## Introduction

In the first notebook we learned basic issues about Python. Another functionalities are also crucial to your data science curriculum (visualization, data structures and control structure). Learn to visualize real data with matplotlib's functions and get to know new data structures such as the dictionary and the Pandas DataFrame. After covering key concepts such as boolean logic, control flow and loops in Python, you're ready to blend together everything you've learned to solve a case study.

- Visualization
- Data structures
- Control structures
- Case Study

In [None]:
# This will list all magic commands
%lsmagic

# https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/

## Matplotlib 

Data Visualization is a key skill for aspiring data scientists. Matplotlib makes it easy to create meaningful and insightful plots. In this cell, you will learn to build various types of plots and to customize them to make them more visually appealing and interpretable. You can find the documentation [here](http://matplotlib.org/contents.html).

In [None]:
# Get matplotlib graphics to show up inline.
%matplotlib inline

import matplotlib.pyplot as plt

year = [1950, 1970, 1990, 2010]
pop = [2.519, 3.692, 5.263, 6.972]

# With matplotlib, you can create a bunch of different plots in Python. The most basic plot is the line plot.
plt.plot(year, pop)

# labels
xlab = 'Years'
ylab = 'Population [in billion]'
title = 'World Growth'

plt.xlabel(xlab)
plt.ylabel(ylab)
plt.title(title)

plt.yticks([2,4,6,8],['2b','4b','6b','8b'])

%time plt.show()

In [None]:
# Change the line plot below to a scatter plot
plt.scatter(year, pop)

# Show plot
plt.show()

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Compute the x and y coordinates for points on sine and cosine curves
x = np.arange(0, 3 * np.pi, 0.1)
y_sin = np.sin(x)
y_cos = np.cos(x)

# Plot the points using matplotlib
plt.plot(x, y_sin)
plt.plot(x, y_cos)
plt.xlabel('x axis label')
plt.ylabel('y axis label')
plt.title('Sine and Cosine')
plt.legend(['Sine', 'Cosine'])
%time plt.show()

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Compute the x and y coordinates for points on sine and cosine curves
x = np.arange(0, 3 * np.pi, 0.1)
y_sin = np.sin(x)
y_cos = np.cos(x)

# Set up a subplot grid that has height 2 and width 1,
# and set the first such subplot as active.
plt.subplot(2, 1, 1)

# Make the first plot
plt.plot(x, y_sin)
plt.title('Sine')

# Set the second subplot as active, and make the second plot.
plt.subplot(2, 1, 2)
plt.plot(x, y_cos)
plt.title('Cosine')

# tight_layout() will also adjust spacing between subplots to minimize the overlaps.
plt.tight_layout()

# Show the figure.
plt.show()

## Dictionaries

A dictionary stores (key, value) pairs, similar to a Map in Java or an object in Javascript.

In [None]:
# Definition of countries and capital
countries = ['spain', 'france', 'germany', 'norway']
capitals = ['madrid', 'paris', 'berlin', 'oslo']

# Get index of 'germany': ind_ger
ind_ger = countries.index('germany')

# Use ind_ger to print out capital of Germany
print(capitals[ind_ger])

In [None]:
# From string in countries and capitals, create dictionary europe
europe = {'spain':'madrid','france':'paris','germany':'berlin','norway':'oslo'}

# Print europe
print(europe)

# Print out the keys in europe
print(europe.keys())

# Print out value that belongs to key 'norway'
print(europe['norway'])

In [None]:
# Add italy to europe
europe['italy'] = 'rome'

# Print out italy in europe
print('italy' in europe)

# Add poland to europe
europe['poland'] = 'warsaw'

print(europe)

del(europe['france'])

# Print europe
print(europe)

In [None]:
# Dictionary of dictionaries
europe = { 'spain': { 'capital':'madrid', 'population':46.77 },
           'france': { 'capital':'paris', 'population':66.03 },
           'germany': { 'capital':'berlin', 'population':80.62 },
           'norway': { 'capital':'oslo', 'population':5.084 } }


# Print out the capital of France
print(europe['france']['capital'])

# Create sub-dictionary data
data = {'capital':'rome', 'population':59.83}

# Add data to europe under key 'italy'
europe['italy'] = data

# Print europe
print(europe)

## Loop over a dictionary

In Python 3, you need the **items()** method to loop over a dictionary:


```python
  world = { "afghanistan":30.55, 
             "albania":2.77,
             "algeria":39.21 }
  for key, value in world.items() :
      print(key + " -- " + str(value))
```

In [None]:
# Definition of dictionary
europe = {'spain':'madrid', 'france':'paris', 'germany':'bonn', 
          'norway':'oslo', 'italy':'rome', 'poland':'warsaw', 'australia':'vienna' }
          
# Iterate over europe
for key, value in europe.items():
    print("the capital of " + key + " is " + value)

# Pandas: working with tabular dataset (dataframe)


Pandas is an open source library, providing high-performance, easy-to-use data structures and data analysis tools for Python. Sounds promising!

The DataFrame is one of Pandas' most important data structures. It's basically a way to store tabular data, where you can label the rows and the columns. One way to build a DataFrame is from a dictionary.



![Tabular Dataset](http://drive.google.com/uc?export=view&id=0BxhVm1REqwr0N0loMHlqOUo5eXM)

- Numpy arrays contains only one type

In the exercises that follow you will be working with vehicle data in different countries. Each observation corresponds to a country and the columns give information about the number of vehicles per capita, whether people drive left or right, and so on.

Three lists are defined in the script: - names, containing the country names for which data is available. - dr, a list with booleans that tells whether people drive left or right in the corresponding country. - cpc, the number of motor vehicles per 1000 people in the corresponding country.

Each dictionary key is a column label and each value is a list which contains the column elements.

In [1]:
# Pre-defined lists
names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr =  [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]

# Import pandas as pd
import pandas as pd

# Create dictionary my_dict with three key:value pairs: my_dict
my_dict = {"country":names, "drives_right":dr, "cars_per_cap":cpc}

# Build a DataFrame cars from my_dict: cars
cars = pd.DataFrame(my_dict)

# Print cars
print(cars)

# Definition of row_labels
row_labels = ['US', 'AUS', 'JAP', 'IN', 'RU', 'MOR', 'EG']

# Specify row labels of cars
cars.index = row_labels

# Print cars again
print("\n", cars)


   cars_per_cap        country drives_right
0           809  United States         True
1           731      Australia        False
2           588          Japan        False
3            18          India        False
4           200         Russia         True
5            70        Morocco         True
6            45          Egypt         True

      cars_per_cap        country drives_right
US            809  United States         True
AUS           731      Australia        False
JAP           588          Japan        False
IN             18          India        False
RU            200         Russia         True
MOR            70        Morocco         True
EG             45          Egypt         True


Putting data in a dictionary and then building a DataFrame works, but it's not very efficient. What if you're dealing with millions of observations? In those cases, the data is typically available as files with a regular structure. One of those file types is the CSV file, which is short for "comma-separated values".

To import CSV data into Python as a Pandas DataFrame you can use **read_csv().**

Let's explore this function with the same cars data from the previous exercises. This time, however, the data is available in a CSV file, named cars.csv. It is available in your current working directory, so the path to the file is simply 'cars.csv'.

In [None]:
# Import pandas as pd
import pandas as pd

# Import the cars.csv data: cars
cars = pd.read_csv("cars.csv")

# Print out cars
print(cars)

In [None]:
print(cars.shape)

In [None]:
# Fix import by including index_col
cars = pd.read_csv('cars.csv', index_col = 0)

# Print out cars
print(cars)

In [None]:
print(cars.shape)

## Square Brackets

You can index and select Pandas DataFrames in many different ways. The simplest, but not the most powerful way, is to use square brackets.

In the sample code below, the same cars data is imported from a CSV files as a Pandas DataFrame. To select only the cars_per_cap column from cars, you can use:

> cars['cars_per_cap']

> cars[['cars_per_cap']]

The single bracket version gives a Pandas Series, the double bracket version gives a Pandas DataFrame.





In [27]:
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)

# Print out country column as Pandas Series
print(cars["drives_right"], "\n", type(cars["drives_right"]))

US      True
AUS    False
JAP    False
IN     False
RU      True
MOR     True
EG      True
Name: drives_right, dtype: bool 
 <class 'pandas.core.series.Series'>


In [None]:
# Print out country column as Pandas DataFrame
print(cars[["country"]], "\n", type(cars[["country"]]))

In [None]:
# Print out DataFrame with country and drives_right columns
print(cars[["country","drives_right"]])

Square brackets can do more than just selecting columns. You can also use them to get rows, or observations, 
from a DataFrame. The following call selects the first five rows from the cars DataFrame:

> <span style="color:blue"> cars[0:5]</span>

The result is another DataFrame containing only the rows you specified.

Pay attention: You can only select rows using square brackets if you specify a slice, like 0:4. Also, you're using the integer indexes of the rows here, not the row labels!



In [None]:
# Print out first 3 observations
print(cars[0:3])

In [None]:
# Print out fourth, fifth and sixth observation
print(cars[3:6])

## loc and iloc

With loc and iloc you can do practically any data selection operation on DataFrames you can think of. loc is label-based, which means that you have to specify rows and columns based on their row and column labels. iloc is integer index based, so you have to specify rows and columns by their integer index like you did in the previous exercise.

In [22]:
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)

print(cars)
print("-----")
# Print out observation for Japan as Serie
print(cars.loc["JAP"])

# Print out observation for Japan as DataFrame
# print(cars.loc[["JAP"]])


     cars_per_cap        country drives_right
US            809  United States         True
AUS           731      Australia        False
JAP           588          Japan        False
IN             18          India        False
RU            200         Russia         True
MOR            70        Morocco         True
EG             45          Egypt         True
-----
cars_per_cap      588
country         Japan
drives_right    False
Name: JAP, dtype: object


In [25]:
print(cars.iloc[2])

cars_per_cap      588
country         Japan
drives_right    False
Name: JAP, dtype: object


In [None]:
# Print out observations for Australia and Egypt
print(cars.loc[["AUS","EG"]])

In [None]:
print(cars.iloc[[1,6]])

In [None]:
# Print out drives_right value of Morocco
print(cars.loc[["MOR"],["drives_right"]])

In [None]:
# Print sub-DataFrame
print(cars.loc[["RU","MOR"],["country","drives_right"]])

In [28]:
# Print out drives_right column as Series
print(cars.loc[:,"drives_right"])

US      True
AUS    False
JAP    False
IN     False
RU      True
MOR     True
EG      True
Name: drives_right, dtype: bool


In [29]:
# Print out drives_right column as DataFrame
print(cars.iloc[:,[2]])

    drives_right
US          True
AUS        False
JAP        False
IN         False
RU          True
MOR         True
EG          True


In [None]:
# Print out cars_per_cap and drives_right as DataFrame
print(cars.iloc[:,[0,2]])

## Loop over DataFrame

Iterating over a Pandas DataFrame is typically done with the **iterrows()** method. Used in a **for loop**, every observation is iterated over and on every iteration the row label and actual row contents are available:

```python
    for lab, row in anydataframe.iterrows() :
        ...
```

In [34]:
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)

# Print out the cars dataframe
print(cars)

     cars_per_cap        country drives_right
US            809  United States         True
AUS           731      Australia        False
JAP           588          Japan        False
IN             18          India        False
RU            200         Russia         True
MOR            70        Morocco         True
EG             45          Egypt         True


In [None]:
# Iterate over rows of cars
for label, row in cars.iterrows():
    print(label)
    print(row)

The row data that's generated by iterrows() on every run is a Pandas Series. This format is not very convenient to print out. Luckily, you can easily select variables from the Pandas Series using square brackets:

In [78]:
# Adapt for loop
for lab, row in cars.iterrows() :
    print(lab + ": " + str(row["cars_per_cap"]))

US: 809
AUS: 731
JAP: 588
IN: 18
RU: 200
MOR: 70
EG: 45


## Add Column


In [None]:
# Print out the cars dataframe
print(cars)

In [None]:
# Code for loop that adds COUNTRY column
for label, row in cars.iterrows():
    cars.loc[label,'COUNTRY'] = row["country"].upper()

# Print cars
print(cars)

Using **iterrows()** to iterate over every observation of a Pandas DataFrame is easy to understand, but not very efficient. On every iteration, you're creating a new Pandas Series.

If you want to add a column to a DataFrame by calling a function on another column, the **iterrows()** method in combination with a **for loop** is not the preferred way to go. Instead, you'll want to use **apply()**.

In [None]:
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)

# Use .apply(str.upper)
cars["COUNTRY"] = cars["country"].apply(str.upper)

print(cars)

# Exercises

### Exercise 1
- Extract the drives_right column as a Pandas Series and store it as dr.
- Use dr, a boolean Series, to subset the cars DataFrame. Store the resulting selection in sel.
- Print sel, and assert that drives_right is True for all observations.

In [70]:
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)

# Extract drives_right column as Series: dr
#print (cars)
dr = cars['drives_right']

# Use dr to subset cars: sel
sel = cars[dr]

# Print sel
print(sel)

    drives_right
US          True
RU          True
MOR         True
EG          True


### Exercise 2

- Select the cars_per_cap column from cars as a Pandas Series and store it as cpc.
- Use cpc in combination with a comparison operator and 500. You want to end up with a boolean Series that's True if the corresponding country has a cars_per_cap of more than 500 and False otherwise. Store this boolean Series as many_cars.
- Use many_cars to subset cars, similar to what you did before. Store the result as car_maniac.
- Print out car_maniac to see if you got it right.

In [80]:
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)

# Create car_maniac: observations that have a cars_per_cap over 500
cpc = cars['cars_per_cap']
#many_cars
car_maniac = [x for x in cars['cars_per_cap'] if x > 500]

# Print car_maniac
print (car_maniac)

[809, 731, 588]


In [82]:
for lab, row in cars['cars_per_cap'].iterrows() if row > 500:
    
    print(lab + ": " + str(row["cars_per_cap"]))

SyntaxError: invalid syntax (<ipython-input-82-ad279cb2d429>, line 1)

### Exercise 3

Remember about **np.logical_and()**, **np.logical_or()** and **np.logical_not()**, the Numpy variants of the and, or and not operators? You can also use them on Pandas Series to do more advanced filtering operations.

Take this example that selects the observations that have a cars_per_cap between 10 and 80. Try out these lines of code step by step to see what's happening.

> cpc = cars['cars_per_cap']

> between = np.logical_and(cpc > 10, cpc < 80)

> medium = cars[between]

- Use the code sample above to create a DataFrame medium, that includes all the observations of cars that have a cars_per_cap between 100 and 500.
- Print out medium.


In [70]:
world = {"afghanistan":30.55, "albania":2.77,"algeria":39.21, "albania":2.81}
world

{'afghanistan': 30.55, 'albania': 2.81, 'algeria': 39.21}

In [None]:
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)

# Import numpy, you'll need this
import numpy as np

# Create medium: observations with cars_per_cap between 100 and 500
# TODO

# Print medium
# TODO