# Dictionaries and Pandas
## Table of Contents
<a href="#Importing-modules-and-defining-functions"<p style="text-decoration: none;"></span></span>Importing modules and defining functions</a></div>

<a href="#Dictionaries,-Part-1"</span>01. </span>Dictionaries, Part 1</a></div>

<a href="#Motivation-for-dictionaries"<p style="text-decoration: none;"></span>&nbsp;&nbsp;&nbsp;1a. </span>Motivation for dictionaries</a></div>

<a href="#Create-dictionary"<p style="text-decoration: none;"></span>&nbsp;&nbsp;&nbsp;1b. </span>Create dictionary</a></div>

<a href="#Access-dictionary"<p style="text-decoration: none;"></span>&nbsp;&nbsp;&nbsp;1c. </span>Access dictionary</a></div>

<a href="#Dictionaries,-Part-2"</span>02. </span>Dictionaries, Part 2</a></div>

<a href="#Dictionary-Manipulation-(1)"<p style="text-decoration: none;"></span>&nbsp;&nbsp;&nbsp;2a. </span>Dictionary Manipulation (1)</a></div>

<a href="#Dictionary-Manipulation-(2)"<p style="text-decoration: none;"></span>&nbsp;&nbsp;&nbsp;2b. </span>Dictionary Manipulation (2)</a></div>

<a href="#Dictionariception"<p style="text-decoration: none;"></span>&nbsp;&nbsp;&nbsp;2c. </span>Dictionariception</a></div>

<a href="#Pandas,-Part-1"</span>03. </span>Pandas, Part 1</a></div>

<a href="#Dictionary-to-DataFrame-(1)"<p style="text-decoration: none;"></span>&nbsp;&nbsp;&nbsp;3a. </span>Dictionary to DataFrame (1)</a></div>

<a href="#Dictionary-to-DataFrame-(2)"<p style="text-decoration: none;"></span>&nbsp;&nbsp;&nbsp;3b. </span>Dictionary to DataFrame (2)</a></div>

<a href="#CSV-to-DataFrame-(1)"<p style="text-decoration: none;"></span>&nbsp;&nbsp;&nbsp;3c. </span>CSV to DataFrame (1)</a></div>

<a href="#CSV-to-DataFrame-(2)"<p style="text-decoration: none;"></span>&nbsp;&nbsp;&nbsp;3d. </span>CSV to DataFrame (2)</a></div>

<a href="#Pandas,-Part-2"</span>04. </span>Pandas, Part 2</a></div>

<a href="#Square-Brackets-(1)"<p style="text-decoration: none;"></span>&nbsp;&nbsp;&nbsp;4a. </span>Square Brackets (1)</a></div>

<a href="#Square-Brackets-(2)"<p style="text-decoration: none;"></span>&nbsp;&nbsp;&nbsp;4b. </span>Square Brackets (2)</a></div>

<a href="#loc-and-iloc-(1)"<p style="text-decoration: none;"></span>&nbsp;&nbsp;&nbsp;4c. </span>loc and iloc (1)</a></div>

<a href="#loc-and-iloc-(2)"<p style="text-decoration: none;"></span>&nbsp;&nbsp;&nbsp;4d. </span>loc and iloc (2)</a></div>

<a href="#loc-and-iloc-(3)"<p style="text-decoration: none;"></span>&nbsp;&nbsp;&nbsp;4e. </span>loc and iloc (3)</a></div>

<a href="#Appendix:-Methods"<p style="text-decoration: none;"></span></span>Appendix: Methods</a></div>

# Importing modules and defining functions

In [2]:
import matplotlib.pyplot as plt
import matplotlib.style as style
import numpy as np
import pandas as pd

# Dictionaries, Part 1

## Motivation for dictionaries

To see why dictionaries are useful, have a look at the two lists defined on the right. countries contains the names of some European countries. capitals lists the corresponding names of their capital.

* [list.index](https://docs.python.org/3/library/stdtypes.html#common-sequence-operations)

In [2]:
# Definition of countries and capital
countries = ['spain', 'france', 'germany', 'norway']
capitals = ['madrid', 'paris', 'berlin', 'oslo']

# Get index of 'germany': ind_ger
ind_ger = countries.index('germany')

# Use ind_ger to print out capital of Germany
print(capitals[ind_ger])

berlin


## Create dictionary

The countries and capitals lists are again available in the script. It's your job to convert this data to a dictionary where the country names are the keys and the capitals are the corresponding values. As a refresher, here is a recipe for creating a dictionary:

my_dict = {
   "key1":"value1",
   "key2":"value2",
}

In this recipe, both the keys and the values are strings. This will also be the case for this exercise.

In [4]:
# Definition of countries and capital
countries = ['spain', 'france', 'germany', 'norway']
capitals = ['madrid', 'paris', 'berlin', 'oslo']

# From string in countries and capitals, create dictionary europe
europe = {'spain':'madrid', 'france':'paris', 'germany':'berlin', 'norway':'oslo'}

# Print europe
print(europe)

{'spain': 'madrid', 'france': 'paris', 'germany': 'berlin', 'norway': 'oslo'}


## Access dictionary

If the keys of a dictionary are chosen wisely, accessing the values in a dictionary is easy and intuitive. For example, to get the capital for France from europe you can use:

europe['france']

Here, 'france' is the key and 'paris' the value is returned.

* [dict.keys](https://docs.python.org/3/library/stdtypes.html#dict.keys)

In [5]:
# Definition of dictionary
europe = {'spain':'madrid', 'france':'paris', 'germany':'berlin', 'norway':'oslo' }

# Print out the keys in europe
print(europe.keys())

# Print out value that belongs to key 'norway'
print(europe['norway'])

dict_keys(['spain', 'france', 'germany', 'norway'])
oslo


# Dictionaries, Part 2

## Dictionary Manipulation (1)

If you know how to access a dictionary, you can also assign a new value to it. To add a new key-value pair to europe you can use something like this:

europe['iceland'] = 'reykjavik'

In [1]:
# Definition of dictionary
europe = {'spain':'madrid', 'france':'paris', 'germany':'berlin', 'norway':'oslo' }

# Add italy to europe
europe['italy'] = 'rome' 

# To assert that 'italy' is now a key in europe, print out 'italy' in europe.
print('italy' in europe)

# Add poland to europe
europe['poland'] = 'warsaw'

# Print europe
print(europe)

True
{'spain': 'madrid', 'france': 'paris', 'germany': 'berlin', 'norway': 'oslo', 'italy': 'rome', 'poland': 'warsaw'}


## Dictionary Manipulation (2)

Somebody thought it would be funny to mess with your accurately generated dictionary. An adapted version of the europe dictionary is available in the script on the right.

Can you clean up? Do not do this by adapting the definition of europe, but by adding Python commands to the script to update and remove key:value pairs.

* [del statement](http://www.pythonforbeginners.com/basics/del-statement)

In [8]:
# Definition of dictionary
europe = {'spain':'madrid', 'france':'paris', 'germany':'bonn',
          'norway':'oslo', 'italy':'rome', 'poland':'warsaw',
          'australia':'vienna' }

# Update capital of germany
europe['germany'] = 'berlin'

# Remove australia
del europe['australia']

# Print europe
print(europe)

{'spain': 'madrid', 'france': 'paris', 'germany': 'berlin', 'norway': 'oslo', 'italy': 'rome', 'poland': 'warsaw'}


## Dictionariception

Remember lists? They could contain anything, even other lists. Well, for dictionaries the same holds. Dictionaries can contain key:value pairs where the values are again dictionaries.

As an example, have a look at the script where another version of europe - the dictionary you've been working with all along - is coded. The keys are still the country names, but the values are dictionaries that contain more information than just the capital.

It's perfectly possible to chain square brackets to select elements. To fetch the population for Spain from europe, for example, you need:

    europe['spain']['population']


In [10]:
# Dictionary of dictionaries
europe = { 'spain': { 'capital':'madrid', 'population':46.77 },
           'france': { 'capital':'paris', 'population':66.03 },
           'germany': { 'capital':'berlin', 'population':80.62 },
           'norway': { 'capital':'oslo', 'population':5.084 } }


# Print out the capital of France
print(europe['france']['capital'])
print()

# Create sub-dictionary data
data = {'capital':'rome', 'population':59.83}

# Add data to europe under key 'italy'
europe['italy'] = data

# Print europe
print(europe)

paris

{'spain': {'capital': 'madrid', 'population': 46.77}, 'france': {'capital': 'paris', 'population': 66.03}, 'germany': {'capital': 'berlin', 'population': 80.62}, 'norway': {'capital': 'oslo', 'population': 5.084}, 'italy': {'capital': 'rome', 'population': 59.83}}


# Pandas, Part 1

## Dictionary to DataFrame (1)

Pandas is an open source library, providing high-performance, easy-to-use data structures and data analysis tools for Python. Sounds promising!

The DataFrame is one of Pandas' most important data structures. It's basically a way to store tabular data where you can label the rows and the columns. One way to build a DataFrame is from a dictionary.

In the exercises that follow you will be working with vehicle data from different countries. Each observation corresponds to a country and the columns give information about the number of vehicles per capita, whether people drive left or right, and so on.

Three lists are defined in the script:

    names, containing the country names for which data is available.
    dr, a list with booleans that tells whether people drive left or right in the corresponding country.
    cpc, the number of motor vehicles per 1000 people in the corresponding country.

Each dictionary key is a column label and each value is a list which contains the column elements.

* [pd.DataFrame()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)

In [3]:
# Pre-defined lists
names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr =  [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]

# Create dictionary my_dict with three key:value pairs: my_dict
my_dict = {'country':names, 'drives_right':dr, 'cars_per_cap':cpc}

# Build a DataFrame cars from my_dict: cars
cars = pd.DataFrame(my_dict)

# Print cars
cars

Unnamed: 0,cars_per_cap,country,drives_right
0,809,United States,True
1,731,Australia,False
2,588,Japan,False
3,18,India,False
4,200,Russia,True
5,70,Morocco,True
6,45,Egypt,True


Notice that the columns of cars can be of different types. This was not possible with 2D Numpy arrays!

## Dictionary to DataFrame (2)

The Python code that solves the previous exercise is included on the right. Have you noticed that the row labels (i.e. the labels for the different observations) were automatically set to integers from 0 up to 6?

To solve this a list row_labels has been created. You can use it to specify the row labels of the cars DataFrame. You do this by setting the index attribute of cars, that you can access as cars.index.

In [15]:
# Build cars DataFrame
names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr =  [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]
dict = { 'country':names, 'drives_right':dr, 'cars_per_cap':cpc }
cars = pd.DataFrame(dict)
print(cars)
print()
# Definition of row_labels
row_labels = ['US', 'AUS', 'JAP', 'IN', 'RU', 'MOR', 'EG']

# Specify row labels of cars
cars.index = row_labels

# Print cars again
print(cars)

   cars_per_cap        country  drives_right
0           809  United States          True
1           731      Australia         False
2           588          Japan         False
3            18          India         False
4           200         Russia          True
5            70        Morocco          True
6            45          Egypt          True

     cars_per_cap        country  drives_right
US            809  United States          True
AUS           731      Australia         False
JAP           588          Japan         False
IN             18          India         False
RU            200         Russia          True
MOR            70        Morocco          True
EG             45          Egypt          True


## CSV to DataFrame (1)

Putting data in a dictionary and then building a DataFrame works, but it's not very efficient. What if you're dealing with millions of observations? In those cases, the data is typically available as files with a regular structure. One of those file types is the CSV file, which is short for "comma-separated values".

To import CSV data into Python as a Pandas DataFrame you can use read_csv().

Let's explore this function with the same cars data from the previous exercises. This time, however, the data is available in a CSV file, named cars.csv. It is available in your current working directory, so the path to the file is simply 'cars.csv'.

* [pd.read_csv](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)

In [4]:
# Import pandas as pd
import pandas as pd

# Import the cars.csv data: cars
cars = pd.read_csv('cars.csv')

# Print out cars
print(cars)

  Unnamed: 0  cars_per_cap        country  drives_right
0         US           809  United States          True
1        AUS           731      Australia         False
2        JAP           588          Japan         False
3         IN            18          India         False
4         RU           200         Russia          True
5        MOR            70        Morocco          True
6         EG            45          Egypt          True


Looks nice, but not exactly what we expected.

## CSV to DataFrame (2)

Your read_csv() call to import the CSV data didn't generate an error, but the output is not entirely what we wanted. The row labels were imported as another column without a name.

Remember index_col, an argument of read_csv(), that you can use to specify which column in the CSV file should be used as a row label? Well, that's exactly what you need here!

Python code that solves the previous exercise is already included; can you make the appropriate changes to fix the data import?

In [5]:
# Fix import by including index_col
cars = pd.read_csv('cars.csv',index_col=0)

# Print out cars
print(cars)

     cars_per_cap        country  drives_right
US            809  United States          True
AUS           731      Australia         False
JAP           588          Japan         False
IN             18          India         False
RU            200         Russia          True
MOR            70        Morocco          True
EG             45          Egypt          True


# Pandas, Part 2

## Square Brackets (1)

In the video, you saw that you can index and select Pandas DataFrames in many different ways. The simplest, but not the most powerful way, is to use square brackets.

In the sample code on the right, the same cars data is imported from a CSV files as a Pandas DataFrame. To select only the cars_per_cap column from cars, you can use:

    cars['cars_per_cap']
    cars[['cars_per_cap']]

The single bracket version gives a Pandas Series, the double bracket version gives a Pandas DataFrame.

In [6]:
# Import cars data
cars = pd.read_csv('cars.csv', index_col = 0)

# Print out country column as Pandas Series
print(cars['country'])
print()
# Print out country column as Pandas DataFrame
print(cars[['country']])
print()
# Print out DataFrame with country and drives_right columns
print(cars[['country','drives_right']])

US     United States
AUS        Australia
JAP            Japan
IN             India
RU            Russia
MOR          Morocco
EG             Egypt
Name: country, dtype: object

           country
US   United States
AUS      Australia
JAP          Japan
IN           India
RU          Russia
MOR        Morocco
EG           Egypt

           country  drives_right
US   United States          True
AUS      Australia         False
JAP          Japan         False
IN           India         False
RU          Russia          True
MOR        Morocco          True
EG           Egypt          True


## Square Brackets (2)

Square brackets can do more than just selecting columns. You can also use them to get rows, or observations, from a DataFrame. The following call selects the first five rows from the cars DataFrame:

    cars[0:5]

The result is another DataFrame containing only the rows you specified.

Pay attention: You can only select rows using square brackets if you specify a slice, like 0:4. Also, you're using the integer indexes of the rows here, not the row labels!

In [7]:
# Import cars data
cars = pd.read_csv('cars.csv', index_col = 0)

# Print out first 3 observations
print(cars[:3])
print()
# Print out fourth, fifth and sixth observation
print(cars[3:6])

     cars_per_cap        country  drives_right
US            809  United States          True
AUS           731      Australia         False
JAP           588          Japan         False

     cars_per_cap  country  drives_right
IN             18    India         False
RU            200   Russia          True
MOR            70  Morocco          True


## loc and iloc (1)

With loc and iloc you can do practically any data selection operation on DataFrames you can think of. loc is label-based, which means that you have to specify rows and columns based on their row and column labels. iloc is integer index based, so you have to specify rows and columns by their integer index like you did in the previous exercise.

Try out the following commands in the IPython Shell to experiment with loc and iloc to select observations. Each pair of commands here gives the same result.

    cars.loc['RU']
    cars.iloc[4]

    cars.loc[['RU']]
    cars.iloc[[4]]

    cars.loc[['RU', 'AUS']]
    cars.iloc[[4, 1]]

As before, code is included that imports the cars data as a Pandas DataFrame.

* [pandas indexing](http://pandas.pydata.org/pandas-docs/stable/indexing.html#different-choices-for-indexing)
* [.loc()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html)
* [.iloc()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html)

In [8]:
# Import cars data
cars = pd.read_csv('cars.csv', index_col = 0)

# Print out observation for Japan
print(cars.loc['JAP'])
print()
# Print out observations for Australia and Egypt
print(cars.loc[['AUS','EG']])

cars_per_cap      588
country         Japan
drives_right    False
Name: JAP, dtype: object

     cars_per_cap    country  drives_right
AUS           731  Australia         False
EG             45      Egypt          True


## loc and iloc (2)

loc and iloc also allow you to select both rows and columns from a DataFrame. To experiment, try out the following commands in the IPython Shell. Again, paired commands produce the same result.

    cars.loc['IN', 'cars_per_cap']
    cars.iloc[3, 0]

    cars.loc[['IN', 'RU'], 'cars_per_cap']
    cars.iloc[[3, 4], 0]

    cars.loc[['IN', 'RU'], ['cars_per_cap', 'country']]
    cars.iloc[[3, 4], [0, 1]]


In [30]:
# loc and iloc also allow you to select both rows and columns from a DataFrame.

# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)

# Print out drives_right value of Morocco
print(cars.loc['MOR','drives_right'])

# Print sub-DataFrame
print(cars.loc[['RU','MOR'],['country','drives_right']])


True
     country  drives_right
RU    Russia          True
MOR  Morocco          True


## loc and iloc (3)

It's also possible to select only columns with loc and iloc. In both cases, you simply put a slice going from beginning to end in front of the comma:

    cars.loc[:, 'country']
    cars.iloc[:, 1]

    cars.loc[:, ['country','drives_right']]
    cars.iloc[:, [1, 2]]


In [9]:
# Import cars data
cars = pd.read_csv('cars.csv', index_col = 0)

# Print out drives_right column as Series
print(cars.iloc[:,2])
print()
# Print out drives_right column as DataFrame
print(cars.loc[:,['drives_right']])
print()
# Print out cars_per_cap and drives_right as DataFrame
print(cars.loc[:,['cars_per_cap','drives_right']])


US      True
AUS    False
JAP    False
IN     False
RU      True
MOR     True
EG      True
Name: drives_right, dtype: bool

     drives_right
US           True
AUS         False
JAP         False
IN          False
RU           True
MOR          True
EG           True

     cars_per_cap  drives_right
US            809          True
AUS           731         False
JAP           588         False
IN             18         False
RU            200          True
MOR            70          True
EG             45          True


## Appendix: Methods

__Methods__
* [list.index](https://docs.python.org/3/library/stdtypes.html#common-sequence-operations)
* [dict.keys](https://docs.python.org/3/library/stdtypes.html#dict.keys)
* [del statement](http://www.pythonforbeginners.com/basics/del-statement)
* [pd.read_csv](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)
* [pandas indexing](http://pandas.pydata.org/pandas-docs/stable/indexing.html#different-choices-for-indexing)
* [.loc()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html)
* [.iloc()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html)

__Objects__
* [pd.DataFrame()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)