## Basic pandas and dictionaries 

In [1]:
import pandas as pd

The benefit of dictionaries is that it stops you using the index of one list to find the corresponding value in a secondary list. This is because they have key value pairs that allow them to be associated and accessed using the key. 

In [2]:
# Definition of countries and capital
countries = ['spain', 'france', 'germany', 'norway']
capitals = ['madrid', 'paris', 'berlin', 'oslo']

# Get index of 'germany': ind_ger
ind_ger = countries.index('germany')

# Use ind_ger to print out capital of Germany
print(capitals[ind_ger])

berlin


Instead you could use a dictionary like this... 

In [3]:
europe = {
    'spain':'madrid',
    'france':'paris',
    'germany':'berlin',
    'norway':'oslo'
}
print(europe['germany'])

berlin


You can print out all the keys as such:

In [4]:
europe.keys()

dict_keys(['spain', 'france', 'germany', 'norway'])

Add additional key value pairs... 

In [5]:
europe['italy'] = 'rome'
# Then check whether italy is in the dictionary using the 'in' keyword
'italy' in europe

True

Using the above assignment syntax will also be used to overwrite/update other values. To delete values from a dictionary you must use `del()` 

In [6]:
del(europe['france'])
'france' in europe

False

### Dictionaries within dictionaries 

In [7]:
europe = { 'spain': { 'capital':'madrid', 'population':46.77 },
           'france': { 'capital':'paris', 'population':66.03 },
           'germany': { 'capital':'berlin', 'population':80.62 },
           'norway': { 'capital':'oslo', 'population':5.084 } }
print(europe)

{'spain': {'capital': 'madrid', 'population': 46.77}, 'france': {'capital': 'paris', 'population': 66.03}, 'germany': {'capital': 'berlin', 'population': 80.62}, 'norway': {'capital': 'oslo', 'population': 5.084}}


In [8]:
print(europe['france']['population'])
print(europe['norway']['capital'])

66.03
oslo


Assigning new dictionaries within dictionaries could be done as per the following:

In [9]:
# Create sub-dictionary data
data = {'capital':'rome', 'population':59.83}

# Add data to europe under key 'italy'
europe['italy'] = data

In [10]:
print(europe)

{'spain': {'capital': 'madrid', 'population': 46.77}, 'france': {'capital': 'paris', 'population': 66.03}, 'germany': {'capital': 'berlin', 'population': 80.62}, 'norway': {'capital': 'oslo', 'population': 5.084}, 'italy': {'capital': 'rome', 'population': 59.83}}


## Pandas, part 1

A data frame is a data structure that allows for the columns to be various data types. The rows tend to be observations, and the variables are each their own column. One way to create a data frame is to simply use the `pd.DataFrame` and pass the dictionary as the argument. 

In [11]:
# Change the inputs to lists, rather than dictionaries 
europe2 = europe = { 'country': ['spain', 'france', 'germany', 'norway'],
           'capital': ['madrid', 'paris', 'berlin', 'oslo'],
           'population':[46.77, 66.03, 80.62, 5.084] }
europe_frame = pd.DataFrame(europe2)
print(europe_frame)

   country capital  population
0    spain  madrid      46.770
1   france   paris      66.030
2  germany  berlin      80.620
3   norway    oslo       5.084


In [12]:
row_labels = ['ES', 'FR', 'DE', 'NO']
europe_frame.index = row_labels
europe_frame

Unnamed: 0,country,capital,population
ES,spain,madrid,46.77
FR,france,paris,66.03
DE,germany,berlin,80.62
NO,norway,oslo,5.084


### Filtering panda data frames 
There are a couple of ways that you can filter data frames with pandas. You can use the square bracket syntax, or the advanced functions that come with the package: `loc` and `iloc`

In [13]:
europe_frame['population']

ES    46.770
FR    66.030
DE    80.620
NO     5.084
Name: population, dtype: float64

You can see that this has returned just the single column, but not as a data frame. It has returned it as something else. 

In [14]:
type(europe_frame['population'])

pandas.core.series.Series

It has returned it as something called a series. This is essentially a list but with the row names included. It is **not** a single columned data frame. To do that you must use the `[[` syntax.

In [15]:
europe_frame[['population']]

Unnamed: 0,population
ES,46.77
FR,66.03
DE,80.62
NO,5.084


In [16]:
print(type(europe_frame[['population']]))

<class 'pandas.core.frame.DataFrame'>


When using single brackets you have to use slices to select the rows. You cannot use the row names to select them. 

In [17]:
europe_frame[1:3]

Unnamed: 0,country,capital,population
FR,france,paris,66.03
DE,germany,berlin,80.62


`loc` uses the row names and the column names, where `iloc` uses the interger places of both those things. 

In [18]:
# europe_frame.loc[['row1', 'row3'], ['column1', 'column4']]
europe_frame.loc[['DE', 'ES'], ['country', 'population']]

Unnamed: 0,country,population
DE,germany,80.62
ES,spain,46.77


In [19]:
# The same selection but using integer index values
europe_frame.iloc[[2, 0], [1, 2]]

Unnamed: 0,capital,population
DE,berlin,80.62
ES,madrid,46.77


If you want to select all of either the rows or columns then you can use the `:` syntax

In [20]:
europe_frame.loc[:, ['country', 'capital']]

Unnamed: 0,country,capital
ES,spain,madrid
FR,france,paris
DE,germany,berlin
NO,norway,oslo


In [21]:
europe_frame[['country', 'population']]

Unnamed: 0,country,population
ES,spain,46.77
FR,france,66.03
DE,germany,80.62
NO,norway,5.084


In [22]:
europe_frame.loc[['FR'], :]
# == europe_frame.loc[['FR]]

Unnamed: 0,country,capital,population
FR,france,paris,66.03


In [23]:
# First 3 observations
europe_frame[0:3]

Unnamed: 0,country,capital,population
ES,spain,madrid,46.77
FR,france,paris,66.03
DE,germany,berlin,80.62


In [24]:
# Print france and germany
europe_frame.loc[['FR', 'DE']]

Unnamed: 0,country,capital,population
FR,france,paris,66.03
DE,germany,berlin,80.62


In [25]:
# Print out population for Spain and Norway
europe_frame.loc[['ES', 'NO'], ['population']]

Unnamed: 0,population
ES,46.77
NO,5.084


### Importing data with read_csv
When you import the data from a csv, you can specify which row you want to be the index by using the index_col argument and setting it to the integer index of the column that you wish to be used for the row names. 

`pd.read_csv('path/to/csv.csv', index_col = 0)`

## Filtering panda dataframes 

In [26]:
europe_frame

Unnamed: 0,country,capital,population
ES,spain,madrid,46.77
FR,france,paris,66.03
DE,germany,berlin,80.62
NO,norway,oslo,5.084


In [27]:
# Select the column you want to filter
europe_frame['population']

ES    46.770
FR    66.030
DE    80.620
NO     5.084
Name: population, dtype: float64

Note how this is a Series type

In [28]:
# Generate the logical series based on your condition
europe_frame['population'] < 50.0

ES     True
FR    False
DE    False
NO     True
Name: population, dtype: bool

Note that the data type here is a boolean.
If you save this boolean series then you can use it to subset the original dataset. 

In [29]:
filter_bools = europe_frame['population'] < 50.0
europe_frame[filter_bools]

Unnamed: 0,country,capital,population
ES,spain,madrid,46.77
NO,norway,oslo,5.084


As pandas is built on numpy is possible to leverage that power here

In [30]:
import numpy as np
europe_frame[np.logical_and(europe_frame['population'] < 50.0, europe_frame['country'].apply(len) > 5)]

Unnamed: 0,country,capital,population
NO,norway,oslo,5.084


## Looping through pandas data frames

In [31]:
# simple loop returns column names
for val in europe_frame:
    print(val)

country
capital
population


If you specifically want to loop through the rows, which is likely going to happen, you must call the `.iterrows()` function on the data frame.

In [32]:
for label, row in europe_frame.iterrows():
    print(label)
    print(row)

ES
country        spain
capital       madrid
population     46.77
Name: ES, dtype: object
FR
country       france
capital        paris
population     66.03
Name: FR, dtype: object
DE
country       germany
capital        berlin
population      80.62
Name: DE, dtype: object
NO
country       norway
capital         oslo
population     5.084
Name: NO, dtype: object


You'll notice that the first element is the label for the row, while the second element is a panda Series. This `row` section of the loop can be subsetted further to extract the information that you want from that particular row. 

In [33]:
# Example
for lab, row in europe_frame.iterrows():
    print(str(lab) + ': ' + str(row['population']))

ES: 46.77
FR: 66.03
DE: 80.62
NO: 5.084


In [34]:
# Assign new column the long way
for lab, row in europe_frame.iterrows():
    europe_frame.loc[lab, 'capital_length'] = len(row['capital'])

## iloc tangent
#### This whole section may break when run from scratch as the columns may not exist, be assured that I created 4 columns as I didn't use `.loc` when creating the additional column within the above loop
If you want to use `iloc` with slices, those slices **cannot** be within the square brackets

In [35]:
europe_frame.iloc[:,4:8]

ES
FR
DE
NO


In [36]:
# Create the object of columns that you wish to drop
columns = europe_frame.columns[4:8]

In [37]:
# Actually drop them
europe_frame.drop(columns, axis = 1)

Unnamed: 0,country,capital,population,capital_length
ES,spain,madrid,46.77,6.0
FR,france,paris,66.03,5.0
DE,germany,berlin,80.62,6.0
NO,norway,oslo,5.084,4.0


# Basic column selection 

There are two basic ways that you can select a single column from a data frame using pandas; using square brackets and using the dot notation. 

In [38]:
europe_frame

Unnamed: 0,country,capital,population,capital_length
ES,spain,madrid,46.77,6.0
FR,france,paris,66.03,5.0
DE,germany,berlin,80.62,6.0
NO,norway,oslo,5.084,4.0


In [39]:
europe_frame['capital']

ES    madrid
FR     paris
DE    berlin
NO      oslo
Name: capital, dtype: object

In [40]:
europe_frame.capital

ES    madrid
FR     paris
DE    berlin
NO      oslo
Name: capital, dtype: object