## Basic Data Processing with Pandas

### DataFrame Data Structures

`DataFrame` is the primary object in the `Pandas` library. It is conceptually a two-dimensional series object, where there's an index and multiple columns of content, with each column having a label.

You can think of the `DataFrame` as simply as a two-axes labeled array.

In [1]:
import pandas as pd

In [3]:
# School records for students
record1 = pd.Series({'Name': 'Alice',
                     'Class': 'Physics',
                     'Score': 85})

record2 = pd.Series({'Name': 'Jack',
                     'Class': 'Chemistry',
                     'Score': 82})

record3 = pd.Series({'Name': 'Helen',
                     'Class': 'Biology',
                     'Score': 90})

In [4]:
# Create a DataFrame using the individual Series as "rows"
df = pd.DataFrame([record1, record2, record3], index=['school1', 'school2', 'school3'])

df.head()

Unnamed: 0,Name,Class,Score
school1,Alice,Physics,85
school2,Jack,Chemistry,82
school3,Helen,Biology,90


In [8]:
# Create a DataFrame using dictionaries
students = [{'Name': 'Alice',
             'Class': 'Physics',
             'Score': 85},
             {'Name': 'Jack',
              'Class': 'Chemistry',
              'Score': 82},
             {'Name': 'Helen',
              'Class': 'Biology',
              'Score': 90}]

df = pd.DataFrame(students, index=['school1', 'school2', 'school1'])
df.head()

Unnamed: 0,Name,Class,Score
school1,Alice,Physics,85
school2,Jack,Chemistry,82
school1,Helen,Biology,90


In [9]:
# Using `.loc[]`
df.loc['school2']

Name          Jack
Class    Chemistry
Score           82
Name: school2, dtype: object

In [10]:
# Type of .loc is a Series
type(df.loc['school2'])

pandas.core.series.Series

In [11]:
# Non-unique values in the index will give different rows and type
df.loc['school1']

Unnamed: 0,Name,Class,Score
school1,Alice,Physics,85
school1,Helen,Biology,90


In [12]:
type(df.loc['school1'])

pandas.core.frame.DataFrame

In [13]:
# Select data based on multiple axes
df.loc['school1', 'Name']

school1    Alice
school1    Helen
Name: Name, dtype: object

In [14]:
# Pandas reserves the indexing operator for column selection. Columns always have a name --> Column projection
df['Name']

school1    Alice
school2     Jack
school1    Helen
Name: Name, dtype: object

In [15]:
# However you cannot use `.loc` for column projection
# df.loc['Name'] # Big error

In [16]:
# Use indexing to select the values for specific indexes and specific columns
df.loc['school1']['Name']

school1    Alice
school1    Helen
Name: Name, dtype: object

**Chaining**

In [17]:
# Chaining, by indexing, tends to cause Pandas to return a copy of the DataFrame instead of a view
df.loc[:, ['Name', 'Score']] # : all the rows, ['col1', 'col2']

Unnamed: 0,Name,Score
school1,Alice,85
school2,Jack,82
school1,Helen,90


**Dropping data**

In [18]:
# Drop function to delete data in Series and DataFrames: returns a copy of the DataFrame!
df.drop('school1') # A normal drop

Unnamed: 0,Name,Class,Score
school2,Jack,Chemistry,82


In [19]:
df # df still intact

Unnamed: 0,Name,Class,Score
school1,Alice,Physics,85
school2,Jack,Chemistry,82
school1,Helen,Biology,90


In [20]:
# Optional parameters for drop: inplace: updated inplace, axes: which should be dropped (default=0, i.e., rows)
copy_df = df.copy()

copy_df.drop('Name', inplace=True, axis=1) # We drop the column 'Name' (axis 1) and it's updated inplace

copy_df

Unnamed: 0,Class,Score
school1,Physics,85
school2,Chemistry,82
school1,Biology,90


In [21]:
# Another way is to use `del`
del copy_df['Class']
copy_df

Unnamed: 0,Score
school1,85
school2,82
school1,90


In [22]:
# Adding a new column
df['ClassRanking'] = None
df

Unnamed: 0,Name,Class,Score,ClassRanking
school1,Alice,Physics,85,
school2,Jack,Chemistry,82,
school1,Helen,Biology,90,


### DataFrame Indexing and Loading

In Jupyter, you can integrate lower level shell commands to the data science workflow using `IPython`. The shell command used here is `cat` for "concatenate". Using the exclamation mark (`!`) at the beginning of the cell will execute the remainder of the line as a shell command.

In [28]:
# !cat ../resources/week-2/datasets/Admission_Predict.csv # Shows the csv file 

Par�metro no v�lido: /resources


In [29]:
df = pd.read_csv('../resources/week-2/datasets/Admission_Predict.csv')
df.head()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


In [30]:
# Set the Serial No as the index for the students
df = pd.read_csv('../resources/week-2/datasets/Admission_Predict.csv', index_col=0)
df.head()

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


In [31]:
# Change the column names "SOP" and "LOR" to make it more clear --> Use `rename`
new_df = df.rename(columns={'GRE Score': 'GRE Score', 'TOEFL Score': 'TOEFL Score',
                            'University Rating': 'University Rating',
                            'SOP': 'Statement of Purpose', 'LOR': 'Letter of Recommendation',
                            'CGPA': 'CGPA', 'Research': 'Research', 'Chance of Admit': 'Chance of Admit'})

new_df.head()

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,Statement of Purpose,LOR,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


In [32]:
# LOR did not get renamed, so we need to check the column names
new_df.columns # 'LOR ' and 'Chance of Admit ' have a space

Index(['GRE Score', 'TOEFL Score', 'University Rating', 'Statement of Purpose',
       'LOR ', 'CGPA', 'Research', 'Chance of Admit '],
      dtype='object')

In [34]:
# Rename again
new_df = new_df.rename(columns={'LOR ': 'Letter of Recommendation'})
new_df.head()

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,Statement of Purpose,Letter of Recommendation,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


In [36]:
# We can create a function that does the cleaning and apply it to the data
new_df = new_df.rename(mapper=str.strip, axis='columns')
new_df.head()

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,Statement of Purpose,Letter of Recommendation,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


In [37]:
# We can use the `df.columns` by assigning to it a list of column names which will rename the columns (very efficient)
cols = list(df.columns)
cols = [x.lower().strip() for x in cols] # lowercase and strip functions for str

# Replace it in the df
df.columns = cols
df.head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


### Querying a DataFrame

**50% of the work in data cleaning involves querying DataFrames**

**Boolean  Masking**: $\rightarrow$ fast and efficient querying for `numpy` and `pandas`.

A *Boolean Mask* is a 1-D or 2-D array where each of the values in the array are either `True` (`1`) or `False` (`0`). This array is overlaid on top of the data structure we are querying and any cell aligned with the true value will be admitted into our final result, and any cell aligned with a false value will not.

In [3]:
# Load the Admissions dataset
df = pd.read_csv('../resources/week-2/datasets/Admission_Predict.csv', index_col=0)

df.columns = [x.lower().strip() for x in df.columns] # Clean up column names
df.head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


To create a Boolean Mask, apply operators directly to the `pandas` Series or DataFrame objects.

For example, let's filter the students that have a chance higher than 0.7.

This is *Broadcasting* a comparison operator (`>=`) with the results being returned as a Boolean Series.

Then, the resultant Series is indexed where the value of each cell is either `True` or `False`.

`pandas` does this broadcasting through *vectorization* (efficiently and in parallel) to all the values in the array, as specified.

In [4]:
admit_mask = df['chance of admit'] > 0.7
admit_mask

Serial No.
1       True
2       True
3       True
4       True
5      False
       ...  
396     True
397     True
398     True
399    False
400     True
Name: chance of admit, Length: 400, dtype: bool

Once you have formed the Boolean Mask, you just lay it on top of the data to "hide" the data you don't want (`False` values), using `df.where(boolean_mask)`

In [5]:
df.where(admit_mask).head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337.0,118.0,4.0,4.5,4.5,9.65,1.0,0.92
2,324.0,107.0,4.0,4.0,4.5,8.87,1.0,0.76
3,316.0,104.0,3.0,3.0,3.5,8.0,1.0,0.72
4,322.0,110.0,3.0,3.5,2.5,8.67,1.0,0.8
5,,,,,,,,


In [6]:
# If we don't want the NaN values, we just use `dropna()`
df.where(admit_mask).dropna().head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337.0,118.0,4.0,4.5,4.5,9.65,1.0,0.92
2,324.0,107.0,4.0,4.0,4.5,8.87,1.0,0.76
3,316.0,104.0,3.0,3.0,3.5,8.0,1.0,0.72
4,322.0,110.0,3.0,3.5,2.5,8.67,1.0,0.8
6,330.0,115.0,5.0,4.5,3.0,9.34,1.0,0.9


Typically, there's a pandas syntax that allows you combining `where()` and `dropna()`:

In [7]:
df[df['chance of admit'] > 0.7].head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
6,330,115,5,4.5,3.0,9.34,1,0.9


**Multiple-Criteria Boolean Masks**:

This can be achieved by using:

* `and` (`&`): if both masks must be `True` for a value to be in the final mask.

* `or` (`|`) if only one needs to be `True`.

You need to use `&` or `|` for it to work on `pandas`.

In [8]:
(df['chance of admit'] > 0.7) and (df['chance of admit'] < 0.9)

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [9]:
(df['chance of admit'] > 0.7) & (df['chance of admit'] < 0.9) # Remember to use parenthesis

Serial No.
1      False
2       True
3       True
4       True
5      False
       ...  
396     True
397     True
398    False
399    False
400    False
Name: chance of admit, Length: 400, dtype: bool

Another way to do this is using the built-in functions which mimic this approach: `.gt()` and `.lt()`

In [10]:
df['chance of admit'].gt(0.7) & df['chance of admit'].lt(0.9) # No parenthesis needed

Serial No.
1      False
2       True
3       True
4       True
5      False
       ...  
396     True
397     True
398    False
399    False
400    False
Name: chance of admit, Length: 400, dtype: bool

In [11]:
# You can chain these together
df['chance of admit'].gt(0.7).lt(0.9) # This code is more readable than that with & or |

Serial No.
1      False
2      False
3      False
4      False
5       True
       ...  
396    False
397    False
398    False
399     True
400    False
Name: chance of admit, Length: 400, dtype: bool

### Indexing DataFrames

The index is a row level label $\rightarrow$ `axis=0`. 

Indices can either be autogenerated (numeric values), or they can be explicitly (through a dictionary object or the CVS file).

You can also set an index using the `set_index()` function. This function takes a list of columns and promote those columns to an index. This is a **destructive** process: it does not keep the current index.

In [2]:
# Load the Admissions dataset
df = pd.read_csv('../resources/week-2/datasets/Admission_Predict.csv', index_col=0)

df.columns = [x.lower().strip() for x in df.columns] # Clean up column names
df.head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


In [3]:
# Let's use 'chance of admit' as the index: keeping the Serial No.
# First, copy the indexed data into its own column
df['Serial Number'] = df.index

In [4]:
# Then, set the index to another column
df = df.set_index('chance of admit')
df.head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,Serial Number
chance of admit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0.92,337,118,4,4.5,4.5,9.65,1,1
0.76,324,107,4,4.0,4.5,8.87,1,2
0.72,316,104,3,3.0,3.5,8.0,1,3
0.8,322,110,3,3.5,2.5,8.67,1,4
0.65,314,103,2,2.0,3.0,8.21,0,5


In [5]:
# You can reset the index: use `reset_index()` -> Creates a default numbered index
df = df.reset_index()
df.head()

Unnamed: 0,chance of admit,gre score,toefl score,university rating,sop,lor,cgpa,research,Serial Number
0,0.92,337,118,4,4.5,4.5,9.65,1,1
1,0.76,324,107,4,4.0,4.5,8.87,1,2
2,0.72,316,104,3,3.0,3.5,8.0,1,3
3,0.8,322,110,3,3.5,2.5,8.67,1,4
4,0.65,314,103,2,2.0,3.0,8.21,0,5


**Multi-level indexing**: Similar to composite keys in relational database systems. You do this by calling the `set_index()` function and giving a list of columns you're interested in promoting to an index.

`pandas` will search through these in order, finding the *distinct* data and forming composite indices.

In [8]:
# Using the census dataset -> The United States Census Bureau: breakdown of the population level data at the US county level
df = pd.read_csv('../resources/week-2/datasets/census.csv')
df.head()

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
0,40,3,6,1,0,Alabama,Alabama,4779736,4780127,4785161,...,0.002295,-0.193196,0.381066,0.582002,-0.467369,1.030015,0.826644,1.383282,1.724718,0.712594
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861


In [9]:
df.columns

Index(['SUMLEV', 'REGION', 'DIVISION', 'STATE', 'COUNTY', 'STNAME', 'CTYNAME',
       'CENSUS2010POP', 'ESTIMATESBASE2010', 'POPESTIMATE2010',
       'POPESTIMATE2011', 'POPESTIMATE2012', 'POPESTIMATE2013',
       'POPESTIMATE2014', 'POPESTIMATE2015', 'NPOPCHG_2010', 'NPOPCHG_2011',
       'NPOPCHG_2012', 'NPOPCHG_2013', 'NPOPCHG_2014', 'NPOPCHG_2015',
       'BIRTHS2010', 'BIRTHS2011', 'BIRTHS2012', 'BIRTHS2013', 'BIRTHS2014',
       'BIRTHS2015', 'DEATHS2010', 'DEATHS2011', 'DEATHS2012', 'DEATHS2013',
       'DEATHS2014', 'DEATHS2015', 'NATURALINC2010', 'NATURALINC2011',
       'NATURALINC2012', 'NATURALINC2013', 'NATURALINC2014', 'NATURALINC2015',
       'INTERNATIONALMIG2010', 'INTERNATIONALMIG2011', 'INTERNATIONALMIG2012',
       'INTERNATIONALMIG2013', 'INTERNATIONALMIG2014', 'INTERNATIONALMIG2015',
       'DOMESTICMIG2010', 'DOMESTICMIG2011', 'DOMESTICMIG2012',
       'DOMESTICMIG2013', 'DOMESTICMIG2014', 'DOMESTICMIG2015', 'NETMIG2010',
       'NETMIG2011', 'NETMIG2012', 'NETMI

This dataset has two summarised levels: one for the whole country and one for each state.

If we want to see a list of all the unique values in a given column, we can use the `unique()` function, similar to the DISTINCT operator in SQL.

In [10]:
# Run unique on the sum level
df['SUMLEV'].unique() # Only two unique values: 40 and 50

array([40, 50], dtype=int64)

In [11]:
# Excluding all of the rows that are summaries at the state level
df = df[df['SUMLEV'] == 50]
df.head()

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861
5,50,3,6,1,9,Alabama,Blount County,57322,57322,57373,...,1.807375,-1.177622,-1.748766,-2.062535,-1.36997,1.859511,-0.84858,-1.402476,-1.577232,-0.884411


In [12]:
# Reducing the data to just the total population estimates and the total number of births
columns_to_keep = ['STNAME', 'CTYNAME', 'BIRTHS2010', 'BIRTHS2011', 'BIRTHS2012', 'BIRTHS2013',
                   'BIRTHS2014', 'BIRTHS2015', 'POPESTIMATE2010', 'POPESTIMATE2011', 'POPESTIMATE2012', 
                   'POPESTIMATE2013', 'POPESTIMATE2014', 'POPESTIMATE2015']
df = df[columns_to_keep]
df.head()

Unnamed: 0,STNAME,CTYNAME,BIRTHS2010,BIRTHS2011,BIRTHS2012,BIRTHS2013,BIRTHS2014,BIRTHS2015,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,POPESTIMATE2013,POPESTIMATE2014,POPESTIMATE2015
1,Alabama,Autauga County,151,636,615,574,623,600,54660,55253,55175,55038,55290,55347
2,Alabama,Baldwin County,517,2187,2092,2160,2186,2240,183193,186659,190396,195126,199713,203709
3,Alabama,Barbour County,70,335,300,283,260,269,27341,27226,27159,26973,26815,26489
4,Alabama,Bibb County,44,266,245,259,247,253,22861,22733,22642,22512,22549,22583
5,Alabama,Blount County,183,744,710,646,618,603,57373,57711,57776,57734,57658,57673


In [13]:
# The US Census data breaks down population estimates by state and county. 
# Let's load the data and set the index to be a combination of the state and county values:
df = df.set_index(['STNAME', 'CTYNAME']) # [List of column identifiers you want indexed]
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,BIRTHS2010,BIRTHS2011,BIRTHS2012,BIRTHS2013,BIRTHS2014,BIRTHS2015,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,POPESTIMATE2013,POPESTIMATE2014,POPESTIMATE2015
STNAME,CTYNAME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Alabama,Autauga County,151,636,615,574,623,600,54660,55253,55175,55038,55290,55347
Alabama,Baldwin County,517,2187,2092,2160,2186,2240,183193,186659,190396,195126,199713,203709
Alabama,Barbour County,70,335,300,283,260,269,27341,27226,27159,26973,26815,26489
Alabama,Bibb County,44,266,245,259,247,253,22861,22733,22642,22512,22549,22583
Alabama,Blount County,183,744,710,646,618,603,57373,57711,57776,57734,57658,57673


In [14]:
# Querying this DataFrame
# Using Multi-Index by providing the arguments in the order of level
df.loc['Michigan', 'Washtenaw County'] # df.loc[State, County]

BIRTHS2010            977
BIRTHS2011           3826
BIRTHS2012           3780
BIRTHS2013           3662
BIRTHS2014           3683
BIRTHS2015           3709
POPESTIMATE2010    345563
POPESTIMATE2011    349048
POPESTIMATE2012    351213
POPESTIMATE2013    354289
POPESTIMATE2014    357029
POPESTIMATE2015    358880
Name: (Michigan, Washtenaw County), dtype: int64

In [16]:
# Comparing two counties -> Use a list of tuples
df.loc[[('Michigan', 'Washtenaw County'),
       ('Michigan', 'Wayne County')]]

Unnamed: 0_level_0,Unnamed: 1_level_0,BIRTHS2010,BIRTHS2011,BIRTHS2012,BIRTHS2013,BIRTHS2014,BIRTHS2015,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,POPESTIMATE2013,POPESTIMATE2014,POPESTIMATE2015
STNAME,CTYNAME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Michigan,Washtenaw County,977,3826,3780,3662,3683,3709,345563,349048,351213,354289,357029,358880
Michigan,Wayne County,5918,23819,23270,23377,23607,23586,1815199,1801273,1792514,1775713,1766008,1759335


### Missing Values

Missing values are very common in data cleaning activities. 

Missing values may be due to:

* Data omissions when running a survey: **Missing at random**
* Missing data without relationship to other variables: **Missing completely at random (MCAR)**
* Not collected data by the collection responsible (e.g. a researcher) or if it wouldn't make sense if it were collected

Most missing values are formatted as `NaN`, `NULL`, `None`, or `N/A`. However, sometimes they are not labelled so cleary. For example, using `99` for binary categories.

In [3]:
df = pd.read_csv('../resources/week-2/datasets/class_grades.csv')
df.head(10)

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,5,57.14,34.09,64.38,51.48,52.5
1,8,95.05,105.49,67.5,99.07,68.33
2,8,83.7,83.17,,63.15,48.89
3,7,,,49.38,105.93,80.56
4,8,91.32,93.64,95.0,107.41,73.89
5,7,95.0,92.58,93.12,97.78,68.06
6,8,95.05,102.99,56.25,99.07,50.0
7,7,72.85,86.85,60.0,,56.11
8,8,84.26,93.1,47.5,18.52,50.83
9,7,90.1,97.55,51.25,88.89,63.61


You can create a **boolean mask** of the missing data using `.isnull()` function.

In [4]:
mask = df.isnull()
mask.head(10)

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,True,False,False
3,False,True,True,False,False,False
4,False,False,False,False,False,False
5,False,False,False,False,False,False
6,False,False,False,False,False,False
7,False,False,False,False,True,False
8,False,False,False,False,False,False
9,False,False,False,False,False,False


**`dropna()`**:

We can also **drop all of those rows** which have any missing data using the `.dropna()` function.

In [6]:
df.dropna().head(10)

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,5,57.14,34.09,64.38,51.48,52.5
1,8,95.05,105.49,67.5,99.07,68.33
4,8,91.32,93.64,95.0,107.41,73.89
5,7,95.0,92.58,93.12,97.78,68.06
6,8,95.05,102.99,56.25,99.07,50.0
8,8,84.26,93.1,47.5,18.52,50.83
9,7,90.1,97.55,51.25,88.89,63.61
10,7,80.44,90.2,75.0,91.48,39.72
12,8,97.16,103.71,72.5,93.52,63.33
13,7,91.28,83.53,81.25,99.81,92.22


**`fillna()`**:

Another option is using the **filling function** called `.fillna()`, which takes a number or method. 

*Number*: You can pass in a single value, a scalar value, to change all of the missing data to one value.

In [7]:
# Filling all missing values with 0
df.fillna(0, inplace=True)
df.head(10)

Unnamed: 0,Prefix,Assignment,Tutorial,Midterm,TakeHome,Final
0,5,57.14,34.09,64.38,51.48,52.5
1,8,95.05,105.49,67.5,99.07,68.33
2,8,83.7,83.17,0.0,63.15,48.89
3,7,0.0,0.0,49.38,105.93,80.56
4,8,91.32,93.64,95.0,107.41,73.89
5,7,95.0,92.58,93.12,97.78,68.06
6,8,95.05,102.99,56.25,99.07,50.0
7,7,72.85,86.85,60.0,0.0,56.11
8,8,84.26,93.1,47.5,18.52,50.83
9,7,90.1,97.55,51.25,88.89,63.61


*Method*: You can pass one of the common filling methods: `ffill` or `bfill`. 

* `ffill` is for *forward filling* and it updates a `NaN` value for a particular cell with the value from the previous row.
* `bfill` is for *backward filling*, which is the opposite of `ffill`. It fills the missing values with the next valid value.

**Important** Your data needs to be sorted in order for this to have the effect you want. Data from traditional database management systems usually has no order guarantee.

**`method='ffill'`**:

We will use the *method* alternative for a dataset which has logs from online learning systems. In these systems, it's common for the video players to have a heartbeat functionality where playback statistics are sent to the server every so often (e.g. every 30 seconds). The following dataset has an example of these systems.

In [9]:
df = pd.read_csv('../resources/week-2/datasets/log.csv')
df.head(20)

Unnamed: 0,time,user,video,playback position,paused,volume
0,1469974424,cheryl,intro.html,5,False,10.0
1,1469974454,cheryl,intro.html,6,,
2,1469974544,cheryl,intro.html,9,,
3,1469974574,cheryl,intro.html,10,,
4,1469977514,bob,intro.html,1,,
5,1469977544,bob,intro.html,1,,
6,1469977574,bob,intro.html,1,,
7,1469977604,bob,intro.html,1,,
8,1469974604,cheryl,intro.html,11,,
9,1469974694,cheryl,intro.html,14,,


In this data, the first column is a timestamp in the Unix epoch format. The next column is the user name followed by a webpage they're visiting and the video that they're playing. Each row of the DataFrame has a playback position. We can see that as the playback position increases by one, the timestamp increases by about 30 seconds.

However, if an user pauses his playback, the time increases but the playback position doesn't change.

There are a lot of missing values in the paused and volume columns. It's not efficient to send this information across the network if it hasn't changed. So, this particular system just inserts null values into the database if there's no changes.

First, we will sort data by promoting `time` as the index and then sorting the index. 

However, we realise that the `time` isn't really unique because two users can use the system at the same time. So, we reset the index and use *multi-level* indexing on time AND user together instead.

In [10]:
# Set "time" as index and sort it
df = df.set_index('time')
df = df.sort_index()
df.head(20)

Unnamed: 0_level_0,user,video,playback position,paused,volume
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1469974424,cheryl,intro.html,5,False,10.0
1469974424,sue,advanced.html,23,False,10.0
1469974454,cheryl,intro.html,6,,
1469974454,sue,advanced.html,24,,
1469974484,cheryl,intro.html,7,,
1469974514,cheryl,intro.html,8,,
1469974524,sue,advanced.html,25,,
1469974544,cheryl,intro.html,9,,
1469974554,sue,advanced.html,26,,
1469974574,cheryl,intro.html,10,,


In [11]:
# Reseting the index using multi-level indexing
df = df.reset_index()
df = df.set_index(['time', 'user'])
df.head(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,video,playback position,paused,volume
time,user,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1469974424,cheryl,intro.html,5,False,10.0
1469974424,sue,advanced.html,23,False,10.0
1469974454,cheryl,intro.html,6,,
1469974454,sue,advanced.html,24,,
1469974484,cheryl,intro.html,7,,
1469974514,cheryl,intro.html,8,,
1469974524,sue,advanced.html,25,,
1469974544,cheryl,intro.html,9,,
1469974554,sue,advanced.html,26,,
1469974574,cheryl,intro.html,10,,


Now, we can fill the missing data using `ffill`.

In [12]:
df = df.fillna(method='ffill')
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,video,playback position,paused,volume
time,user,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1469974424,cheryl,intro.html,5,False,10.0
1469974424,sue,advanced.html,23,False,10.0
1469974454,cheryl,intro.html,6,False,10.0
1469974454,sue,advanced.html,24,False,10.0
1469974484,cheryl,intro.html,7,False,10.0


**Customised `replace()`**:

We can also do customised fill-in to replace values using the `replace()` function. It allows replacement from several approaches:

* Value-to-Value
* List
* Dictionary
* Regex

In [13]:
# A simple example for Value-to-Value
df = pd.DataFrame({'A': [1, 1, 2, 3, 4],
                   'B': [3, 6, 3, 8, 9],
                   'C': ['a', 'b', 'c', 'd', 'e']})

df

Unnamed: 0,A,B,C
0,1,3,a
1,1,6,b
2,2,3,c
3,3,8,d
4,4,9,e


In [14]:
# Replacing 1s with 100s
df.replace(1, 100)

Unnamed: 0,A,B,C
0,100,3,a
1,100,6,b
2,2,3,c
3,3,8,d
4,4,9,e


In [15]:
# Now using lists: 1s and 3s with 100s and 300s
df.replace([1, 3], [100, 300])

Unnamed: 0,A,B,C
0,100,300,a
1,100,6,b
2,2,300,c
3,300,8,d
4,4,9,e


Using the Regex approach for our logs dataset.

In [16]:
df = pd.read_csv('../resources/week-2/datasets/log.csv')
df.head(20)

Unnamed: 0,time,user,video,playback position,paused,volume
0,1469974424,cheryl,intro.html,5,False,10.0
1,1469974454,cheryl,intro.html,6,,
2,1469974544,cheryl,intro.html,9,,
3,1469974574,cheryl,intro.html,10,,
4,1469977514,bob,intro.html,1,,
5,1469977544,bob,intro.html,1,,
6,1469977574,bob,intro.html,1,,
7,1469977604,bob,intro.html,1,,
8,1469974604,cheryl,intro.html,11,,
9,1469974694,cheryl,intro.html,14,,


To replace using **regex**:

* First parameter is the regex parameter we want to match
* Second parameter is the value we want to emit upon match
* Third parameter is `regex=True`

What is the regex pattern to detect all 'html' pages in the `video` column and overwrite that with the keyword 'webpage'?

**R/**: `".*.html"`   (`".*"`: any number of characters, `".html$"`: ".html" is anchored to the end)

In [17]:
df.replace(to_replace='.*.html$', value='webpage', regex=True)

Unnamed: 0,time,user,video,playback position,paused,volume
0,1469974424,cheryl,webpage,5,False,10.0
1,1469974454,cheryl,webpage,6,,
2,1469974544,cheryl,webpage,9,,
3,1469974574,cheryl,webpage,10,,
4,1469977514,bob,webpage,1,,
5,1469977544,bob,webpage,1,,
6,1469977574,bob,webpage,1,,
7,1469977604,bob,webpage,1,,
8,1469974604,cheryl,webpage,11,,
9,1469974694,cheryl,webpage,14,,


Finally, when using statistical functions on DataFrames, these functions typically **ignore missing values**. This is usually what you want, but you should be aware of the values that are being excluded: Why you have missing values?

Infering missing values might be unreasonable.
