## More Data Processing with Pandas

Reference books: 

* [Python for Data Analysis by McKinney](https://wesmckinney.com/pages/book.html)

### Merging Dataframes

When bringing multiple `DataFrame` objects together, we may do it either by merging them horizontally (`merge()`) or vertically (`concatenate()`).

We also need to understand a little *relational theory* and get some language conventions down.

**Venn Diagram**: Venn diagrams are usually used to show set membership. This is an example of all the kinds of set membership:

![Venn Diagram](https://docs.trifacta.com/download/attachments/160412683/JoinVennDiagram.png?version=1&modificationDate=1596167437085&api=v2)

A Venn diagram shows two populations whom we might have data about, but there's an overlap between those populations. In `pandas`, these two populations can be two separate DataFrames, identified by indices, and we might want to join the DataFrames together. To do that, we have some choices to make:

* **Full Outer Join**: or *Union* in set theory, it's used to get a list of both populations regardless of what group/DataFrame they belong to. $\rightarrow$ Everyone in any circle.

* **Inner Join**: or *Intersection* in set theory, it's used to get a list of people who belong to both groups at the same time. It's the overlapping parts of each circle.

* **Left Join**: We use this to get a list of people from one group, regardless of whether they are in the other group, but, if they were, we also want their information from the other group.

* **Right Join**: Contrary to the Left Join.

Let's jump into Pandas to see how this works:

In [1]:
import pandas as pd

In [2]:
# We will use an example with students and staff
staff_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'Director of HR'},
                         {'Name': 'Sally', 'Role': 'Course liasion'},
                         {'Name': 'James', 'Role': 'Grader'}])

staff_df = staff_df.set_index('Name')
staff_df.head()

Unnamed: 0_level_0,Role
Name,Unnamed: 1_level_1
Kelly,Director of HR
Sally,Course liasion
James,Grader


In [3]:
student_df = pd.DataFrame([{'Name': 'James', 'School': 'Business'},
                           {'Name': 'Mike', 'School': 'Law'},
                           {'Name': 'Sally', 'School': 'Engineering'}])

student_df = student_df.set_index('Name')
student_df.head()

Unnamed: 0_level_0,School
Name,Unnamed: 1_level_1
James,Business
Mike,Law
Sally,Engineering


**Important**: Both DataFrames need to be indexed along the value we want to merge them on. In this case: `'Name'`

**Union**:

We call `merge()` passing in the DataFrame on the left and the one on the right, and telling `merge` to do an *outer join* by passing the argument `'outer'`. We also indicate that we want to use the indexes as the joining columns

In [4]:
pd.merge(staff_df, student_df, how='outer', left_index=True, right_index=True)

Unnamed: 0_level_0,Role,School
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
James,Grader,Business
Kelly,Director of HR,
Mike,,Law
Sally,Course liasion,Engineering


**Intersection**:

Similar to Union, but we indicate the attribute `'inner'`.

In [5]:
pd.merge(staff_df, student_df, how='inner', left_index=True, right_index=True)

Unnamed: 0_level_0,Role,School
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Sally,Course liasion,Engineering
James,Grader,Business


**Left Join**:

We indicate the attribute `'left'`. Here, the order of the DataFrames is important. First is the left DataFrame we want with the main population and second is the right DataFrame.

In [6]:
pd.merge(staff_df, student_df, how='left', left_index=True, right_index=True)

Unnamed: 0_level_0,Role,School
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Kelly,Director of HR,
Sally,Course liasion,Engineering
James,Grader,Business


**Right Join**:

Use the attribute `'right'`.

In [7]:
pd.merge(staff_df, student_df, how='right', left_index=True, right_index=True)

Unnamed: 0_level_0,Role,School
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
James,Grader,Business
Mike,,Law
Sally,Course liasion,Engineering


**Use columns for Joins**:

We use the parameter `on` to define the column we want to do the merging with.

In [8]:
# First, let's reset the index for both dfs
staff_df = staff_df.reset_index()
student_df = student_df.reset_index()

In [9]:
# Use merge on the column name
pd.merge(staff_df, student_df, how='right', on='Name')

Unnamed: 0,Name,Role,School
0,James,Grader,Business
1,Mike,,Law
2,Sally,Course liasion,Engineering


**Managing conflicts when merging**:

In [12]:
# Let's add a new column 'Location' to both dfs
staff_df['Location'] = ['State Street', 'Washington Avenue', 'Washington Avenue']
student_df['Location'] = ['1024 Billiard Avenue', 'Fraternity House #22', '512 Wilson Crescent']

In [13]:
staff_df # with office locations

Unnamed: 0,Name,Role,Location
0,Kelly,Director of HR,State Street
1,Sally,Course liasion,Washington Avenue
2,James,Grader,Washington Avenue


In [14]:
student_df # with home addresses

Unnamed: 0,Name,School,Location
0,James,Business,1024 Billiard Avenue
1,Mike,Law,Fraternity House #22
2,Sally,Engineering,512 Wilson Crescent


The `merge()` function preserves information that is "duplicated" or if there's a conflict: in this case, the column 'Location'. The function appends an `_x` or `_y` to help differentiate between which index went with which column of data: `_x` for the left DF and `_y` for the right DF.

In [15]:
pd.merge(staff_df, student_df, how='left', on='Name')

Unnamed: 0,Name,Role,Location_x,School,Location_y
0,Kelly,Director of HR,State Street,,
1,Sally,Course liasion,Washington Avenue,Engineering,512 Wilson Crescent
2,James,Grader,Washington Avenue,Business,1024 Billiard Avenue


`Location_x` refers to the office address (staff/left df) whereas `Location_y` refers to the home address (student/right df)

**Merging with multi-indexing and multiple columns**:

We can use `merge()` on multiple columns as join keys. Both columns need to exist in both dataframes.

In [16]:
# We create a similar example for students and staff
staff_df = pd.DataFrame([{'First Name': 'Kelly', 'Last Name': 'Desjardins',
                          'Role': 'Director of HR'},
                          {'First Name': 'Sally', 'Last Name': 'Brooks',
                           'Role': 'Course liasion'},
                          {'First Name': 'James', 'Last Name': 'Wilde',
                           'Role': 'Grader'}])
staff_df.head()

Unnamed: 0,First Name,Last Name,Role
0,Kelly,Desjardins,Director of HR
1,Sally,Brooks,Course liasion
2,James,Wilde,Grader


In [17]:
student_df = pd.DataFrame([{'First Name': 'James', 'Last Name': 'Hammond',
                            'School': 'Business'},
                           {'First Name': 'Mike', 'Last Name': 'Smith',
                            'School': 'Law'},
                           {'First Name': 'Sally', 'Last Name': 'Brooks',
                            'School': 'Engineering'}])
student_df.head()

Unnamed: 0,First Name,Last Name,School
0,James,Hammond,Business
1,Mike,Smith,Law
2,Sally,Brooks,Engineering


In [18]:
# Inner join would only bring Sally Brooks, as both keys (['First name', 'Last name']) match
pd.merge(staff_df, student_df, how='inner', on=['First Name', 'Last Name'])

Unnamed: 0,First Name,Last Name,Role,School
0,Sally,Brooks,Course liasion,Engineering


### Concatenating Dataframes

When using `concat()`, we put DataFrames on top or at the bottom of each other.

For this task, we are using the US Department of Education College Scorecard data. Data includes information on student completion, student debt, after-graduation income, etc. The data is stored in separate CVS's each containing a year's record.

We want to create a DataFrame with records from 2011 to 2013*:

* I used 2010-11 because the 2011-12 file is corrupt.

In [23]:
%%capture
# Used to supress Jupyter warning messages due to the messy file

df_2011 = pd.read_csv('../resources/week-3/datasets/college_scorecard/MERGED2010_11_PP.csv', error_bad_lines=False)
df_2012 = pd.read_csv('../resources/week-3/datasets/college_scorecard/MERGED2012_13_PP.csv', error_bad_lines=False)
df_2013 = pd.read_csv('../resources/week-3/datasets/college_scorecard/MERGED2013_14_PP.csv', error_bad_lines=False)

In [24]:
df_2011.head(2) # More than 1900 columns

Unnamed: 0,UNITID,OPEID,OPEID6,INSTNM,CITY,STABBR,ZIP,ACCREDAGENCY,INSTURL,NPCURL,...,OMAWDP8_NOTFIRSTTIME_POOLED_SUPP,OMENRUP_NOTFIRSTTIME_POOLED_SUPP,OMENRYP_FULLTIME_POOLED_SUPP,OMENRAP_FULLTIME_POOLED_SUPP,OMAWDP8_FULLTIME_POOLED_SUPP,OMENRUP_FULLTIME_POOLED_SUPP,OMENRYP_PARTTIME_POOLED_SUPP,OMENRAP_PARTTIME_POOLED_SUPP,OMAWDP8_PARTTIME_POOLED_SUPP,OMENRUP_PARTTIME_POOLED_SUPP
0,100654,100200,1002,Alabama A & M University,Normal,AL,35762,,,,...,,,,,,,,,,
1,100663,105200,1052,University of Alabama at Birmingham,Birmingham,AL,35294-0110,,,,...,,,,,,,,,,


In [25]:
df_2012.head(2)

Unnamed: 0,UNITID,OPEID,OPEID6,INSTNM,CITY,STABBR,ZIP,ACCREDAGENCY,INSTURL,NPCURL,...,OMAWDP8_NOTFIRSTTIME_POOLED_SUPP,OMENRUP_NOTFIRSTTIME_POOLED_SUPP,OMENRYP_FULLTIME_POOLED_SUPP,OMENRAP_FULLTIME_POOLED_SUPP,OMAWDP8_FULLTIME_POOLED_SUPP,OMENRUP_FULLTIME_POOLED_SUPP,OMENRYP_PARTTIME_POOLED_SUPP,OMENRAP_PARTTIME_POOLED_SUPP,OMAWDP8_PARTTIME_POOLED_SUPP,OMENRUP_PARTTIME_POOLED_SUPP
0,100654,100200,1002,Alabama A & M University,Normal,AL,35762,,,,...,,,,,,,,,,
1,100663,105200,1052,University of Alabama at Birmingham,Birmingham,AL,35294-0110,,,,...,,,,,,,,,,


In [26]:
df_2013.head(2)

Unnamed: 0,UNITID,OPEID,OPEID6,INSTNM,CITY,STABBR,ZIP,ACCREDAGENCY,INSTURL,NPCURL,...,OMAWDP8_NOTFIRSTTIME_POOLED_SUPP,OMENRUP_NOTFIRSTTIME_POOLED_SUPP,OMENRYP_FULLTIME_POOLED_SUPP,OMENRAP_FULLTIME_POOLED_SUPP,OMAWDP8_FULLTIME_POOLED_SUPP,OMENRUP_FULLTIME_POOLED_SUPP,OMENRYP_PARTTIME_POOLED_SUPP,OMENRAP_PARTTIME_POOLED_SUPP,OMAWDP8_PARTTIME_POOLED_SUPP,OMENRUP_PARTTIME_POOLED_SUPP
0,100654,100200,1002,Alabama A & M University,Normal,AL,35762,,,,...,,,,,,,,,,
1,100663,105200,1052,University of Alabama at Birmingham,Birmingham,AL,35294-0110,,,,...,,,,,,,,,,


In [27]:
# Let's put all three DataFrames in a list and pass itnto the concat() function
frames = [df_2011, df_2012, df_2013]
pd.concat(frames) # More than 23000 rows

Unnamed: 0,UNITID,OPEID,OPEID6,INSTNM,CITY,STABBR,ZIP,ACCREDAGENCY,INSTURL,NPCURL,...,OMAWDP8_NOTFIRSTTIME_POOLED_SUPP,OMENRUP_NOTFIRSTTIME_POOLED_SUPP,OMENRYP_FULLTIME_POOLED_SUPP,OMENRAP_FULLTIME_POOLED_SUPP,OMAWDP8_FULLTIME_POOLED_SUPP,OMENRUP_FULLTIME_POOLED_SUPP,OMENRYP_PARTTIME_POOLED_SUPP,OMENRAP_PARTTIME_POOLED_SUPP,OMAWDP8_PARTTIME_POOLED_SUPP,OMENRUP_PARTTIME_POOLED_SUPP
0,100654,100200,1002,Alabama A & M University,Normal,AL,35762,,,,...,,,,,,,,,,
1,100663,105200,1052,University of Alabama at Birmingham,Birmingham,AL,35294-0110,,,,...,,,,,,,,,,
2,100690,2503400,25034,Amridge University,Montgomery,AL,36117-3553,,,,...,,,,,,,,,,
3,100706,105500,1055,University of Alabama in Huntsville,Huntsville,AL,35899,,,,...,,,,,,,,,,
4,100724,100500,1005,Alabama State University,Montgomery,AL,36104-0271,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7799,48285703,157107,1571,Georgia Military College-Columbus Campus,Columbus,GA,31909,,,,...,,,,,,,,,,
7800,48285704,157101,1571,Georgia Military College-Valdosta Campus,Valdosta,GA,31605,,,,...,,,,,,,,,,
7801,48285705,157105,1571,Georgia Military College-Warner Robins Campus,Warner Robins,GA,31093,,,,...,,,,,,,,,,
7802,48285706,157100,1571,Georgia Military College-Online,Milledgeville,GA,31061,,,,...,,,,,,,,,,


In [29]:
# We check the number matches
len(df_2011) + len(df_2012) + len(df_2013) 

23011

**Adding keys**:

Now that the three DataFrames are concatenated successfully, there's no way to know what observations are from each year anymore.

The `concat()` function has a parameter that solves such problem: Using the parameter `keys`.

In [30]:
pd.concat(frames, keys=['2011', '2012', '2013']) # Adds the keys as indices

Unnamed: 0,Unnamed: 1,UNITID,OPEID,OPEID6,INSTNM,CITY,STABBR,ZIP,ACCREDAGENCY,INSTURL,NPCURL,...,OMAWDP8_NOTFIRSTTIME_POOLED_SUPP,OMENRUP_NOTFIRSTTIME_POOLED_SUPP,OMENRYP_FULLTIME_POOLED_SUPP,OMENRAP_FULLTIME_POOLED_SUPP,OMAWDP8_FULLTIME_POOLED_SUPP,OMENRUP_FULLTIME_POOLED_SUPP,OMENRYP_PARTTIME_POOLED_SUPP,OMENRAP_PARTTIME_POOLED_SUPP,OMAWDP8_PARTTIME_POOLED_SUPP,OMENRUP_PARTTIME_POOLED_SUPP
2011,0,100654,100200,1002,Alabama A & M University,Normal,AL,35762,,,,...,,,,,,,,,,
2011,1,100663,105200,1052,University of Alabama at Birmingham,Birmingham,AL,35294-0110,,,,...,,,,,,,,,,
2011,2,100690,2503400,25034,Amridge University,Montgomery,AL,36117-3553,,,,...,,,,,,,,,,
2011,3,100706,105500,1055,University of Alabama in Huntsville,Huntsville,AL,35899,,,,...,,,,,,,,,,
2011,4,100724,100500,1005,Alabama State University,Montgomery,AL,36104-0271,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2013,7799,48285703,157107,1571,Georgia Military College-Columbus Campus,Columbus,GA,31909,,,,...,,,,,,,,,,
2013,7800,48285704,157101,1571,Georgia Military College-Valdosta Campus,Valdosta,GA,31605,,,,...,,,,,,,,,,
2013,7801,48285705,157105,1571,Georgia Military College-Warner Robins Campus,Warner Robins,GA,31093,,,,...,,,,,,,,,,
2013,7802,48285706,157100,1571,Georgia Military College-Online,Milledgeville,GA,31061,,,,...,,,,,,,,,,


**Inner and Outer Joins when concatenating**:

The `concat()` function has inner and outer methods. If the DataFrames don't have identical columns, and choose the *outer* method, some cells will be `NaN`. If you choose the *inner* method, then some observations will be dropped due to `NaN` values.

This is similar to the left and right joins of the `merge()` function.