## Part 3 - Merging

Let's import the cleaned CollegeData and cleaned USNews rankings dataframes:

In [1]:
import numpy as np
import pandas as pd
from fuzzywuzzy import process
from collegedata_names import school_rename_dict

COLLEGEDATA_CLEAN_CSV_PATH = 'data/collegedata_clean.csv'
USNEWS_CLEAN_CSV_PATH = 'data/usnews_clean.csv'

cols = ['Name', 'State', 'City']

collegedata_df = pd.read_csv(COLLEGEDATA_CLEAN_CSV_PATH, index_col = cols)
usnews_df = pd.read_csv(USNEWS_CLEAN_CSV_PATH, index_col = cols)

len(usnews_df)

1381

Since we have less ranks than we have schools, let's try simply inner joining the usnews_df to collegedata_df on the common Name, State, and City columns to create a new dataframe joined_df:

In [2]:
joined_df = collegedata_df.merge(usnews_df, how = 'inner', left_index = True, 
                                 right_index = True, validate = '1:1')

collegedata_df = collegedata_df.drop(joined_df.index)
usnews_df = usnews_df.drop(joined_df.index)

len(usnews_df)

299

Now we need to find a way to match these orphaned 299 US News ranks to the remaining CollegeData schools.

We can look to see if maybe the Name and State match but for some reason the City is off:

In [3]:
def strip(s):
    s = s.str.replace('SUNY','State University of New York')
    s = s.str.replace('CUNY','City University of New York')
    s = s.str.replace('\'s','s')
    s = s.str.replace('College|University|of|\'','')
    s = s.str.replace('St\.','Saint')
    s = s.str.replace('-|\.',' ')
    s = s.str.replace('\s+',' ')
    s = s.str.strip()
    return s

collegedata_names = collegedata_df.index.get_level_values('Name')
collegedata_df['Stripped Name'] = strip(collegedata_names)

collegedata_df[['Stripped Name']].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Stripped Name
Name,State,City,Unnamed: 3_level_1
Bryn Athyn College,PA,Bryn Athyn,Bryn Athyn
Albany College of Pharmacy and Health Sciences,NY,Albany,Albany Pharmacy and Health Sciences
St. Joseph's College - Long Island Campus,NY,Patchogue,Saint Josephs Long Island Campus
Saint Anselm College,NH,Manchester,Saint Anselm
Saint Francis University,PA,Loretto,Saint Francis


We can use `strip()` on the Name index of rank_df:

In [4]:
usnews_names = usnews_df.index.get_level_values('Name')
usnews_df['Stripped Name'] = strip(usnews_names)

And now we can apply process.extract() from the fuzzywuzzy module to each element of usnews_df's Stripped Name column. We will pass collegedata_df's Stripped Name column as the argument to process.extract(), which will search through those Stripped Name values and return a tuple with the one that closest matches the usnews_df Stripped Name:

In [5]:
match = lambda x: process.extract(x, collegedata_df['Stripped Name'], 
                                  limit = 1)

usnews_df['Closest Match'] = usnews_df['Stripped Name'].apply(match)

usnews_df = usnews_df.drop(columns = 'Stripped Name')
collegdata_df = collegedata_df.drop(columns = 'Stripped Name')

usnews_df[['Closest Match']].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Closest Match
Name,State,City,Unnamed: 3_level_1
Cooper Union,NY,New York,[(Cooper Union for the Advancement Science and...
Harvard University,MA,Cambridge,"[(Harvard, 100, (Harvard College, MA, Cambridg..."
California State University--Maritime Academy,CA,Vallejo,"[(California Maritime Academy, 95, (California..."
College of New Jersey,NJ,Ewing,"[(The New Jersey, 95, (The College of New Jers..."
University of South Carolina--Aiken,SC,Aiken,"[(South Carolina Aiken, 100, (University of So..."


For each Stripped Name in usnews_df, process.extract() returned a list of tuples (which we limited here to just one). The tuple's first element contains the closest 'Stripped Name' value in collegedata_df. The tuple's next element is an integer match score. The tuples final element is another tuple representing the index in collegedata_df corresponding to the matched Stipped Name there, which is itself a three level multiindex. The first element of that index tuple is the true Name of the match as recorded in collegedata_df, which is what we want to extract:

In [6]:
usnews_df['collegedata_df Name'] = \
    usnews_df['Closest Match'].str[0].str[2].str[0]

usnews_df = usnews_df.drop(columns = 'Closest Match')

usnews_df[['collegedata_df Name']].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,collegedata_df Name
Name,State,City,Unnamed: 3_level_1
Cooper Union,NY,New York,Cooper Union for the Advancement of Science an...
Harvard University,MA,Cambridge,Harvard College
California State University--Maritime Academy,CA,Vallejo,California Maritime Academy
College of New Jersey,NJ,Ewing,The College of New Jersey
University of South Carolina--Aiken,SC,Aiken,University of South Carolina Aiken


It is definitely possible that multiple rank_df Name values have mapped to identical collegedata_df Name values. This shouldn't be a problem though since when we soon again do an inner merge usnews_df 1-to-1 to collegedata_df, we will do so not just on the collegedata_df Name, but also on the State and City columns.

This will fail, however, if there were schools in usnews_df that are located in the same City and State that have happened to match to the same Name in collegedata_df:

In [7]:
usnews_df = usnews_df.reset_index()
usnews_df = usnews_df.rename(columns = {'Name':'usnews_df Name'})

cols = ['collegedata_df Name', 'State', 'City']
duplicates = usnews_df.duplicated(cols, keep = False)
usnews_df[duplicates]

Unnamed: 0,usnews_df Name,State,City,Rank,Rank Type,collegedata_df Name
43,CUNY--Hunter College,NY,New York,28,Regional Universities North,City College of New York
101,CUNY--City College,NY,New York,56,Regional Universities North,City College of New York
248,Georgia Southern University--Armstrong,GA,Savannah,114,Regional Universities South,South College
270,South University,GA,Savannah,114,Regional Universities South,South College


Thankfully, there are only two pairs of duplicates. After a quick search, I can manually set the appropriate collegedata_df Name values:

In [8]:
usnews_df.loc[43, 'collegedata_df Name'] = 'Hunter College'
usnews_df.loc[248, 'collegedata_df Name'] = 'Armstrong State University'

Now we can save the original usnews_df Name value, drop it from usnews_df, rename collegedata_df Name to just Name, set the index to Name, State, City:

In [9]:
usnews_df = usnews_df.rename(columns = {'collegedata_df Name':'Name'})

cols = ['Name', 'State', 'City']
usnews_df = usnews_df.set_index(cols)

usnews_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,usnews_df Name,Rank,Rank Type
Name,State,City,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Cooper Union for the Advancement of Science and Art,NY,New York,Cooper Union,1,Regional Colleges North
Harvard College,MA,Cambridge,Harvard University,2,National Universities
California Maritime Academy,CA,Vallejo,California State University--Maritime Academy,3,Regional Colleges West
The College of New Jersey,NJ,Ewing,College of New Jersey,4,Regional Universities North
University of South Carolina Aiken,SC,Aiken,University of South Carolina--Aiken,6,Regional Colleges South


We should be good now to inner merge this usnews_df with collegedata_df, appending the results to our existing joined_df and dropping the joined values from both original dataframes:

In [10]:
result = collegedata_df.merge(usnews_df, how = 'inner', left_index = True, 
                              right_index = True, validate = '1:1')

result = result.drop(columns = 'usnews_df Name')

joined_df = joined_df.append(result, sort = False)

collegedata_df = collegedata_df.drop(result.index)
usnews_df = usnews_df.drop(result.index)

len(usnews_df)

73

Not bad. Let's add our original usnews_df Name values back to the dataframe index and see a bit of what remains:

In [11]:
usnews_df = usnews_df.reset_index()

usnews_df = usnews_df.rename(columns = {'Name':'collegedata_df Name'})
usnews_df = usnews_df.rename(columns = {'usnews_df Name':'Name'})

usnews_df = usnews_df.set_index(cols)
usnews_df.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,collegedata_df Name,Rank,Rank Type
Name,State,City,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Cottey College,MO,Nevada,Ohio University Chillicothe,7,Regional Colleges Midwest
Embry-Riddle Aeronautical University,FL,Daytona Beach,Embry-Riddle Aeronautical University - Prescott,12,Regional Universities South
SUNY College of Technology--Alfred,NY,Alfred,York College,14,Regional Colleges North
SUNY Polytechnic Institute--Albany/Utica,NY,Utica,Penn State University Park,15,Regional Universities North
SUNY College of Technology--Delhi,NY,Delhi,York College,17,Regional Colleges North
Brigham Young University--Hawaii,HI,Laie Oahu,Brigham Young University - Hawaii,18,Regional Colleges West
Concordia College,NY,Bronxville,Concordia College,19,Regional Colleges North
St. Gregory's University,OK,Shawnee,St. Gregory's University,19,Regional Colleges West
Florida Memorial University,FL,Miami,Florida Memorial University,21,Regional Colleges South
St. Mary's University of San Antonio,TX,San Antonio,Saint Mary's College,21,Regional Universities West


Some of the guessed collegedata_df Name values are clearly not lining up the wrong school, and thus failed to match up on the inner join. But some are extremely close, so much so that perhaps there was a problem with the City field on the join.

It's possible some of the City values between the two dataframes are slightly in the same way that some Name values were. We can repeat the procedure we just did on the remaining City values:

In [12]:
collegedata_cities = collegedata_df.index.get_level_values('City')
collegedata_df['Stripped City'] = strip(collegedata_cities)

usnews_cities = usnews_df.index.get_level_values('City')
usnews_df['Stripped City'] = strip(usnews_cities)

match = lambda x: process.extract(x, collegedata_df['Stripped City'], 
                                  limit = 1)

usnews_df['Closest Match'] = usnews_df['Stripped City'].apply(match)

usnews_df = usnews_df.drop(columns = 'Stripped City')
collegdata_df = collegedata_df.drop(columns = 'Stripped City')

usnews_df['collegedata_df City'] = \
    usnews_df['Closest Match'].str[0].str[2].str[0]

usnews_df = usnews_df.drop(columns = 'Closest Match')

usnews_df = usnews_df.reset_index()
usnews_df = usnews_df.rename(columns = {'City':'usnews_df City'})

cols = ['collegedata_df Name', 'State', 'collegedata_df City']
duplicates = usnews_df.duplicated(cols, keep = False)
usnews_df[duplicates]

Unnamed: 0,Name,State,usnews_df City,collegedata_df Name,Rank,Rank Type,collegedata_df City


No duplicates! Proceeding...

In [None]:
usnews_df = usnews_df.rename(columns = {'collegedata_df City':'City'})

cols = ['Name', 'State', 'City']
usnews_df = usnews_df.set_index(cols)

usnews_df.head()



result = collegedata_df.merge(usnews_df, how = 'inner', left_index = True, 
                              right_index = True, validate = '1:1')

result = result.drop(columns = 'usnews_df Name')

joined_df = joined_df.append(result, sort = False)

collegedata_df = collegedata_df.drop(result.index)
usnews_df = usnews_df.drop(result.index)

len(usnews_df)

That method seemed to help a bit - it cut our orphaned usnews_df rows from 287 to 57. We're approaching a threshold amount of values where it might be more efficient to manually rename them:

In [11]:
usnews_df.set_index('Name', inplace = True)
usnews_df.rename(index = school_rename_dict, inplace = True)
usnews_df = usnews_df[usnews_df.index.notna()]

usnews_df.set_index(['State'], append = True, inplace = True)
collegedata_df.set_index(['Name', 'State'], inplace = True)
joined = collegedata_df.join(usnews_df.drop(columns = 'City'), how = 'inner')

df = df.append(joined.set_index('City', append = True), sort = False)
collegedata_df.drop(joined.index, inplace = True)
usnews_df.drop(joined.index, inplace = True)

has_ranks = df['Rank'].notna()
print("{} non-null Rank values in df.".format(has_ranks.sum()))
print("{} rows in collegedata_df, {} rows in collegedata_missing_df,\n"
      "{} rows in usnews_df, {} rows in df.".format(\
            len(collegedata_df), len(collegedata_missing_df),\
            len(usnews_df), len(df)))

1351 non-null Rank values in df.
624 rows in collegedata_df, 54 rows in collegedata_missing_df,
7 rows in usnews_df, 1351 rows in df.


We can try to join the remaining 7 rows from usnews_df on the rows we put into collegedata_missing_df that were missing City and/or State values:

In [12]:
usnews_df.reset_index(inplace = True)
usnews_df.set_index('Name', inplace = True)
collegedata_missing_df.set_index('Name', inplace = True)

joined = usnews_df.join(collegedata_missing_df\
                        .drop(columns = ['State', 'City']), how = 'inner')

df = df.append(joined\
               .set_index(['State', 'City'], append = True), sort = False)

usnews_df.drop(joined.index, inplace = True)

usnews_df.reset_index(inplace = True)

has_ranks = df['Rank'].notna()
print("{} non-null Rank values in df.".format(has_ranks.sum()))
print("{} rows in collegedata_df, {} rows in collegedata_missing_df,\n"
      "{} rows in usnews_df, {} rows in df.".format(\
            len(collegedata_df), len(collegedata_missing_df),\
            len(usnews_df), len(df)))

1356 non-null Rank values in df.
624 rows in collegedata_df, 54 rows in collegedata_missing_df,
2 rows in usnews_df, 1356 rows in df.


There are only two schools remaining in usnews_df:

In [13]:
usnews_df

Unnamed: 0,Name,State,City,Rank,Rank Type
0,Concordia University,NE,Seward,38,Regional Universities Midwest
1,Concordia University,CA,Irvine,41,Regional Universities West


These two share the same name in usnews_df, but their corresponding rows in collegedata_df are different. We'll change the names, join them to 'df':

In [14]:
usnews_df.loc[0, 'Name'] = 'Concordia University Nebraska'
usnews_df.loc[1, 'Name'] = 'Concordia University Irvine'

usnews_df.set_index(['Name', 'State', 'City'], inplace = True)
collegedata_df.set_index('City', append = True, inplace = True)

joined = collegedata_df.join(usnews_df, how = 'inner')

df = df.append(joined, sort = False)

We'll also want to add our remaining collegedata_missing_df rows back to df, even 

In [15]:
len(df)



1358

In [19]:
df[['Name','State','City']].duplicated().sum()

271

In [20]:
len(collegedata_missing_df)

54