# Part 3 - Fuzzy Merging

## Inspection

Let's import and inspect the cleaned CollegeData and cleaned USNews rankings dataframes that we want to merge:

In [1]:
import numpy as np
import pandas as pd
from fuzzywuzzy import process
from collegedata_names import usnews_rename_names
from collegedata_names import usnews_rename_cities

COLLEGEDATA_CLEAN_CSV_PATH = 'data/collegedata_clean.csv'
USNEWS_CLEAN_CSV_PATH = 'data/usnews_clean.csv'
JOINED_CSV_PATH = 'data/joined.csv'

cols = ['Name', 'State', 'City']

collegedata_df = pd.read_csv(COLLEGEDATA_CLEAN_CSV_PATH, index_col = cols)
usnews_df = pd.read_csv(USNEWS_CLEAN_CSV_PATH, index_col = cols)

collegedata_df.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 2028 entries, (Bryn Athyn College, PA, Bryn Athyn) to (Fashion Institute of Design and Merchandising: Los Angeles, CA, Los Angeles)
Columns: 252 entries, SchoolId to Zip
dtypes: float64(157), int64(1), object(94)
memory usage: 3.9+ MB


Our scraped collegedata_df contains entries for 2028 schools with 252 columns, plus the 3 columns Name, State, and City used as its index. Looking at our scraped usnews_df:

In [2]:
usnews_df.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 1381 entries, (Calvin College, MI, Grand Rapids) to (Youngstown State University, OH, Youngstown)
Data columns (total 2 columns):
Rank         1381 non-null int64
Rank Type    1381 non-null object
dtypes: int64(1), object(1)
memory usage: 46.6+ KB


There are 1381 schools US News ranked, with 2 columns - one for the Rank Type and one for the Rank number itself - plus the 3 columns Name, State, and City used as its index.

We will try to join these together into a single combined dataframe, joined_df, that when we're finished should have 2028 entries and 252 + 2 = 254 columns, plus the 3 columns Name, State, and City as its index.

---

## Simple inner join

Since we have less entries in usnews_df than we have in collegedata_df, let's try simply inner joining the usnews_df to collegedata_df on their common Name, State, and City index to create the new dataframe joined_df. Then we can drop all entries from usnews_df and collegedata_df that were successfully joined, and see how many unjoined records remain in usnews_df:

In [3]:
joined_df = collegedata_df.merge(usnews_df, how = 'inner', left_index = True, 
                                 right_index = True, validate = '1:1')

collegedata_df = collegedata_df.drop(joined_df.index)
usnews_df = usnews_df.drop(joined_df.index)

len(usnews_df)

299

Now we need to find a way to match these orphaned 299 usnews_df entries to the the remaining collegedata_df entries.

---

## Fuzzy matching with fuzzywuzzy

It's possible that for some entries, the City and State indexes are matching on the join but for some reason the Name is slightly off, perhaps due to formatting differences in how US News and CollegeData represent the Name.

We create a function `simplify()` that takes in the Name index and returns a 'simplifed' Name string for each entry. First, it will expand commonly used acronyms like SUNY, CUNY, and A&M, it will remove apostrophes, periods, hyphens, and other formatting differences, it will expand abbreviations for 'Saint', and remove the words 'College' and 'University', as some schools have recently changed from using one to the other and could lead to discrepencies between the two dataframes.

We will use `simplify()` on the Name index in both dataframes and append the result as a new temporary 'Simplified Name' column in each. Let's take a look at what the Simplified Name looks like for the first few entries in usnews_df:

In [4]:
def simplify(s):
    s = s.str.replace('SUNY','State University of New York')
    s = s.str.replace('CUNY','City University of New York')
    s = s.str.replace('A&M','Agricultural and Mechanical')
    s = s.str.replace('\'s','s')
    s = s.str.replace('College|University|of|\'','')
    s = s.str.replace('Ste?\.? ','Saint')
    s = s.str.replace('-|\.',' ')
    s = s.str.replace('\s+',' ')
    s = s.str.strip()
    return s

collegedata_names = collegedata_df.index.get_level_values('Name')
collegedata_df['Simplified Name'] = simplify(collegedata_names)

usnews_names = usnews_df.index.get_level_values('Name')
usnews_df['Simplified Name'] = simplify(usnews_names)

collegedata_df[['Simplified Name']].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Simplified Name
Name,State,City,Unnamed: 3_level_1
Bryn Athyn College,PA,Bryn Athyn,Bryn Athyn
Albany College of Pharmacy and Health Sciences,NY,Albany,Albany Pharmacy and Health Sciences
St. Joseph's College - Long Island Campus,NY,Patchogue,SaintJosephs Long Island Campus
Saint Anselm College,NH,Manchester,Saint Anselm
Saint Francis University,PA,Loretto,Saint Francis


Similarly, there now exists a Simplified Name column in collegedata_df. 

We will now apply `process.extract()` from fuzzywuzzy module to each element of usnews_df's Stripped Name column. `process.extract()` accepts two inputs: the first is a string for which we'd like to find a potential match - this is a usnews_df Simplifed Name in our case, the second is a list of strings to consider - this is the entire column of collegedata_df Simplified Name for us. An optional third parameter limits how many results to return. The function returns results in a list.

In our case, we will limit `process.extract()` to return just one result - the closest match. The result is a tuple of three elements: the first is the closest matched Simplified Name, the second is a match score, the third is the index of collegedata_df corresponding to that match: 

In [5]:
fuzzymatch = lambda string: process.extract(string,
                                            collegedata_df['Simplified Name'],
                                            limit = 1)

usnews_df['Closest Match'] = usnews_df['Simplified Name'].apply(fuzzymatch)

usnews_df = usnews_df.drop(columns = 'Simplified Name')
collegedata_df = collegedata_df.drop(columns = 'Simplified Name')

usnews_df[['Closest Match']].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Closest Match
Name,State,City,Unnamed: 3_level_1
Cooper Union,NY,New York,[(Cooper Union for the Advancement Science and...
Harvard University,MA,Cambridge,"[(Harvard, 100, (Harvard College, MA, Cambridg..."
California State University--Maritime Academy,CA,Vallejo,"[(California Maritime Academy, 95, (California..."
College of New Jersey,NJ,Ewing,"[(The New Jersey, 95, (The College of New Jers..."
University of South Carolina--Aiken,SC,Aiken,"[(South Carolina Aiken, 100, (University of So..."


For each Stripped Name in usnews_df, process.extract() returned a list of tuple results. Since we limited the results to only one, this is a list of result. The result is a tuple of three elements. The third element of that tuple is the index of collegedata_df corresponding to the closest match. 

Since collegedata_df is indexed by three columns - Name, State, and City - that index is represented by a tuple of three values, one for each index value. The first element of that index tuple is the true collegedata_df Name of the match - we want to extract that and save it as a new column in usnews_df:

In [6]:
usnews_df['collegedata_df Name'] = \
    usnews_df['Closest Match'].str[0].str[2].str[0]

usnews_df = usnews_df.drop(columns = 'Closest Match')

usnews_df[['collegedata_df Name']].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,collegedata_df Name
Name,State,City,Unnamed: 3_level_1
Cooper Union,NY,New York,Cooper Union for the Advancement of Science an...
Harvard University,MA,Cambridge,Harvard College
California State University--Maritime Academy,CA,Vallejo,California Maritime Academy
College of New Jersey,NJ,Ewing,The College of New Jersey
University of South Carolina--Aiken,SC,Aiken,University of South Carolina Aiken


Now we can see here in the first few entries of usnews_df that each entry in usnews_df is now associated with a fuzzy matched Name from collegedata_df. 

Though these first few entries look properly matched, it's more than possible that the fuzzy match has made some incorrect matches. This shouldn't be a problem though since when we later replace the usnews_df Name index with the matched collegedata_df Name column and inner merge usnews_df to collegedata_df, we will do that merge not just on the Name, but also on the State and City indexes. All three shouldn't line up if the fuzzy match was wrong.

This will fail, however, if there was an extraordinary coincidence and two different schools in usnews_df that happen to be in the same state and city were fuzzy matched to the same collegedata_df Name. Let's check to see if there are any duplicates like this:

In [7]:
usnews_df = usnews_df.reset_index()
usnews_df = usnews_df.rename(columns = {'Name':'usnews_df Name'})

cols = ['collegedata_df Name', 'State', 'City']
duplicates = usnews_df.duplicated(cols, keep = False)
usnews_df[duplicates]

Unnamed: 0,usnews_df Name,State,City,Rank,Rank Type,collegedata_df Name
43,CUNY--Hunter College,NY,New York,28,Regional Universities North,City College of New York
101,CUNY--City College,NY,New York,56,Regional Universities North,City College of New York
248,Georgia Southern University--Armstrong,GA,Savannah,114,Regional Universities South,South College
270,South University,GA,Savannah,114,Regional Universities South,South College


Good thing we checked for duplicates - but thankfully there are only two pairs. After a quick search, I can manually set the appropriate collegedata_df Name values. After doing that, let's set the index of usnews_df to use the matched collegedata_df Name along with the existing State and City:

In [8]:
usnews_df.loc[43, 'collegedata_df Name'] = 'Hunter College'
usnews_df.loc[248, 'collegedata_df Name'] = 'Armstrong State University'

usnews_df = usnews_df.rename(columns = {'collegedata_df Name':'Name'})

cols = ['Name', 'State', 'City']
usnews_df = usnews_df.set_index(cols)

usnews_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,usnews_df Name,Rank,Rank Type
Name,State,City,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Cooper Union for the Advancement of Science and Art,NY,New York,Cooper Union,1,Regional Colleges North
Harvard College,MA,Cambridge,Harvard University,2,National Universities
California Maritime Academy,CA,Vallejo,California State University--Maritime Academy,3,Regional Colleges West
The College of New Jersey,NJ,Ewing,College of New Jersey,4,Regional Universities North
University of South Carolina Aiken,SC,Aiken,University of South Carolina--Aiken,6,Regional Colleges South


We should be good now to inner merge this usnews_df with collegedata_df. We'll want to drop the usnews_df Name column from the results before appending them to our existing joined_df. Finally, we will dropping the successfully joined entries from both original dataframes and take a look at how many unjoined entries remain in usnews_df:

In [9]:
result = collegedata_df.merge(usnews_df, how = 'inner', left_index = True, 
                              right_index = True, validate = '1:1')

result = result.drop(columns = 'usnews_df Name')

joined_df = joined_df.append(result, sort = False)

collegedata_df = collegedata_df.drop(result.index)
usnews_df = usnews_df.drop(result.index)

usnews_df = usnews_df.reset_index()

usnews_df = usnews_df.drop(columns = 'Name')
usnews_df = usnews_df.rename(columns = {'usnews_df Name':'Name'})

usnews_df = usnews_df.set_index(cols)

len(usnews_df)

73

There are 73 entries remaining in usnews_df that did not line up with collegedata_df on the matched Name, City, and State indexes. 

---

## Unaligned cities

It's possible that some of the entries' Name values do in fact align, but their City values differ slightly in formatting in the same way that some Name values did. 

We can look at all the remaining rows in usnews_df where the original usnews_df Name index actually does align with the collegedata_df Name, but for whatever reason isn't aligning with the State and City indexes:

In [10]:
usnews_df = usnews_df.reset_index().set_index('Name')
collegedata_df = collegedata_df.reset_index().set_index('Name')

mask = usnews_df.index.isin(collegedata_df.index)
matched_names = usnews_df[mask].index

usnews_df_matches = usnews_df.loc[matched_names, ['City', 'State']]
collegedata_df_matches = collegedata_df.loc[matched_names, ['City', 'State']]

matches_df = usnews_df_matches.join(collegedata_df_matches,
                                    lsuffix = ' usnews_df',
                                    rsuffix = ' collegedata_df')

matches_df

Unnamed: 0_level_0,City usnews_df,State usnews_df,City collegedata_df,State collegedata_df
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
St. Gregory's University,Shawnee,OK,,
Florida Memorial University,Miami,FL,Miami-Dade,FL
Saint Mary-of-the-Woods College,St. Mary-of-the-Woods,IN,Saint Mary of the Woods,IN
University of Richmond,Univ. of Richmond,VA,University of Richmond,VA
Mount Ida College,Newton,MA,,
Lake Superior State University,Sault Ste. Marie,MI,Sault Sainte Marie,MI
Bard College,Annandale on Hudson,NY,Annandale-on-Hudson,NY
Grace University,Omaha,NE,,
Philadelphia University,Philadelphia,PA,,
St. Mary's College of Maryland,St. Marys City,MD,St. Mary's City,MD


Since many of the city names are only slightly off in their formatting - like 'Sault Ste. Marie' versus 'Sault Sainte Marie' - these rows are indeed referring to the same school in both dataframes. One match with completely different City names is College of Mount St. Vincent - but Riverdale is actually a neighborhood of the Bronx. 

This leaves a few matches that have City/State values in usnews_df, but are missing (NaN) from collegedata_df's information. I double checked collegedata.com to make sure there was no malfunction with the scraper, and it turns out the website is indeed missing values for these schools. But they are actually referring to the same schools.

So, all of these are actual matches. We can overwrite the collegedata_df City and State values with those from usnews_df for these rows and then join and update as we've done previously:

In [11]:
collegedata_df.loc[matched_names, ['City', 'State']] = \
    usnews_df.loc[matched_names, ['City', 'State']]

collegedata_df = collegedata_df.set_index(['State', 'City'], append = True)
usnews_df = usnews_df.set_index(['State', 'City'], append = True)

result = collegedata_df.merge(usnews_df, how = 'inner', left_index = True, 
                              right_index = True, validate = '1:1')

joined_df = joined_df.append(result, sort = False)

collegedata_df = collegedata_df.drop(result.index)
usnews_df = usnews_df.drop(result.index)

len(usnews_df)

55

With only 55 out of the 1381 rankings left, and names no more matching or even close matching Names between the dataframes, this could be a good place to stop if we were short on time. 

---

## Manually aligning the remaining entries

Since time is not a problem for me at the moment, I can do some searching around online to find out what's going on with these schools.

First, it turns out there is a pair of duplicate names causing a problem:

In [12]:
usnews_df = usnews_df.reset_index()
dups = usnews_df.duplicated('Name', keep = False)
usnews_df[dups]

Unnamed: 0,Name,State,City,Rank,Rank Type
17,Concordia University,NE,Seward,38,Regional Universities Midwest
18,Concordia University,CA,Irvine,41,Regional Universities West


These schools are in collegedata_df under slightly different names, which we can quickly rename:

In [13]:
usnews_df.loc[17, 'Name'] = 'Concordia University Nebraska'
usnews_df.loc[18, 'Name'] = 'Concordia University Irvine'

usnews_df = usnews_df.set_index(cols)

It turns out that many of the remaining schools in usnews_df have changed their names recently, and CollegeData and it's Name column is not as up to date as US News. From my research, I put together a dictionary to rename the usnews_df Name values, joined and appended the results as before, and still had a few entries left:

In [14]:
# Rename usnews_df Name and City values from the imported dictionaries.
usnews_df = usnews_df.rename(usnews_rename_names, level = 'Name')
usnews_df = usnews_df.rename(usnews_rename_cities, level = 'City')

result = collegedata_df.merge(usnews_df, how = 'inner', left_index = True, 
                              right_index = True, validate = '1:1')

joined_df = joined_df.append(result, sort = False)

collegedata_df = collegedata_df.drop(result.index)
usnews_df = usnews_df.drop(result.index)

len(usnews_df)

22

CollegeData.com does not have any information for these remaining 24 schools in any way I could find. Since we don't have any data from these schools, this is as far as we can take usnews_df.

---
## Summary

Let's look at our joined_df:

In [15]:
len(joined_df)

1359

Out of our 1381 rows in usnews_df, all but 23 ended up in joined_df. Our unranked schools remain in collegedata_df:

In [16]:
len(collegedata_df)

669

We can append these remaining schools to joined_df and set their 'Rank Type' to 'Unranked School':

In [17]:
joined_df = joined_df.append(collegedata_df, sort = True)
joined_df['Rank Type'] = joined_df['Rank Type'].fillna('Unranked School')
joined_df.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 2028 entries, (Abilene Christian University, TX, Abilene) to (Fashion Institute of Design and Merchandising: Los Angeles, CA, Los Angeles)
Columns: 254 entries, 2016 Graduates Who Took Out Loans to Zip
dtypes: float64(158), int64(1), object(95)
memory usage: 4.0+ MB


Our joined_df looks like we predicted it should at the start - it has all 2028 schools scraped from CollegeData and 254 columns, including the Rank and Rank Type joined from US News.

To end, we'll export this joined_df to .csv:

In [18]:
joined_df.to_csv(JOINED_CSV_PATH)