## Part 3 - Merging

Let's import the cleaned CollegeData and cleaned USNews rankings dataframes:

In [1]:
import numpy as np
import pandas as pd
from fuzzywuzzy import process
from collegedata_names import school_rename_dict

COLLEGEDATA_CLEAN_CSV_PATH = 'data/collegedata_clean.csv'
USNEWS_CLEAN_CSV_PATH = 'data/usnews_clean.csv'

collegedata_df = pd.read_csv(COLLEGEDATA_CLEAN_CSV_PATH, index_col = None)
usnews_df = pd.read_csv(USNEWS_CLEAN_CSV_PATH, index_col = None)

print("{} rows in collegedata_df, {} rows in usnews_df."\
      .format(len(collegedata_df), len(usnews_df)))

2028 rows in collegedata_df, 1381 rows in usnews_df.


Since we have less ranks than we have schools, let's try simply left joining the usnews_df to collegedata_df on the common Name, State, and City columns to create a new dataframe 'df'. Before we do that, let's see if there are any schools from either dataframe missing these values:

In [2]:
cols = ['Name', 'State', 'City']

collegedata_missing = collegedata_df[cols].isna()
collegedata_missing_df = collegedata_df[collegedata_missing.any(axis = 1)]
collegedata_df.drop(collegedata_missing_df.index, inplace = True)

usnews_missing = usnews_df[cols].isna()
usnews_missing_df = usnews_df[usnews_missing.any(axis = 1)]
usnews_df.drop(usnews_missing_df.index, inplace = True)

collegedata_df.set_index(cols, inplace = True)
usnews_df.set_index(cols, inplace = True)

print("{} rows in collegedata_missing_df, {} rows in usnews_missing_df."\
      .format(len(collegedata_missing_df), len(usnews_missing_df)))

print("{} rows in collegedata_df, {} rows in usnews_df."\
      .format(len(collegedata_df), len(usnews_df)))

54 rows in collegedata_missing_df, 0 rows in usnews_missing_df.
1974 rows in collegedata_df, 1381 rows in usnews_df.


No missing data was in usnews_df, thankfully. Let's try simply left joining usnews_df to collegedata_df:

In [3]:
df = collegedata_df.join(usnews_df)
has_ranks = df['Rank'].notna()
print("{} non-null Rank values in df.".format(has_ranks.sum()))

1082 non-null Rank values in df.


This worked alright. But there are still a number of Name, State, City values from usnews_df that failed to match up with those in collegedata_df.

Let's remove the rows that matched from both starting dataframes:

In [4]:
collegedata_df.drop(df[has_ranks].index, inplace = True)
usnews_df.drop(df.index, errors = 'ignore', inplace = True)
print("{} rows in collegedata_df, {} rows in collegedata_missing_df, "
      "{} rows in usnews_df.".format(\
            len(collegedata_df), len(collegedata_missing_df), len(usnews_df)))

892 rows in collegedata_df, 54 rows in collegedata_missing_df, 299 rows in usnews_df.


Now we need to find a way to match these orphaned 299 US News ranks to the remaining CollegeData schools.

We can look to see if maybe the Name and State match but for some reason the City is off:

In [5]:
collegedata_df.reset_index('City', inplace = True)
usnews_df.reset_index('City', inplace = True)
diff_cities = usnews_df.index.isin(collegedata_df.index)
diff_cities.sum()

12

Let's see the actual City values for these 12 cases to make sure that nothing is terribly off:

In [6]:
s1 = usnews_df.loc[diff_cities, 'City']
s2 = collegedata_df.loc[s1.index, 'City']
s1.name, s2.name = 'US News City', 'CollegeData City'
pd.concat([s2, s1], axis = 1)

Unnamed: 0_level_0,Unnamed: 1_level_0,CollegeData City,US News City
Name,State,Unnamed: 2_level_1,Unnamed: 3_level_1
Florida Memorial University,FL,Miami-Dade,Miami
Saint Mary-of-the-Woods College,IN,Saint Mary of the Woods,St. Mary-of-the-Woods
University of Richmond,VA,University of Richmond,Univ. of Richmond
Lake Superior State University,MI,Sault Sainte Marie,Sault Ste. Marie
Bard College,NY,Annandale-on-Hudson,Annandale on Hudson
St. Mary's College of Maryland,MD,St. Mary's City,St. Marys City
Auburn University,AL,Auburn University,Auburn
College of Mount St. Vincent,NY,Riverdale,Bronx
Nova Southeastern University,FL,Fort Lauderdale,Ft. Lauderdale
American Jewish University,CA,Bel Air,Bel-Air


It looks like these are indeed the same schools - the City values are not matching because they have slightly different formatting. Let's join them into 'df' and remove them from the original dataframes:

In [7]:
joined = collegedata_df.join(usnews_df.drop(columns = 'City'), how = 'inner')
df = df.append(joined.set_index('City', append = True), sort = False)

collegedata_df.drop(joined.index, inplace = True)
usnews_df.drop(joined.index, inplace = True)

collegedata_df.reset_index(inplace = True)
usnews_df.reset_index(inplace = True)

has_ranks = df['Rank'].notna()
print("{} non-null Rank values in df.".format(has_ranks.sum()))
print("{} rows in collegedata_df, {} rows in collegedata_missing_df, "
      "{} rows in usnews_df.".format(\
            len(collegedata_df), len(collegedata_missing_df), len(usnews_df)))

1094 non-null Rank values in df.
880 rows in collegedata_df, 54 rows in collegedata_missing_df, 287 rows in usnews_df.


We still have 287 ranks without a home. There's a good chance that the Name values in usnews_df are formatted slightly differently from the Name values in collegedata_df.

Using [fuzzywuzzy](https://github.com/seatgeek/fuzzywuzzy), which was put together by the people at SeatGeek and built on difflib, we can find a list of potential Name value matches in collegedata_df for each Name value in usnews_df:

In [8]:
def strip(s):
    s = s.str.replace('SUNY','State University of New York')
    s = s.str.replace('CUNY','City University of New York')
    s = s.str.replace('\'s','s')
    s = s.str.replace('College|University|of|\'','')
    s = s.str.replace('St\.','Saint')
    s = s.str.replace('-|\.',' ')
    s = s.str.replace('\s+',' ')
    s = s.str.strip()
    return s

collegedata_df['Stripped Name'] = strip(collegedata_df['Name'])

match = lambda x: process.extract(x, collegedata_df['Stripped Name'])[0][0]
usnews_df['Stripped Name'] = strip(usnews_df['Name']).apply(match)

Now we have a Stripped Name column in collegedata_df that contains the true stripped name. We also stripped the Names in usnews_df and immediately applied fuzzywuzzy's process.extract to find the closest matching Stripped Name in collegedata_df, and saved that possible match as usnews_df's Stripped Name column.

Now we'll set the indexes for both dataframes to include the Stripped Name column and then attempt an inner join (while dropping the useless original usnews_df Name column):

In [9]:
collegedata_df.set_index(['Stripped Name', 'State', 'City'], inplace = True)
usnews_df.set_index(['Stripped Name', 'State', 'City'], inplace = True)

joined = collegedata_df.join(usnews_df.drop(columns = 'Name'), how = 'inner')
collegedata_df.drop(joined.index, inplace = True)
usnews_df.drop(joined.index, inplace = True)

Now we'll reset the joined index, drop the now useless Stripped Name column, set the normal Name, State, City index, and append joined to our combined 'df' dataframe:

In [10]:
joined.reset_index(inplace = True)
joined.drop(columns = 'Stripped Name', inplace = True)
joined.set_index(['Name', 'State', 'City'], inplace = True)
df = df.append(joined, sort = False)

collegedata_df.reset_index(inplace = True)
collegedata_df.drop(columns = 'Stripped Name', inplace = True)

usnews_df.reset_index(inplace = True)
usnews_df.drop(columns = 'Stripped Name', inplace = True)

has_ranks = df['Rank'].notna()
print("{} non-null Rank values in df.".format(has_ranks.sum()))
print("{} rows in collegedata_df, {} rows in collegedata_missing_df, "
      "{} rows in usnews_df.".format(\
            len(collegedata_df), len(collegedata_missing_df), len(usnews_df)))

1324 non-null Rank values in df.
651 rows in collegedata_df, 54 rows in collegedata_missing_df, 57 rows in usnews_df.


That method seemed to help a bit - it cut our orphaned usnews_df rows from 287 to 57. We're approaching a threshold amount of values where it might be more efficient to manually rename them:

In [11]:
usnews_df.set_index('Name', inplace = True)
usnews_df.rename(index = school_rename_dict, inplace = True)
usnews_df = usnews_df[usnews_df.index.notna()]

usnews_df.set_index(['State'], append = True, inplace = True)
collegedata_df.set_index(['Name', 'State'], inplace = True)
joined = collegedata_df.join(usnews_df.drop(columns = 'City'), how = 'inner')

df = df.append(joined.set_index('City', append = True), sort = False)
collegedata_df.drop(joined.index, inplace = True)
usnews_df.drop(joined.index, inplace = True)

has_ranks = df['Rank'].notna()
print("{} non-null Rank values in df.".format(has_ranks.sum()))
print("{} rows in collegedata_df, {} rows in collegedata_missing_df, "
      "{} rows in usnews_df.".format(\
            len(collegedata_df), len(collegedata_missing_df), len(usnews_df)))

1351 non-null Rank values in df.
624 rows in collegedata_df, 54 rows in collegedata_missing_df, 7 rows in usnews_df.


We can try to join the remaining 7 rows from usnews_df on the rows we put into collegedata_missing_df that were missing City and/or State values:

In [12]:
usnews_df.reset_index(inplace = True)
usnews_df.set_index('Name', inplace = True)
collegedata_missing_df.set_index('Name', inplace = True)

joined = usnews_df.join(collegedata_missing_df\
                        .drop(columns = ['State', 'City']), how = 'inner')

df = df.append(joined\
               .set_index(['State', 'City'], append = True), sort = False)

usnews_df.drop(joined.index, inplace = True)

usnews_df.reset_index(inplace = True)

has_ranks = df['Rank'].notna()
print("{} non-null Rank values in df.".format(has_ranks.sum()))
print("{} rows in collegedata_df, {} rows in collegedata_missing_df, "
      "{} rows in usnews_df.".format(\
            len(collegedata_df), len(collegedata_missing_df), len(usnews_df)))

1356 non-null Rank values in df.
624 rows in collegedata_df, 54 rows in collegedata_missing_df, 2 rows in usnews_df.


There are only two schools remaining in usnews_df:

In [13]:
usnews_df

Unnamed: 0,Name,State,City,Rank,Rank Type
0,Concordia University,NE,Seward,38,Regional Universities Midwest
1,Concordia University,CA,Irvine,41,Regional Universities West


These two share the same name in usnews_df, but their corresponding rows in collegedata_df are different. We'll change the names, join them to 'df':

In [14]:
usnews_df.loc[0, 'Name'] = 'Concordia University Nebraska'
usnews_df.loc[1, 'Name'] = 'Concordia University Irvine'

usnews_df.set_index(['Name', 'State', 'City'], inplace = True)
collegedata_df.set_index('City', append = True, inplace = True)

joined = collegedata_df.join(usnews_df, how = 'inner')

df = df.append(joined, sort = False)

We'll also want to add our remaining collegedata_missing_df rows back to df, even 

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 2250 entries, (Bryn Athyn College, PA, Bryn Athyn) to (Concordia University Nebraska, NE, Seward)
Columns: 254 entries, SchoolId to Rank Type
dtypes: float64(158), int64(1), object(95)
memory usage: 4.4+ MB
