### Using the FuzzyWuzzy Process Module for Fuzzy Dataframe Joins

#### Process Extact & Extract One Syntax:
* **process.extract(string, list, limit=n):**

Compares each item in list to string, returns list of n closest matches from list. n defaults to number of elements in list.

Returned output is a list of tuples, in format:
[(closest match, closest match score), (second closest match, second closest match score), ...]

* **process.extractOne(string, list):**

Compares each item in list to string, returns list only cloest match from list.

Returned output is a tuple in format: (closest match, closest match score)


In [3]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
%matplotlib inline

college="Harvard"
colleges=["Harvard","Hahvahd","Princeton","Yale","Tufts"]

print process.extract(college, colleges)

[('Harvard', 100), ('Hahvahd', 71), ('Yale', 23), ('Princeton', 13), ('Tufts', 0)]


In [2]:
print process.extract(college, colleges, limit=2)

[('Harvard', 100), ('Hahvahd', 71)]


In [3]:
print process.extractOne(college, colleges)

('Harvard', 100)


#### Now Let's Use FuzzyWuzzy Process to Try to Join Two Pandas' Dataframes

In this example, I will attempt to fuzzy join U.S. News College Ranking Data scraped from usnews.com to the Dept of Education's Integrated Postsecondary Data System (IPEDS) database.

In [4]:
##Read-In U.S. News Ranking Scrapped from usnews.com
pickles=["usnews-ranking-national-universities.pickle", \
"usnews-ranking-national-liberal-arts-colleges.pickle", \
"usnews-ranking-regional-colleges-midwest.pickle", \
"usnews-ranking-regional-colleges-north.pickle", \
"usnews-ranking-regional-colleges-south.pickle", \
"usnews-ranking-regional-colleges-west.pickle", \
"usnews-ranking-regional-universities-midwest.pickle", \
"usnews-ranking-regional-universities-north.pickle", \
"usnews-ranking-regional-universities-south.pickle", \
"usnews-ranking-regional-universities-west.pickle"]

ranks=pd.DataFrame()

for pickle in pickles:
    cat=pd.read_pickle("data/us_news/"+pickle)
    
    ranks=pd.concat([ranks,cat],axis=0,ignore_index=True)

print ranks.shape
ranks.head(10)

(1506, 4)


Unnamed: 0,category,location,school,score
0,National Universities,"Princeton, NJ",Princeton University,100 out of 100.
1,National Universities,"Cambridge, MA",Harvard University,99 out of 100.
2,National Universities,"New Haven, CT",Yale University,97 out of 100.
3,National Universities,"New York, NY",Columbia University,95 out of 100.
4,National Universities,"Stanford, CA",Stanford University,95 out of 100.
5,National Universities,"Chicago, IL",University of Chicago,95 out of 100.
6,National Universities,"Cambridge, MA",Massachusetts Institute of Technology,93 out of 100.
7,National Universities,"Durham, NC",Duke University,92 out of 100.
8,National Universities,"Philadelphia, PA",University of Pennsylvania,91 out of 100.
9,National Universities,"Pasadena, CA",California Institute of Technology,90 out of 100.


In [6]:
ranks["new"] = ranks[["category","location"]].apply(lambda x, y: x+y)

TypeError: ('<lambda>() takes exactly 2 arguments (1 given)', u'occurred at index category')

In [5]:
##Clean U.S. News Ranking Data
ranks["city"]=ranks.location.str.split(", ").str.get(0).str.upper().str.strip()
ranks["state"]=ranks.location.str.split(", ").str.get(1).str.upper().str.strip()
ranks["school"]=ranks.school.str.upper().str.strip()

ranks.head(10)

Unnamed: 0,category,location,school,score,city,state
0,National Universities,"Princeton, NJ",PRINCETON UNIVERSITY,100 out of 100.,PRINCETON,NJ
1,National Universities,"Cambridge, MA",HARVARD UNIVERSITY,99 out of 100.,CAMBRIDGE,MA
2,National Universities,"New Haven, CT",YALE UNIVERSITY,97 out of 100.,NEW HAVEN,CT
3,National Universities,"New York, NY",COLUMBIA UNIVERSITY,95 out of 100.,NEW YORK,NY
4,National Universities,"Stanford, CA",STANFORD UNIVERSITY,95 out of 100.,STANFORD,CA
5,National Universities,"Chicago, IL",UNIVERSITY OF CHICAGO,95 out of 100.,CHICAGO,IL
6,National Universities,"Cambridge, MA",MASSACHUSETTS INSTITUTE OF TECHNOLOGY,93 out of 100.,CAMBRIDGE,MA
7,National Universities,"Durham, NC",DUKE UNIVERSITY,92 out of 100.,DURHAM,NC
8,National Universities,"Philadelphia, PA",UNIVERSITY OF PENNSYLVANIA,91 out of 100.,PHILADELPHIA,PA
9,National Universities,"Pasadena, CA",CALIFORNIA INSTITUTE OF TECHNOLOGY,90 out of 100.,PASADENA,CA


In [6]:
##Read In IPEDS Data (downloaded from https://nces.ed.gov/ipeds/datacenter/DataFiles.aspx)
ipeds=pd.read_csv("data/ipeds/HD2014.csv")[["UNITID","INSTNM","CITY","STABBR"]]

ipeds.head(10)

Unnamed: 0,UNITID,INSTNM,CITY,STABBR
0,100636,Community College of the Air Force,Montgomery,AL
1,100654,Alabama A & M University,Normal,AL
2,100663,University of Alabama at Birmingham,Birmingham,AL
3,100690,Amridge University,Montgomery,AL
4,100706,University of Alabama in Huntsville,Huntsville,AL
5,100724,Alabama State University,Montgomery,AL
6,100733,University of Alabama System Office,Tuscaloosa,AL
7,100751,The University of Alabama,Tuscaloosa,AL
8,100760,Central Alabama Community College,Alexander City,AL
9,100812,Athens State University,Athens,AL


In [7]:
##Clean IPEDS Data
ipeds.rename(columns={"UNITID":"unitid","INSTNM":"school","CITY":"city","STABBR":"state"}, inplace=True)

ipeds["school"]=ipeds.school.str.upper().str.strip()
ipeds["city"]=ipeds.city.str.upper().str.strip()
ipeds["in_ipeds"]=1

ipeds.head(10)

Unnamed: 0,unitid,school,city,state,in_ipeds
0,100636,COMMUNITY COLLEGE OF THE AIR FORCE,MONTGOMERY,AL,1
1,100654,ALABAMA A & M UNIVERSITY,NORMAL,AL,1
2,100663,UNIVERSITY OF ALABAMA AT BIRMINGHAM,BIRMINGHAM,AL,1
3,100690,AMRIDGE UNIVERSITY,MONTGOMERY,AL,1
4,100706,UNIVERSITY OF ALABAMA IN HUNTSVILLE,HUNTSVILLE,AL,1
5,100724,ALABAMA STATE UNIVERSITY,MONTGOMERY,AL,1
6,100733,UNIVERSITY OF ALABAMA SYSTEM OFFICE,TUSCALOOSA,AL,1
7,100751,THE UNIVERSITY OF ALABAMA,TUSCALOOSA,AL,1
8,100760,CENTRAL ALABAMA COMMUNITY COLLEGE,ALEXANDER CITY,AL,1
9,100812,ATHENS STATE UNIVERSITY,ATHENS,AL,1


#### Let's Try Joining our U.S. News and IPEDS Data on State and School without Fuzzy Matching

In [8]:
merged=pd.merge(ranks,ipeds,how="left",on=["state","school"])

print merged.in_ipeds.value_counts(dropna=False)

 1     1113
NaN     393
Name: in_ipeds, dtype: int64


  rlab = rizer.factorize(rk)


Without fuzzy matching, we could only match 1,113 out of our 1,506 (73%) schools reported in U.S. News to the IPEDS database. We can do better than that!

#### Enter FuzzyWuzzy:

Instead, let's use FuzzyWuzzy's process module to match each school in U.S. News to it's CLOSEST, but not necessarily EXACT, U.S. News Match. 

Below I've written some sample code that loops though all of the schools in our U.S. news dataset and matches them to the closest institution reported in IPEDS. I've done this process seperately for each state in order to help prevent erronious matches.

FYI - This process is rather expensive :(

In [9]:
matches=[] #create an empty list to hold school match data

for state in ranks.state.unique(): #Iterate through states in rank dataset
    ranks_schools=ranks[ranks.state==state]["school"] #Return series of schools from U.S. News
    ipeds_schools=ipeds[ipeds.state==state]["school"] #Return series of schools from IPEDS
    
    for school in ranks_schools: #iterate through all schools within state
        best_match=process.extractOne(school, ipeds_schools) #return closest IPEDS school match
        school_match=best_match[0]
        match_score=best_match[1]
        
        matches.append({"state":state, "school":school, "school_match":school_match, "match_score":match_score})
        
crosswalk=pd.DataFrame(matches) #List of Dictionaries -> Pandas Dataframe

#Now let's look at fuzzy matches from NY
crosswalk[(crosswalk.state=="NY") & (crosswalk.school!=crosswalk.school_match)].head(25)

  result = lib.vec_compare(x, y, op)


Unnamed: 0,match_score,school,school_match,state
109,90,COLUMBIA UNIVERSITY,COLUMBIA UNIVERSITY IN THE CITY OF NEW YORK,NY
117,86,BINGHAMTON UNIVERSITY—​SUNY,RIDLEY-LOWELL BUSINESS & TECHNICAL INSTITUTE-B...,NY
118,92,STONY BROOK UNIVERSITY—​SUNY,STONY BROOK UNIVERSITY,NY
120,91,UNIVERSITY AT BUFFALO—​SUNY,UNIVERSITY AT BUFFALO,NY
122,95,NEW SCHOOL,THE NEW SCHOOL,NY
123,86,UNIVERSITY AT ALBANY—​SUNY,HAIR DESIGN INSTITUTE AT FIFTH AVENUE-BROOKLYN,NY
125,92,ST. JOHN FISHER COLLEGE,SAINT JOHN FISHER COLLEGE,NY
127,95,ST. JOHN'S UNIVERSITY,ST JOHN'S UNIVERSITY-NEW YORK,NY
128,90,PACE UNIVERSITY,PACE UNIVERSITY-NEW YORK,NY
138,98,ST. LAWRENCE UNIVERSITY,ST LAWRENCE UNIVERSITY,NY


These matches look mostly good, but there are a few false positives (e.g. SUNY-Albany: Hair Design Institute at 5th Ave-Brooklyn). We may want to edit the code to only accept matches with a match percentage above a certain point, though then our match may not be as complete. Tradeoffs!

We also may want to install the python-Levenshtein package to speed up our matching. This library uses C code to calculate string similarity, so it is much faster! It uses Levenshtein (edit distance) ratios rather than Ratcliff-Obershelp Ratios, but the resulting scores are often very similar.