### Using the FuzzyPandas Module :D

#### This example will walk you through how to use FuzzyPandas to Perform Fuzzy Joins on Pandas Dataframes. 

In this example, we will be matching U.S. News College Rankings scraped from usnews.com to institituion characteristics data obtained from the Integraded Postsecondary Education Data System (IPEDS) on college state and name.

In [10]:
import numpy as np
import pandas as pd
import FuzzyPandas as fp
from matplotlib import pyplot as plt
%matplotlib inline

**First, let's read in the school ranking data scraped from usnews.com and the IPEDS data downloaded here:**
https://nces.ed.gov/ipeds/datacenter/DataFiles.aspx

In [2]:
def read_us_news(pickle):
    df = pd.read_pickle(pickle)   
    return pd.concat([us_news, df], axis=0, ignore_index=True)

us_news = pd.DataFrame() #initialize empty data frame
us_news = read_us_news("example_data/us_news/usnews-ranking-national-universities.pickle")
us_news = read_us_news("example_data/us_news/usnews-ranking-national-liberal-arts-colleges.pickle")
us_news = read_us_news("example_data/us_news/usnews-ranking-regional-colleges-midwest.pickle")
us_news = read_us_news("example_data/us_news/usnews-ranking-regional-colleges-north.pickle")
us_news = read_us_news("example_data/us_news/usnews-ranking-regional-colleges-south.pickle")
us_news = read_us_news("example_data/us_news/usnews-ranking-regional-colleges-west.pickle")
us_news = read_us_news("example_data/us_news/usnews-ranking-regional-universities-midwest.pickle")
us_news = read_us_news("example_data/us_news/usnews-ranking-regional-universities-north.pickle")
us_news = read_us_news("example_data/us_news/usnews-ranking-regional-universities-south.pickle")
us_news = read_us_news("example_data/us_news/usnews-ranking-regional-universities-west.pickle")

print us_news.shape
us_news.head(5)

(1506, 4)


Unnamed: 0,category,location,school,score
0,National Universities,"Princeton, NJ",Princeton University,100 out of 100.
1,National Universities,"Cambridge, MA",Harvard University,99 out of 100.
2,National Universities,"New Haven, CT",Yale University,97 out of 100.
3,National Universities,"New York, NY",Columbia University,95 out of 100.
4,National Universities,"Stanford, CA",Stanford University,95 out of 100.


In [3]:
ipeds=pd.read_csv("example_data/ipeds/HD2014.csv")[["UNITID","INSTNM","CITY","STABBR"]]

print ipeds.shape
ipeds.head(5)

(7687, 4)


Unnamed: 0,UNITID,INSTNM,CITY,STABBR
0,100636,Community College of the Air Force,Montgomery,AL
1,100654,Alabama A & M University,Normal,AL
2,100663,University of Alabama at Birmingham,Birmingham,AL
3,100690,Amridge University,Montgomery,AL
4,100706,University of Alabama in Huntsville,Huntsville,AL


**Now, for a little data pre-processing on the U.S. News Data**

In [4]:
us_news["city"] = us_news.location.apply(lambda x: x.split(", ")[0].upper().strip().encode("ascii", "ignore"))
us_news["state"] = us_news.location.apply(lambda x: x.split(", ")[1].upper().strip()) #parse out state, to upper case
us_news["school"] = us_news.school.apply(lambda x: x.upper().strip()) #school name to upper case
us_news["in_us_news"] = 1 #dummy flag for record belonging in U.S. News

us_news.drop("location", axis=1, inplace=True) #drop original location (now that we've split out into city and state)
us_news.head(5)

Unnamed: 0,category,school,score,city,state,in_us_news
0,National Universities,PRINCETON UNIVERSITY,100 out of 100.,PRINCETON,NJ,1
1,National Universities,HARVARD UNIVERSITY,99 out of 100.,CAMBRIDGE,MA,1
2,National Universities,YALE UNIVERSITY,97 out of 100.,NEW HAVEN,CT,1
3,National Universities,COLUMBIA UNIVERSITY,95 out of 100.,NEW YORK,NY,1
4,National Universities,STANFORD UNIVERSITY,95 out of 100.,STANFORD,CA,1


**Apply Same Data Pre-Processing to IPEDS Data** 

In [5]:
ipeds.rename(columns={"UNITID":"unitid","INSTNM":"school","CITY":"city","STABBR":"state"},\
             inplace=True) #rename columns to match U.S. News Data

ipeds["school"] = ipeds.school.apply(lambda x: x.upper().strip()) #school to upper case
ipeds["city"] = ipeds.city.apply(lambda x: x.upper().strip()) #city to upper case
ipeds["state"] = ipeds.state.apply(lambda x: x.upper().strip()) #state to upper case
ipeds["in_ipeds"] = 1 #dummy flag for record belonging in IPEDS
ipeds.head(5)

Unnamed: 0,unitid,school,city,state,in_ipeds
0,100636,COMMUNITY COLLEGE OF THE AIR FORCE,MONTGOMERY,AL,1
1,100654,ALABAMA A & M UNIVERSITY,NORMAL,AL,1
2,100663,UNIVERSITY OF ALABAMA AT BIRMINGHAM,BIRMINGHAM,AL,1
3,100690,AMRIDGE UNIVERSITY,MONTGOMERY,AL,1
4,100706,UNIVERSITY OF ALABAMA IN HUNTSVILLE,HUNTSVILLE,AL,1


** Now, let's try to match the U.S. News Data to the IPEDS Data on location (city and state) and college name with Pandas merge **

In [6]:
merged = pd.merge(us_news, ipeds, how="left", on=["state", "city", "school"])

merged.in_ipeds.value_counts(dropna=False)

  rlab = rizer.factorize(rk)


 1     1078
NaN     428
Name: in_ipeds, dtype: int64

Without fuzzy matching, we could only match 1,078 out of our 1,506 colleges, around 70%. We can do better than that!

** Instead, let's use FuzzyPandas to match each school in the U.S. News Data to it's CLOSEST, but not necessarily EXACT, U.S. News Match**

In [17]:
print us_news.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1506 entries, 0 to 1505
Data columns (total 7 columns):
category      1506 non-null object
school        1506 non-null object
score         1506 non-null object
city          1506 non-null object
state         1506 non-null object
in_us_news    1506 non-null int64
byvar         1506 non-null object
dtypes: int64(1), object(6)
memory usage: 94.1+ KB


In [None]:
def fuzzy_merge(a, b, fuzz_on, how="inner", score_cutoff=0.6, show_score=True):
	if len(fuzz_on) > 1:
		a["byvar"] = a[fuzz_on].apply(lambda x: " ".join(x), axis=1)
		b["byvar"] = b[fuzz_on].apply(lambda x: " ".join(x), axis=1)
	else:
		a["byvar"] = a[fuzz_on]
		b["byvar"] = a[fuzz_on]

	a["byvar"] = a.byvar.apply(lambda x: x.encode("ascii", "ignore"))

	matches = []
	for each in a.byvar.unique():
		match = process.extractOne(each, b.byvar, score_cutoff=score_cutoff)
		matches.append({"byvar": each, "matched": match[0], "score": match[1]})
	matches = pd.DataFrame(matches)

	b.rename(columns={"byvar": "matched"})
	merged1 = pd.merge(a, matches, on="byvar", how="left")
	merged2 = pd.merge(merged1, b, on="matched", how=how)

	if show_score==True:
		merged2.drop(["byvar","matched"], inplace=True)
	else:
		merged2.drop(["byvar","matched","score"], inplace=True)

	return merged2

In [39]:
from fuzzywuzzy import process

def fuzzy_merge(a, b, fuzz_on, how="inner", score_cutoff=0.6, show_score=True):
    if len(fuzz_on) > 1:
        a["byvar"] = a[fuzz_on].apply(lambda x: " ".join(x), axis=1)
        b["byvar"] = b[fuzz_on].apply(lambda x: " ".join(x), axis=1)
    else:
        a["byvar"] = a[fuzz_on]
        b["byvar"] = a[fuzz_on]
        
    b["byvar"] = b.byvar.apply(lambda x: x.decode("ascii", "ignore"))

    matches=[]
    for each in a.byvar.unique()[:50]:
        match = process.extractOne(each, b.byvar, score_cutoff=score_cutoff)
        print each, match[0], match[1]




In [40]:
fuzzy_merge(us_news, ipeds, fuzz_on=["state", "city", "school"])

NJ PRINCETON PRINCETON UNIVERSITY NJ PRINCETON PRINCETON UNIVERSITY 100
MA CAMBRIDGE HARVARD UNIVERSITY MA CAMBRIDGE HARVARD UNIVERSITY 100
CT NEW HAVEN YALE UNIVERSITY CT NEW HAVEN YALE UNIVERSITY 100
NY NEW YORK COLUMBIA UNIVERSITY NY NEW YORK NEW YORK UNIVERSITY 95
CA STANFORD STANFORD UNIVERSITY CA STANFORD STANFORD UNIVERSITY 100
IL CHICAGO UNIVERSITY OF CHICAGO IL CHICAGO UNIVERSITY OF CHICAGO 100
MA CAMBRIDGE MASSACHUSETTS INSTITUTE OF TECHNOLOGY MA CAMBRIDGE MASSACHUSETTS INSTITUTE OF TECHNOLOGY 100
NC DURHAM DUKE UNIVERSITY NC DURHAM DUKE UNIVERSITY 100
PA PHILADELPHIA UNIVERSITY OF PENNSYLVANIA PA PHILADELPHIA UNIVERSITY OF PENNSYLVANIA 100
CA PASADENA CALIFORNIA INSTITUTE OF TECHNOLOGY CA PASADENA CALIFORNIA INSTITUTE OF TECHNOLOGY 100
MD BALTIMORE JOHNS HOPKINS UNIVERSITY MD BALTIMORE JOHNS HOPKINS UNIVERSITY 100
NH HANOVER DARTMOUTH COLLEGE NH HANOVER DARTMOUTH COLLEGE 100
IL EVANSTON NORTHWESTERN UNIVERSITY IL EVANSTON NORTHWESTERN UNIVERSITY 100
RI PROVIDENCE BROWN UNIVE

KeyboardInterrupt: 

In [None]:
matches=[] #create an empty list to hold school match data

for state in ranks.state.unique(): #Iterate through states in rank dataset
    ranks_schools=ranks[ranks.state==state]["school"] #Return series of schools from U.S. News
    ipeds_schools=ipeds[ipeds.state==state]["school"] #Return series of schools from IPEDS
    
    for school in ranks_schools: #iterate through all schools within state
        best_match=process.extractOne(school, ipeds_schools) #return closest IPEDS school match
        school_match=best_match[0]
        match_score=best_match[1]
        
        matches.append({"state":state, "school":school, "school_match":school_match, "match_score":match_score})
        
crosswalk=pd.DataFrame(matches) #List of Dictionaries -> Pandas Dataframe

#Now let's look at fuzzy matches from NY
crosswalk[(crosswalk.state=="NY") & (crosswalk.school!=crosswalk.school_match)].head(25)

These matches look mostly good, but there are a few false positives (e.g. SUNY-Albany: Hair Design Institute at 5th Ave-Brooklyn). We may want to edit the code to only accept matches with a match percentage above a certain point, though then our match may not be as complete. Tradeoffs!

We also may want to install the python-Levenshtein package to speed up our matching. This library uses C code to calculate string similarity, so it is much faster! It uses Levenshtein (edit distance) ratios rather than Ratcliff-Obershelp Ratios, but the resulting scores are often very similar.