# Task 1: Using RLTK to perform Entity Resolution (ER)

<sub>Content of this notebook was prepared by Basel Shbita, and modified by Avijit Thawani (thawani@usc.edu) as part of the class <u>DSCI 558: Building Knowledge Graphs</u> at University of Southern California (USC).</sub>

The Record Linkage ToolKit ([RLTK](https://github.com/usc-isi-i2/rltk)) is a general-purpose open-source record linkage platform that allows users to build powerful Python programs that link records referring to the same underlying entity.

This notebook introduces some applied examples using RLTK. You can also find additional examples and use-cases in [RLTK's documentation](https://rltk.readthedocs.io/en/master/).

## Dataset analysis & RLTK components construction

First, you need define how a single entry would like for each type of record (for each dataset)

In [1]:
import rltk
import json
import csv
import pandas as pd
from difflib import SequenceMatcher

In [2]:
tokenizer = rltk.tokenizer.crf_tokenizer.crf_tokenizer.CrfTokenizer()

In [3]:
usnews = pd.read_csv("python/usnews.csv")
usnews.rename(columns={'Unnamed: 0':'ID'}, inplace=True)
gradcafe = pd.read_csv("gradcafe.csv")

In [4]:
usnews

Unnamed: 0,ID,name,url,location,no. of reviews,description,tuition,enrollment
0,0,Princeton University,https://www.usnews.com//best-colleges/princeto...,"Princeton, NJ",19 reviews,The ivy-covered campus of Princeton University...,"$56,010","4,773(fall 2020)"
1,1,Columbia University,https://www.usnews.com//best-colleges/columbia...,"New York, NY",39 reviews,Columbia University has three undergraduate sc...,"$63,530","6,170(fall 2020)"
2,2,Harvard University,https://www.usnews.com//best-colleges/harvard-...,"Cambridge, MA",17 reviews,Harvard University is a private institution in...,"$55,587","5,222(fall 2020)"
3,3,Massachusetts Institute of Technology,https://www.usnews.com//best-colleges/massachu...,"Cambridge, MA",12 reviews,Though the Massachusetts Institute of Technolo...,"$55,878","4,361(fall 2020)"
4,4,Yale University,https://www.usnews.com//best-colleges/yale-uni...,"New Haven, CT",10 reviews,"Yale University, located in New Haven, Connect...","$59,950","4,703(fall 2020)"
...,...,...,...,...,...,...,...,...
145,145,University of the Pacific,https://www.usnews.com//best-colleges/universi...,"Stockton, CA",4 reviews,The University of the Pacific is a private col...,"$52,352","3,524(fall 2020)"
146,146,University of Tulsa,https://www.usnews.com//best-colleges/universi...,"Tulsa, OK",6 reviews,Students at the University of Tulsa leave with...,"$44,838","2,929(fall 2020)"
147,147,Colorado State University,https://www.usnews.com//best-colleges/colorado...,"Fort Collins, CO",6 reviews,"Colorado State University, also known as CSU, ...","$31,540(out-of-state)","25,186(fall 2020)"
148,148,CUNY--City College,https://www.usnews.com//best-colleges/cuny-cit...,"New York, NY",3 reviews,"Founded in 1847, CUNY--City College is a publi...","$19,010(out-of-state)","12,587(fall 2020)"


In [5]:
gradcafe

Unnamed: 0,ID,university,major,degree,season,decision,decision_method,decision_date,gpa,gre_verbal,gre_quant,gre_writing,status,created_at,comment
0,1,Michigan State University,Mechanical Engineering,PhD,,Rejected,Website,19-Nov-21,3.63,,,,I,19-Nov-21,I am rejecting without mentioning any reason I...
1,2,Johns Hopkins University,Mechanical Engineering,PhD,,Rejected,E-mail,17-Nov-21,,,,,U,17-Nov-21,Department resubmitted my application for Fall...
2,3,Stevens Institute Of Technology,Mechanical Engineering,PhD,,Rejected,E-mail,16-Nov-21,3.85,156.0,161.0,4.0,I,16-Nov-21,"""...We have reviewed your application for a Ph..."
3,4,UIUC,Civil And Environmental Engineering,Masters,,Accepted,E-mail,5-May-21,3.42,155.0,162.0,3.5,I,14-Nov-21,Applied for PhD but accepted for non thesis ma...
4,5,Carnegie Mellon University,Electrical And Computer Engineering,PhD,,Accepted,E-mail,11-Nov-21,3.50,151.0,166.0,3.5,U,11-Nov-21,Oh my god. It is unbelievable.\nI am CMU MS EC...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49738,49745,Clemson University,Computer Engineering,PhD,F15,Rejected,E-mail,29-Jan-15,3.86,147.0,167.0,3.5,I,29-Jan-15,report spam\nreply
49739,49746,"University Of California, San Diego",Computer Science And Engineering,Masters,F15,Accepted,E-mail,29-Jan-15,,170.0,154.0,3.0,I,29-Jan-15,Non-CS undergrad. One paper in preparation. On...
49740,49747,University Of Texas Austin,Mechanical Engineering,PhD,F15,Rejected,Website,28-Jan-15,,,,,I,29-Jan-15,"1 Journal Publication,and another close to Pub..."
49741,49748,NC State,Nuclear Engineering,PhD,F15,Accepted,E-mail,29-Jan-15,3.64,162.0,164.0,4.5,A,29-Jan-15,report spam\nreply


In [6]:
class usnewsRecord(rltk.Record):
    def __init__(self, raw_object):
        super().__init__(raw_object)
        self.name = ''

    @rltk.cached_property
    def id(self):
        return self.raw_object['ID']

    @rltk.cached_property
    def name_string(self):
        return self.raw_object['name']

    @rltk.cached_property
    def name_tokens(self):
        return set(tokenizer.tokenize(self.name_string))

class gradcafeRecord(rltk.Record):
    def __init__(self, raw_object):
        super().__init__(raw_object)
        self.name = ''

    @rltk.cached_property
    def id(self):
        return self.raw_object['ID']

    @rltk.cached_property
    def name_string(self):
        return self.raw_object['university']
    
    @rltk.cached_property
    def name_tokens(self):
        return set(tokenizer.tokenize(self.name_string))

In [7]:
usnews_file = 'python/usnews.csv'
gradcafe_file = 'gradcafe.csv'

ds1 = rltk.Dataset(rltk.CSVReader(usnews_file),record_class=usnewsRecord)
ds2 = rltk.Dataset(rltk.CSVReader(gradcafe_file),record_class=gradcafeRecord)

df1 = ds1.generate_dataframe()
df2 = ds2.generate_dataframe()

You can load your json-lines files into RLTK using this method:

And we can inspect a few entries:

In [8]:
print(ds1.generate_dataframe().head(5))

  id                            name_string  \
0  0                   Princeton University   
1  1                    Columbia University   
2  2                     Harvard University   
3  3  Massachusetts Institute of Technology   
4  4                        Yale University   

                                  name_tokens  
0                     {University, Princeton}  
1                      {University, Columbia}  
2                       {University, Harvard}  
3  {Massachusetts, Technology, Institute, of}  
4                          {University, Yale}  


In [9]:
print(ds2.generate_dataframe().head(5))

  id                      name_string                           name_tokens
0  1        Michigan State University         {University, Michigan, State}
1  2         Johns Hopkins University          {Hopkins, Johns, University}
2  3  Stevens Institute Of Technology  {Of, Stevens, Technology, Institute}
3  4                             UIUC                                {UIUC}
4  5       Carnegie Mellon University        {University, Mellon, Carnegie}


# Blocking

In [10]:
bg = rltk.HashBlockGenerator()
block = bg.generate(
    bg.block(ds1, property_='name_string'),
    bg.block(ds2, property_='name_string')
)

## Field (Attribute) Similarity

In [11]:
def name_string_similarity_1(r1, r2):
    s1 = r1.name_string.lower()
    s2 = r2.name_string.lower()
    
    return rltk.jaro_winkler_similarity(s1, s2)
    
def name_string_similarity_2(r1, r2):
    s1 = r1.name_string.lower()
    s2 = r2.name_string.lower()   
    if s1 == s2:
        return 1 
    return 0

## Entity Linking

In [12]:
# threshold value to determine if we are confident the record match
MY_TRESH = 0.8 # this number is just an example, you need to change it

# entity linkage scoring function
def rule_based_method(r1, r2):

#     isbn_score = isbn_similarity_1(r1, r2)
    name_score_1 = name_string_similarity_1(r1, r2)
    name_score_2 = name_string_similarity_2(r1, r2)
    
    total = (name_score_1 * 0.5 + name_score_2 * 0.5)
    
    # return two values: boolean if they match or not, float to determine confidence
    return total > MY_TRESH, total

## EL Evaluation

Evaluation is a built-in module for benchmarking. Lets load our development set

In [13]:
dev = pd.read_csv('dev.csv')
# test = pd.read_csv("test.csv")

In [14]:
gradcafe

Unnamed: 0,ID,university,major,degree,season,decision,decision_method,decision_date,gpa,gre_verbal,gre_quant,gre_writing,status,created_at,comment
0,1,Michigan State University,Mechanical Engineering,PhD,,Rejected,Website,19-Nov-21,3.63,,,,I,19-Nov-21,I am rejecting without mentioning any reason I...
1,2,Johns Hopkins University,Mechanical Engineering,PhD,,Rejected,E-mail,17-Nov-21,,,,,U,17-Nov-21,Department resubmitted my application for Fall...
2,3,Stevens Institute Of Technology,Mechanical Engineering,PhD,,Rejected,E-mail,16-Nov-21,3.85,156.0,161.0,4.0,I,16-Nov-21,"""...We have reviewed your application for a Ph..."
3,4,UIUC,Civil And Environmental Engineering,Masters,,Accepted,E-mail,5-May-21,3.42,155.0,162.0,3.5,I,14-Nov-21,Applied for PhD but accepted for non thesis ma...
4,5,Carnegie Mellon University,Electrical And Computer Engineering,PhD,,Accepted,E-mail,11-Nov-21,3.50,151.0,166.0,3.5,U,11-Nov-21,Oh my god. It is unbelievable.\nI am CMU MS EC...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49738,49745,Clemson University,Computer Engineering,PhD,F15,Rejected,E-mail,29-Jan-15,3.86,147.0,167.0,3.5,I,29-Jan-15,report spam\nreply
49739,49746,"University Of California, San Diego",Computer Science And Engineering,Masters,F15,Accepted,E-mail,29-Jan-15,,170.0,154.0,3.0,I,29-Jan-15,Non-CS undergrad. One paper in preparation. On...
49740,49747,University Of Texas Austin,Mechanical Engineering,PhD,F15,Rejected,Website,28-Jan-15,,,,,I,29-Jan-15,"1 Journal Publication,and another close to Pub..."
49741,49748,NC State,Nuclear Engineering,PhD,F15,Accepted,E-mail,29-Jan-15,3.64,162.0,164.0,4.5,A,29-Jan-15,report spam\nreply


In [15]:
ls = []
for i in gradcafe.iterrows():
    for j in usnews.iterrows():
        combination = [str(i[1]['ID']), str(j[1]['ID'])]
        ls.append(combination)

In [16]:
len(ls)

7461450

And now build a ground truth based on the development set

In [17]:
gt = rltk.GroundTruth()
for row in ls:    
    r2 = ds2.get_record(row[0]) # 'ltable.ID'
    r1  = ds1.get_record(row[1]) # 'rtable.ID'
    if row[-1] == '1':
        gt.add_positive(r1.raw_object['ID'], r2.raw_object['ID'])
    else:
        gt.add_negative(r1.raw_object['ID'], r2.raw_object['ID'])
        
#gt.generate_all_negatives(ds1, ds2, range_in_gt=True)

Lets run some candidates using the ground-truth

In [18]:
trial = rltk.Trial(gt)
#candidate_pairs = rltk.get_record_pairs(ds1, ds2, ground_truth=gt)
#for r1, r2 in candidate_pairs:
for row in ls:
    r1 = ds1.get_record(row[1]) # 'ltable.ID'
    r2  = ds2.get_record(row[0]) # 'rtable.ID'
    result, confidence = rule_based_method(r1, r2)
    trial.add_result(r1, r2, result, confidence)

### Save Test predictions
You will be evaluated on dev and test predictions, over a hidden ground truth.

In [19]:
test = [row for row in ls][1:]
print(test[:2])

[['1', '1'], ['1', '2']]


In [20]:
predictions = []
for id1, id2 in test: 
    r1 = ds1.get_record(id2) # 'ltable.ID' us news
    r2  = ds2.get_record(id1) # 'rtable.ID' grad cafe
    result, confidence = rule_based_method(r1, r2)
    predictions.append((r1.id, r2.id, result, confidence))

In [21]:
len(predictions), len(ds1.generate_dataframe()), len(ds2.generate_dataframe())

(7461449, 150, 49743)

In [22]:
with open('Saurabh_Jain_predictions.csv', mode='w') as file:
    writer = csv.writer(file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for row in predictions:
        writer.writerow(row)