# Introduction
This IPython notebook explains a basic workflow two tables using py_entitymatching. The goal is to come up with a workflow to match books from Goodreads and Amazon. Specifically, we want to maximize F1. The datasets contain information about the books.

First, we need to import py_entitymatching package and other libraries as follows:

In [90]:
import py_entitymatching as em
import pandas as pd
import os
import sys
from timeit import default_timer as timer

In [91]:
# Display the versions
print('python version: ' + sys.version )
print('pandas version: ' + pd.__version__ )
print('magellan version: ' + em.__version__ )

python version: 3.5.2 (default, Sep 14 2017, 22:51:06) 
[GCC 5.4.0 20160609]
pandas version: 0.20.3
magellan version: 0.3.0


Matching two tables typically consists of the following three steps:

1. Reading the input tables

2. Blocking the input tables to get a candidate set

3. Matching the tuple pairs in the candidate set

## Read input tables

In [92]:
source1 = 'source1_cleaned.csv'
source2 = 'source2_cleaned.csv'

# Read the data
A = em.read_csv_metadata(source1)
B = em.read_csv_metadata(source2)



In [93]:
# Set the metadata
em.set_key(A, 'ID')
em.set_key(B, 'ID')

True

In [94]:
print('Number of tuples in A: ' + str(len(A)))
print('Number of tuples in B: ' + str(len(B)))
print('Number of tuples in A X B (i.e the cartesian product): ' + str(len(A)*len(B)))

Number of tuples in A: 3387
Number of tuples in B: 3001
Number of tuples in A X B (i.e the cartesian product): 10164387


In [95]:
A.head(2)

Unnamed: 0,ID,Name,Author,Publisher,Publishing_Date,Format,Pages,Rating
0,0,Age of Myth: Book One of The Legends of the First Empire,Michael J. Sullivan,Del Rey,2017-1-31,Paperback,464.0,4.5
1,1,Rise of the Dragons (Kings and Sorcerers--Book 1),Morgan Rice,Morgan Rice,2017-8-4,Hardcover,217.0,4.1


In [96]:
B.head(2)

Unnamed: 0,ID,Name,Author,Publisher,Publishing_Date,Format,Pages,Rating
0,0,Brides of Fantasy,Vanilla Orchid Books,,,Kindle Edition,,0.0
1,1,The Italian Secretary: A Further Adventure Of Sherlock Holmes,Caleb Carr,Sphere,2015-11-27,Paperback,288.0,3.19


In [97]:
# Display the keys of the input tables
em.get_key(A), em.get_key(B)

('ID', 'ID')

Here we will proceed without downsampling the datasets and use the entire dataset. 

## Block tables to get candidate set
Before we do the matching, we would like to remove the obviously non-matching tuple pairs from the input tables. This would reduce the number of tuple pairs considered for matching.

### Rule Based Blocker
We first get the tokenizers and the similarity functions and then get the attribute correspondence for the two tables. 

We then define the following rules:
1. For a tuple pair, if the Levenshtein similarity for the **Name** attribute is less than 0.275, block them.
2. For a tuple pair, if the Jaccard similarity for the **Author** attribute is less than 0.5, block them.

#### Define the rules

In [98]:
# Rule-Based blocker 
rb0 = em.RuleBasedBlocker()
block_t = em.get_tokenizers_for_blocking()
block_s = em.get_sim_funs_for_blocking()
block_c = em.get_attr_corres(A, B)
atypes_A = em.get_attr_types(A)
atypes_B = em.get_attr_types(B)

block_f = em.get_features(A, B, atypes_A, atypes_B, block_c, block_t, block_s)

# add rule for book names : block tuples if Levenshtein Similarity is below 0.275
rb0.add_rule(['Name_Name_lev_sim(ltuple, rtuple) < 0.275'], block_f)

# add rule for authors : block tuples if Jaccard Similarity is below 0.5 in spaces delimited tokens
rb0.add_rule(['Author_Author_jac_dlm_dc0_dlm_dc0(ltuple, rtuple) < 0.5'], block_f)

'_rule_1'

#### Perform Blocking

In [99]:
start = timer()
C0 = rb0.block_tables(A, B,
                    l_output_attrs=['ID', 'Name', 'Author', 'Publisher', 'Publishing_Date', 'Format', 'Pages', 'Rating'], 
                    r_output_attrs=['ID', 'Name', 'Author', 'Publisher', 'Publishing_Date', 'Format', 'Pages', 'Rating'],
                     show_progress=False)

end = timer()
print("Time taken : " + str(end - start))

Time taken : 6.324349999998958


In [100]:
print(len(C0))

2754


In [101]:
C0.head(2)

Unnamed: 0,_id,ltable_ID,rtable_ID,ltable_Name,ltable_Author,ltable_Publisher,ltable_Publishing_Date,ltable_Format,ltable_Pages,ltable_Rating,rtable_Name,rtable_Author,rtable_Publisher,rtable_Publishing_Date,rtable_Format,rtable_Pages,rtable_Rating
14,0,1281,22,"The Magazine of Fantasy and Science Fiction, July 1969 (Volume 37, No. 1)",Fritz Leiber,Mercury Press,1969-0-0,Hardcover,,,"Gather, Darkness! (Nucleus Fantasy & Science Fiction)",Fritz Leiber,Collier Books,1992-12-01,Paperback,240.0,3.64
15,1,204,23,Beyond My Control: Forbidden Fantasies in an Uncensored Age,Nancy Friday,Sourcebooks,2009-4-1,Kindle,288.0,3.2,Men in Love: Men's Sexual Fantasies: The Triumph of Love Over Rage,Nancy Friday,Dell,1982-12-15,Mass Market Paperback,544.0,3.7


#### Overlap Blocker
We now apply the overlap blocker to the candidate set obtained in the previous step. Since the entity we are dealing with is books, there are quite a few stopwords present in the book names, such as "The", "Of", "And" etc. Hence we will remove these stopwords by setting the <i>rem_stop_words</i> to _True_ and then perform overlap blocking with the size set to 1.

We apply overlap blocking to the following attributes:
1. Book Names
2. Book Authors

In [102]:
start = timer()

# Overlap blocker
overlapBlocker = em.OverlapBlocker()
overlapBlocker.stop_words.append('of')
C1 = overlapBlocker.block_candset(C0, 'Name', 'Name', word_level=True, overlap_size=1, allow_missing=True, show_progress=False, rem_stop_words=True)

C1 = overlapBlocker.block_candset(C1, 'Author', 'Author', word_level=True, overlap_size=1, allow_missing=True, show_progress=False, rem_stop_words=True)

end = timer()
print("Time taken : " + str(end - start))

Time taken : 0.17012899999826914


In [103]:
print(len(C1))

1092


In [104]:
C1.head(2)

Unnamed: 0,_id,ltable_ID,rtable_ID,ltable_Name,ltable_Author,ltable_Publisher,ltable_Publishing_Date,ltable_Format,ltable_Pages,ltable_Rating,rtable_Name,rtable_Author,rtable_Publisher,rtable_Publishing_Date,rtable_Format,rtable_Pages,rtable_Rating
14,0,1281,22,"The Magazine of Fantasy and Science Fiction, July 1969 (Volume 37, No. 1)",Fritz Leiber,Mercury Press,1969-0-0,Hardcover,,,"Gather, Darkness! (Nucleus Fantasy & Science Fiction)",Fritz Leiber,Collier Books,1992-12-01,Paperback,240.0,3.64
15,1,204,23,Beyond My Control: Forbidden Fantasies in an Uncensored Age,Nancy Friday,Sourcebooks,2009-4-1,Kindle,288.0,3.2,Men in Love: Men's Sexual Fantasies: The Triumph of Love Over Rage,Nancy Friday,Dell,1982-12-15,Mass Market Paperback,544.0,3.7


## Debug blocker output
The number of tuple pairs considered for matching is reduced to 1092 (from 10164387), but we would want to make sure that the blocker did not drop any potential matches.

In [105]:
# Debug blocker output
dbg = em.debug_blocker(C1, A, B, output_size=200)

In [106]:
dbg.head(3)

Unnamed: 0,_id,ltable_ID,rtable_ID,ltable_Name,ltable_Author,ltable_Publisher,ltable_Publishing_Date,ltable_Format,rtable_Name,rtable_Author,rtable_Publisher,rtable_Publishing_Date,rtable_Format
0,0,457,285,Final Fantasy X-X2 HD Remaster Official Strategy Guide,BradyGames,BRADY GAMES,2014-3-18,Hardcover,Final Fantasy VII: Official Strategy Guide,David Cassady,Bradygames,1998-06-12,Paperback
1,1,457,2719,Final Fantasy X-X2 HD Remaster Official Strategy Guide,BradyGames,BRADY GAMES,2014-3-18,Hardcover,FINAL FANTASY X Official Strategy Guide,Dan Birlew,BradyGames,2001-12-17,Paperback
2,2,457,2378,Final Fantasy X-X2 HD Remaster Official Strategy Guide,BradyGames,BRADY GAMES,2014-3-18,Hardcover,Final Fantasy VIII Official Strategy Guide,David Cassady,BradyGames,1999-08-31,Paperback


We can see here that we already have some matches. Since the number of matches has dropped to just 1092 from 10164387, we decided to stop debugging the blocking step and proceed with training a matcher.

In [107]:
# Saving the tuples which survived the blocking step
C1.to_csv("TuplesAfterBlocking.csv", encoding='utf-8', index=False)

## Labeling the candidate set
We labeled the tuples from the previous step as a match or not. 1 indicates a match and 0 indicates a non match. We did not use the <i>label_table</i> function.

We sample 500 tuple pairs for labeling, from the 1092 obtained after blocking.

In [108]:
# Sample 500 tuples for labeling
S = em.sample_table(C1, 500)

# Save this for labeling
S.to_csv('TuplesForLabeling.csv', encoding='utf-8', index=False)

Labeling 1092 tuples took roughly 45 minutes.

In [109]:
# Load the golden data
S = em.read_csv_metadata('TuplesForLabeling_cleaned.csv', key='_id', ltable=A, rtable=B, 
                         fk_ltable='ltable_ID', fk_rtable='rtable_ID')



Samples from the golden data; The last column **match** indicates the labels we've added.

In [128]:
S.head(3)

Unnamed: 0,_id,ltable_ID,rtable_ID,ltable_Name,ltable_Author,ltable_Publisher,ltable_Format,ltable_Pages,ltable_Rating,rtable_Name,rtable_Author,rtable_Publisher,rtable_Format,rtable_Pages,rtable_Rating,match
0,51,5,478,The Fantasy Baseball Black Book 2018 (Fantasy Black Book),Joe Pisapia,Independently published,Kindle,157.0,4.6,The Fantasy Baseball Black Book 2017 Edition (Fantasy Black Book 10),Joe Pisapia,,Kindle Edition,182.0,3.88,0
1,378,13,2888,Grimgar of Fantasy and Ash (Light Novel) Vol. 1,Ao Jyumonji,Seven Seas,Paperback,280.0,4.3,"Grimgar of Fantasy and Ash, Vol. 1",Ao Jyumonji,Yen Press,Paperback,224.0,3.43,1
2,309,13,2615,Grimgar of Fantasy and Ash (Light Novel) Vol. 1,Ao Jyumonji,Seven Seas,Paperback,280.0,4.3,Grimgar of Fantasy and Ash: Volume 3,Ao Jyumonji,J-Novel Club,Kindle Edition,280.0,4.26,0


## Splitting the labeled data into development and evaluation set
In this step, we split the labeled data into two sets: development (I) and evaluation (J). Specifically, the development set is used to come up with the best learning-based matcher and the evaluation set used to evaluate the selected matcher on unseen data.

In [110]:
# Split S into development set (I) and evaluation set (J)
IJ = em.split_train_test(S, train_proportion=0.7, random_state=42)
I = IJ['train']
J = IJ['test']

In [111]:
len(I), len(J)

(350, 150)

### Save Set I and Set J

In [112]:
I.to_csv("SetI.csv", encoding='utf-8', index=False)
J.to_csv("SetJ.csv", encoding='utf-8', index=False)

## Selecting the best learning-based matcher
Selecting the best learning-based matcher typically involves the following steps:

1. Creating a set of learning-based matchers
2. Creating features
3. Converting the development set into feature vectors
4. Selecting the best learning-based matcher using k-fold cross validation

### Creating a set of learning-based matchers

Here, we tuned the hyperparameters a bit so that they are more relavent to our scenario.

In [113]:
# Create a set of ML-matchers
dt = em.DTMatcher(name='DecisionTree', random_state=0, criterion='gini', class_weight='balanced')
svm = em.SVMMatcher(name='SVM', kernel='linear', random_state=0)
rf = em.RFMatcher(name='RF', n_estimators=50, criterion='gini', class_weight='balanced', random_state=0)
lg = em.LogRegMatcher(name='LogReg', penalty='l2', class_weight='balanced', random_state=0)
ln = em.LinRegMatcher(name='LinReg')
nb = em.NBMatcher(name='NaiveBayes')

### Creating features
Here we use the automatically generated features

In [114]:
# Generate features
feature_table = em.get_features_for_matching(A, B, validate_inferred_attr_types=False)

In [115]:
# List the names of the features generated
feature_table['feature_name']

0                                           ID_ID_exm
1                                           ID_ID_anm
2                                      ID_ID_lev_dist
3                                       ID_ID_lev_sim
4                           Name_Name_jac_qgm_3_qgm_3
5                       Name_Name_cos_dlm_dc0_dlm_dc0
6                                       Name_Name_mel
7                                  Name_Name_lev_dist
8                                   Name_Name_lev_sim
9                       Author_Author_jac_qgm_3_qgm_3
10                  Author_Author_cos_dlm_dc0_dlm_dc0
11                  Author_Author_jac_dlm_dc0_dlm_dc0
12                                  Author_Author_mel
13                             Author_Author_lev_dist
14                              Author_Author_lev_sim
15                                  Author_Author_nmw
16                                   Author_Author_sw
17                Publisher_Publisher_jac_qgm_3_qgm_3
18            Publisher_Publ

### Dropping Features

We remove a few features from the generated set of features. The reasoning is as follows:

Consider the **Publisher** attribute. While labeling the true matches, we marked a tuple pair as a true match even if the publishers did not match. The same book is usually sold in different countries under different publishers and hence though the publishers might differ, the book still refers to the same real world object. Hence we do not consider the **Publisher** attribute as a feature, as they might differ for a match. The same reasoning is extended to **Pages** and **Rating** attributes as well.

In [116]:
# Drop publishing date, rating related features
feature_table = feature_table.drop([0,1,2,3,17,18,19,20,21,22,23,24,25,30,34,35,36,37,38])

### Converting the development set to feature vectors

In [117]:
# Convert the I into a set of feature vectors using F
H = em.extract_feature_vecs(I, 
                            feature_table=feature_table, 
                            attrs_after='match',
                            show_progress=False)

#### Check for missing values

In [118]:
H.isnull().sum()

_id                                          0
ltable_ID                                    0
rtable_ID                                    0
Name_Name_jac_qgm_3_qgm_3                    0
Name_Name_cos_dlm_dc0_dlm_dc0                0
Name_Name_mel                                0
Name_Name_lev_dist                           0
Name_Name_lev_sim                            0
Author_Author_jac_qgm_3_qgm_3              234
Author_Author_cos_dlm_dc0_dlm_dc0          234
Author_Author_jac_dlm_dc0_dlm_dc0          234
Author_Author_mel                          234
Author_Author_lev_dist                     234
Author_Author_lev_sim                      234
Author_Author_nmw                          234
Author_Author_sw                           234
Publishing_Date_Publishing_Date_lev_sim     36
Publishing_Date_Publishing_Date_jar         36
Publishing_Date_Publishing_Date_jwn         36
Publishing_Date_Publishing_Date_exm         36
Pages_Pages_exm                             49
Pages_Pages_a

#### Impute missing values with mean

In [119]:
H = em.impute_table(H, 
                exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'match'],
                strategy='mean')

### Selecting the best matcher using cross-validation
Now, we select the best matcher using k-fold cross-validation. We use five fold cross validation and use 'precision' and 'recall' metric to select the best matcher.

In [120]:
# Select the best ML matcher using CV
start = timer()
result = em.select_matcher([dt, rf, svm, ln, lg, nb], table=H, 
        exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'match'],
        k=5,
        target_attr='match', metric_to_select_matcher='f1', random_state=42)
end = timer()
print(end-start)

9.62093399999867


In [121]:
print(result['cv_stats'])

        Matcher  Average precision  Average recall  Average f1
0  DecisionTree           0.592857        0.590476    0.564267
1            RF           0.966667        0.612381    0.707459
2           SVM           0.852381        0.590476    0.637121
3        LinReg           0.893333        0.566667    0.675556
4        LogReg           0.595311        0.886667    0.690131
5    NaiveBayes           0.609286        0.824762    0.691784


In [122]:
print(result['drill_down_cv_stats']['f1'])

           Name  \
0  DecisionTree   
1            RF   
2           SVM   
3        LinReg   
4        LogReg   
5    NaiveBayes   

                                                                            Matcher  \
0          <py_entitymatching.matcher.dtmatcher.DTMatcher object at 0x7f9d7c444a58>   
1          <py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x7f9d7c444a20>   
2        <py_entitymatching.matcher.svmmatcher.SVMMatcher object at 0x7f9d7c444978>   
3  <py_entitymatching.matcher.linregmatcher.LinRegMatcher object at 0x7f9d7bd83d30>   
4  <py_entitymatching.matcher.logregmatcher.LogRegMatcher object at 0x7f9d7c4444e0>   
5          <py_entitymatching.matcher.nbmatcher.NBMatcher object at 0x7f9d7bd83550>   

   Num folds    Fold 1    Fold 2    Fold 3    Fold 4    Fold 5  Mean score  
0          5  0.666667  0.714286  0.625000  0.615385  0.200000    0.564267  
1          5  0.909091  0.833333  0.461538  0.833333  0.500000    0.707459  
2          5  0.833333  0

#### As seen here, the random forest classifier has a precision of over 90% (96.66%) and has a recall of 61.23%. It also has the highest F1 score. Hence we do not debug further and proceed to use the random forest classifier on the test set.

## Evaluating the matching output
Evaluating the matching outputs for the evaluation set typically involves the following four steps:

1. Converting the evaluation set to feature vectors
2. Training matcher using the feature vectors extracted from the development set
3. Predicting the evaluation set using the trained matcher
4. Evaluating the predicted matches

### Converting the evaluation set to feature vectors
As before, we convert to the feature vectors (using the feature table and the evaluation set)

In [123]:
# Testing
# Convert J into a set of feature vectors using F
L = em.extract_feature_vecs(J, feature_table=feature_table,
                            attrs_after='match', show_progress=False)

### Impute the missing values in the test set

In [124]:
# Impute missing values
L = em.impute_table(L, 
                exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'match'],
                strategy='mean')

### Training the selected matcher
Now, we train the matcher using all of the feature vectors from the development set. Here, we use random forest as the selected matcher.

In [125]:
# Train using feature vectors from I 
rf.fit(table=L, 
       exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'match'], 
       target_attr='match')

### Predicting the matches
Next, we predict the matches for the evaluation set (using the feature vectors extracted from it).

In [126]:
# Predict on L 
predictions = rf.predict(table=L, exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'match'], 
              append=True, target_attr='predicted', inplace=False)

### Evaluating the predictions
Finally, we evaluate the accuracy of predicted outputs

In [127]:
# Evaluate the predictions
eval_result = em.eval_matches(predictions, 'match', 'predicted')
em.print_eval_summary(eval_result)

Precision : 100.0% (12/12)
Recall : 100.0% (12/12)
F1 : 100.0%
False positives : 0 (out of 12 positive predictions)
False negatives : 0 (out of 138 negative predictions)


### Evaluation on all learning methods
Here we see how the other 5 learning methods perform on the test set.

**Decision Tree**

In [133]:
# Train using feature vectors from I 
dt.fit(table=L, 
       exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'match'], 
       target_attr='match')

# Predict on L 
predictions = dt.predict(table=L, exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'match'], 
              append=True, target_attr='predicted', inplace=False)

# Evaluate the predictions
eval_result = em.eval_matches(predictions, 'match', 'predicted')
em.print_eval_summary(eval_result)

Precision : 100.0% (12/12)
Recall : 100.0% (12/12)
F1 : 100.0%
False positives : 0 (out of 12 positive predictions)
False negatives : 0 (out of 138 negative predictions)


**Support Vector Machines**

In [134]:
# Train using feature vectors from I 
svm.fit(table=L, 
       exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'match'], 
       target_attr='match')

# Predict on L 
predictions = svm.predict(table=L, exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'match'], 
              append=True, target_attr='predicted', inplace=False)

# Evaluate the predictions
eval_result = em.eval_matches(predictions, 'match', 'predicted')
em.print_eval_summary(eval_result)

Precision : 88.89% (8/9)
Recall : 66.67% (8/12)
F1 : 76.19%
False positives : 1 (out of 9 positive predictions)
False negatives : 4 (out of 141 negative predictions)


**Logisitic Regression**

In [135]:
# Train using feature vectors from I 
lg.fit(table=L, 
       exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'match'], 
       target_attr='match')

# Predict on L 
predictions = lg.predict(table=L, exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'match'], 
              append=True, target_attr='predicted', inplace=False)

# Evaluate the predictions
eval_result = em.eval_matches(predictions, 'match', 'predicted')
em.print_eval_summary(eval_result)

Precision : 57.89% (11/19)
Recall : 91.67% (11/12)
F1 : 70.97%
False positives : 8 (out of 19 positive predictions)
False negatives : 1 (out of 131 negative predictions)


**Linear Regression**

In [136]:
# Train using feature vectors from I 
ln.fit(table=L, 
       exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'match'], 
       target_attr='match')

# Predict on L 
predictions = ln.predict(table=L, exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'match'], 
              append=True, target_attr='predicted', inplace=False)

# Evaluate the predictions
eval_result = em.eval_matches(predictions, 'match', 'predicted')
em.print_eval_summary(eval_result)

Precision : 90.0% (9/10)
Recall : 75.0% (9/12)
F1 : 81.82%
False positives : 1 (out of 10 positive predictions)
False negatives : 3 (out of 140 negative predictions)


**Naive Bayes**

In [137]:
# Train using feature vectors from I 
nb.fit(table=L, 
       exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'match'], 
       target_attr='match')

# Predict on L 
predictions = nb.predict(table=L, exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'match'], 
              append=True, target_attr='predicted', inplace=False)

# Evaluate the predictions
eval_result = em.eval_matches(predictions, 'match', 'predicted')
em.print_eval_summary(eval_result)

Precision : 50.0% (9/18)
Recall : 75.0% (9/12)
F1 : 60.0%
False positives : 9 (out of 18 positive predictions)
False negatives : 3 (out of 132 negative predictions)
