# Introduction
This IPython notebook explains a basic workflow two tables using py_entitymatching. The goal is to come up with a workflow to match books from Goodreads and Amazon. Specifically, we want to maximize F1. The datasets contain information about the books.

First, we need to import py_entitymatching package and other libraries as follows:

In [1]:
import py_entitymatching as em
import pandas as pd
import os
import sys
from timeit import default_timer as timer

In [2]:
# Display the versions
print('python version: ' + sys.version )
print('pandas version: ' + pd.__version__ )
print('magellan version: ' + em.__version__ )

python version: 3.6.5 (default, Mar 30 2018, 06:42:10) 
[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)]
pandas version: 0.22.0
magellan version: 0.3.0


Matching two tables typically consists of the following three steps:

1. Reading the input tables

2. Blocking the input tables to get a candidate set

3. Matching the tuple pairs in the candidate set

## Read input tables

In [3]:
source1 = 'source1_cleaned_2.csv'
source2 = 'source2_cleaned_2.csv'

# Read the data
A = em.read_csv_metadata(source1)
B = em.read_csv_metadata(source2)

Metadata file is not present in the given path; proceeding to read the csv file.
Metadata file is not present in the given path; proceeding to read the csv file.


In [4]:
# Set the metadata
em.set_key(A, 'ID')
em.set_key(B, 'ID')

True

In [5]:
print('Number of tuples in A: ' + str(len(A)))
print('Number of tuples in B: ' + str(len(B)))
print('Number of tuples in A X B (i.e the cartesian product): ' + str(len(A)*len(B)))

Number of tuples in A: 3387
Number of tuples in B: 3001
Number of tuples in A X B (i.e the cartesian product): 10164387


In [6]:
A.head(2)

Unnamed: 0,ID,Name,Author,Publisher,Publishing_Date,Format,Pages,Rating
0,0,Age of Myth: Book One of The Legends of the First Empire,Michael J. Sullivan,Del Rey,1/31/17,Paperback,464.0,4.5
1,1,Rise of the Dragons (Kings and Sorcerers--Book 1),Morgan Rice,Morgan Rice,8/4/17,Hardcover,217.0,4.1


In [7]:
B.head(2)

Unnamed: 0,ID,Name,Author,Publisher,Publishing_Date,Format,Pages,Rating
0,0,Brides of Fantasy,Vanilla Orchid Books,,,Kindle Edition,,0.0
1,1,The Italian Secretary: A Further Adventure Of Sherlock Holmes,Caleb Carr,Sphere,11/27/15,Paperback,288.0,3.19


In [8]:
# Display the keys of the input tables
em.get_key(A), em.get_key(B)

('ID', 'ID')

Here we will proceed without downsampling the datasets and use the entire dataset. 

## Block tables to get candidate set
Before we do the matching, we would like to remove the obviously non-matching tuple pairs from the input tables. This would reduce the number of tuple pairs considered for matching.

Here we use overlap blocker on the name of the book to and we set the number of overlapping words to be 3, to consider the pair a match.

In [None]:
start = timer()
# Rule-Based blocker [C0]
rb0 = em.RuleBasedBlocker()
block_t = em.get_tokenizers_for_blocking()
block_s = em.get_sim_funs_for_blocking()
block_c = em.get_attr_corres(A, B)
atypes_A = em.get_attr_types(A)
atypes_B = em.get_attr_types(B)

block_f = em.get_features(A, B, atypes_A, atypes_B, block_c, block_t, block_s)
# add rule for book names : block tuples if Levenshtein Similarity is below 0.2
rb0.add_rule(['Name_Name_lev_sim(ltuple, rtuple) < 0.275'], block_f)

# add rule for authors : block tuples if Jaccard Similarity is below 0.2 in spaces delimited tokens
rb0.add_rule(['Author_Author_jac_dlm_dc0_dlm_dc0(ltuple, rtuple) < 0.5'], block_f)

C0 = rb0.block_tables(A, B,
                    l_output_attrs=['ID', 'Name', 'Author', 'Publisher', 'Publishing_Year', 'Format', 'Pages', 'Rating'], 
                    r_output_attrs=['ID', 'Name', 'Author', 'Publisher', 'Publishing_Year', 'Format', 'Pages', 'Rating'])

end = timer()
print(end - start)
print (len(C0))

In [None]:
start = timer()

# Overlap blocker
overlapBlocker = em.OverlapBlocker()
overlapBlocker.stop_words.append('of')
C1 = overlapBlocker.block_candset(C0, 'Name', 'Name', word_level=True, overlap_size=1, allow_missing=True,show_progress=False, rem_stop_words=True)

C1 = overlapBlocker.block_candset(C1, 'Author', 'Author', word_level=True, overlap_size=1, allow_missing=True, show_progress=False, rem_stop_words=True)
end = timer()
print(end - start)

print (len(C1))

In [None]:
C1.head()

## Debug blocker output
The number of tuple pairs considered for matching is reduced to 118876 (from 10164387), but we would want to make sure that the blocker did not drop any potential matches.

In [None]:
# Debug blocker output
dbg = em.debug_blocker(C1, A, B, output_size=200)

In [None]:
dbg.head()

As we see here, an overlap factor of 3 on the name of the book leads to a lot of false matches. This is because book names contains a lot of stop words like "The" and "Of. Hence let's try attribute equivalenec matching on the author of the book. 

We will pass the candidate set obtained in the first step for attribute equivalence matching.
TODO : Chage this

In [None]:
# Display first two rows from C2
C1.head(5)

We can see here that we already have some matches. Since the number of matches has dropped to just 333 from 10164387, we decided to stop debugging the blocking step and proceed with training a matcher.

In [None]:
# Saving the tuples which survived the blocking step
C1.to_csv("TuplesAfterBlocking_cleaned.csv", encoding='utf-8', index=False)

## Labeling the candidate set
We labeled the tuples from the previous step as a match or not. 1 indicates a match and 0 indicates a non match. We did not use the <i>label_table</i> function.

Labeling 333 tuples took roughly 20 minutes.

In [None]:
# Sample 500 tuples for labeling
S = em.sample_table(C1, 500)

# Save this for labeling
S.to_csv('TuplesForLabeling_cleaned.csv', encoding='utf-8', index=False)

In [9]:
# Load the labeled set
S = em.read_csv_metadata('TuplesForLabeling_cleaned.csv', key='_id', ltable=A, rtable=B, 
                         fk_ltable='ltable_ID', fk_rtable='rtable_ID')

Metadata file is not present in the given path; proceeding to read the csv file.


## Splitting the labeled data into development and evaluation set
In this step, we split the labeled data into two sets: development (I) and evaluation (J). Specifically, the development set is used to come up with the best learning-based matcher and the evaluation set used to evaluate the selected matcher on unseen data.

In [10]:
# Split S into development set (I) and evaluation set (J)
IJ = em.split_train_test(S, train_proportion=0.7, random_state=42)
I = IJ['train']
J = IJ['test']

In [11]:
len(I), len(J)

(350, 150)

### Save Set I and Set J

In [12]:
I.to_csv("SetI.csv", encoding='utf-8', index=False)
J.to_csv("SetJ.csv", encoding='utf-8', index=False)

## Selecting the best learning-based matcher
Selecting the best learning-based matcher typically involves the following steps:

1. Creating a set of learning-based matchers
2. Creating features
3. Converting the development set into feature vectors
4. Selecting the best learning-based matcher using k-fold cross validation

### Creating a set of learning-based matchers

In [13]:
# Create a set of ML-matchers
dt = em.DTMatcher(name='DecisionTree', random_state=0, criterion='gini', class_weight='balanced')
svm = em.SVMMatcher(name='SVM', kernel='linear', random_state=0)
rf = em.RFMatcher(name='RF', n_estimators=50, criterion='gini', class_weight='balanced', random_state=0)
lg = em.LogRegMatcher(name='LogReg', penalty='l2', class_weight='balanced', random_state=0)
ln = em.LinRegMatcher(name='LinReg')
nb = em.NBMatcher(name='NaiveBayes')

### Creating features
Here we use the automatically generated features

In [14]:
# Generate features
feature_table = em.get_features_for_matching(A, B, validate_inferred_attr_types=False)

In [15]:
# List the names of the features generated
feature_table['feature_name']

0                                           ID_ID_exm
1                                           ID_ID_anm
2                                      ID_ID_lev_dist
3                                       ID_ID_lev_sim
4                           Name_Name_jac_qgm_3_qgm_3
5                       Name_Name_cos_dlm_dc0_dlm_dc0
6                                       Name_Name_mel
7                                  Name_Name_lev_dist
8                                   Name_Name_lev_sim
9                       Author_Author_jac_qgm_3_qgm_3
10                  Author_Author_cos_dlm_dc0_dlm_dc0
11                  Author_Author_jac_dlm_dc0_dlm_dc0
12                                  Author_Author_mel
13                             Author_Author_lev_dist
14                              Author_Author_lev_sim
15                                  Author_Author_nmw
16                                   Author_Author_sw
17                Publisher_Publisher_jac_qgm_3_qgm_3
18            Publisher_Publ

We remove the **Rating** column, as the books are rated differently on Amazon and Goodreads. Hence we will drop those features here as well.

In [16]:
# Drop publishing date, rating related features
feature_table = feature_table.drop([0,1,2,3,17,18,19,20,21,22,23,24,25,30,34,35,36,37,38])

### Converting the development set to feature vectors

In [17]:
# Convert the I into a set of feature vectors using F
H = em.extract_feature_vecs(I, 
                            feature_table=feature_table, 
                            attrs_after='match',
                            show_progress=False)

#### Check for missing values

In [18]:
H.isnull().sum()

_id                                          0
ltable_ID                                    0
rtable_ID                                    0
Name_Name_jac_qgm_3_qgm_3                    0
Name_Name_cos_dlm_dc0_dlm_dc0                0
Name_Name_mel                                0
Name_Name_lev_dist                           0
Name_Name_lev_sim                            0
Author_Author_jac_qgm_3_qgm_3              234
Author_Author_cos_dlm_dc0_dlm_dc0          234
Author_Author_jac_dlm_dc0_dlm_dc0          234
Author_Author_mel                          234
Author_Author_lev_dist                     234
Author_Author_lev_sim                      234
Author_Author_nmw                          234
Author_Author_sw                           234
Publishing_Date_Publishing_Date_lev_sim     36
Publishing_Date_Publishing_Date_jar         36
Publishing_Date_Publishing_Date_jwn         36
Publishing_Date_Publishing_Date_exm         36
Pages_Pages_exm                             49
Pages_Pages_a

#### Impute missing values with mean

In [19]:
H = em.impute_table(H, 
                exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'match'],
                strategy='mean')

### Selecting the best matcher using cross-validation
Now, we select the best matcher using k-fold cross-validation. We use five fold cross validation and use 'precision' and 'recall' metric to select the best matcher.

In [20]:
# Select the best ML matcher using CV

result = em.select_matcher([dt, rf, svm, ln, lg, nb], table=H, 
        exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'match'],
        k=5,
        target_attr='match', metric_to_select_matcher='f1', random_state=42)
print (result['cv_stats'])
print (result['drill_down_cv_stats']['f1'])




        Matcher  Average precision  Average recall  Average f1
0  DecisionTree           0.786111        0.636190    0.619048
1            RF           1.000000        0.612381    0.725641
2           SVM           0.942857        0.672381    0.750000
3        LinReg           1.000000        0.551429    0.687619
4        LogReg           0.578095        0.824762    0.655934
5    NaiveBayes           0.629286        0.844762    0.714291
           Name  \
0  DecisionTree   
1            RF   
2           SVM   
3        LinReg   
4        LogReg   
5    NaiveBayes   

                                                                         Matcher  \
0          <py_entitymatching.matcher.dtmatcher.DTMatcher object at 0x10db79e48>   
1          <py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x10db79f28>   
2        <py_entitymatching.matcher.svmmatcher.SVMMatcher object at 0x10db79e80>   
3  <py_entitymatching.matcher.linregmatcher.LinRegMatcher object at 0x10db79fd0>   
4  <p

In [22]:
# Debugging SVM
# Fit the decision tree to the feature vectors
svm.fit(table=H, exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'match'], target_attr='match')

# Use the SVM matcher to predict if tuple pairs match
predictions = svm.predict(table=H, exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'match'], target_attr='predicted_labels', 
           append=True, inplace=False)

eval_result = em.eval_matches(predictions, 'match', 'predicted_labels')
em.print_eval_summary(eval_result)

Precision : 100.0% (26/26)
Recall : 74.29% (26/35)
F1 : 85.25%
False positives : 0 (out of 26 positive predictions)
False negatives : 9 (out of 324 negative predictions)


In [23]:
predictions[predictions['match'] != predictions['predicted_labels']]

Unnamed: 0,_id,ltable_ID,rtable_ID,Name_Name_jac_qgm_3_qgm_3,Name_Name_cos_dlm_dc0_dlm_dc0,Name_Name_mel,Name_Name_lev_dist,Name_Name_lev_sim,Author_Author_jac_qgm_3_qgm_3,Author_Author_cos_dlm_dc0_dlm_dc0,...,Author_Author_sw,Publishing_Date_Publishing_Date_lev_sim,Publishing_Date_Publishing_Date_jar,Publishing_Date_Publishing_Date_jwn,Publishing_Date_Publishing_Date_exm,Pages_Pages_exm,Pages_Pages_anm,Pages_Pages_lev_dist,match,predicted_labels
5,338,22,2719,0.290698,0.516398,0.639341,41.0,0.414286,1.0,1.0,...,10.0,0.125,0.5,0.5,0.0,1.0,1.0,0.0,1,0
38,357,360,2807,0.747664,0.741249,0.861764,66.0,0.326531,1.0,1.0,...,17.0,0.714286,0.78254,0.826032,0.0,0.0,0.998117,1.0,1,0
395,197,666,1786,0.518519,0.666667,0.7901,5.0,0.736842,1.0,1.0,...,17.0,0.428571,0.579365,0.579365,0.0,0.10299,0.667998,2.265781,1,0
4,310,16,2615,0.526316,0.629941,0.911638,16.0,0.659574,1.0,1.0,...,11.0,0.5,0.607143,0.607143,0.0,1.0,1.0,0.0,1,0
429,353,1177,2790,0.42623,0.668153,0.676795,34.0,0.32,1.0,1.0,...,16.0,0.285714,0.507937,0.507937,0.0,0.0,0.25,3.0,1,0
190,343,403,2720,0.326733,0.327327,0.833956,43.0,0.505747,1.0,1.0,...,12.0,0.714286,0.809524,0.847619,0.0,1.0,1.0,0.0,1,0
1,378,13,2888,0.634615,0.755929,0.928483,14.0,0.702128,1.0,1.0,...,11.0,0.714286,0.849206,0.879365,0.0,0.0,0.8,2.0,1,0
34,366,347,2847,0.526316,0.629941,0.911638,16.0,0.659574,1.0,1.0,...,11.0,0.571429,0.746032,0.746032,0.0,1.0,1.0,0.0,1,0
427,360,1114,2834,0.474576,0.721688,0.867172,40.0,0.42029,1.0,1.0,...,17.0,0.5,0.690476,0.721429,0.0,0.0,0.949219,2.0,1,0


In [24]:
# Triggers
# Use the constructor to create a trigger
# TODO : Decide if we actually need these
mt = em.MatchTrigger()
mt.add_cond_rule(['Name_Name_cos_dlm_dc0_dlm_dc0(ltuple, rtuple) >= 0.9', 'Author_Author_cos_dlm_dc0_dlm_dc0(ltuple, rtuple) >= 0.9'], feature_table)
mt.add_cond_status(True)
mt.add_action(1)

preds = mt.execute(input_table=predictions, label_column='predicted_labels', inplace=True)
predictions.head()
eval_result = em.eval_matches(predictions, 'match', 'predicted_labels')
em.print_eval_summary(eval_result)


Precision : 100.0% (26/26)
Recall : 74.29% (26/35)
F1 : 85.25%
False positives : 0 (out of 26 positive predictions)
False negatives : 9 (out of 324 negative predictions)


In [25]:
predictions[predictions['match'] != predictions['predicted_labels']]

Unnamed: 0,_id,ltable_ID,rtable_ID,Name_Name_jac_qgm_3_qgm_3,Name_Name_cos_dlm_dc0_dlm_dc0,Name_Name_mel,Name_Name_lev_dist,Name_Name_lev_sim,Author_Author_jac_qgm_3_qgm_3,Author_Author_cos_dlm_dc0_dlm_dc0,...,Author_Author_sw,Publishing_Date_Publishing_Date_lev_sim,Publishing_Date_Publishing_Date_jar,Publishing_Date_Publishing_Date_jwn,Publishing_Date_Publishing_Date_exm,Pages_Pages_exm,Pages_Pages_anm,Pages_Pages_lev_dist,match,predicted_labels
5,338,22,2719,0.290698,0.516398,0.639341,41.0,0.414286,1.0,1.0,...,10.0,0.125,0.5,0.5,0.0,1.0,1.0,0.0,1,0
38,357,360,2807,0.747664,0.741249,0.861764,66.0,0.326531,1.0,1.0,...,17.0,0.714286,0.78254,0.826032,0.0,0.0,0.998117,1.0,1,0
395,197,666,1786,0.518519,0.666667,0.7901,5.0,0.736842,1.0,1.0,...,17.0,0.428571,0.579365,0.579365,0.0,0.10299,0.667998,2.265781,1,0
4,310,16,2615,0.526316,0.629941,0.911638,16.0,0.659574,1.0,1.0,...,11.0,0.5,0.607143,0.607143,0.0,1.0,1.0,0.0,1,0
429,353,1177,2790,0.42623,0.668153,0.676795,34.0,0.32,1.0,1.0,...,16.0,0.285714,0.507937,0.507937,0.0,0.0,0.25,3.0,1,0
190,343,403,2720,0.326733,0.327327,0.833956,43.0,0.505747,1.0,1.0,...,12.0,0.714286,0.809524,0.847619,0.0,1.0,1.0,0.0,1,0
1,378,13,2888,0.634615,0.755929,0.928483,14.0,0.702128,1.0,1.0,...,11.0,0.714286,0.849206,0.879365,0.0,0.0,0.8,2.0,1,0
34,366,347,2847,0.526316,0.629941,0.911638,16.0,0.659574,1.0,1.0,...,11.0,0.571429,0.746032,0.746032,0.0,1.0,1.0,0.0,1,0
427,360,1114,2834,0.474576,0.721688,0.867172,40.0,0.42029,1.0,1.0,...,17.0,0.5,0.690476,0.721429,0.0,0.0,0.949219,2.0,1,0


In [27]:
# Testing
# Convert J into a set of feature vectors using F
L = em.extract_feature_vecs(J, feature_table=feature_table,
                            attrs_after='match', show_progress=False)

# Impute missing values
L = em.impute_table(L, 
                exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'match'],
                strategy='mean')
# Predict on L 
svm.predict(table=L, exclude_attrs=['_id', 'ltable_ID', 'rtable_ID', 'match'], 
              append=True, target_attr='predicted', inplace=True)

L.head()


Unnamed: 0,_id,ltable_ID,rtable_ID,Name_Name_jac_qgm_3_qgm_3,Name_Name_cos_dlm_dc0_dlm_dc0,Name_Name_mel,Name_Name_lev_dist,Name_Name_lev_sim,Author_Author_jac_qgm_3_qgm_3,Author_Author_cos_dlm_dc0_dlm_dc0,...,Author_Author_sw,Publishing_Date_Publishing_Date_lev_sim,Publishing_Date_Publishing_Date_jar,Publishing_Date_Publishing_Date_jwn,Publishing_Date_Publishing_Date_exm,Pages_Pages_exm,Pages_Pages_anm,Pages_Pages_lev_dist,match,predicted
361,334,497,2681,0.808824,0.912871,0.941026,23.0,0.705128,1.0,1.0,...,14.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1,1
73,530,386,400,0.095238,0.142857,0.616316,41.0,0.293103,0.974677,0.989866,...,12.843137,0.375,0.60119,0.60119,0.0,0.0,0.831025,2.0,0,0
374,386,582,2888,0.611111,0.683763,0.877662,46.0,0.417722,1.0,1.0,...,11.0,0.5,0.646429,0.646429,0.0,0.0,0.933333,2.0,0,0
155,784,386,2274,0.142857,0.227921,0.60389,48.0,0.283582,0.974677,0.989866,...,12.843137,0.428571,0.630952,0.630952,0.0,0.0,0.440443,3.0,0,0
104,637,386,1087,0.079545,0.154303,0.645243,40.0,0.310345,0.974677,0.989866,...,12.843137,0.285714,0.619048,0.619048,0.0,0.0,0.268698,3.0,0,0


In [28]:
# Evaluate the predictions

mt = em.MatchTrigger()
mt.add_cond_rule(['Name_Name_cos_dlm_dc0_dlm_dc0(ltuple, rtuple) >= 0.9', 'Author_Author_cos_dlm_dc0_dlm_dc0(ltuple, rtuple) >= 0.9'], feature_table)
mt.add_cond_status(True)
mt.add_action(1)

mt.execute(input_table=L, label_column='predicted', inplace=True)

eval_result = em.eval_matches(L, 'match', 'predicted')
em.print_eval_summary(eval_result)


Precision : 100.0% (9/9)
Recall : 75.0% (9/12)
F1 : 85.71%
False positives : 0 (out of 9 positive predictions)
False negatives : 3 (out of 141 negative predictions)
