# Getting started with DeepMatcher


As an overview, here are the 4 steps to use `deepmatcher` which we will go through in this tutorial:

<ol start="0">
  <li>Setup</li>
  <li>Process labeled data</li>
  <li>Define neural network model</li>
  <li>Train model</li>
  <li>Apply model to new data</li>
</ol>

Let's begin!

## Step 0. Setup

If you are running this notebook inside Colab, you will first need to install necessary packages by running the code below:

In [1]:
try:
    import deepmatcher
except:
    !pip install -qqq deepmatcher

Now let's import `deepmatcher` which will do all the heavy lifting to build and train neural network models for entity matching. 

In [2]:
import deepmatcher as dm

We recommend having a GPU available for the training in Step 4. In case a GPU is not available, we will use all available CPU cores. You can run the following command to determine if a GPU is available and will be used for training:

In [3]:
import torch
torch.cuda.is_available()

False

### Download sample data for entity matching

Now let's get some sample data to play with in this tutorial. We will need three sets of labeled data and one set of unlabeled data:

1. **Training Data:** This is used for training our neural network model.
2. **Validation Data:** This is used for determining the configuration (i.e., hyperparameters) of our model in such a way that the model does not overfit to the training set.
3. **Test Data:** This is used to estimate the performance of our trained model on unlabeled data.
4. **Unlabeled Data:** The trained model is applied on this data to obtain predictions, which can then be used for downstream tasks in practical application scenarios.

We download these four data sets to the `sample_data` directory:

In [4]:
!mkdir -p sample_data/itunes-amazon
!wget -qnc -P sample_data/itunes-amazon https://raw.githubusercontent.com/anhaidgroup/deepmatcher/master/examples/sample_data/itunes-amazon/train.csv
!wget -qnc -P sample_data/itunes-amazon https://raw.githubusercontent.com/anhaidgroup/deepmatcher/master/examples/sample_data/itunes-amazon/validation.csv
!wget -qnc -P sample_data/itunes-amazon https://raw.githubusercontent.com/anhaidgroup/deepmatcher/master/examples/sample_data/itunes-amazon/test.csv
!wget -qnc -P sample_data/itunes-amazon https://raw.githubusercontent.com/anhaidgroup/deepmatcher/master/examples/sample_data/itunes-amazon/unlabeled.csv

/bin/sh: wget: command not found
/bin/sh: wget: command not found
/bin/sh: wget: command not found
/bin/sh: wget: command not found


To get an idea of how our data looks like, let's take a peek at the training dataset:

In [4]:
import pandas as pd
pd.read_csv('/Users/seowookchoi/Desktop/entity/University/train.csv').head()

Unnamed: 0,label,left_School_Name,right_School_Name,id
0,1,"A. D. PATEL INSTITUTE OF TECHNOLOGY, Anand, India",A.D. Patel Institute of Technology,1
1,1,A.D. Patel Institute of Technology,A.D. Patel Institute of Technology,2
2,1,A.T. Still University,A.T. Still University,3
3,1,ABE International Business College,ABE International Business College,4
4,1,ACLC College of Butuan City,ACLC College of Butuan City,5


## Step 1. Process labeled data

Before we can use our data for training, `deepmatcher` needs to first load and process it in order to prepare it for neural network training. Currently `deepmatcher` only supports processing CSV files. Each CSV file is assumed to have the following kinds of columns:

* **"Left" attributes (required):** Our goal is to match tuple pairs. "Left" attributes are columns that correspond to the "left" tuple or the first tuple in the tuple pair. These column names are expected to be prefixed with "left_" by default.
* **"Right" attributes (required):** "Right" attributes are columns that correspond to the "right" tuple or the second tuple in the tuple pair. These column names are expected to be prefixed with "right_" by default.
* **Label column (required for train, validation, test):** Column containing the labels (match or non-match) for each tuple pair. Expected to be named "label" by default
* **ID column (required):** Column containing a unique ID for each tuple pair. This is for evaluation convenience.  Expected to be named "id" by default.

More details on what data processing involves and ways to customize it are described in **[this notebook](https://nbviewer.jupyter.org/github/anhaidgroup/deepmatcher/blob/master/examples/data_processing.ipynb)**. 

### Processing labeled data
In order to process our train, validation and test CSV files we call `dm.data.process` in the following code snippet which will load and process the CSV files and return three processed `MatchingDataset` objects respectively. These dataset objects will later be used for training and evaluation. The basic parameters to `dm.data.process` are as follows:

* **path (required): ** The path where all data is stored. This includes train, validation and test. `deepmatcher` may create new files in this directory to store information about these data sets. This allows subsequent `dm.data.process` calls to be much faster.
* **train (required): ** File name of training data in `path` directory.
* **validation (required): ** File name of validation data in `path` directory.
* **test (optional): ** File name of test data in `path` directory.
* **ignore_columns (optional): ** Any columns in the CSV files that you may want to ignore for the purposes of training. These should be included here. 

Note that the train, validation and test CSVs must all share the same schema, i.e., they should have the same columns. Processing data involves several steps and can take several minutes to complete, especially if this is the first time you are running the `deepmatcher` package.

NOTE: If you are running this in Colab, you may get a message saying 'Memory usage is close to the limit.' You can safely ignore it for now. We are working on reducing the memory footprint.

In [5]:
train, validation, test = dm.data.process(
    path='/Users/seowookchoi/Desktop/entity/University',
    train='train.csv',
    validation='validation.csv',
    test='test.csv')

#### Peeking at processed data
Let's take a look at how the processed data looks like. To do this, we get the raw `pandas` table corresponding to the processed training dataset object. 

In [6]:
train_table = train.get_raw_table()
train_table.head()

Unnamed: 0,label,left_School_Name,right_School_Name,id
0,1,"a. d. patel institute of technology , anand , ...",a.d. patel institute of technology,1
1,1,a.d. patel institute of technology,a.d. patel institute of technology,2
2,1,a.t . still university,a.t . still university,3
3,1,abe international business college,abe international business college,4
4,1,aclc college of butuan city,aclc college of butuan city,5


The processed attribute values have been tokenized and lowercased so they may not look exactly the same as the input training data. These modifications help the neural network generalize better, i.e., perform better on data not trained on. 

## Step 2. Define neural network model

In this step you tell `deepmatcher` what kind of neural network you would like to use for entity matching. The easiest way to do this is to use one of the several kinds of neural network models that comes built-in with `deepmatcher`. To use a built-in network, construct a `dm.MatchingModel` as follows:

`model = dm.MatchingModel(attr_summarizer='<TYPE>')`

where `<TYPE>` is one of `sif`, `rnn`, `attention` or `hybrid`. If you are not familiar with what these mean, we strongly recommend taking a look at either **[slides from our talk on deepmatcher](http://bit.do/deepmatcher-talk)** for a high level overview, or **[our paper](http://pages.cs.wisc.edu/~anhai/papers1/deepmatcher-sigmod18.pdf)** for a more detailed explanation. Here we give briefly describe the intuition behind these four model types:
* **sif:** This model considers the **words** present in each attribute value pair to determine a match or non-match. It does not take word order into account.
* **rnn:** This model considers the **sequences of words** present in each attribute value pair to determine a match or non-match.
* **attention:** This model considers the **alignment of words** present in each attribute value pair to determine a match or non-match. It does not take word order into account.
* **hybrid:** This model considers the **alignment of sequences of words** present in each attribute value pair to determine a match or non-match. This is the default.

`deepmatcher` is highly customizable and allows you to tune almost every aspect of the neural network model for your application scenario. **[This tutorial](https://nbviewer.jupyter.org/github/anhaidgroup/deepmatcher/blob/master/examples/matching_models.ipynb)** discusses the structure of `MatchingModel`s and how they can be customized.

For this tutorial, let's create a `hybrid` model for entity matching:

In [7]:
model = dm.MatchingModel(attr_summarizer='hybrid')

## Step 3. Train model

Next, we train the defined neural network model using the processed training and validation data. To do so, we call the `run_train` method which takes the following basic parameters:

* **train:** The processed training dataset object (of type `MatchingDataset`).
* **validation:** The processed validation dataset object (of type `MatchingDataset`).
* **epochs:** Number of times to go over the entire `train` data for training the model.
* **batch_size:** Number of labeled examples (tuple pairs) to use for each training step. This value may be increased if you have a lot of training data and would like to speed up training. The optimal value is dataset dependent.
* **best_save_path:** Path to save the best model.
* **pos_neg_ratio**: The ratio of the weight of positive examples (matches) to weight of negative examples (non-matches). This value should be increased if you have fewer matches than non-matches in your data. The optimal value is dataset dependent.

Many other aspects of the training algorithm can be customized. For details on this, please refer the API documentation for **[run_train]()**

In [8]:
model.run_train(
    train,
    validation,
    epochs=5,
    batch_size=16,
    best_save_path='hybrid_model.pth',
    pos_neg_ratio=3)

* Number of trainable parameters: 2798703
===>  TRAIN Epoch 1


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 06:49:13


Finished Epoch 1 || Run Time: 24471.7 | Load Time:   82.3 || F1:  88.30 | Prec:  87.86 | Rec:  88.74 || Ex/s:  77.80

===>  EVAL Epoch 1


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:34:06


Finished Epoch 1 || Run Time: 2025.3 | Load Time:   21.2 || F1:  88.80 | Prec:  91.76 | Rec:  86.04 || Ex/s: 303.26

* Best F1: 88.80490467792055
Saving best model...
Done.
---------------------

===>  TRAIN Epoch 2


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 07:54:07


Finished Epoch 2 || Run Time: 28373.8 | Load Time:   74.6 || F1:  90.44 | Prec:  90.90 | Rec:  89.99 || Ex/s:  67.15

===>  EVAL Epoch 2


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:28:44


Finished Epoch 2 || Run Time: 1707.8 | Load Time:   16.4 || F1:  87.10 | Prec:  94.76 | Rec:  80.59 || Ex/s: 359.96

---------------------

===>  TRAIN Epoch 3


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 13:28:20


Finished Epoch 3 || Run Time: 48420.8 | Load Time:   80.9 || F1:  89.96 | Prec:  91.01 | Rec:  88.93 || Ex/s:  39.38

===>  EVAL Epoch 3


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:28:41


Finished Epoch 3 || Run Time: 1703.8 | Load Time:   18.0 || F1:  90.13 | Prec:  95.80 | Rec:  85.09 || Ex/s: 360.46

* Best F1: 90.13186244936654
Saving best model...
Done.
---------------------

===>  TRAIN Epoch 4


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 11:31:29


Finished Epoch 4 || Run Time: 41418.7 | Load Time:   72.9 || F1:  90.64 | Prec:  91.75 | Rec:  89.56 || Ex/s:  46.04

===>  EVAL Epoch 4


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:28:55


Finished Epoch 4 || Run Time: 1718.7 | Load Time:   16.4 || F1:  91.85 | Prec:  95.36 | Rec:  88.59 || Ex/s: 357.70

* Best F1: 91.850852033069
Saving best model...
Done.
---------------------

===>  TRAIN Epoch 5


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 11:37:59


Finished Epoch 5 || Run Time: 41807.8 | Load Time:   73.2 || F1:  92.20 | Prec:  93.56 | Rec:  90.89 || Ex/s:  45.61

===>  EVAL Epoch 5


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:28:49


Finished Epoch 5 || Run Time: 1712.8 | Load Time:   16.4 || F1:  93.10 | Prec:  95.81 | Rec:  90.53 || Ex/s: 358.93

* Best F1: 93.0968119822609
Saving best model...
Done.
---------------------

Loading best model...
Training done.


93.0968119822609

## Step 4. Apply model to new data

### Evaluating on test data
Now that we have a trained model for entity matching, we can now evaluate its accuracy on test data, to estimate the performance of the model on unlabeled data.

In [9]:
# Compute F1 on test set
model.run_eval(test)

===>  EVAL Epoch 5


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:26:12


Finished Epoch 5 || Run Time: 1558.2 | Load Time:   14.8 || F1:  92.87 | Prec:  95.03 | Rec:  90.80 || Ex/s: 404.18



92.86758732737611

### Evaluating on unlabeled data

We finally apply the trained model to unlabeled data to get predictions. To do this, we need to first process the unlabeled data.

#### Processing unlabeled data

To process unlabeled data, we use `dm.data.process_unlabeled`, as shown in the code snippet below. The basic parameters for this call are as follows:

* **path (required): ** The full path to the unlabeled data file (not just the directory).
* **trained_model (required): ** The trained model. The model is aware of the configuration of the training data on which it was trained, and so `deepmatcher` reuses the same configuration for the unlabeled data.
* **ignore_columns (optional): ** Any columns in the unlabeled CSV file that you may want to ignore for the purposes of evaluation. If not specified, the columns that were ignored while processing the training set will also be ignored while processing the unlabeled data.

Note that the unlabeled CSV file must have the same schema as the train, validation and test CSVs.

In [10]:
unlabeled = dm.data.process_unlabeled(
    path='/Users/seowookchoi/Desktop/entity/University/unlabeled.csv',
    trained_model=model)


Reading and processing data from "/Users/seowookchoi/Desktop/entity/University/unlabeled.csv"
0% [##############################] 100% | ETA: 00:00:00

#### Obtaining predictions

Next, we call the `run_prediction` method which takes a processed data set object and returns a `pandas` dataframe containing tuple pair IDs (`id` column) and the corresponding match score predictions (`match_score` column). `match_scores` are in [0, 1] and a score above 0.5 indicates a match prediction.

In [11]:
predictions = model.run_prediction(unlabeled)
predictions.head()

===>  PREDICT Epoch 5


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 02:07:11


Finished Epoch 5 || Run Time: 7507.3 | Load Time:  124.2 || F1:   0.00 | Prec:   0.00 | Rec:   0.00 || Ex/s:   0.00



Unnamed: 0_level_0,match_score
id,Unnamed: 1_level_1
1,0.137177
2,0.136814
3,0.143847
4,0.142449
5,0.141171


You may optionally set the `output_attributes` parameter to also include all attributes present in the original input table. As mentioned earlier, the processed attribute values will likely look a bit different from the attribute values in the input CSV files due to modifications such as tokenization and lowercasing.

In [12]:
predictions = model.run_prediction(unlabeled, output_attributes=True)
predictions.head()

===>  PREDICT Epoch 5


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 02:06:51


Finished Epoch 5 || Run Time: 7533.7 | Load Time:   78.0 || F1:   0.00 | Prec:   0.00 | Rec:   0.00 || Ex/s:   0.00



Unnamed: 0_level_0,match_score,left_School_Name,right_School_Name
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,0.137177,"Indira Gandhi National Tribal University, Amar...",Hunan Normal University
2,0.136814,University of Western Australia,Hunan Normal University
3,0.143847,GIST,Hunan Normal University
4,0.142449,Vignan University (Deemed to be University),Hunan Normal University
5,0.141171,Spiritan University College,Hunan Normal University


You can then save these predictions to CSV and use them for downstream tasks.

In [13]:
predictions.to_csv('/Users/seowookchoi/Desktop/entity/University/unlabeled_predictions.csv')

#### Getting predictions on labeled data

You can also get predictions for labeled data such as validation data. To do so, you can simply call the `run_prediction` method passing the validation data as argument.

In [14]:
valid_predictions = model.run_prediction(validation, output_attributes=True)
valid_predictions.head()

===>  PREDICT Epoch 5


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:29:54


Finished Epoch 5 || Run Time: 1771.0 | Load Time:   23.4 || F1:  93.14 | Prec:  95.74 | Rec:  90.68 || Ex/s: 345.87



Unnamed: 0_level_0,match_score,label,left_School_Name,right_School_Name
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,0.979907,1,ACMCL College,ACMCL College
2,0.9759,1,MY - AIMST University,AIMST University
3,0.141378,1,Missouri S&T,AIMST University
4,0.931575,1,"AKS University, Satna AKSU",AKS University
5,0.990576,1,AKS University,AKS University


In [16]:
valid_predictions.to_csv('/Users/seowookchoi/Desktop/entity/University/valid_predictions.csv')