# University Name Matching using DeepMatcher


This Project is to develop university name matcher which matches different form of university names to be classfied into the right school name. For example, The Ohio State University can be written as osu, ohio state , ohio state university and so on. However, they are indicating same school. But in terms of resume and job application, people do not use standardized form. So, if it is possible to automatically match those names, it will be helpful to process the school information.

## Step 0. Setup

If you are running this notebook inside Colab, you will first need to install necessary packages by running the code below:

In [1]:
try:
    import deepmatcher
except:
    !pip install -qqq deepmatcher

Now let's import `deepmatcher` which will do all the heavy lifting to build and train neural network models for entity matching. 

In [2]:
import deepmatcher as dm

We recommend having a GPU available for the training in Step 4. In case a GPU is not available, we will use all available CPU cores. You can run the following command to determine if a GPU is available and will be used for training:

In [3]:
import torch
torch.cuda.is_available()

False

To get an idea of how our data looks like, let's take a peek at the training dataset:

In [4]:
import pandas as pd
pd.read_csv('/Users/seowookchoi/Desktop/entity/University/train.csv').head()

Unnamed: 0,label,left_School_Name,right_School_Name,id
0,1,"A. D. PATEL INSTITUTE OF TECHNOLOGY, Anand, India",A.D. Patel Institute of Technology,1
1,1,A.D. Patel Institute of Technology,A.D. Patel Institute of Technology,2
2,1,A.T. Still University,A.T. Still University,3
3,1,ABE International Business College,ABE International Business College,4
4,1,ACLC College of Butuan City,ACLC College of Butuan City,5


In [5]:
train, validation, test = dm.data.process(
    path='/Users/seowookchoi/Desktop/entity/University',
    train='train.csv',
    validation='validation.csv',
    test='test.csv')

#### Peeking at processed data
Make all letters in lower case.

In [6]:
train_table = train.get_raw_table()
train_table.head()

Unnamed: 0,label,left_School_Name,right_School_Name,id
0,1,"a. d. patel institute of technology , anand , ...",a.d. patel institute of technology,1
1,1,a.d. patel institute of technology,a.d. patel institute of technology,2
2,1,a.t . still university,a.t . still university,3
3,1,abe international business college,abe international business college,4
4,1,aclc college of butuan city,aclc college of butuan city,5


The processed attribute values have been tokenized and lowercased so they may not look exactly the same as the input training data. These modifications help the neural network generalize better, i.e., perform better on data not trained on. 

## Step 2. Define neural network model


In [7]:
model = dm.MatchingModel(attr_summarizer='hybrid')

## Step 3. Train model


In [8]:
model.run_train(
    train,
    validation,
    epochs=5,
    batch_size=16,
    best_save_path='hybrid_model.pth',
    pos_neg_ratio=3)

* Number of trainable parameters: 2798703
===>  TRAIN Epoch 1


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 06:49:13


Finished Epoch 1 || Run Time: 24471.7 | Load Time:   82.3 || F1:  88.30 | Prec:  87.86 | Rec:  88.74 || Ex/s:  77.80

===>  EVAL Epoch 1


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:34:06


Finished Epoch 1 || Run Time: 2025.3 | Load Time:   21.2 || F1:  88.80 | Prec:  91.76 | Rec:  86.04 || Ex/s: 303.26

* Best F1: 88.80490467792055
Saving best model...
Done.
---------------------

===>  TRAIN Epoch 2


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 07:54:07


Finished Epoch 2 || Run Time: 28373.8 | Load Time:   74.6 || F1:  90.44 | Prec:  90.90 | Rec:  89.99 || Ex/s:  67.15

===>  EVAL Epoch 2


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:28:44


Finished Epoch 2 || Run Time: 1707.8 | Load Time:   16.4 || F1:  87.10 | Prec:  94.76 | Rec:  80.59 || Ex/s: 359.96

---------------------

===>  TRAIN Epoch 3


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 13:28:20


Finished Epoch 3 || Run Time: 48420.8 | Load Time:   80.9 || F1:  89.96 | Prec:  91.01 | Rec:  88.93 || Ex/s:  39.38

===>  EVAL Epoch 3


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:28:41


Finished Epoch 3 || Run Time: 1703.8 | Load Time:   18.0 || F1:  90.13 | Prec:  95.80 | Rec:  85.09 || Ex/s: 360.46

* Best F1: 90.13186244936654
Saving best model...
Done.
---------------------

===>  TRAIN Epoch 4


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 11:31:29


Finished Epoch 4 || Run Time: 41418.7 | Load Time:   72.9 || F1:  90.64 | Prec:  91.75 | Rec:  89.56 || Ex/s:  46.04

===>  EVAL Epoch 4


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:28:55


Finished Epoch 4 || Run Time: 1718.7 | Load Time:   16.4 || F1:  91.85 | Prec:  95.36 | Rec:  88.59 || Ex/s: 357.70

* Best F1: 91.850852033069
Saving best model...
Done.
---------------------

===>  TRAIN Epoch 5


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 11:37:59


Finished Epoch 5 || Run Time: 41807.8 | Load Time:   73.2 || F1:  92.20 | Prec:  93.56 | Rec:  90.89 || Ex/s:  45.61

===>  EVAL Epoch 5


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:28:49


Finished Epoch 5 || Run Time: 1712.8 | Load Time:   16.4 || F1:  93.10 | Prec:  95.81 | Rec:  90.53 || Ex/s: 358.93

* Best F1: 93.0968119822609
Saving best model...
Done.
---------------------

Loading best model...
Training done.


93.0968119822609

## Step 4. Apply model to test data

### Evaluating on test data
Now that we have a trained model for entity matching, we can now evaluate its accuracy on test data, to estimate the performance of the model on unlabeled data.

In [9]:
# Compute F1 on test set
model.run_eval(test)

===>  EVAL Epoch 5


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:26:12


Finished Epoch 5 || Run Time: 1558.2 | Load Time:   14.8 || F1:  92.87 | Prec:  95.03 | Rec:  90.80 || Ex/s: 404.18



92.86758732737611

#### Getting predictions on labeled data

You can also get predictions for labeled data such as validation data. To do so, you can simply call the `run_prediction` method passing the validation data as argument.

In [14]:
valid_predictions = model.run_prediction(validation, output_attributes=True)
valid_predictions.head()

===>  PREDICT Epoch 5


0% [██████████████████████████████] 100% | ETA: 00:00:00
Total time elapsed: 00:29:54


Finished Epoch 5 || Run Time: 1771.0 | Load Time:   23.4 || F1:  93.14 | Prec:  95.74 | Rec:  90.68 || Ex/s: 345.87



Unnamed: 0_level_0,match_score,label,left_School_Name,right_School_Name
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,0.979907,1,ACMCL College,ACMCL College
2,0.9759,1,MY - AIMST University,AIMST University
3,0.141378,1,Missouri S&T,AIMST University
4,0.931575,1,"AKS University, Satna AKSU",AKS University
5,0.990576,1,AKS University,AKS University


In [16]:
valid_predictions.to_csv('/Users/seowookchoi/Desktop/entity/University/valid_predictions.csv')

Both of out of sample predictions shows F1 score about 93% which is pretty accruate. However, there are still plenty of room to be developed to perform better. Making model more complex will not make huge difference. There need to be better and more data set.