# Entity Resolution project @ Wavestone
## Abt-Buy Products Matching

> *Datasets information from [here](Datasets.md) \
> Description to do but only take the raw data because the not raw data was already pre-processed*

> **Tristan PERROT**


## Import libraries


In [1]:
import os

import torch
import pickle

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')

Using device: cpu


In [2]:
while 'model' not in os.listdir():
    os.chdir('..')

In [3]:
MODEL_NAME = ''
DATA_NAME = 'abt-buy'
COMPUTER = 'gpu4.enst.fr:0'
DATA_DIR = os.path.join('data', DATA_NAME)

## Load data

In [4]:
import numpy as np

from model.BertModel import BertModel
from model.utils import deserialize_entities, load_data

In [8]:
X_train_ids, y_train, X_valid_ids, y_valid, X_test_ids, y_test = load_data(DATA_DIR, comp_er=False)
with open(os.path.join(DATA_DIR, '1_serialized.pkl'), 'rb') as f:
    table_a_serialized = pickle.load(f)
with open(os.path.join(DATA_DIR, '2_serialized.pkl'), 'rb') as f:
    table_b_serialized = pickle.load(f)

In [9]:
X_train = [table_a_serialized[i[0]] + ' [SEP] ' + table_b_serialized[i[1]] for i in X_train_ids]
X_valid = [table_a_serialized[i[0]] + ' [SEP] ' + table_b_serialized[i[1]] for i in X_valid_ids]
X_test = [table_a_serialized[i[0]] + ' [SEP] ' + table_b_serialized[i[1]] for i in X_test_ids]

In [10]:
# Display the first 5 samples of the training set
for i in range(5):
    print(f'Sample {i}:')
    print(X_train[i])
    print(f'Label: {y_train[i]}')
    print()

Sample 0:
[COL] name [VAL] Sony Silver Cyber-Shot Digital Camera - DSCS750 [COL] description [VAL] Sony Silver Cyber-Shot Digital Camera - DSCS750/ 7.2 Megapixel Super HAD CCD/ 2.5' LCD Display/ 3x Optical Zoom/ High Sensitivity Mode/ Auto Focus/ 22 MB Internal Memory/ Face Detection/ Burst Mode/ Convenient Photo Modes/ Beginner-Friendly Function Guide/ Silver Finish [COL] price [VAL]  [SEP] [COL] name [VAL] OmniMount Ultra Low Profile ULPT-L Flat Panel Wall Mount - ULPT-L A [COL] description [VAL] 200 lb - Anthracite [COL] manufacturer [VAL] OMNIMOUNT SYSTEMS, INC [COL] price [VAL] 278.95
Label: 0

Sample 1:
[COL] name [VAL] Canon Black EOS Rebel XSi Digital SLR Camera - XSIREB1855 [COL] description [VAL] Canon Black EOS Rebel XSi Digital SLR Camera - XSIREB1855/ 12.2 Megapixel/ DIGIC III Image Processor/ Extensive Noise Reduction Technology/ Auto Optimization/ 3.0' LCD Monitor/ Compatible With Compact SD And SDHC Memory Cards/ EOS Integrated Cleaning System/ 18-55MM Lens Included/ 27

### RoBERTa Base
- Architecture: Transformer-based model
- Parameters: ~125 million
- Layers: 12 Transformer layers
- Hidden Size: 768
- Attention Heads: 12

In [26]:
MODEL_NAME = 'roberta-base'

In [None]:
model = BertModel(model_name=MODEL_NAME, study_name=DATA_NAME + "/" + MODEL_NAME + "@" + COMPUTER, device=device)
train_loader, val_loader, test_loader = model.prepare_data(X_train, y_train, X_valid, y_valid, X_test, y_test, batch_size=32, num_workers=4)
model.fit(train_loader, val_loader, epochs=3, lr=2e-5, weight_decay=0.01, early_stopping=True, patience=3)

In [24]:
y_pred = model.evaluate(test_loader)

Testing: 100%|██████████████████████████████████████████████████████████████| 11/11 [00:02<00:00,  4.64batch/s]

              precision    recall  f1-score   support

           0       1.00      0.99      1.00       166
           1       0.99      1.00      1.00       166

    accuracy                           1.00       332
   macro avg       1.00      1.00      1.00       332
weighted avg       1.00      1.00      1.00       332

Test loss: 0.01014





In [25]:
# Exemple of prediction
print(f'Prediction: {y_pred[0]}')
print(f'Label: {y_test[0]}')
print(f'Sample: {X_test[0]}')

Prediction: 0
Label: 0
Sample: <col>name<val>LaCie 500GB d2 Quadra External Hard Drive - 301825U <col>description<val>LaCie 500GB d2 Quadra External Hard Drive - 301825U/ Quadruple Interface For Full PC And Mac Compatibility/ Interface Bandwidth Up To 3Gbits/s (eSATA)/ Advanced Aluminum Heat Sink Design Cooling System For Quiet Operation/ 7200 Rotational Speed (rpm)/ 16MB Cache/ Compatible With Time Machine <col>price<val>$159.00<sep><col>name<val>Canon EF 75-300mm f/4-5.6 III Telephoto Zoom Lens - 6473A003 <col>description<val>f/4 to 5.6 <col>manufacturer<val>Canon <col>price<val>$159.95


In [26]:
e1_df, e2_df = deserialize_entities(X_test[np.nonzero(y_test)[0][0]])
print('Entity 1:')
display(e1_df)
print('Entity 2:')
display(e2_df)
print(f'Label: {y_test[np.nonzero(y_test)[0][0]]}')
print(f'Prediction: {y_pred[np.nonzero(y_test)[0][0]]}')

Entity 1:


column,name,description,price
0,Sony Black 1080p Upscaling DVD Player - DVPNS7...,Sony Black 1080p Upscaling DVD Player - DVPNS7...,


Entity 2:


column,name,description,manufacturer,price
0,Sony DVP-NS700H/B DVD Player,"DVD-RW, DVD+RW, DVD-R, DVD+R, CD-RW - DVD Vide...",Sony,


Label: 1
Prediction: 1


### DistilRoBERTa
- Parameters: ~82 million
- Layers: 6 Transformer layers (half of RoBERTa-base)
- Hidden Size: 768 (same as RoBERTa-base)
- Attention Heads: 12 (same as RoBERTa-base)
- It’s 60% faster than RoBERTa base.
- Has 40% fewer parameters while retaining over 95% of the performance on most tasks.

In [27]:
MODEL_NAME = 'distilroberta-base'

In [28]:
model = BertModel(model_name=MODEL_NAME, study_name=DATA_NAME + "/" + MODEL_NAME + "@" + COMPUTER, device=device)
train_loader, val_loader, test_loader = model.prepare_data(X_train, y_train, X_valid, y_valid, X_test, y_test, batch_size=32, num_workers=4)
model.fit(train_loader, val_loader, epochs=3, lr=2e-5, weight_decay=0.01, early_stopping=True, patience=3)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model initialized
Device: cuda:0
Model: distilroberta-base
Study name: Abt-Buy/distilroberta-base@gpu6.enst.fr:0
Epoch 1/10


Training: 100%|█████████████████████████████████████████████████████████████| 48/48 [00:15<00:00,  3.09batch/s]
Validation: 100%|███████████████████████████████████████████████████████████| 11/11 [00:01<00:00,  7.97batch/s]


Train loss: 0.4401
Val loss: 0.0743
              precision    recall  f1-score   support

           0       1.00      0.96      0.98       166
           1       0.97      1.00      0.98       166

    accuracy                           0.98       332
   macro avg       0.98      0.98      0.98       332
weighted avg       0.98      0.98      0.98       332

Epoch 2/10


Training: 100%|█████████████████████████████████████████████████████████████| 48/48 [00:15<00:00,  3.09batch/s]
Validation: 100%|███████████████████████████████████████████████████████████| 11/11 [00:01<00:00,  8.16batch/s]


Train loss: 0.0536
Val loss: 0.0253
              precision    recall  f1-score   support

           0       1.00      0.99      0.99       166
           1       0.99      1.00      0.99       166

    accuracy                           0.99       332
   macro avg       0.99      0.99      0.99       332
weighted avg       0.99      0.99      0.99       332

Epoch 3/10


Training: 100%|█████████████████████████████████████████████████████████████| 48/48 [00:15<00:00,  3.09batch/s]
Validation: 100%|███████████████████████████████████████████████████████████| 11/11 [00:01<00:00,  8.11batch/s]


Train loss: 0.0196
Val loss: 0.0113
              precision    recall  f1-score   support

           0       1.00      0.99      0.99       166
           1       0.99      1.00      0.99       166

    accuracy                           0.99       332
   macro avg       0.99      0.99      0.99       332
weighted avg       0.99      0.99      0.99       332

Epoch 4/10


Training: 100%|█████████████████████████████████████████████████████████████| 48/48 [00:15<00:00,  3.09batch/s]
Validation: 100%|███████████████████████████████████████████████████████████| 11/11 [00:01<00:00,  8.20batch/s]


Train loss: 0.0158
Val loss: 0.0619
              precision    recall  f1-score   support

           0       0.97      1.00      0.99       166
           1       1.00      0.97      0.98       166

    accuracy                           0.98       332
   macro avg       0.99      0.98      0.98       332
weighted avg       0.99      0.98      0.98       332

Epoch 5/10


Training: 100%|█████████████████████████████████████████████████████████████| 48/48 [00:15<00:00,  3.09batch/s]
Validation: 100%|███████████████████████████████████████████████████████████| 11/11 [00:01<00:00,  7.73batch/s]


Train loss: 0.0422
Val loss: 0.0157
              precision    recall  f1-score   support

           0       1.00      0.99      1.00       166
           1       0.99      1.00      1.00       166

    accuracy                           1.00       332
   macro avg       1.00      1.00      1.00       332
weighted avg       1.00      1.00      1.00       332

Epoch 6/10


Training: 100%|█████████████████████████████████████████████████████████████| 48/48 [00:15<00:00,  3.09batch/s]
Validation: 100%|███████████████████████████████████████████████████████████| 11/11 [00:01<00:00,  8.17batch/s]

Train loss: 0.0140
Val loss: 0.0131
              precision    recall  f1-score   support

           0       0.99      0.99      0.99       166
           1       0.99      0.99      0.99       166

    accuracy                           0.99       332
   macro avg       0.99      0.99      0.99       332
weighted avg       0.99      0.99      0.99       332

Early stopping triggered
Best weights restored
Mean time per epoch: 16.91





In [29]:
y_pred_disti = model.evaluate(test_loader)

Testing: 100%|██████████████████████████████████████████████████████████████| 11/11 [00:01<00:00,  8.12batch/s]

              precision    recall  f1-score   support

           0       0.98      0.99      0.99       166
           1       0.99      0.98      0.98       166

    accuracy                           0.98       332
   macro avg       0.99      0.98      0.98       332
weighted avg       0.99      0.98      0.98       332

Test loss: 0.02789





### BERT Base
- Parameters: ~110 million
- Layers: 12 Transformer layers
- Hidden Size: 768
- Attention Heads: 12

In [30]:
MODEL_NAME = 'bert-base-uncased'

In [31]:
model = BertModel(model_name=MODEL_NAME, study_name=DATA_NAME + "/" + MODEL_NAME + "@" + COMPUTER, device=device)
train_loader, val_loader, test_loader = model.prepare_data(X_train, y_train, X_valid, y_valid, X_test, y_test, batch_size=32, num_workers=4)
model.fit(train_loader, val_loader, epochs=3, lr=2e-5, weight_decay=0.01, early_stopping=True, patience=3)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model initialized
Device: cuda:0
Model: bert-base-uncased
Study name: Abt-Buy/bert-base-uncased@gpu6.enst.fr:0
Epoch 1/10


Training: 100%|█████████████████████████████████████████████████████████████| 48/48 [00:30<00:00,  1.58batch/s]
Validation: 100%|███████████████████████████████████████████████████████████| 11/11 [00:02<00:00,  4.43batch/s]


Train loss: 0.4517
Val loss: 0.1832
              precision    recall  f1-score   support

           0       1.00      0.87      0.93       166
           1       0.88      1.00      0.94       166

    accuracy                           0.93       332
   macro avg       0.94      0.93      0.93       332
weighted avg       0.94      0.93      0.93       332

Epoch 2/10


Training: 100%|█████████████████████████████████████████████████████████████| 48/48 [00:30<00:00,  1.58batch/s]
Validation: 100%|███████████████████████████████████████████████████████████| 11/11 [00:02<00:00,  4.40batch/s]


Train loss: 0.0784
Val loss: 0.0632
              precision    recall  f1-score   support

           0       1.00      0.97      0.98       166
           1       0.97      1.00      0.99       166

    accuracy                           0.98       332
   macro avg       0.99      0.98      0.98       332
weighted avg       0.99      0.98      0.98       332

Epoch 3/10


Training: 100%|█████████████████████████████████████████████████████████████| 48/48 [00:30<00:00,  1.58batch/s]
Validation: 100%|███████████████████████████████████████████████████████████| 11/11 [00:02<00:00,  4.41batch/s]


Train loss: 0.0349
Val loss: 0.0169
              precision    recall  f1-score   support

           0       1.00      0.99      0.99       166
           1       0.99      1.00      0.99       166

    accuracy                           0.99       332
   macro avg       0.99      0.99      0.99       332
weighted avg       0.99      0.99      0.99       332

Epoch 4/10


Training: 100%|█████████████████████████████████████████████████████████████| 48/48 [00:30<00:00,  1.58batch/s]
Validation: 100%|███████████████████████████████████████████████████████████| 11/11 [00:02<00:00,  4.42batch/s]


Train loss: 0.0181
Val loss: 0.0172
              precision    recall  f1-score   support

           0       1.00      0.99      0.99       166
           1       0.99      1.00      0.99       166

    accuracy                           0.99       332
   macro avg       0.99      0.99      0.99       332
weighted avg       0.99      0.99      0.99       332

Epoch 5/10


Training: 100%|█████████████████████████████████████████████████████████████| 48/48 [00:30<00:00,  1.58batch/s]
Validation: 100%|███████████████████████████████████████████████████████████| 11/11 [00:02<00:00,  4.44batch/s]


Train loss: 0.0288
Val loss: 0.0322
              precision    recall  f1-score   support

           0       1.00      0.98      0.99       166
           1       0.98      1.00      0.99       166

    accuracy                           0.99       332
   macro avg       0.99      0.99      0.99       332
weighted avg       0.99      0.99      0.99       332

Epoch 6/10


Training: 100%|█████████████████████████████████████████████████████████████| 48/48 [00:30<00:00,  1.58batch/s]
Validation: 100%|███████████████████████████████████████████████████████████| 11/11 [00:02<00:00,  4.44batch/s]

Train loss: 0.0165
Val loss: 0.0227
              precision    recall  f1-score   support

           0       1.00      0.99      0.99       166
           1       0.99      1.00      0.99       166

    accuracy                           0.99       332
   macro avg       0.99      0.99      0.99       332
weighted avg       0.99      0.99      0.99       332

Early stopping triggered
Best weights restored
Mean time per epoch: 32.90





In [32]:
y_pred_bert = model.evaluate(test_loader)

Testing: 100%|██████████████████████████████████████████████████████████████| 11/11 [00:02<00:00,  4.39batch/s]

              precision    recall  f1-score   support

           0       1.00      0.99      1.00       166
           1       0.99      1.00      1.00       166

    accuracy                           1.00       332
   macro avg       1.00      1.00      1.00       332
weighted avg       1.00      1.00      1.00       332

Test loss: 0.01531





### Electra
- Parameters: ~110 million
- Layers: 12 Transformer layers
- Hidden Size: 768
- Attention Heads: 12

In [33]:
MODEL_NAME = 'google/electra-base-discriminator'

In [34]:
model = BertModel(model_name=MODEL_NAME, study_name=DATA_NAME + "/" + MODEL_NAME + "@" + COMPUTER, device=device)
train_loader, val_loader, test_loader = model.prepare_data(X_train, y_train, X_valid, y_valid, X_test, y_test, batch_size=32, num_workers=4)
model.fit(train_loader, val_loader, epochs=3, lr=2e-5, weight_decay=0.01, early_stopping=True, patience=3)

Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-base-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model initialized
Device: cuda:0
Model: google/electra-base-discriminator
Study name: Abt-Buy/google/electra-base-discriminator@gpu6.enst.fr:0
Epoch 1/10


Training: 100%|█████████████████████████████████████████████████████████████| 48/48 [00:34<00:00,  1.37batch/s]
Validation: 100%|███████████████████████████████████████████████████████████| 11/11 [00:02<00:00,  3.94batch/s]


Train loss: 0.4390
Val loss: 0.2168
              precision    recall  f1-score   support

           0       1.00      0.84      0.92       166
           1       0.86      1.00      0.93       166

    accuracy                           0.92       332
   macro avg       0.93      0.92      0.92       332
weighted avg       0.93      0.92      0.92       332

Epoch 2/10


Training: 100%|█████████████████████████████████████████████████████████████| 48/48 [00:35<00:00,  1.37batch/s]
Validation: 100%|███████████████████████████████████████████████████████████| 11/11 [00:02<00:00,  3.92batch/s]


Train loss: 0.0514
Val loss: 0.0411
              precision    recall  f1-score   support

           0       1.00      0.98      0.99       166
           1       0.98      1.00      0.99       166

    accuracy                           0.99       332
   macro avg       0.99      0.99      0.99       332
weighted avg       0.99      0.99      0.99       332

Epoch 3/10


Training: 100%|█████████████████████████████████████████████████████████████| 48/48 [00:35<00:00,  1.37batch/s]
Validation: 100%|███████████████████████████████████████████████████████████| 11/11 [00:02<00:00,  3.92batch/s]


Train loss: 0.0407
Val loss: 0.0462
              precision    recall  f1-score   support

           0       1.00      0.98      0.99       166
           1       0.98      1.00      0.99       166

    accuracy                           0.99       332
   macro avg       0.99      0.99      0.99       332
weighted avg       0.99      0.99      0.99       332

Epoch 4/10


Training: 100%|█████████████████████████████████████████████████████████████| 48/48 [00:35<00:00,  1.37batch/s]
Validation: 100%|███████████████████████████████████████████████████████████| 11/11 [00:02<00:00,  3.89batch/s]


Train loss: 0.0241
Val loss: 0.0233
              precision    recall  f1-score   support

           0       1.00      0.98      0.99       166
           1       0.98      1.00      0.99       166

    accuracy                           0.99       332
   macro avg       0.99      0.99      0.99       332
weighted avg       0.99      0.99      0.99       332

Epoch 5/10


Training: 100%|█████████████████████████████████████████████████████████████| 48/48 [00:34<00:00,  1.37batch/s]
Validation: 100%|███████████████████████████████████████████████████████████| 11/11 [00:02<00:00,  3.89batch/s]


Train loss: 0.0213
Val loss: 0.0268
              precision    recall  f1-score   support

           0       1.00      0.98      0.99       166
           1       0.98      1.00      0.99       166

    accuracy                           0.99       332
   macro avg       0.99      0.99      0.99       332
weighted avg       0.99      0.99      0.99       332

Epoch 6/10


Training: 100%|█████████████████████████████████████████████████████████████| 48/48 [00:35<00:00,  1.37batch/s]
Validation: 100%|███████████████████████████████████████████████████████████| 11/11 [00:02<00:00,  3.90batch/s]


Train loss: 0.0138
Val loss: 0.0239
              precision    recall  f1-score   support

           0       1.00      0.98      0.99       166
           1       0.98      1.00      0.99       166

    accuracy                           0.99       332
   macro avg       0.99      0.99      0.99       332
weighted avg       0.99      0.99      0.99       332

Epoch 7/10


Training: 100%|█████████████████████████████████████████████████████████████| 48/48 [00:35<00:00,  1.37batch/s]
Validation: 100%|███████████████████████████████████████████████████████████| 11/11 [00:02<00:00,  3.90batch/s]


Train loss: 0.0122
Val loss: 0.0301
              precision    recall  f1-score   support

           0       1.00      0.98      0.99       166
           1       0.98      1.00      0.99       166

    accuracy                           0.99       332
   macro avg       0.99      0.99      0.99       332
weighted avg       0.99      0.99      0.99       332

Early stopping triggered
Best weights restored
Mean time per epoch: 37.85


In [35]:
y_pred_electra = model.evaluate(test_loader)

Testing: 100%|██████████████████████████████████████████████████████████████| 11/11 [00:02<00:00,  3.88batch/s]

              precision    recall  f1-score   support

           0       1.00      0.99      0.99       166
           1       0.99      1.00      0.99       166

    accuracy                           0.99       332
   macro avg       0.99      0.99      0.99       332
weighted avg       0.99      0.99      0.99       332

Test loss: 0.01888



