# Entity Resolution project @ Wavestone
## Entity Matching

> *Datasets information from [here](https://data.dws.informatik.uni-mannheim.de/benchmarkmatchingtasks/index.html)*

> **Tristan PERROT**


## Run results

In [1]:
!python result_blocking.py

Using device: cuda
  warn(
2025-01-30 16:18:44.324265: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-01-30 16:18:44.353747: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-01-30 16:18:44.353772: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-01-30 16:18:44.354644: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-01-30 16:18:44.359663: I tensorflow/cor

## Import libraries

In [1]:
import os

while 'model' not in os.listdir():
    os.chdir('..')

In [2]:
import re
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, roc_curve, auc
import matplotlib.pyplot as plt
import seaborn as sns

from model.utils import load_data
import result_blocking
import src.import_data as import_data

ModuleNotFoundError: No module named 'result_blocking'

In [7]:
print('Available models:')
print(result_blocking.MODELS)
print()
print('Available datasets:')
print(result_blocking.DATA_NAMES)

Available models:
['sentence-transformers/allenai-specter', 'all-distilroberta-v1', 'all-mpnet-base-v2', 'multi-qa-mpnet-base-dot-v1']

Available datasets:
['fodors-zagats', 'amazon-google', 'abt-buy']


## Amazon-Google

In [8]:
data_name = 'amazon-google'

table_a_serialized, table_b_serialized, X_train_ids, y_train, X_valid_ids, y_valid, X_test_ids, y_test = load_data(os.path.join(import_data.DATA_FOLDER, data_name))

Table A columns:
    column_name data_example
1   description          NaN
2  manufacturer   broderbund
3         price          0.0 

Table B columns:
    column_name              data_example
1   description  learning quickbooks 2007
2  manufacturer                    intuit
3         price                     38.99 

Serialized entities 



In [11]:
preds = {}
results_dict = {}
for data_name in result_blocking.DATA_NAMES:
    results_dict[data_name] = {}
    table_a_serialized, table_b_serialized, X_train_ids, y_train, X_valid_ids, y_valid, X_test_ids, y_test = load_data(os.path.join(import_data.DATA_FOLDER, data_name))
    for model_name in result_blocking.MODELS:
        results_dict[data_name][model_name] = {}
        for order_cols in result_blocking.LOAD_OPTIONS['order_cols']:
            results_dict[data_name][model_name][order_cols] = {}
            for remove_col_names in result_blocking.LOAD_OPTIONS['remove_col_names']:
                dir_name = f'{data_name}-blocking/{model_name}-order_cols_{order_cols}-remove_col_names_{remove_col_names}'
                pairs = []
                with open(os.path.join(dir_name, 'pairs.csv'), 'r') as f:
                    for line in f:
                        pairs.append([int(x) for x in line.strip().split(',')])
                print(pairs)
                y_pred = [1 if pair in X_test_ids else 0 for pair in pairs]
                y_true = [y_test[X_test_ids.index(pair)] if pair in X_test_ids else 0 for pair in pairs]

                precision = precision_score(y_true, y_pred)
                recall = recall_score(y_true, y_pred)
                f1 = f1_score(y_true, y_pred)
                roc_auc = roc_auc_score(y_true, y_pred)
                accuracy = accuracy_score(y_true, y_pred)
                conf_matrix = confusion_matrix(y_true, y_pred)

                results_dict[data_name][model_name][order_cols][remove_col_names] = {
                    'precision': precision,
                    'recall': recall,
                    'f1': f1,
                    'roc_auc': roc_auc,
                    'accuracy': accuracy,
                    'confusion_matrix': conf_matrix
                }

Table A columns:
  column_name              data_example
1        addr  '435 s. la cienega blv.'
2        city             'los angeles'
3       phone              310/246-1501
4        type                  american 

Table B columns:
  column_name           data_example
1        addr  '10801 w. pico blvd.'
2        city              'west la'
3       phone           310-475-3585
4        type               american 

Serialized entities 



FileNotFoundError: [Errno 2] No such file or directory: 'fodors-zagats-blocking/sentence-transformers/allenai-specter-order_cols_True-remove_col_names_True/pairs.csv'

In [30]:
for data_name, models in results_dict.items():
    best_f1 = 0
    best_option = None
    for model_name, orders in models.items():
        for order_cols, removes in orders.items():
            for remove_col_names, metrics in removes.items():
                if metrics['f1'] > best_f1:
                    best_f1 = metrics['f1']
                    best_option = (model_name, order_cols, remove_col_names)
    print(f"Dataset: {data_name}, Best F1 Score: {best_f1}, Options: {best_option}")

Dataset: fodors-zagats, Best F1 Score: 1.0, Options: ('cross-encoder/stsb-roberta-base', True, True)
Dataset: amazon-google, Best F1 Score: 0.9034749034749034, Options: ('cross-encoder/stsb-roberta-base', False, True)
Dataset: abt-buy, Best F1 Score: 0.9488372093023256, Options: ('cross-encoder/stsb-roberta-base', False, False)


In [31]:
for data_name, models in results_dict.items():
    for model_name, orders in models.items():
        for order_cols, removes in orders.items():
            for remove_col_names, metrics in removes.items():
                print(f"Dataset: {data_name}, Model: {model_name}, Order Cols: {order_cols}, Remove Col Names: {remove_col_names}")
                print(f"Precision: {metrics['precision']}")
                print(f"Recall: {metrics['recall']}")
                print(f"F1: {metrics['f1']}")
                print(f"ROC AUC: {metrics['roc_auc']}")
                print(f"Accuracy: {metrics['accuracy']}")
                print(f"Confusion Matrix: {metrics['confusion_matrix']}")
                print()

Dataset: fodors-zagats, Model: cross-encoder/stsb-roberta-base, Order Cols: True, Remove Col Names: True
Precision: 1.0
Recall: 1.0
F1: 1.0
ROC AUC: 1.0
Accuracy: 1.0
Confusion Matrix: [[55  0]
 [ 0 11]]

Dataset: fodors-zagats, Model: cross-encoder/stsb-roberta-base, Order Cols: True, Remove Col Names: False
Precision: 1.0
Recall: 1.0
F1: 1.0
ROC AUC: 1.0
Accuracy: 1.0
Confusion Matrix: [[55  0]
 [ 0 11]]

Dataset: fodors-zagats, Model: cross-encoder/stsb-roberta-base, Order Cols: False, Remove Col Names: True
Precision: 1.0
Recall: 1.0
F1: 1.0
ROC AUC: 1.0
Accuracy: 1.0
Confusion Matrix: [[55  0]
 [ 0 11]]

Dataset: fodors-zagats, Model: cross-encoder/stsb-roberta-base, Order Cols: False, Remove Col Names: False
Precision: 1.0
Recall: 1.0
F1: 1.0
ROC AUC: 1.0
Accuracy: 1.0
Confusion Matrix: [[55  0]
 [ 0 11]]

Dataset: fodors-zagats, Model: cross-encoder/stsb-distilroberta-base, Order Cols: True, Remove Col Names: True
Precision: 1.0
Recall: 1.0
F1: 1.0
ROC AUC: 1.0
Accuracy: 1.0
Con