# US Patent Phrase to Phrase Matching

* Updated June 5 5pm by Nabil Arnaoot
* Submission to Kaggle Competition
* By Mark Sosa, Norelle Liang, and Nabil Arnaoot
* for Machine Learning class, MSBA Saint Mary's College April 22, 2022

### Summary and Dependencies

This notebook is set up to test and score many models very quickly, then choose a single one for submission to the competition.
* When trying out many models, run it with the internet turned on.
* After selecting the best model, use the companion notebook Model_Import_Notebook to create the datasets it needs to run without internet.

When you've chosen the model you want to submit, add it to to Model_Import_Notebook and run it to download the datasets it needs.  Then in this notebook you need to + Add Data, look in your datasets, your notebook outputs, and grab the output from the Model_Import_Notebook.  At that point the code below should work.

### Gotchas

* make sure you've imported the right datasets the right datasets-- sentencetransformers220 and whatever model you're using
* the model you're using comes from the Model_Import_Notebook-- make sure you've run and saved that notebook, then loaded it as a dataset into this notebook
* the other tricky bit is which dataset you're using-- if you're using the test dataset from the competition, you won't be able to calculate the model score.  turn that off in the function or you'll get an error message
* if you're submitting the notebook, turn off internet and save it first


# Setup: import libraries, load & preprocess data.

In [None]:
# Import datasets and libraries

import numpy as np
import pandas as pd

import scipy
from scipy import stats

df = pd.read_csv('../input/us-patent-phrase-to-phrase-matching/train.csv')
df.head()

import sys
sys.path.append('../input/sentencetransformers220/sentence-transformers-2.2.0')
import sentence_transformers
from sentence_transformers import SentenceTransformer

import nltk
from nltk.corpus import stopwords
import string
from nltk import word_tokenize

from unidecode import unidecode

from sentence_transformers import SentenceTransformer, util

In [None]:
# Pre-process data

def pre_process(corpus):
    # convert input corpus to lower case.
    corpus = corpus.lower()
    # collecting a list of stop words from nltk and punctuation form
    # string class and create single array.
    stopset = stopwords.words('english') + list(string.punctuation)
    # remove stop words and punctuations from string.
    # word_tokenize is used to tokenize the input corpus in word tokens.
    corpus = " ".join([i for i in word_tokenize(corpus) if i not in stopset])
    # remove non-ascii characters
    corpus = unidecode(corpus)
    return corpus

df['target'] = df['target'].apply(pre_process)
df['anchor'] = df['anchor'].apply(pre_process)

# Grab the two columns we care about

input_anchor = df.anchor.to_list()
input_target = df.target.to_list()


# Prepare a function that will allow us to quickly test models.

In [None]:
# This function takes in a model name (must be available as input to run without internet)
# and anchor and target and returns a list of all the new scores we have created
# and the pearson score for the entire model (which is very close to the score 
# the leaderboard assigns)

def use_model(model_name, input_anchor, input_target):
    model = SentenceTransformer(model_name)
    anchor_vec = model.encode(input_anchor)
    target_vec = model.encode(input_target)
    cos_sim = []
    for i in range(len(anchor_vec)):
        sim = util.cos_sim(anchor_vec[i], target_vec[i])
        cos_sim.append(sim[0][0].item())
    cos_sim_model_score, p = scipy.stats.pearsonr(df.score, cos_sim)  
    dot_sim = []
    for i in range(len(anchor_vec)):
        dsim = util.dot_score(anchor_vec[i], target_vec[i])
        dot_sim.append(dsim[0][0].item())
    dot_sim_model_score, p = scipy.stats.pearsonr(df.score, dot_sim)
    return(cos_sim, cos_sim_model_score, dot_sim, dot_sim_model_score) 


# Run a bunch of models and see which one has the best results.

Turn on internet to run many models quickly, then find the best one for submission to the competition.
* For this chosen model, we have to make it available without internet.
* Add the chosen model to the copanion notebook Model_Import_Notebook, run that notebook to load the model, then load the resulting dataset into this model.
* Use the chosen model for submission with interent turned off.
* Before turning off the internet, change the cell below from code to markdown.

Models selected from here:  https://www.sbert.net/docs/pretrained_models.html

Note: the next cell is the longest one to run in this notebook.



In [None]:
# Try many models quickly (with internet turned on) and find the scores for each one.
# Based on these scores, we'll choose which model to use for submission to the competition.

# For each model we're testing: assign model name, then run function, grab generated scores for each row 
# and the overall model score

# create an empty dataframe to hold model names and scores
model_comparisons = pd.DataFrame(columns = ['Model', 'Cos_Sim_Model_Score', 'Dot_Model_Score'])

# First model we're testing:
model_name = 'all-mpnet-base-v2'
cos_sim_function_results1, cos_sim_function_score, dot_sim_results1, dot_sim_model_score = use_model('sentence-transformers/' + model_name, input_anchor, input_target)
model_comparisons = model_comparisons.append({'Model': model_name, 'Cos_Sim_Model_Score': cos_sim_function_score, "Dot_Model_Score": dot_sim_model_score}, ignore_index=True)

# Next model we're testing:
model_name = 'multi-qa-mpnet-base-dot-v1' 
cos_sim_function_results2, cos_sim_function_score, dot_sim_results2, dot_sim_model_score = use_model('sentence-transformers/' + model_name, input_anchor, input_target)
model_comparisons = model_comparisons.append({'Model': model_name, 'Cos_Sim_Model_Score': cos_sim_function_score, "Dot_Model_Score": dot_sim_model_score}, ignore_index=True)


# Next model we're testing:
model_name = 'all-distilroberta-v1' 
cos_sim_function_results3, cos_sim_function_score, dot_sim_results3, dot_sim_model_score = use_model('sentence-transformers/' + model_name, input_anchor, input_target)
model_comparisons = model_comparisons.append({'Model': model_name, 'Cos_Sim_Model_Score': cos_sim_function_score, "Dot_Model_Score": dot_sim_model_score}, ignore_index=True)

# Next model we're testing:
model_name = 'all-MiniLM-L12-v2' 
cos_sim_function_results4, cos_sim_function_score, dot_sim_results4, dot_sim_model_score = use_model('sentence-transformers/' + model_name, input_anchor, input_target)
model_comparisons = model_comparisons.append({'Model': model_name, 'Cos_Sim_Model_Score': cos_sim_function_score, "Dot_Model_Score": dot_sim_model_score}, ignore_index=True)

# Next model we're testing:
model_name = 'all-MiniLM-L6-v2' 
cos_sim_function_results5, cos_sim_function_score, dot_sim_results5, dot_sim_model_score = use_model('sentence-transformers/' + model_name, input_anchor, input_target)
model_comparisons = model_comparisons.append({'Model': model_name, 'Cos_Sim_Model_Score': cos_sim_function_score, "Dot_Model_Score": dot_sim_model_score}, ignore_index=True)


In [None]:
# print results
print("\n\nYour results are:")
model_comparisons

In [None]:
# Try an ensemble model by averaging the cos_sim results of the other 5

r5 = np.array(cos_sim_function_results5)
r4 = np.array(cos_sim_function_results4)
r3 = np.array(cos_sim_function_results3)
r2 = np.array(cos_sim_function_results2)
r1 = np.array(cos_sim_function_results1)
avg = (r5 + r4 + r3 + r2 + r1)/5

avg_model_score, p = scipy.stats.pearsonr(df.score, avg)
print("Combining the cos sim scores of your five (cos_sim score) models gives you an accuracy of", avg_model_score)


# Try an ensemble model by averaging the dot results of the other 5

d5 = np.array(dot_sim_results5)
d4 = np.array(dot_sim_results4)
d3 = np.array(dot_sim_results3)
d2 = np.array(dot_sim_results2)
d1 = np.array(dot_sim_results1)
avg_dot = (d5 + d4 + d3 + d2 + d1)/5

avg_model_score, p = scipy.stats.pearsonr(df.score, avg_dot)
print("Combining the cos sim scores of your five (dot_sim score) models gives you an accuracy of", avg_model_score)

# Run the chosen model without internet, then write the datafile for your submission.

In [None]:
model = SentenceTransformer("../input/model-import-notebook/all-mpnet-base-v2")
anchor_vec = model.encode(input_anchor)
target_vec = model.encode(input_target)
cos_sim = []
for i in range(len(anchor_vec)):
    sim = util.cos_sim(anchor_vec[i], target_vec[i])
    cos_sim.append(sim[0][0].item())
    

In [None]:
# Write the datafile for your submission

data = {'id': df.id, 'score': cos_sim}
submission = pd.DataFrame(data)
submission.to_csv('submission.csv', index=False)