# Pretrained sentence transformer model baseline

Before we start, check out my quick EDA over [here](https://www.kaggle.com/code/tanlikesmath/u-s-patent-phrase-to-phrase-matching-simple-eda).

Also, **before forking and submitting, please support and upvote this notebook**.


In this notebook, I provide a quick baseline. I use the [Sentence Transformers](https://www.sbert.net/) library, which is commonly used for text similarity tasks. It comes with some pretrained models with I use here.


## Imports

Since I want to submit, I need no internet access, so I exported the library to a dataset over [here](https://www.kaggle.com/tanlikesmath/sentence-transformers-dataset).

In [None]:
!cp -r ../input/sentence-transformers-dataset/sentence-transformers /tmp/st
!pip install /tmp/st

In [None]:
from sentence_transformers import SentenceTransformer, util
import numpy as np
import pandas as pd

## Get model

I use the "paraphrase-MiniLM-L3-v2" model. The model card is [here](https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L3-v2). Since I want to submit, I need no internet access, so I exported the pretrained model as a dataset over [here](https://www.kaggle.com/code/tanlikesmath/sentence-transformers-models).

In [None]:
model = SentenceTransformer('../input/sentence-transformers-models/all-MiniLM-L6-v2')

Okay now let's process the test dataset. We will get a list of embedding vectors from the model:

In [None]:
test_df = pd.read_csv('../input/us-patent-phrase-to-phrase-matching/test.csv')

In [None]:
test_df.head()

In [None]:
anchors = test_df.anchor.values
targets = test_df.target.values
embedding1 = model.encode(anchors, convert_to_tensor=True)
embedding2 = model.encode(targets, convert_to_tensor=True)

Now let's use the cosine similarity function to calculate similarity between the sentences.

In [None]:
scores = []

cosine_scores = util.pytorch_cos_sim(embedding1, embedding2)
for i in range(len(anchors)):
        scores.append(cosine_scores[i][i].item())

Now we can submit!

In [None]:
sample_df = pd.read_csv('../input/us-patent-phrase-to-phrase-matching/sample_submission.csv')
sample_df.score = scores
sample_df.head()

In [None]:
sample_df.to_csv('submission.csv', index=False)

Now, **WE ARE DONE!**

If you enjoyed this notebook, please give it an upvote.

If you have any questions or suggestions, please leave a comment!