# 02. Neural embeddings

We're going to use some existing pretrained models trained on English corpus. [SentenceTransfomers](https://www.sbert.net/) is a library exposing them in an easy way. Let's start with checking how they are going to look like on our dataset.

In [None]:
from datasets import load_dataset

import pandas as pd

In [None]:
tweet_qa_dataset = load_dataset("tweet_qa")
train_df = pd.DataFrame(tweet_qa_dataset["train"])

In [None]:
from sentence_transformers import SentenceTransformer

There is plenty of available pretrained models, and with Sentence-Transformers you can use any of the ones listed here: https://huggingface.co/docs/transformers/model_doc/auto.

In [None]:
embedder = SentenceTransformer("all-MiniLM-L6-v2")

In [None]:
question_embeddings = embedder.encode(train_df["Question"])
question_embeddings.shape

In [None]:
question_embeddings[0]

And the same process of generating the embeddings might be done for answers, in our case full tweets.

In [None]:
answer_embeddings = embedder.encode(train_df["Tweet"])

In [None]:
import numpy as np

from sklearn.metrics.pairwise import cosine_similarity

Now we're able to compare the embeddings of the questions and corresponding answers. Typically Euclidean or cosine similarity functions are used for that purpose, and we won't be any different. Cosine distance has a great property of being normalized and is equal to 1 when both vectors are identical and -1 when they're opposite.

Remember good old kNN? This is him now. Feel old yet? 

In [None]:
train_df["distances"] = np.array([
    cosine_similarity([qe], [ae])
    for qe, ae in zip(question_embeddings, answer_embeddings)
]).flatten()
train_df["distances"].describe()

In [None]:
train_df["distances"].plot.hist()