# How Can We Represent Semantic? Embeddings 101

Since high school computer programming classes, everyone knows that computer can understand only numbers. So, how can LLM understand humans language? 

## A Bit of History

A naive way to do this is to map every words of the dictionary to a number. That is, a sentence like "Applied Machine Learning Days are a super cool conference" became a list of integer
```
[10, 40, 5, 10, 7, 8, 90, 123, 2]
```
This is a very simple method, but for complex Natural Language Processing (NLP) tasks is not sufficient. Researchers in this field have developed during the years many different strategies to generate numerical reppresantation that include also semantic features of the language. In 2013, Word2Vec from Google make his first appereance on the stage introducing a method to generate dense vector representations, or embeddings, of words that capture a significant amount of language semantic. For example: 
```
"Queen" = [0.3, 0.3, 0.2, ..., 0.3]
"King" = [0.5, -0.3, 0.1, ..., 0.5]
"Man" = [0.2, 0.95, 0.3, ..., 0.1]
"Woman" = [0.56, -0.5, 0.32, ..., 0.1]
```
So, we have a multi-dimensional matematichal space where vectors close to each others represent words semantically similar. in the above example, the vectors representing "Queen" and "Woman", or "King" and "Man" are probably close to each others. One of the coolest consegunece of this is that we can apply mathematical operations to vector obtaining others vectors! 

```
"King" - "Man" + "Woman" ~= "Queen"
```

## TODO: add here some of history after word to vec, sentences transformer etc.
[...]


But talk is cheap! Let's put the hands in the mud and make our computer understand language!

# Every Good Craftsman Need Good Tools

We import some useful python dependencies that we will need for executing this notebook

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from sentence_transformers import SentenceTransformer
import umap
import pandas as pd
import plotly.express as px

from movie_buddy.preprocessing.movies_dataset import (
    get_movies_dataset,
    get_sentences_dataset,
)
from movie_buddy.preprocessing.utils import reduce_dimensions, add_umap_to_df
from movie_buddy.preprocessing.visualization import plot_sentences

# Let's Play With Sentences

As we discussed early, we need to help our laptop to understand sentences. We talk about embeddings models, it seems a scary concept and it could be if we deal with all the details. Luckily someone did a lot of works for us, we can just donwload pre-trained sentence embedings model from HuggingFace repository:

In [None]:
encoder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

In [None]:
encoded_sent = encoder.encode("My first encoded sentence: AMLD are Cools")
print(
    f"First and last part of the resulting vector:\n{encoded_sent[:5]} ... {encoded_sent[len(encoded_sent)-5:]}"
)

Congratulations! You have encoded your first sentence! 

Let's make a further step, we can embed multiple sentences: we have prepared a dataset of sentences regarding different fields raging from "Nature and Enviroment" to "Sports"

In [None]:
sentences_df = get_sentences_dataset()
sentences_df

In [None]:
encoded_sentences = encoder.encode(sentences_df["sentences"])

In [None]:
encoded_sentences.shape

In [None]:
reduced_encoded_sentences = reduce_dimensions(encoded_sentences)
full_sentences_df = add_umap_to_df(sentences_df, reduced_encoded_sentences)

# This is just to improve visualization
full_sentences_df["short_sentences"] = (
    full_sentences_df["sentences"].str.slice(0, 20) + "..."
)
full_sentences_df

In [None]:
plot_sentences(full_sentences_df)

Note that, encoder didn't know nothing about the field of the sentence, could you see some interesting pattern?

Now, it is your turn! Add more sentences and try to guess in which zone of the plot will be positioned

In [None]:
your_sentences = [
    "Schools are very important for our society",
    "I run every day",
    "AI will revoluzionize the computer science industry",
    # PUT YOUR SENTENCES DOWN THERE IN THE LIST
    # |
    # V
]

In [None]:
your_sentences_df = pd.DataFrame(your_sentences, columns=["sentences"])
your_sentences_df["field"] = "your_sentences"

full_your_sentences_df = pd.concat([sentences_df, your_sentences_df], ignore_index=True)
full_your_sentences_df["short_sentences"] = (
    full_your_sentences_df["sentences"].str.slice(0, 20) + "..."
)
your_encoded_sentences = encoder.encode(full_your_sentences_df["sentences"])

reduced_encoded_sentences = reduce_dimensions(your_encoded_sentences)
full_your_sentences_df = add_umap_to_df(
    full_your_sentences_df, reduced_encoded_sentences
)

In [None]:
plot_sentences(full_your_sentences_df)

## What About Movies? 

At the end of the day we want to build an AI movies assistant, so what about movies? 

We have a dataset containing some information such as title, overview, genre, release date about ~42000 movies. We can try to embed overviews and try to see if the encoder find some structure inside it. 

In [None]:
movies_df = get_movies_dataset()

In [None]:
movies_df

In [None]:
len(movies_df["overview"].tolist())

In [None]:
encoded_movies = encoder.encode(movies_df["overview"].tolist())

In [None]:
reducer = umap.UMAP()
reduced_encoded_movies = reducer.fit_transform(encoded_movies)

In [None]:
len(reduced_encoded_movies.tolist())

In [None]:
movies_df["encoded_overview"] = reduced_encoded_movies.tolist()

In [None]:
movies_df

In [None]:
split = pd.DataFrame(movies_df["encoded_overview"].tolist(), columns=["x", "y"])

In [None]:
movies_df_joined = pd.concat(
    [movies_df.reset_index(drop=True), split.reset_index(drop=True)],
    axis=1,
)
movies_df_joined

In [None]:
fig = px.scatter(
    movies_df_joined.sample(5000),
    x="x",
    y="y",
    color="genre",
    height=512,
    hover_name="genre",
    hover_data={"overview": False, "title": True, "x": False, "y": False},
)
fig.update_layout(title_text="Which Movies Are Close?", template="plotly_white")
fig.update_traces(textposition="top center", marker=dict(size=5))
fig.show()