In [None]:
import importlib

if not importlib.util.find_spec("movie_buddy"):
    !pip install -qqq git+https://github.com/xtreamsrl/movies-buddy

# How Can We Represent Semantics? Embeddings 101

Since high school computer programming classes, everyone knows that computers can only understand numbers. So, how can Large Language Models (LLMs) such as ChatGPT understand humans language?

## A Bit of History

A naive way to do this, is to **map every word in a dictionary to a number**. That is, map a sentence like "Applied Machine Learning Days are a super cool conference" to a list of integers like the following:

```
[10, 40, 5, 10, 7, 8, 90, 123, 2]
```

This process is called **encoding**, and this is a very simple encoding method. However, for complex Natural Language Processing (NLP) tasks, this is not sufficient. For example, here the word `cool` is used to denote something that is excellent, admirable. But `cool` can also mean "almost cold", or even "relaxed": how can we disambiguate?

To make another, slightly more complex example, take the two pair of words ("man", "woman") and ("king", "queen"). We know there is a relationship between those concepts: paraphrasing the words of a [very famous research paper](https://arxiv.org/abs/1301.3781), we might want our encoding method to have this property: encoding("King") - encoding("Man") + encoding("Woman") ~= encoding("Queen").

In other words, we would want our encoding method to carry the semantic relationship between words. In 2013, the four authors of the paper came up with Word2Vec: a neural network capable of generating dense vector representations, called *embeddings*, from words, capturing a significant amount of language semantic. You can imagine that the encoding perform this mapping:

```
"Queen" = [0.3, 0.3, 0.2, ..., 0.3]
"King" = [0.5, -0.3, 0.1, ..., 0.5]
"Man" = [0.2, 0.95, 0.3, ..., 0.1]
"Woman" = [0.56, -0.5, 0.32, ..., 0.1]
```

Where the number of dimensions of each vector is the same for every word (which is yet another desireable feature from our encoding algorithm) and is used to "place" every word in an Euclidean space. During training, such neural networks adjust the position of every word in the vector space, and produce a representation that carries the semantic proximity of the terms of the data it was trained on.

Nowadays, such models have evolved beyond embedding words alone. Now, for example, it is possible to encode sentences altogether (and this is what we'll do!), as well as images, text and images, as well as audio tracks. But talk is cheap! Let's get our hands dirty, and make our computer understand language!

# Your First Vector Embedding

As we discussed early, we need to help our laptop  understand sentences. We talk about embeddings models, it seems a scary concept and it could be if we deal with all the details. Luckily someone did a lot of works for us, we can just donwload pre-trained sentence embedings model from HuggingFace repository (this might take a bit):

In [None]:
from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

Encoding is straighforward:

In [None]:
embedding_example = encoder.encode("My first encoded sentence: Hello, AMLD!")

print(
    f"First and last part of the resulting vector:\n{embedding_example[:5]} ... {embedding_example[-5:]}"
)

Congratulations! You have encoded your first sentence! How many dimensions does this vector have?

In [None]:
embedding_example.shape

If you changed example, you'd notice that the resulting encoded vector would still have the same number of dimensions.

This is by far the most thrilling result. To appreciate the representational power of embeddings, we need to move a step forward and embed a bunch of sentences. How about we put it to the test with a more meaninfgul example?

## Navigating the Embedding Space


Here is a utility function to load a dataset of sentences about different fields, such as "Nature and Enviroment" and "Sports".

In [None]:
from movie_buddy.data import get_sentences_dataset

sentences = get_sentences_dataset()
sentences

Now, your turn: how do you encode the sentences in this dataset? Keep in mind, our encoder can encode any sentence in a sequence.

In [None]:
sentence_embeddings = encoder.encode(sentences["sentences"].to_list())

What's the shape of this dataset?

In [None]:
sentence_embeddings.shape

As humans, it's impossible for us to make sense of this many dimensions. We resort to some black magic (AKA [UMAP](https://umap-learn.readthedocs.io/en/latest/index.html)) to sharply reduce the number of dimensions to 2.

This greatly diminishes the representational power of our embeddings, but will leave just enough to surprise us and still be interpretable.

In [None]:
from movie_buddy.utils import reduce_dimensions, add_umap_to_dataset

sentence_embeddings_reduced = reduce_dimensions(sentence_embeddings)
sentences_and_embeddings = add_umap_to_dataset(sentences, sentence_embeddings_reduced)

sentences_and_embeddings.head()

In [None]:
from movie_buddy.utils import plot_sentences

plot_sentences(sentences_and_embeddings)

Note that, encoder didn't know anything about the field of the sentence, could you see some interesting pattern?

Now, it is your turn! Add more sentences and try to guess in which zone of the plot will be positioned.

In [None]:
your_sentences = [
    "Schools are very important for our society",
    "I run every day",
    "AI will revoluzionize the computer science industry",
    # PUT YOUR SENTENCES DOWN THERE IN THE LIST
    # ...
    # ...
]

In [None]:
from movie_buddy.utils import add_sentences

# fmt: off
(
    sentences_and_embeddings
    .pipe(add_sentences, sentences=your_sentences, encoder=encoder)
    .pipe(plot_sentences)
)
# fmt: on

## What About Movies? 

At the end of the day, though, we want to build an AI movie assistant, so where are them?

Here is a utility that will download a processed version of the movies dataset you can find in [this](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset) Kaggle competition.

In [None]:
from movie_buddy.data import get_movies_dataset

movies = get_movies_dataset()

movies.head()

In [None]:
movies.info()

There are slightly more than 42 thousands movies, with their title, an overview and their genre. Forget about covariates, how would you perform explorative data analysis on textual data?

Finally, you might want to embed the movies titles or overviews, and see if you can find some patterns. One note: embedding 44k vectors might just take a bit of time. You can specify the `sample` parameter in the `get_movies_dataset` function to retrieve some thousands.

In [None]:
movies_embeddings = encoder.encode(movies["overview"].to_list())

In [None]:
from movie_buddy.utils import plot_movies

# fmt: off
(
    movies
    .pipe(add_umap_to_dataset, reduce_dimensions(movies_embeddings))
    .pipe(plot_movies)
)
#fmt: on