## Sodobna obdelava naravnega jezika: BERT prek praktičnih primerov

## Iskanje podobnih besedil

Praktični del 3. delavnice v sklopu Akademije umetne inteligence za poslovne aplikacije.

V tej beležki bomo spoznali *tf-idf* in modele za vektorske vložitve, kot je *BERT*. S temi pristopi bomo zgradili predstavitve besedil, s pomočjo katerih bomo iskali filme glede na uporabnikovo poizvedbo.


## TF-IDF

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

#### Primer s prosojnic:

In [None]:
texts = [
    "Kje leži mačka?",
    "Mačka leži na preprogi.",
    "V gozdu se nahaja medved."
]

In [None]:
vectorizer = TfidfVectorizer()

Izgradnja predstavitev za tri stavke iz primera:

In [None]:
representations = vectorizer.fit_transform(texts)

Seznam besed, ki so značilke v našem tf-idf pristopu:

In [None]:
vectorizer.get_feature_names_out().tolist()

Za vsakega od treh stavkov v našem primeru lahko izpišemo njegovo tf-idf predstavitev v obliki vektorja:

In [None]:
print(" " * 40 + " ".join([f"{word:<10}" for word in vectorizer.get_feature_names_out()]))

for i, row in enumerate(representations.toarray()):
    print(f"{texts[i]:<40}" + " ".join([f"{value:<10.2f}" for value in row]))

S pomočjo kosinusne podobnosti lahko pogledamo, kako podobni so si med sabo stavki v našem primeru.

Opazimo:
- da ima stavek sam s seboj podobnost 1,
- da je podobnost med stavkoma "Kje leži mačka" ter "Mačka leži na preprogi." enaka 0.44,
- da je podobnost med stavkoma "Kje leži mačka" ter "V gozdu se nahaja medved." enaka 0.

In [None]:
print(" " * 30 + " ".join([f"{text:<30}" for text in texts]))

for i, row in enumerate(cosine_similarity(representations, representations)):
    print(f"{texts[i]:<30}" + " ".join([f"{value:<30.2f}" for value in row]))

#### Iskanje podobnih filmov

In [None]:
from typing import List, Tuple

import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

Preberemo podatke o filmih:

In [None]:
path_to_train_csv = "https://raw.githubusercontent.com/valira-ai/llm-course/main/data/movies.csv"

df = pd.read_csv(path_to_train_csv)

In [None]:
len(df)

In [None]:
df.head()

In [None]:
titles = df["title"].to_numpy()
descriptions = df["description"].to_numpy()

Izračunajmo predstavitve:

In [None]:
vectorizer = TfidfVectorizer(max_features=10000)
representations = vectorizer.fit_transform(descriptions)

In [None]:
representations.shape

Iskanje podobnih filmov:

In [None]:
def get_similar_movies_tfidf(movie_title: str, n: int = 10) -> List[Tuple[str, float]]:
    movie_idxs = np.where(titles == movie_title)[0]
    if len(movie_idxs) == 0:
        raise ValueError("Movie not found")

    similarity_scores = cosine_similarity(representations[movie_idxs[0]], representations).flatten()
    related_movie_indices = similarity_scores.argsort()[-n-1:-1]
    return [(titles[i], similarity_scores[i]) for i in reversed(related_movie_indices)]

In [None]:
titles

In [None]:
get_similar_movies_tfidf("Toy Story")

Iskanje filmov s poizvedbo:

In [None]:
def find_movies_by_query_tfidf(query: str, n: int = 10) -> List[Tuple[str, float]]:
    query_tfidf = vectorizer.transform([query])
    similarity_scores = cosine_similarity(query_tfidf, representations).flatten()
    related_movie_indices = similarity_scores.argsort()[-n:][::-1]

    return [(titles[i], similarity_scores[i]) for i in related_movie_indices]

In [None]:
find_movies_by_query_tfidf("Some kid fights off the burglars in his house on Christmas after family leaves him behind.")

## BERT (or BERT-like models)

Najprej si uredimo dostop do GPU-ja v tej Colab seji:
- `Edit -> Notebook settings -> Hardware accelerator` mora biti nastavljen na enega izmed GPU-jev.
- po potrebi se ponovno poveženo z gumbom `Connect` v desnem zgornjem kotu.

In [None]:
!nvidia-smi

In [None]:
%%capture
!pip install -U sentence-transformers

In [None]:
from typing import List, Tuple

import numpy as np
import pandas as pd

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

In [None]:
texts = [
    "Kje leži mačka?",
    "Mačka leži na preprogi.",
    "V gozdu se nahaja medved."
]

In [None]:
embeddings = model.encode(texts)

In [None]:
embeddings.shape

In [None]:
print(" " * 30 + " ".join([f"{text:<30}" for text in texts]))

for i, row in enumerate(cosine_similarity(embeddings, embeddings)):
    print(f"{texts[i]:<30}" + " ".join([f"{value:<30.2f}" for value in row]))

Podobnost opisov filmov:

In [None]:
path_to_train_csv = "https://raw.githubusercontent.com/valira-ai/llm-course/main/data/movies.csv"

df = pd.read_csv(path_to_train_csv)

In [None]:
titles = df["title"].to_numpy()
descriptions = df["description"].to_numpy()

In [None]:
embeddings = model.encode(descriptions, batch_size=128, show_progress_bar=True)

In [None]:
embeddings.shape

In [None]:
def get_similar_movies_bert(movie_title: str, n: int = 10) -> List[Tuple[str, float]]:
    movie_idxs = np.where(titles == movie_title)[0]
    if len(movie_idxs) == 0:
        raise ValueError("Movie not found")

    similarity_scores = cosine_similarity([embeddings[movie_idxs[0]]], embeddings).flatten()
    related_movie_indices = similarity_scores.argsort()[-n-1:-1]
    return [(titles[i], similarity_scores[i]) for i in reversed(related_movie_indices)]

In [None]:
get_similar_movies_bert("Toy Story")

In [None]:
def find_movies_by_query_bert(query: str, n: int = 10) -> List[Tuple[str, float]]:
    similarity_scores = cosine_similarity(model.encode([query]), embeddings).flatten()
    related_movie_indices = similarity_scores.argsort()[-n:][::-1]

    return [(titles[i], similarity_scores[i]) for i in related_movie_indices]

In [None]:
find_movies_by_query_bert("Some kid fights off the burglars in his house on Christmas after family leaves him behind.")

## Eksperimentiraj

In [None]:
%%capture
!pip install datasets

In [None]:
from typing import List, Tuple

import numpy as np
import pandas as pd

from datasets import load_dataset
from sentence_transformers import SentenceTransformer

In [None]:
amazon_products = load_dataset("ckandemir/amazon-products")

In [None]:
amazon_products

In [None]:
names = amazon_products["train"]["Product Name"]
descriptions = [f"{name} - {description}" for name, description in zip(names, amazon_products["train"]["Description"])]

In [None]:
len(descriptions)

TODO