<a href="https://colab.research.google.com/github/yilinmiao/genai-solution/blob/main/embedding_functions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Embedding functions

Using the appropriate model, anything can be embedded. We're going to use embeddings to do semantic search over Rick and Morty quotes

### ChatGPT doesn't have parents or emotions

Vector search is useful for retrieving data that's not part of the model's training data.

For example, if we asked the following question to ChatGPT, we get some generic sounding answer wrapping around the core "she does not make any statements about causing her parents misery".

But what if searched through actual quotes from the show?

### [HuggingFace Sentence Transformers](https://huggingface.co/sentence-transformers)

Restart the kernel (`Kernel` --> `Restart` in the Jupyter Notebook menu) after running this cell to use the latest packages.

In [1]:
!pip install -U --quiet sentence-transformers==2.5.1 transformers==4.36.0

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.8/126.8 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m156.5/156.5 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.2/8.2 MB[0m [31m65.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m91.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m100.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m65.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m45.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
!pip freeze | grep transformers

sentence-transformers==2.5.1
transformers==4.36.0


### Read data

Now let's read the quotes from the included text file (source: https://parade.com/tv/rick-and-morty-quotes).

**SOLUTION** Here each quote lives on it's own line of text. So we just do `readlines` here. Remember to use the context manager to close resources.

In [3]:
def read_quotes() -> list[str]:
    with open("rick_and_morty_quotes.txt", "r") as fh:
        return fh.readlines()

In [4]:
rick_and_morty_quotes = read_quotes()
rick_and_morty_quotes[:3]

["Losers look stuff up while the rest of us are carpin' all them diems.\n",
 "He's not a hot girl. He can't just bail on his life and set up shop in someone else's.\n",
 "When you are an a—hole, it doesn't matter how right you are. Nobody wants to give you the satisfaction.\n"]

Oops it seems like we have some extra newlines at the end of these quotes. Does that matter?
How do we prove it to ourselves?

**SOLUTION** turns out the answer is no. The reason is because the transformer library will strip the text of special characters.

In [5]:
import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

emb1, emb2 = model.encode([
 "Losers look stuff up while the rest of us are carpin' all them diems.\n",
 "Losers look stuff up while the rest of us are carpin' all them diems."
])

np.allclose(emb1, emb2)

  _torch_pytree._register_pytree_node(
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.51k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

True

### Write function to generate embeddings from text

Write a function that turns text into embeddings using Sentence Transformers.

**HINT**
1. Choose a [pre-trained model](https://www.sbert.net/docs/pretrained_models.html), you don't need to create your own
2. See the API documentation and examples for Sentence Transformers to see how to encode text

**SOLUTION**
Sentence Transformers make this pretty easy. First we load the model using the model name you chose.
Then we call the `model.encode()` function to generate the embeddings. If you pass in a single string,
then a 1D numpy array is returned. Otherwise a 2D array is returned.

In [6]:
import numpy as np
from sentence_transformers import SentenceTransformer
from typing import Union

MODEL_NAME = 'paraphrase-MiniLM-L6-v2'

def generate_embeddings(input_data: Union[str, list[str]]) -> np.ndarray:
    model = SentenceTransformer(MODEL_NAME)
    embeddings = model.encode(input_data)
    return embeddings

In [7]:
embeddings = generate_embeddings(rick_and_morty_quotes)



In [8]:
#Print the embeddings
for sentence, embedding in zip(rick_and_morty_quotes[:3], embeddings[:3]):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

Sentence: Losers look stuff up while the rest of us are carpin' all them diems.

Embedding: [ 0.6188342   0.06881845  0.443743   -0.45357847  0.3027152  -0.10784181
  0.4952491  -0.12448807  0.05482458 -0.0426283   0.04789171 -0.31940383
  0.18216977 -0.27199662 -0.14199574 -0.5600972  -0.355663   -0.44555116
 -0.03909537  0.42247915 -0.46049666  0.26436484  0.16821639  0.34295774
  0.20552626  0.2099483  -0.07352537 -0.02430931 -0.07486219  0.41356164
 -0.09713838 -0.02470851  0.02246377  0.10461555  0.25205332 -0.05957074
  0.02156204  0.24379644  0.20664135 -0.40555876 -0.18285923  0.13926397
 -0.29004875  0.14936335 -0.17484261 -0.22140737 -0.01152972 -0.17155811
  0.2581105   0.01463493 -0.05509429  0.02583235  0.01430622 -0.13821104
  0.16160002 -0.5648244   0.4062963   0.08129288  0.1872963  -0.06932872
 -0.17729409 -0.10064985  0.30244136 -0.2205626  -0.20505185  0.13730276
  0.32069105  0.22979227 -0.2280676   0.375768   -0.17270245 -0.17178829
  0.16163573  0.5295059  -0.1935

https://www.sbert.net/docs/pretrained_models.html


How many dimensions is each embedding?

In [9]:
len(embeddings[0])

384

Are the embeddings normalized already?

**SOLUTION** No they're not. We may want to add a step in the embedding function to normalize all vectors. OR we use cosine instead of Euclidean distance

In [10]:
np.linalg.norm(embeddings, axis=1)

array([5.578064 , 4.8029366, 4.740227 , 5.141901 , 7.088952 , 4.5193715,
       4.1849637, 5.134198 , 4.9683104, 5.2417045, 5.5314946, 4.3252525,
       6.9031262, 5.592641 , 5.20466  , 5.814751 , 6.8994946, 5.7716303,
       6.385527 , 5.230854 , 6.576409 , 5.348642 , 6.0708694, 7.759796 ,
       4.2407436, 4.596545 , 5.975322 , 4.7049956, 5.027795 , 7.618719 ,
       5.839973 , 5.674178 , 5.2255936, 6.6308613, 7.290153 , 5.0769453,
       7.4152474, 5.501258 , 4.7184834, 5.834399 , 4.634561 , 5.4468746,
       5.329065 , 4.7717104, 5.175283 , 5.1571107, 6.2419667, 5.8772345,
       4.933059 , 7.8839326, 4.9240723, 6.0574803, 4.257353 , 5.084046 ,
       5.6248083, 4.061501 , 5.4896584, 4.048276 ], dtype=float32)

### Let's put it all together

First let's encode the question

In [11]:
query_text = "Are you the cause of your parents' misery?"
query_embedding = model.encode(query_text)

Now we can reuse the find_nearest_neighbors function we wrote for exercise 1.

However, that only returns the vectors, whereas we also want the quotes. So please rewrite the find_nearest_neighbors function to return the *indices* of the nearest neighbors.

In [12]:
import numpy as np

def euclidean_distance(v1: np.ndarray, v2: np.ndarray) -> float:
    """
    Compute the Euclidean distance between two vectors.

    Parameters
    ----------
    v1 : np.ndarray
        First vector.
    v2 : np.ndarray
        Second vector.

    Returns
    -------
    float
        Euclidean distance between `v1` and `v2`.
    """
    dist = v1 - v2
    return np.linalg.norm(dist, axis=len(dist.shape)-1)


def find_nearest_neighbors(query: np.ndarray,
                           vectors: np.ndarray,
                           k: int = 1) -> np.ndarray:
    """
    Find k-nearest neighbors of a query vector.

    Parameters
    ----------
    query : np.ndarray
        Query vector.
    vectors : np.ndarray
        Vectors to search.
    k : int, optional
        Number of nearest neighbors to return, by default 1.

    Returns
    -------
    np.ndarray
        The `k` nearest neighbors of `query` in `vectors`.
    """
    distances = euclidean_distance(query, vectors)
    return np.argsort(distances)[:k]

In [13]:
indices = find_nearest_neighbors(query_embedding, embeddings, k=3)

In [14]:
for i in indices:
    print(rick_and_morty_quotes[i])

You're not the cause of your parents' misery. You're just a symptom of it.

Having a family doesn't mean that you stop being an individual. You know the best thing you can do for the people that depend on you? Be honest with them, even if it means setting them free.

B—h, my generation gets traumatized for breakfast.



#### Asking the question again

Now let's use the retrieved quotes and ask ChatGPT to answer the question based on the quotes in addition to its own data

In [15]:
"""
Answer the question based on the context.

Question: Are you the cause of your parents' misery?

Context:

You're not the cause of your parents' misery. You're just a symptom of it.

Having a family doesn't mean that you stop being an individual. You know the best thing you can do for the people that depend on you? Be honest with them, even if it means setting them free.

B—h, my generation gets traumatized for breakfast.
"""

"\nAnswer the question based on the context.\n\nQuestion: Are you the cause of your parents' misery?\n\nContext:\n\nYou're not the cause of your parents' misery. You're just a symptom of it.\n\nHaving a family doesn't mean that you stop being an individual. You know the best thing you can do for the people that depend on you? Be honest with them, even if it means setting them free.\n\nB—h, my generation gets traumatized for breakfast.\n"