# Embeddings & You - A Brief Introduction to Embeddings in Machine Learning

If you've toyed with LangChain, LlamaIndex, or even OpenAI's `ada` model - you've likely run into the word: "Embeddings" a few time.

They've had a recent surge in popularity due to the profliferation of Retrieval Augmented Generation, but they've been around for a very long time.

If you come from an NLP background, embeddings are something you might be intimately familiar with - otherwise, you might find the topic a bit...dense. (this attempt at a joke will make more sense later)

In all seriousness, embeddings are a powerful piece of the NLP puzzle, so let's dive in!

> NOTE: While this notebook language/NLP-centric, embeddings have uses beyond just text!

## Notebook Table of Contents:

- Breakout Room #1: Training Word2Vec from Scratch
  - Task 1: Dependencies
  - Task 2: Data Collection
  - Task 3: Data Preprocessing
    - ‚ùì Question #1
    - üë™‚ùì Discussion Question #1
  - Task 4: Training Word2Vec
    - üèóÔ∏è Activity #1
    - ‚ùì Question #2
- Breakout Room #2:
  - Task 1: Fine-tuning Our Embedding Model
    - ‚ùì Question #3
    - üèóÔ∏è Activity #2
  - Task 2: Evaluating our Embedding Model
    - üë™‚ùì Discussion Question #2

### Why Do We Even Need Embeddings?

In order to fully understand what Embeddings are, we first need to understand why we have them:

Machine Learning algorithms, ranging from the very big to the very small, all have one thing in common:

*They need numeric inputs.*

So we need a process by which to translate the domain we live in, dominated by images, audio, language, and more, into the domain of the machine: Numbers.

Another thing we want to be able to do is capture "semantic information" about words/phrases so that we can use algorithmic approaches to determine if words are closely related or not!

So, we need to come up with a process that does these two things well:

1. Convert non-numeric data into numeric-data
2. Capture potential semantic relationships between individual pieces of data

## Breakout Room #1: Training Word2Vec from Scratch

Now that we have a bit of background on Embeddings - let's look at what it takes to create our own embeddings using Word2Vec!

We'll be leveraging the `gensim` library, which you can read all about [here](https://pypi.org/project/gensim/).

Before we begin training, however, we need some data!

Let's use the Wikipedia pages for Wicked and Gladiator as examples.

### Task 1: Dependencies
We'll leverage the `wikipedia` library, and `langchain`s `WikipediaLoader` to obtain our Wikipedia data!

In [1]:
!pip install -U -q wikipedia langchain langchain_community lxml datasets

> NOTE: Please reset the Colab environment after running the install cells.

### Task 2: Data Collection



In [2]:
from langchain_community.document_loaders import WikipediaLoader

wicked_docs = WikipediaLoader(
    query="Wicked (2024 film)",
    load_max_docs=5,
    doc_content_chars_max=1_000_000
    ).load()



  lis = BeautifulSoup(html).find_all('li')


In [3]:
len(wicked_docs)

4

In [4]:
gladiator_2_docs = WikipediaLoader(
    query="Gladiator II",
    load_max_docs=5,
    doc_content_chars_max=1_000_000
    ).load()

In [5]:
len(gladiator_2_docs)

5

### Task 3: Data Preprocessing

Now that we have some text, we need to do some preprocessing! That's right - classic NLP!

Let's begin by cleaning up our text, we'll:

- Remove special characters
- Remove stop words
- Remove links
- Convert to lowercase
- Strip whitespace

To do this, we'll need two main modules:

- The `re` standard library module
- `spacy`, another NLP library

In [6]:
import nltk

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

Let's take a peek at what these "stopwords" are - for traditional embedding models and NLP.

In [7]:
from nltk.corpus import stopwords
stopwords.words('english')[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

#### Text Normalization

The first step is to make a helper function that normalizes our text.

In [8]:
import re
from typing import List
from nltk.tokenize import word_tokenize

def preprocess_text(text: str) -> List[str]:
  # remove links
  text = re.sub(r"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", "", text)
  # remove all special characters (keep alphabet characters)
  text = re.sub("[^a-zA-Z ]", " ", text)
  # tokenize text, make lowercase, and remove stop words
  stop_words = set(stopwords.words('english'))
  tokens = word_tokenize(text)
  filtered_tokens = [token.lower() for token in tokens if token.lower() not in stop_words]
  return filtered_tokens

Let's see how this works on some of our Wikipedia data!

In [9]:
preprocess_text(wicked_docs[0].page_content[:100])

['wicked',
 'titled',
 'onscreen',
 'wicked',
 'part',
 'american',
 'musical',
 'fantasy',
 'film',
 'directed',
 'jon']

#### Sentence Tokenization:

Now we'll turn our corpus into sets of sentences and apply our pre-processing function to each sentence individually.

In [10]:
from nltk.tokenize import sent_tokenize

def sentence_tokenization(text: str) -> List[List[str]]:
    # Tokenize the text into sentences
    sentences = sent_tokenize(text)
    # Tokenize each sentence into words and store them in a list of lists
    sentence_tokens = [preprocess_text(sentence) for sentence in sentences]
    return sentence_tokens

In [11]:
sentence_tokenization(wicked_docs[0].page_content[:200])

[['wicked',
  'titled',
  'onscreen',
  'wicked',
  'part',
  'american',
  'musical',
  'fantasy',
  'film',
  'directed',
  'jon',
  'chu',
  'written',
  'winnie',
  'holzman',
  'dana',
  'fox',
  'songs',
  'stephen',
  'schwartz'],
 ['first']]

Perfect, with that, we're ready to create our corpus!

In [12]:
corpus = []

for doc in wicked_docs:
  corpus += sentence_tokenization(doc.page_content)

for doc in gladiator_2_docs:
  corpus += sentence_tokenization(doc.page_content)

##### ‚ùì Question #1:

Why is this normalization and tokenization necessary to train a Word2Vec Embedding Model?

#### **ANSWER**
Normalization and tokenization are important for training a Word2Vec model for several key reasons:

**Consistent Input Format**:
- Word2Vec learns relationships between words by analyzing their contexts. It needs words to be in a consistent, normalized format to recognize them as the same token.
- Without normalization, "training", "Training", and "TRAINING" would be treated as completely different words, diluting the learning signal

**Meaningful Context Windows**
- Sentence tokenization is important because Word2Vec learns from words that appear together within a context window
- Sentence boundaries provide natural breaks where word relationships shouldn't be learned
- Also, you don't want the model learning relationships between words at the end of one sentence and the start of another

**Noise Reduction**
- Removing stopwords and standardizing text helps focus the model on meaningful semantic relationships
- Without this pre-processing, the model would waste capacity learning relationships between very common words like "the", "and", "is" that don't carry much semantic meaning

**Vocabulary**
- Tokenization helps build a defined vocabulary for the model
- Normalizing words (like converting to lowercase) reduces the vocabulary size
- A smaller, more focused vocabulary means more training examples per word, less memory usage and better quality embeddings for the words that matter


##### üë™‚ùì Discussion Question #1:

When creating training data for Large Language Models, do we need to/should we use text normalization?

What arguments for or against text normalization exist at LLM-scale datasets?

### Task 4: Training Word2Vec

Now that we have our corpus set up, we can train our Word2Vec model.

Training is straightforward, thanks to `gensim`, and more can be understood about the process by reading the paper - but let's see it in code!

It's also worth considering/playing around with the `gensim` parameters.

In [13]:
!pip install -q -U gensim

### An Aside on Skip-gram (SG) and Continuous Bag of Words (CBOW):

**Skip-gram**:

Skip-gram is an approach to teaching computers the meaning of words by predicting the surrounding context from a given word. Think of it as a student who learns by taking a single word and trying to guess what words might appear around it. For example, given the word "sun," Skip-gram would learn to predict related words like "bright," "sky," and "shine." This method is particularly effective at handling rare words in the vocabulary and capturing multiple meanings of words, though it typically requires more training time. The key insight is that words appearing in similar contexts often have related meanings.

**Continuous Bag of Words (CBOW)**:

CBOW takes the opposite approach to Skip-gram by predicting a target word based on its surrounding context words. Imagine playing a fill-in-the-blank game where you see "The ___ is barking at the mailman" and need to predict "dog" based on the surrounding words. CBOW looks at multiple context words at once and tries to understand what word would make sense in the middle. This method tends to be faster to train than Skip-gram and performs particularly well with frequent words in the vocabulary. However, it might not be as effective at handling rare words or capturing multiple word meanings since it averages the context.

#####üèóÔ∏è Activity #1:

Set appropriate hyperparameters for the gensim `Word2Vec` model.

> NOTE: Documentation is available [here](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec)

##### ‚ùì Question #2:

What do each of the hyper-parameters mean or relate to:

- `VECTOR_SIZE` -> Dimensionality of word vectors
- `WINDOW` -> Maximum distance between target and context words
- `MIN_COUNT` -> Minimum frequency for words to be included in vocabulary

In [14]:
from gensim.models import Word2Vec

### Modify These Hyper Parameters
VECTOR_SIZE = 128
WINDOW = 10
MIN_COUNT = 2

### Leave this Hyper Parameter
# Training algo- 1 for skip-gram , 0 for CBOW
SG = 1

model = Word2Vec(
    sentences=corpus,
    vector_size=VECTOR_SIZE,
    window=WINDOW,
    min_count=MIN_COUNT,
    sg=SG
    )

Blink and you'll miss it. You just trained an embeddings model!

Let's try it out and see what we did!

In [15]:
model.wv["elphaba"]

array([-0.03414725, -0.04884041,  0.10193804,  0.20329528,  0.10152539,
       -0.13650948,  0.10819466,  0.04226966, -0.05617155,  0.05878397,
        0.00152129,  0.02239892, -0.0348038 , -0.15599115,  0.12895605,
        0.2589449 , -0.00040091,  0.0717862 , -0.21542439,  0.24777177,
        0.22817518,  0.10774313, -0.16589536, -0.23007196, -0.18244976,
        0.01020667, -0.18018322,  0.0852144 ,  0.0776168 , -0.06121664,
       -0.1311743 ,  0.17316589,  0.01349317, -0.16350332,  0.22495998,
        0.11282989,  0.34145167, -0.04497095,  0.11076037,  0.11443741,
        0.05830157,  0.0857458 ,  0.10610778, -0.07305935,  0.27919325,
       -0.07181875, -0.00339787, -0.06210322, -0.11192293,  0.09862242,
        0.07002874,  0.13819158,  0.23487525, -0.03220874,  0.08827607,
       -0.07670002,  0.21123056, -0.04948097, -0.13903558,  0.12029547,
       -0.19755921,  0.01538324,  0.09666726, -0.04356951,  0.21688014,
       -0.12563924, -0.06335825,  0.04929867, -0.00814048, -0.14

Finally! We see it: An embedding in the wild.

Notice how we input a word, in this case "Elphaba", and we got back a 100-dimensional vector of floats.

Let's see if we can't get back a list of similar vectors to the vector for "Elphaba", and "Maximus"!

In [16]:
model.wv.most_similar(positive=["elphaba"], topn=3)

[('glinda', 0.9969524145126343),
 ('grande', 0.9950782656669617),
 ('wizard', 0.9950142502784729)]

In [17]:
model.wv.most_similar(positive=["maximus"], topn=3)

[('lucilla', 0.9979579448699951),
 ('marcus', 0.9978983402252197),
 ('son', 0.997671365737915)]

Now, for the moment of truth - let's do some vector math and see what happens!

In [18]:
galinda_vec = model.wv["galinda"]
good_vec = model.wv["good"]
mystery_vector = galinda_vec - good_vec

In [19]:
model.wv.most_similar(positive=[mystery_vector], topn=3)

[('parma', 0.5291705131530762),
 ('galinda', 0.5288470387458801),
 ('elphaba', 0.5266224145889282)]

And there we have it - embeddings, and a demonstration of what makes them so powerful!

> Note: This is a very small sample size, and while this result is what we'd hope for - it is largely coincidental - this behaviour is expressed better in much larger corpus' of text.

## Breakout Room #2: Fine-tuning a BERT-Style Embedding Model on Question Answer Pairs.

Now that we've seen where embeddings "started", as it were, let's see where they've gotten.

In this section, we'll be fine-tuning Hugging Face's [sentence transformers](https://www.sbert.net/).

Sentence Transformers leverages the work done in the [Sentence-BERT](https://arxiv.org/abs/1908.10084) paper. So while the idea of converting input text into a dense vector representation is the same, the way we got to those embeddings is a bit different.

> NOTE: As the name implies, the following model is an *ENTIRE* transformer model (though Encoder-only, as described by Sentence-BERT).

### Fine-tuning Our Embeddings Model

Finally, the set up is complete - and we can move on to fine-tuning our sentence transformer embedding model!

The process is simplified considerably by how amazing the Hugging Face `sentence-transformer` library is, so let's jump straight in!

In [20]:
!pip install -U -q sentence-transformers

[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/268.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[91m‚ï∏[0m [32m266.2/268.8 kB[0m [31m65.9 MB/s[0m eta [36m0:00:01[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m268.8/268.8 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [21]:
from sentence_transformers import SentenceTransformer

We're going to use the `BAAI/bge-small-en` embedding model as an example, but you could use any of the `sentence-transformer` embeddings models.

In [22]:
model_id = "BAAI/bge-small-en"
model = SentenceTransformer(model_id)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/90.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [23]:
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Let's load our data into the desired format!

In [24]:
from torch.utils.data import DataLoader
from sentence_transformers import InputExample

In [25]:
!git clone https://github.com/AI-Maker-Space/DataRepository

Cloning into 'DataRepository'...
remote: Enumerating objects: 119, done.[K
remote: Counting objects: 100% (111/111), done.[K
remote: Compressing objects: 100% (96/96), done.[K
remote: Total 119 (delta 36), reused 40 (delta 10), pack-reused 8 (from 1)[K
Receiving objects: 100% (119/119), 78.04 MiB | 14.74 MiB/s, done.
Resolving deltas: 100% (36/36), done.


In [26]:
TRAIN_DATASET_FPATH = './DataRepository/embedding_data/train_dataset.json'
VAL_DATASET_FPATH = './DataRepository/embedding_data/eval_dataset.json'

In [27]:
import json

with open(TRAIN_DATASET_FPATH, 'r+') as f:
    train_dataset = json.load(f)

with open(VAL_DATASET_FPATH, 'r+') as f:
    val_dataset = json.load(f)

In [28]:
# Load the components from the training dataset
dataset = train_dataset


# Dictionary of documents/texts
corpus = dataset['corpus']

# Dictionary of search queries
queries = dataset['queries']

# Dictionary mapping queries to relevant document IDs
relevant_docs = dataset['relevant_docs']

# Create a list to store our training examples
examples = []

# Iterate through each query and its ID
for query_id, query in queries.items():

    # Get the ID of the first relevant document for this query
    node_id = relevant_docs[query_id][0]
    # Get the actual text of the relevant document from the corpus
    text = corpus[node_id]

    # Create a training example pair: [query, relevant_document]
    # This will be used by MultipleNegativesRankingLoss where:
    # query should be more similar to its relevant document than to other documents
    # Other documents in the batch will serve as negative examples
    example = InputExample(texts=[query, text])

    # Add the example to our training dataset
    examples.append(example)

We're going to be leveraging `sentence_transformers` `MultipleNegativesRankingLoss` as our loss function.

You can read more about it in the docs, [here](https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss).

Note that there is [research](https://arxiv.org/pdf/1705.00652.pdf) that indicates that performance generally scales with `BATCH_SIZE`, but we're going to stick with an arbitrary 10 for the example in the notebook.

##### ‚ùì Question #3:

What is happening in `MultipleNegativesRankingLoss` that makes it useful for our task?

For fine-tuning of embedding model with our use case of query-document matching
we have
- Corpus of documents
- Queries
- Relevant document IDs for each query

We are creating training examples where each query is paired with its relevant document. MNRL is perfect for this use case because:

- For each query in our batch, positive pair is (query, relevant_document)
- All other documents in that batch automatically become negative examples
- Our goal is that we want queries to be more similar to their relevant documents than to any other document
- MNRL learns to maximize similarity between queries and their matching documents and minimize similarity between queries and non-matching documents
- Do this in a computationally efficient way since it uses in-batch negatives (basically you just need to compute embeddings once for each text in your batch)

For 10 examples say in our notebook
- Each InputExample contains **(query, text)** pairs
- With batch_size=10, when processed through MNRL, each query will be compared against:
    *   own relevant document (positive)
    *   9 other documents (negatives) from other query-document pairs in the batch

- For query 1, its paired document 1 is positive, but documents 2-10 automatically become negatives

- For query 2, its paired document 2 is positive, but documents 1,3-10 become negatives and so on....

We get 9 negative examples per query for free without having to:
- Store additional negative examples in memory
- Compute additional embeddings
- Do separate loss calculations
- We're reusing the same embeddings that you had to compute anyway for the positive pairs, rather than having to compute and store additional embeddings for explicit negative examples.

This approach is particularly efficient for our use case because:
- We don't need to explicitly specify negative examples
- It naturally handles the asymmetric nature of query-document matching




In [29]:
from sentence_transformers import losses

In [30]:
# Set up the loss function - MultipleNegativesRankingLoss
loss = losses.MultipleNegativesRankingLoss(model)

In [31]:
# Define batch size for training
BATCH_SIZE = 10

# Create a DataLoader to handle batching of our training examples
loader = DataLoader(
    examples, batch_size=BATCH_SIZE
)

We'll set up the `InformationRetrievalEvaluator` to determine performance during training.

In [32]:
# Set up an evaluator to measure model performance during training
from sentence_transformers.evaluation import InformationRetrievalEvaluator

In [33]:
# Loading validation dataset
dataset = val_dataset

# Extract components from validation data
corpus = dataset['corpus']
queries = dataset['queries']
relevant_docs = dataset['relevant_docs']

# Create evaluator that will measure how well our model retrieves relevant documents for queries
evaluator = InformationRetrievalEvaluator(queries, corpus, relevant_docs)

You could use a larger epoch size here, but for the example in the Notebook, we'll stick with 10.

In [34]:
# Set number of training epochs
EPOCHS = 10

Nothing left to do but #trainthatmodel!

> NOTE: You'll need to make sure you enter the desired Weights and Biases key - you should be able to simple click the link `https://wandb.ai/authorize` and follow the outlined steps to get the API key.

In [35]:
from datasets import Dataset
from torch.utils.data import DataLoader

# Calculate warmup steps (10% of total training steps)
# Warmup helps stabilize early training by gradually increasing learning rate
warmup_steps = int(len(loader) * EPOCHS * 0.1)

# Start the training process
model.fit(
    # Pair of data loader and loss function
    train_objectives=[(loader, loss)],
    # Number of training epochs
    epochs=EPOCHS,
    # Learning rate warmup
    warmup_steps=warmup_steps,
    output_path='exp_finetune',
    show_progress_bar=True,
    evaluator=evaluator,
    # Evaluate every 50 steps
    evaluation_steps=50,
)

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Step,Training Loss,Validation Loss,Cosine Accuracy@1,Cosine Accuracy@3,Cosine Accuracy@5,Cosine Accuracy@10,Cosine Precision@1,Cosine Precision@3,Cosine Precision@5,Cosine Precision@10,Cosine Recall@1,Cosine Recall@3,Cosine Recall@5,Cosine Recall@10,Cosine Ndcg@10,Cosine Mrr@10,Cosine Map@100
50,No log,No log,0.627517,0.785235,0.832215,0.892617,0.627517,0.261745,0.166443,0.089262,0.627517,0.785235,0.832215,0.892617,0.75848,0.715591,0.720487
60,No log,No log,0.620805,0.775168,0.832215,0.88255,0.620805,0.258389,0.166443,0.088255,0.620805,0.775168,0.832215,0.88255,0.750621,0.708377,0.713706
100,No log,No log,0.630872,0.788591,0.845638,0.90604,0.630872,0.262864,0.169128,0.090604,0.630872,0.788591,0.845638,0.90604,0.764965,0.720006,0.723902
120,No log,No log,0.644295,0.788591,0.838926,0.90604,0.644295,0.262864,0.167785,0.090604,0.644295,0.788591,0.838926,0.90604,0.771768,0.729144,0.733401
150,No log,No log,0.630872,0.785235,0.848993,0.892617,0.630872,0.261745,0.169799,0.089262,0.630872,0.785235,0.848993,0.892617,0.761787,0.719672,0.724533
180,No log,No log,0.627517,0.791946,0.845638,0.899329,0.627517,0.263982,0.169128,0.089933,0.627517,0.791946,0.845638,0.899329,0.762379,0.71849,0.723044
200,No log,No log,0.630872,0.791946,0.845638,0.90604,0.630872,0.263982,0.169128,0.090604,0.630872,0.791946,0.845638,0.90604,0.76759,0.723505,0.727488
240,No log,No log,0.651007,0.788591,0.838926,0.895973,0.651007,0.262864,0.167785,0.089597,0.651007,0.788591,0.838926,0.895973,0.770072,0.729936,0.734745
250,No log,No log,0.64094,0.798658,0.845638,0.899329,0.64094,0.266219,0.169128,0.089933,0.64094,0.798658,0.845638,0.899329,0.769081,0.727354,0.731877
300,No log,No log,0.644295,0.795302,0.842282,0.899329,0.644295,0.265101,0.168456,0.089933,0.644295,0.795302,0.842282,0.899329,0.769558,0.728181,0.732656


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

### Task 2: Evaluating our Embeddings Models

Now that we've fine-tuned our embedding model on our data - lets see how it performs compared to the base embeddings!

In [40]:
import json
from tqdm.notebook import tqdm
import pandas as pd

In [41]:
TRAIN_DATASET_FPATH = './DataRepository/embedding_data/train_dataset.json'
EVAL_DATASET_FPATH = './DataRepository/embedding_data/eval_dataset.json'

In [42]:
with open(TRAIN_DATASET_FPATH, 'r+') as f:
    train_dataset = json.load(f)

with open(EVAL_DATASET_FPATH, 'r+') as f:
    eval_dataset = json.load(f)

We're going to be using the `InformationRetrievalEvaluator` to help us determine how well our embedding model is performing on a widely used task: Information Retrieval!

You can dive deeper into the documentation [here](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#informationretrievalevaluator) to see under the hood.

You'll notice, however, that we have common suffixes for our evaluation metrics:

- `X_accuracy@1`, `X_accuracy@3`, etc.

This is computing metrics by looking at the accuracy, recall, precision, Mean Reciprocal Rank (MRR), Normalized Discounted Cumulative Gain (NDGC), and Mean-Average Precision (MAP) at various numbers of retrieved items.

That is to say:

We look at these scores as we include the first closest document, top three closest documents, etc.

We can think of these `@k` as "top k` metrics.

These will help us guide important hyper-parameters when using these models for Information Retrieval tasks down the road!

In [43]:
from sentence_transformers.evaluation import InformationRetrievalEvaluator
from sentence_transformers import SentenceTransformer

def evaluate_st(
    dataset,
    model_id,
    name,
):
    # Extract components from dataset
    corpus = dataset['corpus']
    queries = dataset['queries']
    relevant_docs = dataset['relevant_docs']

    # Create evaluator with all necessary components
    evaluator = InformationRetrievalEvaluator(queries, corpus, relevant_docs, name=name)

    # Load the model and run evaluation
    model = SentenceTransformer(model_id)

    return evaluator(model, output_path="/content/")

#####üèóÔ∏è Activity #2:

Describe what the `evaluate` function is doing in the above cell in natural language.

#### ANSWER
- It takes three inputs dataset containing documents and queries, model ID to test, name for this evaluation run
- Extract components from dataset - corpus (collection of documents), queries (questions or search terms) and relevant_docs (a mapping showing which documents should match which queries)
- It creates an evaluator that will test the model by taking each query
 and having the model find similar documents and checking if the documents the model thought were similar match the ones i.e from relevant docs
-  It loads up the specified model and runs this evaluation
- It saves the results to the /content/ directory and returns them


#### Base Embeddings Model Results

In [44]:
evaluate_st(eval_dataset, "BAAI/bge-small-en", name='bge')

{'bge_cosine_accuracy@1': 0.5067114093959731,
 'bge_cosine_accuracy@3': 0.714765100671141,
 'bge_cosine_accuracy@5': 0.7818791946308725,
 'bge_cosine_accuracy@10': 0.8288590604026845,
 'bge_cosine_precision@1': 0.5067114093959731,
 'bge_cosine_precision@3': 0.23825503355704697,
 'bge_cosine_precision@5': 0.1563758389261745,
 'bge_cosine_precision@10': 0.08288590604026844,
 'bge_cosine_recall@1': 0.5067114093959731,
 'bge_cosine_recall@3': 0.714765100671141,
 'bge_cosine_recall@5': 0.7818791946308725,
 'bge_cosine_recall@10': 0.8288590604026845,
 'bge_cosine_ndcg@10': 0.6710313851865369,
 'bge_cosine_mrr@10': 0.619814637264302,
 'bge_cosine_map@100': 0.6279603491960256}

#### Fine-tuned Results

In [45]:
evaluate_st(eval_dataset, "exp_finetune", name='finetuned')

{'finetuned_cosine_accuracy@1': 0.6442953020134228,
 'finetuned_cosine_accuracy@3': 0.7885906040268457,
 'finetuned_cosine_accuracy@5': 0.8389261744966443,
 'finetuned_cosine_accuracy@10': 0.9060402684563759,
 'finetuned_cosine_precision@1': 0.6442953020134228,
 'finetuned_cosine_precision@3': 0.2628635346756152,
 'finetuned_cosine_precision@5': 0.16778523489932884,
 'finetuned_cosine_precision@10': 0.09060402684563756,
 'finetuned_cosine_recall@1': 0.6442953020134228,
 'finetuned_cosine_recall@3': 0.7885906040268457,
 'finetuned_cosine_recall@5': 0.8389261744966443,
 'finetuned_cosine_recall@10': 0.9060402684563759,
 'finetuned_cosine_ndcg@10': 0.7717683446959837,
 'finetuned_cosine_mrr@10': 0.7291440289762436,
 'finetuned_cosine_map@100': 0.7334005887765855}

### Conclusion

Now we can compare the embeddings models to see which performed the best!

In [46]:
df_st_bge = pd.read_csv('/content/Information-Retrieval_evaluation_bge_results.csv')
df_st_finetuned = pd.read_csv('/content/Information-Retrieval_evaluation_finetuned_results.csv')

In [47]:
df_st_bge['model'] = 'bge'
df_st_finetuned['model'] = 'fine_tuned'
df_st_all = pd.concat([df_st_bge, df_st_finetuned])
df_st_all = df_st_all.set_index('model')
df_st_all

Unnamed: 0_level_0,epoch,steps,cosine-Accuracy@1,cosine-Accuracy@3,cosine-Accuracy@5,cosine-Accuracy@10,cosine-Precision@1,cosine-Recall@1,cosine-Precision@3,cosine-Recall@3,cosine-Precision@5,cosine-Recall@5,cosine-Precision@10,cosine-Recall@10,cosine-MRR@10,cosine-NDCG@10,cosine-MAP@100
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
bge,-1,-1,0.506711,0.714765,0.781879,0.828859,0.506711,0.506711,0.238255,0.714765,0.156376,0.781879,0.082886,0.828859,0.619815,0.671031,0.62796
fine_tuned,-1,-1,0.644295,0.788591,0.838926,0.90604,0.644295,0.644295,0.262864,0.788591,0.167785,0.838926,0.090604,0.90604,0.729144,0.771768,0.733401


##### üë™‚ùìDiscussion Question #2:

We looked at the evaluation results between the base BAAI/bge-small-en model and the fine-tuned version, following are some findings:

- Fine-tuned model outperforms the base model on every metric
- Accuracy@1 improved from 50.7% to 64.4% (significant ~14% improvement)
- Accuracy@10 improved from 82.9% to 90.6% (~8% improvement)
- Both precision and recall improved across all k values
- Improvements are more pronounced at lower k values which means the fine-tuned model is better at putting the most relevant documents first
- Ranking Quality
   - MRR@10 improved from 0.619 to 0.729
   - NDCG@10 improved from 0.671 to 0.772
   - MAP@100 improved from 0.628 to 0.733

This indicates improvement in overall ranking quality   