## Finetuning Embeddings Pipeline

This notebook is set up to run just the finetuning portion of the AIE4 Midterm assignment...

NOTE: All other midterm steps (RAG and RAGAS) are run in a separate notebook...

*I will push the finetuned model from here to the Huggingface model hub and use it in the other notebook to work
on all the other steps for the midterm...*


> Looking for the Main Midterm Assignment Notebook?
> ----
> 
> If you are looking for the main AIE4 midterm assignment notebook, please [click here](vc_aie4_midterm_rag_and_ragas_pipelines.ipynb).

### Details of finetuning
1.  Model: `Snowflake/snowflake-arctic-embed-m` is a `SentenceTransformer` model with 110 million parameters

2.  I used `random.shuffle` to randomize the order of the chunks in the corpus before assigning them to train, val and test sets.  I did this to ensure that there was `distributional similarity` between the three subsets.

3.  Train size: 300, validation size: 50, test size: 50

4.  For each chunk/context in train/val/test, I generated `n_questions` 2 questions using an OpenAI Chat Model `gpt-4o-mini`

5.  Loss function: `MultipleNegativesRankingLoss` wrapped inside a `MatryoshkaLoss` function

6.  `Batch size` of 16 during the training process

### Why I Chose to Finetune the `snowflake-arctic-embed-m` Model

On the AIE4 midterm, we are asked to state why we chose the particular embedding model that we did for finetuning.  These are the criteria I used:

1.  PARSIMONY: This model has approx 110 million parameters, so we can feasibly finetune the model with consumer-grade access to GPU and memory resources.  It can be done very quickly in a Colab notebook, for instance, with access to their GPU.  I chose to use the A100 to speed up the process, but the training would work just as well with other GPUs like T4 etc.

2.  PERFORMANCE: Despite the far fewer parameters, the model holds its own in terms of performance on benchmark tasks.

3.  CONVENIENT ACCESS: This model is conveniently available via Huggingface, so I could leverage the model hub as well as all the libraries that support access to this type of model (SentenceTransformer) as well as all the training/finetuning capabilities.

4.  NO-BRAINER REASON: It is an open-source model so we have access to all parameters and configurations needed for finetuning.

### 1. Install Packages

In [1]:
!pip install -qU langchain_openai langchain_huggingface langchain_core==0.2.38 langchain langchain_community langchain-text-splitters langchain_experimental langchain_qdrant

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m396.4/396.4 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.0/52.0 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m42.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.2/207.2 kB[0m [31m15.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.2/290.2 kB[0m [31m23.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m375.6/375.6 kB[0m [31m27.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
!pip install -qU ragas pymupdf sentence_transformers datasets

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.5/3.5 MB[0m [31m23.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.9/15.9 MB[0m [31m71.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.3/474.3 kB[0m [31m38.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m39.9/39.9 MB[0m [31m43.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.1/71.1 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

### 2. Imports

In [3]:
import os
import getpass

In [4]:
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter Your OpenAI API Key: ")

Enter Your OpenAI API Key: ··········


In [5]:
os.environ['HF_TOKEN'] = getpass.getpass("Enter your <write-permissioned> Huggingface Token here:")

Enter your <write-permissioned> Huggingface Token here:··········


In [6]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
# import os
# from dotenv import load_dotenv

# load_dotenv()
# openai_api_key = os.environ.get("OPENAI_API_KEY")

In [7]:

from sentence_transformers import SentenceTransformer
from langchain_huggingface import HuggingFaceEmbeddings


### Special Note: Importing My Modules

The modules imported below do the heavy lifting for the finetuning process. You can find them in a nearby location!

In [8]:
import sys
sys.path.append('./')

from myutils.rag_pipeline_utils import load_all_pdfs
from myutils.rag_pipeline_utils import SimpleTextSplitter
from myutils.finetuning import PrepareDataForFinetuning, FineTuneModelAndEvaluateRetriever

### 3. Pointers to pdf files

In [9]:
pdf_file_paths = [
    './data/docs_for_rag/Blueprint-for-an-AI-Bill-of-Rights.pdf',
    './data/docs_for_rag/NIST.AI.600-1.pdf'
]

### 4. The main class that leverages my modules to run the finetuning process

In [10]:
class FineTuneEmbeddingModel:
    def __init__(self,
                 pdf_file_paths=pdf_file_paths,
                 chunk_size=1000,
                 chunk_overlap=300,
                 train_val_test_size=[10, 5, 5],
                 train_val_test_split_type='random',
                 qa_chat_model_name='gpt-4o-mini',
                 random_seed=69,
                 n_questions=2,
                 batch_size=5,
                 base_model_id='Snowflake/snowflake-arctic-embed-m',
                 matryoshka_dimensions=[768, 512, 256, 128, 64],
                 number_of_training_epochs=5,
                 finetuned_model_output_path='finetuned_arctic',
                 evaluation_steps=50):

        # parameters to load docs and chunk them
        self.pdf_file_paths = pdf_file_paths
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap

        # parameters that govern size of train, val and test
        # Also, if the flag below is set to 'random' then
        # the chunks are randomly assigned to train/val/test
        self.train_val_test_size = train_val_test_size
        self.train_val_test_split_type = train_val_test_split_type
        self.random_seed = random_seed

        # qa chat model to generate questions from contexts
        self.qa_chat_model_name = qa_chat_model_name

        # number of questions per context
        self.n_questions = n_questions

        # batch size for finetuning
        self.batch_size = batch_size

        # name of base model from HF - this model will be finetuned
        self.base_model_id = base_model_id
        # Am doing the Matryoshka objective
        self.matryoshka_dimensions = matryoshka_dimensions

        # number of training epochs
        self.number_of_training_epochs = number_of_training_epochs

        # local path to finetuned model name
        self.finetuned_model_output_path = finetuned_model_output_path

        # number of steps between running eval on val dataset
        self.evaluation_steps = evaluation_steps
        return

    def load_and_chunk_docs(self):
        """
        load the pdf files and chunk using RecursiveCharacterTextSplitter
        """
        self.documents = load_all_pdfs(self.pdf_file_paths)
        # instantiate baseline text splitter -
        # NOTE!!! The `SimpleTextSplitter` below is my wrapper around Langchain RecursiveCharacterTextSplitter!!!!
        # (see module for the code if needed)
        baseline_text_splitter = SimpleTextSplitter(
            chunk_size=self.chunk_size,
            chunk_overlap=self.chunk_overlap,
            documents=self.documents
        )

        # split text for baseline case
        self.baseline_text_splits = baseline_text_splitter.split_text()
        return self

    def prep_data_for_finetuning(self):
        """
        Prepare data for finetuning
            split into train, val and test sub-groups
            Generate questions for contexts
            Load train data into data loader
        """
        self.pdft = PrepareDataForFinetuning(
            all_splits=self.baseline_text_splits,
            train_val_test_size=self.train_val_test_size,
            train_val_test_split_type=self.train_val_test_split_type,
            random_seed=self.random_seed,
            qa_chat_model_name=self.qa_chat_model_name,
            n_questions=self.n_questions,
            batch_size=self.batch_size
        )

        self.pdft.run_all_prep_data()
        return self

    def finetune_and_eval_retriever(self):
        """
        Run the finetuning steps and evaluate the results
        using the simple hit rate metric

        Note the final step where the finetuned SentenceTransformer model is loaded
        into an instance object
        """
        self.evr = FineTuneModelAndEvaluateRetriever(
            train_data=self.pdft.train_dataset,
            val_data=self.pdft.val_dataset,
            test_data=self.pdft.test_dataset,
            batch_size=self.batch_size,
            base_model_id=self.base_model_id,
            matryoshka_dimensions=self.matryoshka_dimensions,
            number_of_training_epochs=self.number_of_training_epochs,
            finetuned_model_output_path=self.finetuned_model_output_path,
            evaluation_steps=self.evaluation_steps
        )

        self.evr.run_steps_to_finetune_model()

        # load finetuned SentenceTransformer model
        self.arctic_finetuned_model = SentenceTransformer(self.finetuned_model_output_path)
        return self

    def run_finetuning_steps(self):
        """
        Run all the steps to finetune model
        """
        self.load_and_chunk_docs()
        self.prep_data_for_finetuning()
        self.finetune_and_eval_retriever()
        return self

### 5. Instantiate the class and Run the Finetuning

In [11]:
# instantiate the class object for finetuning
ftem = FineTuneEmbeddingModel(train_val_test_size=[300, 50, 50],
                              batch_size=16)

In [12]:
# Run all steps to finetune model
ftem.run_finetuning_steps()

loaded ./data/docs_for_rag/Blueprint-for-an-AI-Bill-of-Rights.pdf with 73 pages 
loaded ./data/docs_for_rag/NIST.AI.600-1.pdf with 64 pages 
loaded all files: total number of pages: 137 


100%|██████████| 300/300 [04:25<00:00,  1.13it/s]
100%|██████████| 50/50 [00:52<00:00,  1.05s/it]
100%|██████████| 207/207 [02:55<00:00,  1.18it/s]
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/252 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/84.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/107 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/738 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

Step,Training Loss,Validation Loss,Cosine Accuracy@1,Cosine Accuracy@3,Cosine Accuracy@5,Cosine Accuracy@10,Cosine Precision@1,Cosine Precision@3,Cosine Precision@5,Cosine Precision@10,Cosine Recall@1,Cosine Recall@3,Cosine Recall@5,Cosine Recall@10,Cosine Ndcg@10,Cosine Mrr@10,Cosine Map@100,Dot Accuracy@1,Dot Accuracy@3,Dot Accuracy@5,Dot Accuracy@10,Dot Precision@1,Dot Precision@3,Dot Precision@5,Dot Precision@10,Dot Recall@1,Dot Recall@3,Dot Recall@5,Dot Recall@10,Dot Ndcg@10,Dot Mrr@10,Dot Map@100
38,No log,No log,0.94,0.98,1.0,1.0,0.94,0.326667,0.2,0.1,0.94,0.98,1.0,1.0,0.973851,0.965,0.965,0.94,0.98,1.0,1.0,0.94,0.326667,0.2,0.1,0.94,0.98,1.0,1.0,0.973851,0.965,0.965
50,No log,No log,0.96,1.0,1.0,1.0,0.96,0.333333,0.2,0.1,0.96,1.0,1.0,1.0,0.983928,0.978333,0.978333,0.96,1.0,1.0,1.0,0.96,0.333333,0.2,0.1,0.96,1.0,1.0,1.0,0.983928,0.978333,0.978333
76,No log,No log,0.96,1.0,1.0,1.0,0.96,0.333333,0.2,0.1,0.96,1.0,1.0,1.0,0.982619,0.976667,0.976667,0.96,1.0,1.0,1.0,0.96,0.333333,0.2,0.1,0.96,1.0,1.0,1.0,0.982619,0.976667,0.976667
100,No log,No log,0.97,1.0,1.0,1.0,0.97,0.333333,0.2,0.1,0.97,1.0,1.0,1.0,0.987619,0.983333,0.983333,0.97,1.0,1.0,1.0,0.97,0.333333,0.2,0.1,0.97,1.0,1.0,1.0,0.987619,0.983333,0.983333
114,No log,No log,0.98,1.0,1.0,1.0,0.98,0.333333,0.2,0.1,0.98,1.0,1.0,1.0,0.991309,0.988333,0.988333,0.98,1.0,1.0,1.0,0.98,0.333333,0.2,0.1,0.98,1.0,1.0,1.0,0.991309,0.988333,0.988333
150,No log,No log,0.98,1.0,1.0,1.0,0.98,0.333333,0.2,0.1,0.98,1.0,1.0,1.0,0.991309,0.988333,0.988333,0.98,1.0,1.0,1.0,0.98,0.333333,0.2,0.1,0.98,1.0,1.0,1.0,0.991309,0.988333,0.988333
152,No log,No log,0.98,1.0,1.0,1.0,0.98,0.333333,0.2,0.1,0.98,1.0,1.0,1.0,0.991309,0.988333,0.988333,0.98,1.0,1.0,1.0,0.98,0.333333,0.2,0.1,0.98,1.0,1.0,1.0,0.991309,0.988333,0.988333
190,No log,No log,0.98,1.0,1.0,1.0,0.98,0.333333,0.2,0.1,0.98,1.0,1.0,1.0,0.991309,0.988333,0.988333,0.98,1.0,1.0,1.0,0.98,0.333333,0.2,0.1,0.98,1.0,1.0,1.0,0.991309,0.988333,0.988333


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

Some weights of BertModel were not initialized from the model checkpoint at finetuned_arctic and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


<__main__.FineTuneEmbeddingModel at 0x7d1a6f887490>

### 6. Push Finetuned Model to HF Model Hub

In [14]:
ftem.arctic_finetuned_model.push_to_hub("vincha77/finetuned_arctic")

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

'https://huggingface.co/vincha77/finetuned_arctic/commit/d45999f3caa43e1bf4770307caa501c719afd8e8'

### 7. Pull down Finetuned Model From HF Hub (as a check)

In [15]:
model_id = "vincha77/finetuned_arctic"
arctic_finetuned_model = SentenceTransformer(model_id)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/277 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/36.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/657 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

In [16]:
arctic_finetuned_embeddings = HuggingFaceEmbeddings(model_name=model_id)