<a href="https://colab.research.google.com/github/americanthinker/vectorsearch-applications/blob/main/notebooks/6-EmbeddingModel_FineTuning.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Run these cells if on Colab

In [24]:
# !curl -o preprocessing.py https://raw.githubusercontent.com/americanthinker/vectorsearch-applications/main/src/preprocessor/preprocessing.py

In [None]:
# !curl -o qa_training_triplets.json https://raw.githubusercontent.com/americanthinker/vectorsearch-applications/main/data/qa_training_triplets.json

In [34]:
# !pip install sentence-transformers loguru --quiet

# Fine-Tuning a SentenceTransformers Embedding Model
***

### Fine-tune High-Level Walkthrough

1. Get baseline retrieval scores (vector Hit Rate, MRR, and total misses) using out-of-the-box baseline model.  You won't know objectively if fine-tuning had any effect if you don't measure the baseline results first.  I know this goes without saying it, but practitioners sometimes want to jump straight into model improvement without first considering their starting point.
2. Collect a training dataset.  This step has already been completed for you, courtesy of `gpt-3.5-turbo`.  The training dataset consists of triplets in the following format:
   - **Anchor**: The context i.e. a random text chunk created by the initial baseline model
   - **Positive**: A query generated by the LLM that can be answered by the anchor context.
   - **Hard Negative**: A query generated by the LLM that is semantically similar to the positive, but cannot be answered by the anchor context.
These triplets were generated using a prompt specifically written for the Huberman Lab corpus so the training data (for the most part) is high quality and contextually relevant. 
3. Train the model and set a path where the new model will reside.  I created a `models/` directory in the course repo, and included the directory in the `.gitignore` file so that models aren't being pushed with every commit.
4. Create a new dataset (as you learned in Notebook 1) but this time create the embeddings using the new fine-tuned model.
5. Create a new index on Weaviate using the new dataset you just created.
6. Run the `retrieval_evaluation` function again, but this time instantiate your Weaviate client with the new fine-tuned model, but hold all other parameters constant (i.e. don't change any other parameter from the baseline run).
7. Compare the fine-tuned retrieval results to the baseline results 🥳

## Load Model


In [26]:
import sys
sys.path.append('../')
try:
  from src.preprocessor.preprocessing import FileIO
except ModuleNotFoundError:
  from preprocessing import FileIO

from torch import cuda 
from torch.utils.data import DataLoader
from sentence_transformers import SentenceTransformer, losses, InputExample, models

### Execute Model Loading func

In [2]:
def load_pretrained_model(model_name: str='sentence-transformers/all-MiniLM-L6-v2'):
    '''
    Loads sentence transformer modules and returns a pretrained 
    model for finetuning. 
    '''
    word_embedding_model = models.Transformer(model_name)
    pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
    model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
    return model

In [1]:
model = load_pretrained_model()
model.device 

## Prep Data


### Import Training Dataset

In [27]:
#depending on if you are running locally or on Colab 
data_path  = '../data/qa_training_triplets.json'  # '/content/qa_training_triplets.json'
data = FileIO.load_json(data_path)
len(data)

500

#### Peek at the data

In [14]:
# data[0]

### Build list of InputExamples & Create Dataloader

In [29]:
train_examples = [InputExample(texts=[sample['anchor'],
                                      sample['positive'],
                                      sample['hard_negative']
                                     ]) for sample in data]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=32, )

#### Training example peek

In [13]:
# train_examples[0].__dict__

## Set Loss Function, Epochs, and warm-up


In [50]:
num_epochs = 3
train_loss = losses.MultipleNegativesRankingLoss(model=model)
warmup_steps = int(len(train_dataloader) * num_epochs * 0.1) #10% of train data

## Train model 

In [31]:
model.device

device(type='cpu')

In [51]:
model.fit(train_objectives=[(train_dataloader, train_loss)],
          epochs=num_epochs,
          warmup_steps=warmup_steps)

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Iteration:   0%|          | 0/16 [00:00<?, ?it/s]

Iteration:   0%|          | 0/16 [00:00<?, ?it/s]

Iteration:   0%|          | 0/16 [00:00<?, ?it/s]

### Save model
---
Similar to how you have labeled dataset and collections, it's a good idea to stick to a naming convention that will allow you to keep track of fine-tuned models that you create.  I would suggest sticking to something like the following convention:  

`short-hand model name`-`finetuned`-`dataset size`

For example a finetuned version of the `all-MiniLM` could looks like this:  
`allminilm-finetuned-500`

If you want to get even more granular you could add other unique identifiers for experimentation such as adding number of epochs:  
`allminilm-finetuned-500-2` --> an `all-MiniLM` model finetuned on a 500 samples over 2 epochs

**I would also recommend creating a dedicated `models` folder in your top-level directory.  The repo `.gitignore` folder already has the `models` folder included to avoid pushing large file sizes to Github.  But after you've created the folder you should be able to access the model via a path similar to this one:** 

`models/allminilm-finetuned-500`

In [52]:
model.save(path='local path', model_name='name of your model')

### COLAB-specific saving and downloading
---
If you are running this notebook on Google Colab then I recommend the following steps. 

#### Save the finetuned model in current dir

In [32]:
#define your path
model_path = './allminilm-finetuned-256'
model.save(model_path, model_name='name of your model')

#### Zip the model folder into a single file

In [None]:
#ensure the paths match
!zip -r /content/model.zip /content/allminilm-finetuned-256/

Once you have zipped the model you can download locally as a single zipped file by right-clicking on the file and selecting "Download"

### Model Evaluation
---
Fine-tuning is just the start!  You still have to create a new dataset using the fine-tuned model, index that data on Weaviate, and then evaluate its performance.  This is why having a solid dataset creation and indexing pipeline is key, especially if you plan on running multiple experiments to optimize your results.  Follow this recipe:  
1. Create new dataset (from Notebook 1)
2. Index that dataset and create an easily identifiable collection name i.e. `Huberman_minilm_finetuned_256` (from Notebook 2)
3. Run the `execute_evaluation` function (from Notebook 4)

Assuming you are in the `notebooks` folder when performing the new evaluation and you have created a `models` folder in the top-level directory, the following code snippet will load the Weaviate client with the fine-tuned model and ensure that you are hitting the right collection for evaluation:

In [None]:
from src.database.database_utils import get_weaviate_client

model_path = '../models/minilm-finetuned-500/'
client = get_weaviate_client(model_name_or_path=model_path)
collection_name = 'Huberman_minilm_finetuned_256'