# PART 1 Synthetic Dataset Generation for Embedding Finetuning Tasks

In this Jupyter notebook, we demonstrate how to leverage a Python script designed for generating a synthetic dataset of (query, relevant document) pairs from a corpus of documents that can be used to finetune embeddings models to improve performance in custom RAG and retrival AI use cases. We use natural language processing (NLP) techniques and a language model to automate the creation of a dataset suitable for tasks such as question answering, search, and information retrieval.

## Setup

First, let's import the necessary components from our script. This involves loading the corpus, generating queries, and saving our dataset.


## Generate the Corpus

We begin by importing the relevant helper functions from the script, initializing our `CorpusLoader` with a directory containing our PDF documents. This class will load and split our corpus into training and validation sets. 

We create the corpus of text chunks by leveraging LlamaIndex to load some sample PDFs, and parsing/chunking into plain text chunks.


In [13]:
import sys

sys.path.append("../")
sys.path.append("../../")

In [6]:
from src.generate_fine_tune_embed_dataset import (
    CorpusLoader,
    QueryGenerator,
    SambaNovaEndpoint,
    LangChainLLM,
    OpenAI,
    save_dict_safely,
)
import random

In [None]:
data_directory = "./sample_data"
val_ratio = 0.2

corpus_loader = CorpusLoader(directory=data_directory, val_ratio=val_ratio)
train_corpus = corpus_loader.load_corpus(corpus_loader.train_files)
val_corpus = corpus_loader.load_corpus(corpus_loader.val_files)

## Saving the Loaded Corpora

After loading the training and validation corpora, we save them to files for later use. This ensures we can easily reload the corpora without reprocessing the original documents.


In [None]:
train_corpus_output_path = "./data/train_corpus.json"
val_corpus_output_path = "./data/val_corpus.json"

corpus_loader.save_corpus(train_corpus, train_corpus_output_path)
corpus_loader.save_corpus(val_corpus, val_corpus_output_path)

## Defining the Language Model (LLM)

For generating queries, we define the language model (LLM) to use. You can choose between a SambaNova model or an OpenAI / other LLM provider model based on your requirements and access.


In [9]:
# Example LLM instantiation:
# For a Sambanova LLM:
# base_url="YOUR_BASE_URL"
# project_id="YOUR_PROJECT_ID"
# endpoint_id="YOUR_ENDPOINT_ID"
# api_key="YOUR_API_KEY"

base_url = "https://sjc3-demo2.sambanova.net"
project_id = "60774d44-3cc3-47eb-aa91-87fae2e8655e"
endpoint_id = "b0e414eb-4863-4a8c-9839-3c2dfa718ae5"
api_key = "e2a3bac7-c31c-4712-a408-bb4b64d92c41"

llm = SambaNovaEndpoint(
    base_url=base_url,
    project_id=project_id,
    endpoint_id=endpoint_id,
    api_key=api_key,
    model_kwargs={
        "do_sample": True,
        "temperature": 0.01,
        "max_tokens_to_generate": 512,
    },
)

# Convert SN Endpoint to LangChain LLM As The Wrapper Is In Langchain
llm = LangChainLLM(llm=llm)


# For OpenAI:
# llm = OpenAI(model='gpt-3.5-turbo')  # This line remains commented in the script for instructional purposes

In [None]:
# Initialize the QueryGenerator with your language model
# Note: Ensure you have access to the LLM and its credentials
# Note: Depending on the size of your corpus & model inference time, this can take a long time!

query_generator = QueryGenerator(llm=llm)

train_queries, train_relevant_docs = query_generator.generate_queries(
    train_corpus, verbose=True
)
val_queries, val_relevant_docs = query_generator.generate_queries(
    val_corpus, verbose=True
)

## Inspecting Generated Queries

It's essential to inspect the generated queries and their corresponding relevant documents to ensure the quality of our dataset.


In [None]:
# Helper function to display generated queries and documents
def display_generated_data(queries, relevant_docs, corpus, num_samples=5):
    sample_queries = random.sample(list(queries.items()), num_samples)

    for query_id, query in sample_queries:
        print(f"Query: {query}")
        doc_ids = relevant_docs[query_id]
        for doc_id in doc_ids:
            print(
                f"Relevant Document: {corpus[doc_id][:200]}..."
            )  # Display the first 200 characters
        print("\n")


display_generated_data(train_queries, train_relevant_docs, train_corpus)
display_generated_data(val_queries, val_relevant_docs, val_corpus)

## Saving the Dataset

Finally, we save our generated dataset safely to ensure it can be used for training NLP models without running into memory issues.


In [12]:
train_output_path = "./data/train_dataset.json"
val_output_path = "./data/val_dataset.json"

save_dict_safely(
    {
        "queries": train_queries,
        "corpus": train_corpus,
        "relevant_docs": train_relevant_docs,
    },
    train_output_path,
)
save_dict_safely(
    {"queries": val_queries, "corpus": val_corpus, "relevant_docs": val_relevant_docs},
    val_output_path,
)

2024-02-06 14:50:40,945 - INFO - Saving data to ./data/train_dataset.json...
Saving data: 100%|██████████| 3/3 [00:00<00:00, 263.20it/s]
2024-02-06 14:50:40,961 - INFO - Saving data to ./data/val_dataset.json...
Saving data: 100%|██████████| 3/3 [00:00<00:00, 973.98it/s]


## Part I Conclusion

This notebook provides a comprehensive guide on generating a synthetic dataset for NLP tasks using Python. By automating the generation of queries and relevant documents, we streamline the process of creating rich datasets for training models on tasks such as question answering and information retrieval. In Part II of the Series We'll FineTune The Embeddings


# PART 2: Fine-Tuning Embedding Models for Enhanced NLP Performance

In Part 2 of this series, we will leverage the synthetic dataset generated in Part 1 to fine-tune an open-source embedding model using Sentence Transformers. The goal is to improve the model's performance on custom Retrieval AI and question answering (QA) use cases by adapting the embeddings to our specific dataset.

## Setup

To begin, we will import necessary functions from our fine-tuning script. This includes components for loading the dataset, configuring the fine-tuning process, and executing the training.


In [None]:
from src.finetune_embedding_model import DatasetLoader, FineTuneModel

## Loading the Synthetic Dataset

Our first step is to load the synthetic dataset created in Part 1. This dataset includes pairs of queries and relevant documents that we will use to fine-tune our embedding model.


In [None]:
train_dataset_path = "./data/train_dataset.json"
val_dataset_path = "./data/val_dataset.json"

# Initialize the dataset loader
dataset_loader = DatasetLoader(train_dataset_path)
val_dataset_loader = DatasetLoader(val_dataset_path)

## Initializing the Fine-Tuning Process

With our dataset ready, we can now initialize the model for fine-tuning. We'll specify the model identifier, paths to our training and validation datasets, and other training parameters.


In [None]:
model_id = "sentence-transformers/all-MiniLM-L6-v2"  # Example model ID
batch_size = 8
epochs = 4
output_path = "./finetuned_model"

# Initialize the fine-tuning model
finetune_model = FineTuneModel(
    model_id=model_id,
    train_dataset_path=train_dataset_path,
    val_dataset_path=val_dataset_path,
    batch_size=batch_size,
    epochs=epochs,
    output_path=output_path,
)

## Fine-Tuning the Model

Now, we're ready to fine-tune our model. This process will adjust the embeddings to better suit our synthetic dataset, potentially improving performance on our target NLP tasks.


In [None]:
finetune_model.train()