# PART 1 Synthetic Dataset Generation for Embedding Finetuning Tasks

In this Jupyter notebook, we demonstrate how to leverage a Python script designed for generating a synthetic dataset of (query, relevant document) pairs from a corpus of documents that can be used to finetune embeddings models to improve performance in custom RAG and retrival AI use cases. We use natural language processing (NLP) techniques and a language model to automate the creation of a dataset suitable for tasks such as question answering, search, and information retrieval.

## Setup

First, let's import the necessary components from our script. This involves loading the corpus, generating queries, and saving our dataset.


## Generate the Corpus

We begin by importing the relevant helper functions from the script, initializing our `CorpusLoader` with a directory containing our PDF documents. This class will load and split our corpus into training and validation sets. 

We create the corpus of text chunks by leveraging LlamaIndex to load some sample PDFs, and parsing/chunking into plain text chunks.


In [1]:
import os
import yaml
import sys

current_dir = os.getcwd()
kit_dir = os.path.abspath(os.path.join(current_dir, ".."))
repo_dir = os.path.abspath(os.path.join(kit_dir, ".."))

sys.path.append(kit_dir)
sys.path.append(repo_dir)


In [None]:
from fine_tuning_embeddings.src.generate_fine_tune_embed_dataset import CorpusLoader, QueryGenerator, LangChainLLM, save_dict_safely
from utils.model_wrappers.api_gateway import APIGateway
import random

In [None]:
data_directory = os.path.join(kit_dir, "sample_data")
val_ratio = 0.2

corpus_loader = CorpusLoader(directory=data_directory, val_ratio=val_ratio)
train_corpus = corpus_loader.load_corpus(corpus_loader.train_files)
val_corpus = corpus_loader.load_corpus(corpus_loader.val_files)


In [None]:
corpus_loader.val_files

## Saving the Loaded Corpora

After loading the training and validation corpora, we save them to files for later use. This ensures we can easily reload the corpora without reprocessing the original documents.


In [None]:
train_corpus_output_path =  os.path.join(kit_dir, "data/train_corpus.json")
val_corpus_output_path =  os.path.join(kit_dir, "data/val_corpus.json")

corpus_loader.save_corpus(train_corpus, train_corpus_output_path)
corpus_loader.save_corpus(val_corpus, val_corpus_output_path)


## Defining the Language Model (LLM)

For generating queries, we define the language model (LLM) to use. You can choose between a SambaNova model or an OpenAI / other LLM provider model based on your requirements and access.


In [9]:
def load_config(config_file: str) -> None:
        """
        Load configuration parameters from a YAML file.

        Parameters:
        config_file (str): Path to the YAML configuration file.

        Returns:
        None
        """
        with open(config_file, 'r') as file:
            config = yaml.safe_load(file)
            return config

config = load_config(os.path.join(kit_dir, 'config.yaml'))        
llm_info = config['llm']

llm = APIGateway.load_llm(
            type=llm_info['api'],
            streaming=True,
            coe=llm_info['coe'],
            do_sample=llm_info['do_sample'],
            max_tokens_to_generate=llm_info['max_tokens_to_generate'],
            temperature=llm_info['temperature'],
            select_expert=llm_info['select_expert'],
            process_prompt=False,
        )

# Convert SN Endpoint to LangChain LLM As The Wrapper Is In Langchain
llm = LangChainLLM(llm=llm)


# For OpenAI:
# llm = OpenAI(model='gpt-3.5-turbo')  # This line remains commented in the script for instructional purposes

In [None]:
# Initialize the QueryGenerator with your language model
# Note: Ensure you have access to the LLM and its credentials
# Note: Depending on the size of your corpus & model inference time, this can take a long time! 

query_generator = QueryGenerator(llm=llm)

train_queries, train_relevant_docs = query_generator.generate_queries(train_corpus, verbose=True)
val_queries, val_relevant_docs = query_generator.generate_queries(val_corpus, verbose=True)

## Inspecting Generated Queries

It's essential to inspect the generated queries and their corresponding relevant documents to ensure the quality of our dataset.


In [None]:
# Helper function to display generated queries and documents
def display_generated_data(queries, relevant_docs, corpus, num_samples=5):
    sample_queries = random.sample(list(queries.items()), num_samples)
    
    for query_id, query in sample_queries:
        print(f"Query: {query}")
        doc_ids = relevant_docs[query_id]
        for doc_id in doc_ids:
            print(f"Relevant Document: {corpus[doc_id][:200]}...")  # Display the first 200 characters
        print("\n")

display_generated_data(train_queries, train_relevant_docs, train_corpus)
display_generated_data(val_queries, val_relevant_docs, val_corpus)


## Saving the Dataset

Finally, we save our generated dataset safely to ensure it can be used for training NLP models without running into memory issues.


In [None]:
train_output_path =  os.path.join(kit_dir, "data/train_dataset.json")
val_output_path =  os.path.join(kit_dir, "data/val_dataset.json")

save_dict_safely({'queries': train_queries, 'corpus': train_corpus, 'relevant_docs': train_relevant_docs}, train_output_path)
save_dict_safely({'queries': val_queries, 'corpus': val_corpus, 'relevant_docs': val_relevant_docs}, val_output_path)


## Conclusion

This notebook provides a comprehensive guide on generating a synthetic dataset for NLP tasks using Python. By automating the generation of queries and relevant documents, we streamline the process of creating rich datasets for training models on tasks such as question answering and information retrieval.
