# Query/Context Dataset Generation
***

This notebook walks students through the process of generating datasets of query/context pairs which can be used for two primary purposes:
- Fine-tuning an embedding model
- Serve as ground truth for retrieval evaluation

In [71]:
%load_ext autoreload 
%autoreload 2

import sys
sys.path.append('../')

from src.evaluation.retrieval_evaluation import QueryContextGenerator
from src.evaluation.eval_prompt_templates import qa_triplet_generation_prompt, qa_flavors
from src.preprocessor.preprocessing import FileIO
from src.llm.llm_interface import LLM
from tqdm import tqdm
from rich import print
import random
import pandas as pd
import uuid
import re
import os

from dotenv import load_dotenv
env = load_dotenv('.env', override=True)

from datasets import load_dataset, load_dataset_builder, Dataset, DatasetDict

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [78]:
data_path = '../../version2/data/huberman_minilm-512.parquet'
data = FileIO().load_parquet(data_path)



Shape of data: (11602, 13)
Memory Usage: 1.15+ MB


In [79]:
# pri_keys = ['title', 'content', 'summary', 'guest', 'doc_id']
# data = [{k:v for k,v in d.items() if k in pri_keys} for d in data]

In [80]:
llm = LLM()

In [81]:
generator = QueryContextGenerator(llm)
print(qa_flavors)

In [82]:
dataset_512 = generator.generate_retrieval_dataset(data, 100)

QA Pair Generation:  20%|███████████████████                                                                            | 20/100 [00:31<02:37,  1.97s/it]

QA Pair Generation:  25%|███████████████████████▊                                                                       | 25/100 [00:39<01:51,  1.49s/it][32m2024-05-17 10:40:28.083[0m | [1mINFO    [0m | [36msrc.evaluation.retrieval_evaluation[0m:[36mgenerate_retrieval_dataset[0m:[36m270[0m - [1mChanging QA Flavor: at count 25, using qa_flavor 1[0m
QA Pair Generation:  34%|████████████████████████████████▎                                                              | 34/100 [00:51<01:23,  1.26s/it]

QA Pair Generation:  39%|█████████████████████████████████████                                                          | 39/100 [00:58<01:23,  1.36s/it]

QA Pair Generation:  40%|██████████████████████████████████████                                                         | 40/100 [01:01<01:41,  1.69s/it]

QA Pair Generation:  45%|██████████████████████████████████████████▊                                                    | 45/100 [01:08<01:21,  1.48s/it]

QA Pair Generation:  49%|██████████████████████████████████████████████▌                                                | 49/100 [01:15<01:20,  1.58s/it]

QA Pair Generation:  50%|███████████████████████████████████████████████▌                                               | 50/100 [01:18<01:44,  2.10s/it][32m2024-05-17 10:41:07.936[0m | [1mINFO    [0m | [36msrc.evaluation.retrieval_evaluation[0m:[36mgenerate_retrieval_dataset[0m:[36m270[0m - [1mChanging QA Flavor: at count 50, using qa_flavor 2[0m
QA Pair Generation:  60%|█████████████████████████████████████████████████████████                                      | 60/100 [01:32<00:53,  1.33s/it]

QA Pair Generation:  75%|███████████████████████████████████████████████████████████████████████▎                       | 75/100 [01:52<00:34,  1.37s/it][32m2024-05-17 10:41:41.798[0m | [1mINFO    [0m | [36msrc.evaluation.retrieval_evaluation[0m:[36mgenerate_retrieval_dataset[0m:[36m270[0m - [1mChanging QA Flavor: at count 75, using qa_flavor 3[0m
QA Pair Generation:  80%|████████████████████████████████████████████████████████████████████████████                   | 80/100 [01:58<00:24,  1.24s/it]

QA Pair Generation: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [02:28<00:00,  1.38s/it]

QA Pair Generation: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [02:28<00:00,  1.48s/it]


In [83]:
FileIO.save_as_json('../data/golden_datasets/golden_512_hard.json', dataset_512)

[32m2024-05-17 10:42:41.726[0m | [1mINFO    [0m | [36msrc.preprocessor.preprocessing[0m:[36msave_as_json[0m:[36m111[0m - [1mData saved as json file here: ../data/golden_datasets/golden_512_hard.json[0m


In [10]:
# golden_dataset = generator.generate_qa_embedding_pairs(data, qa_generation_prompt, num_total_questions=100, num_questions_per_chunk=2, threshold=0.80)

In [11]:
io = FileIO()

In [18]:
io.save_as_json('../data/golden_datasets/golden_512.json', golden_dataset, indent=4)

[32m2024-04-18 17:18:35.931[0m | [1mINFO    [0m | [36msrc.preprocessor.preprocessing[0m:[36msave_as_json[0m:[36m107[0m - [1mData saved as json file here: ../data/golden_datasets/golden_512.json[0m
Bad pipe message: %s [b')\x1f\x93\xd0:Y\x97\xcfz\xf6\xe8\xcb\x94qX\x99\x1f^ \rm\nh\xcf\x13>*0j\xdeI6~\x88V\xdb_\x01\x04\xf0G\xe2\x97QT\x04\xd5j\xfcy\x81\x00\x08\x13\x02\x13\x03\x13\x01\x00\xff\x01']
Bad pipe message: %s [b"\xa6\x97\xe6p\x17\xed\xe0L\xa6P\x95h+\xd1\x12M\r\xff\x00\x00|\xc0,\xc00\x00\xa3\x00\x9f\xcc\xa9\xcc\xa8\xcc\xaa\xc0\xaf\xc0\xad\xc0\xa3\xc0\x9f\xc0]\xc0a\xc0W\xc0S\xc0+\xc0/\x00\xa2\x00\x9e\xc0\xae\xc0\xac\xc0\xa2\xc0\x9e\xc0\\\xc0`\xc0V\xc0R\xc0$\xc0(\x00k\x00j\xc0#\xc0'\x00g\x00@\xc0\n\xc0\x14\x009\x008\xc0\t\xc0\x13\x003\x002\x00\x9d\xc0\xa1\xc0\x9d\xc0Q\x00\x9c\xc0\xa0\xc0\x9c\xc0P\x00=\x00<\x005\x00/\x00\x9a\x00\x99\xc0\x07\xc0\x11\x00\x96\x00\x05\x00\xff\x01\x00\x00j\x00\x00\x00\x0e\x00\x0c\x00\x00\t127.0.0.1\x00\x0b\x00\x04\x03\x00\x01\x02\x00\n\x00\x0c\x

In [54]:
df = pd.DataFrame.from_dict(q, orient='index', columns=['queries']).reset_index(drop=False, names=['query_ids'])

In [63]:
df = pd.concat((df, pd.DataFrame(relevant_docs, columns=['contexts'])), axis=1)

In [64]:
df = pd.concat((df, pd.DataFrame(doc_ids, columns=['doc_ids'])), axis=1)

In [67]:
test = Dataset.from_pandas(df)

In [99]:
qs = ['How long did protein synthesis peak after the infusion of essential amino acids in the study mentioned in the transcript?',
 "What study out of Wolf's lab suggested the duration for which protein synthesis peaked after the infusion of essential amino acids?", 'How long did protein synthesis peak after the infusion of essential amino acids in the study mentioned in the transcript?',
 "What study out of Wolf's lab suggested the duration for which protein synthesis peaked after the infusion of essential amino acids?"]

In [89]:
qs = test['queries']

In [60]:
relevant_docs = [testrun['corpus'][di] for di in list(testrun['relevant_docs'].values())]
relevant_docs
doc_ids = list(testrun['relevant_docs'].values())
len(doc_ids)

50

### Load raw data
Load raw data from parquet file.  Raw data should be in the same format as the dataset (corpus) created in [Notebook 1](https://github.com/americanthinker/vectorsearch-applications/blob/main/1-Data_Preprocessing_Week1_COLAB.ipynb). 

In [4]:
data_path = './data/impact_theory_minilm_256.parquet'
data = FileIO().load_parquet(data_path)
len(data)

Shape of data: (26448, 12)
Memory Usage: 2.42+ MB


26448

### Data Length Analysis
Conduct an analysis of the length of the content chunks.  Can use either raw words or tokens to assess length.  The main point here is to get a sense of the mean length of content chunks in the data and to set the `total_chars` param in the `clean_validate_data` method with an appropriate value.

In [5]:
#in this example the mean content length is @ 1,000
lengths = [len(d['content']) for d in data]
df = pd.DataFrame(lengths)
df.describe() 

Unnamed: 0,0
count,26448.0
mean,991.729053
std,126.34487
min,4.0
25%,944.0
50%,1005.0
75%,1060.0
max,1974.0


### Split Data

The `train_val_split` function will clean and validate the raw data as a first step and then split into user defined train/val splits.  
- Cleaning simply strips the keys from the data that are not needed for the query/content generation process
- Validation consists of ensuring that only content chunks of length > `total_chars` are passed to the LLM (this step prevents the LLM from asking questions from sparse context)

Users define the number of training samples and validation samples to generate.  Number of questions per content chunk can also be set to more than 1, however a note of caution:
- Setting `num_questions_per_chunk` > 1 saves time (and money) by asking more than one question per content chunk, however, the dataset will be less diverse.  There is also the potential for the model to generate lower quality questions if the content chunk isn't large enough or meaningful enough to generate more than one question from the content.
- Retrieval evaluation results from fine-tuning an embedding model with 200-300 training samples showed an uptick of 5-10% points.  Upper bound on retrieval improvement as a funtion of training sample size is yet to be determined (have fun pushing the boundaries! 👊)
- A validation data set is not required for seeing improvement from fine tuning.  The addition of a validation dataset, however, allows a user to test the results of fine tuning on an unseen dataset. 

In [6]:
#split data into train/val sets
#in this example we are creating a training set of n=10, val set of n=5, and asking the LLM to only ask 1 question per chunk. 
train, val = generator.train_val_split(data, 10, 5, 1, total_chars=950)

Length Training Data: 10
Length Validation Data: 5


### Generate QA pairs

To generate query/context pairs we need to pass in our cleaned data splits, a question asking generation prompt, and the number of questions per chunk (needs to be same value passed into the `train_val_split` function.
The `qa_generation_prompt` is already preconfigured and supplies the LLM with additional context about the Impact Theory show to ensure high quality questions are asked given the additional context.   
Print out the prompt to see what is being asked of the model:

In [7]:
print(qa_generation_prompt)

The output from this function is a llama_index class `EmbeddingQAFinetuneDataset`, which is a simple wrapper for a series of three dictionaries (`corpus`, `queries`, and `relevant_docs`).  The llama_index class is not absolutely necessary, but it is helpful in making transitions smoother when using the llama_index `SentenceTransformersFinetuneEngine` class for fine-tuning.  It takes roughly 80 seconds to generate 100 query/context pairs so a sample size of 300 takes about 4 minutes (much faster than if you were to do this manually!).

In [12]:
training_set = generator.generate_qa_embedding_pairs(train, qa_generation_prompt, 2)
# val_set = generator.generate_qa_embedding_pairs(val, qa_generation_prompt, 1)

100%|████████████████████████████████████████████████████████████████████| 10/10 [00:13<00:00,  1.30s/it]


In [13]:
#EmbeddingQAFinetuneDataset has no len, so check length of queries instead
len(training_set.queries), len(val_set.queries)

(20, 5)

### Dataset Analysis

Always a good idea to check the quality of the pairs generated.  Most pairs will be high quality but some will not be, this is a chance for human intervention to adjust the questions manually to ensure the quality remains high. 

In [14]:
def show_qa_pairs(data: EmbeddingQAFinetuneDataset, print_results: bool=True):
    pairs = []
    for k, v in data.queries.items():
        doc_id = data.relevant_docs[k][0]
        context = data.corpus[doc_id]
        pairs.append((v, context))
    if print_results:
        for tup in pairs:
            print(f'Question: {tup[0]}\nContext: {tup[1]}\n\n')
    return pairs    

In [17]:
pairs = show_qa_pairs(training_set, print_results=True)

### Save to Disk  
Save to disk using your own filepaths, below is an example using the length of the sets as part of the filepath.

In [28]:
# training_set.save_json('./data/training_data_10.json')
# val_set.save_json('./data/validation_data_5.json')