# Synthetic Evaluation Data Generation

## Table of Contents
1. [Install required libraries](#Install-required-libraries)
2. [Prepare input data](#Prepare-input-data)
3. [Generate API key](#Generating-API-key)
4. [Loading dataset](#Loading-datasets)
5. [Reading pipeline config](#Read-pipeline-config)
6. [Data Generation](#Running-the-Synthetic-Data-Generator)
7. [Data Quality Assessment](#Data-Quality-Assessment)


### Install required libraries

Please install NeMo-Curator and required dependencies following the steps for NeMo-Curator installation. Install the tutorial specific dependencies as follows:
```
$ pip install -r requirements.txt
```

Please also see [README.md](../README.md) for environment setup including necessary library installation.


### Prepare input data

The synthetic data generation framework supports two input formats `rawdoc`. 

- `input_format=rawdoc`

The file should be stored in a JSONL format. Each line contains a document in the format of `{"text": <document>, "title": <title>}`.

```
{"text": "The quick brown fox jumps over the lazy dog.", "title": "Classic Pangram" }
{"text": "The Eiffel Tower is an iron lattice tower on the Champ de Mars in Paris.", "title": "Iconic Landmark" }
...
```
Additionally, if the documents already have a document id, the input file can also contain document ids. The same ids will be persisted in the generated data as well. Another accepted format is `{"_id": <document_id>, "text": <document>, "title": <title>}`.
```
{"_id": "5", "text": "The quick brown fox jumps over the lazy dog.", "title": "Classic Pangram" }
{"_id": "doc3", "text": "The Eiffel Tower is an iron lattice tower on the Champ de Mars in Paris.", "title": "Iconic Landmark" }
...
```
This repository contains a sample JSONL file `sample_data/sample_data.jsonl`.

In [1]:
import os
from omegaconf import OmegaConf
import sys
import importlib
import warnings
warnings.filterwarnings('ignore')


from nemo_curator.filters.synthetic import EasinessFilter, AnswerabilityFilter
from nemo_curator.modules.filter import ScoreFilter, Score
from nemo_curator.datasets import DocumentDataset

config = importlib.import_module(
    "tutorials.nemo-retriever-synthetic-data-generation.config.config"
)
retriever_evalset_generator = importlib.import_module(
    "tutorials.nemo-retriever-synthetic-data-generation.retriever_evalset_generator"
)

## Generating API key

- The SDG pipeline uses NIM models, in order to use them, you need to generate an API key.

- Visit [this page](https://build.nvidia.com/mistralai/mixtral-8x7b-instruct) and click "Get API Key" to generate an API key

![NVIDIA API Catalog](../figures/api_key.png) 

### Loading datasets
We now load a sample dataset from out data folder

In [2]:
import pandas as pd
df = pd.read_json("../data/sample_data_rawdoc.jsonl", lines=True)

In [3]:
df.head()

Unnamed: 0,text,title
0,The Eiffel Tower is an iconic landmark of Pari...,Eiffel Tower - A French Icon
1,The Great Wall of China is a series of fortifi...,The Great Wall of China - Ancient Protection
2,The Taj Mahal is an ivory-white marble mausole...,Taj Mahal - A Symbol of Love
3,Machu Picchu is a 15th-century Inca citadel si...,Machu Picchu - Lost City of the Incas
4,"The Colosseum, also known as the Flavian Amphi...",The Colosseum - Ancient Roman Architecture


### Read pipeline config

In [4]:
cfg = config.RetrieverEvalSDGConfig.from_yaml("../config/config.yaml")
cfg.api_key = "your api key here"
retrieval_evalset_generator = retriever_evalset_generator.RetrieverEvalSetGenerator(cfg)

In [5]:
print (f"Generator model used = {cfg.generator_model}")

Generator model used = mistralai/mixtral-8x22b-instruct-v0.1


### Running the Synthetic Data Generator
We first create the dataset object from the pandas dataframe, and pass along the dataset object through the generator and the filters. The dataset object gets transformed along the different steps of the pipeline (i.e. generator, filters)

In [6]:
dataset = DocumentDataset.from_pandas(df)
generated_dataset = retrieval_evalset_generator(dataset)
generated_df = generated_dataset.df.compute()

### Probing the generated Data
For those documents that do not have a document id, the pipeline generates a random hash as document id. For those that have an existing document id, the pipeline persists the same ids in the generated data.

In [7]:
generated_df.head()

Unnamed: 0,text,title,question,_id,question-id,answer,score
0,The Eiffel Tower is an iconic landmark of Pari...,Eiffel Tower - A French Icon,What is the significance of the Eiffel Tower i...,342d2d470596528b192b9f0a12d0ec5f4798ab1fc84090...,c6075864cc0c9318df5456c2b06bfb581562542205ff99...,The Eiffel Tower is an iconic landmark in Pari...,1
1,The Eiffel Tower is an iconic landmark of Pari...,Eiffel Tower - A French Icon,Who was responsible for designing the Eiffel T...,12dcafeb731d5ef4e1903f1e6cc35bfa9d5e40f740e967...,003de77e8d7a0d499d75edfc5ad4633d4a2703b89c1f09...,The Eiffel Tower was designed by the engineer ...,1
2,The Eiffel Tower is an iconic landmark of Pari...,Eiffel Tower - A French Icon,When was the Eiffel Tower built and for what p...,e5d22c48da4684bf5da4afe414d2d6630709e5b134b847...,eb5bfbf35e7d53cc2affc58146721a017c72c38344ca1d...,The Eiffel Tower was built in 1889 for the Exp...,1
3,The Great Wall of China is a series of fortifi...,The Great Wall of China - Ancient Protection,What materials were used to construct the Grea...,dab619e293076e8119d9dd0d0ea4a69bf0fff0f526951f...,03c619187f0aae660725a45533184a2ccf58ebb264d92a...,The Great Wall of China was constructed using ...,1
4,The Great Wall of China is a series of fortifi...,The Great Wall of China - Ancient Protection,What was the primary purpose of building the G...,329021930f100a10785cea69e4c1c42a965e5c1892b3ae...,b4d63625700e8f80dd0c42668eb1625d8c58e9716b904a...,The primary purpose of building the Great Wall...,1


### Data Quality Assessment
We apply two filters:

*Answerability filer* uses LLM-as-judge in order to determine quality of questions in terms of them being answerable from content in the passage. The filter weeds out questions that are invalid and not relevant to the document chunk that was used to generate them.

*Easiness filter* is used to filter out questions that are deemed easy for the retriever models to retrieve positive passages for the given generated question. It uses embedding model as judge. The user needs to provide threshold (number between 0 and 1) for this filter. Lower the value of the filter, harder the questions in the dataset. If the threshold value is higher, then we have many easy questions in the dataset. 

The filters can be applied in any order. 

In [8]:
ef = EasinessFilter(cfg.base_url,
                    cfg.api_key,
                    cfg.easiness_filter,
                    cfg.percentile,
                    cfg.truncate,
                    cfg.batch_size)
easiness_filter = ScoreFilter(ef,
                              text_field = ["text", "question"],
                              score_field = "easiness_scores")
af = AnswerabilityFilter(cfg.base_url,
                         cfg.api_key,
                         cfg.answerability_filter,
                         cfg.answerability_system_prompt,
                         cfg.answerability_user_prompt_template,
                         cfg.num_criteria)
answerability_filter = ScoreFilter(af,
                              text_field = ["text", "question"],
                              score_field = "answerability_scores")

### Easiness filter
We see an additional column being generated "easiness_scores". This filter removes questions that are too easy to retrieve by retriever models.

In [9]:
filtered_dataset = easiness_filter(generated_dataset)
filtered_df_1 = filtered_dataset.df.compute()

In [10]:
filtered_df_1.head()

Unnamed: 0,text,title,question,_id,question-id,answer,score,easiness_scores
1,The Eiffel Tower is an iconic landmark of Pari...,Eiffel Tower - A French Icon,Who was the engineer behind the design of the ...,5b31740eab0e66fa435ac3b2d0f3ad299e9bc885da22ad...,985cd7b5de889c7b62eca2d45b83eac5c1ba6fa2dce681...,The Eiffel Tower was designed by the engineer ...,1,0.569564
3,The Great Wall of China is a series of fortifi...,The Great Wall of China - Ancient Protection,What is the purpose of the Great Wall of China?,2e40d9da383f39586c7f4a2e6cdc930de7ceaa1800d41c...,108ee53f98dcba40d2e4654df9e41dadf313000b2cfbb0...,The purpose of the Great Wall of China is to p...,1,0.527854
4,The Great Wall of China is a series of fortifi...,The Great Wall of China - Ancient Protection,What materials were used to build the Great Wa...,b05babced766cf6b65f43bc0d8c927d08a271d30423cd8...,a698316fccdb6facb8341372778863bb092fe71bc60357...,The Great Wall of China was built using materi...,1,0.55047
5,The Great Wall of China is a series of fortifi...,The Great Wall of China - Ancient Protection,What is the general direction of the Great Wal...,88cd9adc26f148a24a1fbde7c5dfed1033db29c7ab997f...,a5996280c5a2b382c206ab4fc69b8981f7588a6c216b63...,The Great Wall of China was generally built al...,1,0.462438
6,The Taj Mahal is an ivory-white marble mausole...,Taj Mahal - A Symbol of Love,What is the Taj Mahal primarily used for?,4eaff3017898dab67377f19bef2cf7bbf7ee1223a661f7...,1e2f14820a6f599a5d124f5cd6b0e2575a0fa601a36d5f...,The Taj Mahal is primarily used as a mausoleum...,1,0.444493


In [11]:
print (f"Total number of generated data points = {generated_df.shape[0]}") 
print (f"Total number of data points after application of easiness filter = {filtered_df_1.shape[0]}")

Total number of generated data points = 30
Total number of data points after application of easiness filter = 21


### Answerability filter
We see additional column "answerability scores", which shows the rating provided by the LLM-as-judge on criteria used to judge the questions. The criteria can be found in the config. 

In [12]:
filtered_dataset_2 = answerability_filter(filtered_dataset)
filtered_df_2 = filtered_dataset_2.df.compute()

In [13]:
filtered_df_2.head()

Unnamed: 0,text,title,question,_id,question-id,answer,score,easiness_scores,answerability_scores
3,The Great Wall of China is a series of fortifi...,The Great Wall of China - Ancient Protection,What materials were used to construct the Grea...,a5b2fd08b6a424a371b12c7d07c37044abddf168427dee...,e5078730ce04b2f8314fced830fbb037528097bfa4c9f8...,The Great Wall of China was constructed using ...,1,0.553092,"{\n""criterion_1_explanation"": ""The question is..."
4,The Great Wall of China is a series of fortifi...,The Great Wall of China - Ancient Protection,What was the primary purpose of building the G...,51d260dd9881d4176553b1a416d3f299a375d903fc677b...,4b5c0ad1ac49efb0f1d252792e4537c960566309f33eb0...,The primary purpose of building the Great Wall...,1,0.505319,"{\n""criterion_1_explanation"": ""The question is..."
5,The Great Wall of China is a series of fortifi...,The Great Wall of China - Ancient Protection,Which direction was the Great Wall of China ge...,5bf59b2efabd4a5b0d2f179841ff1cdc41086e6598a098...,96456def817ac60dfa8a34f3983aba20209ec661d88535...,The Great Wall of China was generally built al...,1,0.545968,"{\n""criterion_1_explanation"": ""The question is..."
6,The Taj Mahal is an ivory-white marble mausole...,Taj Mahal - A Symbol of Love,What is the Taj Mahal primarily made of?,7cf289552442f65170be4c4d0a950a65b9d21ffeeca05d...,53ea4d4ff9ba312d953c3a2a393bb503712f6036830aa8...,The Taj Mahal is primarily made of ivory-white...,1,0.422271,"{\n""criterion_1_explanation"": ""The question is..."
7,The Taj Mahal is an ivory-white marble mausole...,Taj Mahal - A Symbol of Love,Who commissioned the construction of the Taj M...,3290bf8bb526a81774e70939849fd84a4ed49e708677ca...,914f2cc013c20a7e34737435f17db7e05a38e8da333109...,The Taj Mahal was commissioned by the Mughal e...,1,0.547095,"{\n""criterion_1_explanation"": ""The question is..."


In [14]:
print (f"Total number of data points after application of answerability filter = {filtered_df_2.shape[0]}")

Total number of data points after application of answerability filter = 19


We see that upon adding the answerability filter, the number of data points further reduced. We removed unanswerable questions i.e. questions that can't be answered solely based on content provided in the context document.