# 0. Installation

Install the SDV library.

In [1]:
!pip install openai==1.2.4



# 2. Script for generating Data from Sample Output CSV and Prompt Instruction
### Script for Generating Balanced LLM NLI Dataset

#### Overview
- **Purpose**: Automates the creation of a Balanced Dataset using OpenAI's GPT model.

#### Input
- **Sample CSV**: Path to a sample CSV file for understanding data structure.
- **User-Defined Prompt**: Custom prompt instruction provided by the user.
- **Configuration**:
  - `total_rows_target`: Target number of rows to generate.
  - `max_tokens_per_call`: Maximum token limit for each GPT API call.

#### Process
1. **Read Sample CSV**: Analyzes the sample CSV to define the data structure.
2. **Generate Prompt**: Merges the user-defined prompt with examples from the CSV to guide the GPT model.
3. **Data Generation**:
   - Iteratively calls the GPT API to generate data.
   - Performs quality checks on generated entries for uniqueness and relevance.

#### Output
- **Generated Dataset**: A Balanced Dataset, tailored to the specific structure and context defined by the user and sample CSV.

#### Goal
- Efficient creation of large, diverse, and balanced datasets, leveraging the advanced capabilities of GPT for specific data generation needs.






In [2]:
## connect to google drive
from google.colab import drive
drive.mount('/content/drive', force_remount=False)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
from openai import OpenAI
import pandas as pd
from tqdm import tqdm
import time
import csv
from io import StringIO

# Set your OpenAI API key
from google.colab import userdata
open_ai_key = userdata.get('open_ai_key')

client = OpenAI(
    api_key=open_ai_key,
)

def generate_prompt_with_examples(sample_csv_path, user_prompt, num_examples=5):
    # Load the sample output CSV file and extract example entries
    sample_output_structure = pd.read_csv(sample_csv_path)
    output_columns = sample_output_structure.columns.tolist()
    example_entries = sample_output_structure.head(num_examples).to_csv(index=False, sep=';', header=False)

    # Combine user prompt with examples
    prompt_instruction = f"{user_prompt}\nHere are some examples:\n{example_entries}\nGenerate more entries in CSV format with the following columns: {', '.join(output_columns)}. Use semicolon ';' as the delimiter."
    return prompt_instruction, output_columns

def generate_data(prompt_instruction, max_tokens, output_columns):
    generated_entries = []
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt_instruction}],
        max_tokens=max_tokens
    )

    generated_content = response.choices[0].message.content.strip()
    reader = csv.reader(StringIO(generated_content), delimiter=';')
    generated_entries = [row for row in reader if len(row) == len(output_columns)]

    return generated_entries

def quality_checks(entries, unique_entries_set):
    quality_entries = []
    for entry in entries:
        entry_tuple = tuple(entry)
        if entry_tuple not in unique_entries_set:
            quality_entries.append(entry)
            unique_entries_set.add(entry_tuple)
    return quality_entries

# User inputs
## Your custom sample .csv file here
sample_csv_path = '/content/drive/MyDrive/Colab Notebooks/task_classification/NLI_Task_Classification_Dataset.csv'
## Your custom prompt instruction here
user_prompt = """
Create a dataset designed for large language model (LLM) task classification within the Natural Language Inference (NLI) framework. Each entry in the dataset should consist of the following elements:
1. **Premise:** Develop a statement that describes a realistic scenario where a specific task of a large language model (LLM) might be applied. Focus on clearly identifiable LLM tasks such as 'Text Generation', 'Conversation and Chatbots', 'Question Answering', 'Content Curation and Recommendation', and others. Ensure the scenarios are varied and accurately represent the diverse range of tasks LLMs are capable of performing.
2. **Hypothesis:** For each premise, formulate a statement that has one of the following relationships with the premise:
   - Entailment: The hypothesis should logically follow the premise and directly relate to the LLM task described in the premise.
   - Neutral: The hypothesis is unrelated or neutral to the premise, not directly commenting on the LLM task described.
   - Contradiction: The hypothesis contradicts the premise, suggesting a different LLM task or an outcome inconsistent with the task described in the premise.
3. **Label:** Assign one of three labels to each premise-hypothesis pair, indicating the relationship type (entailment, neutral, or contradiction).
Please generate a comprehensive and diverse dataset, focused on explicit LLM tasks, to be used for accurate NLI task classification.
"""

total_rows_target = 1000 #expected number of generated rows
max_tokens_per_call = 3000 #max token for the model

# Prepare the prompt with examples
prompt_instruction, output_columns = generate_prompt_with_examples(sample_csv_path, user_prompt)
unique_entries_set = set()
all_generated_data = []

for _ in tqdm(range(0, total_rows_target)):
    generated_batch = generate_data(prompt_instruction, max_tokens_per_call, output_columns)
    quality_batch = quality_checks(generated_batch, unique_entries_set)
    all_generated_data.extend(quality_batch)

    print(f'Current number of generated rows: {len(all_generated_data)}')
    if len(all_generated_data) >= total_rows_target:
        break


  0%|          | 1/1000 [00:15<4:16:58, 15.43s/it]

Current number of generated rows: 22


  0%|          | 2/1000 [00:27<3:41:53, 13.34s/it]

Current number of generated rows: 43


  0%|          | 3/1000 [00:38<3:24:40, 12.32s/it]

Current number of generated rows: 58


  0%|          | 4/1000 [00:51<3:30:18, 12.67s/it]

Current number of generated rows: 73


  0%|          | 5/1000 [01:03<3:23:14, 12.26s/it]

Current number of generated rows: 91


  1%|          | 6/1000 [01:16<3:28:24, 12.58s/it]

Current number of generated rows: 112


  1%|          | 7/1000 [01:32<3:45:04, 13.60s/it]

Current number of generated rows: 136


  1%|          | 8/1000 [01:43<3:35:48, 13.05s/it]

Current number of generated rows: 154


  1%|          | 9/1000 [01:57<3:38:49, 13.25s/it]

Current number of generated rows: 178


  1%|          | 10/1000 [02:10<3:34:50, 13.02s/it]

Current number of generated rows: 194


  1%|          | 11/1000 [02:21<3:25:56, 12.49s/it]

Current number of generated rows: 215


  1%|          | 12/1000 [02:32<3:16:56, 11.96s/it]

Current number of generated rows: 233


  1%|▏         | 13/1000 [02:41<3:02:29, 11.09s/it]

Current number of generated rows: 248


  1%|▏         | 14/1000 [02:54<3:10:31, 11.59s/it]

Current number of generated rows: 263


  2%|▏         | 15/1000 [03:05<3:09:38, 11.55s/it]

Current number of generated rows: 279


  2%|▏         | 16/1000 [03:17<3:14:09, 11.84s/it]

Current number of generated rows: 298


  2%|▏         | 17/1000 [03:31<3:22:03, 12.33s/it]

Current number of generated rows: 319


  2%|▏         | 18/1000 [03:47<3:42:10, 13.57s/it]

Current number of generated rows: 343


  2%|▏         | 19/1000 [04:00<3:39:20, 13.42s/it]

Current number of generated rows: 367


  2%|▏         | 20/1000 [04:12<3:30:27, 12.89s/it]

Current number of generated rows: 388


  2%|▏         | 21/1000 [04:23<3:22:19, 12.40s/it]

Current number of generated rows: 407


  2%|▏         | 22/1000 [04:35<3:16:34, 12.06s/it]

Current number of generated rows: 422


  2%|▏         | 23/1000 [04:51<3:35:25, 13.23s/it]

Current number of generated rows: 443


  2%|▏         | 24/1000 [05:06<3:47:17, 13.97s/it]

Current number of generated rows: 467


  2%|▎         | 25/1000 [05:19<3:42:05, 13.67s/it]

Current number of generated rows: 485


  3%|▎         | 26/1000 [05:32<3:35:25, 13.27s/it]

Current number of generated rows: 503


  3%|▎         | 27/1000 [05:48<3:48:28, 14.09s/it]

Current number of generated rows: 524


  3%|▎         | 28/1000 [06:05<4:03:04, 15.00s/it]

Current number of generated rows: 551


  3%|▎         | 29/1000 [06:17<3:49:49, 14.20s/it]

Current number of generated rows: 572


  3%|▎         | 30/1000 [06:31<3:47:59, 14.10s/it]

Current number of generated rows: 593


  3%|▎         | 31/1000 [06:44<3:40:21, 13.64s/it]

Current number of generated rows: 613


  3%|▎         | 32/1000 [06:58<3:42:35, 13.80s/it]

Current number of generated rows: 634


  3%|▎         | 33/1000 [07:10<3:35:30, 13.37s/it]

Current number of generated rows: 655


  3%|▎         | 34/1000 [07:22<3:30:13, 13.06s/it]

Current number of generated rows: 670


  4%|▎         | 35/1000 [07:36<3:33:38, 13.28s/it]

Current number of generated rows: 690


  4%|▎         | 36/1000 [07:50<3:38:11, 13.58s/it]

Current number of generated rows: 710


  4%|▎         | 37/1000 [08:01<3:23:09, 12.66s/it]

Current number of generated rows: 731


  4%|▍         | 38/1000 [08:16<3:35:08, 13.42s/it]

Current number of generated rows: 752


  4%|▍         | 39/1000 [08:31<3:40:00, 13.74s/it]

Current number of generated rows: 773


  4%|▍         | 40/1000 [08:43<3:35:01, 13.44s/it]

Current number of generated rows: 794


  4%|▍         | 41/1000 [08:58<3:38:32, 13.67s/it]

Current number of generated rows: 812


  4%|▍         | 42/1000 [09:07<3:20:11, 12.54s/it]

Current number of generated rows: 826


  4%|▍         | 43/1000 [09:20<3:18:08, 12.42s/it]

Current number of generated rows: 843


  4%|▍         | 44/1000 [09:34<3:26:18, 12.95s/it]

Current number of generated rows: 864


  4%|▍         | 45/1000 [09:48<3:31:14, 13.27s/it]

Current number of generated rows: 887


  5%|▍         | 46/1000 [10:04<3:45:37, 14.19s/it]

Current number of generated rows: 920


  5%|▍         | 47/1000 [10:17<3:38:27, 13.75s/it]

Current number of generated rows: 940


  5%|▍         | 48/1000 [10:34<3:54:20, 14.77s/it]

Current number of generated rows: 961


  5%|▍         | 49/1000 [10:48<3:49:55, 14.51s/it]

Current number of generated rows: 979


  5%|▌         | 50/1000 [11:01<3:44:49, 14.20s/it]

Current number of generated rows: 999


  5%|▌         | 50/1000 [11:13<3:33:21, 13.48s/it]

Current number of generated rows: 1020





In [4]:
# from openai import OpenAI
# import pandas as pd
# from tqdm import tqdm
# import time
# import csv
# from io import StringIO

# # Set your OpenAI API key
# from google.colab import userdata
# open_ai_key = userdata.get('open_ai_key')

# client = OpenAI(
#     # defaults to os.environ.get("OPENAI_API_KEY")
#     api_key=open_ai_key,
# )

# # Load the structure from the sample output CSV file
# sample_output_structure = pd.read_csv('/content/customer_data/Expanded_LLM_Task_Classification_Dataset.csv')
# output_columns = sample_output_structure.columns.tolist()

# #Define your prompt instruction
# prompt_instruction = """Create a dataset for task classification with entries in Natural Language Inference (NLI) format. Each entry should consist of:
# * A Premise: This should be a statement describing a realistic scenario in which a popular task of an LLM might be used, such as language translation, summarization, content generation, question answering, etc.
# * A Hypothesis: This should be a statement that either follows logically from the premise (entailment), is unrelated or neutral to the premise (neutral), or contradicts the premise (contradiction).
# * A NLILabel: This should be one of the three NLI labels ('entailment', 'neutral', 'contradiction') based on the relationship between the premise and the hypothesis.
# * A Task: This should be tasks performed by LLMs, ex: 'Text Generation', 'Conversation and Chatbots', 'Question Answering', 'Content Curation and Recommendation', etc..
# The dataset should reflect tasks performed by LLMs, ensuring that each entry's hypothesis is a realistic outcome or expectation of the premise described. Please balance the dataset with an equal number of entries for each label. Generate entries to form a diverse dataset that encompasses various domains where LLMs are typically applied.
# """

# prompt_instruction = """Create a dataset for task classification with entries in Natural Language Inference (NLI) format. Each entry should consist of:
# * A Premise: This should be a statement describing a realistic scenario in which a popular task of an LLM might be used, such as language translation, summarization, content generation, question answering, etc.
# * A Hypothesis: This should be a statement that either follows logically from the premise (entailment), is unrelated or neutral to the premise (neutral), or contradicts the premise (contradiction).
# * A Label: This should be one of the three NLI labels ('entailment', 'neutral', 'contradiction') based on the relationship between the premise and the hypothesis.
# The dataset should reflect tasks performed by LLMs, ensuring that each entry's hypothesis is a realistic outcome or expectation of the premise described. Please balance the dataset with an equal number of entries for each label. Generate entries to form a diverse dataset that encompasses various domains where LLMs are typically applied.
# """
# #Fix the Output Format which is generated from OpenAI Models
# prompt_instruction = f"Generate data in CSV format with the following columns: {', '.join(output_columns)}. Use semicolon ';' as the delimiter. " + prompt_instruction


# # Desired number of total rows and max tokens per API call
# total_rows_target = 500
# max_tokens_per_call = 3000  # Maximum tokens for GPT-3.5 Turbo

# # Function to generate data using GPT-3.5 Turbo
# def generate_data(prompt_instruction, max_tokens, output_format):
#     generated_entries = []

#     customized_prompt = f"{prompt_instruction} Generate output in the format: {output_format}"
#     response = client.chat.completions.create(
#     model="gpt-3.5-turbo",
#     messages=[{"role": "user", "content": customized_prompt}],
#     max_tokens= max_tokens
#     )
#     generated_content = response.choices[0].message.content
#     reader = csv.reader(StringIO(generated_content), delimiter=';')
#     generated_entries = [row for row in reader if len(row) == len(output_columns)]

#     return generated_entries

# # Set for storing unique entries (as tuples for comparison)
# unique_entries_set = set()

# # Function for quality checks including deduplication
# def quality_checks(entries, unique_entries_set):
#     quality_entries = []
#     for entry in entries:
#         entry_tuple = tuple(entry)
#         if entry_tuple not in unique_entries_set:
#             quality_entries.append(entry)
#             unique_entries_set.add(entry_tuple)
#     return quality_entries

# # Main loop for batch processing
# all_generated_data = []

# for _ in tqdm(range(0, total_rows_target)):
#     generated_batch = generate_data(prompt_instruction, max_tokens_per_call, ';'.join(output_columns))
#     quality_batch = quality_checks(generated_batch, unique_entries_set)
#     all_generated_data.extend(quality_batch)
#     print(f'Current number of generated rows: {len(all_generated_data)}')

#     if len(all_generated_data) >= total_rows_target:
#         break


In [5]:
# Convert the list of dictionaries to a DataFrame with specified column names
df = pd.DataFrame(all_generated_data, columns=output_columns)

# Save DataFrame to CSV without index and with proper header
df.to_csv("/content/LLM_Task_NLI_Dataset.csv", index=False)

print("Data generation complete.")

Data generation complete.


In [6]:
pd.DataFrame(all_generated_data).tail()

Unnamed: 0,0,1,2
1015,Generate short descriptions for online product...,The output will be a sales report.,contradiction
1016,Generate short descriptions for online product...,This is a machine translation task.,contradiction
1017,Compose an email to a friend based on a given ...,This is a text generation task.,entailment
1018,Compose an email to a friend based on a given ...,The output will be a song lyrics.,contradiction
1019,Compose an email to a friend based on a given ...,This is a sentiment analysis task.,contradiction
