# Developing a Mock Dataset using the OpenAI GPT API

In this tutorial, we will guide you through the process of creating a mock dataset using the OpenAI GPT API. This dataset will simulate English writing assignments from students across different grade levels. We'll leverage the power of OpenAI's language model to generate text responses that mimic the writing style of students from grades 3 to 12.

## Setting Up the Environment

First, we need to set up our Python environment by importing necessary libraries and loading the OpenAI API key.

In [1]:
import numpy as np
import h5py
import time
import openai
from openai.error import APIError

def load_api_key(key_file='../educational-prompt-engineering/key.txt'):
    with open(key_file) as f:
        # Get API key from text file
        key = f.read().strip("\n")
    return key

openai.api_key = load_api_key()

## Generating Prompts

To simulate different grade levels, we create a function that generates a prompt contextualized to a specific grade. This prompt will instruct the OpenAI model to generate a paragraph typical of a student in that grade.

### Prompt Structure:
- **Contextual Setup**: The prompt includes a context setup specifying the grade level and the nature of the task (writing a paragraph).
- **Randomly Generated Personas**: We integrate a set of pre-defined personas that can be randomly selected for each prompt, which helps produce more randomness in the responses.
- **Response Format**: It instructs the model to list a topic and purpose followed by the paragraph, mimicking a typical school assignment.

In [2]:
def load_personas(file_path):
    with open(file_path, 'r') as file:
        personas = [line.strip() for line in file]
    return personas

def develop_prompt(grade, personas):
    persona = np.random.choice(personas)
    task = 'Write a paragraph on a topic assigned by your teacher.'
    response = ('First list the topic and purpose in the form: \n\n# Topic: {insert_topic_here}\n\n# Purpose: {insert_purpose_here}\n\nThen follow '+
                f'this with your paragraph. Write in the style and at the level that is appropriate for a grade {grade} student.')
    prompt = f'# PERSONA #\n\n{persona}\n\n# CONTEXT #\n\nYou are writing a grade {grade} English assignment.\n\n# TASK #\n\n{task}\n\n# RESPONSE #\n\n{response}'
    return prompt

# Set grade level
grade = 10
# Load personas
personas = load_personas('data/prompt_personas.txt')
# Construct prompt
prompt = develop_prompt(grade=grade, personas=personas)
print(prompt)

# PERSONA #

As an animal lover, you are interested in learning about different species.

# CONTEXT #

You are writing a grade 10 English assignment.

# TASK #

Write a paragraph on a topic assigned by your teacher.

# RESPONSE #

First list the topic and purpose in the form: 

# Topic: {insert_topic_here}

# Purpose: {insert_purpose_here}

Then follow this with your paragraph. Write in the style and at the level that is appropriate for a grade 10 student.


## Example Response from GPT

We use the OpenAI GPT model to generate a response based on our prompt. The response simulates a student's paragraph, including a topic and purpose.

### Interaction Details:
- **Model Selection**: The GPT-3.5 model is used for its advanced language generation capabilities and speed.
- **Token Limit**: `max_tokens` defines the maximum length of the generated response.

In [3]:
model = 'gpt-3.5-turbo' 
max_tokens = 500

# Construct message
messages = [{'role': 'system', 'content': 'You are a helpful assistant who excels at generating randomized paragraph topics, purposes, and text content.'},
            {'role': 'user', 'content': prompt}]
# Prompt GPT model
response = openai.ChatCompletion.create(model=model, messages=messages, max_tokens=max_tokens)
# Collect response
txt = response['choices'][0]['message']['content']
print(txt)

# Topic: Endangered Species

# Purpose: To raise awareness about the importance of conserving endangered species and the impact of human activities on their survival.

Endangered species are a topic of great concern in today's world. As an animal lover, it is disheartening to learn about the declining population of various species. Endangered species are those that are at risk of extinction, meaning they are in danger of disappearing forever. The loss of species affects not only the ecological balance but also the biodiversity and overall health of our planet. Human activities, such as deforestation, pollution, and illegal hunting, are major contributors to the decline of these species. Additionally, climate change poses a significant threat by disrupting ecosystems and altering habitats. It is vital for us to understand the importance of conserving endangered species and taking action to protect them. Conservation efforts, such as habitat restoration, captive breeding, and implementin

## Creating a Randomized Dataset

Our goal is to create a diverse and realistic dataset of student writings across different grades. To achieve this, we generate text for a large number of mock students.

### Dataset Creation Process:
- **HDF5 File**: We use an HDF5 file to store the dataset efficiently.
- **Loop for Data Generation**: For each mock student, a grade is randomly chosen, a prompt is generated, and a response from the OpenAI API is obtained and stored.

In [None]:
num_students = 1000
model = 'gpt-3.5-turbo'
max_tokens = 500
n_messages_store = 2

# Create a new HDF5 file
mode = 'a'
with h5py.File('data/grade_level_writing.h5', mode) as h5f:
    # Create a dataset for grades with maximum size
    if mode=='w':
        max_size = (None,)  # None indicates an unlimited dimension
        grade_dset = h5f.create_dataset('grade', shape=(0,), maxshape=max_size, dtype='i')
    
        # Create a special datatype for variable-length strings
        dt = h5py.special_dtype(vlen=str)
    
        # Create a dataset for texts with the special datatype and maximum size
        text_dset = h5f.create_dataset('text', shape=(0,), maxshape=max_size, dtype=dt)
    elif mode=='a':
        # Append to existing datafile
        grade_dset = h5f['grade']
        text_dset = h5f['text']

    # Begin message thread
    if mode=='w':
        messages = [{'role': 'system', 'content': 'You are a helpful assistant who excels at generating randomized paragraph topics, purposes, and text content.'}]
    # Append data to datasets in a loop
    for i in range(num_students):
        time_start = time.time()
        grade = np.random.choice(np.arange(3,13))

        # Define the prompt
        prompt = develop_prompt(grade=grade, personas=personas)
        messages.append({'role': 'user', 'content': prompt})

        # Prompt GPT with some randomness
        try:
            response = openai.ChatCompletion.create(model=model, messages=messages, max_tokens=max_tokens, 
                                                    seed=np.random.randint(0,1e5), temperature=np.random.uniform(0.7,1.3))
        except APIError as e:
            print('Server Error')
            time.sleep(10)
            continue
        # Obtain response
        text = response['choices'][0]['message']['content']

        # Save response as part of the conversation (or else it repeats its responses)
        messages.append({'role': 'assistant', 'content': text})
        # Only keep the last n_messages back-and-forths
        if len(messages)>(2*n_messages_store):
            messages = [messages[0]] + messages[-int(2*n_messages_store):]
        # Make sure last message was the assistant
        while messages[-1]['role']=='user':
            messages = messages[:-1]

        # Resize datasets to accommodate new data
        grade_dset.resize((len(grade_dset) + 1,))
        text_dset.resize((len(text_dset) + 1,))

        # Assign data
        grade_dset[-1] = grade
        text_dset[-1] = text

        # Estimate remaining time.
        time_end = time.time()
        n_left = num_students - i
        t_left = (n_left * (time_end - time_start)) / 60
        print(f'{i+1} complete. Approximately {t_left:0.1f} minutes remaining...', end='\r')
print('\nAll done!')

490 complete. Approximately 155.8 minutes remaining....

## Inspecting the Generated Data

Finally, we can inspect the generated dataset by reading from the HDF5 file. This step is crucial for verifying the quality and diversity of our mock dataset.

In [49]:
with h5py.File('data/grade_level_writing.h5', 'r') as h5f:
    for i in range(3):
        print('\n# Grade: ', int(h5f['grade'][i]),'\n',)
        print(h5f['text'][i].decode('utf-8'))


# Grade:  8 

# Topic: Ancient Egypt

# Purpose: To inform the reader about the civilization of Ancient Egypt.

Ancient Egypt was a fascinating civilization that flourished along the banks of the Nile River over 4,000 years ago. The ancient Egyptians were known for their impressive architectural achievements, such as the Great Pyramids of Giza and the Temple of Luxor. They believed in the afterlife and spent a great deal of time and effort preparing for it. This is why they built elaborate tombs and filled them with valuable possessions. The Egyptians also had a complex system of hieroglyphic writing, which was used to record important events and religious beliefs. They worshiped many gods and goddesses, including Ra, the sun god, and Isis, the goddess of magic and fertility. The Nile River was crucial to their way of life, as it provided them with fertile land for farming and transportation. The ancient Egyptians also had a well-organized government led by a pharaoh, who was consider