# Developing a Mock Dataset using the OpenAI GPT API

In this tutorial, we will guide you through the process of creating a mock dataset using the OpenAI GPT API. This dataset will simulate English writing assignments from students across different grade levels. We'll leverage the power of OpenAI's language model to generate text responses that mimic the writing style of students from grades 3 to 12.

## Setting Up the Environment

First, we need to set up our Python environment by importing necessary libraries and loading the OpenAI API key.

In [1]:
import numpy as np
import h5py
from time import sleep
import openai
from openai.error import APIError

def load_api_key(key_file='../educational-prompt-engineering/key.txt'):
    with open(key_file) as f:
        # Get API key from text file
        key = f.read().strip("\n")
    return key

openai.api_key = load_api_key()

## Generating Prompts

To simulate different grade levels, we create a function that generates a prompt contextualized to a specific grade. This prompt will instruct the OpenAI model to generate a paragraph typical of a student in that grade.

### Prompt Structure:
- **Contextual Setup**: The prompt includes a context setup specifying the grade level and the nature of the task (writing a paragraph).
- **Response Format**: It instructs the model to list a topic and purpose followed by the paragraph, mimicking a typical school assignment.

In [2]:
def develop_prompt(grade):
    context = f'You are a grade {grade} student in an English class who writes at the average level for a student in grade {grade}.'
    task = 'Write a paragraph on a topic assigned by your teacher for a purpose that is also assigned by your teacher.'
    response = ('First list the topic and purpose in the form: \n\n# Topic: {insert_topic_here}\n\n# Purpose: {insert_purpose_here}\n\n Then follow '+
                f'this with your paragraph. Write in the style and at the level that is appropriate for a {grade} student.')

    # Put it all together
    prompt = f'# CONTEXT #\n\n{context}\n\n# TASK #\n\n{task}\n\n# RESPONSE #\n\n{response}'

    return prompt
grade = 10
print(develop_prompt(grade=grade))

# CONTEXT #

You are a grade 10 student in an English class.

# TASK #

Write a paragraph on a topic assigned by your teacher for a purpose that is also assigned by your teacher.

# RESPONSE #

First list the topic and purpose in the form: 

# Topic: {insert_topic_here}

# Purpose: {insert_purpose_here}

 Then follow this with your paragraph. Write in the style and at the level that is appropriate for a 10 student.


## Example Response from GPT

We use the OpenAI GPT model to generate a response based on our prompt. The response simulates a student's paragraph, including a topic and purpose.

### Interaction Details:
- **Model Selection**: The GPT-3.5 model is used for its advanced language generation capabilities and speed.
- **Token Limit**: `max_tokens` defines the maximum length of the generated response.

In [3]:
model = 'gpt-3.5-turbo' 
max_tokens = 1000

# Define the prompt
prompt = develop_prompt(grade)

messages = [{'role': 'system', 'content': 'You are a helpful assistant who excels at generating randomized paragraph topics, purposes, and text content.'},
            {'role': 'user', 'content': prompt}]

response = openai.ChatCompletion.create(model=model, messages=messages, max_tokens=max_tokens)

txt = response['choices'][0]['message']['content']
print(txt)

# Topic: The Importance of Exercise

# Purpose: To persuade readers to incorporate regular exercise into their daily routine

Exercise is an essential part of a healthy lifestyle, and its importance cannot be overstated. Regular physical activity provides numerous benefits to both our physical and mental well-being. First and foremost, exercise helps to maintain a healthy weight by burning calories and boosting metabolism. It plays a crucial role in preventing a wide range of health problems, including heart disease, diabetes, and certain types of cancer. Additionally, working out helps to strengthen our muscles and bones, improving our overall strength and flexibility. Physical activity also has a significant impact on our mental health, as it releases endorphins, which are natural "feel-good" chemicals that help reduce feelings of stress and anxiety. Furthermore, exercise can improve our sleep patterns, enhance our mood, and boost our self-esteem. In conclusion, incorporating regular

## Creating a Randomized Dataset

Our goal is to create a diverse and realistic dataset of student writings across different grades. To achieve this, we generate text for a large number of mock students.

### Dataset Creation Process:
- **HDF5 File**: We use an HDF5 file to store the dataset efficiently.
- **Loop for Data Generation**: For each mock student, a grade is randomly chosen, a prompt is generated, and a response from the OpenAI API is obtained and stored.

In [None]:
num_students = 5000
model = 'gpt-3.5-turbo'
max_tokens = 1000
n_messages = 3

# Create a new HDF5 file
mode = 'a'
with h5py.File('data/grades_texts_new.h5', mode) as h5f:
    # Create a dataset for grades with maximum size
    if mode=='w':
        max_size = (None,)  # None indicates an unlimited dimension
        grade_dset = h5f.create_dataset('grade', shape=(0,), maxshape=max_size, dtype='i')
    
        # Create a special datatype for variable-length strings
        dt = h5py.special_dtype(vlen=str)
    
        # Create a dataset for texts with the special datatype and maximum size
        text_dset = h5f.create_dataset('text', shape=(0,), maxshape=max_size, dtype=dt)
    elif mode=='a':
        # Append to existing datafile
        grade_dset = h5f['grade']
        text_dset = h5f['text']

    # Being message thread
    if mode=='w':
        messages = [{'role': 'system', 'content': 'You are a helpful assistant who excels at generating randomized paragraph topics, purposes, and text content.'}]
    # Append data to datasets in a loop
    for i in range(num_students):
        grade = np.random.choice(np.arange(3,13))

        # Define the prompt
        prompt = develop_prompt(grade)
        messages.append({'role': 'user', 'content': prompt})

        # Prompt GPT with some randomness
        try:
            response = openai.ChatCompletion.create(model=model, messages=messages, max_tokens=max_tokens, 
                                                    seed=np.random.randint(0,1e5), temperature=np.random.uniform(0.7,1.3))
        except APIError as e:
            print('Server Error')
            sleep(10)
            continue
        # Obtain response
        text = response['choices'][0]['message']['content']

        # Save response as part of the conversation (or else it repeats its responses)
        messages.append({'role': 'assistant', 'content': text})
        # Only keep the last n_messages back-and-forths
        if len(messages)>(2*n_messages):
            messages = [messages[0]] + messages[-int(2*n_messages):]
        # Make sure last message was the assistant
        while messages[-1]['role']=='user':
            messages = messages[:-1]

        # Resize datasets to accommodate new data
        grade_dset.resize((len(grade_dset) + 1,))
        text_dset.resize((len(text_dset) + 1,))

        # Assign data
        grade_dset[-1] = grade
        text_dset[-1] = text

        print(f'{i+1} complete...', end='\r')


62 complete...

## Inspecting the Generated Data

Finally, we can inspect the generated dataset by reading from the HDF5 file. This step is crucial for verifying the quality and diversity of our mock dataset.

In [7]:
with h5py.File('data/grades_texts_new.h5', 'r') as h5f:
    for i in range(10,20):
        print('\n# Grade: ', int(h5f['grade'][i]),'\n',)
        print(h5f['text'][i].decode('utf-8'))


# Grade:  4 

# Topic: My Favorite Hobby

# Purpose: To describe and share enthusiasm about a favorite hobby.

My favorite hobby is drawing. I love to pick up my colored pencils and create beautiful pictures. When I draw, I can let my imagination run wild and create anything I want. I draw people, animals, and even imaginary creatures that come straight from my mind. Drawing makes me feel happy and relaxed, like I'm in my own little world. It's a way for me to express myself and show others what I see in my imagination. I also enjoy looking at other people's drawings and learning new techniques. Drawing is a hobby that I can do anywhere, whether it's at home or even during a fun day at the park. It's a wonderful way for me to spend my free time and bring my ideas to life on paper.

# Grade:  6 

# Topic: The Importance of Friendship

# Purpose: To explain why having friends is important and how it positively impacts our lives.

Having friends is really important in our lives. Friends 