# Synthetic Data Generation

This notebook demonstrates how to generate synthetic data for Claire Text Classification Task. 

The data is generated using Open-AI API and the model used is `gpt-3.5-turbo-0125`.

## Setup Local Environment

Install the following packages to run the code locally.

```bash
pip install openai==1.23.2
pip install python-dotenv==1.0.1
```

Make sure to fill the value of `OPENAI_API_KEY` in `.env` file before running the code.

## Import required Libraries

In [49]:
from openai import OpenAI

import os 
from dotenv import load_dotenv
import json
from time import sleep

## Define the prompt and function to generate synthetic data

In [None]:
## Load the API key from the .env file
load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

In [50]:
## Initialize the OpenAI client
client = OpenAI(api_key=OPENAI_API_KEY)

In [189]:
def generate_text(prompt: str, seed: int = 0):
    response = client.chat.completions.create(
        model="gpt-3.5-turbo-0125",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt},
        ],
        response_format={ "type": "json_object" },
        seed=seed
    )
    return response.choices[0].message

In [212]:
PROMPT = """Text Classification Dataset Creation

Generate a JSON-format text classification dataset with entries comprising "text" and "label" keys.

### Be innovative while coming up with examples and think of real-world scenarios.
###  I will use the same prompt for all the examples so make sure that you are not repeating anything.

Clare, our Conversational AI, interacts with users via WhatsApp and calls to offer personalized mental health support. In WhatsApp conversations, Clare distinguishes between:

1. Regular Conversations: Mental health discussions, exercises, etc.
2. Product-related Conversations: Queries about Clare's functionalities, communication methods (phone or WhatsApp), etc.
3. Subscription-related Conversations: Questions regarding user data retrieval, such as trial duration.
4. Suicide: Detection of user expressions indicating active contemplation or planning, requiring redirection to clinically-approved conversational protocols.
5. Non-mental Health Topics: Gently guiding conversations back to mental health when users delve into unrelated topics like Clare's movie preferences.

"""

## Generate Synthetic Data and Parse the Output into JSON format

In [213]:
## Check if it works or not 
output = generate_text(PROMPT, seed=1)
json.loads(output.content)

In [217]:
NUM_CALLS = 100 # Number of Calls to the OpenAI API aka Number of examples per class. 

In [218]:
raw_outputs = [] # Stores raw output returned from OpenAI API calls
json_objects = [] # Stores properly formatted json strings
failed_indexes = [] # Stores indexes that aren't properly formatted json strings

In [233]:
for num in range(NUM_CALLS):
    output = generate_text(PROMPT, seed=num)
    raw_outputs.append(output.content)
    try:
        json_object = json.loads(output.content)
        json_objects.append(json_object)
    except Exception as e:
        failed_indexes.append(num)
        print(f"Error processing output {num}")
    sleep(2) # Sleep for 2 seconds to avoid rate limiting

In [238]:
## Let's extract the data sample from the JSON object
data_samples = []
for obj in json_objects:
    for key in obj.keys():
        data_samples.append(obj[key])

## Visualize the Synthetic Data

In [138]:
import pandas as pd

In [241]:
all_df = []
for idx, json_data in enumerate(data_samples):
    print(f"Processing JSON data {idx}")
    try:
        cur_df = pd.DataFrame(json_data)
        all_df.append(cur_df)
    except Exception as e:
        print(f"Error processing JSON data {idx}")
all_df = pd.concat(all_df, ignore_index=True)

Processing JSON data 0
Processing JSON data 1
Processing JSON data 2
Processing JSON data 3
Processing JSON data 4
Processing JSON data 5
Processing JSON data 6
Processing JSON data 7
Processing JSON data 8
Processing JSON data 9
Processing JSON data 10
Processing JSON data 11
Processing JSON data 12
Processing JSON data 13
Processing JSON data 14
Processing JSON data 15
Processing JSON data 16
Processing JSON data 17
Processing JSON data 18
Processing JSON data 19
Processing JSON data 20
Processing JSON data 21
Processing JSON data 22
Processing JSON data 23
Processing JSON data 24
Processing JSON data 25
Processing JSON data 26
Processing JSON data 27
Processing JSON data 28
Processing JSON data 29
Processing JSON data 30
Processing JSON data 31
Processing JSON data 32
Processing JSON data 33
Processing JSON data 34
Processing JSON data 35
Processing JSON data 36
Processing JSON data 37
Processing JSON data 38
Processing JSON data 39
Processing JSON data 40
Processing JSON data 41
Pr

There were some more issues when converting the JSON object to a Pandas DataFrame. 
We see that the JSON object is not in the correct format to be converted to a DataFrame.

In [244]:
all_df.shape

(529, 2)

In [245]:
## Distribution of the classes
all_df['label'].value_counts()

label
Product-related Conversations         109
Regular Conversations                 105
Subscription-related Conversations    104
Suicide                               104
Non-mental Health Topics              103
Regular Conversation                    1
Product-related Conversation            1
Subscription-related Conversation       1
Non-mental Health Topic                 1
Name: count, dtype: int64

Here we see that some of the labels are not correct. So we need to fix this issue.

In [246]:
new_df = all_df.copy()

In [247]:
# Define a dictionary mapping the duplicate labels to their originals
duplicate_mapping = {
    'Regular Conversation': 'Regular Conversations',
    'Product-related Conversation': 'Product-related Conversations',
    'Subscription-related Conversation': 'Subscription-related Conversations',
    'Non-mental Health Topic': 'Non-mental Health Topics'
}

# Use map function to replace duplicate labels with their originals
new_df['label'] = new_df['label'].map(duplicate_mapping).fillna(new_df['label'])

In [248]:
new_df['label'].value_counts()

label
Product-related Conversations         110
Regular Conversations                 106
Subscription-related Conversations    105
Suicide                               104
Non-mental Health Topics              104
Name: count, dtype: int64

In [257]:
new_df.shape

(529, 2)

We managed to generate a dataset of 529 samples.

#### Check if there are  duplicates in the dataset

In [262]:
## Let's take any two examples from the dataset and see the text and label

regular_conversations = new_df[new_df['label'] == 'Regular Conversations']
print(len(regular_conversations['text'].unique()))
print(regular_conversations['text'].unique()[:5])

83
["I've been feeling really down lately and struggling with my emotions."
 "Hi Clare, I have been feeling really anxious lately and I don't know how to cope."
 "Hi Clare, I've been feeling really down lately and I'm not sure how to cope with it."
 "Hi Clare, I've been feeling really down lately and I don't know how to cope."
 'Hi Clare, how are you today?']


We can see that the samples generated here are very repititive as they all are talking about same thing.

So, we need to come up with a better prompt to generate more diverse samples.

In [263]:
non_mental_health_topics = new_df[new_df['label'] == 'Non-mental Health Topics']
print(len(non_mental_health_topics['text'].unique()))
print(non_mental_health_topics['text'].unique()[:5])

88
['Have you watched any good movies lately? I need some recommendations.'
 "What's your favorite movie, Clare?"
 'Hey Clare, have you watched any good movies lately?'
 'So, Clare, do you like action movies or more of a rom-com person?'
 'Do you like watching movies, Clare?']


We can see that the samples generated here are very repititive as they all are talking about movies.

So, we need to come up with a better prompt to generate more diverse samples.

In [265]:
suicide_conversations = new_df[new_df['label'] == "Suicide"]
print(len(suicide_conversations['text'].unique()))
print(suicide_conversations['text'].unique()[:5])

101
["I don't see a way out of this darkness. I need help."
 "I don't see the point in living anymore. I feel like ending everything."
 "I'm feeling so overwhelmed and I can't see a way out."
 "I'm feeling like there's no point in going on anymore..."
 "I can't bear this pain anymore, Clare. Everything feels pointless."]


We can see that the conversations indicate suicidal thoughts.

However, it feels also a bit exaggerated to call all of these conversations suicidal.

In [266]:
product_conversations = new_df[new_df['label'] == "Product-related Conversations"]
len(product_conversations['text'].unique())
print(product_conversations['text'].unique()[:5])

['How do I schedule a call with Clare? I prefer to talk over the phone.'
 'How can I schedule a call with Clare to discuss my mental health concerns?'
 "How can I change my notification settings in Clare's app?"
 'How do I change my notification settings on Clare?'
 'Does Clare have a feature that allows me to set reminders for self-care activities?']


We can see here that they all are related to product related.

In [267]:
subscription_conversations = new_df[new_df['label'] == 'Subscription-related Conversations']
print(len(subscription_conversations['text'].unique()))
print(subscription_conversations['text'].unique()[:5])

105
['Can I download all my chat data for analysis purposes?'
 'Can you provide me with details about the subscription plans available for user data access?'
 "I want to know how long the free trial period is for Clare's services."
 "Can I cancel my subscription to Clare's service? I no longer need it."
 'How can I access my data after my trial period ends?']


We can see here that they all are related to subscription related.

## Save the final dataset to a CSV file

In [268]:
unique_texts_df = new_df.drop_duplicates(subset=['text'])

In [277]:
unique_texts_df.to_csv('claire_text_classification_data.csv', index=False)

### Upload the dataset to Hugging Face 


You can find the final dataset on Hugging Face Datasets [here](https://huggingface.co/datasets/shub-kris/claire-dataset).

I haven't included the code to upload the dataset to Hugging Face Hub to keep the notebook clean.

## What can be improved?

I showed a simple example of generating synthetic data using OpenAI API. There are many ways to improve the quality of the generated data. Some of the ways are:

- The quality of the generated data can be improved by using different techniques like using more complex prompts, playing with the temperature parameter, etc.
- We can use different models to generate synthetic data. E.g. GPT-4 as it is the latest model.
- We can use different prompts to generate synthetic data.
- We can generate more synthetic data by increasing the number of iterations.



#### Note: I ran the code multiple times to generate the data. 