## Synthetic data with GPT-4o and OpenAI Batch service

In [34]:
import random, json, yaml, random
from string import Template

from dotenv import load_dotenv
from openai import OpenAI
import pandas as pd
from tqdm import tqdm

from fortune_500 import companies

load_dotenv("../.env")
random.seed(42)

In [36]:
client = OpenAI()

sector, company_name = random.choice(companies)

MODEL = "gpt-4o-2024-08-06"
SYSTEM = "You are an expert in data annotation for machine learning models, specifically in the areas of LLM and generative AI."

PROMPT = Template("""**Task Overview**
Your job is to create structured data in a specific format (YAML) that includes:

1. **Context:** This should be a collection of at least 10 paragraphs from quarterly and yearly reports of various companies in the S&P 500 list. The paragraphs can vary in length (10-15 sentences) and should contain both text and a table in markdown format with at least *10* rows and *5* columns. The table should be present in just one paragraph. The context should be rich enough to allow for detailed questions and answers. Talk about competitors in the context. Be creative while creating the context and questions so that it is different every time. 

2. **Questions:** You need to create a list of complex questions based on the context. These questions should require deep reasoning, involve multiple references, or be based on the information in the table. Include questions for each of these types:
- reasoning: questions that require synthesizing information from multiple paragraphs.
- cannot answer: questions that cannot be answered with the provided context. Say "I do not have the information to answer this question."
- tabular: questions that specifically ask for information from the table. Generate 3 questions which require different operations.
- extractive: questions that require extracting specific entities from the context.
- math: questions that involve different type of relevant math calculations.

3. **Answers:** Provide a concise answer based on the context for each question. If a question cannot be answered with the given information, state that you do not have the information. For math questions, show the calculations used to arrive at the answer.

# Schema of yaml output
Sector: $sector
Company: $company_name
Context: List[str] 
Questions : List[Tuple[type: str, question: str]] 
Answers: List[str]

Don't generate anything after generating the YAML output.
""")

print(PROMPT.substitute({'sector' : sector, 'company_name' : company_name}))

**Task Overview**
Your job is to create structured data in a specific format (YAML) that includes:

1. **Context:** This should be a collection of at least 10 paragraphs from quarterly and yearly reports of various companies in the S&P 500 list. The paragraphs can vary in length (10-15 sentences) and should contain both text and a table in markdown format with at least *10* rows and *5* columns. The table should be present in just one paragraph. The context should be rich enough to allow for detailed questions and answers. Talk about competitors in the context. Be creative while creating the context and questions so that it is different every time. 

2. **Questions:** You need to create a list of complex questions based on the context. These questions should require deep reasoning, involve multiple references, or be based on the information in the table. Include questions for each of these types:
- reasoning: questions that require synthesizing information from multiple paragraphs.
- c

In [75]:
def get_data(sector, company_name, system, prompt, model):
    return client.chat.completions.create(
    model=model,
    messages=[
        {
        "role": "system",
        "content": [
            {
            "text": system,
            "type": "text"
            }
        ]
        },
        {
        "role": "user",
        "content": [
            {
            "text": prompt,
            "type": "text"
            }
        ]
        }
    ],
    temperature=1.0,
    max_tokens=4000,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
    response_format={
        "type": "text"
    }
    )

input = PROMPT.substitute({'sector' : sector, 'company_name' : company_name})
response = get_data(*random.choice(companies), SYSTEM, input, model=MODEL)
print(response.choices[0].message.content)

### Create batches for generating synthetic data with GPT-4o

In [32]:
# Create a file with requests
file_path = "../data/syn_data_rag/input/data.jsonl"
print(file_path)
with open(file_path, "w+") as f:
    for i in range(2000):
        sector, company_name = random.choice(companies)
        input = {"custom_id": str(i), 
                "method": "POST", "url": "/v1/chat/completions", 
                "body": {"model": MODEL, 
                        "messages": [{"role": "system", "content": SYSTEM},
                                    {"role": "user", "content": PROMPT.substitute({'sector' : sector, 'company_name' : company_name})}],
                        "max_tokens": 5000,
                        "temperature": 1.0,}}
        f.write(json.dumps(input))
        f.write("\n")

# Upload the file to OpenAI            
batch_input_file_id = client.files.create(
file=open(file_path, "rb"),
purpose="batch"
).id

# Create a batch
client.batches.create(
    input_file_id=batch_input_file_id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
    metadata={
    "description": f"synthetic data {i}"
    }
)

In [None]:
file_response = client.files.content(client.batches.list(limit=1).data[0].output_file_id)
with open("../data/syn_data_rag/output/test.jsonl", "w+") as f:
    f.write(file_response.text)

### Create prompts and responses using the prompt and outputs

#### Output schema validation

In [6]:
from typing import List
from pydantic import BaseModel, ValidationError

class Question(BaseModel):
    type: str
    question: str

class Schema(BaseModel):
    Sector: str
    Company: str
    Context: List[str]
    Questions: List[Question]
    Answers: List[str]

def validate_dict(data: dict) -> bool:
    try:
        validated_data = Schema(**data)
        return True
    except ValidationError as e:
        # print(e.json())
        return False

#### Data transformation

In [28]:
# format the context
def get_context_for_prompt(x):
    context = ''
    for i, p in enumerate(x):
        context += f"\n{i+1}. {p}"
    return context

# read the file
with open("../data/syn_data_rag/output/data.jsonl", "r") as f:
    data = f.readlines()

# parse the data
data_formatted = []
for line in data:
    d = json.loads(line)['response']['body']['choices'][0]['message']['content'].replace("```yaml\n", "").replace("\n```", "")
    try:
        d = yaml.load(d, Loader=yaml.FullLoader)
        if validate_dict(d):
            for question, answer in zip(d['Questions'], d['Answers']):
                random.shuffle(d['Context']) # shuffle every time to create variability
                context = get_context_for_prompt(d['Context'])
                data_formatted.append((d['Sector'], context,  question['type'], question['question'], answer))
    except:
        pass

# create a dataframe
df = pd.DataFrame(data_formatted, columns=['sector', 'context', 'question_type', 'question', 'answer'])
print(len(df))
df.head()

14018


Unnamed: 0,sector,context,question_type,question,answer
0,Financials,\n1. Moody's risk assessment division has play...,reasoning,How has Moody's Corporation leveraged its geog...,Moody's has focused its geographical expansion...
1,Financials,\n1. Moody's internal forecasts suggest contin...,cannot answer,What are the CEO's personal views on the curre...,I do not have the information to answer this q...
2,Financials,\n1. Moody's internal forecasts suggest contin...,tabular,What is the year-over-year growth percentage i...,15.38%
3,Financials,"\n1. In terms of financial performance, Moody'...",tabular,"Based on the table, which metric showed a nega...",The Equity Ratio showed a negative growth from...
4,Financials,"\n1. During the annual shareholders' meeting, ...",tabular,Calculate the average Earnings per Share for Q...,The average Earnings per Share for Q1 and Q2 o...


In [29]:
df.question_type.value_counts()

question_type
tabular          4688
math             3163
extractive       2812
reasoning        1739
cannot answer    1587
cannot_answer      29
Name: count, dtype: int64

In [30]:
df.sector.value_counts()

sector
Information Technology    2213
Health Care               1762
Financials                1754
Consumer Discretionary    1667
Industrials               1627
Consumer Staples          1165
Real Estate                973
Technology                 780
Materials                  616
Communication Services     614
Energy                     415
Utilities                  397
Healthcare                  26
Telecommunications           9
Name: count, dtype: int64

In [31]:
df.to_parquet("../data/syn_data_rag/extracted/data.parquet", index=False)