## Synthetic data with GPT-4o and OpenAI Batch service

In [8]:
import random, json, yaml, random
from string import Template

from dotenv import load_dotenv
from openai import OpenAI
import pandas as pd
from tqdm import tqdm

from fortune_500 import companies

load_dotenv("../.env")
random.seed(42)

In [2]:
companies[:3]

[('Industrials', '3M'),
 ('Industrials', 'A. O. Smith'),
 ('Health Care', 'Abbott Laboratories')]

In [74]:
sector, company_name = random.choice(companies)

MODEL = "gpt-4o-2024-08-06"
SYSTEM = "You are an expert in data annotation for machine learning models, specifically in the areas of LLM and generative AI."

PROMPT = Template("""**Task Overview**
Your job is to create structured data in a specific format (YAML) that includes:

1. **Context:** This should be a collection of at least 10 paragraphs from quarterly and yearly reports of various companies in the S&P 500 list. The paragraphs can vary in length (10-15 sentences) and should contain both text and a table in markdown format with at least *10* rows and *5* columns. The table should be present in just one paragraph. The context should be rich enough to allow for detailed questions and answers. Talk about competitors in the context. Be creative while creating the context and questions so that it is different every time. 

2. **Questions:** You need to create a list of complex questions based on the context. These questions should require deep reasoning, involve multiple references, or be based on the information in the table. Include questions for each of these types:
- reasoning: questions that require synthesizing information from multiple paragraphs.
- cannot answer: questions that cannot be answered with the provided context. Say "I do not have the information to answer this question."
- tabular: questions that specifically ask for information from the table. Generate 3 questions which require different operations.
- extractive: questions that require extracting specific entities from the context.
- math: questions that involve different type of relevant math calculations.

3. **Answers:** Provide a concise answer based on the context for each question. If a question cannot be answered with the given information, state that you do not have the information. For math questions, show the calculations used to arrive at the answer.

# Schema of yaml output
Sector: $sector
Company: $company_name
Context: List[str] 
Questions : List[Tuple[type: str, question: str]] 
Answers: List[str]

Don't generate anything after generating the YAML output.
""")

print(PROMPT.substitute({'sector' : sector, 'company_name' : company_name}))

**Task Overview**
Your job is to create structured data in a specific format (YAML) that includes:

1. **Context:** This should be a collection of at least 10 paragraphs from quarterly and yearly reports of various companies in the S&P 500 list. The paragraphs can vary in length (10-15 sentences) and should contain both text and a table in markdown format with at least *10* rows and *5* columns. The table should be present in just one paragraph. The context should be rich enough to allow for detailed questions and answers. Talk about competitors in the context. Be creative while creating the context and questions so that it is different every time. 

2. **Questions:** You need to create a list of complex questions based on the context. These questions should require deep reasoning, involve multiple references, or be based on the information in the table. Include questions for each of these types:
- reasoning: questions that require synthesizing information from multiple paragraphs.
- c

In [75]:
client = OpenAI()

def get_data(sector, company_name, system, prompt, model):
    return client.chat.completions.create(
    model=model,
    messages=[
        {
        "role": "system",
        "content": [
            {
            "text": system,
            "type": "text"
            }
        ]
        },
        {
        "role": "user",
        "content": [
            {
            "text": prompt,
            "type": "text"
            }
        ]
        }
    ],
    temperature=1.0,
    max_tokens=4000,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
    response_format={
        "type": "text"
    }
    )

input = PROMPT.substitute({'sector' : sector, 'company_name' : company_name})
response = get_data(*random.choice(companies), SYSTEM, input, model=MODEL)
print(response.choices[0].message.content)

### Create batches for generating synthetic data with GPT-4o

In [78]:
for i in tqdm(range(5)):
    # Create a file with requests
    file_path = f"../data/syn_data_rag/input/{i}.jsonl"
    print(file_path)
    with open(file_path, "w+") as f:
        for i in range(2000):
            sector, company_name = random.choice(companies)
            input = {"custom_id": str(i), 
                    "method": "POST", "url": "/v1/chat/completions", 
                    "body": {"model": MODEL, 
                            "messages": [{"role": "system", "content": SYSTEM},
                                        {"role": "user", "content": PROMPT.substitute({'sector' : sector, 'company_name' : company_name})}],
                            "max_tokens": 5000,
                            "temperature": 1.0,}}
            f.write(json.dumps(input))
            f.write("\n")
    
    # Upload the file to OpenAI            
    batch_input_file_id = client.files.create(
    file=open(file_path, "rb"),
    purpose="batch"
    ).id
    
    # Create a batch
    client.batches.create(
        input_file_id=batch_input_file_id,
        endpoint="/v1/chat/completions",
        completion_window="24h",
        metadata={
        "description": f"synthetic data {i}"
        }
    )

  0%|          | 0/5 [00:00<?, ?it/s]

../data/syn_data_rag/input/0.jsonl


 20%|██        | 1/5 [00:01<00:06,  1.74s/it]

../data/syn_data_rag/input/1.jsonl


 40%|████      | 2/5 [00:03<00:04,  1.55s/it]

../data/syn_data_rag/input/2.jsonl


 60%|██████    | 3/5 [00:04<00:02,  1.41s/it]

../data/syn_data_rag/input/3.jsonl


 80%|████████  | 4/5 [00:05<00:01,  1.23s/it]

../data/syn_data_rag/input/4.jsonl


100%|██████████| 5/5 [00:06<00:00,  1.31s/it]


In [81]:
dict(client.batches.list(limit=5).data[4])

{'id': 'batch_uxYswjyWz88KH5Q4X0O6gTkR',
 'completion_window': '24h',
 'created_at': 1724232268,
 'endpoint': '/v1/chat/completions',
 'input_file_id': 'file-w2yMjIvUoZagp6d0yzCRRamG',
 'object': 'batch',
 'status': 'completed',
 'cancelled_at': None,
 'cancelling_at': None,
 'completed_at': 1724233617,
 'error_file_id': None,
 'errors': None,
 'expired_at': None,
 'expires_at': 1724318668,
 'failed_at': None,
 'finalizing_at': 1724233512,
 'in_progress_at': 1724232270,
 'metadata': {'description': 'synthetic data 0'},
 'output_file_id': 'file-kNQ1IFgwmmKyDPyVNjZ6Hc94',
 'request_counts': BatchRequestCounts(completed=2000, failed=0, total=2000)}

In [None]:
output_file_id = client.batches.retrieve("batch_IxzjwViEMKxbwbPhjDfzzzsf").output_file_id
file_response = client.files.content(output_file_id)
with open("../data/syn_data_rag/output/test.jsonl", "w+") as f:
    f.write(file_response.text)

### Create prompts and responses using the prompt and outputs

#### Output schema validation

In [16]:
from typing import List
from pydantic import BaseModel, ValidationError

class Question(BaseModel):
    type: str
    question: str

class Schema(BaseModel):
    Sector: str
    Company: str
    Context: List[str]
    Questions: List[Question]
    Answers: List[str]

def validate_dict(data: dict) -> bool:
    try:
        validated_data = Schema(**data)
        return True
    except ValidationError as e:
        # print(e.json())
        return False

#### Prompt formatting

In [26]:
def get_context_for_prompt(x):
    context = ''
    for i, p in enumerate(x):
        context += f"\n{i+1}. {p}"
    return context

instruction = "Answer the question using the information in the context."
template = """Context: {context}

Task: {instruction}
Question: {question}"""

#### Data transformation

In [27]:
# read the file
with open("../data/syn_data_rag/output/test.jsonl", "r") as f:
    data = f.readlines()

# parse the data
data_formatted = []
for line in data:
    d = json.loads(line)['response']['body']['choices'][0]['message']['content'].replace("```yaml\n", "").replace("\n```", "")
    try:
        d = yaml.load(d, Loader=yaml.FullLoader)
        if validate_dict(d):
            for question, answer in zip(d['Questions'], d['Answers']):
                random.shuffle(d['Context']) # shuffle every time to create variability
                context = get_context_for_prompt(d['Context'])
                data_formatted.append((d['Sector'], d['Company'], context,  question['type'], question['question'], answer))
    except:
        pass

# create a dataframe
df = pd.DataFrame(data_formatted, columns=['sector', 'company', 'context', 'question_type', 'question', 'answer'])
df["prompt"] = df.apply(lambda x: template.format(instruction=instruction, question=x["question"], context=x["context"]), axis=1)
df.head()

Unnamed: 0,sector,company,context,question_type,question,answer,prompt
0,Financials,Everest Re,\n1. The company's management stated that the ...,reasoning,How does Everest Re's approach to digital infr...,Everest Re is investing 20% more in its digita...,Context: \n1. The company's management stated ...
1,Financials,Everest Re,"\n1. Everest Re's competitors, such as Munich ...",cannot answer,What is Everest Re's projected dividend yield ...,I do not have the information to answer this q...,"Context: \n1. Everest Re's competitors, such a..."
2,Financials,Everest Re,"\n1. Looking ahead, Everest Re has ambitious p...",tabular,What was the percentage change in Everest Re's...,"The percentage change in net income was 30%, s...","Context: \n1. Looking ahead, Everest Re has am..."
3,Financials,Everest Re,\n1. Everest Re's Corporate Responsibility ini...,tabular,Calculate the difference in claims payouts bet...,The claims payouts increased by $50M (650 to 7...,Context: \n1. Everest Re's Corporate Responsib...
4,Financials,Everest Re,\n1. The macroeconomic environment poses both ...,tabular,Identify the financial parameter with the high...,Underwriting Profit has the highest percentage...,Context: \n1. The macroeconomic environment po...


In [28]:
df.question_type.value_counts()

question_type
tabular          2220
math             1501
extractive       1322
reasoning         801
cannot answer     744
cannot_answer      19
Name: count, dtype: int64

In [29]:
df.sector.value_counts()

sector
Financials                1062
Information Technology     979
Industrials                855
Health Care                770
Consumer Discretionary     679
Technology                 407
Energy                     385
Consumer Staples           377
Real Estate                377
Utilities                  277
Communication Services     253
Materials                  177
Tech                         9
Name: count, dtype: int64

In [30]:
df.company.value_counts()

company
Microsoft                  63
American Tower             56
PayPal                     56
News Corp (Class B)        50
Snap-on                    50
                           ..
Ansys                       7
TechSphere Inc              7
QuantumTech Innovations     7
Altria                      7
Textron                     7
Name: count, Length: 408, dtype: int64

In [31]:
df.to_parquet("../data/syn_data_rag/output/test.parquet", index=False)