# Synthetic Data Generation (합성데이터 만들기)
- GTP를 이용해서 데이터를 만드는 케이스
1. 구조화된 프롬프트가 있는 CSV
2. 파이썬 프로그램으로 CSV 만들기
3. 파이썬 프로그램으로 여러 CSV 다루기
4. 간단한 textual 데이터 생성

In [None]:
%pip install openai
%pip install pandas
%pip install scikit-learn
%pip install matplotlib

In [None]:
from openai import OpenAI
import re
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import json
import matplotlib

# 1. 구조화된 CSV 파일 만들기
- 행과 열 형식

In [None]:
import os
from dotenv import load_dotenv

load_dotenv()

datagen_model = "gpt-4-0125-preview"

# 집에 관한 데이터
question = """
Create a CSV file with 10 rows of housing data.
Each row should include the following fields:
 - id (incrementing integer starting at 1)
 - house size (m^2)
 - house price
 - location
 - number of bedrooms

Make sure that the numbers make sense (i.e. more rooms is usually bigger size, more expensive locations increase price. more size is usually higher price etc. make sure all the numbers make sense). Also only respond with the CSV.
"""

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

response = client.chat.completions.create(
    model=datagen_model,
    messages=[
        {"role": "system", "content": "You are a helpful assistant designed to generate synthetic data"},
        {"role": "user", "content": question}
    ]
)

res = response.choices[0].message.content
print(res)

# 2. 파이썬으로 CSV 만들기
- 1번 방식은 토큰 수 제한으로 많은 데이터를 만들기 어려움
- LLM에게 데이터를 생성할 수 있는 프로그램을 만들어달라고 요청함

In [None]:
question = """
Create a Python program to generate 100 rows of housing data.
I want you to at the end of it output a pandas dataframe with 100 rows of data.
Each row should include the following fields:
 - id (incrementing integer starting at 1)
 - house size (m^2)
 - house price
 - location
 - number of bedrooms

Make sure that the numbers make sense (i.e. more rooms is usually bigger size, more expensive locations increase price. more size is usually higher price etc. make sure all the numbers make sense).
"""

response = client.chat.completions.create(
  model=datagen_model,
  messages=[
    {"role": "system", "content": "You are a helpful assistant designed to generate synthetic data."},
    {"role": "user", "content": question}
  ]
)
res = response.choices[0].message.content
print(res)



# 3. Multitable CSV를 가진 파이썬 프로그램
- 연관되어 있는 테이블을 연결한 데이터 (연관관계)

In [None]:
question = """
Create a Python program to generate 3 different pandas dataframes.

1. Housing data
I want 100 rows. Each row should include the following fields:
 - id (incrementing integer starting at 1)
 - house size (m^2)
 - house price
 - location
 - number of bedrooms
 - house type
 + any relevant foreign keys

2. Location
Each row should include the following fields:
 - id (incrementing integer starting at 1)
 - country
 - city
 - population
 - area (m^2)
 + any relevant foreign keys

 3. House types
 - id (incrementing integer starting at 1)
 - house type
 - average house type price
 - number of houses
 + any relevant foreign keys

Make sure that the numbers make sense (i.e. more rooms is usually bigger size, more expensive locations increase price. more size is usually higher price etc. make sure all the numbers make sense).
Make sure that the dataframe generally follow common sense checks, e.g. the size of the dataframes make sense in comparison with one another.
Make sure the foreign keys match up and you can use previously generated dataframes when creating each consecutive dataframes.
You can use the previously generated dataframe to generate the next dataframe.
"""

response = client.chat.completions.create(
  model=datagen_model,
  messages=[
    {"role": "system", "content": "You are a helpful assistant designed to generate synthetic data."},
    {"role": "user", "content": question}
  ]
)
res = response.choices[0].message.content
print(res)

"""
To accomplish this task, first, we will need to install pandas by running:

```
!pip install pandas
```

Below is a Python program to generate the 3 different pandas DataFrames according to your requirements:

```python
import pandas as pd
import numpy as np

# Helper function to generate house prices based on several factors
def generate_house_price(size, bedrooms, location_id, house_type_id):
    base_price = 100000  # Base price for calculation
    price = base_price + (size * 3000) + (bedrooms * 50000)
    price_modifier = 1.0 + (location_id * 0.05) + (house_type_id * 0.1)
    return round(price * price_modifier)

# 1. Generating 'Location' DataFrame
location_data = {
    "id": range(1, 6),  # Assuming 5 unique locations for simplicity
    "country": ["CountryA", "CountryB", "CountryC", "CountryD", "CountryE"],
    "city": ["City1", "City2", "City3", "City4", "City5"],
    "population": [500000, 200000, 1000000, 750000, 300000],
    "area_m2": [100000, 50000, 200000, 150000, 120000]
}
location_df = pd.DataFrame(location_data)

# 2. Generating 'House types' DataFrame
house_types_data = {
    "id": range(1, 4),  # 3 Types of houses
    "house_type": ["Apartment", "Detached", "Townhouse"],
    "average_house_type_price": [200000, 300000, 250000],  # Base prices for simplicity
    "number_of_houses": [120, 70, 45]
}
house_types_df = pd.DataFrame(house_types_data)

# 3. Generating 'Housing' DataFrame
np.random.seed(42)  # For consistent random data

housing_data = {
    "id": range(1, 101),  # 100 Houses
    "house_size_m2": np.random.randint(50, 500, 100),  # Random sizes between 50 and 500 m^2
    "location_id": np.random.randint(1, 6, 100),  # Assuming 5 locations from 'Location' DataFrame
    "number_of_bedrooms": np.random.randint(1, 6, 100),  # Between 1 and 5 bedrooms
    "house_type_id": np.random.randint(1, 4, 100)  # 3 Types from 'House types' DataFrame
}

# Placeholder for house prices, will be generated next
housing_data["house_price"] = [0] * 100

# Generating 'House price' based on size, bedroom count, location, and house type
for i in range(100):
    size = housing_data["house_size_m2"][i]
    bedrooms = housing_data["number_of_bedrooms"][i]
    location_id = housing_data["location_id"][i]
    house_type_id = housing_data["house_type_id"][i]
    housing_data["house_price"][i] = generate_house_price(size, bedrooms, location_id, house_type_id)

housing_df = pd.DataFrame(housing_data)

# Display the DataFrames for inspection
print("Location DataFrame:")
print(location_df, "\n")

print("House Types DataFrame:")
print(house_types_df, "\n")

print("Housing DataFrame:")
print(housing_df.head())  # Displaying only the first 5 rows for brevity

```

This script first generates a DataFrame for locations and house types, which are simpler and not dependent on any other data. These are then used to create a more complex Housing DataFrame where house prices are determined by a custom function `generate_house_price`, taking into account several factors like house size, number of bedrooms, location, and house type. 

Please replace `"CountryA"`, `"CountryB"`, etc., and `"City1"`, `"City2"`, etc., with actual names as needed. The `generate_house_price` function and the random data are simplified for this example and can be adjusted to reflect more complex and realistic scenarios.
"""


# 4. 간단한 텍스트 데이터
- Input과 Output을 설정

In [None]:
output_string = ""
for i in range(3):
  question = f"""
  I am creating input output training pairs to fine tune my gpt model. The usecase is a retailer generating a description for a product from a product catalogue. I want the input to be product name and category (to which the product belongs to) and output to be description.
  The format should be of the form:
  1.
  Input: product_name, category
  Output: description
  2.
  Input: product_name, category
  Output: description

  Do not add any extra characters around that formatting as it will make the output parsing break.
  Create as many training pairs as possible.
  """

  response = client.chat.completions.create(
    model=datagen_model,
    messages=[
      {"role": "system", "content": "You are a helpful assistant designed to generate synthetic data."},
      {"role": "user", "content": question}
    ]
  )
  res = response.choices[0].message.content
  output_string += res + "\n" + "\n"
print(output_string[:1000]) #displaying truncated response


In [None]:
#regex to parse data
pattern = re.compile(r'Input:\s*(.+?),\s*(.+?)\nOutput:\s*(.+?)(?=\n\n|\Z)', re.DOTALL)
matches = pattern.findall(output_string)
products = []
categories = []
descriptions = []

for match in matches:
    product, category, description = match
    products.append(product.strip())
    categories.append(category.strip())
    descriptions.append(description.strip())
products

# 5. 불균형하고 다양하지 않은 데이터 다루기
- 데이터 일관성과 동일성, 다양성을 높이기 위한 방법
- 클러스터링 => k-means 알고리즘 이용

In [None]:
# 입 출력 쌍 만들기

output_string = ""
for i in range(3):
  question = f"""
  I am creating input output training pairs to fine tune my gpt model. I want the input to be product name and category and output to be description. the category should be things like: mobile phones, shoes, headphones, laptop, electronic toothbrush, etc. and also more importantly the categories should come under 4 main topics: vehicle, clothing, toiletries, food)
  After the number of each example also state the topic area. The format should be of the form:
  1. topic_area
  Input: product_name, category
  Output: description

  Do not add any extra characters around that formatting as it will make the output parsing break.

  Here are some helpful examples so you get the style of output correct.

  1) clothing
  Input: "Shoe Name, Shoes"
  Output: "Experience unparalleled comfort. These shoes feature a blend of modern style and the traditional superior cushioning, perfect for those always on the move."
  """

  response = client.chat.completions.create(
    model="gpt-4",
    messages=[
      {"role": "system", "content": "You are a helpful assistant designed to generate synthetic data."},
      {"role": "user", "content": question}
    ]
  )
  res = response.choices[0].message.content
  output_string += res + "\n" + "\n"
print(output_string[:1000]) #displaying truncated response

In [None]:
# 정규화로 데이터가 잘린것들 정제
pattern = re.compile(r'(\d+)\) (\w+(?: \w+)?)\s*Input: "(.+?), (.+?)"\s*Output: "(.+?)"', re.DOTALL)
matches = pattern.findall(output_string)


topics = []
products = []
categories = []
descriptions = []

for match in matches:
    number, topic, product, category, description = match
    topics.append(topic)
    products.append(product)
    categories.append(category)
    descriptions.append(description)


In [None]:
data = {
    'Product': products,
    'Category': categories,
    'Description': descriptions
}

df = pd.DataFrame(data)

In [None]:
# 임베딩
def get_embedding(text, model="text-embedding-3-small"):
    text = text.replace("\n", " ")

    response = client.embeddings.create(input=[text], model=model)

    return response.data[0].embedding

embedding_model = "text-embedding-3-small"
df["embedding"] = df.Category.apply(lambda x: get_embedding(x, model=embedding_model))

matrix = np.vstack(df.embedding.values)

In [None]:
# k-means 알고리즘으로 군집화 예시

# Determine the optimal number of clusters using the elbow method
# inertias = []
# range_of_clusters = range(1, 13)  # Adjust the range as necessary

# for n_clusters in range_of_clusters:
#     kmeans = KMeans(n_clusters=n_clusters, init="k-means++", random_state=42, n_init=10)
#     kmeans.fit(matrix)
#     inertias.append(kmeans.inertia_)


In [None]:
# Plotting the elbow plot
# plt.figure(figsize=(10, 6))
# plt.plot(range_of_clusters, inertias, '-o')
# plt.title('Elbow Method to Determine Optimal Number of Clusters')
# plt.xlabel('Number of Clusters')
# plt.ylabel('Inertia')
# plt.xticks(range_of_clusters)
# plt.show()

In [None]:
n_clusters = 5

kmeans = KMeans(n_clusters=n_clusters, init="k-means++", random_state=42)
kmeans.fit(matrix)
labels = kmeans.labels_
df["Cluster"] = labels

In [None]:
cluster_counts = df["Cluster"].value_counts().sort_index()
print(cluster_counts)

In [None]:
# 토픽 별로 데이터 가져오기

#selected_examples = df.groupby('Cluster').apply(lambda x: x.sample(3)).reset_index(drop=True)
selected_examples = df.groupby('Cluster').apply(lambda x: x.sample(min(len(x), 3))).reset_index(drop=True)

# Format the selected examples
formatted_examples = "\n".join(
    f'Input: "{row["Product"]}, {row["Category"]}"\nOutput: "{row["Description"]}"\nCluster: "{row["Cluster"]}"'
    for _, row in selected_examples.iterrows()
)

topic_prompt = f"""
    I previously generated some examples of input output trainings pairs and then I clustered them based on category. From each cluster I picked 3 example data point which you can find below.
    I want you identify the broad topic areas these clusters belong to.
    Previous examples:
    {formatted_examples}


    Your output should be strictly of the format:
    Cluster: number, topic: topic
    Cluster: number, topic: topic
    Cluster: number, topic: topic

    Do not add any extra characters around that formatting as it will make the output parsing break.
    """

response = client.chat.completions.create(
  model=datagen_model,
  messages=[
    {"role": "system", "content": "You are a helpful assistant designed analyze clustered data"},
    {"role": "user", "content": topic_prompt}
  ]
)
res = response.choices[0].message.content

pattern = r"Cluster: (\d+), topic: ([^\n]+)"
matches = re.findall(pattern, res)
clusters = [{"cluster": int(cluster), "topic": topic} for cluster, topic in matches]
json_output = json.dumps(clusters, indent=2)
print(json_output)

In [None]:
# 데이터 다양성을 늘리는 방법

#selected_examples = df.groupby('Cluster').apply(lambda x: x.sample(3)).reset_index(drop=True)
selected_examples = df.groupby('Cluster').apply(lambda x: x.sample(min(len(x), 3))).reset_index(drop=True)

# Format the selected examples
formatted_examples = "\n".join(
    f'Input: "{row["Product"]}, {row["Category"]}"\nOutput: "{row["Description"]}"\nCluster: "{row["Cluster"]}"'
    for _, row in selected_examples.iterrows()
)

topic_prompt = f"""
    I previously generated some examples of input output trainings pairs and then I clustered them based on category. From each cluster I picked 3 example data point which you can find below.
    I want to promote diversity in my examples across categories so follow the procedure below:
    1. You must identify the broad topic areas these clusters belong to.
    2. You should generate further topic areas which don't exist so I can generate data within these topics to improve diversity.


    Previous examples:
    {formatted_examples}


    Your output should be strictly of the format:

    1. Cluster topic mapping
    Cluster: number, topic: topic
    Cluster: number, topic: topic
    Cluster: number, topic: topic

    2. New topics
    1. topic
    2. topic
    3. topic
    4. topic

    Do not add any extra characters around that formatting as it will make the output parsing break. It is very important you stick to that output format
    """

response = client.chat.completions.create(
  model=datagen_model,
  messages=[
    {"role": "system", "content": "You are a helpful assistant designed to analyze clustered data"},
    {"role": "user", "content": topic_prompt}
  ]
)
res = response.choices[0].message.content
print(res)


In [None]:
parts = res.split("\n\n")
cluster_mapping_part = parts[0]
new_topics_part = parts[1]

# Parse cluster topic mapping
cluster_topic_mapping_lines = cluster_mapping_part.split("\n")[1:]  # Skip the first two lines
cluster_topic_mapping = [{"cluster": int(line.split(",")[0].split(":")[1].strip()), "topic": line.split(":")[2].strip()} for line in cluster_topic_mapping_lines]

# Parse new topics
new_topics_lines = new_topics_part.split("\n")[1:]  # Skip the first line
new_topics = [line.split(". ")[1] for line in new_topics_lines]

cluster_topic_mapping, new_topics