<a href="https://colab.research.google.com/github/sap156/infinityskillshub/blob/main/Semantic_Data_Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Generative AI for Data Professionals
# Data Generation and Augmentation — Module 2 Part 1

# 🎯 Synthetic Data Generation with AI

**In this module, we will explore how Generative AI and Large Language Models (LLMs)
can assist with data generation and augmentation.**

This is one of the most impactful use cases of Generative AI, allowing us to create
synthetic data from scratch or augment existing datasets efficiently.

**Learning Objectives:**
- ✅ Understand why synthetic data is essential for data workflows
- ✅ Generate realistic synthetic datasets using AI
- ✅ Create domain-specific data for various business contexts
- ✅ Introduce controlled bias and edge cases for testing
- ✅ Generate location-specific and targeted datasets

# =====================================================
# 🔄 INTRODUCTION: WHY SYNTHETIC DATA MATTERS
# ===============================================


## Common Data Professional Challenges

As data professionals, we often need to generate or extend datasets for various reasons:

🧪 **Testing Data Pipelines:** When developing data pipelines, we need sample data that mimics real-world data

🔒 **Avoiding PII Exposure:** Using real production data introduces privacy risks, especially with
Personally Identifiable Information (PII). Synthetic data eliminates this risk.

🐛 **Handling Edge Cases:** We need to test how our systems handle unusual or problematic data

⚖️ **Fixing Imbalanced Datasets:** Real data often lacks representation in certain categories

📊 **Development Speed:** Waiting for real data slows down development and testing cycles


In [None]:
# Example of traditional random generation
import random
import datetime

print("🔧 Traditional Random Data Generation:")
traditional_data = []
for i in range(5):
    customer_data = {
        'customer_id': f'CUST{random.randint(100, 999)}',
        'age': random.randint(18, 80),
        'purchase_amount': round(random.uniform(10, 500), 2),
        'date': datetime.date(2024, random.randint(1, 12), random.randint(1, 28))
    }
    traditional_data.append(customer_data)

for data in traditional_data:
    print(data)

print("\n❌ Notice: Data is technically correct but feels artificial and random")
print("✅ AI-generated data will show realistic patterns and relationships")

In [None]:
# Install libraries
!pip install openai pandas scikit-learn matplotlib

In [None]:
# Generate Synthetic Data Example 1

from openai import OpenAI
import pandas as pd
import json
import numpy as np
from pprint import pprint
from google.colab import userdata

api_key = userdata.get('OPENAI_API_KEY')

# Initialize OpenAI client
client = OpenAI(api_key=api_key)

prompt = "Generate synthetic sales data for an e-commerce platform. Include fields for date, customer_id (Customer ###), order total (in $USD). For certain orders, the order total should be negative. Create data for 10 customers. Output in JSON form."


response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user",
         "content": prompt
         }
        ],
    response_format={"type": "json_object"}
)

customer_data = json.loads(response.choices[0].message.content)
print(json.dumps(customer_data, indent=2))




In [None]:
# Convert to DataFrame
def clean(dict_variable):
    return next(iter(dict_variable.values()))

df_customers = pd.DataFrame(clean(customer_data))
print(df_customers)

In [None]:
# Generate Synthetic Data Example 2

prompt = "Generate 5 synthetic product reviews for a smartphone. Include fields for review_id, rating (1-5), and review_text. Output in JSON form."


response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user",
         "content": prompt
         }
        ],
    response_format={"type": "json_object"}
)

product_data = json.loads(response.choices[0].message.content)
print(json.dumps(product_data, indent=2))

df_product = pd.DataFrame(clean(product_data))
print(df_product)
