<a href="https://colab.research.google.com/github/reckn/indoor-monitor-dashboard/blob/main/create_datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Synthetic Data Generation 📊

### Overview
This code generates synthetic datasets for customer interactions, purchase history, and product details to simulate a retail environment. It leverages various functions to generate realistic data and saves them to CSV files for further analysis or usage.

### Functions and Data Generation Explained:

#### 1. `generate_random_customer_ids(num_interactions, total_customers)`
   - **Purpose**: Generate random unique customer IDs for interactions.
   - **Inputs**:
       - `num_interactions` (int): Number of interactions.
       - `total_customers` (int): Total number of customers in the dataset.
   - **Output**: Array of random customer IDs.

#### 2. `fetch_avatar_links()`
   - **Purpose**: Fetch avatar links from a provided CSV link.
   - **Output**: List of avatar links.

#### 3. `fetch_product_icons()`
   - **Purpose**: Fetch product icons from a provided CSV link.
   - **Output**: List of product icons.

#### 4. `generate_customer_interactions(num_interactions, customer_ids, purchase_history_df)`
   - **Purpose**: Generate customer interactions data.
   - **Inputs**:
       - `num_interactions` (int): Number of interactions.
       - `customer_ids` (array): Array of customer IDs.
       - `purchase_history_df` (DataFrame): Purchase history DataFrame.
   - **Output**: DataFrame containing customer interactions.

#### 5. `generate_product_details(num_products)`
   - **Purpose**: Generate product details.
   - **Inputs**:
       - `num_products` (int): Number of products.
   - **Output**: DataFrame containing product details.

#### 6. `generate_purchase_history(customer_ids, num_purchase_history, product_details_df)`
   - **Purpose**: Generate purchase history.
   - **Inputs**:
       - `customer_ids` (array): Array of customer IDs.
       - `num_purchase_history` (int): Number of purchase history records.
       - `product_details_df` (DataFrame): Product details DataFrame.
   - **Output**: DataFrame containing purchase history.

#### 7. `save_to_csv(customer_interactions_df, purchase_history_df, product_details_df)`
   - **Purpose**: Save DataFrames to CSV files.
   - **Inputs**:
       - `customer_interactions_df` (DataFrame): Customer interactions DataFrame.
       - `purchase_history_df` (DataFrame): Purchase history DataFrame.
       - `product_details_df` (DataFrame): Product details DataFrame.

### How to Use:
1. Specify the desired number of interactions, purchase history records, products, and total customers.
2. Run the script to generate synthetic datasets.
3. Check the saved CSV files for synthetic customer interactions, purchase history, and product details.


In [None]:
import numpy as np
import pandas as pd
import requests
from io import StringIO
from datetime import datetime, timedelta
import time

PRICE_RANGES = {
    "Apparel and Fashion Accessories": (10, 500),
    "Electronics": (20, 5000),
    "Beauty and Personal Care Products": (5, 200),
    "Home and Kitchen Appliances": (20, 2000),
    "Books and Media": (5, 100),
    "Consumer Electronics Accessories": (5, 200),
    "Health and Wellness Products": (5, 500),
    "Toys and Games": (5, 300),
    "Pet Supplies": (5, 300),
    "Sporting Goods and Fitness Equipment": (10, 2000)
}

def generate_random_customer_ids(num_interactions, total_customers):
    """Generate random unique customer IDs for interactions."""
    if num_interactions > total_customers:
        raise ValueError("Number of interactions cannot exceed total customers.")

    return np.random.choice(np.arange(1, total_customers + 1), num_interactions, replace=False)

def fetch_avatar_links():
    """Fetch avatar links from the provided CSV link."""
    avatar_csv_url = "https://raw.githubusercontent.com/reckn/super-disco/main/assets/avatar_link.csv"
    response = requests.get(avatar_csv_url)
    data = pd.read_csv(StringIO(response.text), header=None)
    return data.iloc[:, 0].tolist()

def fetch_product_icons():
    """Fetch product icons from the provided CSV link."""
    product_icon_csv_url = "https://raw.githubusercontent.com/reckn/super-disco/main/assets/terra_link.csv"
    response = requests.get(product_icon_csv_url)
    data = pd.read_csv(StringIO(response.text), header=None)
    return data.iloc[:, 0].tolist()

def generate_customer_interactions(num_interactions, customer_ids, purchase_history_df):
    """Generate customer interactions data."""
    max_page_views = purchase_history_df.groupby('Customer ID')['Page Views'].sum()
    max_page_views = max_page_views.reindex(customer_ids, fill_value=1)
    num_page_views = np.random.randint(max_page_views, max_page_views + 100, size=num_interactions)

    time_spent = np.random.uniform(1, 5, size=num_interactions)
    while np.any(time_spent / num_page_views <= 0.733):
        mask = time_spent / num_page_views <= 0.733
        time_spent[mask] += np.random.uniform(5, 20, size=mask.sum())

    avatar_links = fetch_avatar_links()
    avatars = np.random.choice(avatar_links, num_interactions)

    return pd.DataFrame({
        'Customer ID': customer_ids,
        'Page Views': num_page_views,
        'Time Spent (minutes)': np.round(time_spent),
        'Avatar': avatars
    })

def generate_product_details(num_products):
    """Generate product details."""
    categories = np.random.choice(list(PRICE_RANGES.keys()), num_products)
    prices = [np.random.uniform(*PRICE_RANGES[cat]) for cat in categories]
    product_icons = fetch_product_icons()
    icons = np.random.choice(product_icons, num_products)

    return pd.DataFrame({
        'Product ID': range(1, num_products + 1),
        'Category': categories,
        'Price': prices,
        'Ratings': np.round(np.random.uniform(1, 5, num_products), 1),
        'Product Icon': icons
    })

def generate_purchase_history(customer_ids, num_purchase_history, product_details_df):
    """Generate purchase history."""
    product_ids = np.random.choice(product_details_df['Product ID'], num_purchase_history)
    categories = product_details_df.set_index('Product ID').loc[product_ids, 'Category'].values
    prices = product_details_df.set_index('Product ID').loc[product_ids, 'Price'].values
    purchase_dates = np.array([datetime.now() - timedelta(
        days=np.random.randint(1, 365),
        hours=np.random.randint(0, 24),
        minutes=np.random.randint(0, 60),
        seconds=np.random.randint(0, 60)
    ) for _ in range(num_purchase_history)])
    page_views = np.random.randint(1, 5, num_purchase_history)

    return pd.DataFrame({
        'Customer ID': np.random.choice(customer_ids, num_purchase_history),
        'Product ID': product_ids,
        'Purchase Date': purchase_dates,
        'Category': categories,
        'Price': prices,
        'Page Views': page_views
    })

def save_to_csv(customer_interactions_df, purchase_history_df, product_details_df):
    """Save dataframes to CSV files."""
    customer_interactions_df.to_csv('/content/drive/MyDrive/fair_dataset/customer_interactions_synthetic.csv', index=False)
    purchase_history_df.to_csv('/content/drive/MyDrive/fair_dataset/purchase_history_synthetic.csv', index=False)
    product_details_df.to_csv('/content/drive/MyDrive/fair_dataset/product_details_synthetic.csv', index=False)

if __name__ == "__main__":
    start_time = time.time()

    num_interactions = 10000
    num_purchase_history = 300000
    num_products = 1043
    total_customers = 5000000

    customer_ids = generate_random_customer_ids(num_interactions, total_customers)
    product_details_df = generate_product_details(num_products)
    purchase_history_df = generate_purchase_history(customer_ids, num_purchase_history, product_details_df)
    customer_interactions_df = generate_customer_interactions(num_interactions, customer_ids, purchase_history_df)

    save_to_csv(customer_interactions_df, purchase_history_df, product_details_df)

    end_time = time.time()
    execution_time = end_time - start_time
    print(f"Execution time: {execution_time} seconds")

    print("Synthetic Customer Interactions Data:")
    print(customer_interactions_df.head())

    print("\nSynthetic Purchase History Data:")
    print(purchase_history_df.head())

    print("\nSynthetic Product Details Data:")
    print(product_details_df.head())


Execution time: 8.015324354171753 seconds
Synthetic Customer Interactions Data:
   Customer ID  Page Views  Time Spent (minutes)  \
0      3849396         156                 129.0   
1      3110597         133                 103.0   
2       931005         140                 108.0   
3       507165          87                  81.0   
4       464587         113                  93.0   

                                              Avatar  
0  https://raw.githubusercontent.com/reckn/super-...  
1  https://raw.githubusercontent.com/reckn/super-...  
2  https://raw.githubusercontent.com/reckn/super-...  
3  https://raw.githubusercontent.com/reckn/super-...  
4  https://raw.githubusercontent.com/reckn/super-...  

Synthetic Purchase History Data:
   Customer ID  Product ID              Purchase Date  \
0      1217225         942 2023-10-05 06:51:34.943239   
1      3670929         884 2023-12-23 10:09:31.943314   
2      3103312         232 2023-08-27 14:17:02.943330   
3       980994 