# Data Preparation for Retail/CPG Planning Analytics

This notebook generates and prepares realistic datasets for a **global retail/CPG company's planning value chain**. The datasets cover end-to-end planning processes from demand planning to supply planning.

**Planning Value Chain Covered:**
- Demand Planning
- Supply Planning
- Inventory Netting
- Production Planning
- Material Planning
- Distribution Requirements Planning (DRP)

**Datasets Generated:**

| Dataset | Type | Use Case | Notebook |
|---------|------|----------|----------|
| `supplier_delay_risk` | Classification | Predict supplier delivery delays | 01_classification |
| `material_shortage` | Multi-class Classification | Predict material shortage risk levels | 01_classification |
| `price_elasticity` | Regression | Predict price elasticity of demand | 02_regression |
| `promotion_lift` | Regression | Predict promotional sales lift | 02_regression |
| `scrap_anomaly` | Anomaly Detection | Detect unusual scrap/defect patterns | 03_outlier_detection |
| `demand_forecast` | Time Series | Forecast product demand by region | 04_time_series_forecasting |

**Run this notebook once** to set up all the data before running the analytics notebooks.

## Compute Setup

We recommend running this notebook on **Serverless Compute** with the **Base Environment V4**.

To configure:
1. Click on the compute selector in the notebook toolbar
2. Select **Serverless**
3. Under Environment, choose **Base Environment V4**

Serverless compute provides fast startup times and automatic scaling, ideal for interactive notebook workflows.

## 1. Install Required Packages

In [None]:
%pip install pandas numpy --quiet

In [None]:
dbutils.library.restartPython()

## 2. Configuration

Define the catalog and schema where datasets will be stored.

In [None]:
# Configure your catalog and schema
CATALOG = "tabpfn_databricks"
SCHEMA = "default"

# Create the catalog and schema if they don't exist
spark.sql(f"CREATE CATALOG IF NOT EXISTS {CATALOG}")
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{SCHEMA}")
spark.sql(f"USE CATALOG {CATALOG}")
spark.sql(f"USE SCHEMA {SCHEMA}")

print(f"Using catalog: {CATALOG}")
print(f"Using schema: {SCHEMA}")

## 3. Import Data Generation Utilities

We use custom data generation functions that create realistic retail/CPG planning datasets.

In [None]:
import numpy as np
import pandas as pd
import sys
import os

# Add the scripts directory to the path so we can import util.py
# In Databricks, you may need to adjust this path based on your repo structure
repo_root = os.path.dirname(os.getcwd()) if 'notebooks' in os.getcwd() else os.getcwd()
scripts_path = os.path.join(repo_root, 'scripts')
if scripts_path not in sys.path:
    sys.path.insert(0, scripts_path)

# Import data generation functions
from util import (
    generate_supplier_delay_risk_data,
    generate_material_shortage_data,
    generate_price_elasticity_data,
    generate_promotion_lift_data,
    generate_scrap_anomaly_data,
    generate_aggregate_demand_forecast_data
)

print("Data generation utilities imported successfully!")

## 4. Supplier Delay Risk Dataset (Binary Classification)

**Use Case:** Supply Planning - Predict which supplier deliveries are at risk of delay

This dataset helps supply planners identify high-risk orders and take proactive mitigation actions such as expediting, finding alternative suppliers, or adjusting production schedules.

In [None]:
# Generate supplier delay risk data
df_supplier_delay = generate_supplier_delay_risk_data(n_samples=2000, seed=42)

print(f"Supplier Delay Risk Dataset")
print(f"Shape: {df_supplier_delay.shape}")
print(f"\nTarget distribution (is_delayed):")
print(df_supplier_delay['is_delayed'].value_counts())
print(f"\nDelay rate: {df_supplier_delay['is_delayed'].mean():.1%}")

display(df_supplier_delay.head(10))

In [None]:
# Save to Delta table
spark.createDataFrame(df_supplier_delay).write.mode("overwrite").saveAsTable("supplier_delay_risk")
print(f"✓ Saved to {CATALOG}.{SCHEMA}.supplier_delay_risk")

## 5. Material Shortage Dataset (Multi-class Classification)

**Use Case:** Material Planning - Predict which materials are at risk of shortage

This dataset supports material planners in prioritizing procurement actions based on shortage risk levels (No Risk, At Risk, Critical).

In [None]:
# Generate material shortage data
df_material_shortage = generate_material_shortage_data(n_samples=1500, seed=42)

print(f"Material Shortage Dataset")
print(f"Shape: {df_material_shortage.shape}")
print(f"\nTarget distribution (shortage_risk):")
print("0 = No Risk, 1 = At Risk, 2 = Critical")
print(df_material_shortage['shortage_risk'].value_counts().sort_index())

display(df_material_shortage.head(10))

In [None]:
# Save to Delta table
spark.createDataFrame(df_material_shortage).write.mode("overwrite").saveAsTable("material_shortage")
print(f"✓ Saved to {CATALOG}.{SCHEMA}.material_shortage")

## 6. Price Elasticity Dataset (Regression)

**Use Case:** Demand Planning - Understand how price changes affect demand

This dataset helps demand planners and revenue management teams predict how changes in pricing will impact unit sales across different products and market conditions.

In [None]:
# Generate price elasticity data
df_price_elasticity = generate_price_elasticity_data(n_samples=3000, seed=42)

print(f"Price Elasticity Dataset")
print(f"Shape: {df_price_elasticity.shape}")
print(f"\nTarget (price_elasticity) statistics:")
print(df_price_elasticity['price_elasticity'].describe())
print(f"\nNote: Elasticity values are typically negative (demand decreases as price increases)")
print(f"More negative = more elastic (price sensitive)")

display(df_price_elasticity.head(10))

In [None]:
# Save to Delta table
spark.createDataFrame(df_price_elasticity).write.mode("overwrite").saveAsTable("price_elasticity")
print(f"✓ Saved to {CATALOG}.{SCHEMA}.price_elasticity")

## 7. Promotion Lift Dataset (Regression)

**Use Case:** Demand Planning - Predict the sales impact of planned promotions

This dataset supports trade promotion planning by predicting the incremental sales lift from different promotion types, depths, and marketing support.

In [None]:
# Generate promotion lift data
df_promotion_lift = generate_promotion_lift_data(n_samples=2500, seed=42)

print(f"Promotion Lift Dataset")
print(f"Shape: {df_promotion_lift.shape}")
print(f"\nTarget (promotion_lift_pct) statistics:")
print(df_promotion_lift['promotion_lift_pct'].describe())
print(f"\nPromotion type distribution:")
print(df_promotion_lift['promotion_type'].value_counts())

display(df_promotion_lift.head(10))

In [None]:
# Save to Delta table
spark.createDataFrame(df_promotion_lift).write.mode("overwrite").saveAsTable("promotion_lift")
print(f"✓ Saved to {CATALOG}.{SCHEMA}.promotion_lift")

## 8. Scrap Anomaly Dataset (Anomaly Detection)

**Use Case:** Production Planning - Detect unusual scrap/defect patterns

This dataset helps production managers identify abnormal production runs that may indicate equipment issues, material problems, or process deviations.

In [None]:
# Generate scrap anomaly data
df_scrap_anomaly, anomaly_labels = generate_scrap_anomaly_data(n_samples=1000, anomaly_rate=0.08, seed=42)

# Add the labels to the dataframe
df_scrap_anomaly['is_anomaly'] = anomaly_labels

print(f"Scrap Anomaly Dataset")
print(f"Shape: {df_scrap_anomaly.shape}")
print(f"\nAnomaly distribution:")
print(f"Normal (0): {(anomaly_labels == 0).sum()}")
print(f"Anomaly (1): {(anomaly_labels == 1).sum()}")
print(f"Anomaly rate: {anomaly_labels.mean():.1%}")

display(df_scrap_anomaly.head(10))

In [None]:
# Save to Delta table
spark.createDataFrame(df_scrap_anomaly).write.mode("overwrite").saveAsTable("scrap_anomaly")
print(f"✓ Saved to {CATALOG}.{SCHEMA}.scrap_anomaly")

## 9. Demand Forecast Dataset (Time Series)

**Use Case:** Demand Planning - Forecast product demand by category and region

This dataset contains monthly demand data across multiple product categories and regions, supporting demand forecasting and inventory planning processes.

In [None]:
# Generate demand forecast data
df_demand_forecast = generate_aggregate_demand_forecast_data(n_series=50, n_months=36, seed=42)

print(f"Demand Forecast Dataset")
print(f"Shape: {df_demand_forecast.shape}")
print(f"Number of time series: {df_demand_forecast['series_id'].nunique()}")
print(f"Time range: {df_demand_forecast['date'].min()} to {df_demand_forecast['date'].max()}")
print(f"\nCategory distribution:")
print(df_demand_forecast['category'].value_counts())
print(f"\nRegion distribution:")
print(df_demand_forecast['region'].value_counts())

display(df_demand_forecast.head(10))

In [None]:
# Save to Delta table
spark.createDataFrame(df_demand_forecast).write.mode("overwrite").saveAsTable("demand_forecast")
print(f"✓ Saved to {CATALOG}.{SCHEMA}.demand_forecast")

## 10. Verify All Tables

In [None]:
# List all tables in the schema
print(f"Tables in {CATALOG}.{SCHEMA}:")
display(spark.sql(f"SHOW TABLES IN {CATALOG}.{SCHEMA}"))

In [None]:
# Preview each table
tables = [
    ("supplier_delay_risk", "Supply Planning - Predict delivery delays"),
    ("material_shortage", "Material Planning - Predict shortage risk"),
    ("price_elasticity", "Demand Planning - Price sensitivity"),
    ("promotion_lift", "Demand Planning - Promotion impact"),
    ("scrap_anomaly", "Production Planning - Detect anomalies"),
    ("demand_forecast", "Demand Planning - Time series forecasting")
]

for table_name, description in tables:
    print(f"\n{'='*60}")
    print(f"{table_name}: {description}")
    print(f"{'='*60}")
    display(spark.table(table_name).limit(5))

## Summary

All datasets have been prepared and saved as Delta tables in `tabpfn_databricks.default`:

| Table | Task | Planning Process | Samples | Features |
|-------|------|------------------|---------|----------|
| `supplier_delay_risk` | Binary Classification | Supply Planning | 2,000 | 14 |
| `material_shortage` | Multi-class Classification | Material Planning | 1,500 | 15 |
| `price_elasticity` | Regression | Demand Planning | 3,000 | 13 |
| `promotion_lift` | Regression | Demand Planning | 2,500 | 15 |
| `scrap_anomaly` | Anomaly Detection | Production Planning | 1,000 | 13 |
| `demand_forecast` | Time Series Forecasting | Demand Planning | 1,800 | 7 |

**Next steps:** Run the individual notebooks (01-04) to explore TabPFN capabilities using these prepared datasets.

### Planning Value Chain Coverage

```
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│ Demand Planning │───▶│ Supply Planning │───▶│ Production      │
│                 │    │                 │    │ Planning        │
│ • Forecasting   │    │ • Supplier Risk │    │ • Yield Pred.   │
│ • Price Elast.  │    │ • Lead Time     │    │ • Scrap Detect. │
│ • Promo Lift    │    │ • Material      │    │                 │
│                 │    │   Shortage      │    │                 │
└─────────────────┘    └─────────────────┘    └─────────────────┘
```