# 01 - Data Generation for Spark Performance Benchmark

**Objective:** Generate synthetic datasets to benchmark I/O performance and join strategies.

This notebook creates:
1. **fact_sales**: Large fact table (~1-5M rows) with transaction data
2. **dim_customers**: Dimension table (~100k rows) with customer information

Data is saved in three formats:
- **CSV**: Row-oriented, uncompressed
- **Parquet**: Columnar, compressed (Snappy)
- **Delta Lake**: Columnar, versioned, optimized with Z-Ordering

---

## Setup and Imports

In [None]:
# Add src directory to path
import sys
from pathlib import Path

# Add parent directory to path for imports
notebook_dir = Path.cwd()
project_root = notebook_dir.parent
src_dir = project_root / "src"
sys.path.insert(0, str(src_dir))

print(f"Project root: {project_root}")
print(f"Src directory: {src_dir}")

In [None]:
# Import required libraries
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DoubleType, DateType
from datetime import datetime, timedelta
import random

# Import project modules
from config import (
    get_data_path,
    FACT_SALES_TABLE,
    DIM_CUSTOMERS_TABLE,
    SPARK_APP_NAME
)
from benchmark_utils import BenchmarkTimer, get_directory_size_mb

print("✓ All imports successful")

## Initialize Spark Session

In [None]:
# Create Spark session with Delta Lake support
spark = (
    SparkSession.builder
    .appName(f"{SPARK_APP_NAME} - Data Generation")
    .master("local[*]")  # Use all available cores
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
    .config("spark.driver.memory", "4g")  # Adjust based on available RAM
    .config("spark.sql.shuffle.partitions", "8")  # Optimize for local execution
    .getOrCreate()
)

print(f"✓ Spark {spark.version} session initialized")
print(f"✓ Master: {spark.sparkContext.master}")
print(f"✓ App Name: {spark.sparkContext.appName}")

## Configuration: Data Size Settings

Adjust these parameters based on your system's capabilities:
- **Small**: 100K sales, 10K customers (for testing)
- **Medium**: 1M sales, 100K customers (recommended)
- **Large**: 5M+ sales, 500K customers (for production benchmarks)

In [None]:
# Configuration - Adjust based on your needs
NUM_SALES_RECORDS = 2_000_000      # 2 Million sales transactions
NUM_CUSTOMER_RECORDS = 100_000     # 100K customers

# Seed for reproducibility
RANDOM_SEED = 42

print(f"Configuration:")
print(f"  - Sales records: {NUM_SALES_RECORDS:,}")
print(f"  - Customer records: {NUM_CUSTOMER_RECORDS:,}")
print(f"  - Random seed: {RANDOM_SEED}")

## Generate Fact Table: `fact_sales`

This represents a large transaction table with:
- `transaction_id`: Unique identifier
- `customer_id`: Foreign key to customers
- `amount`: Transaction amount
- `date`: Transaction date
- `product_category`: Product category

In [None]:
# Generate fact_sales DataFrame
print("Generating fact_sales table...")

# Product categories for variety
categories = ["Electronics", "Clothing", "Food", "Books", "Sports", "Home", "Beauty", "Toys"]

# Create the DataFrame using Spark functions
fact_sales = (
    spark.range(0, NUM_SALES_RECORDS)
    .withColumn("transaction_id", F.col("id").cast("int"))
    .withColumn(
        "customer_id",
        (F.rand(seed=RANDOM_SEED) * NUM_CUSTOMER_RECORDS).cast("int")
    )
    .withColumn(
        "amount",
        (F.rand(seed=RANDOM_SEED + 1) * 1000 + 10).cast("double")  # Between 10 and 1010
    )
    .withColumn(
        "date",
        F.date_add(F.lit("2023-01-01"), (F.rand(seed=RANDOM_SEED + 2) * 365).cast("int"))
    )
    .withColumn(
        "product_category",
        F.array([F.lit(cat) for cat in categories]).getItem(
            (F.rand(seed=RANDOM_SEED + 3) * len(categories)).cast("int")
        )
    )
    .drop("id")
)

# Cache for reuse during multiple writes
fact_sales.cache()
fact_sales_count = fact_sales.count()

print(f"✓ Generated {fact_sales_count:,} sales records")
print("\nSample data:")
fact_sales.show(5, truncate=False)

In [None]:
# Display schema and basic statistics
print("Schema:")
fact_sales.printSchema()

print("\nBasic Statistics:")
fact_sales.describe().show()

## Generate Dimension Table: `dim_customers`

This represents a smaller dimension table with:
- `customer_id`: Primary key
- `name`: Customer name
- `region`: Geographic region
- `signup_date`: Registration date

In [None]:
# Generate dim_customers DataFrame
print("Generating dim_customers table...")

# Regions for diversity
regions = ["North", "South", "East", "West", "Central"]

dim_customers = (
    spark.range(0, NUM_CUSTOMER_RECORDS)
    .withColumn("customer_id", F.col("id").cast("int"))
    .withColumn(
        "name",
        F.concat(F.lit("Customer_"), F.col("id").cast("string"))
    )
    .withColumn(
        "region",
        F.array([F.lit(reg) for reg in regions]).getItem(
            (F.rand(seed=RANDOM_SEED + 4) * len(regions)).cast("int")
        )
    )
    .withColumn(
        "signup_date",
        F.date_add(F.lit("2020-01-01"), (F.rand(seed=RANDOM_SEED + 5) * 1095).cast("int"))  # 3 years
    )
    .drop("id")
)

# Cache for reuse during multiple writes
dim_customers.cache()
dim_customers_count = dim_customers.count()

print(f"✓ Generated {dim_customers_count:,} customer records")
print("\nSample data:")
dim_customers.show(5, truncate=False)

In [None]:
# Display schema
print("Schema:")
dim_customers.printSchema()

print("\nRegion Distribution:")
dim_customers.groupBy("region").count().orderBy("region").show()

## Save Data in Multiple Formats

We'll save both tables in three formats:
1. **CSV** - Baseline format (row-oriented)
2. **Parquet** - Columnar format with Snappy compression
3. **Delta Lake** - Advanced format with ACID transactions

### Save as CSV

In [None]:
# Save fact_sales as CSV
with BenchmarkTimer(
    "Save fact_sales as CSV",
    description=f"Writing {NUM_SALES_RECORDS:,} records to CSV",
    spark=spark
):
    csv_path = str(get_data_path("csv", FACT_SALES_TABLE))
    fact_sales.write.mode("overwrite").option("header", "true").csv(csv_path)

csv_size = get_directory_size_mb(get_data_path("csv", FACT_SALES_TABLE))
print(f"CSV Size: {csv_size:.2f} MB")

In [None]:
# Save dim_customers as CSV
with BenchmarkTimer(
    "Save dim_customers as CSV",
    description=f"Writing {NUM_CUSTOMER_RECORDS:,} records to CSV",
    spark=spark
):
    csv_path = str(get_data_path("csv", DIM_CUSTOMERS_TABLE))
    dim_customers.write.mode("overwrite").option("header", "true").csv(csv_path)

csv_size = get_directory_size_mb(get_data_path("csv", DIM_CUSTOMERS_TABLE))
print(f"CSV Size: {csv_size:.2f} MB")

### Save as Parquet (with Snappy compression)

In [None]:
# Save fact_sales as Parquet
with BenchmarkTimer(
    "Save fact_sales as Parquet",
    description=f"Writing {NUM_SALES_RECORDS:,} records to Parquet (Snappy)",
    spark=spark
):
    parquet_path = str(get_data_path("parquet", FACT_SALES_TABLE))
    fact_sales.write.mode("overwrite").option("compression", "snappy").parquet(parquet_path)

parquet_size = get_directory_size_mb(get_data_path("parquet", FACT_SALES_TABLE))
print(f"Parquet Size: {parquet_size:.2f} MB")

In [None]:
# Save dim_customers as Parquet
with BenchmarkTimer(
    "Save dim_customers as Parquet",
    description=f"Writing {NUM_CUSTOMER_RECORDS:,} records to Parquet (Snappy)",
    spark=spark
):
    parquet_path = str(get_data_path("parquet", DIM_CUSTOMERS_TABLE))
    dim_customers.write.mode("overwrite").option("compression", "snappy").parquet(parquet_path)

parquet_size = get_directory_size_mb(get_data_path("parquet", DIM_CUSTOMERS_TABLE))
print(f"Parquet Size: {parquet_size:.2f} MB")

### Save as Delta Lake (with optimization)

In [None]:
# Save fact_sales as Delta
with BenchmarkTimer(
    "Save fact_sales as Delta",
    description=f"Writing {NUM_SALES_RECORDS:,} records to Delta Lake",
    spark=spark
):
    delta_path = str(get_data_path("delta", FACT_SALES_TABLE))
    fact_sales.write.mode("overwrite").format("delta").save(delta_path)

delta_size_before = get_directory_size_mb(get_data_path("delta", FACT_SALES_TABLE))
print(f"Delta Size (before optimization): {delta_size_before:.2f} MB")

In [None]:
# Optimize fact_sales Delta table with Z-Ordering on customer_id
# This improves join performance by co-locating related data
print("Optimizing fact_sales Delta table with Z-ORDER BY customer_id...")

with BenchmarkTimer(
    "Optimize fact_sales Delta (ZORDER)",
    description="Running OPTIMIZE with ZORDER BY customer_id",
    spark=spark
):
    delta_path = str(get_data_path("delta", FACT_SALES_TABLE))
    spark.sql(f"OPTIMIZE delta.`{delta_path}` ZORDER BY (customer_id)")

delta_size_after = get_directory_size_mb(get_data_path("delta", FACT_SALES_TABLE))
print(f"Delta Size (after optimization): {delta_size_after:.2f} MB")

In [None]:
# Save dim_customers as Delta
with BenchmarkTimer(
    "Save dim_customers as Delta",
    description=f"Writing {NUM_CUSTOMER_RECORDS:,} records to Delta Lake",
    spark=spark
):
    delta_path = str(get_data_path("delta", DIM_CUSTOMERS_TABLE))
    dim_customers.write.mode("overwrite").format("delta").save(delta_path)

delta_size = get_directory_size_mb(get_data_path("delta", DIM_CUSTOMERS_TABLE))
print(f"Delta Size: {delta_size:.2f} MB")

In [None]:
# Optimize dim_customers Delta table
print("Optimizing dim_customers Delta table...")

with BenchmarkTimer(
    "Optimize dim_customers Delta",
    description="Running OPTIMIZE on dimension table",
    spark=spark
):
    delta_path = str(get_data_path("delta", DIM_CUSTOMERS_TABLE))
    spark.sql(f"OPTIMIZE delta.`{delta_path}`")

## Summary: Storage Comparison

In [None]:
# Calculate and display storage statistics
import pandas as pd

def get_format_sizes(table_name):
    """Get sizes for all formats of a table."""
    sizes = {}
    for fmt in ["csv", "parquet", "delta"]:
        path = get_data_path(fmt, table_name)
        if path.exists():
            sizes[fmt] = get_directory_size_mb(path)
        else:
            sizes[fmt] = 0.0
    return sizes

# Get sizes for both tables
sales_sizes = get_format_sizes(FACT_SALES_TABLE)
customer_sizes = get_format_sizes(DIM_CUSTOMERS_TABLE)

# Create comparison DataFrame
comparison_df = pd.DataFrame({
    'Format': ['CSV', 'Parquet', 'Delta'],
    'fact_sales (MB)': [sales_sizes['csv'], sales_sizes['parquet'], sales_sizes['delta']],
    'dim_customers (MB)': [customer_sizes['csv'], customer_sizes['parquet'], customer_sizes['delta']]
})

# Calculate compression ratios (vs CSV)
comparison_df['Sales Compression %'] = (
    (1 - comparison_df['fact_sales (MB)'] / sales_sizes['csv']) * 100
).round(1)
comparison_df['Customer Compression %'] = (
    (1 - comparison_df['dim_customers (MB)'] / customer_sizes['csv']) * 100
).round(1)

print("="*80)
print("STORAGE COMPARISON SUMMARY")
print("="*80)
print(comparison_df.to_string(index=False))
print("="*80)
print(f"\nNote: Negative compression % means the format is larger than CSV")

## Verification: Read Sample Data from Each Format

In [None]:
# Verify CSV
print("Reading from CSV:")
csv_df = spark.read.option("header", "true").option("inferSchema", "true").csv(
    str(get_data_path("csv", FACT_SALES_TABLE))
)
print(f"Count: {csv_df.count():,}")
csv_df.show(3)

In [None]:
# Verify Parquet
print("Reading from Parquet:")
parquet_df = spark.read.parquet(str(get_data_path("parquet", FACT_SALES_TABLE)))
print(f"Count: {parquet_df.count():,}")
parquet_df.show(3)

In [None]:
# Verify Delta
print("Reading from Delta:")
delta_df = spark.read.format("delta").load(str(get_data_path("delta", FACT_SALES_TABLE)))
print(f"Count: {delta_df.count():,}")
delta_df.show(3)

## Cleanup and Final Summary

In [None]:
# Unpersist cached DataFrames
fact_sales.unpersist()
dim_customers.unpersist()

print("✓ Cached DataFrames released")

# Print final summary
print("\n" + "="*80)
print("DATA GENERATION COMPLETE")
print("="*80)
print(f"✓ Generated {NUM_SALES_RECORDS:,} sales records")
print(f"✓ Generated {NUM_CUSTOMER_RECORDS:,} customer records")
print(f"✓ Saved in 3 formats: CSV, Parquet, Delta")
print(f"✓ Optimized Delta tables with Z-Ordering")
print(f"\nNext step: Run notebook 02_format_benchmark.ipynb")
print("="*80)