# 01 - Data Generation for Spark Performance Benchmark (PaySim Dataset)

**Objective:** Use the PaySim mobile money transaction dataset to benchmark I/O performance and join strategies.

**Dataset:** PaySim1 - Mobile Money Transactions
- Source: https://www.kaggle.com/datasets/ealaxi/paysim1
- Size: ~6.3 Million transactions
- Features: Transaction types, amounts, balances, fraud indicators

This notebook:
1. **Loads PaySim CSV data** from Kaggle
2. **Creates two tables**:
   - **fact_transactions**: Main transaction table (~6.3M rows)
   - **dim_accounts**: Unique account dimension table
3. **Saves in three formats**:
   - **CSV**: Row-oriented, uncompressed
   - **Parquet**: Columnar, compressed (Snappy)
   - **Delta Lake**: Columnar, versioned, optimized with Z-Ordering

---

## Prerequisites: Download PaySim Dataset

**Manual Download (Required):**
1. Visit: https://www.kaggle.com/datasets/ealaxi/paysim1
2. Click "Download" button (requires free Kaggle account)
3. Extract the ZIP file
4. Place `PS_20174392719_1491204439457_log.csv` in your project's `data/raw/` folder
5. Optionally rename it to `paysim.csv` for simplicity

**Expected file location:**
```
project/
└── data/
    └── raw/
        └── paysim.csv  (or PS_20174392719_1491204439457_log.csv)
```

## Setup and Imports

In [1]:
# Add src directory to path
import sys
from pathlib import Path

# Add parent directory to path for imports
notebook_dir = Path.cwd()
project_root = notebook_dir.parent
src_dir = project_root / "src"
sys.path.insert(0, str(src_dir))

print(f"Project root: {project_root}")
print(f"Src directory: {src_dir}")

Project root: C:\Users\samvo\source\repos\Spark-Performance-Benchmark
Src directory: C:\Users\samvo\source\repos\Spark-Performance-Benchmark\src


In [2]:
# Configure Hadoop for Windows (MUST run before creating SparkSession)
from config import configure_hadoop_home
configure_hadoop_home()


✓ All directories verified/created successfully
✓ HADOOP_HOME already set: C:\hadoop


In [3]:
# Import required libraries
from delta import *
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DoubleType
from pathlib import Path

# Import project modules
from config import (
    get_data_path,
    SPARK_APP_NAME,
    ensure_directories_exist
)
from benchmark_utils import BenchmarkTimer, get_directory_size_mb

# Ensure all directories exist
ensure_directories_exist()

print("✓ All imports successful")

✓ All directories verified/created successfully
✓ All imports successful


## Initialize Spark Session

In [4]:
# Create Spark session with Delta Lake support
from delta import *

builder = (
    SparkSession.builder
    .appName(f"{SPARK_APP_NAME} - PaySim Data Processing")
    .master("local[*]")  # Use all available cores
    .config("spark.driver.memory", "4g")  # Adjust based on available RAM
    .config("spark.sql.shuffle.partitions", "8")  # Optimize for local execution
    # --- WICHTIG: Delta Lake Konfigurationen ---
    .config("spark.jars.packages", "io.delta:delta-spark_2.12:3.0.0")
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
    # Windows-Fix für temporäre Dateien:
    .config("spark.sql.warehouse.dir", "file:///C:/temp") 
)

# configure_spark_with_delta_pip ist der Schlüssel! Es verbindet dein pip-Paket mit Spark Java.
spark = configure_spark_with_delta_pip(builder).getOrCreate()

print(f"✓ Spark {spark.version} session initialized with Delta Lake")
print(f"✓ Master: {spark.sparkContext.master}")
print(f"✓ App Name: {spark.sparkContext.appName}")

✓ Spark 3.5.0 session initialized with Delta Lake
✓ Master: local[*]
✓ App Name: SparkPerformanceBenchmark - PaySim Data Processing


## Verify Dataset and Display Info

In [5]:
# Setup paths and check for PaySim CSV
from pathlib import Path
from config import RAW_DATA_DIR

# Check for different possible filenames
possible_files = [
    RAW_DATA_DIR / "paysim.csv",
    RAW_DATA_DIR / "PS_20174392719_1491204439457_log.csv"
]

PAYSIM_CSV_PATH = None
for file_path in possible_files:
    if file_path.exists():
        PAYSIM_CSV_PATH = file_path
        break

if PAYSIM_CSV_PATH is None:
    print("❌ ERROR: PaySim CSV file not found!")
    print("\nPlease download the dataset from:")
    print("https://www.kaggle.com/datasets/ealaxi/paysim1")
    print(f"\nAnd place it in: {RAW_DATA_DIR}/")
    print("\nAccepted filenames:")
    for f in possible_files:
        print(f"  - {f.name}")
    raise FileNotFoundError("PaySim CSV not found")

file_size_mb = PAYSIM_CSV_PATH.stat().st_size / (1024 * 1024)
print("="*70)
print("DATASET FOUND")
print("="*70)
print(f"File: {PAYSIM_CSV_PATH.name}")
print(f"Path: {PAYSIM_CSV_PATH}")
print(f"Size: {file_size_mb:.2f} MB")
print("="*70)

DATASET FOUND
File: PS_20174392719_1491204439457_log.csv
Path: C:\Users\samvo\source\repos\Spark-Performance-Benchmark\data\raw\PS_20174392719_1491204439457_log.csv
Size: 470.67 MB


## Load PaySim Dataset

**PaySim Schema:**
- `step`: Time step (1 unit = 1 hour)
- `type`: Transaction type (CASH-IN, CASH-OUT, DEBIT, PAYMENT, TRANSFER)
- `amount`: Transaction amount
- `nameOrig`: Customer who initiated the transaction
- `oldbalanceOrg`: Initial balance before transaction
- `newbalanceOrig`: New balance after transaction
- `nameDest`: Customer recipient of transaction
- `oldbalanceDest`: Initial recipient balance
- `newbalanceDest`: New recipient balance
- `isFraud`: 1 if fraud, 0 otherwise
- `isFlaggedFraud`: 1 if flagged as fraud, 0 otherwise

In [6]:
# Load PaySim CSV
print("Loading PaySim dataset...")

with BenchmarkTimer(
    "Load PaySim CSV",
    description=f"Reading {file_size_mb:.2f} MB CSV file",
    spark=spark,
    clear_cache=False
):
    paysim_raw = (
        spark.read
        .option("header", "true")
        .option("inferSchema", "true")
        .csv(str(PAYSIM_CSV_PATH))
    )
    
    # Cache for reuse
    paysim_raw.cache()
    row_count = paysim_raw.count()

print(f"✓ Loaded {row_count:,} transactions")
print("\nSample data:")
paysim_raw.show(5, truncate=False)

Loading PaySim dataset...

Starting benchmark: Load PaySim CSV
Description: Reading 470.67 MB CSV file

✓ Completed: Load PaySim CSV
Duration: 18.090 seconds (0.30 minutes)

✓ Results logged to: C:\Users\samvo\source\repos\Spark-Performance-Benchmark\results\benchmark_logs.csv
✓ Loaded 6,362,620 transactions

Sample data:
+----+--------+--------+-----------+-------------+--------------+-----------+--------------+--------------+-------+--------------+
|step|type    |amount  |nameOrig   |oldbalanceOrg|newbalanceOrig|nameDest   |oldbalanceDest|newbalanceDest|isFraud|isFlaggedFraud|
+----+--------+--------+-----------+-------------+--------------+-----------+--------------+--------------+-------+--------------+
|1   |PAYMENT |9839.64 |C1231006815|170136.0     |160296.36     |M1979787155|0.0           |0.0           |0      |0             |
|1   |PAYMENT |1864.28 |C1666544295|21249.0      |19384.72      |M2044282225|0.0           |0.0           |0      |0             |
|1   |TRANSFER|181.0 

In [7]:
# Display schema and basic statistics
print("Schema:")
paysim_raw.printSchema()

print("\nData Quality Check:")
print(f"Total rows: {paysim_raw.count():,}")
print(f"Null values per column:")
paysim_raw.select([F.sum(F.col(c).isNull().cast("int")).alias(c) for c in paysim_raw.columns]).show()

print("\nTransaction Type Distribution:")
paysim_raw.groupBy("type").count().orderBy("count", ascending=False).show()

print("\nFraud Statistics:")
fraud_stats = paysim_raw.agg(
    F.sum("isFraud").alias("total_fraud"),
    F.sum("isFlaggedFraud").alias("total_flagged"),
    F.count("*").alias("total_transactions")
).collect()[0]
print(f"Fraudulent transactions: {fraud_stats['total_fraud']:,} ({fraud_stats['total_fraud']/fraud_stats['total_transactions']*100:.2f}%)")
print(f"Flagged transactions: {fraud_stats['total_flagged']:,}")

Schema:
root
 |-- step: integer (nullable = true)
 |-- type: string (nullable = true)
 |-- amount: double (nullable = true)
 |-- nameOrig: string (nullable = true)
 |-- oldbalanceOrg: double (nullable = true)
 |-- newbalanceOrig: double (nullable = true)
 |-- nameDest: string (nullable = true)
 |-- oldbalanceDest: double (nullable = true)
 |-- newbalanceDest: double (nullable = true)
 |-- isFraud: integer (nullable = true)
 |-- isFlaggedFraud: integer (nullable = true)


Data Quality Check:
Total rows: 6,362,620
Null values per column:
+----+----+------+--------+-------------+--------------+--------+--------------+--------------+-------+--------------+
|step|type|amount|nameOrig|oldbalanceOrg|newbalanceOrig|nameDest|oldbalanceDest|newbalanceDest|isFraud|isFlaggedFraud|
+----+----+------+--------+-------------+--------------+--------+--------------+--------------+-------+--------------+
|   0|   0|     0|       0|            0|             0|       0|             0|             0|      

## Create Fact Table: `fact_transactions`

We'll use the PaySim data directly as our fact table with some transformations:
- Rename columns for consistency
- Add a proper transaction_id
- Keep all relevant transaction details

In [8]:
# Transform PaySim data into fact_transactions
print("Creating fact_transactions table...")

# Use monotonically_increasing_id() directly (no Window function needed!)
# This is much faster as it avoids shuffling all data to a single partition
fact_transactions = (
    paysim_raw
    .withColumn("transaction_id", F.monotonically_increasing_id())
    .select(
        "transaction_id",
        F.col("step").alias("time_step"),
        F.col("type").alias("transaction_type"),
        F.col("amount").alias("amount"),
        F.col("nameOrig").alias("account_orig"),
        F.col("oldbalanceOrg").alias("balance_orig_before"),
        F.col("newbalanceOrig").alias("balance_orig_after"),
        F.col("nameDest").alias("account_dest"),
        F.col("oldbalanceDest").alias("balance_dest_before"),
        F.col("newbalanceDest").alias("balance_dest_after"),
        F.col("isFraud").alias("is_fraud"),
        F.col("isFlaggedFraud").alias("is_flagged_fraud")
    )
)

# Cache for reuse during multiple writes
fact_transactions.cache()
fact_count = fact_transactions.count()

print(f"✓ Created fact_transactions with {fact_count:,} records")
print("\nSample data:")
fact_transactions.show(5, truncate=False)

Creating fact_transactions table...
✓ Created fact_transactions with 6,362,620 records

Sample data:
+--------------+---------+----------------+--------+------------+-------------------+------------------+------------+-------------------+------------------+--------+----------------+
|transaction_id|time_step|transaction_type|amount  |account_orig|balance_orig_before|balance_orig_after|account_dest|balance_dest_before|balance_dest_after|is_fraud|is_flagged_fraud|
+--------------+---------+----------------+--------+------------+-------------------+------------------+------------+-------------------+------------------+--------+----------------+
|0             |1        |PAYMENT         |9839.64 |C1231006815 |170136.0           |160296.36         |M1979787155 |0.0                |0.0               |0       |0               |
|1             |1        |PAYMENT         |1864.28 |C1666544295 |21249.0            |19384.72          |M2044282225 |0.0                |0.0               |0       |0 

In [9]:
# Display schema and statistics
print("Schema:")
fact_transactions.printSchema()

print("\nBasic Statistics:")
fact_transactions.select("amount", "balance_orig_before", "balance_orig_after").describe().show()

Schema:
root
 |-- transaction_id: long (nullable = false)
 |-- time_step: integer (nullable = true)
 |-- transaction_type: string (nullable = true)
 |-- amount: double (nullable = true)
 |-- account_orig: string (nullable = true)
 |-- balance_orig_before: double (nullable = true)
 |-- balance_orig_after: double (nullable = true)
 |-- account_dest: string (nullable = true)
 |-- balance_dest_before: double (nullable = true)
 |-- balance_dest_after: double (nullable = true)
 |-- is_fraud: integer (nullable = true)
 |-- is_flagged_fraud: integer (nullable = true)


Basic Statistics:
+-------+------------------+-------------------+------------------+
|summary|            amount|balance_orig_before|balance_orig_after|
+-------+------------------+-------------------+------------------+
|  count|           6362620|            6362620|           6362620|
|   mean|179861.90354913156|  833883.1040744876| 855113.6685785913|
| stddev| 603858.2314629381| 2888242.6730375625| 2924048.502954259|
|    m

## Create Dimension Table: `dim_accounts`

Extract unique accounts from both origin and destination accounts:
- Combine all unique account names
- Calculate account statistics (total transactions, total volume, fraud rate)
- Determine account type (Customer 'C' vs Merchant 'M')

In [10]:
# Extract unique accounts from origin accounts
print("Creating dim_accounts table...")

# Get origin account statistics
origin_stats = (
    paysim_raw
    .groupBy(F.col("nameOrig").alias("account_id"))
    .agg(
        F.count("*").alias("total_transactions_orig"),
        F.sum("amount").alias("total_amount_orig"),
        F.sum("isFraud").alias("fraud_count_orig")
    )
)

# Get destination account statistics
dest_stats = (
    paysim_raw
    .groupBy(F.col("nameDest").alias("account_id"))
    .agg(
        F.count("*").alias("total_transactions_dest"),
        F.sum("amount").alias("total_amount_dest"),
        F.sum("isFraud").alias("fraud_count_dest")
    )
)

# Combine and create dimension table
dim_accounts = (
    origin_stats
    .join(dest_stats, "account_id", "full_outer")
    .fillna(0, subset=[
        "total_transactions_orig", "total_amount_orig", "fraud_count_orig",
        "total_transactions_dest", "total_amount_dest", "fraud_count_dest"
    ])
    .withColumn(
        "total_transactions",
        F.col("total_transactions_orig") + F.col("total_transactions_dest")
    )
    .withColumn(
        "total_volume",
        F.col("total_amount_orig") + F.col("total_amount_dest")
    )
    .withColumn(
        "fraud_count",
        F.col("fraud_count_orig") + F.col("fraud_count_dest")
    )
    .withColumn(
        "account_type",
        F.when(F.col("account_id").startswith("C"), "Customer")
         .when(F.col("account_id").startswith("M"), "Merchant")
         .otherwise("Unknown")
    )
    .withColumn(
        "fraud_rate",
        F.when(F.col("total_transactions") > 0, 
               F.col("fraud_count") / F.col("total_transactions"))
         .otherwise(0.0)
    )
    .select(
        "account_id",
        "account_type",
        "total_transactions",
        "total_volume",
        "fraud_count",
        "fraud_rate"
    )
)

# Cache for reuse during multiple writes
dim_accounts.cache()
account_count = dim_accounts.count()

print(f"✓ Created dim_accounts with {account_count:,} unique accounts")
print("\nSample data:")
dim_accounts.show(10, truncate=False)

Creating dim_accounts table...
✓ Created dim_accounts with 9,073,900 unique accounts

Sample data:
+-----------+------------+------------------+------------+-----------+----------+
|account_id |account_type|total_transactions|total_volume|fraud_count|fraud_rate|
+-----------+------------+------------------+------------+-----------+----------+
|C1000005555|Customer    |1                 |233109.79   |0          |0.0       |
|C1000008393|Customer    |1                 |58347.84    |0          |0.0       |
|C1000008582|Customer    |1                 |315626.96   |0          |0.0       |
|C1000009272|Customer    |1                 |2262.44     |0          |0.0       |
|C1000012233|Customer    |1                 |331041.93   |0          |0.0       |
|C1000014489|Customer    |1                 |5787.18     |0          |0.0       |
|C100002506 |Customer    |1                 |57887.52    |0          |0.0       |
|C100002808 |Customer    |1                 |232765.42   |0          |0.0       |

In [11]:
# Display schema and statistics
print("Schema:")
dim_accounts.printSchema()

print("\nAccount Type Distribution:")
dim_accounts.groupBy("account_type").count().orderBy("count", ascending=False).show()

print("\nTop 10 Accounts by Transaction Volume:")
dim_accounts.orderBy("total_volume", ascending=False).show(10, truncate=False)

Schema:
root
 |-- account_id: string (nullable = true)
 |-- account_type: string (nullable = false)
 |-- total_transactions: long (nullable = true)
 |-- total_volume: double (nullable = false)
 |-- fraud_count: long (nullable = true)
 |-- fraud_rate: double (nullable = true)


Account Type Distribution:
+------------+-------+
|account_type|  count|
+------------+-------+
|    Customer|6923499|
|    Merchant|2150401|
+------------+-------+


Top 10 Accounts by Transaction Volume:
+-----------+------------+------------------+--------------------+-----------+--------------------+
|account_id |account_type|total_transactions|total_volume        |fraud_count|fraud_rate          |
+-----------+------------+------------------+--------------------+-----------+--------------------+
|C439737079 |Customer    |18                |3.5744083144000006E8|0          |0.0                 |
|C707403537 |Customer    |17                |2.993744184199999E8 |0          |0.0                 |
|C167875008 |Cus

## Save Data in Multiple Formats

We'll save both tables in three formats:
1. **CSV** - Baseline format (row-oriented)
2. **Parquet** - Columnar format with Snappy compression
3. **Delta Lake** - Advanced format with ACID transactions

### Save as CSV

In [12]:
# Save fact_transactions as CSV
with BenchmarkTimer(
    "Save fact_transactions as CSV",
    description=f"Writing {fact_count:,} records to CSV",
    spark=spark
):
    csv_path = str(get_data_path("csv", "fact_transactions"))
    fact_transactions.write.mode("overwrite").option("header", "true").csv(csv_path)

csv_size = get_directory_size_mb(get_data_path("csv", "fact_transactions"))
print(f"CSV Size: {csv_size:.2f} MB")

✓ Cache cleared for: Save fact_transactions as CSV

Starting benchmark: Save fact_transactions as CSV
Description: Writing 6,362,620 records to CSV

✓ Completed: Save fact_transactions as CSV
Duration: 7.250 seconds (0.12 minutes)

✓ Results logged to: C:\Users\samvo\source\repos\Spark-Performance-Benchmark\results\benchmark_logs.csv
CSV Size: 552.89 MB


In [13]:
# Save dim_accounts as CSV
with BenchmarkTimer(
    "Save dim_accounts as CSV",
    description=f"Writing {account_count:,} records to CSV",
    spark=spark
):
    csv_path = str(get_data_path("csv", "dim_accounts"))
    dim_accounts.write.mode("overwrite").option("header", "true").csv(csv_path)

csv_size = get_directory_size_mb(get_data_path("csv", "dim_accounts"))
print(f"CSV Size: {csv_size:.2f} MB")

✓ Cache cleared for: Save dim_accounts as CSV

Starting benchmark: Save dim_accounts as CSV
Description: Writing 9,073,900 records to CSV

✓ Completed: Save dim_accounts as CSV
Duration: 14.783 seconds (0.25 minutes)

✓ Results logged to: C:\Users\samvo\source\repos\Spark-Performance-Benchmark\results\benchmark_logs.csv
CSV Size: 337.20 MB


### Save as Parquet (with Snappy compression)

In [14]:
# Save fact_transactions as Parquet
with BenchmarkTimer(
    "Save fact_transactions as Parquet",
    description=f"Writing {fact_count:,} records to Parquet (Snappy)",
    spark=spark
):
    parquet_path = str(get_data_path("parquet", "fact_transactions"))
    fact_transactions.write.mode("overwrite").option("compression", "snappy").parquet(parquet_path)

parquet_size = get_directory_size_mb(get_data_path("parquet", "fact_transactions"))
print(f"Parquet Size: {parquet_size:.2f} MB")

✓ Cache cleared for: Save fact_transactions as Parquet

Starting benchmark: Save fact_transactions as Parquet
Description: Writing 6,362,620 records to Parquet (Snappy)

✓ Completed: Save fact_transactions as Parquet
Duration: 8.856 seconds (0.15 minutes)

✓ Results logged to: C:\Users\samvo\source\repos\Spark-Performance-Benchmark\results\benchmark_logs.csv
Parquet Size: 290.38 MB


In [15]:
# Save dim_accounts as Parquet
with BenchmarkTimer(
    "Save dim_accounts as Parquet",
    description=f"Writing {account_count:,} records to Parquet (Snappy)",
    spark=spark
):
    parquet_path = str(get_data_path("parquet", "dim_accounts"))
    dim_accounts.write.mode("overwrite").option("compression", "snappy").parquet(parquet_path)

parquet_size = get_directory_size_mb(get_data_path("parquet", "dim_accounts"))
print(f"Parquet Size: {parquet_size:.2f} MB")

✓ Cache cleared for: Save dim_accounts as Parquet

Starting benchmark: Save dim_accounts as Parquet
Description: Writing 9,073,900 records to Parquet (Snappy)

✓ Completed: Save dim_accounts as Parquet
Duration: 13.769 seconds (0.23 minutes)

✓ Results logged to: C:\Users\samvo\source\repos\Spark-Performance-Benchmark\results\benchmark_logs.csv
Parquet Size: 112.84 MB


### Save as Delta Lake (with optimization)

In [17]:
# Save fact_transactions as Delta
with BenchmarkTimer(
    "Save fact_transactions as Delta",
    description=f"Writing {fact_count:,} records to Delta Lake",
    spark=spark
):
    delta_path = str(get_data_path("delta", "fact_transactions"))
    fact_transactions.write.mode("overwrite").format("delta").save(delta_path)

delta_size_before = get_directory_size_mb(get_data_path("delta", "fact_transactions"))
print(f"Delta Size (before optimization): {delta_size_before:.2f} MB")

✓ Cache cleared for: Save fact_transactions as Delta

Starting benchmark: Save fact_transactions as Delta
Description: Writing 6,362,620 records to Delta Lake

✓ Completed: Save fact_transactions as Delta
Duration: 11.348 seconds (0.19 minutes)

✓ Results logged to: C:\Users\samvo\source\repos\Spark-Performance-Benchmark\results\benchmark_logs.csv
Delta Size (before optimization): 580.80 MB


In [18]:
# Optimize fact_transactions Delta table with Z-Ordering on account_orig
# This improves join performance by co-locating related data
print("Optimizing fact_transactions Delta table with Z-ORDER BY account_orig...")

with BenchmarkTimer(
    "Optimize fact_transactions Delta (ZORDER)",
    description="Running OPTIMIZE with ZORDER BY account_orig",
    spark=spark
):
    delta_path = str(get_data_path("delta", "fact_transactions"))
    spark.sql(f"OPTIMIZE delta.`{delta_path}` ZORDER BY (account_orig)")

delta_size_after = get_directory_size_mb(get_data_path("delta", "fact_transactions"))
print(f"Delta Size (after optimization): {delta_size_after:.2f} MB")

Optimizing fact_transactions Delta table with Z-ORDER BY account_orig...
✓ Cache cleared for: Optimize fact_transactions Delta (ZORDER)

Starting benchmark: Optimize fact_transactions Delta (ZORDER)
Description: Running OPTIMIZE with ZORDER BY account_orig

✓ Completed: Optimize fact_transactions Delta (ZORDER)
Duration: 21.682 seconds (0.36 minutes)

✓ Results logged to: C:\Users\samvo\source\repos\Spark-Performance-Benchmark\results\benchmark_logs.csv
Delta Size (after optimization): 862.78 MB


In [19]:
# Save dim_accounts as Delta
with BenchmarkTimer(
    "Save dim_accounts as Delta",
    description=f"Writing {account_count:,} records to Delta Lake",
    spark=spark
):
    delta_path = str(get_data_path("delta", "dim_accounts"))
    dim_accounts.write.mode("overwrite").format("delta").save(delta_path)

delta_size = get_directory_size_mb(get_data_path("delta", "dim_accounts"))
print(f"Delta Size: {delta_size:.2f} MB")

✓ Cache cleared for: Save dim_accounts as Delta

Starting benchmark: Save dim_accounts as Delta
Description: Writing 9,073,900 records to Delta Lake

✓ Completed: Save dim_accounts as Delta
Duration: 14.875 seconds (0.25 minutes)

✓ Results logged to: C:\Users\samvo\source\repos\Spark-Performance-Benchmark\results\benchmark_logs.csv
Delta Size: 112.86 MB


In [20]:
# Optimize dim_accounts Delta table
print("Optimizing dim_accounts Delta table...")

with BenchmarkTimer(
    "Optimize dim_accounts Delta",
    description="Running OPTIMIZE on dimension table",
    spark=spark
):
    delta_path = str(get_data_path("delta", "dim_accounts"))
    spark.sql(f"OPTIMIZE delta.`{delta_path}`")

Optimizing dim_accounts Delta table...
✓ Cache cleared for: Optimize dim_accounts Delta

Starting benchmark: Optimize dim_accounts Delta
Description: Running OPTIMIZE on dimension table

✓ Completed: Optimize dim_accounts Delta
Duration: 10.378 seconds (0.17 minutes)

✓ Results logged to: C:\Users\samvo\source\repos\Spark-Performance-Benchmark\results\benchmark_logs.csv


## Summary: Storage Comparison

In [21]:
# Calculate and display storage statistics
import pandas as pd

def get_format_sizes(table_name):
    """Get sizes for all formats of a table."""
    sizes = {}
    for fmt in ["csv", "parquet", "delta"]:
        path = get_data_path(fmt, table_name)
        if path.exists():
            sizes[fmt] = get_directory_size_mb(path)
        else:
            sizes[fmt] = 0.0
    return sizes

# Get sizes for both tables
transactions_sizes = get_format_sizes("fact_transactions")
accounts_sizes = get_format_sizes("dim_accounts")

# Create comparison DataFrame
comparison_df = pd.DataFrame({
    'Format': ['CSV', 'Parquet', 'Delta'],
    'fact_transactions (MB)': [transactions_sizes['csv'], transactions_sizes['parquet'], transactions_sizes['delta']],
    'dim_accounts (MB)': [accounts_sizes['csv'], accounts_sizes['parquet'], accounts_sizes['delta']]
})

# Calculate compression ratios (vs CSV)
comparison_df['Transactions Compression %'] = (
    (1 - comparison_df['fact_transactions (MB)'] / transactions_sizes['csv']) * 100
).round(1)
comparison_df['Accounts Compression %'] = (
    (1 - comparison_df['dim_accounts (MB)'] / accounts_sizes['csv']) * 100
).round(1)

print("="*90)
print("STORAGE COMPARISON SUMMARY")
print("="*90)
print(comparison_df.to_string(index=False))
print("="*90)
print(f"\nNote: Negative compression % means the format is larger than CSV")

STORAGE COMPARISON SUMMARY
 Format  fact_transactions (MB)  dim_accounts (MB)  Transactions Compression %  Accounts Compression %
    CSV              552.894993         337.201606                         0.0                     0.0
Parquet              290.375078         112.840783                        47.5                    66.5
  Delta              862.776222         225.712968                       -56.0                    33.1

Note: Negative compression % means the format is larger than CSV


## Verification: Read Sample Data from Each Format

In [22]:
# Verify CSV
print("Reading from CSV:")
csv_df = spark.read.option("header", "true").option("inferSchema", "true").csv(
    str(get_data_path("csv", "fact_transactions"))
)
print(f"Count: {csv_df.count():,}")
csv_df.show(3)

Reading from CSV:
Count: 6,362,620
+--------------+---------+----------------+---------+------------+-------------------+------------------+------------+-------------------+------------------+--------+----------------+
|transaction_id|time_step|transaction_type|   amount|account_orig|balance_orig_before|balance_orig_after|account_dest|balance_dest_before|balance_dest_after|is_fraud|is_flagged_fraud|
+--------------+---------+----------------+---------+------------+-------------------+------------------+------------+-------------------+------------------+--------+----------------+
|  111669149696|      370|        CASH_OUT|131195.36|  C631630829|           43169.28|               0.0| C1324313732|          447142.99|         578338.34|       0|               0|
|  111669149697|      370|         CASH_IN| 88693.24| C2094891829|           106252.0|         194945.24| C2051512830|      1.259020603E7|     1.250151279E7|       0|               0|
|  111669149698|      370|         PAYMENT| 1

In [23]:
# Verify Parquet
print("Reading from Parquet:")
parquet_df = spark.read.parquet(str(get_data_path("parquet", "fact_transactions")))
print(f"Count: {parquet_df.count():,}")
parquet_df.show(3)

Reading from Parquet:
Count: 6,362,620
+--------------+---------+----------------+---------+------------+-------------------+------------------+------------+-------------------+------------------+--------+----------------+
|transaction_id|time_step|transaction_type|   amount|account_orig|balance_orig_before|balance_orig_after|account_dest|balance_dest_before|balance_dest_after|is_fraud|is_flagged_fraud|
+--------------+---------+----------------+---------+------------+-------------------+------------------+------------+-------------------+------------------+--------+----------------+
|    8589934592|       18|         PAYMENT| 12756.09| C1715701002|          171830.01|         159073.92|  M752830443|                0.0|               0.0|       0|               0|
|    8589934593|       18|         PAYMENT|  8780.95| C1811023780|          159073.92|         150292.97| M1404577539|                0.0|               0.0|       0|               0|
|    8589934594|       18|        TRANSFE

In [24]:
# Verify Delta
print("Reading from Delta:")
delta_df = spark.read.format("delta").load(str(get_data_path("delta", "fact_transactions")))
print(f"Count: {delta_df.count():,}")
delta_df.show(3)

Reading from Delta:
Count: 6,362,620
+--------------+---------+----------------+---------+------------+-------------------+------------------+------------+-------------------+------------------+--------+----------------+
|transaction_id|time_step|transaction_type|   amount|account_orig|balance_orig_before|balance_orig_after|account_dest|balance_dest_before|balance_dest_after|is_fraud|is_flagged_fraud|
+--------------+---------+----------------+---------+------------+-------------------+------------------+------------+-------------------+------------------+--------+----------------+
|    8589934592|       18|         PAYMENT| 12756.09| C1715701002|          171830.01|         159073.92|  M752830443|                0.0|               0.0|       0|               0|
|    8589934593|       18|         PAYMENT|  8780.95| C1811023780|          159073.92|         150292.97| M1404577539|                0.0|               0.0|       0|               0|
|    8589934594|       18|        TRANSFER|

## Cleanup and Final Summary

In [25]:
# Unpersist cached DataFrames
paysim_raw.unpersist()
fact_transactions.unpersist()
dim_accounts.unpersist()

print("✓ Cached DataFrames released")

# Print final summary
print("\n" + "="*80)
print("DATA PROCESSING COMPLETE (PaySim Dataset)")
print("="*80)
print(f"✓ Loaded PaySim dataset: {row_count:,} transactions")
print(f"✓ Created fact_transactions: {fact_count:,} records")
print(f"✓ Created dim_accounts: {account_count:,} unique accounts")
print(f"✓ Saved in 3 formats: CSV, Parquet, Delta")
print(f"✓ Optimized Delta tables with Z-Ordering")
print(f"\nNext step: Run notebook 02_format_benchmark.ipynb")
print("="*80)

✓ Cached DataFrames released

DATA PROCESSING COMPLETE (PaySim Dataset)
✓ Loaded PaySim dataset: 6,362,620 transactions
✓ Created fact_transactions: 6,362,620 records
✓ Created dim_accounts: 9,073,900 unique accounts
✓ Saved in 3 formats: CSV, Parquet, Delta
✓ Optimized Delta tables with Z-Ordering

Next step: Run notebook 02_format_benchmark.ipynb
