# PySpark Merge Performance Test

**Purpose:** Merge all 36 source tables into a single bronze_sales table

**Source:** `source_bronze_YYYY_MM` (36 tables)

**Target:** `pyspark_bronze_sales`

**Comparison:** This will be compared against dbt Jobs performance

In [1]:
import time

# Configuration
SOURCE_PREFIX = "source_bronze"
TARGET_TABLE = "pyspark_bronze_sales"
YEARS = [2023, 2024, 2025]
MONTHS = range(1, 13)

print("="*60)
print("PYSPARK MERGE PERFORMANCE TEST")
print("="*60)
print(f"Source: {SOURCE_PREFIX}_YYYY_MM (36 tables)")
print(f"Target: {TARGET_TABLE}")
print("Method: PySpark DataFrame UNION")

StatementMeta(, ef3ab077-8be9-4fe5-bcf7-673d09c467b8, 3, Finished, Available, Finished)

PYSPARK MERGE PERFORMANCE TEST
Source: source_bronze_YYYY_MM (36 tables)
Target: pyspark_bronze_sales
Method: PySpark DataFrame UNION


In [3]:
# PERFORMANCE TEST START
print("\nStarting performance test...")
start_time = time.time()

df_merged = None
tables_merged = 0

for year in YEARS:
    for month in MONTHS:
        table_name = f"{SOURCE_PREFIX}_{year}_{month:02d}"
        
        try:
            df = spark.table(table_name)
            
            if df_merged is None:
                df_merged = df
            else:
                df_merged = df_merged.union(df)
            
            tables_merged += 1
        except Exception as e:
            print(f"  Error reading {table_name}: {type(e).__name__}: {e}")

read_time = time.time() - start_time
print(f"\nRead & Union: {read_time:.2f} seconds ({tables_merged} tables)")


StatementMeta(, ef3ab077-8be9-4fe5-bcf7-673d09c467b8, 5, Finished, Available, Finished)


Starting performance test...

Read & Union: 30.70 seconds (36 tables)


In [4]:
# Write merged data to target table
write_start = time.time()

df_merged.write.format("delta").mode("overwrite").option("overwriteSchema", "true").saveAsTable(TARGET_TABLE)

write_time = time.time() - write_start
total_time = time.time() - start_time

print(f"Write Time:   {write_time:.2f} seconds")

StatementMeta(, ef3ab077-8be9-4fe5-bcf7-673d09c467b8, 6, Finished, Available, Finished)

Write Time:   112.02 seconds


In [6]:
# Verify row count
row_count = spark.table(TARGET_TABLE).count()

print("\n" + "="*60)
print("PYSPARK MERGE - RESULTS")
print("="*60)
print(f"Source Tables:     {SOURCE_PREFIX}_YYYY_MM")
print(f"Target Table:      {TARGET_TABLE}")
print(f"Tables Merged:     {tables_merged}")
print(f"Total Rows:        {row_count:,}")
print(f"")
print(f"Read/Union Time:   {read_time:.2f} seconds")
print(f"Write Time:        {write_time:.2f} seconds")
print(f"")
print(f">>> TOTAL TIME:    {total_time:.2f} seconds <<<")
print("="*60)


StatementMeta(, ef3ab077-8be9-4fe5-bcf7-673d09c467b8, 8, Finished, Available, Finished)


PYSPARK MERGE - RESULTS
Source Tables:     source_bronze_YYYY_MM
Target Table:      pyspark_bronze_sales
Tables Merged:     36
Total Rows:        3,600,000

Read/Union Time:   30.70 seconds
Write Time:        112.02 seconds

>>> TOTAL TIME:    192.96 seconds <<<


In [10]:
# Show sample
print(f"Sample data from {TARGET_TABLE}:")
spark.table(TARGET_TABLE).show(5)

StatementMeta(, ef3ab077-8be9-4fe5-bcf7-673d09c467b8, 12, Finished, Available, Finished)

Sample data from pyspark_bronze_sales:
+--------------+-----------+---------+-------------+----------+----------+----------+-----------+---------+----------+-----------------+----------+------------+-----+-------+---------------+----------+--------+------------+-----------+----------+----------+---------------+------------+-----------------+---------------------+------------------+-----------------------+--------------+------------+-----------+----------+---------------+-------------+--------------+----------+
|      order_id|customer_id|driver_id|restaurant_id|order_date|order_time|order_year|order_month|order_day|order_hour|order_day_of_week|is_weekend|        city|state|country|restaurant_type|item_count|subtotal|delivery_fee|service_fee|tax_amount|tip_amount|discount_amount|total_amount|prep_time_minutes|delivery_time_minutes|total_time_minutes|delivery_distance_miles|payment_method|order_status|device_type|promo_code|customer_rating|driver_rating|is_first_order|is_reorder|
+------

In [11]:
# Verify distribution by year
print("Row distribution by year:")
spark.sql(f"""
    SELECT order_year, COUNT(*) as row_count
    FROM {TARGET_TABLE}
    GROUP BY order_year
    ORDER BY order_year
""").show()

StatementMeta(, ef3ab077-8be9-4fe5-bcf7-673d09c467b8, 13, Finished, Available, Finished)

Row distribution by year:
+----------+---------+
|order_year|row_count|
+----------+---------+
|      2023|  1200000|
|      2024|  1200000|
|      2025|  1200000|
+----------+---------+

