# Lab 1: Spark Batch Processing

## üéØ Objectives
- Master Spark DataFrame operations
- Learn data processing patterns
- Understand performance optimization
- Practice with real-world datasets

## üìã Prerequisites
- Spark cluster running
- Basic Python knowledge
- Understanding of SQL concepts

## üèóÔ∏è Architecture Overview
```
Data Sources ‚Üí Spark DataFrame ‚Üí Transformations ‚Üí Actions ‚Üí Results
     ‚Üì              ‚Üì                ‚Üì              ‚Üì
   CSV/JSON    Select/Filter    GroupBy/Join    Collect/Write
   Parquet     WithColumn      Aggregations    Database/File
   Database    Drop/Rename     Window Functions
```

## üìä Sample Datasets
- **Sales Data**: Transaction records with customer, product, timestamp
- **Customer Data**: Demographics, preferences, purchase history  
- **Product Catalog**: Product details, categories, pricing


In [None]:
# Install and Import Dependencies
%pip install pyspark findspark pandas numpy pyarrow psycopg2-binary sqlalchemy

import findspark
findspark.init()

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import random
import builtins  # Import builtins ƒë·ªÉ s·ª≠ d·ª•ng Python's built-in round()

print("‚úÖ Dependencies installed and imported successfully!")


Note: you may need to restart the kernel to use updated packages.
‚úÖ Dependencies installed and imported successfully!


# Initialize Spark Session
spark = SparkSession.builder \
    .appName("SparkBatchProcessingLab") \
    .master("spark://spark-master:7077") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .getOrCreate()


In [2]:
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)

25/11/26 07:49:21 WARN Utils: Your hostname, DSAI-TrungTrans-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 172.20.10.2 instead (on interface en0)
25/11/26 07:49:21 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/11/26 07:49:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:

# Set log level to reduce verbosity
spark.sparkContext.setLogLevel("WARN")

print("üöÄ Spark Session initialized successfully!")
print(f"üìä Spark Version: {spark.version}")
print(f"üîó Master URL: {spark.sparkContext.master}")
#print(f"üë• Available Executors: {spark.sparkContext.statusTracker().getExecutorInfos()}")


üöÄ Spark Session initialized successfully!
üìä Spark Version: 3.5.0
üîó Master URL: local


In [7]:
# Create Sample Data for Spark Lab
print("üìä Creating sample datasets for Spark Batch Processing Lab...")

# Sample Sales Data
sales_data = []
products = ['Laptop', 'Phone', 'Tablet', 'Headphones', 'Camera', 'Monitor', 'Keyboard', 'Mouse']
customers = ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace', 'Henry']
categories = ['Electronics', 'Accessories', 'Computing']

for i in range(1000):
    sales_data.append({
        'sale_id': f'SALE_{i+1:04d}',
        'customer_name': random.choice(customers),
        'product_name': random.choice(products),
        'category': random.choice(categories),
        'quantity': random.randint(1, 5),
        'unit_price': __builtins__.round(random.uniform(50, 2000), 2),  # S·ª≠ d·ª•ng Python's built-in round()
        'sale_date': (datetime.now() - timedelta(days=random.randint(0, 365))).strftime('%Y-%m-%d'),
        'region': random.choice(['North', 'South', 'East', 'West', 'Central'])
    })

# Sample Customer Data
customer_data = []
for customer in customers:
    customer_data.append({
        'customer_name': customer,
        'age': random.randint(25, 65),
        'city': random.choice(['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix']),
        'membership_level': random.choice(['Bronze', 'Silver', 'Gold', 'Platinum']),
        'join_date': (datetime.now() - timedelta(days=random.randint(30, 1000))).strftime('%Y-%m-%d'),
        'total_purchases': random.randint(5, 50)
    })

# Sample Product Data
product_data = []
for product in products:
    product_data.append({
        'product_name': product,
        'category': random.choice(categories),
        'brand': random.choice(['TechCorp', 'ElectroMax', 'DigitalPro', 'SmartTech']),
        'cost_price': __builtins__.round(random.uniform(30, 1500), 2),  # S·ª≠ d·ª•ng Python's built-in round()
        'in_stock': random.randint(0, 100),
        'supplier': random.choice(['SupplierA', 'SupplierB', 'SupplierC'])
    })

print(f"‚úÖ Sample data created:")
print(f"   üìä Sales records: {len(sales_data)}")
print(f"   üë• Customer records: {len(customer_data)}")
print(f"   üì¶ Product records: {len(product_data)}")


üìä Creating sample datasets for Spark Batch Processing Lab...
‚úÖ Sample data created:
   üìä Sales records: 1000
   üë• Customer records: 8
   üì¶ Product records: 8


In [8]:
# Create Spark DataFrames
print("üîÑ Creating Spark DataFrames from sample data...")

# Convert to Spark DataFrames
sales_df = spark.createDataFrame(sales_data)
customers_df = spark.createDataFrame(customer_data)
products_df = spark.createDataFrame(product_data)

# Show schema and sample data
print("\nüìä Sales DataFrame Schema:")
sales_df.printSchema()

print("\nüìä Sales DataFrame Sample:")
sales_df.show(5, truncate=False)

print("\nüë• Customers DataFrame Schema:")
customers_df.printSchema()

print("\nüë• Customers DataFrame Sample:")
customers_df.show(5, truncate=False)

print("\nüì¶ Products DataFrame Schema:")
products_df.printSchema()

print("\nüì¶ Products DataFrame Sample:")
products_df.show(5, truncate=False)

print(f"\n‚úÖ DataFrames created successfully!")
print(f"   üìä Sales: {sales_df.count()} records")
print(f"   üë• Customers: {customers_df.count()} records") 
print(f"   üì¶ Products: {products_df.count()} records")


üîÑ Creating Spark DataFrames from sample data...

üìä Sales DataFrame Schema:
root
 |-- category: string (nullable = true)
 |-- customer_name: string (nullable = true)
 |-- product_name: string (nullable = true)
 |-- quantity: long (nullable = true)
 |-- region: string (nullable = true)
 |-- sale_date: string (nullable = true)
 |-- sale_id: string (nullable = true)
 |-- unit_price: double (nullable = true)


üìä Sales DataFrame Sample:


                                                                                

+-----------+-------------+------------+--------+-------+----------+---------+----------+
|category   |customer_name|product_name|quantity|region |sale_date |sale_id  |unit_price|
+-----------+-------------+------------+--------+-------+----------+---------+----------+
|Electronics|Diana        |Monitor     |4       |East   |2025-04-12|SALE_0001|1326.25   |
|Computing  |Alice        |Headphones  |4       |West   |2025-02-02|SALE_0002|245.87    |
|Accessories|Charlie      |Camera      |2       |East   |2025-01-15|SALE_0003|1787.48   |
|Computing  |Alice        |Tablet      |5       |Central|2025-05-27|SALE_0004|1036.16   |
|Accessories|Diana        |Camera      |5       |West   |2025-06-13|SALE_0005|1028.07   |
+-----------+-------------+------------+--------+-------+----------+---------+----------+
only showing top 5 rows


üë• Customers DataFrame Schema:
root
 |-- age: long (nullable = true)
 |-- city: string (nullable = true)
 |-- customer_name: string (nullable = true)
 |-- join_da

## Exercise 1: Basic DataFrame Operations

### üéØ **Learning Objectives:**
- Master DataFrame transformations
- Learn filtering and selection patterns
- Practice column operations
- Understand DataFrame caching strategies

### üìö **Key Concepts:**
1. **Transformations**: Lazy operations that build execution plan
2. **Actions**: Operations that trigger computation
3. **Column Operations**: Working with DataFrame columns
4. **Caching**: Optimizing repeated operations


In [None]:
# Exercise 1: Basic DataFrame Operations
print("üîß Exercise 1: Basic DataFrame Operations")

print("\n1Ô∏è‚É£ Filtering Operations:")
print("   Filter sales with quantity > 2 and unit_price > 500")

filtered_sales = sales_df.filter(
    (col("quantity") > 2) & (col("unit_price") > 500)
)

print(f"   üìä Filtered records: {filtered_sales.count()}")
filtered_sales.show(5)

print("\n2Ô∏è‚É£ Column Selection and Renaming:")
print("   Select specific columns and rename them")

selected_sales = sales_df.select(
    col("sale_id").alias("transaction_id"),
    col("customer_name").alias("customer"),
    col("product_name").alias("product"),
    col("quantity"),
    col("unit_price").alias("price")
)

selected_sales.show(5)

print("\n3Ô∏è‚É£ Adding Calculated Columns:")
print("   Add total_amount column (quantity * unit_price)")

sales_with_total = sales_df.withColumn(
    "total_amount", 
    col("quantity") * col("unit_price")
).withColumn(
    "discount_applied",
    when(col("total_amount") > 1000, col("total_amount") * 0.1).otherwise(0)
).withColumn(
    "final_amount",
    col("total_amount") - col("discount_applied")
)

sales_with_total.show(5)

print("\n4Ô∏è‚É£ Data Type Conversions:")
print("   Convert sale_date to proper date type")

sales_with_dates = sales_df.withColumn(
    "sale_date", 
    to_date(col("sale_date"), "yyyy-MM-dd")
).withColumn(
    "sale_year",
    year(col("sale_date"))
).withColumn(
    "sale_month", 
    month(col("sale_date"))
)

sales_with_dates.select("sale_id", "sale_date", "sale_year", "sale_month").show(5)

print("\n‚úÖ Basic DataFrame operations completed!")


## Exercise 2: Aggregations and Grouping

### üéØ **Learning Objectives:**
- Master groupBy operations
- Learn aggregation functions
- Practice window functions
- Understand data summarization patterns

### üìö **Key Concepts:**
1. **GroupBy**: Grouping data by columns
2. **Aggregations**: Sum, count, avg, min, max operations
3. **Window Functions**: Advanced analytical functions
4. **Pivoting**: Reshaping data for analysis


In [None]:
# Exercise 2: Aggregations and Grouping
print("üìä Exercise 2: Aggregations and Grouping")

print("\n1Ô∏è‚É£ Basic Aggregations:")
print("   Calculate total sales by product")

product_sales = sales_df.groupBy("product_name").agg(
    sum("quantity").alias("total_quantity"),
    sum(col("quantity") * col("unit_price")).alias("total_revenue"),
    avg("unit_price").alias("avg_price"),
    count("*").alias("sale_count")
).orderBy(desc("total_revenue"))

product_sales.show()

print("\n2Ô∏è‚É£ Multi-level Grouping:")
print("   Sales by region and category")

region_category_sales = sales_df.groupBy("region", "category").agg(
    sum(col("quantity") * col("unit_price")).alias("total_revenue"),
    avg(col("quantity") * col("unit_price")).alias("avg_transaction_value"),
    count("*").alias("transaction_count")
).orderBy(desc("total_revenue"))

region_category_sales.show()

print("\n3Ô∏è‚É£ Window Functions:")
print("   Calculate running totals and rankings")

# Define window specification
window_spec = Window.partitionBy("customer_name").orderBy("sale_date")

# Add running totals and rankings
sales_with_window = sales_df.withColumn(
    "running_total", 
    sum(col("quantity") * col("unit_price")).over(window_spec)
).withColumn(
    "transaction_rank",
    row_number().over(window_spec)
).withColumn(
    "customer_avg_transaction",
    avg(col("quantity") * col("unit_price")).over(
        Window.partitionBy("customer_name")
    )
)

sales_with_window.select(
    "customer_name", "sale_date", "quantity", "unit_price", 
    "running_total", "transaction_rank", "customer_avg_transaction"
).show(10)

print("\n4Ô∏è‚É£ Pivot Operations:")
print("   Pivot sales data by region")

pivot_sales = sales_df.groupBy("product_name").pivot("region").agg(
    sum(col("quantity") * col("unit_price")).alias("revenue")
).fillna(0)

pivot_sales.show()

print("\n‚úÖ Aggregations and grouping completed!")


## Exercise 3: Joins and Data Integration

### üéØ **Learning Objectives:**
- Master different join types
- Learn data integration patterns
- Practice complex join operations
- Understand join optimization

### üìö **Key Concepts:**
1. **Inner Join**: Matching records from both tables
2. **Left/Right Join**: Including all records from one side
3. **Outer Join**: Including all records from both sides
4. **Join Optimization**: Efficient join strategies


In [None]:
# Exercise 3: Joins and Data Integration
print("üîó Exercise 3: Joins and Data Integration")

print("\n1Ô∏è‚É£ Inner Join:")
print("   Join sales with customer data")

sales_customers = sales_df.join(
    customers_df, 
    sales_df.customer_name == customers_df.customer_name, 
    "inner"
).select(
    sales_df["*"],
    customers_df.age.alias("customer_age"),
    customers_df.city.alias("customer_city"),
    customers_df.membership_level.alias("customer_membership")
)

print(f"   üìä Joined records: {sales_customers.count()}")
sales_customers.show(5)

print("\n2Ô∏è‚É£ Left Join:")
print("   Join sales with product data (include all sales)")

sales_products = sales_df.join(
    products_df,
    sales_df.product_name == products_df.product_name,
    "left"
).select(
    sales_df["*"],
    products_df.brand.alias("product_brand"),
    products_df.cost_price.alias("product_cost"),
    products_df.in_stock.alias("current_stock")
)

print(f"   üìä Joined records: {sales_products.count()}")
sales_products.show(5)

print("\n3Ô∏è‚É£ Complex Multi-table Join:")
print("   Join sales, customers, and products")

complete_sales = sales_df.join(
    customers_df,
    sales_df.customer_name == customers_df.customer_name,
    "inner"
).join(
    products_df,
    sales_df.product_name == products_df.product_name,
    "inner"
).select(
    sales_df.sale_id,
    sales_df.customer_name,
    sales_df.product_name,
    sales_df.quantity,
    sales_df.unit_price,
    (sales_df.quantity * sales_df.unit_price).alias("total_amount"),
    customers_df.age,
    customers_df.city,
    customers_df.membership_level,
    products_df.brand,
    products_df.cost_price,
    (sales_df.quantity * sales_df.unit_price - sales_df.quantity * products_df.cost_price).alias("profit")
)

print(f"   üìä Complete joined records: {complete_sales.count()}")
complete_sales.show(5)

print("\n4Ô∏è‚É£ Join Analysis:")
print("   Analyze profit by customer membership level")

profit_by_membership = complete_sales.groupBy("membership_level").agg(
    sum("total_amount").alias("total_revenue"),
    sum("profit").alias("total_profit"),
    avg("profit").alias("avg_profit_per_transaction"),
    count("*").alias("transaction_count")
).orderBy(desc("total_profit"))

profit_by_membership.show()

print("\n‚úÖ Joins and data integration completed!")


## Exercise 4: Performance Optimization

### üéØ **Learning Objectives:**
- Learn DataFrame caching strategies
- Understand partitioning concepts
- Practice performance monitoring
- Master optimization techniques

### üìö **Key Concepts:**
1. **Caching**: Storing DataFrames in memory
2. **Partitioning**: Data distribution strategies
3. **Broadcast Joins**: Optimizing small table joins
4. **Performance Monitoring**: Tracking execution metrics


In [None]:
# Exercise 4: Performance Optimization
print("‚ö° Exercise 4: Performance Optimization")

print("\n1Ô∏è‚É£ DataFrame Caching:")
print("   Cache frequently used DataFrames")

# Cache the complete sales DataFrame for multiple operations
complete_sales.cache()
print("   üìä Cached complete_sales DataFrame")

# Test cache performance
import time
start_time = time.time()
count1 = complete_sales.count()
first_run_time = time.time() - start_time

start_time = time.time()
count2 = complete_sales.count()
second_run_time = time.time() - start_time

print(f"   ‚è±Ô∏è First count: {first_run_time:.3f}s")
print(f"   ‚è±Ô∏è Second count: {second_run_time:.3f}s")
print(f"   üìä Records: {count1}")

print("\n2Ô∏è‚É£ Broadcast Join:")
print("   Use broadcast join for small tables")

# Broadcast the small customers table
from pyspark.sql.functions import broadcast

sales_broadcast = sales_df.join(
    broadcast(customers_df),
    sales_df.customer_name == customers_df.customer_name,
    "inner"
)

print("   üìä Used broadcast join for customers table")
sales_broadcast.select("sale_id", "customer_name", "age", "city").show(5)

print("\n3Ô∏è‚É£ Repartitioning:")
print("   Optimize data partitioning")

# Check current partitions
print(f"   üìä Current partitions: {sales_df.rdd.getNumPartitions()}")

# Repartition by region for better performance
sales_repartitioned = sales_df.repartition(4, "region")
print(f"   üìä Repartitioned partitions: {sales_repartitioned.rdd.getNumPartitions()}")

# Coalesce to reduce partitions
sales_coalesced = sales_repartitioned.coalesce(2)
print(f"   üìä Coalesced partitions: {sales_coalesced.rdd.getNumPartitions()}")

print("\n4Ô∏è‚É£ Performance Monitoring:")
print("   Monitor Spark application performance")

# Get Spark context information
sc = spark.sparkContext
print(f"   üîß Spark Version: {sc.version}")
print(f"   üîß Master: {sc.master}")
print(f"   üîß App Name: {sc.appName}")

# Get executor information
executor_infos = sc.statusTracker().getExecutorInfos()
print(f"   üîß Active Executors: {len(executor_infos)}")

for executor in executor_infos:
    print(f"      Executor {executor.executorId}: {executor.host}:{executor.port}")

print("\n5Ô∏è‚É£ Memory Management:")
print("   Check DataFrame memory usage")

# Show storage level
print(f"   üìä Storage Level: {complete_sales.storageLevel}")

# Unpersist cached DataFrames
complete_sales.unpersist()
print("   üóëÔ∏è Unpersisted cached DataFrame")

print("\n‚úÖ Performance optimization completed!")


## Exercise 5: Data Persistence and Export

### üéØ **Learning Objectives:**
- Learn data persistence strategies
- Practice different file formats
- Understand data export patterns
- Master data pipeline completion

### üìö **Key Concepts:**
1. **File Formats**: Parquet, JSON, CSV, Avro
2. **Data Persistence**: Saving processed results
3. **Partitioned Storage**: Organizing data by partitions
4. **Data Export**: Writing to external systems


In [None]:
# Exercise 5: Data Persistence and Export
print("üíæ Exercise 5: Data Persistence and Export")

print("\n1Ô∏è‚É£ Export to Different Formats:")
print("   Save processed data in various formats")

# Create output directory
output_dir = "/tmp/spark_lab_output"
print(f"   üìÅ Output directory: {output_dir}")

# Export to Parquet (recommended for Spark)
print("\n   üìä Exporting to Parquet format...")
complete_sales.write \
    .mode("overwrite") \
    .parquet(f"{output_dir}/sales_parquet")

print("   ‚úÖ Parquet export completed")

# Export to JSON
print("\n   üìä Exporting to JSON format...")
product_sales.write \
    .mode("overwrite") \
    .json(f"{output_dir}/product_sales_json")

print("   ‚úÖ JSON export completed")

# Export to CSV
print("\n   üìä Exporting to CSV format...")
profit_by_membership.write \
    .mode("overwrite") \
    .option("header", "true") \
    .csv(f"{output_dir}/profit_analysis_csv")

print("   ‚úÖ CSV export completed")

print("\n2Ô∏è‚É£ Partitioned Storage:")
print("   Save data partitioned by region")

# Partition by region for better query performance
print("\n   üìä Exporting partitioned data...")
sales_df.write \
    .mode("overwrite") \
    .partitionBy("region") \
    .parquet(f"{output_dir}/sales_partitioned")

print("   ‚úÖ Partitioned export completed")

print("\n3Ô∏è‚É£ Data Validation:")
print("   Verify exported data")

# Read back and validate
print("\n   üìä Reading Parquet data...")
parquet_data = spark.read.parquet(f"{output_dir}/sales_parquet")
print(f"   üìä Parquet records: {parquet_data.count()}")

print("\n   üìä Reading JSON data...")
json_data = spark.read.json(f"{output_dir}/product_sales_json")
print(f"   üìä JSON records: {json_data.count()}")

print("\n   üìä Reading CSV data...")
csv_data = spark.read.option("header", "true").csv(f"{output_dir}/profit_analysis_csv")
print(f"   üìä CSV records: {csv_data.count()}")

print("\n4Ô∏è‚É£ Summary Statistics:")
print("   Final data processing summary")

print(f"\nüìä Processing Summary:")
print(f"   üìà Total sales records processed: {sales_df.count()}")
print(f"   üë• Customer records: {customers_df.count()}")
print(f"   üì¶ Product records: {products_df.count()}")
print(f"   üîó Joined records: {complete_sales.count()}")
print(f"   üìä Product sales analysis: {product_sales.count()}")
print(f"   üí∞ Profit analysis records: {profit_by_membership.count()}")

print("\n‚úÖ Data persistence and export completed!")
print("üéâ Spark Batch Processing Lab completed successfully!")


In [None]:
# Cleanup and Best Practices
print("üßπ Cleanup and Best Practices")

print("\nüìã Spark Batch Processing Best Practices:")
print("‚úÖ Use appropriate file formats (Parquet for analytics)")
print("‚úÖ Cache DataFrames that are used multiple times")
print("‚úÖ Use broadcast joins for small tables")
print("‚úÖ Partition data by frequently queried columns")
print("‚úÖ Monitor Spark UI for performance insights")
print("‚úÖ Use appropriate data types to save memory")
print("‚úÖ Avoid unnecessary shuffles and repartitions")
print("‚úÖ Use column pruning and predicate pushdown")
print("‚úÖ Set appropriate batch sizes for streaming")
print("‚úÖ Clean up cached DataFrames when done")

print("\nüîß Performance Tips:")
print("‚úÖ Enable adaptive query execution (AQE)")
print("‚úÖ Use Kryo serializer for better performance")
print("‚úÖ Tune executor memory and cores")
print("‚úÖ Use appropriate storage levels")
print("‚úÖ Monitor and optimize shuffle operations")

print("\nüìä Data Quality Tips:")
print("‚úÖ Validate data schemas")
print("‚úÖ Handle null values appropriately")
print("‚úÖ Use consistent data types")
print("‚úÖ Implement data quality checks")
print("‚úÖ Document data transformations")

print("\nüéØ Next Steps:")
print("üöÄ Try Lab 2: Spark Streaming for real-time processing")
print("ü§ñ Try Lab 3: Spark MLlib for machine learning")
print("üìà Explore Spark UI for performance monitoring")
print("üîç Practice with larger datasets")

print("\n‚úÖ Spark Batch Processing Lab completed!")
print("üéâ Ready for Spark Streaming and MLlib labs!")
