# PySpark Interview Preparation - Set 5 (Hard)

## Overview & Instructions

### How to run this notebook in Google Colab:
1. Upload this .ipynb file to Google Colab
2. Run the installation cells below
3. Execute each problem cell sequentially

### Installation Commands:
The following cell installs Java and PySpark:

In [None]:
# Install Java and PySpark
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!pip install -q pyspark

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

### SparkSession Initialization:

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window

spark = SparkSession.builder\
    .appName("PySparkInterviewSet5")\
    .config("spark.sql.adaptive.enabled", "true")\
    .getOrCreate()

spark.conf.set("spark.sql.adaptive.enabled", "true")

### DataFrame Assertion Function:

This function compares DataFrames ignoring order and with floating-point tolerance:

In [None]:
def assert_dataframe_equal(df_actual, df_expected, epsilon=1e-6):
    """Compare two DataFrames ignoring order and with floating-point tolerance"""
    
    # Check schema first
    if df_actual.schema != df_expected.schema:
        print("Schema mismatch!")
        print("Actual schema:", df_actual.schema)
        print("Expected schema:", df_expected.schema)
        raise AssertionError("Schema mismatch")
    
    # Collect data
    actual_data = df_actual.collect()
    expected_data = df_expected.collect()
    
    if len(actual_data) != len(expected_data):
        print(f"Row count mismatch! Actual: {len(actual_data)}, Expected: {len(expected_data)}")
        raise AssertionError("Row count mismatch")
    
    # Convert to sets of tuples for order-insensitive comparison
    def row_to_comparable(row):
        values = []
        for field in row:
            if isinstance(field, float):
                # Handle floating point comparison
                values.append(round(field / epsilon) * epsilon)
            elif isinstance(field, list):
                # Handle arrays
                values.append(tuple(sorted(field)) if field else tuple())
            elif isinstance(field, dict):
                # Handle structs
                values.append(tuple(sorted(field.items())))
            else:
                values.append(field)
        return tuple(values)
    
    actual_set = set(row_to_comparable(row) for row in actual_data)
    expected_set = set(row_to_comparable(row) for row in expected_data)
    
    if actual_set != expected_set:
        print("Data mismatch!")
        print("Actual data:", actual_set)
        print("Expected data:", expected_set)
        raise AssertionError("Data content mismatch")
    
    print("✓ DataFrames are equal!")
    return True

## Table of Contents - Set 5 (Hard)

**Difficulty Distribution:** 30 Hard Problems

**Topics Covered:**
- Advanced Joins & Complex Deduplication (9 problems)
- Sophisticated Window Functions (4 problems)
- Multi-level Aggregations & OLAP (3 problems)
- Advanced Pandas UDFs & Performance (3 problems)
- Production File Format Handling (8 problems)
- Complex Nested Data Structures (7 problems)
- Performance & Optimization (2 problems)



## Problem 1: Customer Churn Prediction Features

**Requirement:** Analytics team needs features for customer churn prediction model.

**Scenario:** Calculate customer engagement metrics: purchase frequency, recency, and monetary value.

In [None]:
# Source DataFrame
customer_engagement_data = [
    ("C001", "2023-01-15", 100.0),
    ("C001", "2023-02-10", 150.0),
    ("C001", "2023-03-05", 200.0),
    ("C002", "2023-01-20", 300.0),
    ("C002", "2023-03-15", 250.0),
    ("C003", "2023-02-01", 500.0),
    ("C004", "2023-01-05", 150.0),
    ("C004", "2023-01-25", 175.0),
    ("C004", "2023-02-20", 200.0),
    ("C004", "2023-03-10", 225.0)
]

customer_engagement_df = spark.createDataFrame(customer_engagement_data, ["customer_id", "order_date", "amount"])
customer_engagement_df = customer_engagement_df.withColumn("order_date", col("order_date").cast("date"))
customer_engagement_df.show()

In [None]:
# Expected Output
expected_data = [
    ("C004", 4, 750.0, 187.5, 64),
    ("C001", 3, 450.0, 150.0, 54),
    ("C002", 2, 550.0, 275.0, 54),
    ("C003", 1, 500.0, 500.0, 37)
]

expected_df = spark.createDataFrame(expected_data, ["customer_id", "frequency", "monetary", "avg_order_value", "recency_days"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** RFM analysis implementation. Tests date calculations and multi-metric aggregation.

## Problem 2: Inventory Stock Analysis

**Requirement:** Supply chain needs current stock levels with lead time calculations.

**Scenario:** Calculate current inventory levels considering incoming and outgoing shipments.

In [None]:
# Source DataFrames
inventory_data = [
    ("P001", "Laptop", 50),
    ("P002", "Mouse", 100),
    ("P003", "Keyboard", 75)
]

incoming_shipments_data = [
    ("S001", "P001", "2023-03-01", 20),
    ("S002", "P002", "2023-03-02", 50),
    ("S003", "P001", "2023-03-03", 10)
]

outgoing_orders_data = [
    ("O001", "P001", "2023-03-01", 15),
    ("O002", "P002", "2023-03-02", 30),
    ("O003", "P001", "2023-03-03", 25),
    ("O004", "P003", "2023-03-03", 20)
]

inventory_df = spark.createDataFrame(inventory_data, ["product_id", "product_name", "current_stock"])
incoming_df = spark.createDataFrame(incoming_shipments_data, ["shipment_id", "product_id", "arrival_date", "quantity"])
outgoing_df = spark.createDataFrame(outgoing_orders_data, ["order_id", "product_id", "order_date", "quantity"])

print("Inventory:")
inventory_df.show()
print("Incoming Shipments:")
incoming_df.show()
print("Outgoing Orders:")
outgoing_df.show()

In [None]:
# Expected Output
expected_data = [
    ("P001", "Laptop", 50, 30, 40, 40),
    ("P002", "Mouse", 100, 50, 30, 120),
    ("P003", "Keyboard", 75, 0, 20, 55)
]

expected_df = spark.createDataFrame(expected_data, ["product_id", "product_name", "current_stock", "incoming_qty", "outgoing_qty", "projected_stock"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Multi-table aggregation with conditional sums. Tests complex join scenarios with multiple data sources.

## Problem 3: Employee Attendance Pattern Analysis

**Requirement:** HR needs to analyze employee attendance patterns for workforce planning.

**Scenario:** Calculate consecutive work days and identify attendance patterns using window functions.

In [None]:
# Source DataFrame
attendance_data = [
    ("E001", "2023-03-01", "Present"),
    ("E001", "2023-03-02", "Present"),
    ("E001", "2023-03-03", "Absent"),
    ("E001", "2023-03-04", "Present"),
    ("E001", "2023-03-05", "Present"),
    ("E001", "2023-03-06", "Present"),
    ("E002", "2023-03-01", "Present"),
    ("E002", "2023-03-02", "Present"),
    ("E002", "2023-03-03", "Present"),
    ("E002", "2023-03-04", "Absent"),
    ("E002", "2023-03-05", "Present")
]

attendance_df = spark.createDataFrame(attendance_data, ["employee_id", "date", "status"])
attendance_df = attendance_df.withColumn("date", col("date").cast("date"))
attendance_df.show()

In [None]:
# Expected Output
expected_data = [
    ("E001", "2023-03-01", "Present", 1),
    ("E001", "2023-03-02", "Present", 2),
    ("E001", "2023-03-03", "Absent", 0),
    ("E001", "2023-03-04", "Present", 1),
    ("E001", "2023-03-05", "Present", 2),
    ("E001", "2023-03-06", "Present", 3),
    ("E002", "2023-03-01", "Present", 1),
    ("E002", "2023-03-02", "Present", 2),
    ("E002", "2023-03-03", "Present", 3),
    ("E002", "2023-03-04", "Absent", 0),
    ("E002", "2023-03-05", "Present", 1)
]

expected_df = spark.createDataFrame(expected_data, ["employee_id", "date", "status", "consecutive_days"])
expected_df = expected_df.withColumn("date", col("date").cast("date"))
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Complex window functions with conditional reset. Tests pattern detection and state management in window operations.

## Problem 4: Financial Portfolio Analysis

**Requirement:** Investment team needs portfolio performance analysis with risk metrics.

**Scenario:** Calculate portfolio weights, returns, and risk metrics across different assets.

In [None]:
# Source DataFrame
portfolio_data = [
    ("AAPL", 10000.0, 150.0, 155.0),
    ("GOOGL", 15000.0, 2800.0, 2850.0),
    ("MSFT", 8000.0, 300.0, 295.0),
    ("TSLA", 12000.0, 200.0, 210.0)
]

portfolio_df = spark.createDataFrame(portfolio_data, ["symbol", "investment", "purchase_price", "current_price"])
portfolio_df.show()

In [None]:
# Expected Output
expected_data = [
    ("AAPL", 10000.0, 150.0, 155.0, 6666.67, 10333.33, 3.33, 22.22),
    ("GOOGL", 15000.0, 2800.0, 2850.0, 5357.14, 15267.86, 1.79, 267.86),
    ("MSFT", 8000.0, 300.0, 295.0, 2666.67, 7866.67, -1.67, -133.33),
    ("TSLA", 12000.0, 200.0, 210.0, 6000.0, 12600.0, 5.0, 600.0)
]

expected_df = spark.createDataFrame(expected_data, ["symbol", "investment", "purchase_price", "current_price", "shares", "current_value", "return_pct", "return_amt"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Financial calculations with multiple derived metrics. Tests mathematical operations and percentage calculations.

## Problem 5: Healthcare Patient Journey Analysis

**Requirement:** Medical analytics needs patient treatment pathway analysis.

**Scenario:** Analyze patient journeys through different medical departments and treatments.

In [None]:
# Source DataFrame
patient_journey_data = [
    ("P001", "Emergency", "2023-01-15 10:00:00"),
    ("P001", "Radiology", "2023-01-15 11:30:00"),
    ("P001", "Surgery", "2023-01-15 14:00:00"),
    ("P001", "ICU", "2023-01-15 18:00:00"),
    ("P002", "OPD", "2023-01-16 09:00:00"),
    ("P002", "Lab", "2023-01-16 10:00:00"),
    ("P002", "Pharmacy", "2023-01-16 11:00:00"),
    ("P003", "Emergency", "2023-01-17 15:00:00"),
    ("P003", "Radiology", "2023-01-17 16:00:00")
]

patient_journey_df = spark.createDataFrame(patient_journey_data, ["patient_id", "department", "timestamp"])
patient_journey_df = patient_journey_df.withColumn("timestamp", col("timestamp").cast("timestamp"))
patient_journey_df.show()

In [None]:
# Expected Output
expected_data = [
    ("P001", "Emergency", "Radiology", 90),
    ("P001", "Radiology", "Surgery", 150),
    ("P001", "Surgery", "ICU", 240),
    ("P002", "OPD", "Lab", 60),
    ("P002", "Lab", "Pharmacy", 60),
    ("P003", "Emergency", "Radiology", 60)
]

expected_df = spark.createDataFrame(expected_data, ["patient_id", "from_dept", "to_dept", "time_minutes"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Time-based analysis with lead/lag operations. Tests patient journey analysis and time interval calculations.

## Problem 6: E-commerce Customer Segmentation

**Requirement:** Marketing needs advanced customer segmentation for targeted campaigns.

**Scenario:** Segment customers based on RFM (Recency, Frequency, Monetary) scores and clustering logic.

In [None]:
# Source DataFrame
customer_segmentation_data = [
    ("C001", 45, 15, 2500.0),
    ("C002", 120, 3, 800.0),
    ("C003", 10, 25, 5000.0),
    ("C004", 80, 8, 1500.0),
    ("C005", 200, 2, 400.0),
    ("C006", 5, 30, 7500.0),
    ("C007", 60, 12, 3000.0)
]

customer_segmentation_df = spark.createDataFrame(customer_segmentation_data, ["customer_id", "recency_days", "frequency", "monetary"])
customer_segmentation_df.show()

In [None]:
# Expected Output
expected_data = [
    ("C001", 45, 15, 2500.0, "Gold"),
    ("C002", 120, 3, 800.0, "Bronze"),
    ("C003", 10, 25, 5000.0, "Platinum"),
    ("C004", 80, 8, 1500.0, "Silver"),
    ("C005", 200, 2, 400.0, "Bronze"),
    ("C006", 5, 30, 7500.0, "Platinum"),
    ("C007", 60, 12, 3000.0, "Gold")
]

expected_df = spark.createDataFrame(expected_data, ["customer_id", "recency_days", "frequency", "monetary", "segment"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Customer segmentation with business rules. Tests conditional logic and multi-criteria classification.

## Problem 7: Supply Chain Route Optimization

**Requirement:** Logistics needs optimal delivery route analysis with cost calculations.

**Scenario:** Calculate delivery routes, distances, and costs considering multiple stops and constraints.

In [None]:
# Source DataFrame
delivery_routes_data = [
    ("R001", "Warehouse", "Store_A", 50.0, 100.0),
    ("R001", "Store_A", "Store_B", 30.0, 60.0),
    ("R001", "Store_B", "Warehouse", 40.0, 80.0),
    ("R002", "Warehouse", "Store_C", 70.0, 140.0),
    ("R002", "Store_C", "Store_D", 25.0, 50.0),
    ("R002", "Store_D", "Warehouse", 60.0, 120.0),
    ("R003", "Warehouse", "Store_E", 90.0, 180.0)
]

delivery_routes_df = spark.createDataFrame(delivery_routes_data, ["route_id", "from_location", "to_location", "distance_km", "cost"])
delivery_routes_df.show()

In [None]:
# Expected Output
expected_data = [
    ("R001", 120.0, 240.0, 3),
    ("R002", 155.0, 310.0, 3),
    ("R003", 90.0, 180.0, 1)
]

expected_df = spark.createDataFrame(expected_data, ["route_id", "total_distance", "total_cost", "stops"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Route optimization with aggregation. Tests group-based calculations and multi-leg journey analysis.

## Problem 8: Media Content Performance Analysis

**Requirement:** Media analytics needs content engagement metrics and performance trends.

**Scenario:** Calculate content engagement rates, completion rates, and audience retention metrics.

In [None]:
# Source DataFrame
content_performance_data = [
    ("V001", "Tutorial", 10000, 8500, 7500, 6000),
    ("V002", "Entertainment", 15000, 12000, 11000, 9000),
    ("V003", "News", 8000, 6000, 5000, 3500),
    ("V004", "Documentary", 5000, 4500, 4200, 3800),
    ("V005", "Sports", 20000, 18000, 16000, 14000)
]

content_performance_df = spark.createDataFrame(content_performance_data, ["content_id", "category", "impressions", "views", "engagements", "completions"])
content_performance_df.show()

In [None]:
# Expected Output
expected_data = [
    ("V001", "Tutorial", 85.0, 75.0, 60.0, 70.6),
    ("V002", "Entertainment", 80.0, 73.3, 60.0, 75.0),
    ("V003", "News", 75.0, 62.5, 43.8, 58.3),
    ("V004", "Documentary", 90.0, 84.0, 76.0, 84.4),
    ("V005", "Sports", 90.0, 80.0, 70.0, 77.8)
]

expected_df = spark.createDataFrame(expected_data, ["content_id", "category", "view_rate", "engagement_rate", "completion_rate", "retention_rate"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Media analytics with percentage calculations. Tests ratio computations and performance metric derivations.

## Problem 9: Educational Course Progress Tracking

**Requirement:** Education platform needs student progress analytics and course completion tracking.

**Scenario:** Calculate student progress, completion rates, and identify at-risk students.

In [None]:
# Source DataFrame
student_progress_data = [
    ("S001", "C001", 10, 8, 85.0),
    ("S001", "C002", 15, 5, 65.0),
    ("S002", "C001", 10, 10, 95.0),
    ("S002", "C003", 20, 15, 88.0),
    ("S003", "C001", 10, 3, 55.0),
    ("S003", "C002", 15, 2, 45.0),
    ("S004", "C003", 20, 18, 92.0)
]

student_progress_df = spark.createDataFrame(student_progress_data, ["student_id", "course_id", "total_modules", "completed_modules", "avg_score"])
student_progress_df.show()

In [None]:
# Expected Output
expected_data = [
    ("S001", 25, 13, 52.0, 75.0, "At Risk"),
    ("S002", 30, 25, 83.3, 91.5, "Excellent"),
    ("S003", 25, 5, 20.0, 50.0, "Critical"),
    ("S004", 20, 18, 90.0, 92.0, "Excellent")
]

expected_df = spark.createDataFrame(expected_data, ["student_id", "total_modules", "completed_modules", "completion_rate", "avg_score", "status"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Student analytics with multi-criteria status classification. Tests aggregation and conditional business logic.

## Problem 10: IoT Sensor Data Anomaly Detection

**Requirement:** IoT monitoring needs real-time anomaly detection in sensor data streams.

**Scenario:** Identify sensor readings that deviate significantly from historical patterns using statistical methods.

In [None]:
# Source DataFrame
sensor_data = [
    ("Sensor_A", "2023-03-01 10:00:00", 25.5),
    ("Sensor_A", "2023-03-01 11:00:00", 26.1),
    ("Sensor_A", "2023-03-01 12:00:00", 25.8),
    ("Sensor_A", "2023-03-01 13:00:00", 45.2),  # Anomaly
    ("Sensor_A", "2023-03-01 14:00:00", 25.9),
    ("Sensor_B", "2023-03-01 10:00:00", 30.2),
    ("Sensor_B", "2023-03-01 11:00:00", 31.0),
    ("Sensor_B", "2023-03-01 12:00:00", 15.8),  # Anomaly
    ("Sensor_B", "2023-03-01 13:00:00", 30.5),
    ("Sensor_B", "2023-03-01 14:00:00", 30.8)
]

sensor_df = spark.createDataFrame(sensor_data, ["sensor_id", "timestamp", "value"])
sensor_df = sensor_df.withColumn("timestamp", col("timestamp").cast("timestamp"))
sensor_df.show()

In [None]:
# Expected Output
expected_data = [
    ("Sensor_A", "2023-03-01 13:00:00", 45.2, "Anomaly"),
    ("Sensor_B", "2023-03-01 12:00:00", 15.8, "Anomaly")
]

expected_df = spark.createDataFrame(expected_data, ["sensor_id", "timestamp", "value", "status"])
expected_df = expected_df.withColumn("timestamp", col("timestamp").cast("timestamp"))
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Statistical anomaly detection with window functions. Tests standard deviation calculations and outlier identification.

## Problem 11: Financial Transaction Pattern Analysis

**Requirement:** Fraud detection needs transaction pattern analysis for suspicious activity identification.

**Scenario:** Analyze transaction patterns to identify unusual spending behaviors and potential fraud.

In [None]:
# Source DataFrame
transaction_patterns_data = [
    ("T001", "C001", "2023-03-01 09:00:00", 100.0, "Retail"),
    ("T002", "C001", "2023-03-01 10:30:00", 50.0, "Dining"),
    ("T003", "C001", "2023-03-01 15:00:00", 200.0, "Electronics"),
    ("T004", "C001", "2023-03-02 08:00:00", 5000.0, "Jewelry"),  # Suspicious
    ("T005", "C002", "2023-03-01 11:00:00", 75.0, "Groceries"),
    ("T006", "C002", "2023-03-01 14:00:00", 120.0, "Entertainment"),
    ("T007", "C002", "2023-03-02 10:00:00", 80.0, "Dining")
]

transaction_patterns_df = spark.createDataFrame(transaction_patterns_data, ["transaction_id", "customer_id", "timestamp", "amount", "category"])
transaction_patterns_df = transaction_patterns_df.withColumn("timestamp", col("timestamp").cast("timestamp"))
transaction_patterns_df.show()

In [None]:
# Expected Output
expected_data = [
    ("T004", "C001", "2023-03-02 08:00:00", 5000.0, "Jewelry", "High Value")
]

expected_df = spark.createDataFrame(expected_data, ["transaction_id", "customer_id", "timestamp", "amount", "category", "risk_level"])
expected_df = expected_df.withColumn("timestamp", col("timestamp").cast("timestamp"))
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Fraud detection with pattern analysis. Tests statistical comparisons and anomaly flagging based on historical patterns.

## Problem 12: Multi-Dimensional Sales Analysis

**Requirement:** Business intelligence needs sales analysis across multiple dimensions.

**Scenario:** Analyze sales performance across time, geography, and product categories with rollup aggregations.

In [None]:
# Source DataFrame
multi_dim_sales_data = [
    ("2023-Q1", "North", "Electronics", "Laptop", 50000),
    ("2023-Q1", "North", "Electronics", "Tablet", 30000),
    ("2023-Q1", "South", "Electronics", "Laptop", 45000),
    ("2023-Q1", "South", "Electronics", "Tablet", 25000),
    ("2023-Q1", "North", "Clothing", "Shirt", 20000),
    ("2023-Q1", "South", "Clothing", "Shirt", 22000),
    ("2023-Q2", "North", "Electronics", "Laptop", 55000),
    ("2023-Q2", "North", "Electronics", "Tablet", 32000)
]

multi_dim_sales_df = spark.createDataFrame(multi_dim_sales_data, ["quarter", "region", "category", "product", "sales"])
multi_dim_sales_df.show()

In [None]:
# Expected Output
expected_data = [
    ("2023-Q1", "North", "Electronics", 80000),
    ("2023-Q1", "North", "Clothing", 20000),
    ("2023-Q1", "South", "Electronics", 70000),
    ("2023-Q1", "South", "Clothing", 22000),
    ("2023-Q2", "North", "Electronics", 87000),
    ("2023-Q1", "North", "All", 100000),
    ("2023-Q1", "South", "All", 92000),
    ("2023-Q1", "All", "Electronics", 150000),
    ("2023-Q1", "All", "Clothing", 42000)
]

expected_df = spark.createDataFrame(expected_data, ["quarter", "region", "category", "total_sales"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Multi-dimensional analysis with rollup aggregations. Tests cube/rollup operations for hierarchical reporting.

## Problem 13: Complex UDF for Natural Language Processing

**Requirement:** Customer feedback analysis needs text processing for sentiment and topic extraction.

**Scenario:** Create advanced UDFs to process customer feedback text for sentiment analysis and key topic identification.

In [None]:
# Source DataFrame
customer_feedback_data = [
    (1, "The product is amazing! Great quality and fast delivery."),
    (2, "Terrible experience. The item arrived damaged and customer service was unhelpful."),
    (3, "Average product, nothing special but gets the job done."),
    (4, "Excellent service! Will definitely buy again. Highly recommended."),
    (5, "Poor quality product. Broke after first use. Very disappointed.")
]

customer_feedback_df = spark.createDataFrame(customer_feedback_data, ["feedback_id", "feedback_text"])
customer_feedback_df.show(truncate=False)

In [None]:
# Expected Output
expected_data = [
    (1, "The product is amazing! Great quality and fast delivery.", "Positive", "product quality"),
    (2, "Terrible experience. The item arrived damaged and customer service was unhelpful.", "Negative", "customer service"),
    (3, "Average product, nothing special but gets the job done.", "Neutral", "product quality"),
    (4, "Excellent service! Will definitely buy again. Highly recommended.", "Positive", "customer service"),
    (5, "Poor quality product. Broke after first use. Very disappointed.", "Negative", "product quality")
]

expected_df = spark.createDataFrame(expected_data, ["feedback_id", "feedback_text", "sentiment", "main_topic"])
expected_df.show(truncate=False)

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Advanced UDFs for text processing. Tests string analysis, keyword matching, and sentiment classification logic.

## Problem 14: Time-Series Forecasting Features

**Requirement:** Forecasting team needs feature engineering for time-series prediction models.

**Scenario:** Create lag features, moving averages, and trend indicators for sales forecasting.

In [None]:
# Source DataFrame
sales_forecasting_data = [
    ("2023-01-01", 1000.0),
    ("2023-01-02", 1200.0),
    ("2023-01-03", 1100.0),
    ("2023-01-04", 1300.0),
    ("2023-01-05", 1400.0),
    ("2023-01-06", 1250.0),
    ("2023-01-07", 1500.0),
    ("2023-01-08", 1600.0)
]

sales_forecasting_df = spark.createDataFrame(sales_forecasting_data, ["date", "sales"])
sales_forecasting_df = sales_forecasting_df.withColumn("date", col("date").cast("date"))
sales_forecasting_df.show()

In [None]:
# Expected Output
expected_data = [
    ("2023-01-01", 1000.0, None, None, None, None),
    ("2023-01-02", 1200.0, 1000.0, None, None, 200.0),
    ("2023-01-03", 1100.0, 1200.0, 1000.0, 1100.0, -100.0),
    ("2023-01-04", 1300.0, 1100.0, 1200.0, 1200.0, 200.0),
    ("2023-01-05", 1400.0, 1300.0, 1100.0, 1266.67, 100.0),
    ("2023-01-06", 1250.0, 1400.0, 1300.0, 1316.67, -150.0),
    ("2023-01-07", 1500.0, 1250.0, 1400.0, 1383.33, 250.0),
    ("2023-01-08", 1600.0, 1500.0, 1250.0, 1450.0, 100.0)
]

expected_df = spark.createDataFrame(expected_data, ["date", "sales", "lag_1", "lag_2", "moving_avg_3", "daily_change"])
expected_df = expected_df.withColumn("date", col("date").cast("date"))
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Time-series feature engineering. Tests lag features, moving averages, and trend calculations for forecasting.

## Problem 15: Complex Data Validation Framework

**Requirement:** Data governance needs comprehensive data quality validation framework.

**Scenario:** Implement multi-level data validation checks with custom business rules and cross-field validation.

In [None]:
# Source DataFrame
data_validation_data = [
    (1, "John Doe", "john@email.com", "1990-01-15", "2023-01-01", 5000.0),
    (2, "Jane Smith", "invalid-email", "1985-12-20", "2023-01-15", -100.0),  # Invalid
    (3, "Bob Johnson", "bob@company.com", "2005-06-10", "2023-02-01", 3000.0),  # Underage
    (4, "Alice Brown", "alice@domain.com", "1975-03-25", "2022-12-01", 7500.0),  # Future date
    (5, "", "charlie@email.com", "1988-07-30", "2023-01-10", 4000.0)  # Empty name
]

data_validation_df = spark.createDataFrame(data_validation_data, ["customer_id", "name", "email", "birth_date", "signup_date", "balance"])
data_validation_df.show()

In [None]:
# Expected Output
expected_data = [
    (1, "John Doe", "john@email.com", "1990-01-15", "2023-01-01", 5000.0, "Valid"),
    (2, "Jane Smith", "invalid-email", "1985-12-20", "2023-01-15", -100.0, "Invalid Email, Negative Balance"),
    (3, "Bob Johnson", "bob@company.com", "2005-06-10", "2023-02-01", 3000.0, "Underage"),
    (4, "Alice Brown", "alice@domain.com", "1975-03-25", "2022-12-01", 7500.0, "Future Signup Date"),
    (5, "", "charlie@email.com", "1988-07-30", "2023-01-10", 4000.0, "Empty Name")
]

expected_df = spark.createDataFrame(expected_data, ["customer_id", "name", "email", "birth_date", "signup_date", "balance", "validation_errors"])
expected_df.show(truncate=False)

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Comprehensive data validation framework. Tests multiple validation rules and error message aggregation.

## Problem 16: Advanced Window Functions for Gap Analysis

**Requirement:** Business operations needs gap analysis in service delivery timelines.

**Scenario:** Identify service gaps and calculate downtime between consecutive service events.

In [None]:
# Source DataFrame
service_events_data = [
    ("S001", "2023-03-01 09:00:00", "2023-03-01 10:00:00"),
    ("S001", "2023-03-01 11:30:00", "2023-03-01 12:30:00"),
    ("S001", "2023-03-01 14:00:00", "2023-03-01 15:00:00"),
    ("S002", "2023-03-01 08:00:00", "2023-03-01 09:00:00"),
    ("S002", "2023-03-01 10:00:00", "2023-03-01 11:00:00"),
    ("S002", "2023-03-01 13:00:00", "2023-03-01 14:00:00")
]

service_events_df = spark.createDataFrame(service_events_data, ["service_id", "start_time", "end_time"])
service_events_df = service_events_df.withColumn("start_time", col("start_time").cast("timestamp"))\
                                   .withColumn("end_time", col("end_time").cast("timestamp"))
service_events_df.show()

In [None]:
# Expected Output
expected_data = [
    ("S001", "2023-03-01 09:00:00", "2023-03-01 10:00:00", None, None),
    ("S001", "2023-03-01 11:30:00", "2023-03-01 12:30:00", "2023-03-01 10:00:00", 90),
    ("S001", "2023-03-01 14:00:00", "2023-03-01 15:00:00", "2023-03-01 12:30:00", 90),
    ("S002", "2023-03-01 08:00:00", "2023-03-01 09:00:00", None, None),
    ("S002", "2023-03-01 10:00:00", "2023-03-01 11:00:00", "2023-03-01 09:00:00", 60),
    ("S002", "2023-03-01 13:00:00", "2023-03-01 14:00:00", "2023-03-01 11:00:00", 120)
]

expected_df = spark.createDataFrame(expected_data, ["service_id", "start_time", "end_time", "prev_end_time", "gap_minutes"])
expected_df = expected_df.withColumn("start_time", col("start_time").cast("timestamp"))\
                       .withColumn("end_time", col("end_time").cast("timestamp"))\
                       .withColumn("prev_end_time", col("prev_end_time").cast("timestamp"))
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Gap analysis with window functions. Tests time interval calculations and service continuity analysis.

## Problem 17: Complex Business Rule Engine

**Requirement:** Insurance claims processing needs automated rule-based decision engine.

**Scenario:** Implement complex business rules for insurance claim approval with multiple conditions and scoring.

In [None]:
# Source DataFrame
insurance_claims_data = [
    ("CL001", 5000.0, 2, "2023-01-15", "Approved"),
    ("CL002", 15000.0, 1, "2023-02-20", "Pending"),
    ("CL003", 25000.0, 3, "2023-03-05", "Approved"),
    ("CL004", 50000.0, 0, "2023-03-10", "Rejected"),
    ("CL005", 8000.0, 5, "2023-03-15", "Approved")
]

insurance_claims_df = spark.createDataFrame(insurance_claims_data, ["claim_id", "claim_amount", "previous_claims", "claim_date", "current_status"])
insurance_claims_df.show()

In [None]:
# Expected Output
expected_data = [
    ("CL001", 5000.0, 2, "2023-01-15", "Approved", "Auto Approved"),
    ("CL002", 15000.0, 1, "2023-02-20", "Pending", "Manual Review Required"),
    ("CL003", 25000.0, 3, "2023-03-05", "Approved", "High Risk - Approved"),
    ("CL004", 50000.0, 0, "2023-03-10", "Rejected", "Exceeds Limit"),
    ("CL005", 8000.0, 5, "2023-03-15", "Approved", "Frequent Claimant - Review")
]

expected_df = spark.createDataFrame(expected_data, ["claim_id", "claim_amount", "previous_claims", "claim_date", "current_status", "decision_reason"])
expected_df.show(truncate=False)

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Complex business rule engine implementation. Tests multi-condition decision logic and business rule application.

## Problem 18: Advanced Data Partitioning Strategy

**Requirement:** Big data processing needs optimized partitioning for performance.

**Scenario:** Implement custom partitioning strategy for large-scale customer transaction data.

In [None]:
# Source DataFrame
large_transactions_data = [
    ("T001", "C001", "2023-03-01", "Electronics", 1000.0),
    ("T002", "C002", "2023-03-01", "Clothing", 500.0),
    ("T003", "C001", "2023-03-02", "Electronics", 1500.0),
    ("T004", "C003", "2023-03-02", "Home", 2000.0),
    ("T005", "C002", "2023-03-03", "Electronics", 800.0),
    ("T006", "C004", "2023-03-03", "Clothing", 300.0),
    ("T007", "C001", "2023-03-04", "Home", 1200.0),
    ("T008", "C003", "2023-03-04", "Electronics", 2500.0)
]

large_transactions_df = spark.createDataFrame(large_transactions_data, ["transaction_id", "customer_id", "date", "category", "amount"])
large_transactions_df.show()

In [None]:
# Expected Output
expected_data = [
    ("2023-03-01", "Electronics", 1000.0),
    ("2023-03-01", "Clothing", 500.0),
    ("2023-03-02", "Electronics", 1500.0),
    ("2023-03-02", "Home", 2000.0),
    ("2023-03-03", "Electronics", 800.0),
    ("2023-03-03", "Clothing", 300.0),
    ("2023-03-04", "Home", 1200.0),
    ("2023-03-04", "Electronics", 2500.0)
]

expected_df = spark.createDataFrame(expected_data, ["date", "category", "total_amount"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Advanced partitioning and aggregation strategy. Tests efficient data organization for large-scale processing.

## Problem 19: Multi-Source Data Integration

**Requirement:** Data warehouse needs integration of multiple source systems with conflict resolution.

**Scenario:** Merge customer data from different source systems with priority-based conflict resolution.

In [None]:
# Source DataFrames
crm_customers_data = [
    ("C001", "John Doe", "john@old-email.com", "123-456-7890"),
    ("C002", "Jane Smith", "jane@email.com", "987-654-3210"),
    ("C003", "Bob Johnson", "bob@company.com", "555-123-4567")
]

erp_customers_data = [
    ("C001", "John Doe", "john@new-email.com", "123-456-7890"),
    ("C002", "Jane Smith", "jane@email.com", "987-654-0000"),
    ("C004", "Alice Brown", "alice@domain.com", "111-222-3333")
]

crm_customers_df = spark.createDataFrame(crm_customers_data, ["customer_id", "name", "email", "phone"])
erp_customers_df = spark.createDataFrame(erp_customers_data, ["customer_id", "name", "email", "phone"])

print("CRM Customers:")
crm_customers_df.show()
print("ERP Customers:")
erp_customers_df.show()

In [None]:
# Expected Output
expected_data = [
    ("C001", "John Doe", "john@new-email.com", "123-456-7890"),
    ("C002", "Jane Smith", "jane@email.com", "987-654-0000"),
    ("C003", "Bob Johnson", "bob@company.com", "555-123-4567"),
    ("C004", "Alice Brown", "alice@domain.com", "111-222-3333")
]

expected_df = spark.createDataFrame(expected_data, ["customer_id", "name", "email", "phone"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Multi-source data integration with conflict resolution. Tests complex join logic and priority-based merging.

## Problem 20: Complex Hierarchical Calculations

**Requirement:** Financial reporting needs hierarchical profit center calculations.

**Scenario:** Calculate rolling up financial metrics across organizational hierarchy with weighted allocations.

In [None]:
# Source DataFrame
profit_centers_data = [
    ("PC001", "North Region", "Region", None, 1000000.0),
    ("PC002", "NY Division", "Division", "PC001", 400000.0),
    ("PC003", "NJ Division", "Division", "PC001", 350000.0),
    ("PC004", "CT Division", "Division", "PC001", 250000.0),
    ("PC005", "NY Store A", "Store", "PC002", 150000.0),
    ("PC006", "NY Store B", "Store", "PC002", 120000.0),
    ("PC007", "NY Store C", "Store", "PC002", 130000.0)
]

profit_centers_df = spark.createDataFrame(profit_centers_data, ["center_id", "center_name", "level", "parent_id", "revenue"])
profit_centers_df.show()

In [None]:
# Expected Output
expected_data = [
    ("PC001", "North Region", "Region", None, 1000000.0, 1000000.0),
    ("PC002", "NY Division", "Division", "PC001", 400000.0, 400000.0),
    ("PC003", "NJ Division", "Division", "PC001", 350000.0, 350000.0),
    ("PC004", "CT Division", "Division", "PC001", 250000.0, 250000.0),
    ("PC005", "NY Store A", "Store", "PC002", 150000.0, 150000.0),
    ("PC006", "NY Store B", "Store", "PC002", 120000.0, 120000.0),
    ("PC007", "NY Store C", "Store", "PC002", 130000.0, 130000.0)
]

expected_df = spark.createDataFrame(expected_data, ["center_id", "center_name", "level", "parent_id", "revenue", "rolled_up_revenue"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Hierarchical calculations with recursive relationships. Tests complex organizational structure processing.

## Problem 21: Advanced Time-Series Correlation

**Requirement:** Financial analytics needs correlation analysis between different time-series.

**Scenario:** Calculate rolling correlations between stock prices and market indicators.

In [None]:
# Source DataFrame
stock_correlation_data = [
    ("2023-01-01", "AAPL", 150.0, 4500.0),
    ("2023-01-02", "AAPL", 152.0, 4520.0),
    ("2023-01-03", "AAPL", 151.5, 4480.0),
    ("2023-01-04", "AAPL", 153.0, 4550.0),
    ("2023-01-05", "AAPL", 154.5, 4600.0),
    ("2023-01-06", "AAPL", 153.5, 4580.0),
    ("2023-01-07", "AAPL", 155.0, 4620.0),
    ("2023-01-08", "AAPL", 156.0, 4650.0)
]

stock_correlation_df = spark.createDataFrame(stock_correlation_data, ["date", "symbol", "price", "market_index"])
stock_correlation_df = stock_correlation_df.withColumn("date", col("date").cast("date"))
stock_correlation_df.show()

In [None]:
# Expected Output
expected_data = [
    ("2023-01-01", "AAPL", 150.0, 4500.0, None),
    ("2023-01-02", "AAPL", 152.0, 4520.0, None),
    ("2023-01-03", "AAPL", 151.5, 4480.0, None),
    ("2023-01-04", "AAPL", 153.0, 4550.0, 0.87),
    ("2023-01-05", "AAPL", 154.5, 4600.0, 0.92),
    ("2023-01-06", "AAPL", 153.5, 4580.0, 0.89),
    ("2023-01-07", "AAPL", 155.0, 4620.0, 0.91),
    ("2023-01-08", "AAPL", 156.0, 4650.0, 0.93)
]

expected_df = spark.createDataFrame(expected_data, ["date", "symbol", "price", "market_index", "correlation_5d"])
expected_df = expected_df.withColumn("date", col("date").cast("date"))
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Advanced time-series correlation analysis. Tests statistical calculations and rolling window correlations.

## Problem 22: Complex Data Enrichment Pipeline

**Requirement:** Customer analytics needs comprehensive data enrichment from multiple sources.

**Scenario:** Enrich customer data with demographic, geographic, and behavioral attributes from external sources.

In [None]:
# Source DataFrames
customers_base_data = [
    ("C001", "John Doe", "10001"),
    ("C002", "Jane Smith", "90001"),
    ("C003", "Bob Johnson", "60601")
]

demographic_data = [
    ("10001", 35, "Married", "Bachelor"),
    ("90001", 28, "Single", "Master"),
    ("60601", 42, "Married", "PhD")
]

behavioral_data = [
    ("C001", "High", "Frequent", "Premium"),
    ("C002", "Medium", "Occasional", "Standard"),
    ("C003", "Low", "Rare", "Basic")
]

customers_base_df = spark.createDataFrame(customers_base_data, ["customer_id", "name", "postal_code"])
demographic_df = spark.createDataFrame(demographic_data, ["postal_code", "avg_age", "marital_status", "education"])
behavioral_df = spark.createDataFrame(behavioral_data, ["customer_id", "spending_level", "purchase_frequency", "customer_tier"])

print("Base Customers:")
customers_base_df.show()
print("Demographic Data:")
demographic_df.show()
print("Behavioral Data:")
behavioral_df.show()

In [None]:
# Expected Output
expected_data = [
    ("C001", "John Doe", "10001", 35, "Married", "Bachelor", "High", "Frequent", "Premium"),
    ("C002", "Jane Smith", "90001", 28, "Single", "Master", "Medium", "Occasional", "Standard"),
    ("C003", "Bob Johnson", "60601", 42, "Married", "PhD", "Low", "Rare", "Basic")
]

expected_df = spark.createDataFrame(expected_data, ["customer_id", "name", "postal_code", "avg_age", "marital_status", "education", "spending_level", "purchase_frequency", "customer_tier"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Complex data enrichment pipeline. Tests multi-source joins and comprehensive data augmentation.

## Problem 23: Advanced Statistical Analysis

**Requirement:** Data science needs advanced statistical metrics for model feature engineering.

**Scenario:** Calculate z-scores, percentiles, and other statistical measures for data normalization.

In [None]:
# Source DataFrame
statistical_data = [
    ("P001", 150.0),
    ("P002", 175.0),
    ("P003", 200.0),
    ("P004", 125.0),
    ("P005", 225.0),
    ("P006", 180.0),
    ("P007", 160.0),
    ("P008", 190.0),
    ("P009", 210.0),
    ("P010", 140.0)
]

statistical_df = spark.createDataFrame(statistical_data, ["product_id", "price"])
statistical_df.show()

In [None]:
# Expected Output
expected_data = [
    ("P001", 150.0, -0.82, 0.2),
    ("P002", 175.0, -0.16, 0.4),
    ("P003", 200.0, 0.49, 0.6),
    ("P004", 125.0, -1.48, 0.1),
    ("P005", 225.0, 1.15, 0.9),
    ("P006", 180.0, 0.0, 0.5),
    ("P007", 160.0, -0.65, 0.3),
    ("P008", 190.0, 0.33, 0.7),
    ("P009", 210.0, 0.82, 0.8),
    ("P010", 140.0, -1.15, 0.0)
]

expected_df = spark.createDataFrame(expected_data, ["product_id", "price", "z_score", "percentile"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Advanced statistical analysis with window functions. Tests z-score calculations and percentile rankings.

## Problem 24: Complex Data Quality Monitoring

**Requirement:** Data governance needs automated data quality monitoring with trend analysis.

**Scenario:** Implement data quality metrics tracking with trend analysis and alerting capabilities.

In [None]:
# Source DataFrame
data_quality_metrics_data = [
    ("2023-01-01", "Completeness", 95.5),
    ("2023-01-02", "Completeness", 96.2),
    ("2023-01-03", "Completeness", 94.8),
    ("2023-01-04", "Completeness", 97.1),
    ("2023-01-05", "Completeness", 93.5),
    ("2023-01-01", "Accuracy", 98.0),
    ("2023-01-02", "Accuracy", 97.5),
    ("2023-01-03", "Accuracy", 96.8),
    ("2023-01-04", "Accuracy", 98.2),
    ("2023-01-05", "Accuracy", 95.9)
]

data_quality_metrics_df = spark.createDataFrame(data_quality_metrics_data, ["date", "metric", "score"])
data_quality_metrics_df = data_quality_metrics_df.withColumn("date", col("date").cast("date"))
data_quality_metrics_df.show()

In [None]:
# Expected Output
expected_data = [
    ("2023-01-01", "Completeness", 95.5, None),
    ("2023-01-02", "Completeness", 96.2, 0.7),
    ("2023-01-03", "Completeness", 94.8, -1.4),
    ("2023-01-04", "Completeness", 97.1, 2.3),
    ("2023-01-05", "Completeness", 93.5, -3.6),
    ("2023-01-01", "Accuracy", 98.0, None),
    ("2023-01-02", "Accuracy", 97.5, -0.5),
    ("2023-01-03", "Accuracy", 96.8, -0.7),
    ("2023-01-04", "Accuracy", 98.2, 1.4),
    ("2023-01-05", "Accuracy", 95.9, -2.3)
]

expected_df = spark.createDataFrame(expected_data, ["date", "metric", "score", "daily_change"])
expected_df = expected_df.withColumn("date", col("date").cast("date"))
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Data quality monitoring with trend analysis. Tests time-series analysis for quality metric tracking.

## Problem 25: Complex Business Metric Calculation

**Requirement:** Executive dashboard needs complex business KPIs with multiple calculation steps.

**Scenario:** Calculate customer acquisition cost, lifetime value, and return on investment metrics.

In [None]:
# Source DataFrame
business_metrics_data = [
    ("2023-Q1", 1000, 50000.0, 250000.0, 5000.0),
    ("2023-Q2", 1200, 60000.0, 300000.0, 5500.0),
    ("2023-Q3", 1500, 75000.0, 400000.0, 6000.0),
    ("2023-Q4", 1800, 90000.0, 500000.0, 6500.0)
]

business_metrics_df = spark.createDataFrame(business_metrics_data, ["quarter", "new_customers", "acquisition_cost", "customer_revenue", "operating_cost"])
business_metrics_df.show()

In [None]:
# Expected Output
expected_data = [
    ("2023-Q1", 1000, 50000.0, 250000.0, 5000.0, 50.0, 250.0, 5.0, 245000.0),
    ("2023-Q2", 1200, 60000.0, 300000.0, 5500.0, 50.0, 250.0, 5.0, 294500.0),
    ("2023-Q3", 1500, 75000.0, 400000.0, 6000.0, 50.0, 266.67, 5.33, 394000.0),
    ("2023-Q4", 1800, 90000.0, 500000.0, 6500.0, 50.0, 277.78, 5.56, 493500.0)
]

expected_df = spark.createDataFrame(expected_data, ["quarter", "new_customers", "acquisition_cost", "customer_revenue", "operating_cost", "cac", "clv", "roi", "net_profit"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Complex business metric calculations. Tests multi-step financial calculations and KPI derivations.

## Problem 26: Advanced Data Transformation Pipeline

**Requirement:** ETL pipeline needs complex multi-stage data transformation with error handling.

**Scenario:** Implement a robust ETL pipeline with data validation, transformation, and error logging.

In [None]:
# Source DataFrame
etl_source_data = [
    ("C001", "john doe  ", "1000.50", "2023-01-15"),
    ("C002", "Jane Smith", "invalid", "2023-01-16"),
    ("C003", "bob johnson", "1200.25", "2023-01-17"),
    ("C004", "Alice Brown", "950.00", "2023-01-18")
]

etl_source_df = spark.createDataFrame(etl_source_data, ["customer_id", "customer_name", "amount", "date"])
etl_source_df.show()

In [None]:
# Expected Output
expected_data = [
    ("C001", "John Doe", 1000.5, "2023-01-15", "Success"),
    ("C002", "Jane Smith", None, "2023-01-16", "Invalid Amount"),
    ("C003", "Bob Johnson", 1200.25, "2023-01-17", "Success"),
    ("C004", "Alice Brown", 950.0, "2023-01-18", "Success")
]

expected_df = spark.createDataFrame(expected_data, ["customer_id", "customer_name", "amount", "date", "status"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Robust ETL pipeline with error handling. Tests data validation, transformation, and error management.

## Problem 27: Complex Join Optimization

**Requirement:** Performance tuning needs optimized join strategies for large datasets.

**Scenario:** Implement efficient join strategies for large customer and transaction datasets.

In [None]:
# Source DataFrames
customers_large_data = [
    ("C001", "John Doe"),
    ("C002", "Jane Smith"),
    ("C003", "Bob Johnson"),
    ("C004", "Alice Brown")
]

transactions_large_data = [
    ("T001", "C001", 100.0),
    ("T002", "C001", 150.0),
    ("T003", "C002", 200.0),
    ("T004", "C003", 75.0),
    ("T005", "C004", 300.0),
    ("T006", "C001", 125.0)
]

customers_large_df = spark.createDataFrame(customers_large_data, ["customer_id", "customer_name"])
transactions_large_df = spark.createDataFrame(transactions_large_data, ["transaction_id", "customer_id", "amount"])

print("Customers:")
customers_large_df.show()
print("Transactions:")
transactions_large_df.show()

In [None]:
# Expected Output
expected_data = [
    ("C001", "John Doe", 3, 375.0),
    ("C002", "Jane Smith", 1, 200.0),
    ("C003", "Bob Johnson", 1, 75.0),
    ("C004", "Alice Brown", 1, 300.0)
]

expected_df = spark.createDataFrame(expected_data, ["customer_id", "customer_name", "transaction_count", "total_amount"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Join optimization strategies. Tests efficient aggregation and join patterns for large datasets.

## Problem 28: Complex Window Function Patterns

**Requirement:** Advanced analytics needs complex window function patterns for time-series analysis.

**Scenario:** Implement advanced window function patterns for financial time-series analysis.

In [None]:
# Source DataFrame
financial_series_data = [
    ("2023-01-01", "AAPL", 150.0),
    ("2023-01-02", "AAPL", 152.0),
    ("2023-01-03", "AAPL", 151.5),
    ("2023-01-04", "AAPL", 153.0),
    ("2023-01-05", "AAPL", 154.5),
    ("2023-01-06", "AAPL", 153.5),
    ("2023-01-07", "AAPL", 155.0),
    ("2023-01-08", "AAPL", 156.0)
]

financial_series_df = spark.createDataFrame(financial_series_data, ["date", "symbol", "price"])
financial_series_df = financial_series_df.withColumn("date", col("date").cast("date"))
financial_series_df.show()

In [None]:
# Expected Output
expected_data = [
    ("2023-01-01", "AAPL", 150.0, None, None, None),
    ("2023-01-02", "AAPL", 152.0, 150.0, 2.0, 1.33),
    ("2023-01-03", "AAPL", 151.5, 152.0, -0.5, -0.33),
    ("2023-01-04", "AAPL", 153.0, 151.5, 1.5, 0.99),
    ("2023-01-05", "AAPL", 154.5, 153.0, 1.5, 0.98),
    ("2023-01-06", "AAPL", 153.5, 154.5, -1.0, -0.65),
    ("2023-01-07", "AAPL", 155.0, 153.5, 1.5, 0.98),
    ("2023-01-08", "AAPL", 156.0, 155.0, 1.0, 0.65)
]

expected_df = spark.createDataFrame(expected_data, ["date", "symbol", "price", "prev_price", "price_change", "pct_change"])
expected_df = expected_df.withColumn("date", col("date").cast("date"))
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Advanced window function patterns. Tests financial calculations and time-series analysis techniques.

## Problem 29: Complex Data Aggregation Strategy

**Requirement:** Business reporting needs complex multi-level aggregation with custom logic.

**Scenario:** Implement custom aggregation logic for hierarchical business reporting.

In [None]:
# Source DataFrame
business_aggregation_data = [
    ("Region_A", "Division_1", "Department_X", 100000.0),
    ("Region_A", "Division_1", "Department_Y", 150000.0),
    ("Region_A", "Division_2", "Department_Z", 200000.0),
    ("Region_B", "Division_3", "Department_W", 120000.0),
    ("Region_B", "Division_3", "Department_V", 180000.0),
    ("Region_B", "Division_4", "Department_U", 220000.0)
]

business_aggregation_df = spark.createDataFrame(business_aggregation_data, ["region", "division", "department", "revenue"])
business_aggregation_df.show()

In [None]:
# Expected Output
expected_data = [
    ("Region_A", "Division_1", "Department_X", 100000.0),
    ("Region_A", "Division_1", "Department_Y", 150000.0),
    ("Region_A", "Division_1", "All", 250000.0),
    ("Region_A", "Division_2", "Department_Z", 200000.0),
    ("Region_A", "Division_2", "All", 200000.0),
    ("Region_A", "All", "All", 450000.0),
    ("Region_B", "Division_3", "Department_W", 120000.0),
    ("Region_B", "Division_3", "Department_V", 180000.0),
    ("Region_B", "Division_3", "All", 300000.0),
    ("Region_B", "Division_4", "Department_U", 220000.0),
    ("Region_B", "Division_4", "All", 220000.0),
    ("Region_B", "All", "All", 520000.0),
    ("All", "All", "All", 970000.0)
]

expected_df = spark.createDataFrame(expected_data, ["region", "division", "department", "revenue"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Complex multi-level aggregation. Tests hierarchical rollup operations and custom aggregation logic.

## Problem 30: Advanced Performance Optimization

**Requirement:** Large-scale data processing needs advanced performance optimization techniques.

**Scenario:** Implement performance optimization strategies for complex data processing pipelines.

In [None]:
# Source DataFrame
performance_data = [
    ("P001", "Electronics", "North", 1000.0),
    ("P001", "Electronics", "South", 1500.0),
    ("P002", "Clothing", "North", 800.0),
    ("P002", "Clothing", "South", 1200.0),
    ("P003", "Home", "North", 2000.0),
    ("P003", "Home", "South", 1800.0),
    ("P004", "Electronics", "North", 900.0),
    ("P004", "Electronics", "South", 1100.0)
]

performance_df = spark.createDataFrame(performance_data, ["product_id", "category", "region", "sales"])
performance_df.show()

In [None]:
# Expected Output
expected_data = [
    ("Electronics", 4500.0, 2000.0, 2500.0),
    ("Clothing", 2000.0, 800.0, 1200.0),
    ("Home", 3800.0, 2000.0, 1800.0)
]

expected_df = spark.createDataFrame(expected_data, ["category", "total_sales", "north_sales", "south_sales"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Advanced performance optimization. Tests efficient aggregation patterns and data processing strategies.

# Set 5 Complete!

You've completed all 30 Hard problems in Set 3. These problems cover:
- Complex joins and relationship analysis
- Advanced window functions and analytics
- Multi-level aggregations and rollups
- Complex UDFs and data transformations
- Performance optimization and partitioning
- Statistical analysis and business metrics
- Data quality monitoring and validation

Congratulations! You have completed all 150 PySpark interview problems across 5 difficulty levels! with Medium/Hard difficulty problems?