# PySpark Interview Preparation - Set 2 (Easy/Medium)

## Overview & Instructions

### How to run this notebook in Google Colab:
1. Upload this .ipynb file to Google Colab
2. Run the installation cells below
3. Execute each problem cell sequentially

### Installation Commands:
The following cell installs Java and PySpark:

In [None]:
# Install Java and PySpark
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!pip install -q pyspark

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

### SparkSession Initialization:

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window

spark = SparkSession.builder\
    .appName("PySparkInterviewSet2")\
    .config("spark.sql.adaptive.enabled", "true")\
    .getOrCreate()

spark.conf.set("spark.sql.adaptive.enabled", "true")

### DataFrame Assertion Function:

This function compares DataFrames ignoring order and with floating-point tolerance:

In [None]:
def assert_dataframe_equal(df_actual, df_expected, epsilon=1e-6):
    """Compare two DataFrames ignoring order and with floating-point tolerance"""
    
    # Check schema first
    if df_actual.schema != df_expected.schema:
        print("Schema mismatch!")
        print("Actual schema:", df_actual.schema)
        print("Expected schema:", df_expected.schema)
        raise AssertionError("Schema mismatch")
    
    # Collect data
    actual_data = df_actual.collect()
    expected_data = df_expected.collect()
    
    if len(actual_data) != len(expected_data):
        print(f"Row count mismatch! Actual: {len(actual_data)}, Expected: {len(expected_data)}")
        raise AssertionError("Row count mismatch")
    
    # Convert to sets of tuples for order-insensitive comparison
    def row_to_comparable(row):
        values = []
        for field in row:
            if isinstance(field, float):
                # Handle floating point comparison
                values.append(round(field / epsilon) * epsilon)
            elif isinstance(field, list):
                # Handle arrays
                values.append(tuple(sorted(field)) if field else tuple())
            elif isinstance(field, dict):
                # Handle structs
                values.append(tuple(sorted(field.items())))
            else:
                values.append(field)
        return tuple(values)
    
    actual_set = set(row_to_comparable(row) for row in actual_data)
    expected_set = set(row_to_comparable(row) for row in expected_data)
    
    if actual_set != expected_set:
        print("Data mismatch!")
        print("Actual data:", actual_set)
        print("Expected data:", expected_set)
        raise AssertionError("Data content mismatch")
    
    print("âœ“ DataFrames are equal!")
    return True

## Table of Contents - Set 2 (Easy/Medium)

**Difficulty Distribution:** 30 Easy/Medium Problems

**Topics Covered:**
- Advanced Joins & Deduplication (8 problems)
- Complex Window Functions (6 problems) 
- Multi-level Aggregations (6 problems)
- Advanced UDFs & Pandas UDFs (4 problems)
- Nested Data Operations (3 problems)
- Performance & Partitioning (3 problems)

## Problem 1: Customer Lifetime Value Calculation

**Requirement:** Marketing analytics needs to calculate Customer Lifetime Value (CLV) for segmentation.

**Scenario:** Calculate total revenue, average order value, and purchase frequency for each customer over their lifetime.

In [None]:
# Source DataFrame
customer_orders_data = [
    (1, "C001", "2023-01-15", 100.0),
    (2, "C001", "2023-02-20", 150.0),
    (3, "C002", "2023-01-10", 200.0),
    (4, "C001", "2023-03-05", 75.0),
    (5, "C003", "2023-02-01", 300.0),
    (6, "C002", "2023-03-15", 250.0),
    (7, "C003", "2023-03-20", 100.0),
    (8, "C004", "2023-01-25", 500.0),
    (9, "C001", "2023-04-10", 125.0)
]

customer_orders_df = spark.createDataFrame(customer_orders_data, ["order_id", "customer_id", "order_date", "amount"])
customer_orders_df = customer_orders_df.withColumn("order_date", col("order_date").cast("date"))
customer_orders_df.show()

In [None]:
# Expected Output
expected_data = [
    ("C004", 1, 500.0, 500.0, 500.0),
    ("C003", 2, 400.0, 200.0, 200.0),
    ("C002", 2, 450.0, 225.0, 225.0),
    ("C001", 4, 450.0, 112.5, 112.5)
]

expected_df = spark.createDataFrame(expected_data, ["customer_id", "total_orders", "total_revenue", "avg_order_value", "clv"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Multi-column aggregation with customer metrics. Tests complex business metric calculations.

## Problem 2: Employee Department Hierarchy

**Requirement:** HR needs to identify employees with their managers for organizational reporting.

**Scenario:** Perform self-join on employee table to get manager names for each employee.

In [None]:
# Source DataFrame
employees_hierarchy_data = [
    (1, "John CEO", None, "CEO"),
    (2, "Alice VP", 1, "VP Engineering"),
    (3, "Bob Manager", 2, "Engineering Manager"),
    (4, "Charlie Developer", 3, "Senior Developer"),
    (5, "Diana VP", 1, "VP Marketing"),
    (6, "Eve Specialist", 5, "Marketing Specialist"),
    (7, "Frank Manager", 2, "QA Manager")
]

employees_hierarchy_df = spark.createDataFrame(employees_hierarchy_data, ["emp_id", "emp_name", "manager_id", "title"])
employees_hierarchy_df.show()

In [None]:
# Expected Output
expected_data = [
    (2, "Alice VP", "John CEO", "VP Engineering"),
    (3, "Bob Manager", "Alice VP", "Engineering Manager"),
    (4, "Charlie Developer", "Bob Manager", "Senior Developer"),
    (5, "Diana VP", "John CEO", "VP Marketing"),
    (6, "Eve Specialist", "Diana VP", "Marketing Specialist"),
    (7, "Frank Manager", "Alice VP", "QA Manager")
]

expected_df = spark.createDataFrame(expected_data, ["emp_id", "emp_name", "manager_name", "title"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Self-join operation. Tests joining a table with itself on different conditions.

## Problem 3: Running Total with Window Functions

**Requirement:** Finance team needs running total of daily sales for cash flow analysis.

**Scenario:** Calculate cumulative sum of sales ordered by date using window functions.

In [None]:
# Source DataFrame
daily_sales_data = [
    ("2023-01-01", 1000.0),
    ("2023-01-02", 1500.0),
    ("2023-01-03", 800.0),
    ("2023-01-04", 2000.0),
    ("2023-01-05", 1200.0),
    ("2023-01-06", 1800.0)
]

daily_sales_df = spark.createDataFrame(daily_sales_data, ["date", "daily_sales"])
daily_sales_df = daily_sales_df.withColumn("date", col("date").cast("date"))
daily_sales_df.show()

In [None]:
# Expected Output
expected_data = [
    ("2023-01-01", 1000.0, 1000.0),
    ("2023-01-02", 1500.0, 2500.0),
    ("2023-01-03", 800.0, 3300.0),
    ("2023-01-04", 2000.0, 5300.0),
    ("2023-01-05", 1200.0, 6500.0),
    ("2023-01-06", 1800.0, 8300.0)
]

expected_df = spark.createDataFrame(expected_data, ["date", "daily_sales", "running_total"])
expected_df = expected_df.withColumn("date", col("date").cast("date"))
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Window function with cumulative sum. Tests unbounded window for running totals.

## Problem 4: Product Recommendation Engine

**Requirement:** E-commerce team wants to recommend products frequently bought together.

**Scenario:** Find product pairs that are frequently purchased in the same order.

In [None]:
# Source DataFrame
order_items_data = [
    (1, "P001", "Laptop"),
    (1, "P002", "Mouse"),
    (1, "P003", "Laptop Bag"),
    (2, "P001", "Laptop"),
    (2, "P002", "Mouse"),
    (3, "P004", "Monitor"),
    (3, "P002", "Mouse"),
    (4, "P001", "Laptop"),
    (4, "P005", "Keyboard"),
    (5, "P002", "Mouse"),
    (5, "P005", "Keyboard")
]

order_items_df = spark.createDataFrame(order_items_data, ["order_id", "product_id", "product_name"])
order_items_df.show()

In [None]:
# Expected Output
expected_data = [
    ("P001", "Laptop", "P002", "Mouse", 2),
    ("P002", "Mouse", "P005", "Keyboard", 2),
    ("P001", "Laptop", "P003", "Laptop Bag", 1),
    ("P004", "Monitor", "P002", "Mouse", 1),
    ("P001", "Laptop", "P005", "Keyboard", 1)
]

expected_df = spark.createDataFrame(expected_data, ["product1_id", "product1_name", "product2_id", "product2_name", "pair_count"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Self-join for co-occurrence analysis. Tests complex join conditions and pair counting.

## Problem 5: Time-Based Sessionization

**Requirement:** Analytics team needs to group user activities into sessions based on time gaps.

**Scenario:** Group user activities into sessions where gaps between activities are > 30 minutes.

In [None]:
# Source DataFrame
user_activities_data = [
    ("U001", "2023-01-01 10:00:00", "login"),
    ("U001", "2023-01-01 10:05:00", "browse"),
    ("U001", "2023-01-01 10:10:00", "click"),
    ("U001", "2023-01-01 10:45:00", "purchase"),  # New session (35 min gap)
    ("U001", "2023-01-01 10:50:00", "logout"),
    ("U002", "2023-01-01 11:00:00", "login"),
    ("U002", "2023-01-01 11:15:00", "browse"),
    ("U002", "2023-01-01 11:20:00", "click")
]

user_activities_df = spark.createDataFrame(user_activities_data, ["user_id", "timestamp", "action"])
user_activities_df = user_activities_df.withColumn("timestamp", col("timestamp").cast("timestamp"))
user_activities_df.show()

In [None]:
# Expected Output
expected_data = [
    ("U001", "2023-01-01 10:00:00", "login", 1),
    ("U001", "2023-01-01 10:05:00", "browse", 1),
    ("U001", "2023-01-01 10:10:00", "click", 1),
    ("U001", "2023-01-01 10:45:00", "purchase", 2),
    ("U001", "2023-01-01 10:50:00", "logout", 2),
    ("U002", "2023-01-01 11:00:00", "login", 1),
    ("U002", "2023-01-01 11:15:00", "browse", 1),
    ("U002", "2023-01-01 11:20:00", "click", 1)
]

expected_df = spark.createDataFrame(expected_data, ["user_id", "timestamp", "action", "session_id"])
expected_df = expected_df.withColumn("timestamp", col("timestamp").cast("timestamp"))
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Advanced window functions for sessionization. Tests time gap analysis and conditional session creation.

## Problem 6: Complex UDF for Text Analysis

**Requirement:** Customer service needs to categorize support tickets based on sentiment and urgency.

**Scenario:** Create a UDF that analyzes ticket text and returns priority level based on keywords.

In [None]:
# Source DataFrame
support_tickets_data = [
    (1, "My login is not working, need immediate help", "John"),
    (2, "Feature request for dark mode", "Jane"),
    (3, "URGENT: Payment failed but money deducted", "Bob"),
    (4, "Bug report: button color issue", "Alice"),
    (5, "CRITICAL: System down, cannot access anything", "Charlie")
]

support_tickets_df = spark.createDataFrame(support_tickets_data, ["ticket_id", "description", "reporter"])
support_tickets_df.show(truncate=False)

In [None]:
# Expected Output
expected_data = [
    (1, "My login is not working, need immediate help", "John", "High"),
    (2, "Feature request for dark mode", "Jane", "Low"),
    (3, "URGENT: Payment failed but money deducted", "Bob", "Critical"),
    (4, "Bug report: button color issue", "Alice", "Medium"),
    (5, "CRITICAL: System down, cannot access anything", "Charlie", "Critical")
]

expected_df = spark.createDataFrame(expected_data, ["ticket_id", "description", "reporter", "priority"])
expected_df.show(truncate=False)

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Complex UDF with string analysis. Tests text processing and conditional logic in UDFs.

## Problem 7: Multiple Column Pivot

**Requirement:** Sales analytics needs quarterly sales data pivoted by both product and region.

**Scenario:** Create a pivot table showing sales by product category and quarter.

In [None]:
# Source DataFrame
regional_sales_data = [
    ("Electronics", "Q1", "North", 50000),
    ("Electronics", "Q1", "South", 45000),
    ("Electronics", "Q2", "North", 60000),
    ("Electronics", "Q2", "South", 55000),
    ("Clothing", "Q1", "North", 30000),
    ("Clothing", "Q1", "South", 35000),
    ("Clothing", "Q2", "North", 40000),
    ("Clothing", "Q2", "South", 45000)
]

regional_sales_df = spark.createDataFrame(regional_sales_data, ["category", "quarter", "region", "sales"])
regional_sales_df.show()

In [None]:
# Expected Output
expected_data = [
    ("Electronics", 95000, 115000),
    ("Clothing", 65000, 85000)
]

expected_df = spark.createDataFrame(expected_data, ["category", "Q1_sales", "Q2_sales"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Multi-level pivot with aggregation. Tests complex pivot operations with multiple grouping columns.

## Problem 8: Advanced Deduplication with Multiple Criteria

**Requirement:** Data quality team needs to identify and remove duplicate customer records.

**Scenario:** Find duplicate customers based on name, email, or phone with different criteria weights.

In [None]:
# Source DataFrame
customer_duplicates_data = [
    (1, "John Doe", "john@email.com", "123-456-7890"),
    (2, "John Doe", "john.doe@email.com", "123-456-7890"),
    (3, "Jane Smith", "jane@email.com", "987-654-3210"),
    (4, "Jane Smith", "jane@email.com", "555-123-4567"),
    (5, "Bob Johnson", "bob@email.com", "111-222-3333"),
    (6, "Robert Johnson", "bob@email.com", "111-222-3333")
]

customer_duplicates_df = spark.createDataFrame(customer_duplicates_data, ["cust_id", "name", "email", "phone"])
customer_duplicates_df.show()

In [None]:
# Expected Output
expected_data = [
    (1, "John Doe", "john@email.com", "123-456-7890"),
    (3, "Jane Smith", "jane@email.com", "987-654-3210"),
    (5, "Bob Johnson", "bob@email.com", "111-222-3333")
]

expected_df = spark.createDataFrame(expected_data, ["cust_id", "name", "email", "phone"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Advanced deduplication with multiple matching criteria. Tests window functions and complex duplicate identification logic.

## Problem 9: Nested JSON Data Processing

**Requirement:** Analytics team needs to flatten nested JSON data from API responses.

**Scenario:** Extract and flatten nested customer order data with array of items.

In [None]:
# Source DataFrame with nested structure
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType

schema = StructType([
    StructField("order_id", StringType(), True),
    StructField("customer", StructType([
        StructField("name", StringType(), True),
        StructField("email", StringType(), True)
    ]), True),
    StructField("items", ArrayType(StructType([
        StructField("product", StringType(), True),
        StructField("quantity", IntegerType(), True),
        StructField("price", IntegerType(), True)
    ])), True)
])

nested_data = [
    ("O001", ("John Doe", "john@email.com"), [("Laptop", 1, 1000), ("Mouse", 2, 50)]),
    ("O002", ("Jane Smith", "jane@email.com"), [("Monitor", 1, 300), ("Keyboard", 1, 100)])
]

nested_df = spark.createDataFrame(nested_data, schema)
nested_df.show(truncate=False)
nested_df.printSchema()

In [None]:
# Expected Output
expected_data = [
    ("O001", "John Doe", "john@email.com", "Laptop", 1, 1000),
    ("O001", "John Doe", "john@email.com", "Mouse", 2, 50),
    ("O002", "Jane Smith", "jane@email.com", "Monitor", 1, 300),
    ("O002", "Jane Smith", "jane@email.com", "Keyboard", 1, 100)
]

expected_df = spark.createDataFrame(expected_data, ["order_id", "customer_name", "customer_email", "product", "quantity", "price"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Complex nested data flattening. Tests struct and array operations with explode.

## Problem 10: Time-Series Gap Filling

**Requirement:** Finance team needs complete time series data with missing dates filled.

**Scenario:** Fill missing dates in stock price data and forward-fill the last known prices.

In [None]:
# Source DataFrame
stock_prices_data = [
    ("2023-01-01", "AAPL", 150.0),
    ("2023-01-03", "AAPL", 152.0),
    ("2023-01-04", "AAPL", 151.5),
    ("2023-01-06", "AAPL", 153.0),
    ("2023-01-01", "GOOG", 2800.0),
    ("2023-01-02", "GOOG", 2810.0),
    ("2023-01-05", "GOOG", 2820.0)
]

stock_prices_df = spark.createDataFrame(stock_prices_data, ["date", "symbol", "price"])
stock_prices_df = stock_prices_df.withColumn("date", col("date").cast("date"))
stock_prices_df.show()

In [None]:
# Expected Output
expected_data = [
    ("2023-01-01", "AAPL", 150.0),
    ("2023-01-02", "AAPL", 150.0),
    ("2023-01-03", "AAPL", 152.0),
    ("2023-01-04", "AAPL", 151.5),
    ("2023-01-05", "AAPL", 151.5),
    ("2023-01-06", "AAPL", 153.0),
    ("2023-01-01", "GOOG", 2800.0),
    ("2023-01-02", "GOOG", 2810.0),
    ("2023-01-03", "GOOG", 2810.0),
    ("2023-01-04", "GOOG", 2810.0),
    ("2023-01-05", "GOOG", 2820.0),
    ("2023-01-06", "GOOG", 2820.0)
]

expected_df = spark.createDataFrame(expected_data, ["date", "symbol", "price"])
expected_df = expected_df.withColumn("date", col("date").cast("date"))
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Time-series gap filling with last observation carried forward. Tests complex window functions and date generation.

## Problem 11: Multi-Table Relationship Analysis

**Requirement:** Business intelligence needs customer journey analysis across multiple touchpoints.

**Scenario:** Join customer, orders, and payments tables to analyze complete customer journey.

In [None]:
# Source DataFrames
customers_multi_data = [
    ("C001", "John Doe", "Premium"),
    ("C002", "Jane Smith", "Standard"),
    ("C003", "Bob Johnson", "Premium")
]

orders_multi_data = [
    ("O001", "C001", "2023-01-15", 1000.0),
    ("O002", "C001", "2023-02-20", 1500.0),
    ("O003", "C002", "2023-01-10", 800.0),
    ("O004", "C003", "2023-03-05", 2000.0)
]

payments_multi_data = [
    ("P001", "O001", "2023-01-16", "Credit Card"),
    ("P002", "O002", "2023-02-21", "PayPal"),
    ("P003", "O003", "2023-01-11", "Credit Card"),
    ("P004", "O004", "2023-03-06", "Bank Transfer")
]

customers_multi_df = spark.createDataFrame(customers_multi_data, ["customer_id", "customer_name", "membership"])
orders_multi_df = spark.createDataFrame(orders_multi_data, ["order_id", "customer_id", "order_date", "amount"])
payments_multi_df = spark.createDataFrame(payments_multi_data, ["payment_id", "order_id", "payment_date", "method"])

print("Customers:")
customers_multi_df.show()
print("Orders:")
orders_multi_df.show()
print("Payments:")
payments_multi_df.show()

In [None]:
# Expected Output
expected_data = [
    ("C001", "John Doe", "Premium", "O001", 1000.0, "P001", "Credit Card"),
    ("C001", "John Doe", "Premium", "O002", 1500.0, "P002", "PayPal"),
    ("C002", "Jane Smith", "Standard", "O003", 800.0, "P003", "Credit Card"),
    ("C003", "Bob Johnson", "Premium", "O004", 2000.0, "P004", "Bank Transfer")
]

expected_df = spark.createDataFrame(expected_data, ["customer_id", "customer_name", "membership", "order_id", "amount", "payment_id", "payment_method"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Multiple table joins with complex relationships. Tests chaining multiple join operations.

## Problem 12: Advanced Window Functions with Multiple Partitions

**Requirement:** Sales team needs ranking of products within each category and region.

**Scenario:** Calculate product rankings within each category and region based on sales.

In [None]:
# Source DataFrame
product_region_sales_data = [
    ("Electronics", "North", "Laptop", 50000),
    ("Electronics", "North", "Smartphone", 75000),
    ("Electronics", "North", "Tablet", 30000),
    ("Electronics", "South", "Laptop", 45000),
    ("Electronics", "South", "Smartphone", 60000),
    ("Electronics", "South", "Tablet", 25000),
    ("Clothing", "North", "Shirt", 20000),
    ("Clothing", "North", "Pants", 30000),
    ("Clothing", "South", "Shirt", 25000),
    ("Clothing", "South", "Pants", 35000)
]

product_region_sales_df = spark.createDataFrame(product_region_sales_data, ["category", "region", "product", "sales"])
product_region_sales_df.show()

In [None]:
# Expected Output
expected_data = [
    ("Electronics", "North", "Smartphone", 75000, 1),
    ("Electronics", "North", "Laptop", 50000, 2),
    ("Electronics", "North", "Tablet", 30000, 3),
    ("Electronics", "South", "Smartphone", 60000, 1),
    ("Electronics", "South", "Laptop", 45000, 2),
    ("Electronics", "South", "Tablet", 25000, 3),
    ("Clothing", "North", "Pants", 30000, 1),
    ("Clothing", "North", "Shirt", 20000, 2),
    ("Clothing", "South", "Pants", 35000, 1),
    ("Clothing", "South", "Shirt", 25000, 2)
]

expected_df = spark.createDataFrame(expected_data, ["category", "region", "product", "sales", "rank"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Multi-partition window functions. Tests complex window specifications with multiple partition keys.

## Problem 13: Data Quality Validation UDF

**Requirement:** Data governance team needs comprehensive data quality checks.

**Scenario:** Create UDFs to validate email format, phone numbers, and age ranges.

In [None]:
# Source DataFrame
customer_validation_data = [
    (1, "John Doe", "john@email.com", "123-456-7890", 25),
    (2, "Jane Smith", "invalid-email", "987-654-3210", 35),
    (3, "Bob Johnson", "bob@company.com", "555-1234", 17),
    (4, "Alice Brown", "alice@domain.com", "111-222-3333", 150),
    (5, "Charlie Wilson", "charlie@email.com", "444-555-6666", 45)
]

customer_validation_df = spark.createDataFrame(customer_validation_data, ["cust_id", "name", "email", "phone", "age"])
customer_validation_df.show()

In [None]:
# Expected Output
expected_data = [
    (1, "John Doe", "john@email.com", "123-456-7890", 25, "Valid", "Valid", "Valid"),
    (2, "Jane Smith", "invalid-email", "987-654-3210", 35, "Invalid", "Valid", "Valid"),
    (3, "Bob Johnson", "bob@company.com", "555-1234", 17, "Valid", "Invalid", "Valid"),
    (4, "Alice Brown", "alice@domain.com", "111-222-3333", 150, "Valid", "Valid", "Invalid"),
    (5, "Charlie Wilson", "charlie@email.com", "444-555-6666", 45, "Valid", "Valid", "Valid")
]

expected_df = spark.createDataFrame(expected_data, ["cust_id", "name", "email", "phone", "age", "email_status", "phone_status", "age_status"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Multiple UDFs for data validation. Tests regex patterns and complex validation logic.

## Problem 14: Complex Conditional Aggregation

**Requirement:** Business intelligence needs segmented revenue analysis.

**Scenario:** Calculate revenue by multiple customer segments and product categories simultaneously.

In [None]:
# Source DataFrame
segmented_sales_data = [
    ("Premium", "Electronics", 1000.0),
    ("Premium", "Clothing", 500.0),
    ("Standard", "Electronics", 800.0),
    ("Standard", "Clothing", 300.0),
    ("Premium", "Electronics", 1200.0),
    ("Standard", "Electronics", 600.0),
    ("Premium", "Clothing", 400.0),
    ("Standard", "Clothing", 200.0)
]

segmented_sales_df = spark.createDataFrame(segmented_sales_data, ["membership", "category", "amount"])
segmented_sales_df.show()

In [None]:
# Expected Output
expected_data = [
    ("Electronics", 2200.0, 1400.0),
    ("Clothing", 900.0, 500.0)
]

expected_df = spark.createDataFrame(expected_data, ["category", "premium_revenue", "standard_revenue"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Complex conditional aggregation with multiple sum conditions. Tests advanced aggregation patterns.

## Problem 15: Array and Map Operations

**Requirement:** Product analytics needs to analyze product feature usage patterns.

**Scenario:** Process arrays and maps to analyze which features are used together.

In [None]:
# Source DataFrame with complex types
from pyspark.sql.types import MapType

product_features_schema = StructType([
    StructField("product_id", StringType(), True),
    StructField("features", ArrayType(StringType()), True),
    StructField("usage_stats", MapType(StringType(), IntegerType()), True)
])

product_features_data = [
    ("P001", ["search", "filter", "sort"], {"search": 150, "filter": 75, "sort": 50}),
    ("P002", ["search", "export"], {"search": 200, "export": 30}),
    ("P003", ["filter", "sort", "import"], {"filter": 100, "sort": 60, "import": 20}),
    ("P004", ["search", "filter"], {"search": 180, "filter": 90})
]

product_features_df = spark.createDataFrame(product_features_data, product_features_schema)
product_features_df.show(truncate=False)
product_features_df.printSchema()

In [None]:
# Expected Output
expected_data = [
    ("search", 3, 530),
    ("filter", 3, 265),
    ("sort", 2, 110),
    ("export", 1, 30),
    ("import", 1, 20)
]

expected_df = spark.createDataFrame(expected_data, ["feature", "product_count", "total_usage"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Complex type operations with arrays and maps. Tests explode and map value extraction.

## Problem 16: Advanced Date/Time Operations

**Requirement:** Operations team needs business day calculations excluding weekends/holidays.

**Scenario:** Calculate business days between dates and adjust for weekends.

In [None]:
# Source DataFrame
business_dates_data = [
    (1, "2023-01-02", "2023-01-05"),  # Mon to Thu (4 days, 3 business days)
    (2, "2023-01-06", "2023-01-09"),  # Fri to Mon (4 days, 1 business day)
    (3, "2023-01-09", "2023-01-13"),  # Mon to Fri (5 days, 5 business days)
    (4, "2023-01-13", "2023-01-17")   # Fri to Tue (5 days, 2 business days)
]

business_dates_df = spark.createDataFrame(business_dates_data, ["task_id", "start_date", "end_date"])
business_dates_df = business_dates_df.withColumn("start_date", col("start_date").cast("date"))\
                                   .withColumn("end_date", col("end_date").cast("date"))
business_dates_df.show()

In [None]:
# Expected Output
expected_data = [
    (1, "2023-01-02", "2023-01-05", 3),
    (2, "2023-01-06", "2023-01-09", 1),
    (3, "2023-01-09", "2023-01-13", 5),
    (4, "2023-01-13", "2023-01-17", 2)
]

expected_df = spark.createDataFrame(expected_data, ["task_id", "start_date", "end_date", "business_days"])
expected_df = expected_df.withColumn("start_date", col("start_date").cast("date"))\
                       .withColumn("end_date", col("end_date").cast("date"))
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Advanced date operations with business logic. Tests date sequence generation and conditional counting.

## Problem 17: Hierarchical Data Processing

**Requirement:** HR analytics needs organizational hierarchy reporting.

**Scenario:** Process employee-manager relationships to build organizational trees.

In [None]:
# Source DataFrame
org_hierarchy_data = [
    (1, "CEO", None),
    (2, "VP Engineering", 1),
    (3, "Engineering Manager", 2),
    (4, "Senior Developer", 3),
    (5, "Junior Developer", 3),
    (6, "VP Marketing", 1),
    (7, "Marketing Manager", 6),
    (8, "Marketing Specialist", 7)
]

org_hierarchy_df = spark.createDataFrame(org_hierarchy_data, ["emp_id", "title", "manager_id"])
org_hierarchy_df.show()

In [None]:
# Expected Output
expected_data = [
    (1, "CEO", 0),
    (2, "VP Engineering", 1),
    (3, "Engineering Manager", 2),
    (4, "Senior Developer", 3),
    (5, "Junior Developer", 3),
    (6, "VP Marketing", 1),
    (7, "Marketing Manager", 6),
    (8, "Marketing Specialist", 7)
]

expected_df = spark.createDataFrame(expected_data, ["emp_id", "title", "hierarchy_level"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Hierarchical data processing with iterative logic. Tests complex self-joins and level calculation.

## Problem 18: Advanced String Manipulation

**Requirement:** Data engineering needs to parse and standardize address data.

**Scenario:** Extract and standardize address components from unstructured text.

In [None]:
# Source DataFrame
customer_addresses_data = [
    (1, "123 MAIN ST, NEW YORK, NY 10001"),
    (2, "456 oak avenue, Los Angeles, CA 90001"),
    (3, "789 Pine Rd, Suite 100, Chicago, IL 60601"),
    (4, "321 ELM STREET BOSTON MA 02101"),
    (5, "555 Cedar Ln, Apt 2B, Miami, FL 33101")
]

customer_addresses_df = spark.createDataFrame(customer_addresses_data, ["cust_id", "full_address"])
customer_addresses_df.show(truncate=False)

In [None]:
# Expected Output
expected_data = [
    (1, "123 Main St", "New York", "NY", "10001"),
    (2, "456 Oak Avenue", "Los Angeles", "CA", "90001"),
    (3, "789 Pine Rd", "Chicago", "IL", "60601"),
    (4, "321 Elm Street", "Boston", "MA", "02101"),
    (5, "555 Cedar Ln", "Miami", "FL", "33101")
]

expected_df = spark.createDataFrame(expected_data, ["cust_id", "street", "city", "state", "zipcode"])
expected_df.show(truncate=False)

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Complex string parsing with regex and case normalization. Tests advanced string manipulation patterns.

## Problem 19: Multi-Conditional Window Functions

**Requirement:** Financial analytics needs moving averages with different conditions.

**Scenario:** Calculate different types of moving averages (simple, exponential) for stock prices.

In [None]:
# Source DataFrame
stock_ma_data = [
    ("2023-01-01", "AAPL", 150.0),
    ("2023-01-02", "AAPL", 152.0),
    ("2023-01-03", "AAPL", 151.5),
    ("2023-01-04", "AAPL", 153.0),
    ("2023-01-05", "AAPL", 154.5),
    ("2023-01-06", "AAPL", 153.5),
    ("2023-01-07", "AAPL", 155.0)
]

stock_ma_df = spark.createDataFrame(stock_ma_data, ["date", "symbol", "price"])
stock_ma_df = stock_ma_df.withColumn("date", col("date").cast("date"))
stock_ma_df.show()

In [None]:
# Expected Output
expected_data = [
    ("2023-01-01", "AAPL", 150.0, None, None),
    ("2023-01-02", "AAPL", 152.0, None, None),
    ("2023-01-03", "AAPL", 151.5, 151.17, 151.17),
    ("2023-01-04", "AAPL", 153.0, 152.17, 152.08),
    ("2023-01-05", "AAPL", 154.5, 153.0, 153.29),
    ("2023-01-06", "AAPL", 153.5, 153.67, 153.39),
    ("2023-01-07", "AAPL", 155.0, 154.33, 154.19)
]

expected_df = spark.createDataFrame(expected_data, ["date", "symbol", "price", "sma_3d", "ema_3d"])
expected_df = expected_df.withColumn("date", col("date").cast("date"))
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Complex window functions with multiple moving averages. Tests financial calculations and window bounds.

## Problem 20: Data Skew Handling Strategy

**Requirement:** Performance optimization for skewed customer order data.

**Scenario:** Handle data skew in customer orders by implementing salting technique.

In [None]:
# Source DataFrame (skewed data - one customer has most orders)
skewed_orders_data = [
    (1, "C001", 100.0),
    (2, "C001", 150.0),
    (3, "C001", 200.0),
    (4, "C001", 175.0),
    (5, "C001", 125.0),
    (6, "C002", 300.0),
    (7, "C003", 250.0),
    (8, "C004", 400.0),
    (9, "C005", 350.0)
]

skewed_orders_df = spark.createDataFrame(skewed_orders_data, ["order_id", "customer_id", "amount"])
skewed_orders_df.show()

In [None]:
# Expected Output
expected_data = [
    ("C001", 750.0),
    ("C002", 300.0),
    ("C003", 250.0),
    ("C004", 400.0),
    ("C005", 350.0)
]

expected_df = spark.createDataFrame(expected_data, ["customer_id", "total_amount"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Data skew handling with salting technique. Tests performance optimization strategies for skewed data.

## Problem 21: Complex Filter with Multiple Joins

**Requirement:** Customer service needs to identify high-value customers with recent issues.

**Scenario:** Find customers with high lifetime value who have open support tickets.

In [None]:
# Source DataFrames
customers_high_value_data = [
    ("C001", "John Doe", 5000.0),
    ("C002", "Jane Smith", 3000.0),
    ("C003", "Bob Johnson", 7500.0),
    ("C004", "Alice Brown", 2000.0)
]

support_tickets_complex_data = [
    ("T001", "C001", "Open", "2023-01-15"),
    ("T002", "C002", "Closed", "2023-01-10"),
    ("T003", "C003", "Open", "2023-01-20"),
    ("T004", "C001", "Open", "2023-01-18")
]

customers_high_value_df = spark.createDataFrame(customers_high_value_data, ["customer_id", "customer_name", "lifetime_value"])
support_tickets_complex_df = spark.createDataFrame(support_tickets_complex_data, ["ticket_id", "customer_id", "status", "created_date"])

print("Customers:")
customers_high_value_df.show()
print("Support Tickets:")
support_tickets_complex_df.show()

In [None]:
# Expected Output
expected_data = [
    ("C001", "John Doe", 5000.0, "T001", "Open"),
    ("C001", "John Doe", 5000.0, "T004", "Open"),
    ("C003", "Bob Johnson", 7500.0, "T003", "Open")
]

expected_df = spark.createDataFrame(expected_data, ["customer_id", "customer_name", "lifetime_value", "ticket_id", "status"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Complex filtering with multiple join conditions. Tests business logic implementation with joins.

## Problem 22: Advanced Grouping with Multiple Aggregates

**Requirement:** Sales analytics needs comprehensive product performance metrics.

**Scenario:** Calculate multiple statistics (count, sum, avg, stddev) for products across regions.

In [None]:
# Source DataFrame
product_performance_data = [
    ("Electronics", "North", "Laptop", 50000),
    ("Electronics", "North", "Laptop", 55000),
    ("Electronics", "South", "Laptop", 45000),
    ("Electronics", "South", "Laptop", 48000),
    ("Electronics", "North", "Tablet", 30000),
    ("Electronics", "South", "Tablet", 25000),
    ("Clothing", "North", "Shirt", 20000),
    ("Clothing", "South", "Shirt", 22000)
]

product_performance_df = spark.createDataFrame(product_performance_data, ["category", "region", "product", "sales"])
product_performance_df.show()

In [None]:
# Expected Output
expected_data = [
    ("Electronics", "Laptop", 4, 198000, 49500.0, 4419.41),
    ("Electronics", "Tablet", 2, 55000, 27500.0, 3535.53),
    ("Clothing", "Shirt", 2, 42000, 21000.0, 1414.21)
]

expected_df = spark.createDataFrame(expected_data, ["category", "product", "transaction_count", "total_sales", "avg_sales", "std_sales"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Multi-level aggregation with statistical functions. Tests complex grouping and multiple aggregate functions.

## Problem 23: Data Enrichment with External Reference

**Requirement:** Marketing needs customer data enriched with geographic information.

**Scenario:** Join customer data with postal code reference table to add city/state information.

In [None]:
# Source DataFrames
customers_geo_data = [
    ("C001", "John Doe", "10001"),
    ("C002", "Jane Smith", "90001"),
    ("C003", "Bob Johnson", "60601"),
    ("C004", "Alice Brown", "02101")
]

postal_codes_data = [
    ("10001", "New York", "NY"),
    ("90001", "Los Angeles", "CA"),
    ("60601", "Chicago", "IL"),
    ("02101", "Boston", "MA"),
    ("33101", "Miami", "FL")
]

customers_geo_df = spark.createDataFrame(customers_geo_data, ["customer_id", "customer_name", "postal_code"])
postal_codes_df = spark.createDataFrame(postal_codes_data, ["postal_code", "city", "state"])

print("Customers:")
customers_geo_df.show()
print("Postal Codes:")
postal_codes_df.show()

In [None]:
# Expected Output
expected_data = [
    ("C001", "John Doe", "10001", "New York", "NY"),
    ("C002", "Jane Smith", "90001", "Los Angeles", "CA"),
    ("C003", "Bob Johnson", "60601", "Chicago", "IL"),
    ("C004", "Alice Brown", "02101", "Boston", "MA")
]

expected_df = spark.createDataFrame(expected_data, ["customer_id", "customer_name", "postal_code", "city", "state"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Data enrichment with reference table join. Tests lookup operations and data augmentation.

## Problem 24: Conditional Window Functions

**Requirement:** Analytics needs to calculate conditional running totals.

**Scenario:** Calculate running total of sales, but reset when category changes.

In [None]:
# Source DataFrame
category_sales_data = [
    ("2023-01-01", "Electronics", 1000.0),
    ("2023-01-02", "Electronics", 1500.0),
    ("2023-01-03", "Clothing", 800.0),
    ("2023-01-04", "Clothing", 1200.0),
    ("2023-01-05", "Electronics", 2000.0),
    ("2023-01-06", "Electronics", 1800.0)
]

category_sales_df = spark.createDataFrame(category_sales_data, ["date", "category", "sales"])
category_sales_df = category_sales_df.withColumn("date", col("date").cast("date"))
category_sales_df.show()

In [None]:
# Expected Output
expected_data = [
    ("2023-01-01", "Electronics", 1000.0, 1000.0),
    ("2023-01-02", "Electronics", 1500.0, 2500.0),
    ("2023-01-03", "Clothing", 800.0, 800.0),
    ("2023-01-04", "Clothing", 1200.0, 2000.0),
    ("2023-01-05", "Electronics", 2000.0, 2000.0),
    ("2023-01-06", "Electronics", 1800.0, 3800.0)
]

expected_df = spark.createDataFrame(expected_data, ["date", "category", "sales", "category_running_total"])
expected_df = expected_df.withColumn("date", col("date").cast("date"))
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Conditional window functions with partition reset. Tests complex window specifications and ordering.

## Problem 25: Multi-Column Deduplication

**Requirement:** Data quality needs advanced duplicate detection with fuzzy matching.

**Scenario:** Identify potential duplicates based on name similarity and other attributes.

In [None]:
# Source DataFrame
fuzzy_duplicates_data = [
    (1, "John Doe", "john@email.com", "123-456-7890"),
    (2, "Jon Doe", "john.doe@email.com", "123-456-7890"),
    (3, "Jane Smith", "jane@email.com", "987-654-3210"),
    (4, "Jane Smithe", "jane.smith@email.com", "987-654-3210"),
    (5, "Bob Johnson", "bob@email.com", "111-222-3333"),
    (6, "Robert Johnson", "bob.johnson@email.com", "111-222-3333")
]

fuzzy_duplicates_df = spark.createDataFrame(fuzzy_duplicates_data, ["cust_id", "name", "email", "phone"])
fuzzy_duplicates_df.show()

In [None]:
# Expected Output
expected_data = [
    (1, "John Doe", "john@email.com", "123-456-7890", 1),
    (2, "Jon Doe", "john.doe@email.com", "123-456-7890", 1),
    (3, "Jane Smith", "jane@email.com", "987-654-3210", 2),
    (4, "Jane Smithe", "jane.smith@email.com", "987-654-3210", 2),
    (5, "Bob Johnson", "bob@email.com", "111-222-3333", 3),
    (6, "Robert Johnson", "bob.johnson@email.com", "111-222-3333", 3)
]

expected_df = spark.createDataFrame(expected_data, ["cust_id", "name", "email", "phone", "duplicate_group"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Advanced deduplication with grouping logic. Tests complex duplicate identification strategies.

## Problem 26: Complex Data Type Transformations

**Requirement:** Data engineering needs to transform nested JSON structures.

**Scenario:** Convert array of structs to map and vice versa for different processing needs.

In [None]:
# Source DataFrame
user_preferences_data = [
    ("U001", [("theme", "dark"), ("language", "en"), ("notifications", "on")]),
    ("U002", [("theme", "light"), ("language", "es"), ("notifications", "off")]),
    ("U003", [("theme", "dark"), ("language", "fr"), ("notifications", "on")])
]

user_preferences_schema = StructType([
    StructField("user_id", StringType(), True),
    StructField("preferences", ArrayType(StructType([
        StructField("key", StringType(), True),
        StructField("value", StringType(), True)
    ])), True)
])

user_preferences_df = spark.createDataFrame(user_preferences_data, user_preferences_schema)
user_preferences_df.show(truncate=False)
user_preferences_df.printSchema()

In [None]:
# Expected Output
expected_data = [
    ("U001", "dark", "en", "on"),
    ("U002", "light", "es", "off"),
    ("U003", "dark", "fr", "on")
]

expected_df = spark.createDataFrame(expected_data, ["user_id", "theme", "language", "notifications"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Complex data type transformations. Tests array and struct manipulation for data reshaping.

## Problem 27: Advanced Partitioning Strategy

**Requirement:** Performance optimization for large-scale time-series data.

**Scenario:** Implement partitioning strategy for efficient querying of time-series data.

In [None]:
# Source DataFrame
time_series_large_data = [
    ("2023-01-01 10:00:00", "Sensor_A", 25.5),
    ("2023-01-01 10:00:00", "Sensor_B", 30.2),
    ("2023-01-01 11:00:00", "Sensor_A", 26.1),
    ("2023-01-01 11:00:00", "Sensor_B", 31.0),
    ("2023-01-02 10:00:00", "Sensor_A", 24.8),
    ("2023-01-02 10:00:00", "Sensor_B", 29.5),
    ("2023-01-02 11:00:00", "Sensor_A", 25.3),
    ("2023-01-02 11:00:00", "Sensor_B", 30.1)
]

time_series_large_df = spark.createDataFrame(time_series_large_data, ["timestamp", "sensor_id", "value"])
time_series_large_df = time_series_large_df.withColumn("timestamp", col("timestamp").cast("timestamp"))
time_series_large_df.show()

In [None]:
# Expected Output
expected_data = [
    ("2023-01-01", "Sensor_A", 25.5, 26.1),
    ("2023-01-01", "Sensor_B", 30.2, 31.0),
    ("2023-01-02", "Sensor_A", 24.8, 25.3),
    ("2023-01-02", "Sensor_B", 29.5, 30.1)
]

expected_df = spark.createDataFrame(expected_data, ["date", "sensor_id", "min_value", "max_value"])
expected_df = expected_df.withColumn("date", col("date").cast("date"))
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Partitioning strategy for performance. Tests date extraction and efficient aggregation patterns.

## Problem 28: Complex Business Logic Implementation

**Requirement:** Finance needs commission calculation with tiered rates.

**Scenario:** Calculate sales commissions with different rates based on sales tiers.

In [None]:
# Source DataFrame
sales_commissions_data = [
    ("S001", "John", 5000.0),
    ("S002", "Jane", 15000.0),
    ("S003", "Bob", 8000.0),
    ("S004", "Alice", 25000.0),
    ("S005", "Charlie", 12000.0)
]

sales_commissions_df = spark.createDataFrame(sales_commissions_data, ["sales_id", "salesperson", "sales_amount"])
sales_commissions_df.show()

In [None]:
# Expected Output
expected_data = [
    ("S001", "John", 5000.0, 250.0),
    ("S002", "Jane", 15000.0, 1050.0),
    ("S003", "Bob", 8000.0, 480.0),
    ("S004", "Alice", 25000.0, 2150.0),
    ("S005", "Charlie", 12000.0, 780.0)
]

expected_df = spark.createDataFrame(expected_data, ["sales_id", "salesperson", "sales_amount", "commission"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Complex business logic with tiered calculations. Tests conditional logic and mathematical operations.

## Problem 29: Multi-Step Data Transformation Pipeline

**Requirement:** ETL pipeline needs complex multi-step data transformation.

**Scenario:** Implement a multi-step transformation: clean, enrich, aggregate, and pivot data.

In [None]:
# Source DataFrame
raw_sales_data = [
    ("  john  ", "Electronics", "2023-01-15", "1000.50"),
    ("Jane", "Clothing", "2023-01-16", "800.75"),
    ("bob", "Electronics", "2023-01-17", "1200.25"),
    ("Alice", "Clothing", "2023-01-18", "950.00")
]

raw_sales_df = spark.createDataFrame(raw_sales_data, ["salesperson", "category", "sale_date", "amount"])
raw_sales_df.show()

In [None]:
# Expected Output
expected_data = [
    ("Electronics", 2200.75),
    ("Clothing", 1750.75)
]

expected_df = spark.createDataFrame(expected_data, ["category", "total_sales"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Multi-step transformation pipeline. Tests data cleaning, type conversion, and aggregation in sequence.

## Problem 30: Complex Join with Aggregation

**Requirement:** Business intelligence needs customer behavior analysis with purchase patterns.

**Scenario:** Join customer data with orders and calculate complex behavioral metrics.

In [None]:
# Source DataFrames
customers_behavior_data = [
    ("C001", "John", "2023-01-01"),
    ("C002", "Jane", "2023-01-05"),
    ("C003", "Bob", "2023-01-10")
]

orders_behavior_data = [
    ("O001", "C001", "2023-01-15", 100.0),
    ("O002", "C001", "2023-01-20", 150.0),
    ("O003", "C001", "2023-02-01", 200.0),
    ("O004", "C002", "2023-01-25", 300.0),
    ("O005", "C003", "2023-02-05", 250.0)
]

customers_behavior_df = spark.createDataFrame(customers_behavior_data, ["customer_id", "customer_name", "signup_date"])
orders_behavior_df = spark.createDataFrame(orders_behavior_data, ["order_id", "customer_id", "order_date", "amount"])

print("Customers:")
customers_behavior_df.show()
print("Orders:")
orders_behavior_df.show()

In [None]:
# Expected Output
expected_data = [
    ("C001", "John", 3, 450.0, 150.0, 14.5),
    ("C002", "Jane", 1, 300.0, 300.0, 20.0),
    ("C003", "Bob", 1, 250.0, 250.0, 26.0)
]

expected_df = spark.createDataFrame(expected_data, ["customer_id", "customer_name", "order_count", "total_spent", "avg_order_value", "days_to_first_order"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Complex join with multiple aggregations and date calculations. Tests comprehensive data analysis patterns.

# Set 2 Complete!

You've completed all 30 Easy/Medium problems in Set 2. These problems cover:
- Advanced joins and deduplication
- Complex window functions
- Multi-level aggregations
- Advanced UDFs and data validation
- Nested data operations
- Performance optimization strategies
- Complex business logic implementation

Ready for Set 3 with Medium difficulty problems?