# PySpark Interview Preparation - Set 1 (Easy)

## Overview & Instructions

### How to run this notebook in Google Colab:
1. Upload this .ipynb file to Google Colab
2. Run the installation cells below
3. Execute each problem cell sequentially

### Installation Commands:
The following cell installs Java and PySpark:

In [None]:
# Install Java and PySpark
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!pip install -q pyspark

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

### SparkSession Initialization:

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window

spark = SparkSession.builder\
    .appName("PySparkInterviewSet1")\
    .config("spark.sql.adaptive.enabled", "true")\
    .getOrCreate()

spark.conf.set("spark.sql.adaptive.enabled", "true")

### DataFrame Assertion Function:

This function compares DataFrames ignoring order and with floating-point tolerance:

In [None]:
def assert_dataframe_equal(df_actual, df_expected, epsilon=1e-6):
    """Compare two DataFrames ignoring order and with floating-point tolerance"""
    
    # Check schema first
    if df_actual.schema != df_expected.schema:
        print("Schema mismatch!")
        print("Actual schema:", df_actual.schema)
        print("Expected schema:", df_expected.schema)
        raise AssertionError("Schema mismatch")
    
    # Collect data
    actual_data = df_actual.collect()
    expected_data = df_expected.collect()
    
    if len(actual_data) != len(expected_data):
        print(f"Row count mismatch! Actual: {len(actual_data)}, Expected: {len(expected_data)}")
        raise AssertionError("Row count mismatch")
    
    # Convert to sets of tuples for order-insensitive comparison
    def row_to_comparable(row):
        values = []
        for field in row:
            if isinstance(field, float):
                # Handle floating point comparison
                values.append(round(field / epsilon) * epsilon)
            elif isinstance(field, list):
                # Handle arrays
                values.append(tuple(sorted(field)) if field else tuple())
            elif isinstance(field, dict):
                # Handle structs
                values.append(tuple(sorted(field.items())))
            else:
                values.append(field)
        return tuple(values)
    
    actual_set = set(row_to_comparable(row) for row in actual_data)
    expected_set = set(row_to_comparable(row) for row in expected_data)
    
    if actual_set != expected_set:
        print("Data mismatch!")
        print("Actual data:", actual_set)
        print("Expected data:", expected_set)
        raise AssertionError("Data content mismatch")
    
    print("âœ“ DataFrames are equal!")
    return True

## Table of Contents - Set 1 (Easy)

**Difficulty Distribution:** 30 Easy Problems

**Topics Covered:**
- Basic Filtering & Selection (6 problems)
- Simple Aggregations (6 problems) 
- Basic Joins (6 problems)
- Window Functions (4 problems)
- String & Date Operations (4 problems)
- UDFs (4 problems)

## Problem 1: Active Customer Filter

**Requirement:** The marketing team needs a list of all active customers for a promotional campaign.

**Scenario:** Filter customers where status is 'Active' from the customer database.

In [None]:
# Source DataFrame
customer_data = [
    (1, "John Doe", "Active", "2023-01-15"),
    (2, "Jane Smith", "Inactive", "2023-02-20"),
    (3, "Bob Johnson", "Active", "2023-03-10"),
    (4, "Alice Brown", "Active", "2023-01-05"),
    (5, "Charlie Wilson", "Inactive", "2023-04-01")
]

customer_df = spark.createDataFrame(customer_data, ["customer_id", "customer_name", "status", "join_date"])
customer_df.show()

In [None]:
# Expected Output
expected_data = [
    (1, "John Doe", "Active", "2023-01-15"),
    (3, "Bob Johnson", "Active", "2023-03-10"),
    (4, "Alice Brown", "Active", "2023-01-05")
]

expected_df = spark.createDataFrame(expected_data, ["customer_id", "customer_name", "status", "join_date"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Basic filtering operation. Tests understanding of filter() or where() methods.

## Problem 2: Total Sales by Product

**Requirement:** Sales department wants total revenue by product for quarterly reporting.

**Scenario:** Calculate sum of sales amount grouped by product_id.

In [None]:
# Source DataFrame
sales_data = [
    (1, "A", 100.0),
    (1, "A", 150.0),
    (2, "B", 200.0),
    (1, "A", 75.0),
    (3, "C", 300.0),
    (2, "B", 250.0),
    (3, "C", 100.0)
]

sales_df = spark.createDataFrame(sales_data, ["product_id", "product_name", "amount"])
sales_df.show()

In [None]:
# Expected Output
expected_data = [
    (3, "C", 400.0),
    (1, "A", 325.0),
    (2, "B", 450.0)
]

expected_df = spark.createDataFrame(expected_data, ["product_id", "product_name", "total_sales"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Basic groupBy and aggregation. Tests sum() function and alias usage.

## Problem 3: Employee Department Join

**Requirement:** HR needs a combined view of employees with their department names.

**Scenario:** Join employees DataFrame with departments DataFrame on department_id.

In [None]:
# Source DataFrames
employees_data = [
    (1, "John", 101),
    (2, "Jane", 102),
    (3, "Bob", 101),
    (4, "Alice", 103),
    (5, "Charlie", 102)
]

departments_data = [
    (101, "Engineering"),
    (102, "Marketing"),
    (103, "Sales")
]

employees_df = spark.createDataFrame(employees_data, ["emp_id", "emp_name", "dept_id"])
departments_df = spark.createDataFrame(departments_data, ["dept_id", "dept_name"])

print("Employees:")
employees_df.show()
print("Departments:")
departments_df.show()

In [None]:
# Expected Output
expected_data = [
    (1, "John", 101, "Engineering"),
    (2, "Jane", 102, "Marketing"),
    (3, "Bob", 101, "Engineering"),
    (4, "Alice", 103, "Sales"),
    (5, "Charlie", 102, "Marketing")
]

expected_df = spark.createDataFrame(expected_data, ["emp_id", "emp_name", "dept_id", "dept_name"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Basic inner join operation. Tests join syntax and column handling.

## Problem 4: Top N Products by Sales

**Requirement:** Business stakeholders want to identify top 3 best-selling products.

**Scenario:** Use window functions to rank products by total sales and select top 3.

In [None]:
# Source DataFrame
product_sales_data = [
    ("P001", "Laptop", 50000.0),
    ("P002", "Mouse", 15000.0),
    ("P003", "Keyboard", 25000.0),
    ("P004", "Monitor", 45000.0),
    ("P005", "Headphones", 18000.0),
    ("P006", "Tablet", 35000.0)
]

product_sales_df = spark.createDataFrame(product_sales_data, ["product_id", "product_name", "total_sales"])
product_sales_df.show()

In [None]:
# Expected Output
expected_data = [
    ("P001", "Laptop", 50000.0, 1),
    ("P004", "Monitor", 45000.0, 2),
    ("P006", "Tablet", 35000.0, 3)
]

expected_df = spark.createDataFrame(expected_data, ["product_id", "product_name", "total_sales", "rank"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Basic window function with ranking. Tests Window specification and rank() function.

## Problem 5: Customer Email Domain Extraction

**Requirement:** Marketing team wants to analyze customer distribution by email domain.

**Scenario:** Extract domain from email addresses and count customers by domain.

In [None]:
# Source DataFrame
customers_data = [
    (1, "john@gmail.com"),
    (2, "jane@yahoo.com"),
    (3, "bob@gmail.com"),
    (4, "alice@company.com"),
    (5, "charlie@gmail.com"),
    (6, "diana@yahoo.com")
]

customers_df = spark.createDataFrame(customers_data, ["customer_id", "email"])
customers_df.show()

In [None]:
# Expected Output
expected_data = [
    ("gmail.com", 3),
    ("yahoo.com", 2),
    ("company.com", 1)
]

expected_df = spark.createDataFrame(expected_data, ["domain", "customer_count"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** String manipulation with split function. Tests string operations and array element access.

## Problem 6: Age Category UDF

**Requirement:** Analytics team needs to categorize customers by age groups for segmentation.

**Scenario:** Create a UDF to categorize ages and apply it to customer data.

In [None]:
# Source DataFrame
customer_ages_data = [
    (1, "John", 25),
    (2, "Jane", 35),
    (3, "Bob", 17),
    (4, "Alice", 45),
    (5, "Charlie", 60),
    (6, "Diana", 15)
]

customer_ages_df = spark.createDataFrame(customer_ages_data, ["customer_id", "name", "age"])
customer_ages_df.show()

In [None]:
# Expected Output
expected_data = [
    (1, "John", 25, "Adult"),
    (2, "Jane", 35, "Adult"),
    (3, "Bob", 17, "Teen"),
    (4, "Alice", 45, "Adult"),
    (5, "Charlie", 60, "Senior"),
    (6, "Diana", 15, "Teen")
]

expected_df = spark.createDataFrame(expected_data, ["customer_id", "name", "age", "age_category"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Basic UDF creation and application. Tests UDF registration and usage with column operations.

## Problem 7: Monthly Sales Growth

**Requirement:** Finance team needs month-over-month sales growth percentage.

**Scenario:** Calculate percentage growth compared to previous month using window functions.

In [None]:
# Source DataFrame
monthly_sales_data = [
    ("2023-01", 100000.0),
    ("2023-02", 120000.0),
    ("2023-03", 110000.0),
    ("2023-04", 130000.0),
    ("2023-05", 150000.0)
]

monthly_sales_df = spark.createDataFrame(monthly_sales_data, ["month", "sales"])
monthly_sales_df.show()

In [None]:
# Expected Output
expected_data = [
    ("2023-01", 100000.0, None),
    ("2023-02", 120000.0, 20.0),
    ("2023-03", 110000.0, -8.33),
    ("2023-04", 130000.0, 18.18),
    ("2023-05", 150000.0, 15.38)
]

expected_df = spark.createDataFrame(expected_data, ["month", "sales", "growth_pct"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Window function with lag for time-series analysis. Tests lag() and percentage calculations.

## Problem 8: Duplicate Order Detection

**Requirement:** Operations team needs to identify duplicate orders for fraud detection.

**Scenario:** Find orders with same customer_id, product_id, and order_date within 1 hour.

In [None]:
# Source DataFrame
orders_data = [
    (1, 101, "2023-01-01 10:00:00", 2, 100.0),
    (2, 101, "2023-01-01 10:30:00", 1, 50.0),
    (3, 102, "2023-01-01 11:00:00", 1, 75.0),
    (4, 101, "2023-01-01 10:45:00", 1, 50.0),  # Duplicate
    (5, 103, "2023-01-01 12:00:00", 3, 200.0),
    (6, 102, "2023-01-01 13:00:00", 2, 150.0)
]

orders_df = spark.createDataFrame(orders_data, ["order_id", "customer_id", "order_time", "product_id", "amount"])
orders_df = orders_df.withColumn("order_time", col("order_time").cast("timestamp"))
orders_df.show()

In [None]:
# Expected Output
expected_data = [
    (2, 101, "2023-01-01 10:30:00", 1, 50.0),
    (4, 101, "2023-01-01 10:45:00", 1, 50.0)
]

expected_df = spark.createDataFrame(expected_data, ["order_id", "customer_id", "order_time", "product_id", "amount"])
expected_df = expected_df.withColumn("order_time", col("order_time").cast("timestamp"))
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Window functions with time-based duplicate detection. Tests timestamp operations and conditional filtering.

## Problem 9: Product Price Range Categorization

**Requirement:** Pricing team wants to categorize products into price ranges for analysis.

**Scenario:** Use CASE WHEN statements to categorize products by price ranges.

In [None]:
# Source DataFrame
products_data = [
    ("P001", "Laptop", 999.99),
    ("P002", "Mouse", 25.50),
    ("P003", "Keyboard", 75.00),
    ("P004", "Monitor", 299.99),
    ("P005", "Headphones", 150.00),
    ("P006", "Tablet", 450.00)
]

products_df = spark.createDataFrame(products_data, ["product_id", "product_name", "price"])
products_df.show()

In [None]:
# Expected Output
expected_data = [
    ("P001", "Laptop", 999.99, "Premium"),
    ("P002", "Mouse", 25.50, "Budget"),
    ("P003", "Keyboard", 75.00, "Standard"),
    ("P004", "Monitor", 299.99, "Standard"),
    ("P005", "Headphones", 150.00, "Standard"),
    ("P006", "Tablet", 450.00, "Premium")
]

expected_df = spark.createDataFrame(expected_data, ["product_id", "product_name", "price", "price_category"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Conditional logic with CASE WHEN. Tests when().otherwise() pattern for categorization.

## Problem 10: Customer Order Summary

**Requirement:** Sales team wants a summary of each customer's order history.

**Scenario:** For each customer, calculate total orders, total amount, and average order value.

In [None]:
# Source DataFrame
customer_orders_data = [
    (1, 101, 100.0),
    (2, 101, 150.0),
    (3, 102, 200.0),
    (4, 101, 75.0),
    (5, 103, 300.0),
    (6, 102, 250.0),
    (7, 103, 100.0),
    (8, 104, 500.0)
]

customer_orders_df = spark.createDataFrame(customer_orders_data, ["order_id", "customer_id", "amount"])
customer_orders_df.show()

In [None]:
# Expected Output
expected_data = [
    (104, 1, 500.0, 500.0),
    (103, 2, 400.0, 200.0),
    (101, 3, 325.0, 108.33),
    (102, 2, 450.0, 225.0)
]

expected_df = spark.createDataFrame(expected_data, ["customer_id", "total_orders", "total_amount", "avg_order_value"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Multiple aggregations in single groupBy. Tests count, sum, avg functions together.

## Problem 11: Employee Salary Percentile

**Requirement:** HR analytics needs to calculate salary percentiles by department.

**Scenario:** Use window functions to calculate percentile rank of salaries within each department.

In [None]:
# Source DataFrame
employees_salary_data = [
    (1, "John", "Engineering", 80000),
    (2, "Jane", "Engineering", 95000),
    (3, "Bob", "Engineering", 70000),
    (4, "Alice", "Marketing", 60000),
    (5, "Charlie", "Marketing", 75000),
    (6, "Diana", "Sales", 65000),
    (7, "Eve", "Sales", 85000)
]

employees_salary_df = spark.createDataFrame(employees_salary_data, ["emp_id", "emp_name", "department", "salary"])
employees_salary_df.show()

In [None]:
# Expected Output
expected_data = [
    (1, "John", "Engineering", 80000, 0.5),
    (2, "Jane", "Engineering", 95000, 1.0),
    (3, "Bob", "Engineering", 70000, 0.0),
    (4, "Alice", "Marketing", 60000, 0.0),
    (5, "Charlie", "Marketing", 75000, 1.0),
    (6, "Diana", "Sales", 65000, 0.0),
    (7, "Eve", "Sales", 85000, 1.0)
]

expected_df = spark.createDataFrame(expected_data, ["emp_id", "emp_name", "department", "salary", "percentile"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Window function with percent_rank for percentile calculations. Tests partitioning and ranking.

## Problem 12: Product Sales Pivot

**Requirement:** Business intelligence needs monthly sales data in pivot table format.

**Scenario:** Pivot sales data to show product sales by month as columns.

In [None]:
# Source DataFrame
product_monthly_sales_data = [
    ("P001", "2023-01", 1000),
    ("P001", "2023-02", 1200),
    ("P001", "2023-03", 1100),
    ("P002", "2023-01", 500),
    ("P002", "2023-02", 600),
    ("P002", "2023-03", 550),
    ("P003", "2023-01", 800),
    ("P003", "2023-02", 900),
    ("P003", "2023-03", 850)
]

product_monthly_sales_df = spark.createDataFrame(product_monthly_sales_data, ["product_id", "month", "sales"])
product_monthly_sales_df.show()

In [None]:
# Expected Output
expected_data = [
    ("P001", 1000, 1200, 1100),
    ("P002", 500, 600, 550),
    ("P003", 800, 900, 850)
]

expected_df = spark.createDataFrame(expected_data, ["product_id", "2023-01", "2023-02", "2023-03"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Pivot table operation. Tests pivot() method with aggregation for data reshaping.

## Problem 13: Customer Purchase Intervals

**Requirement:** Marketing team wants to analyze time between customer purchases for retention.

**Scenario:** Calculate days between consecutive purchases for each customer.

In [None]:
# Source DataFrame
customer_purchases_data = [
    (1, 101, "2023-01-01"),
    (2, 101, "2023-01-05"),
    (3, 101, "2023-01-12"),
    (4, 102, "2023-01-02"),
    (5, 102, "2023-01-15"),
    (6, 103, "2023-01-03")
]

customer_purchases_df = spark.createDataFrame(customer_purchases_data, ["order_id", "customer_id", "order_date"])
customer_purchases_df = customer_purchases_df.withColumn("order_date", col("order_date").cast("date"))
customer_purchases_df.show()

In [None]:
# Expected Output
expected_data = [
    (2, 101, "2023-01-05", 4),
    (3, 101, "2023-01-12", 7),
    (5, 102, "2023-01-15", 13)
]

expected_df = spark.createDataFrame(expected_data, ["order_id", "customer_id", "order_date", "days_since_last_purchase"])
expected_df = expected_df.withColumn("order_date", col("order_date").cast("date"))
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Date operations with window functions. Tests datediff and lag for time interval calculations.

## Problem 14: String Pattern Matching

**Requirement:** Support team needs to find customers with specific email patterns for outreach.

**Scenario:** Filter customers whose email contains 'support' or 'help' in any case.

In [None]:
# Source DataFrame
customer_emails_data = [
    (1, "john@gmail.com"),
    (2, "support@company.com"),
    (3, "jane@yahoo.com"),
    (4, "HELPdesk@business.com"),
    (5, "bob@gmail.com"),
    (6, "info@company.com")
]

customer_emails_df = spark.createDataFrame(customer_emails_data, ["customer_id", "email"])
customer_emails_df.show()

In [None]:
# Expected Output
expected_data = [
    (2, "support@company.com"),
    (4, "HELPdesk@business.com")
]

expected_df = spark.createDataFrame(expected_data, ["customer_id", "email"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** String pattern matching with regex. Tests rlike and case-insensitive pattern matching.

## Problem 15: Array Column Operations

**Requirement:** Product team needs to analyze product tags for categorization.

**Scenario:** Explode array column of product tags and count products per tag.

In [None]:
# Source DataFrame
from pyspark.sql.types import ArrayType, StringType

products_tags_data = [
    ("P001", "Laptop", ["electronics", "computing", "premium"]),
    ("P002", "Mouse", ["electronics", "accessories"]),
    ("P003", "Notebook", ["stationery", "office"]),
    ("P004", "Monitor", ["electronics", "computing"]),
    ("P005", "Pen", ["stationery", "office"])
]

products_tags_df = spark.createDataFrame(products_tags_data, ["product_id", "product_name", "tags"])
products_tags_df.show()

In [None]:
# Expected Output
expected_data = [
    ("electronics", 3),
    ("computing", 2),
    ("stationery", 2),
    ("office", 2),
    ("premium", 1),
    ("accessories", 1)
]

expected_df = spark.createDataFrame(expected_data, ["tag", "product_count"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Array operations with explode function. Tests handling of complex types and flattening arrays.

## Problem 16: Null Value Handling

**Requirement:** Data quality team needs to handle missing customer phone numbers.

**Scenario:** Replace null phone numbers with 'Not Provided' and count nulls by city.

In [None]:
# Source DataFrame
customers_contact_data = [
    (1, "John", "New York", "123-456-7890"),
    (2, "Jane", "Chicago", None),
    (3, "Bob", "New York", None),
    (4, "Alice", "Chicago", "987-654-3210"),
    (5, "Charlie", "Boston", None),
    (6, "Diana", "New York", "555-123-4567")
]

customers_contact_df = spark.createDataFrame(customers_contact_data, ["customer_id", "customer_name", "city", "phone"])
customers_contact_df.show()

In [None]:
# Expected Output
expected_data = [
    (1, "John", "New York", "123-456-7890"),
    (2, "Jane", "Chicago", "Not Provided"),
    (3, "Bob", "New York", "Not Provided"),
    (4, "Alice", "Chicago", "987-654-3210"),
    (5, "Charlie", "Boston", "Not Provided"),
    (6, "Diana", "New York", "555-123-4567")
]

expected_df = spark.createDataFrame(expected_data, ["customer_id", "customer_name", "city", "phone"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Null value handling with fillna. Tests data cleaning and missing value imputation.

## Problem 17: Column Renaming and Selection

**Requirement:** Reporting team needs specific column names for their dashboard.

**Scenario:** Select specific columns and rename them to business-friendly names.

In [None]:
# Source DataFrame
employee_details_data = [
    (1, "John Doe", "Engineering", 80000, "2020-01-15"),
    (2, "Jane Smith", "Marketing", 75000, "2019-03-20"),
    (3, "Bob Johnson", "Engineering", 90000, "2018-06-10"),
    (4, "Alice Brown", "Sales", 70000, "2021-02-05")
]

employee_details_df = spark.createDataFrame(employee_details_data, ["emp_id", "emp_name", "department", "salary", "hire_date"])
employee_details_df.show()

In [None]:
# Expected Output
expected_data = [
    (1, "John Doe", "Engineering", 80000),
    (2, "Jane Smith", "Marketing", 75000),
    (3, "Bob Johnson", "Engineering", 90000),
    (4, "Alice Brown", "Sales", 70000)
]

expected_df = spark.createDataFrame(expected_data, ["EmployeeID", "FullName", "Department", "AnnualSalary"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Column selection and renaming. Tests alias usage and selective column operations.

## Problem 18: Date Format Conversion

**Requirement:** International team needs dates in specific format for reporting.

**Scenario:** Convert date format from YYYY-MM-DD to DD/MM/YYYY for international standards.

In [None]:
# Source DataFrame
orders_date_data = [
    (1, "2023-01-15"),
    (2, "2023-02-20"),
    (3, "2023-03-10"),
    (4, "2023-04-05"),
    (5, "2023-05-25")
]

orders_date_df = spark.createDataFrame(orders_date_data, ["order_id", "order_date"])
orders_date_df = orders_date_df.withColumn("order_date", col("order_date").cast("date"))
orders_date_df.show()

In [None]:
# Expected Output
expected_data = [
    (1, "15/01/2023"),
    (2, "20/02/2023"),
    (3, "10/03/2023"),
    (4, "05/04/2023"),
    (5, "25/05/2023")
]

expected_df = spark.createDataFrame(expected_data, ["order_id", "formatted_date"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Date formatting operations. Tests date_format function for international date standards.

## Problem 19: Simple Union Operation

**Requirement:** Operations team needs to combine current and historical customer data.

**Scenario:** Union two customer DataFrames with same schema into single dataset.

In [None]:
# Source DataFrames
current_customers_data = [
    (1, "John", "Active"),
    (2, "Jane", "Active"),
    (3, "Bob", "Active")
]

historical_customers_data = [
    (4, "Alice", "Inactive"),
    (5, "Charlie", "Inactive")
]

current_customers_df = spark.createDataFrame(current_customers_data, ["customer_id", "customer_name", "status"])
historical_customers_df = spark.createDataFrame(historical_customers_data, ["customer_id", "customer_name", "status"])

print("Current Customers:")
current_customers_df.show()
print("Historical Customers:")
historical_customers_df.show()

In [None]:
# Expected Output
expected_data = [
    (1, "John", "Active"),
    (2, "Jane", "Active"),
    (3, "Bob", "Active"),
    (4, "Alice", "Inactive"),
    (5, "Charlie", "Inactive")
]

expected_df = spark.createDataFrame(expected_data, ["customer_id", "customer_name", "status"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Union operation for combining datasets. Tests union() with same schema DataFrames.

## Problem 20: Distinct Value Count

**Requirement:** Data governance team needs to count distinct values for data quality assessment.

**Scenario:** Count distinct departments and distinct job titles in employee data.

In [None]:
# Source DataFrame
employees_diverse_data = [
    (1, "John", "Engineering", "Software Engineer"),
    (2, "Jane", "Engineering", "Data Scientist"),
    (3, "Bob", "Marketing", "Marketing Manager"),
    (4, "Alice", "Marketing", "Content Writer"),
    (5, "Charlie", "Engineering", "Software Engineer"),
    (6, "Diana", "Sales", "Sales Executive"),
    (7, "Eve", "Sales", "Sales Executive")
]

employees_diverse_df = spark.createDataFrame(employees_diverse_data, ["emp_id", "emp_name", "department", "job_title"])
employees_diverse_df.show()

In [None]:
# Expected Output
expected_data = [
    (3, 5)
]

expected_df = spark.createDataFrame(expected_data, ["distinct_departments", "distinct_job_titles"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Distinct counting with countDistinct. Tests aggregation without grouping for overall distinct counts.

## Problem 21: Column Concatenation

**Requirement:** Reporting team needs full names from separate first and last name columns.

**Scenario:** Concatenate first and last name columns with space separator.

In [None]:
# Source DataFrame
employees_names_data = [
    (1, "John", "Doe"),
    (2, "Jane", "Smith"),
    (3, "Bob", "Johnson"),
    (4, "Alice", "Brown"),
    (5, "Charlie", "Wilson")
]

employees_names_df = spark.createDataFrame(employees_names_data, ["emp_id", "first_name", "last_name"])
employees_names_df.show()

In [None]:
# Expected Output
expected_data = [
    (1, "John Doe"),
    (2, "Jane Smith"),
    (3, "Bob Johnson"),
    (4, "Alice Brown"),
    (5, "Charlie Wilson")
]

expected_df = spark.createDataFrame(expected_data, ["emp_id", "full_name"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** String concatenation with concat function. Tests string manipulation and literal usage.

## Problem 22: Row Number Generation

**Requirement:** Analytics team needs sequential row numbers for data processing.

**Scenario:** Add sequential row numbers to customer data ordered by customer_id.

In [None]:
# Source DataFrame
customers_sequential_data = [
    (105, "John"),
    (102, "Jane"),
    (108, "Bob"),
    (101, "Alice"),
    (107, "Charlie")
]

customers_sequential_df = spark.createDataFrame(customers_sequential_data, ["customer_id", "customer_name"])
customers_sequential_df.show()

In [None]:
# Expected Output
expected_data = [
    (101, "Alice", 1),
    (102, "Jane", 2),
    (105, "John", 3),
    (107, "Charlie", 4),
    (108, "Bob", 5)
]

expected_df = spark.createDataFrame(expected_data, ["customer_id", "customer_name", "row_number"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Row number generation with window functions. Tests row_number() for sequential numbering.

## Problem 23: Simple Conditional Aggregation

**Requirement:** Finance team needs separate totals for domestic and international sales.

**Scenario:** Calculate total sales amount for domestic vs international orders using conditional sum.

In [None]:
# Source DataFrame
orders_international_data = [
    (1, "Domestic", 1000.0),
    (2, "International", 1500.0),
    (3, "Domestic", 800.0),
    (4, "International", 2000.0),
    (5, "Domestic", 1200.0),
    (6, "International", 1800.0)
]

orders_international_df = spark.createDataFrame(orders_international_data, ["order_id", "order_type", "amount"])
orders_international_df.show()

In [None]:
# Expected Output
expected_data = [
    (3000.0, 5300.0)
]

expected_df = spark.createDataFrame(expected_data, ["domestic_sales", "international_sales"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Conditional aggregation with when().otherwise(). Tests conditional sum operations.

## Problem 24: Data Type Conversion

**Requirement:** Data engineering team needs to convert string columns to proper data types.

**Scenario:** Convert string representations of numbers and dates to appropriate data types.

In [None]:
# Source DataFrame
raw_data_data = [
    ("1", "1000.50", "2023-01-15"),
    ("2", "2000.75", "2023-02-20"),
    ("3", "1500.25", "2023-03-10")
]

raw_data_df = spark.createDataFrame(raw_data_data, ["id_str", "amount_str", "date_str"])
raw_data_df.show()
raw_data_df.printSchema()

In [None]:
# Expected Output
expected_data = [
    (1, 1000.5, "2023-01-15"),
    (2, 2000.75, "2023-02-20"),
    (3, 1500.25, "2023-03-10")
]

expected_df = spark.createDataFrame(expected_data, ["id", "amount", "date"])
expected_df = expected_df.withColumn("id", col("id").cast("integer"))\
                       .withColumn("amount", col("amount").cast("double"))\
                       .withColumn("date", col("date").cast("date"))
expected_df.show()
expected_df.printSchema()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Data type casting operations. Tests cast() method for type conversions.

## Problem 25: Simple Left Join

**Requirement:** Customer service needs order details with customer information, including customers without orders.

**Scenario:** Perform left join between customers and orders to include all customers.

In [None]:
# Source DataFrames
customers_left_data = [
    (1, "John"),
    (2, "Jane"),
    (3, "Bob"),
    (4, "Alice")
]

orders_left_data = [
    (101, 1, 100.0),
    (102, 1, 150.0),
    (103, 2, 200.0),
    (104, 3, 75.0)
]

customers_left_df = spark.createDataFrame(customers_left_data, ["customer_id", "customer_name"])
orders_left_df = spark.createDataFrame(orders_left_data, ["order_id", "customer_id", "amount"])

print("Customers:")
customers_left_df.show()
print("Orders:")
orders_left_df.show()

In [None]:
# Expected Output
expected_data = [
    (1, "John", 101, 100.0),
    (1, "John", 102, 150.0),
    (2, "Jane", 103, 200.0),
    (3, "Bob", 104, 75.0),
    (4, "Alice", None, None)
]

expected_df = spark.createDataFrame(expected_data, ["customer_id", "customer_name", "order_id", "amount"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Left join operation. Tests different join types and handling of null values from unmatched records.

## Problem 26: Basic Statistical Aggregations

**Requirement:** Analytics team needs basic statistics for sales data analysis.

**Scenario:** Calculate count, mean, standard deviation, min, and max of sales amounts.

In [None]:
# Source DataFrame
sales_stats_data = [
    (100.0,),
    (150.0,),
    (200.0,),
    (75.0,),
    (300.0,),
    (250.0,),
    (100.0,)
]

sales_stats_df = spark.createDataFrame(sales_stats_data, ["amount"])
sales_stats_df.show()

In [None]:
# Expected Output
expected_data = [
    (7, 167.86, 87.88, 75.0, 300.0)
]

expected_df = spark.createDataFrame(expected_data, ["count", "mean", "stddev", "min", "max"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Multiple statistical aggregations. Tests various aggregation functions together.

## Problem 27: Column Dropping

**Requirement:** Data privacy team requires removal of sensitive columns from dataset.

**Scenario:** Drop sensitive columns (SSN, salary) from employee data for external sharing.

In [None]:
# Source DataFrame
employees_sensitive_data = [
    (1, "John", "123-45-6789", "Engineering", 80000),
    (2, "Jane", "987-65-4321", "Marketing", 75000),
    (3, "Bob", "456-78-9123", "Engineering", 90000)
]

employees_sensitive_df = spark.createDataFrame(employees_sensitive_data, ["emp_id", "emp_name", "ssn", "department", "salary"])
employees_sensitive_df.show()

In [None]:
# Expected Output
expected_data = [
    (1, "John", "Engineering"),
    (2, "Jane", "Marketing"),
    (3, "Bob", "Engineering")
]

expected_df = spark.createDataFrame(expected_data, ["emp_id", "emp_name", "department"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Column dropping operation. Tests drop() method for removing specific columns.

## Problem 28: Simple Sort Operation

**Requirement:** Reporting team needs customer data sorted for consistent presentation.

**Scenario:** Sort customers by name in alphabetical order and by customer_id descending.

In [None]:
# Source DataFrame
customers_sort_data = [
    (3, "Charlie"),
    (1, "Alice"),
    (4, "Diana"),
    (2, "Bob"),
    (5, "Eve")
]

customers_sort_df = spark.createDataFrame(customers_sort_data, ["customer_id", "customer_name"])
customers_sort_df.show()

In [None]:
# Expected Output
expected_data = [
    (1, "Alice"),
    (2, "Bob"),
    (3, "Charlie"),
    (4, "Diana"),
    (5, "Eve")
]

expected_df = spark.createDataFrame(expected_data, ["customer_id", "customer_name"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Sorting operation. Tests orderBy for data ordering.

## Problem 29: Basic Mathematical Operations

**Requirement:** Finance team needs calculated fields for financial reporting.

**Scenario:** Calculate tax (15%) and total amount after tax for each sale.

In [None]:
# Source DataFrame
sales_tax_data = [
    (1, 100.0),
    (2, 200.0),
    (3, 150.0),
    (4, 300.0)
]

sales_tax_df = spark.createDataFrame(sales_tax_data, ["sale_id", "amount"])
sales_tax_df.show()

In [None]:
# Expected Output
expected_data = [
    (1, 100.0, 15.0, 115.0),
    (2, 200.0, 30.0, 230.0),
    (3, 150.0, 22.5, 172.5),
    (4, 300.0, 45.0, 345.0)
]

expected_df = spark.createDataFrame(expected_data, ["sale_id", "amount", "tax", "total_amount"])
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Mathematical operations on columns. Tests arithmetic operations and column references.

## Problem 30: Simple Filter with Multiple Conditions

**Requirement:** Sales team needs to identify high-value recent customers.

**Scenario:** Filter customers who joined in 2023 and have spent more than $1000.

In [None]:
# Source DataFrame
customers_high_value_data = [
    (1, "John", "2023-01-15", 800.0),
    (2, "Jane", "2023-02-20", 1500.0),
    (3, "Bob", "2022-12-10", 1200.0),
    (4, "Alice", "2023-03-05", 2000.0),
    (5, "Charlie", "2022-11-20", 900.0)
]

customers_high_value_df = spark.createDataFrame(customers_high_value_data, ["customer_id", "customer_name", "join_date", "total_spent"])
customers_high_value_df = customers_high_value_df.withColumn("join_date", col("join_date").cast("date"))
customers_high_value_df.show()

In [None]:
# Expected Output
expected_data = [
    (2, "Jane", "2023-02-20", 1500.0),
    (4, "Alice", "2023-03-05", 2000.0)
]

expected_df = spark.createDataFrame(expected_data, ["customer_id", "customer_name", "join_date", "total_spent"])
expected_df = expected_df.withColumn("join_date", col("join_date").cast("date"))
expected_df.show()

In [None]:
# YOUR SOLUTION HERE

# Test your solution
assert_dataframe_equal(result_df, expected_df)

**Instructor Notes:** Multiple condition filtering with date functions. Tests compound conditions and date extraction functions.

# Set 1 Complete!

You've completed all 30 Easy problems in Set 1. These problems cover:
- Basic filtering and selection
- Simple aggregations
- Basic joins
- Window functions
- String and date operations
- UDFs
- Data type conversions
- And other fundamental PySpark operations

Ready for Set 2 with Easy/Medium difficulty problems?