## Spark Streaming Resource Pool Performance Demo

This demo showcases the impact of Spark's fair scheduler pools on streaming performance under high load. The script runs multiple concurrent streaming jobs with varying resource demands to illustrate how pools can improve consistency and throughput in resource-constrained environments.

Key elements:

1. **Workload Simulation**: Processes complex JSON data with nested structures through multiple streaming pipelines, including data ingestion, windowed aggregations, text analysis, and statistical calculations.

2. **Resource Contention**: Creates deliberate competition for cluster resources by running four concurrent streams with overlapping processing windows and CPU-intensive operations.

3. **Comparison Methodology**: Toggles between using dedicated scheduler pools (`USE_POOLS=True`) and shared resources (`USE_POOLS=False`) to demonstrate performance differences.

4. **Performance Metrics**: Tracks and visualizes processing time, input rows, and memory usage across streams to illustrate how pool allocation affects execution consistency.

5. **Visualization**: Generates time-series plots and summary statistics to quantify the impact of resource pools on processing latency and throughput.

This demo helps students understand when and why resource pools should be implemented in production Spark streaming applications, especially in multi-tenant environments or with complex workloads.

In [0]:
%run ./includes/includes

## Setup Variables

In [0]:
# Control Variables
# !!!!!!!!!!!!!!!!!!!!!!!!!
USE_POOLS = True
GEN_INPUT_FILES = False
FILE_COUNT = 20000  # Increased file count
CONCURRENT_STREAMS = 4  # Run multiple streams concurrently
MAX_FILES_PER_TRIGGER = 10  # Process more files simultaneously
# !!!!!!!!!!!!!!!!!!!!!!!!!

## Demo Execution

In [0]:
if GEN_INPUT_FILES:
    # Clean up the previous runs
    dbutils.fs.rm(input_path, True)
    generate_json_files(input_path, FILE_COUNT)


### Start Multiple Concurrent Streams

clean_up_delta_tables()

if USE_POOLS:
    # Initialize Spark session with resource allocation for pools
    spark.conf.set("spark.sql.shuffle.partitions", "8")  # Moderate parallelism
    spark.conf.set("spark.default.parallelism", "8")
    spark.conf.set("spark.sql.adaptive.enabled", "false")  # Disable adaptive query execution to make pool effects clearer
    
    # Set up fair scheduler pools
    spark.sparkContext.setLocalProperty("spark.scheduler.pool", "default")
    json_stream = start_json_stream()
    
    # wait for the delta table to ensure it exists
    wait_for_delta_table(delta_table_path)
    
    # Now start all streams in different pools concurrently
    spark.sparkContext.setLocalProperty("spark.scheduler.pool", "count_pool")
    count_stream = start_count_stream()
    # wait for the delta table to ensure it exists
    wait_for_delta_table(count_table_path)

    spark.sparkContext.setLocalProperty("spark.scheduler.pool", "word_pool")
    word_stream = start_word_count_stream()
    # wait for the delta table to ensure it exists
    wait_for_delta_table(word_count_path)

    spark.sparkContext.setLocalProperty("spark.scheduler.pool", "analytics_pool")
    analytics_stream = start_analytics_stream()
    # wait for the delta table to ensure it exists
    wait_for_delta_table(analytics_table_path)   
else:
    # Without pools, all streams will compete for the same resources
    json_stream = start_json_stream()
    
    # Start the delta table to ensure it exists
    wait_for_delta_table(delta_table_path)
    
    # Start all streams concurrently without pool separation
    count_stream = start_count_stream()
    wait_for_delta_table(count_table_path)

    word_stream = start_word_count_stream()
    wait_for_delta_table(word_count_path)

    analytics_stream = start_analytics_stream()
    wait_for_delta_table(analytics_table_path)   

# Start monitoring in a separate loop
print(f"Monitoring streams with USE_POOLS = {USE_POOLS}")

"""
Monitor the streams until all of the data is processed
"""
row_count = 1
i = 0
while row_count != 0:
    time.sleep(10)
    
    # Get streaming stats
    stats = get_streaming_stats()
    
    # Display current status
    if not stats.empty:
        # Group by query name and get the latest stats
        latest_stats = stats.sort_values("elapsed_time").groupby("query").last().reset_index()
        print(f"\nStatus at {datetime.now().strftime('%H:%M:%S')} (Elapsed: {i*10}s):")
        for _, row in latest_stats.iterrows():
            print(f"  {row['query']}: {row['input_rows']} rows, {row['processing_time']}ms processing time")
        row_count = latest_stats["input_rows"].sum()
        i += 1

# Collect final streaming metrics
df = get_streaming_stats()

# Plot processing time and input rows over elapsed time by query
if not df.empty:
    # Create a figure with two subplots
    fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 10), sharex=True)
    
    # Plot processing time
    for query in df["query"].unique():
        subset = df[df["query"] == query]
        ax1.plot(subset["elapsed_time"], subset["processing_time"], marker='o', linestyle='-', label=query)
    
    ax1.set_ylabel("Processing Time (ms)")
    ax1.set_title(f"Spark Streaming Processing Time - USE_POOLS = {USE_POOLS}")
    ax1.legend()
    ax1.grid(True)
    
    # Plot input rows
    for query in df["query"].unique():
        subset = df[df["query"] == query]
        ax2.plot(subset["elapsed_time"], subset["input_rows"], marker='o', linestyle='-', label=query)
    
    ax2.set_xlabel("Elapsed Time (seconds)")
    ax2.set_ylabel("Input Rows")
    ax2.set_title(f"Spark Streaming Input Rows - USE_POOLS = {USE_POOLS}")
    ax2.legend()
    ax2.grid(True)
    
    plt.tight_layout()
    plt.show()
    
    # Generate statistics summary
    print("\nStreaming Performance Summary:")
    summary = df.groupby("query").agg({
        "processing_time": ["mean", "max", "min", "std"],
        "input_rows": ["sum", "mean", "max"]
    }).reset_index()
    display(summary)



In [0]:
# Stop all the active streams
for s in spark.streams.active:
    s.stop()