# Notebook 03 · Build Streaming Input from Bronze (Log Replay Setup)

## Purpose

This notebook prepares the **streaming input** for FunnelPulse by transforming historical events stored in the **bronze** layer into a file-based “event stream”.

Because we are working in a classroom environment without a production Kafka cluster, we simulate a real-time event firehose by:

- Selecting a time range of historical events from `bronze_events`
- Ordering and repartitioning those events into many small Parquet files
- Writing them into a `stream_input/` directory that Spark Structured Streaming can treat as a source

This notebook does **not** perform any analytics itself. Its only job is to produce a realistic, incremental input for the streaming pipeline implemented in Notebook 04.

---

## Inputs and Outputs

**Input table**

- `tables/bronze_events`
  - Built in Notebook 01
  - Contains raw but normalized events for October and November
  - Partitioned by `event_date`

**Output directory**

- `stream_input/`
  - Located under the project root, e.g. `~/funnelpulse/stream_input/`
  - Contains many small Parquet files
  - Each file holds a subset of bronze events in the chosen date range
  - Used as a streaming source in Notebook 04

---

## High Level Workflow

1. Initialize Spark and project paths
2. Load the bronze event table covering October and November
3. Select a **streaming window** (a subset of days) to simulate as “live traffic”
4. Repartition this subset into many small files
5. Write the subset as Parquet files into `stream_input/`
6. Optionally inspect the number of files and sample rows for sanity

---

## Streaming Simulation Design

In a production deployment, FunnelPulse would consume events from a real-time source such as Kafka or a log service. On this environment, we approximate that behavior with a **file-based streaming source**:

1. **Choose a historical period**
   - For example, focus on events between:
     - `2019-10-15` and `2019-10-31`
   - This gives enough days and volume to exercise the streaming pipeline without overwhelming the cluster

2. **Subset bronze events**
   - Filter `bronze_events` to keep only events whose `event_date` falls within the chosen range
   - Preserve all other columns so the streaming job sees the same schema as the batch pipeline

3. **Chunk the data into many small files**
   - Repartition the subset into a target number of partitions (e.g., 50)
   - Each partition becomes a separate Parquet file in `stream_input/`
   - This allows Spark Structured Streaming to ingest new files incrementally, one or a few at a time

4. **Write to the streaming input directory**
   - Use `mode("overwrite")` so the streaming input can be rebuilt cleanly
   - The resulting directory contains a backlog of “to-be-streamed” events

---

## How This Is Used by the Streaming Pipeline

The streaming notebook (Notebook 04) uses `stream_input/` as a **file-based streaming source**:

- It defines a streaming DataFrame with:
  - Schema taken from `bronze_events`
  - A `maxFilesPerTrigger` setting so that Spark processes one new file per microbatch
- As the streaming query runs, Spark:
  - Monitors `stream_input/` for new files
  - Treats each file’s contents as the next batch of events
  - Applies the same cleaning and aggregation logic used in the batch pipeline, but in streaming mode

This log replay pattern closely mirrors how a Kafka-based or log-based streaming system behaves, but remains achievable within the constraints of the course environment.

---

## Role of This Notebook in the Overall System

Within the FunnelPulse architecture, this notebook serves as the **bridge between batch history and streaming simulation**:

- Notebooks 01 and 02:
  - Build the **batch lakehouse**, including bronze, silver, and multiple gold tables
- Notebook 03:
  - Converts a slice of the bronze history into a pseudo-real-time **stream input**
  - Makes it possible to test and demonstrate the streaming pipeline without external infrastructure
- Notebook 04:
  - Consumes `stream_input/` with Spark Structured Streaming
  - Computes real-time hourly funnel metrics by brand, analogous to the batch gold table
- Later notebooks:
  - Use the gold metrics for anomaly detection and incident surfacing

In short, Notebook 03 is an operational preparation step that turns static historical data into a sequence of events that the streaming system can consume as if they were arriving live.

In [4]:
# CELL 1: Spark initialization and paths for streaming input builder

import os
import sys

# Set JAVA_HOME to Java 17 (required for PySpark 3.4+)
os.environ["JAVA_HOME"] = "/opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home"

# Add parent directory to path
sys.path.insert(0, os.path.dirname(os.getcwd()))

from pyspark.sql import SparkSession

# Create Spark session for local execution
spark = (
    SparkSession.builder
    .appName("FunnelPulse Build Stream Input")
    .master("local[*]")
    .config("spark.driver.memory", "4g")
    .config("spark.sql.shuffle.partitions", "200")
    .getOrCreate()
)

print(spark)
print(f"Spark UI available at: http://localhost:4040")

# Project paths (parent of notebooks folder)
project_root = os.path.dirname(os.getcwd())
tables_dir = os.path.join(project_root, "tables")

bronze_path       = os.path.join(tables_dir, "bronze_events")
stream_input_path = os.path.join(project_root, "stream_input")

# Ensure stream_input directory exists
os.makedirs(stream_input_path, exist_ok=True)

print("Bronze path      :", bronze_path)
print("Stream input path:", stream_input_path)

<pyspark.sql.session.SparkSession object at 0x11267e3c0>
Spark UI available at: http://localhost:4040
Bronze path      : /Users/aranyaaryaman/Desktop/bigData 2/finalProject/Big-Data-Project/tables/bronze_events
Stream input path: /Users/aranyaaryaman/Desktop/bigData 2/finalProject/Big-Data-Project/stream_input


In [5]:
# CELL 2: Load bronze and filter a subset period for streaming

from pyspark.sql.functions import col

bronze = spark.read.parquet(bronze_path)

print("Total BRONZE rows (Oct+Nov):", bronze.count())
bronze.select("event_time", "event_date", "event_type").show(5, truncate=False)

# For streaming demo, let's take events between 2019-10-15 and 2019-10-31 (example)
bronze_stream_subset = bronze.filter(
    (col("event_date") >= "2019-10-15") & (col("event_date") <= "2019-10-31")
)

print("Rows in streaming subset:", bronze_stream_subset.count())
bronze_stream_subset.orderBy("event_time").show(10, truncate=False)

Total BRONZE rows (Oct+Nov): 8738120
+-------------------+----------+----------+
|event_time         |event_date|event_type|
+-------------------+----------+----------+
|2019-11-22 00:00:00|2019-11-22|cart      |
|2019-11-22 00:00:00|2019-11-22|view      |
|2019-11-22 00:00:00|2019-11-22|cart      |
|2019-11-22 00:00:01|2019-11-22|view      |
|2019-11-22 00:00:01|2019-11-22|cart      |
+-------------------+----------+----------+
only showing top 5 rows
Rows in streaming subset: 2140953
+-------------------+----------------+----------+-------------------+-------------+--------+-----+---------+------------------------------------+----------+
|event_time         |event_type      |product_id|category_id        |category_code|brand   |price|user_id  |user_session                        |event_date|
+-------------------+----------------+----------+-------------------+-------------+--------+-----+---------+------------------------------------+----------+
|2019-10-15 00:00:01|view            |

In [6]:
# CELL 3: Write streaming subset as many small Parquet files

# Repartition to produce multiple small files (e.g., 50)
# Adjust 50 up/down depending on size; more partitions = more "micro-batches"
num_partitions = 50

(
    bronze_stream_subset
    .repartition(num_partitions)
    .write
    .mode("overwrite")
    .parquet(stream_input_path)
)

print("Wrote streaming subset to:", stream_input_path)

# Quick inspection: count files (handles paths with spaces)
import os
files = [f for f in os.listdir(stream_input_path) if f.endswith('.parquet')]
print("Number of parquet files in stream_input:", len(files))



Wrote streaming subset to: /Users/aranyaaryaman/Desktop/bigData 2/finalProject/Big-Data-Project/stream_input
Number of parquet files in stream_input: 50


                                                                                