# Notebook 05 · Anomaly Detection on Hourly Brand Funnels

## Purpose

This notebook builds the **anomaly detection layer** for FunnelPulse.

Using the historical **hourly funnel metrics by brand** generated in the batch pipeline, it:

- Learns typical conversion behavior for each brand (and for each brand at each hour of day)
- Scores each hour’s conversion rate against those baselines
- Flags statistically unusual hours as **anomalies** (drops or spikes)
- Writes a dedicated anomalies table for downstream reporting and investigation

This closes the loop from “we have metrics” to “we can surface incidents that need attention.”

---

## Inputs and Outputs

**Input table**

- `tables/gold_funnel_hourly_brand`
  - Built in Notebook 01 from the silver event layer
  - Grain: 1-hour window × brand
  - Metrics:
    - `views`, `carts`, `purchases`, `revenue`
    - `view_to_cart_rate`, `cart_to_purchase_rate`, `conversion_rate`
  - Covers October and November

**Output table**

- `tables/gold_anomalies_hourly_brand`
  - Grain: 1-hour window × brand
  - Contains:
    - All standard hourly funnel metrics
    - Baseline statistics per brand and per brand × hour_of_day
    - Z-score style anomaly scores
    - Flags and labels indicating anomaly type (drop or spike)
    - `window_date` partition for efficient querying

---

## High Level Workflow

1. Initialize Spark and project paths
2. Load the batch hourly brand funnel table
3. Filter for reliable windows (sufficient volume)
4. Derive additional features:
   - Hour of day
5. Compute baselines:
   - Per brand across all hours
   - Per brand × hour_of_day
6. Compute anomaly scores using z-scores
7. Flag significant drops and spikes as anomalies
8. Write anomalies into a dedicated gold table
9. Inspect and sort anomalies to understand the most severe incidents

---

## Data Preparation and Filtering

To avoid noisy, low traffic windows distorting results, the notebook:

- Reads the full `gold_funnel_hourly_brand` table
- Keeps only brand-hour windows that meet basic volume criteria, for example:
  - A minimum number of views in that hour
- Adds `hour_of_day` (0–23) extracted from `window_start`

This creates a focused dataset where anomaly scores are more meaningful and less driven by random fluctuations.

---

## Baseline Modeling

The anomaly logic uses **simple but robust statistical baselines** derived directly from the hourly funnel table:

1. **Per-brand baseline**

   For each brand, the notebook computes, across all hours in the historical data:

   - `conv_mean_brand` = average conversion rate
   - `conv_std_brand`  = standard deviation of conversion rate

   This captures how the brand typically converts overall, regardless of time of day.

2. **Per-brand, per-hour-of-day baseline**

   For each brand and each hour of day (0–23), it computes:

   - `conv_mean_brand_hour` = average conversion rate at that hour
   - `conv_std_brand_hour`  = standard deviation of conversion rate at that hour

   This captures daily patterns (for example, evenings vs mornings) and allows more precise comparisons when enough data is available.

Both sets of baselines are computed using window functions so they scale over the full dataset.

---

## Anomaly Scoring

For every hourly brand window, the notebook computes **z-scores** that measure how far the current conversion rate deviates from its baseline:

- `z_brand`      = (current conversion − brand mean) / brand std
- `z_brand_hour` = (current conversion − brand-hour mean) / brand-hour std

To handle edge cases:

- If a standard deviation is zero or null (e.g., very few observations), the corresponding z-score is set to 0 so it does not trigger anomalies.
- An additional view count threshold is applied (for example, `views >= 50`) so that only windows with sufficient traffic are considered for anomaly decisions.

This produces two independent anomaly signals per window (brand level and brand×hour level).

---

## Anomaly Flagging and Labeling

The notebook then converts z-scores into human consumable anomaly flags:

- **Drop anomaly**
  - Marked when:
    - The window has enough views, and
    - Either `z_brand` or `z_brand_hour` falls below a negative threshold (e.g., −2.0)
  - Interpreted as: “conversion is significantly lower than expected for this brand, or for this brand at this hour of day.”

- **Spike anomaly**
  - Marked when:
    - The window has enough views, and
    - Either `z_brand` or `z_brand_hour` exceeds a positive threshold (e.g., +2.0)
  - Interpreted as: “conversion is significantly higher than expected.”

An `anomaly_type` field is derived with values such as `"drop"`, `"spike"`, or `NULL` when no anomaly is present.

This structure makes it easy to filter and rank incidents by severity (`z_brand` or `z_brand_hour`), and to build lists of “top 20 worst drops” or “largest positive spikes.”

---

## Anomaly Table Storage

All anomaly rows are written to `tables/gold_anomalies_hourly_brand`:

- Each record includes:
  - Time window (`window_start`, `window_end`, `window_date`)
  - Brand and original funnel metrics
  - Baseline statistics (`conv_mean_brand`, `conv_std_brand`, etc.)
  - Z-scores (`z_brand`, `z_brand_hour`)
  - Anomaly flags and type
- The table is partitioned by `window_date` to support efficient time-based queries and visualization tools.

This table serves as the primary source for:

- Alert lists
- Incident dashboards
- Post-incident root cause analysis workflows

---

## Role of This Notebook in the Overall System

Within the full FunnelPulse architecture:

- Notebooks 01 and 02:
  - Build the batch lakehouse:
    - Bronze → Silver → multiple gold metrics (hourly, daily, by brand, category, price band)
- Notebook 03:
  - Prepares a file-based streaming input from historical bronze events
- Notebook 04:
  - Runs a Spark Structured Streaming job to compute hourly brand funnel metrics in near real time
- Notebook 05 (this notebook):
  - Adds a statistical anomaly detection layer on top of the hourly funnel metrics

Together, these components form an end-to-end system that:

1. Ingests and cleans large-scale event data
2. Produces both batch and streaming funnel KPIs
3. Automatically surfaces **when** and **where** funnel behavior becomes unusual

This elevates the project from “reporting metrics” to **proactive funnel monitoring and incident detection** — exactly the goal of FunnelPulse.

In [1]:
# CELL 1: Spark initialization and paths for anomaly detection

import os
import sys

# Set JAVA_HOME to Java 17 (required for PySpark 3.4+)
os.environ["JAVA_HOME"] = "/opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home"

# Add parent directory to path
sys.path.insert(0, os.path.dirname(os.getcwd()))

from pyspark.sql import SparkSession

# Create Spark session for local execution
spark = (
    SparkSession.builder
    .appName("FunnelPulse Anomaly Hourly Brand")
    .master("local[*]")
    .config("spark.driver.memory", "4g")
    .config("spark.sql.shuffle.partitions", "200")
    .getOrCreate()
)

print(spark)
print(f"Spark UI available at: http://localhost:4040")

# Project paths (parent of notebooks folder)
project_root = os.path.dirname(os.getcwd())
tables_dir = os.path.join(project_root, "tables")

gold_hourly_brand_path = os.path.join(tables_dir, "gold_funnel_hourly_brand")
anomaly_hourly_brand_path = os.path.join(tables_dir, "gold_anomalies_hourly_brand")

print("Batch hourly funnel path:", gold_hourly_brand_path)
print("Anomalies output path   :", anomaly_hourly_brand_path)

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/11/27 07:34:12 WARN Utils: Your hostname, Aranyas-Laptop.local, resolves to a loopback address: 127.0.0.1; using 192.168.1.180 instead (on interface en0)
25/11/27 07:34:12 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/11/27 07:34:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/11/27 07:34:13 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/11/27 07:34:13 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
25/11/27 07:34:13 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
25/11/27 07:34:13 WARN 

<pyspark.sql.session.SparkSession object at 0x11746dd60>
Spark UI available at: http://localhost:4040
Batch hourly funnel path: /Users/aranyaaryaman/Desktop/bigData 2/finalProject/Big-Data-Project/tables/gold_funnel_hourly_brand
Anomalies output path   : /Users/aranyaaryaman/Desktop/bigData 2/finalProject/Big-Data-Project/tables/gold_anomalies_hourly_brand


In [2]:
# CELL 2: Load batch hourly funnel and do basic filtering

from pyspark.sql.functions import hour, col

gold = spark.read.parquet(gold_hourly_brand_path)

print("Rows in batch hourly funnel:", gold.count())
gold.orderBy("window_start", "brand").show(10, truncate=False)

# Keep only rows with enough traffic to be meaningful
# You can tune these thresholds
gold_filtered = gold.filter(
    (col("views") >= 20) &    # at least 20 views in that hour
    (col("brand").isNotNull())
)

print("Rows after baseline traffic filter:", gold_filtered.count())

# Add hour of day for more granular baselines (brand x hour_of_day)
gold_feat = gold_filtered.withColumn("hour_of_day", hour(col("window_start")))

gold_feat.select(
    "window_start", "window_end", "brand",
    "views", "carts", "purchases", "conversion_rate",
    "hour_of_day"
).orderBy("window_start", "brand").show(10, truncate=False)

Rows in batch hourly funnel: 183145
+-----------+-----+-----+---------+-----------------+-------------------+-------------------+------------------+---------------------+-------------------+-----------+
|brand      |views|carts|purchases|revenue          |window_start       |window_end         |view_to_cart_rate |cart_to_purchase_rate|conversion_rate    |window_date|
+-----------+-----+-----+---------+-----------------+-------------------+-------------------+------------------+---------------------+-------------------+-----------+
|NULL       |202  |185  |59       |284.0199999999999|2019-09-30 20:00:00|2019-09-30 21:00:00|0.9158415841584159|0.31891891891891894  |0.29207920792079206|2019-09-30 |
|airnails   |0    |2    |2        |2.38             |2019-09-30 20:00:00|2019-09-30 21:00:00|NULL              |1.0                  |NULL               |2019-09-30 |
|ardell     |1    |0    |0        |0.0              |2019-09-30 20:00:00|2019-09-30 21:00:00|0.0               |NULL             

In [3]:
# CELL 3: Compute baselines (brand level and brand x hour_of_day)

from pyspark.sql.window import Window
from pyspark.sql.functions import avg, stddev_samp

w_brand = Window.partitionBy("brand")
w_brand_hour = Window.partitionBy("brand", "hour_of_day")

gold_with_baselines = (
    gold_feat
    .withColumn("conv_mean_brand", avg("conversion_rate").over(w_brand))
    .withColumn("conv_std_brand", stddev_samp("conversion_rate").over(w_brand))
    .withColumn("conv_mean_brand_hour", avg("conversion_rate").over(w_brand_hour))
    .withColumn("conv_std_brand_hour", stddev_samp("conversion_rate").over(w_brand_hour))
)

gold_with_baselines.select(
    "window_start", "brand", "views", "conversion_rate",
    "hour_of_day",
    "conv_mean_brand", "conv_std_brand",
    "conv_mean_brand_hour", "conv_std_brand_hour"
).orderBy("window_start", "brand").show(10, truncate=False)

+-------------------+---------+-----+-------------------+-----------+-------------------+-------------------+--------------------+--------------------+
|window_start       |brand    |views|conversion_rate    |hour_of_day|conv_mean_brand    |conv_std_brand     |conv_mean_brand_hour|conv_std_brand_hour |
+-------------------+---------+-----+-------------------+-----------+-------------------+-------------------+--------------------+--------------------+
|2019-09-30 20:00:00|runail   |60   |0.8                |20         |0.20059888898908887|0.1602073128205354 |0.2838033711964399  |0.40523242803154963 |
|2019-09-30 21:00:00|irisk    |30   |0.2                |21         |0.19918839008798714|0.19252297094947488|0.20394748720490852 |0.15342093017511443 |
|2019-09-30 21:00:00|runail   |32   |0.53125            |21         |0.20059888898908887|0.1602073128205354 |0.21051508929878074 |0.2407001359396143  |
|2019-09-30 22:00:00|runail   |28   |2.392857142857143  |22         |0.20059888898908887

In [4]:
# CELL 4: Compute anomaly scores and flags

from pyspark.sql.functions import when, abs as _abs

df = gold_with_baselines

# Avoid divide by zero or null stddev (set z to 0 in those cases)
df = df.withColumn(
    "z_brand",
    when(
        (col("conv_std_brand").isNull()) | (col("conv_std_brand") == 0),
        0.0
    ).otherwise(
        (col("conversion_rate") - col("conv_mean_brand")) / col("conv_std_brand")
    )
)

df = df.withColumn(
    "z_brand_hour",
    when(
        (col("conv_std_brand_hour").isNull()) | (col("conv_std_brand_hour") == 0),
        0.0
    ).otherwise(
        (col("conversion_rate") - col("conv_mean_brand_hour")) / col("conv_std_brand_hour")
    )
)

# Define anomaly flags
# You can tune thresholds, here we focus on drops (z <= -2)
df = df.withColumn(
    "is_drop_anomaly",
    when(
        (col("views") >= 50) &  # require more traffic for reliability
        ((col("z_brand") <= -2.0) | (col("z_brand_hour") <= -2.0)),
        True
    ).otherwise(False)
)

df = df.withColumn(
    "is_spike_anomaly",
    when(
        (col("views") >= 50) &  # same traffic threshold
        ((col("z_brand") >= 2.0) | (col("z_brand_hour") >= 2.0)),
        True
    ).otherwise(False)
)

df = df.withColumn(
    "anomaly_type",
    when(col("is_drop_anomaly"), "drop")
    .when(col("is_spike_anomaly"), "spike")
    .otherwise(None)
)

df.select(
    "window_start", "brand", "views", "carts", "purchases",
    "conversion_rate", "hour_of_day",
    "conv_mean_brand", "conv_std_brand",
    "z_brand", "z_brand_hour",
    "anomaly_type"
).orderBy("window_start", "brand").show(20, truncate=False)

+-------------------+---------+-----+-----+---------+-------------------+-----------+-------------------+-------------------+--------------------+---------------------+------------+
|window_start       |brand    |views|carts|purchases|conversion_rate    |hour_of_day|conv_mean_brand    |conv_std_brand     |z_brand             |z_brand_hour         |anomaly_type|
+-------------------+---------+-----+-----+---------+-------------------+-----------+-------------------+-------------------+--------------------+---------------------+------------+
|2019-09-30 20:00:00|runail   |60   |65   |48       |0.8                |20         |0.20059888898908887|0.1602073128205354 |3.7414091807554484  |1.2738285317170406   |spike       |
|2019-09-30 21:00:00|irisk    |30   |8    |6        |0.2                |21         |0.19918839008798714|0.19252297094947488|0.004215652334940692|-0.025729782764338932|NULL        |
|2019-09-30 21:00:00|runail   |32   |62   |17       |0.53125            |21         |0.200

In [5]:
# CELL 5: Filter anomalies and write to gold_anomalies_hourly_brand

anomalies = df.filter(col("anomaly_type").isNotNull())

print("Total anomaly windows found:", anomalies.count())

anomalies.orderBy("window_start", "brand").show(50, truncate=False)

# Write to Parquet, partitioned by window_date for easy querying
from pyspark.sql.functions import to_date

anomalies_out = anomalies.withColumn("window_date", to_date(col("window_start")))

(
    anomalies_out
    .write
    .mode("overwrite")
    .partitionBy("window_date")
    .parquet(anomaly_hourly_brand_path)
)

print("Anomalies table written to:", anomaly_hourly_brand_path)

Total anomaly windows found: 442
+---------+-----+-----+---------+------------------+-------------------+-------------------+-------------------+---------------------+-------------------+-----------+-----------+-------------------+-------------------+--------------------+-------------------+-------------------+-------------------+---------------+----------------+------------+
|brand    |views|carts|purchases|revenue           |window_start       |window_end         |view_to_cart_rate  |cart_to_purchase_rate|conversion_rate    |window_date|hour_of_day|conv_mean_brand    |conv_std_brand     |conv_mean_brand_hour|conv_std_brand_hour|z_brand            |z_brand_hour       |is_drop_anomaly|is_spike_anomaly|anomaly_type|
+---------+-----+-----+---------+------------------+-------------------+-------------------+-------------------+---------------------+-------------------+-----------+-----------+-------------------+-------------------+--------------------+-------------------+----------------

[Stage 27:>                                                         (0 + 1) / 1]

Anomalies table written to: /Users/aranyaaryaman/Desktop/bigData 2/finalProject/Big-Data-Project/tables/gold_anomalies_hourly_brand


                                                                                

In [6]:
# CELL 6: Reload anomalies table and inspect

anoms = spark.read.parquet(anomaly_hourly_brand_path)

print("Anomaly rows stored:", anoms.count())

anoms.select(
    "window_start", "window_end", "window_date",
    "brand", "views", "carts", "purchases",
    "conversion_rate",
    "conv_mean_brand", "conv_std_brand",
    "z_brand", "z_brand_hour",
    "anomaly_type"
).orderBy("window_start", "brand").show(50, truncate=False)

# Example: top 20 largest drops by z-score
anoms.orderBy("z_brand").select(
    "window_start", "brand", "views", "purchases",
    "conversion_rate", "conv_mean_brand", "z_brand"
).show(20, truncate=False)

Anomaly rows stored: 442
+-------------------+-------------------+-----------+---------+-----+-----+---------+-------------------+-------------------+-------------------+-------------------+-------------------+------------+
|window_start       |window_end         |window_date|brand    |views|carts|purchases|conversion_rate    |conv_mean_brand    |conv_std_brand     |z_brand            |z_brand_hour       |anomaly_type|
+-------------------+-------------------+-----------+---------+-----+-----+---------+-------------------+-------------------+-------------------+-------------------+-------------------+------------+
|2019-09-30 20:00:00|2019-09-30 21:00:00|2019-09-30 |runail   |60   |65   |48       |0.8                |0.20059888898908887|0.1602073128205354 |3.7414091807554484 |1.2738285317170406 |spike       |
|2019-10-01 00:00:00|2019-10-01 01:00:00|2019-10-01 |runail   |113  |100  |54       |0.4778761061946903 |0.20059888898908887|0.1602073128205354 |1.7307400787391518 |2.408188244923