# Lecture 3 Lab — Kafka + Spark Structured Streaming (Real-time Fraud Detection)

**Data:** `creditcard.csv`  
**Kafka topics:** `transactions` (input) → `fraud_alerts` (output)  

Lab outcomes:
- Train a baseline fraud model offline (Spark ML)
- Stream transactions from Kafka and score in real time
- Windowed KPI analytics with event-time + watermark
- Fault tolerance with checkpointing


## 0) Paths & versions
This notebook is designed for the Docker Compose package:
- Jupyter image: `jupyter/pyspark-notebook:spark-3.5.1`
- Kafka image: `bitnami/kafka:3.7`

Folder mapping:
- Host `./work` → container `/home/jovyan/work`
- Host `./data/creditcard.csv` → available in the package (already copied)


In [1]:
from pathlib import Path
import os

WORK = Path("/home/jovyan/work")
DATA = WORK / "data"
MODELS = WORK / "models"
OUT = WORK / "output_alerts"
CKPT = WORK / "checkpoints"

for p in [DATA, MODELS, OUT, CKPT]:
    p.mkdir(parents=True, exist_ok=True)

csv_path = str(DATA / "creditcard.csv")
csv_path

'/home/jovyan/work/data/creditcard.csv'

In [2]:
import shutil, pathlib

# If the dataset is missing under /home/jovyan/work/data, try to copy from /data (if you mounted it)
fallback = pathlib.Path("/data/creditcard.csv")
target = pathlib.Path(csv_path)
if (not target.exists()) and fallback.exists():
    shutil.copy2(fallback, target)

target.exists(), target

(True, PosixPath('/home/jovyan/work/data/creditcard.csv'))

## 1) Start Spark with Kafka connector
Spark needs the Kafka connector package. The coordinates must match your Spark version.


In [3]:
from pyspark.sql import SparkSession
import pyspark

spark_version = pyspark.__version__
# For Spark 3.5.1 notebook image:
kafka_pkg = f"org.apache.spark:spark-sql-kafka-0-10_2.12:{spark_version}"

spark = (SparkSession.builder
         .appName("Lecture3-Kafka-Fraud")
         .master("local[*]")
         .config("spark.sql.shuffle.partitions", "8")
         .config("spark.jars.packages", kafka_pkg)
         .getOrCreate())

spark.sparkContext.setLogLevel("WARN")
spark_version, kafka_pkg

('3.5.0', 'org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0')

## 2) Offline data audit + prepare event_time
Audit: duplicates, missing values, and class imbalance.


In [4]:
import pandas as pd
import datetime

pdf = pd.read_csv(csv_path)
print("Shape:", pdf.shape)
print("Duplicates:", pdf.duplicated().sum())
print("Missing cells:", int(pdf.isna().sum().sum()))
print("Class distribution:")
print(pdf["Class"].value_counts())
print((pdf["Class"].value_counts(normalize=True)*100).round(4))

pdf = pdf.drop_duplicates()

# Create event_time (timestamp) for streaming windows
start_ts = datetime.datetime.now().replace(microsecond=0)
pdf["event_time"] = pd.to_datetime(start_ts) + pd.to_timedelta(pdf["Time"], unit="s")

pdf[["Time","event_time","Amount","Class"]].head()

Shape: (284807, 31)
Duplicates: 1081
Missing cells: 0
Class distribution:
Class
0    284315
1       492
Name: count, dtype: int64
Class
0    99.8273
1     0.1727
Name: proportion, dtype: float64


Unnamed: 0,Time,event_time,Amount,Class
0,0.0,2025-12-16 12:46:03,149.62,0
1,0.0,2025-12-16 12:46:03,2.69,0
2,1.0,2025-12-16 12:46:04,378.66,0
3,1.0,2025-12-16 12:46:04,123.5,0
4,2.0,2025-12-16 12:46:05,69.99,0


## 3) Train and save a baseline model (Spark ML)
Pipeline: assembler → scaler → logistic regression. Use `weightCol` for rare fraud.


In [5]:
from pyspark.sql import functions as F
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import BinaryClassificationEvaluator

sdf = spark.createDataFrame(pdf)

feature_cols = [c for c in sdf.columns if c.startswith("V")] + ["Amount"]
label_col = "Class"

train, test = sdf.randomSplit([0.8, 0.2], seed=42)

counts = train.groupBy(label_col).count().collect()
counts = {row[label_col]: row["count"] for row in counts}
w0 = 1.0
w1 = counts.get(0,1) / max(counts.get(1,1), 1)
print("Class counts:", counts, "fraud_weight:", w1)

train = train.withColumn("weight", F.when(F.col(label_col)==1, F.lit(float(w1))).otherwise(F.lit(float(w0))))
test  = test.withColumn("weight",  F.when(F.col(label_col)==1, F.lit(float(w1))).otherwise(F.lit(float(w0))))

assembler = VectorAssembler(inputCols=feature_cols, outputCol="features_raw")
scaler    = StandardScaler(inputCol="features_raw", outputCol="features", withMean=True, withStd=True)
lr        = LogisticRegression(featuresCol="features", labelCol=label_col, weightCol="weight", maxIter=50)

pipe = Pipeline(stages=[assembler, scaler, lr])
model = pipe.fit(train)

pred = model.transform(test)
pr_auc = BinaryClassificationEvaluator(labelCol=label_col, rawPredictionCol="rawPrediction", metricName="areaUnderPR").evaluate(pred)
print("PR-AUC:", pr_auc)

Class counts: {0: 226792, 1: 365} fraud_weight: 621.3479452054795
PR-AUC: 0.6973528063247361


In [6]:
import shutil, os
from pyspark.ml import PipelineModel

model_path = str(MODELS / "fraud_lr_model")
if os.path.exists(model_path):
    shutil.rmtree(model_path)

model.write().overwrite().save(model_path)
model_path

'/home/jovyan/work/models/fraud_lr_model'

## 4) Read transactions from Kafka
Producer sends JSON messages with fields:
`event_time, Time, V1..V28, Amount, Class`.


In [7]:
from pyspark.sql.types import StructType, StructField, DoubleType, IntegerType, TimestampType
from pyspark.sql import functions as F

schema_fields = [
    StructField("event_time", TimestampType(), True),
    StructField("Time", DoubleType(), True),
]
for i in range(1, 29):
    schema_fields.append(StructField(f"V{i}", DoubleType(), True))
schema_fields += [
    StructField("Amount", DoubleType(), True),
    StructField("Class", IntegerType(), True),
]
txn_schema = StructType(schema_fields)

BOOTSTRAP = "kafka:9092"
TOPIC_IN  = "transactions"

raw = (spark.readStream
       .format("kafka")
       .option("kafka.bootstrap.servers", BOOTSTRAP)
       .option("subscribe", TOPIC_IN)
       .option("startingOffsets", "earliest")
       .load())

# Kafka value is bytes → string → JSON → columns
parsed = (raw.selectExpr("CAST(value AS STRING) AS json_str")
            .select(F.from_json(F.col("json_str"), txn_schema).alias("r"))
            .select("r.*"))

parsed.isStreaming

True

## 5) Streaming inference + alerts
- Load the saved model
- Compute fraud probability (`fraud_prob`)
- Trigger alert when `fraud_prob >= threshold`
- Write alerts to:
  1) Kafka topic `fraud_alerts`
  2) Parquet in `/home/jovyan/work/output_alerts/`


In [8]:
from pyspark.ml import PipelineModel
from pyspark.sql import functions as F
from pyspark.sql.types import DoubleType
from pyspark.ml.linalg import Vector, VectorUDT

loaded_model = PipelineModel.load(model_path)

threshold = 0.90
TOPIC_OUT = "fraud_alerts"

# Define UDF to extract probability for class 1 (fraud)
def extract_prob(v):
    try:
        return float(v[1])
    except:
        return 0.0

extract_prob_udf = F.udf(extract_prob, DoubleType())

scored = loaded_model.transform(parsed)
scored = scored.withColumn("fraud_prob", extract_prob_udf(F.col("probability")))

alerts = (scored
          .withColumn("is_alert", (F.col("fraud_prob") >= F.lit(threshold)).cast("int"))
          .filter(F.col("is_alert") == 1)
          .select("event_time","Amount","Class","fraud_prob","is_alert"))

# 1) Write to Kafka (value as JSON)
alerts_json = (alerts
               .select(F.to_json(F.struct(*alerts.columns)).alias("value")))

q_kafka = (alerts_json.writeStream
           .format("kafka")
           .option("kafka.bootstrap.servers", BOOTSTRAP)
           .option("topic", TOPIC_OUT)
           .option("checkpointLocation", str(CKPT / "alerts_to_kafka"))
           .outputMode("append")
           .start())

# 2) Write to Parquet
q_parquet = (alerts.writeStream
             .format("parquet")
             .option("path", str(OUT))
             .option("checkpointLocation", str(CKPT / "alerts_to_parquet"))
             .outputMode("append")
             .start())

print("Started alert streams: to Kafka and to Parquet.")

Started alert streams: to Kafka and to Parquet.


## 6) Windowed KPI analytics (event-time + watermark)
We compute per-minute KPIs:
- n_txn, n_alert, n_fraud_true, TP, FP
- precision, recall
Watermark handles late data.


In [9]:
# KPI stream (no filtering; use all scored records)
all_alerts = (scored
              .withColumn("is_alert", (F.col("fraud_prob") >= F.lit(threshold)).cast("int"))
              .select("event_time","Class","is_alert"))

kpi = (all_alerts
       .withWatermark("event_time", "2 minutes")
       .groupBy(F.window("event_time", "1 minute"))
       .agg(
           F.count("*").alias("n_txn"),
           F.sum("is_alert").alias("n_alert"),
           F.sum(F.when(F.col("Class")==1, 1).otherwise(0)).alias("n_fraud_true"),
           F.sum(F.when((F.col("is_alert")==1) & (F.col("Class")==1), 1).otherwise(0)).alias("tp"),
           F.sum(F.when((F.col("is_alert")==1) & (F.col("Class")==0), 1).otherwise(0)).alias("fp"),
       )
       .withColumn("precision", F.when(F.col("n_alert")>0, F.col("tp")/F.col("n_alert")).otherwise(F.lit(None)))
       .withColumn("recall", F.when(F.col("n_fraud_true")>0, F.col("tp")/F.col("n_fraud_true")).otherwise(F.lit(None)))
      )

q_kpi = (kpi.writeStream
         .format("console")
         .option("truncate", "false")
         .option("checkpointLocation", str(CKPT / "kpi_console"))
         .outputMode("update")
         .start())

print("Started KPI console stream.")

Started KPI console stream.


### 6.1 Monitor

In [10]:
q_kpi.status, q_kpi.lastProgress

({'message': 'Initializing sources',
  'isDataAvailable': False,
  'isTriggerActive': True},
 None)

### 6.2 Stop streams (important)

In [11]:
for q in [q_kpi, q_kafka, q_parquet]:
    try:
        q.stop()
    except Exception as e:
        print("stop error:", e)
print("Stopped all streams.")

Stopped all streams.


## 7) Audit outputs (offline)
Read Parquet alerts and compute basic counts.


In [12]:
alerts_saved = spark.read.parquet(str(OUT))
alerts_saved.createOrReplaceTempView("alerts_saved")

spark.sql('''
SELECT
  COUNT(*) AS n_alert_rows,
  SUM(CASE WHEN Class=1 THEN 1 ELSE 0 END) AS n_fraud_true_in_alerts
FROM alerts_saved
''').show(truncate=False)

+------------+----------------------+
|n_alert_rows|n_fraud_true_in_alerts|
+------------+----------------------+
|0           |NULL                  |
+------------+----------------------+



## 8) Exercises
1) Tune threshold: 0.7 / 0.8 / 0.9 / 0.95. Compare precision/recall.
2) Change watermark to 30 seconds. Explain what changes.
3) Write KPI to Kafka as another topic (optional).
4) Restart the notebook kernels and rerun streams. Explain how checkpointing changes behavior.
