# Task C: Streaming Inference + Alerts

**Objectives:**
1. Read transactions from Kafka topic `transactions`
2. Parse JSON messages and apply ML model for real-time scoring
3. Generate fraud alerts based on threshold
4. Write alerts to:
   - Kafka topic `fraud_alerts`
   - Parquet files (for batch analysis)
5. Monitor and verify streaming pipeline

In [1]:
from pathlib import Path
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, DoubleType, IntegerType, TimestampType
from pyspark.ml import PipelineModel
import pyspark
import json

# Setup paths
WORK = Path("/home/jovyan/work")
MODELS = WORK / "models"
OUT = WORK / "output_alerts"
CKPT = WORK / "checkpoints"

for p in [MODELS, OUT, CKPT]:
    p.mkdir(parents=True, exist_ok=True)

print(f"Paths configured:")
print(f"  Models: {MODELS}")
print(f"  Alerts output: {OUT}")
print(f"  Checkpoints: {CKPT}")

Paths configured:
  Models: /home/jovyan/work/models
  Alerts output: /home/jovyan/work/output_alerts
  Checkpoints: /home/jovyan/work/checkpoints


## 1. Initialize Spark with Kafka Support

In [2]:
spark_version = pyspark.__version__
kafka_pkg = f"org.apache.spark:spark-sql-kafka-0-10_2.12:{spark_version}"

spark = (SparkSession.builder
         .appName("TaskC-Streaming-Fraud-Detection")
         .master("local[*]")
         .config("spark.sql.shuffle.partitions", "8")
         .config("spark.jars.packages", kafka_pkg)
         .config("spark.sql.streaming.forceDeleteTempCheckpointLocation", "true")
         .getOrCreate())

spark.sparkContext.setLogLevel("WARN")

print("="*80)
print("SPARK SESSION INITIALIZED")
print("="*80)
print(f"Spark version: {spark_version}")
print(f"Kafka package: {kafka_pkg}")
print("="*80)

SPARK SESSION INITIALIZED
Spark version: 3.5.0
Kafka package: org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0


## 2. Load Model Selection Summary

Load the model and threshold selected in Task A.

In [3]:
# Load model selection summary from Task A
audit_path = WORK / "audit_results" / "model_selection_summary.json"

if audit_path.exists():
    with open(audit_path, "r") as f:
        model_summary = json.load(f)
    
    selected_model_name = model_summary["selected_model"]
    pr_auc = model_summary["pr_auc"]
    recommended_threshold = model_summary["recommended_threshold"]
    
    print("Model Selection Summary (from Task A):")
    print(f"  Selected Model: {selected_model_name}")
    print(f"  PR-AUC: {pr_auc:.6f}")
    print(f"  Recommended Threshold: {recommended_threshold:.2f}")
    
    # Determine model path
    if "Logistic" in selected_model_name:
        model_path = str(MODELS / "fraud_lr_model")
    else:
        model_path = str(MODELS / "fraud_rf_model")
else:
    print("WARNING: Model selection summary not found. Using defaults.")
    model_path = str(MODELS / "fraud_lr_model")
    recommended_threshold = 0.5

print(f"\nModel path: {model_path}")
print(f"Alert threshold: {recommended_threshold}")

Model Selection Summary (from Task A):
  Selected Model: Random Forest
  PR-AUC: 0.784108
  Recommended Threshold: 0.50

Model path: /home/jovyan/work/models/fraud_rf_model
Alert threshold: 0.5


## 3. Define Transaction Schema

In [4]:
# Define schema for incoming JSON messages
schema_fields = [
    StructField("event_time", TimestampType(), True),
    StructField("Time", DoubleType(), True),
]

for i in range(1, 29):
    schema_fields.append(StructField(f"V{i}", DoubleType(), True))

schema_fields += [
    StructField("Amount", DoubleType(), True),
    StructField("Class", IntegerType(), True),
]

txn_schema = StructType(schema_fields)

print("Transaction schema defined:")
print(f"  Total fields: {len(schema_fields)}")
print(f"  Feature fields (V1-V28 + Amount): 29")
print(f"  Label field: Class")
print(f"  Event time field: event_time")

Transaction schema defined:
  Total fields: 32
  Feature fields (V1-V28 + Amount): 29
  Label field: Class
  Event time field: event_time


## 4. Read Streaming Data from Kafka

In [5]:
BOOTSTRAP = "kafka:9092"
TOPIC_IN = "transactions"
TOPIC_OUT = "fraud_alerts"

print(f"Kafka configuration:")
print(f"  Bootstrap server: {BOOTSTRAP}")
print(f"  Input topic: {TOPIC_IN}")
print(f"  Output topic: {TOPIC_OUT}")

# Read from Kafka
raw = (spark.readStream
       .format("kafka")
       .option("kafka.bootstrap.servers", BOOTSTRAP)
       .option("subscribe", TOPIC_IN)
       .option("startingOffsets", "earliest")
       .load())

print(f"\nStream created: {raw.isStreaming}")
print(f"Schema: {raw.schema.simpleString()}")

Kafka configuration:
  Bootstrap server: kafka:9092
  Input topic: transactions
  Output topic: fraud_alerts

Stream created: True
Schema: struct<key:binary,value:binary,topic:string,partition:int,offset:bigint,timestamp:timestamp,timestampType:int>


## 5. Parse JSON and Extract Fields

In [6]:
# Kafka value is bytes → convert to string → parse JSON → extract columns
parsed = (raw
          .selectExpr("CAST(value AS STRING) AS json_str")
          .select(F.from_json(F.col("json_str"), txn_schema).alias("data"))
          .select("data.*"))

print("Parsed stream schema:")
parsed.printSchema()

Parsed stream schema:
root
 |-- event_time: timestamp (nullable = true)
 |-- Time: double (nullable = true)
 |-- V1: double (nullable = true)
 |-- V2: double (nullable = true)
 |-- V3: double (nullable = true)
 |-- V4: double (nullable = true)
 |-- V5: double (nullable = true)
 |-- V6: double (nullable = true)
 |-- V7: double (nullable = true)
 |-- V8: double (nullable = true)
 |-- V9: double (nullable = true)
 |-- V10: double (nullable = true)
 |-- V11: double (nullable = true)
 |-- V12: double (nullable = true)
 |-- V13: double (nullable = true)
 |-- V14: double (nullable = true)
 |-- V15: double (nullable = true)
 |-- V16: double (nullable = true)
 |-- V17: double (nullable = true)
 |-- V18: double (nullable = true)
 |-- V19: double (nullable = true)
 |-- V20: double (nullable = true)
 |-- V21: double (nullable = true)
 |-- V22: double (nullable = true)
 |-- V23: double (nullable = true)
 |-- V24: double (nullable = true)
 |-- V25: double (nullable = true)
 |-- V26: double (nullable

## 6. Load ML Model and Apply Scoring

In [7]:
# Load saved model
loaded_model = PipelineModel.load(model_path)
print(f"Model loaded from: {model_path}")
print(f"Model stages: {[type(stage).__name__ for stage in loaded_model.stages]}")

# Define UDF to extract probability for fraud class (class 1)
from pyspark.sql.types import DoubleType

def extract_prob(v):
    try:
        return float(v[1])
    except:
        return 0.0

extract_prob_udf = F.udf(extract_prob, DoubleType())

# Apply model to streaming data
scored = loaded_model.transform(parsed)
scored = scored.withColumn("fraud_prob", extract_prob_udf(F.col("probability")))

print("\nScoring applied to stream")
print("Additional columns: prediction, rawPrediction, probability, fraud_prob")

Model loaded from: /home/jovyan/work/models/fraud_rf_model
Model stages: ['VectorAssembler', 'RandomForestClassificationModel']

Scoring applied to stream
Additional columns: prediction, rawPrediction, probability, fraud_prob


## 7. Generate Alerts

In [8]:
# Generate alert flag based on threshold
threshold = recommended_threshold

alerts = (scored
          .withColumn("is_alert", (F.col("fraud_prob") >= F.lit(threshold)).cast("int"))
          .filter(F.col("is_alert") == 1)
          .select(
              "event_time",
              "Amount",
              "Class",
              "fraud_prob",
              "is_alert",
              "prediction"
          ))

print(f"Alert threshold: {threshold:.2f}")
print(f"Alerts stream created (filtered for fraud_prob >= {threshold:.2f})")

Alert threshold: 0.50
Alerts stream created (filtered for fraud_prob >= 0.50)


## 8. Write Alerts to Kafka

In [9]:
# Convert alerts to JSON for Kafka
alerts_json = alerts.select(
    F.to_json(F.struct(
        F.col("event_time").cast("string").alias("event_time"),
        "Amount",
        "Class",
        "fraud_prob",
        "is_alert",
        "prediction"
    )).alias("value")
)

# Start writing to Kafka
query_kafka = (alerts_json.writeStream
               .format("kafka")
               .option("kafka.bootstrap.servers", BOOTSTRAP)
               .option("topic", TOPIC_OUT)
               .option("checkpointLocation", str(CKPT / "alerts_to_kafka"))
               .outputMode("append")
               .start())

print(f"Started streaming alerts to Kafka topic: {TOPIC_OUT}")
print(f"Query name: {query_kafka.name}")
print(f"Query ID: {query_kafka.id}")

Started streaming alerts to Kafka topic: fraud_alerts
Query name: None
Query ID: 31b47a62-47bc-435d-ab99-0964d62bf830


## 9. Write Alerts to Parquet

In [10]:
# Write to Parquet for batch analysis
query_parquet = (alerts.writeStream
                 .format("parquet")
                 .option("path", str(OUT))
                 .option("checkpointLocation", str(CKPT / "alerts_to_parquet"))
                 .outputMode("append")
                 .start())

print(f"Started streaming alerts to Parquet: {OUT}")
print(f"Query name: {query_parquet.name}")
print(f"Query ID: {query_parquet.id}")

Started streaming alerts to Parquet: /home/jovyan/work/output_alerts
Query name: None
Query ID: fa7a7ed4-21f6-4158-a7ee-b993c7ceea5f


## 10. Write Console Output (for monitoring)

In [11]:
# Optional: Write to console for real-time monitoring
query_console = (alerts.writeStream
                 .format("console")
                 .option("truncate", "false")
                 .option("numRows", "10")
                 .outputMode("append")
                 .start())

print("Started console output for monitoring")

Started console output for monitoring


## 11. Monitor Streaming Queries

In [12]:
import time

# Monitor query status
print("="*80)
print("STREAMING QUERIES STATUS")
print("="*80)

queries = [query_kafka, query_parquet, query_console]
query_names = ["Kafka", "Parquet", "Console"]

for name, q in zip(query_names, queries):
    status = q.status
    print(f"\n{name} Query:")
    print(f"  ID: {q.id}")
    print(f"  Status: {status['message']}")
    print(f"  Is Active: {q.isActive}")
    print(f"  Data Available: {status.get('isDataAvailable', 'N/A')}")

print("\n" + "="*80)
print("Queries are running. Let them process for 30 seconds...")
print("="*80)
time.sleep(30)

STREAMING QUERIES STATUS

Kafka Query:
  ID: 31b47a62-47bc-435d-ab99-0964d62bf830
  Status: Getting offsets from KafkaV2[Subscribe[transactions]]
  Is Active: True
  Data Available: False

Parquet Query:
  ID: fa7a7ed4-21f6-4158-a7ee-b993c7ceea5f
  Status: Getting offsets from KafkaV2[Subscribe[transactions]]
  Is Active: True
  Data Available: False

Console Query:
  ID: 5fe50ba0-390d-4f31-aa0d-681a6dc754a5
  Status: Getting offsets from KafkaV2[Subscribe[transactions]]
  Is Active: True
  Data Available: False

Queries are running. Let them process for 30 seconds...


In [13]:
# Check progress
print("\n" + "="*80)
print("QUERY PROGRESS")
print("="*80)

for name, q in zip(query_names, queries):
    progress = q.lastProgress
    if progress:
        print(f"\n{name} Query Progress:")
        print(f"  Batch: {progress.get('batchId', 'N/A')}")
        print(f"  Input rows: {progress.get('numInputRows', 'N/A')}")
        print(f"  Processing time: {progress.get('durationMs', {}).get('triggerExecution', 'N/A')} ms")
    else:
        print(f"\n{name} Query: No progress data yet")


QUERY PROGRESS

Kafka Query: No progress data yet

Parquet Query: No progress data yet

Console Query: No progress data yet


## 12. Consume Alerts from Kafka (Verification)

In [14]:
# Read alerts from Kafka topic to verify
alerts_from_kafka = (spark.read
                     .format("kafka")
                     .option("kafka.bootstrap.servers", BOOTSTRAP)
                     .option("subscribe", TOPIC_OUT)
                     .option("startingOffsets", "earliest")
                     .load())

# Parse JSON value
alerts_parsed = (alerts_from_kafka
                 .selectExpr("CAST(value AS STRING) AS json_str")
                 .select(F.from_json(F.col("json_str"), 
                                     StructType([
                                         StructField("event_time", TimestampType()),
                                         StructField("Amount", DoubleType()),
                                         StructField("Class", IntegerType()),
                                         StructField("fraud_prob", DoubleType()),
                                         StructField("is_alert", IntegerType()),
                                         StructField("prediction", DoubleType())
                                     ])).alias("alert"))
                 .select("alert.*"))

print("\n" + "="*80)
print("SAMPLE ALERTS FROM KAFKA")
print("="*80)
alerts_parsed.show(10, truncate=False)

alert_count = alerts_parsed.count()
print(f"\nTotal alerts in Kafka topic: {alert_count}")


SAMPLE ALERTS FROM KAFKA
+----------+------+-----+----------+--------+----------+
|event_time|Amount|Class|fraud_prob|is_alert|prediction|
+----------+------+-----+----------+--------+----------+
+----------+------+-----+----------+--------+----------+


Total alerts in Kafka topic: 678


## 13. Verify Parquet Output

In [15]:
import os

# Check if parquet files exist
parquet_files = list(OUT.glob("*.parquet"))

print("\n" + "="*80)
print("PARQUET OUTPUT VERIFICATION")
print("="*80)
print(f"Output directory: {OUT}")
print(f"Number of parquet files: {len(parquet_files)}")

if parquet_files:
    # Read alerts from parquet
    alerts_from_parquet = spark.read.parquet(str(OUT))
    
    print("\nSample alerts from Parquet:")
    alerts_from_parquet.show(10, truncate=False)
    
    parquet_count = alerts_from_parquet.count()
    print(f"\nTotal alerts in Parquet: {parquet_count}")
    
    # Statistics
    print("\nAlert Statistics:")
    alerts_from_parquet.describe(["fraud_prob", "Amount"]).show()
    
    # True vs False positives
    print("\nAlert Quality:")
    alerts_from_parquet.groupBy("Class").count().show()
else:
    print("\nNo parquet files found yet. Query may still be processing.")


PARQUET OUTPUT VERIFICATION
Output directory: /home/jovyan/work/output_alerts
Number of parquet files: 3

Sample alerts from Parquet:
+----------+------+-----+----------+--------+
|event_time|Amount|Class|fraud_prob|is_alert|
+----------+------+-----+----------+--------+
+----------+------+-----+----------+--------+


Total alerts in Parquet: 0

Alert Statistics:
+-------+----------+------+
|summary|fraud_prob|Amount|
+-------+----------+------+
|  count|         0|     0|
|   mean|      NULL|  NULL|
| stddev|      NULL|  NULL|
|    min|      NULL|  NULL|
|    max|      NULL|  NULL|
+-------+----------+------+


Alert Quality:
+-----+-----+
|Class|count|
+-----+-----+
+-----+-----+



## 14. Alert Analysis

In [16]:
if parquet_files:
    print("\n" + "="*80)
    print("ALERT ANALYSIS")
    print("="*80)
    
    # True Positives vs False Positives
    tp = alerts_from_parquet.filter(F.col("Class") == 1).count()
    fp = alerts_from_parquet.filter(F.col("Class") == 0).count()
    total_alerts = tp + fp
    
    if total_alerts > 0:
        precision = tp / total_alerts
        print(f"\nAlert Breakdown:")
        print(f"  Total alerts: {total_alerts}")
        print(f"  True Positives (TP): {tp}")
        print(f"  False Positives (FP): {fp}")
        print(f"  Precision: {precision:.4f} ({precision*100:.2f}%)")
        
        print(f"\nInterpretation:")
        if precision >= 0.5:
            print(f"  ✓ Good precision: {precision*100:.1f}% of alerts are true fraud")
        else:
            print(f"  ⚠ Low precision: {precision*100:.1f}% of alerts are true fraud")
            print(f"  Consider increasing threshold or improving model")
    else:
        print("\nNo alerts generated yet")
else:
    print("\nWaiting for alert data...")


ALERT ANALYSIS

No alerts generated yet


## 15. Stop Streaming Queries

**IMPORTANT:** Always stop queries before shutting down or restarting.

In [17]:
print("Stopping all streaming queries...")

for name, q in zip(query_names, queries):
    try:
        q.stop()
        print(f"  ✓ {name} query stopped")
    except Exception as e:
        print(f"  ✗ Error stopping {name} query: {e}")

print("\nAll queries stopped.")

Stopping all streaming queries...
  ✓ Kafka query stopped
  ✓ Parquet query stopped
  ✓ Console query stopped

All queries stopped.


## 16. Task C Deliverables

### Deliverable C.1: Screenshot of running stream
Take a screenshot of the console output showing alerts being processed.

### Deliverable C.2: Sample alert messages
See the "Sample alerts from Kafka" section above for at least 5 alert examples.

Key alert fields:
- `event_time`: When the transaction occurred
- `Amount`: Transaction amount
- `Class`: True label (0=normal, 1=fraud)
- `fraud_prob`: Model's fraud probability
- `is_alert`: Alert flag (1 if fraud_prob >= threshold)
- `prediction`: Model's binary prediction

### Pipeline Summary
- Input: Kafka topic `transactions`
- Processing: ML model scoring in real-time
- Output 1: Kafka topic `fraud_alerts` (for downstream consumers)
- Output 2: Parquet files (for batch analysis)
- Checkpointing: Enabled for fault tolerance