# Stream Processing — Overview

## Purpose
Stream processing enables the continuous ingestion, processing, and analysis of data in real-time as it arrives. Unlike batch processing, which operates on bounded datasets, stream processing handles unbounded, continuously flowing data with low latency.

## Key Questions
- What is stream processing and how does it differ from batch processing?
- What are the core concepts of event-driven architecture?
- How do popular stream processing frameworks (Kafka, Flink, Spark Streaming) compare?
- What is the difference between event time and processing time?

---
## What is Stream Processing?

**Stream processing** is a data processing paradigm that processes data records continuously and sequentially as they arrive, rather than waiting for all data to be collected first.

### Characteristics of Stream Processing

| Characteristic | Description |
|----------------|-------------|
| **Real-time** | Data is processed immediately upon arrival (milliseconds to seconds latency) |
| **Unbounded data** | Works with infinite, continuously arriving data streams |
| **Stateful/Stateless** | Can maintain state across events or process each event independently |
| **Fault-tolerant** | Handles failures gracefully with exactly-once or at-least-once guarantees |

### Stream Processing vs Batch Processing

| Aspect | Stream Processing | Batch Processing |
|--------|-------------------|------------------|
| **Data** | Unbounded, continuous | Bounded, finite |
| **Latency** | Milliseconds to seconds | Minutes to hours |
| **Processing** | Record-by-record or micro-batch | Large chunks at once |
| **Use Cases** | Fraud detection, real-time analytics | ETL, reporting, ML training |
| **Complexity** | Higher (state management, ordering) | Lower (simpler semantics) |

---
## Event-Driven Architecture Concepts

Event-driven architecture (EDA) is a design pattern where the flow of the program is determined by events—significant changes in state.

### Core Components

```
┌──────────────┐     ┌──────────────────┐     ┌──────────────────┐
│   Producer   │────▶│   Event Broker   │────▶│    Consumer      │
│ (Event Source)│     │  (Message Queue) │     │ (Event Handler)  │
└──────────────┘     └──────────────────┘     └──────────────────┘
```

### Key Concepts

| Concept | Description |
|---------|-------------|
| **Event** | An immutable record of something that happened (e.g., "OrderPlaced", "UserClicked") |
| **Producer** | Generates and publishes events to a channel/topic |
| **Consumer** | Subscribes to and processes events from channels/topics |
| **Event Broker** | Middleware that routes events between producers and consumers |
| **Topic/Channel** | Named destination where events are published and consumed |

### Event Processing Patterns

1. **Simple Event Processing**: React to individual events immediately
2. **Event Stream Processing**: Process continuous streams of events
3. **Complex Event Processing (CEP)**: Detect patterns across multiple events over time

---
## Stream Processing Frameworks Overview

### Apache Kafka

**Kafka** is a distributed event streaming platform designed for high-throughput, fault-tolerant messaging.

```
┌─────────────────────────────────────────────────────────┐
│                    Kafka Cluster                        │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐                 │
│  │ Broker 1│  │ Broker 2│  │ Broker 3│                 │
│  │         │  │         │  │         │                 │
│  │Topic A  │  │Topic A  │  │Topic B  │                 │
│  │Partition│  │Partition│  │Partition│                 │
│  │  0,1    │  │  2,3    │  │  0,1,2  │                 │
│  └─────────┘  └─────────┘  └─────────┘                 │
└─────────────────────────────────────────────────────────┘
```

**Key Features:**
- Distributed log-based storage
- Horizontal scalability via partitions
- Consumer groups for parallel processing
- Kafka Streams for stream processing within the Kafka ecosystem

---

### Apache Flink

**Flink** is a distributed stream processing framework with true event-at-a-time processing.

**Key Features:**
- Native streaming (not micro-batch)
- Advanced windowing and event time support
- Exactly-once state consistency
- Unified batch and streaming API

```python
# Flink conceptual example (PyFlink)
from pyflink.datastream import StreamExecutionEnvironment

env = StreamExecutionEnvironment.get_execution_environment()
stream = env.from_collection([1, 2, 3, 4, 5])
result = stream.map(lambda x: x * 2).filter(lambda x: x > 4)
result.print()
env.execute("Flink Example")
```

---

### Apache Spark Streaming

**Spark Streaming** (and Structured Streaming) processes data in micro-batches using the Spark engine.

**Key Features:**
- Micro-batch processing model
- Integration with Spark ecosystem (MLlib, SQL)
- DataFrame/Dataset API for streaming
- Exactly-once guarantees with checkpointing

```python
# Spark Structured Streaming conceptual example
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("StreamExample").getOrCreate()

# Read from Kafka
stream_df = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "events") \
    .load()

# Process and write
query = stream_df.writeStream \
    .outputMode("append") \
    .format("console") \
    .start()
```

---

### Framework Comparison

| Feature | Kafka Streams | Apache Flink | Spark Streaming |
|---------|---------------|--------------|------------------|
| **Processing Model** | Record-at-a-time | Record-at-a-time | Micro-batch |
| **Latency** | Very low | Very low | Low (seconds) |
| **State Management** | RocksDB | RocksDB, heap | In-memory, external |
| **Exactly-once** | ✅ | ✅ | ✅ |
| **Deployment** | Lightweight (library) | Cluster | Cluster |
| **Best For** | Kafka-centric apps | Complex event processing | Spark ecosystem users |

---
## Ordering, Keys, and Backpressure

**Ordering** is only guaranteed **within a partition** (Kafka) or **keyed stream** (Flink).
- Choose partition keys carefully to preserve order where needed.
- Cross-partition ordering requires additional coordination.

**Keys and parallelism:**
- Partitioning by key enables **stateful per-entity** processing.
- Too many unique keys can create hot partitions or state blow-up.

**Backpressure** protects the system when downstream is slow.
- Systems propagate backpressure to producers or buffer with limits.
- Watch for excessive lag, large queues, or growing watermark delays.

---
## Reliability, State, and Fault Tolerance

Stream processors often keep **state** (e.g., counts, windows). Correctness depends on how state is stored and recovered.

**Key mechanisms:**
- **Checkpointing**: periodic snapshots of state to durable storage.
- **State backends**: RocksDB (disk), in-memory + spill, or external stores.
- **Exactly-once processing**: coordinated checkpoints + transactional sinks.
- **Idempotent sinks**: avoid duplicates if at-least-once delivery.

**Failure recovery flow:**
1. Detect failure and restart tasks
2. Restore from latest checkpoint
3. Reprocess events after the checkpoint offset
4. Resume normal processing

---
## Event Time vs Processing Time

Understanding time semantics is crucial for correct stream processing results.

### Time Concepts

| Time Type | Definition | Example |
|-----------|------------|----------|
| **Event Time** | When the event actually occurred (embedded in the event) | User clicked at 10:00:00 |
| **Processing Time** | When the event is processed by the system | System processes click at 10:00:05 |
| **Ingestion Time** | When the event enters the streaming system | Event arrives at broker at 10:00:02 |

### Why Event Time Matters

```
Event Time:      10:00:00   10:00:01   10:00:02   10:00:03
                    │          │          │          │
Events:            [A]        [B]        [C]        [D]
                    │          │          │          │
Arrival Order:     [A]        [C]        [B]        [D]  ← Out of order!
                    │          │          │          │
Processing Time: 10:00:05  10:00:06   10:00:07   10:00:08
```

**Challenges with Event Time:**
- Events can arrive out of order
- Events can be delayed (network issues, mobile devices offline)
- Need to handle "late" data

### Watermarks

**Watermarks** are a mechanism to track progress in event time and handle late data.

```
Watermark = "All events with event time ≤ W have arrived"

Event Stream: ──[E1:10:00]──[E3:10:02]──[E2:10:01]──[W:10:02]──[E4:10:03]──▶
                                                      │
                                           Watermark triggers window close
```

### Windowing with Event Time

| Window Type | Description |
|-------------|-------------|
| **Tumbling** | Fixed-size, non-overlapping windows |
| **Sliding** | Fixed-size, overlapping windows |
| **Session** | Dynamic windows based on activity gaps |

```
Tumbling Windows (5 min):
├────────────┼────────────┼────────────┤
│  Window 1  │  Window 2  │  Window 3  │
│ 10:00-10:05│ 10:05-10:10│ 10:10-10:15│

Sliding Windows (5 min window, 2 min slide):
├────────────┤
│  Window 1  │
│ 10:00-10:05│
    ├────────────┤
    │  Window 2  │
    │ 10:02-10:07│
        ├────────────┤
        │  Window 3  │
        │ 10:04-10:09│
```

---
## Takeaway

### Key Points

1. **Stream processing** handles unbounded, continuously flowing data with low latency—ideal for real-time analytics, monitoring, and event-driven systems

2. **Event-driven architecture** decouples producers and consumers through an event broker, enabling scalable and loosely coupled systems

3. **Choose your framework wisely:**
   - **Kafka Streams**: Lightweight, Kafka-native applications
   - **Apache Flink**: Complex event processing, low-latency requirements
   - **Spark Streaming**: When you need the broader Spark ecosystem

4. **Event time vs processing time** is a critical distinction—use event time when correctness matters and handle late data with watermarks

5. **Windowing** enables aggregations over infinite streams by grouping events into finite chunks

### When to Use Stream Processing

| Use Case | Why Stream Processing |
|----------|----------------------|
| Fraud detection | Immediate response needed |
| Real-time dashboards | Live metrics and monitoring |
| IoT sensor data | Continuous data from devices |
| Log aggregation | Centralized, real-time log analysis |
| Recommendation engines | React to user behavior instantly |