# Stream Processing Windows

## Overview

In stream processing, **windows** are fundamental mechanisms for grouping unbounded streams of data into finite chunks for aggregation and analysis. Since streaming data is continuous and potentially infinite, we need windows to define temporal boundaries for computations.

### Why Windows Matter

- **Bounded Computation**: Transform infinite streams into finite, processable chunks
- **Time-Based Aggregations**: Calculate metrics like "events per minute" or "average over 5 minutes"
- **State Management**: Limit memory usage by processing data in bounded segments
- **Real-time Analytics**: Enable continuous queries over streaming data

### Event Time vs Processing Time

| Concept | Description | Use Case |
|---------|-------------|----------|
| **Event Time** | When the event actually occurred (embedded in data) | Accurate historical analysis |
| **Processing Time** | When the event is processed by the system | Simple, low-latency scenarios |
| **Ingestion Time** | When the event enters the streaming system | Compromise between the two |

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from collections import defaultdict
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px

---

## 1. Tumbling Windows

**Tumbling windows** are fixed-size, non-overlapping windows that partition the stream into discrete segments.

### Characteristics
- Fixed window size (e.g., 5 minutes)
- No overlap between windows
- Each event belongs to exactly one window
- Windows are aligned to epoch (or configurable offset)

```
Time:     0    5    10   15   20   25   30
          |----|----|----|----|----|----|---->
Windows:  [  W1  ][  W2  ][  W3  ][  W4  ]
```

### Use Cases
- Hourly/daily aggregations
- Batch-like processing in streaming context
- Regular metric snapshots

In [None]:
class TumblingWindow:
    """
    Tumbling Window Implementation
    
    Fixed-size, non-overlapping windows that partition the stream.
    """
    
    def __init__(self, window_size_seconds: int):
        self.window_size = window_size_seconds
        self.windows = defaultdict(list)
    
    def get_window_key(self, event_time: datetime) -> int:
        """Calculate which window an event belongs to."""
        timestamp = event_time.timestamp()
        return int(timestamp // self.window_size) * self.window_size
    
    def add_event(self, event_time: datetime, value: float):
        """Add an event to the appropriate window."""
        window_key = self.get_window_key(event_time)
        self.windows[window_key].append({
            'event_time': event_time,
            'value': value,
            'window_start': datetime.fromtimestamp(window_key),
            'window_end': datetime.fromtimestamp(window_key + self.window_size)
        })
    
    def get_window_aggregations(self) -> dict:
        """Get sum and count for each window."""
        results = {}
        for window_key, events in sorted(self.windows.items()):
            values = [e['value'] for e in events]
            results[window_key] = {
                'window_start': datetime.fromtimestamp(window_key),
                'window_end': datetime.fromtimestamp(window_key + self.window_size),
                'count': len(values),
                'sum': sum(values),
                'avg': np.mean(values) if values else 0
            }
        return results


# Generate sample streaming events
np.random.seed(42)
base_time = datetime(2024, 1, 1, 10, 0, 0)
events = []

for i in range(50):
    event_time = base_time + timedelta(seconds=np.random.uniform(0, 300))
    value = np.random.uniform(10, 100)
    events.append((event_time, value))

# Sort events by time
events.sort(key=lambda x: x[0])

# Apply tumbling window (60-second windows)
tumbling = TumblingWindow(window_size_seconds=60)
for event_time, value in events:
    tumbling.add_event(event_time, value)

# Display results
print("=" * 60)
print("TUMBLING WINDOW AGGREGATIONS (60-second windows)")
print("=" * 60)
for window_key, agg in tumbling.get_window_aggregations().items():
    print(f"\nWindow: {agg['window_start'].strftime('%H:%M:%S')} - {agg['window_end'].strftime('%H:%M:%S')}")
    print(f"  Events: {agg['count']}, Sum: {agg['sum']:.2f}, Avg: {agg['avg']:.2f}")

---

## 2. Sliding (Hopping) Windows

**Sliding windows** have a fixed size but can overlap. They "slide" forward by a configurable interval.

### Characteristics
- Fixed window size
- Configurable slide interval (hop size)
- Windows can overlap (event may belong to multiple windows)
- When slide = size, it becomes a tumbling window

```
Window Size: 10, Slide: 5

Time:     0    5    10   15   20   25   30
          |----|----|----|----|----|----|---->
W1:       [--------]
W2:            [--------]
W3:                 [--------]
W4:                      [--------]
```

### Use Cases
- Moving averages
- Trend detection
- Smoothing metrics over time

In [None]:
class SlidingWindow:
    """
    Sliding (Hopping) Window Implementation
    
    Fixed-size windows that slide forward by a configurable interval.
    Events may belong to multiple overlapping windows.
    """
    
    def __init__(self, window_size_seconds: int, slide_seconds: int):
        self.window_size = window_size_seconds
        self.slide = slide_seconds
        self.events = []
    
    def add_event(self, event_time: datetime, value: float):
        """Store event for later window assignment."""
        self.events.append({
            'event_time': event_time,
            'timestamp': event_time.timestamp(),
            'value': value
        })
    
    def get_windows_for_event(self, event_timestamp: float) -> list:
        """Find all windows that contain this event."""
        windows = []
        # Find the earliest window that could contain this event
        earliest_window_start = int((event_timestamp - self.window_size) // self.slide + 1) * self.slide
        
        window_start = earliest_window_start
        while window_start <= event_timestamp:
            window_end = window_start + self.window_size
            if window_start <= event_timestamp < window_end:
                windows.append((window_start, window_end))
            window_start += self.slide
        
        return windows
    
    def compute_aggregations(self) -> dict:
        """Compute aggregations for all windows."""
        window_data = defaultdict(list)
        
        for event in self.events:
            windows = self.get_windows_for_event(event['timestamp'])
            for window_start, window_end in windows:
                window_data[(window_start, window_end)].append(event['value'])
        
        results = {}
        for (start, end), values in sorted(window_data.items()):
            results[(start, end)] = {
                'window_start': datetime.fromtimestamp(start),
                'window_end': datetime.fromtimestamp(end),
                'count': len(values),
                'sum': sum(values),
                'avg': np.mean(values) if values else 0
            }
        
        return results


# Apply sliding window (90-second window, 30-second slide)
sliding = SlidingWindow(window_size_seconds=90, slide_seconds=30)
for event_time, value in events:
    sliding.add_event(event_time, value)

# Display results
print("=" * 60)
print("SLIDING WINDOW AGGREGATIONS (90s window, 30s slide)")
print("=" * 60)
for key, agg in list(sliding.compute_aggregations().items())[:8]:  # Show first 8
    print(f"\nWindow: {agg['window_start'].strftime('%H:%M:%S')} - {agg['window_end'].strftime('%H:%M:%S')}")
    print(f"  Events: {agg['count']}, Sum: {agg['sum']:.2f}, Avg: {agg['avg']:.2f}")

---

## 3. Session Windows

**Session windows** are dynamic windows that group events by activity periods, separated by inactivity gaps.

### Characteristics
- Variable window size based on activity
- Defined by a gap timeout (inactivity threshold)
- No overlap
- Windows can be keyed (per user, per device)

```
Gap Timeout: 5 seconds

Events:   *  * *      *  *      * * * *
Time:     0  2 4      12 14     25 26 27 28
          |--|-|------|--|------|--|--|--|---->
Sessions: [S1---]     [S2-]     [--S3------]
```

### Use Cases
- User session tracking
- Clickstream analysis
- Activity-based billing

In [None]:
class SessionWindow:
    """
    Session Window Implementation
    
    Dynamic windows based on activity gaps. Events are grouped
    into sessions when they occur within the gap timeout of each other.
    """
    
    def __init__(self, gap_timeout_seconds: int):
        self.gap_timeout = gap_timeout_seconds
        self.sessions = defaultdict(list)  # key -> list of sessions
    
    def process_events(self, events: list, key: str = 'default'):
        """
        Process events and group them into sessions.
        Events should be sorted by time.
        """
        if not events:
            return
        
        current_session = []
        last_event_time = None
        
        for event_time, value in events:
            if last_event_time is None:
                # First event
                current_session.append({'event_time': event_time, 'value': value})
            elif (event_time - last_event_time).total_seconds() > self.gap_timeout:
                # Gap exceeded - close current session, start new one
                if current_session:
                    self.sessions[key].append(current_session)
                current_session = [{'event_time': event_time, 'value': value}]
            else:
                # Within gap - add to current session
                current_session.append({'event_time': event_time, 'value': value})
            
            last_event_time = event_time
        
        # Don't forget the last session
        if current_session:
            self.sessions[key].append(current_session)
    
    def get_session_stats(self, key: str = 'default') -> list:
        """Get statistics for each session."""
        results = []
        for i, session in enumerate(self.sessions[key]):
            values = [e['value'] for e in session]
            start_time = session[0]['event_time']
            end_time = session[-1]['event_time']
            duration = (end_time - start_time).total_seconds()
            
            results.append({
                'session_id': i + 1,
                'start': start_time,
                'end': end_time,
                'duration_seconds': duration,
                'event_count': len(session),
                'total_value': sum(values),
                'avg_value': np.mean(values)
            })
        
        return results


# Create events with gaps to demonstrate sessions
session_events = [
    # Session 1 - clustered activity
    (base_time + timedelta(seconds=0), 10),
    (base_time + timedelta(seconds=5), 20),
    (base_time + timedelta(seconds=12), 15),
    (base_time + timedelta(seconds=18), 25),
    # Gap of 45 seconds
    # Session 2
    (base_time + timedelta(seconds=65), 30),
    (base_time + timedelta(seconds=70), 35),
    # Gap of 50 seconds
    # Session 3 - longer session
    (base_time + timedelta(seconds=120), 40),
    (base_time + timedelta(seconds=125), 45),
    (base_time + timedelta(seconds=130), 50),
    (base_time + timedelta(seconds=138), 55),
    (base_time + timedelta(seconds=145), 60),
]

# Apply session window (30-second gap timeout)
session_window = SessionWindow(gap_timeout_seconds=30)
session_window.process_events(session_events, key='user_123')

# Display results
print("=" * 60)
print("SESSION WINDOW AGGREGATIONS (30-second gap timeout)")
print("=" * 60)
for session in session_window.get_session_stats('user_123'):
    print(f"\nSession {session['session_id']}:")
    print(f"  Time: {session['start'].strftime('%H:%M:%S')} - {session['end'].strftime('%H:%M:%S')}")
    print(f"  Duration: {session['duration_seconds']:.1f}s, Events: {session['event_count']}")
    print(f"  Total: {session['total_value']:.2f}, Avg: {session['avg_value']:.2f}")

---

## 4. Watermarks and Late Data Handling

### The Challenge of Event Time Processing

In distributed systems, events may arrive **out of order** due to:
- Network delays
- Device offline periods
- Distributed system clock skew

### Watermarks

A **watermark** is a timestamp that declares: *"No events with timestamp earlier than W will arrive."*

```
Event Time:     |--1--|--2--|--3--|--4--|--5--|
                         ^
                    Watermark (W=3)
                    
Meaning: All events with event_time <= 3 have been received
```

### Late Data Strategies

| Strategy | Description | Trade-off |
|----------|-------------|----------|
| **Drop** | Discard late events | Simple, may lose data |
| **Recompute** | Update window results | Accurate, more complex |
| **Side Output** | Route to separate stream | Flexible, requires handling |
| **Allowed Lateness** | Accept events within threshold | Balance of accuracy and resources |

In [None]:
class WatermarkWindow:
    """
    Window with Watermark and Late Data Handling
    
    Demonstrates how watermarks work and different strategies
    for handling late-arriving data.
    """
    
    def __init__(self, window_size_seconds: int, 
                 max_lateness_seconds: int = 0,
                 watermark_delay_seconds: int = 5):
        self.window_size = window_size_seconds
        self.max_lateness = max_lateness_seconds
        self.watermark_delay = watermark_delay_seconds
        
        self.windows = defaultdict(list)
        self.late_events = []
        self.dropped_events = []
        self.current_watermark = None
        self.max_event_time_seen = None
    
    def update_watermark(self, event_time: datetime):
        """Update watermark based on max event time seen."""
        if self.max_event_time_seen is None or event_time > self.max_event_time_seen:
            self.max_event_time_seen = event_time
        
        # Watermark = max_event_time - delay (to allow for out-of-order events)
        self.current_watermark = self.max_event_time_seen - timedelta(seconds=self.watermark_delay)
    
    def get_window_key(self, event_time: datetime) -> int:
        timestamp = event_time.timestamp()
        return int(timestamp // self.window_size) * self.window_size
    
    def is_window_closed(self, window_end: datetime) -> bool:
        """Check if window has been closed by watermark."""
        if self.current_watermark is None:
            return False
        return window_end <= self.current_watermark
    
    def add_event(self, event_time: datetime, value: float, processing_time: datetime):
        """
        Add event with late data handling.
        
        Args:
            event_time: When the event occurred
            value: Event payload
            processing_time: When event is being processed
        """
        # First, update watermark based on processing time progression
        self.update_watermark(processing_time)
        
        window_key = self.get_window_key(event_time)
        window_end = datetime.fromtimestamp(window_key + self.window_size)
        
        event_record = {
            'event_time': event_time,
            'processing_time': processing_time,
            'value': value,
            'window_key': window_key
        }
        
        # Check if this is late data
        if self.is_window_closed(window_end):
            lateness = (self.current_watermark - window_end).total_seconds()
            
            if lateness <= self.max_lateness:
                # Accept late event (within allowed lateness)
                self.windows[window_key].append(event_record)
                event_record['status'] = 'late_accepted'
                self.late_events.append(event_record)
            else:
                # Drop event (too late)
                event_record['status'] = 'dropped'
                event_record['lateness'] = lateness
                self.dropped_events.append(event_record)
        else:
            # On-time event
            event_record['status'] = 'on_time'
            self.windows[window_key].append(event_record)
    
    def get_summary(self) -> dict:
        """Get processing summary."""
        total_events = sum(len(w) for w in self.windows.values())
        return {
            'on_time_events': total_events - len(self.late_events),
            'late_accepted_events': len(self.late_events),
            'dropped_events': len(self.dropped_events),
            'total_windows': len(self.windows),
            'current_watermark': self.current_watermark
        }


# Simulate out-of-order event processing
watermark_window = WatermarkWindow(
    window_size_seconds=60,
    max_lateness_seconds=30,
    watermark_delay_seconds=10
)

# Simulate events arriving (some out of order)
# (event_time, value, processing_time)
out_of_order_events = [
    (base_time + timedelta(seconds=5), 10, base_time + timedelta(seconds=6)),
    (base_time + timedelta(seconds=15), 20, base_time + timedelta(seconds=16)),
    (base_time + timedelta(seconds=10), 15, base_time + timedelta(seconds=20)),  # Arrived late
    (base_time + timedelta(seconds=45), 30, base_time + timedelta(seconds=46)),
    (base_time + timedelta(seconds=65), 40, base_time + timedelta(seconds=66)),
    (base_time + timedelta(seconds=70), 45, base_time + timedelta(seconds=71)),
    # Very late event - should be dropped
    (base_time + timedelta(seconds=8), 12, base_time + timedelta(seconds=120)),
    (base_time + timedelta(seconds=90), 50, base_time + timedelta(seconds=91)),
    # Late but within allowed lateness
    (base_time + timedelta(seconds=62), 42, base_time + timedelta(seconds=100)),
]

for event_time, value, proc_time in out_of_order_events:
    watermark_window.add_event(event_time, value, proc_time)

# Display results
print("=" * 60)
print("WATERMARK AND LATE DATA HANDLING")
print("=" * 60)
summary = watermark_window.get_summary()
print(f"\nProcessing Summary:")
print(f"  On-time events: {summary['on_time_events']}")
print(f"  Late (accepted): {summary['late_accepted_events']}")
print(f"  Dropped events: {summary['dropped_events']}")
print(f"  Current watermark: {summary['current_watermark'].strftime('%H:%M:%S') if summary['current_watermark'] else 'N/A'}")

if watermark_window.dropped_events:
    print(f"\nDropped Events (exceeded {watermark_window.max_lateness}s lateness):")
    for e in watermark_window.dropped_events:
        print(f"  Event time: {e['event_time'].strftime('%H:%M:%S')}, "
              f"Processed: {e['processing_time'].strftime('%H:%M:%S')}, "
              f"Lateness: {e['lateness']:.1f}s")

---

## 5. Visualization of Window Types

Let's create an interactive visualization to understand how different window types partition the same stream of events.

In [None]:
# Generate events for visualization
np.random.seed(123)
viz_base_time = datetime(2024, 1, 1, 0, 0, 0)
viz_events = []

for i in range(30):
    event_time = viz_base_time + timedelta(seconds=i * 3 + np.random.uniform(-1, 1))
    value = np.random.uniform(20, 80)
    viz_events.append((event_time, value))

viz_events.sort(key=lambda x: x[0])

# Create DataFrame for visualization
events_df = pd.DataFrame(viz_events, columns=['event_time', 'value'])
events_df['seconds'] = (events_df['event_time'] - viz_base_time).dt.total_seconds()

# Create visualization
fig = make_subplots(
    rows=4, cols=1,
    subplot_titles=(
        'Raw Event Stream',
        'Tumbling Windows (20s)',
        'Sliding Windows (30s window, 10s slide)',
        'Session Windows (10s gap)'
    ),
    vertical_spacing=0.08,
    row_heights=[0.2, 0.25, 0.25, 0.3]
)

# Row 1: Raw events
fig.add_trace(
    go.Scatter(
        x=events_df['seconds'],
        y=events_df['value'],
        mode='markers',
        marker=dict(size=10, color='#636EFA'),
        name='Events',
        showlegend=True
    ),
    row=1, col=1
)

# Row 2: Tumbling windows
tumbling_size = 20
colors = px.colors.qualitative.Set2
for i, start in enumerate(range(0, 100, tumbling_size)):
    end = start + tumbling_size
    fig.add_vrect(
        x0=start, x1=end,
        fillcolor=colors[i % len(colors)],
        opacity=0.3,
        line_width=2,
        line_color=colors[i % len(colors)],
        row=2, col=1
    )

fig.add_trace(
    go.Scatter(
        x=events_df['seconds'],
        y=events_df['value'],
        mode='markers',
        marker=dict(size=8, color='#636EFA'),
        name='Events (Tumbling)',
        showlegend=False
    ),
    row=2, col=1
)

# Row 3: Sliding windows
slide_size = 30
slide_interval = 10
y_positions = [60, 45, 30]  # Offset windows vertically

for i, start in enumerate(range(0, 80, slide_interval)):
    end = start + slide_size
    y_pos = y_positions[i % 3]
    fig.add_shape(
        type='rect',
        x0=start, x1=end,
        y0=y_pos - 10, y1=y_pos + 10,
        fillcolor=colors[i % len(colors)],
        opacity=0.4,
        line=dict(color=colors[i % len(colors)], width=2),
        row=3, col=1
    )
    fig.add_annotation(
        x=(start + end) / 2,
        y=y_pos,
        text=f'W{i+1}',
        showarrow=False,
        font=dict(size=10),
        row=3, col=1
    )

fig.add_trace(
    go.Scatter(
        x=events_df['seconds'],
        y=[50] * len(events_df),
        mode='markers',
        marker=dict(size=8, color='#636EFA', symbol='diamond'),
        name='Events (Sliding)',
        showlegend=False
    ),
    row=3, col=1
)

# Row 4: Session windows
gap_timeout = 10
session_start = None
session_end = None
session_id = 0
sessions = []

for i, (_, row) in enumerate(events_df.iterrows()):
    if session_start is None:
        session_start = row['seconds']
        session_end = row['seconds']
    elif row['seconds'] - session_end <= gap_timeout:
        session_end = row['seconds']
    else:
        sessions.append((session_id, session_start, session_end))
        session_id += 1
        session_start = row['seconds']
        session_end = row['seconds']

sessions.append((session_id, session_start, session_end))

for sid, start, end in sessions:
    fig.add_vrect(
        x0=start - 1, x1=end + 1,
        fillcolor=colors[sid % len(colors)],
        opacity=0.4,
        line_width=2,
        line_color=colors[sid % len(colors)],
        row=4, col=1
    )
    fig.add_annotation(
        x=(start + end) / 2,
        y=85,
        text=f'Session {sid + 1}',
        showarrow=False,
        font=dict(size=10, color='black'),
        row=4, col=1
    )

fig.add_trace(
    go.Scatter(
        x=events_df['seconds'],
        y=events_df['value'],
        mode='markers',
        marker=dict(size=8, color='#636EFA'),
        name='Events (Session)',
        showlegend=False
    ),
    row=4, col=1
)

# Update layout
fig.update_layout(
    height=900,
    title_text='<b>Stream Processing Window Types Comparison</b>',
    title_x=0.5,
    showlegend=True
)

fig.update_xaxes(title_text='Time (seconds)', row=4, col=1)
fig.update_yaxes(title_text='Value', row=1, col=1)
fig.update_yaxes(title_text='Value', row=2, col=1)
fig.update_yaxes(title_text='Window Layer', row=3, col=1)
fig.update_yaxes(title_text='Value', row=4, col=1)

fig

---

## 6. Window Aggregation Comparison

In [None]:
# Compare aggregation results across window types

# Prepare data
tumbling_aggs = []
window_size = 20

for start in range(0, 100, window_size):
    end = start + window_size
    mask = (events_df['seconds'] >= start) & (events_df['seconds'] < end)
    window_events = events_df[mask]
    if len(window_events) > 0:
        tumbling_aggs.append({
            'window': f'{start}-{end}s',
            'center': (start + end) / 2,
            'count': len(window_events),
            'sum': window_events['value'].sum(),
            'avg': window_events['value'].mean()
        })

tumbling_df = pd.DataFrame(tumbling_aggs)

# Create comparison chart
fig2 = make_subplots(
    rows=1, cols=3,
    subplot_titles=('Event Count per Window', 'Sum per Window', 'Average per Window')
)

fig2.add_trace(
    go.Bar(
        x=tumbling_df['window'],
        y=tumbling_df['count'],
        marker_color='#636EFA',
        name='Count'
    ),
    row=1, col=1
)

fig2.add_trace(
    go.Bar(
        x=tumbling_df['window'],
        y=tumbling_df['sum'],
        marker_color='#EF553B',
        name='Sum'
    ),
    row=1, col=2
)

fig2.add_trace(
    go.Bar(
        x=tumbling_df['window'],
        y=tumbling_df['avg'],
        marker_color='#00CC96',
        name='Average'
    ),
    row=1, col=3
)

fig2.update_layout(
    height=400,
    title_text='<b>Tumbling Window Aggregations</b>',
    title_x=0.5,
    showlegend=False
)

fig2.show()

---

## 7. Real-World Framework Examples

### Apache Flink (Java/Scala)

```java
// Tumbling Window
dataStream
    .keyBy(event -> event.getKey())
    .window(TumblingEventTimeWindows.of(Time.minutes(5)))
    .sum("value");

// Sliding Window
dataStream
    .keyBy(event -> event.getKey())
    .window(SlidingEventTimeWindows.of(Time.minutes(10), Time.minutes(5)))
    .sum("value");

// Session Window
dataStream
    .keyBy(event -> event.getKey())
    .window(EventTimeSessionWindows.withGap(Time.minutes(10)))
    .sum("value");
```

### Apache Kafka Streams (Java)

```java
// Tumbling Window
stream.groupByKey()
    .windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofMinutes(5)))
    .count();

// Sliding Window
stream.groupByKey()
    .windowedBy(SlidingWindows.ofTimeDifferenceAndGrace(
        Duration.ofMinutes(10), Duration.ofMinutes(1)))
    .count();

// Session Window
stream.groupByKey()
    .windowedBy(SessionWindows.ofInactivityGapWithNoGrace(Duration.ofMinutes(5)))
    .count();
```

### Apache Spark Structured Streaming (Python)

```python
# Tumbling Window
df.groupBy(
    window(col("timestamp"), "5 minutes")
).agg(sum("value"))

# Sliding Window
df.groupBy(
    window(col("timestamp"), "10 minutes", "5 minutes")
).agg(sum("value"))

# Watermark for late data
df.withWatermark("timestamp", "10 minutes") \
  .groupBy(window(col("timestamp"), "5 minutes")) \
  .agg(sum("value"))
```

---

## üéØ Key Takeaways

### Window Type Selection

| Window Type | Best For | Avoid When |
|-------------|----------|------------|
| **Tumbling** | Regular periodic aggregations, reporting | Need smoothing or overlap |
| **Sliding** | Moving averages, trend analysis | Memory constraints with small slides |
| **Session** | User behavior, activity-based analysis | Regular time-based metrics needed |

### Best Practices

1. **Choose Event Time over Processing Time** when accuracy matters more than simplicity

2. **Set Appropriate Watermark Delays** based on your expected event lateness

3. **Define Allowed Lateness** to balance completeness vs. resource usage

4. **Monitor Late Data** - high late event rates may indicate:
   - Network issues
   - Incorrect watermark configuration
   - Clock synchronization problems

5. **Consider Window Size Trade-offs**:
   - Smaller windows ‚Üí More real-time, but noisier
   - Larger windows ‚Üí Smoother trends, but more latency

### Common Pitfalls to Avoid

- ‚ùå Using processing time when event time is available
- ‚ùå Setting watermark delay to zero (causes data loss)
- ‚ùå Ignoring late data without monitoring
- ‚ùå Creating too many overlapping sliding windows (memory explosion)
- ‚ùå Using session windows without per-key partitioning

### Formula Reference

**Sliding Window Count**: For time range $T$, window size $W$, and slide $S$:

$$\text{Number of windows} = \left\lceil \frac{T - W}{S} \right\rceil + 1$$

**Windows per Event** (sliding): 

$$\text{Windows per event} = \left\lceil \frac{W}{S} \right\rceil$$