# Lesson 1: Time Series Data & Preprocessing

## Learning Objectives
- LO6: Understand what time series data is and how it differs from sequential data
- LO7: Apply techniques for transforming time series data

---

## Setup: Import Required Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Set style for better-looking plots
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("Libraries imported successfully!")

---
## Part 1: Opening Activity - "What Do You See?"

Let's start by generating and visualizing three different types of data patterns.

In [None]:
# Generate sample data for opening activity
np.random.seed(42)

# 1. Daily temperature over a year
days = np.arange(365)
temperature = 15 + 10 * np.sin(2 * np.pi * days / 365) + np.random.normal(0, 2, 365)

# 2. Text as sequence (letter positions in alphabet)
text = "TIMESERIES"
letter_values = [ord(char) - ord('A') + 1 for char in text]

# 3. Sensor readings over time (with trend and noise)
time_points = np.arange(100)
sensor_readings = 50 + 0.3 * time_points + 5 * np.sin(time_points / 5) + np.random.normal(0, 3, 100)

# Visualize all three
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

axes[0].plot(days, temperature, linewidth=1, color='orange')
axes[0].set_title('Pattern A', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Index')
axes[0].set_ylabel('Value')
axes[0].grid(True, alpha=0.3)

axes[1].plot(range(len(letter_values)), letter_values, marker='o', linewidth=2, markersize=8, color='blue')
axes[1].set_title('Pattern B', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Index')
axes[1].set_ylabel('Value')
axes[1].grid(True, alpha=0.3)

axes[2].plot(time_points, sensor_readings, linewidth=1, color='green')
axes[2].set_title('Pattern C', fontsize=14, fontweight='bold')
axes[2].set_xlabel('Index')
axes[2].set_ylabel('Value')
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nü§î Discussion Questions:")
print("- Which two patterns look similar to each other?")
print("- What makes them similar?")
print("- What role does TIME play in each pattern?")

### üí≠ Reflection (Discuss with your neighbor)

Write your observations here:
- Patterns A and C are: ___________
- Pattern B is different because: ___________
- Time is important for: ___________

---
## Part 2: Time Series vs Sequential Data

### Key Definitions

**Time Series Data:**
- Observations ordered in time with **regular intervals**
- Time itself is intrinsically important
- Examples: stock prices, temperature readings, heart rate per second

**Sequential Data:**
- Data where **order matters** but time intervals may not be regular or important
- Examples: DNA sequences, text, browsing history

### üìù Exercise 1: Categorization

Classify each example as either **Time Series (TS)** or **Sequential (SEQ)**:

1. DNA sequence: _______
2. Heart rate per second: _______
3. Search history: _______
4. Daily sales figures: _______
5. Words in a sentence: _______
6. Hourly energy consumption: _______
7. Customer purchases on a website: _______
8. Temperature readings every 10 minutes: _______

### üë• Group Activity (15 minutes)

**Your domain:** ___________  (healthcare / retail / industry / transport)

**Brainstorm:**
- 2 time series examples from your domain:
  1. ___________
  2. ___________

- 2 sequential data examples from your domain:
  1. ___________
  2. ___________

**Discussion:** Why is this distinction important for analysis?

---
## Part 3: Characteristics of Time Series

Time series data typically contains three main components:
1. **Trend**: Long-term increase or decrease
2. **Seasonality**: Regular, repeating patterns
3. **Noise**: Random, irregular fluctuations

### Generate Sample Energy Consumption Data

In [None]:
# Create synthetic energy consumption data
np.random.seed(42)

# Time range: 2 years of hourly data
hours = pd.date_range('2022-01-01', periods=24*365*2, freq='H')
n = len(hours)

# Components
trend = np.linspace(100, 120, n)  # Gradual increase in consumption
seasonal_yearly = 15 * np.sin(2 * np.pi * np.arange(n) / (24*365))  # Yearly seasonality
seasonal_daily = 10 * np.sin(2 * np.pi * np.arange(n) / 24)  # Daily seasonality
noise = np.random.normal(0, 3, n)  # Random noise

# Combine all components
energy_consumption = trend + seasonal_yearly + seasonal_daily + noise

# Create DataFrame
df_energy = pd.DataFrame({
    'timestamp': hours,
    'consumption': energy_consumption,
    'trend': trend,
    'seasonal_yearly': seasonal_yearly,
    'seasonal_daily': seasonal_daily,
    'noise': noise
})

print(f"Dataset created: {len(df_energy)} hours of data")
print(f"Date range: {df_energy['timestamp'].min()} to {df_energy['timestamp'].max()}")
df_energy.head()

### üìä Exercise 2: Visualize the Data

**Task:** Plot the energy consumption data and identify its components.

In [None]:
# Visualize the complete time series
plt.figure(figsize=(15, 5))
plt.plot(df_energy['timestamp'], df_energy['consumption'], linewidth=0.5, alpha=0.8)
plt.title('Energy Consumption Over Time', fontsize=16, fontweight='bold')
plt.xlabel('Time')
plt.ylabel('Energy Consumption (kWh)')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

### üîç Your Analysis:

Look at the plot above and answer:

1. **Trend**: Do you see an overall increase, decrease, or stability? ___________

2. **Seasonality**: Do you notice any repeating patterns? At what frequency? ___________

3. **Noise**: How much random variation is present? ___________

### Decompose the Time Series

In [None]:
# Visualize individual components
fig, axes = plt.subplots(4, 1, figsize=(15, 10))

# Plot first 30 days for clarity
sample_days = 30
sample_data = df_energy.iloc[:24*sample_days]

axes[0].plot(sample_data['timestamp'], sample_data['consumption'], linewidth=1)
axes[0].set_title('Original Series', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Value')
axes[0].grid(True, alpha=0.3)

axes[1].plot(sample_data['timestamp'], sample_data['trend'], color='red', linewidth=2)
axes[1].set_title('Trend Component', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Value')
axes[1].grid(True, alpha=0.3)

axes[2].plot(sample_data['timestamp'], sample_data['seasonal_yearly'] + sample_data['seasonal_daily'], color='green', linewidth=1)
axes[2].set_title('Seasonal Component', fontsize=12, fontweight='bold')
axes[2].set_ylabel('Value')
axes[2].grid(True, alpha=0.3)

axes[3].plot(sample_data['timestamp'], sample_data['noise'], color='gray', linewidth=0.5, alpha=0.7)
axes[3].set_title('Noise Component', fontsize=12, fontweight='bold')
axes[3].set_xlabel('Time')
axes[3].set_ylabel('Value')
axes[3].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### üí¨ Think-Pair-Share

**Individual:** Which component do you see most strongly in the original data?

Your answer: ___________

**Pair:** Compare with your neighbor

**Share:** Be ready to share your findings with the class

### üìö Key Concept: Stationarity

**Stationary Time Series:**
- Statistical properties (mean, variance) remain constant over time
- No trend, no seasonality
- Important for many forecasting models

**Non-Stationary Time Series:**
- Statistical properties change over time
- Has trend and/or seasonality
- Often needs transformation before modeling

In [None]:
# Visualize stationary vs non-stationary
fig, axes = plt.subplots(1, 2, figsize=(15, 4))

# Stationary series (white noise)
stationary = np.random.normal(50, 5, 500)
axes[0].plot(stationary, linewidth=1)
axes[0].axhline(y=np.mean(stationary), color='red', linestyle='--', label=f'Mean = {np.mean(stationary):.1f}')
axes[0].set_title('Stationary Series', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Time')
axes[0].set_ylabel('Value')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Non-stationary series (with trend)
non_stationary = 30 + 0.1 * np.arange(500) + np.random.normal(0, 5, 500)
axes[1].plot(non_stationary, linewidth=1)
axes[1].plot(30 + 0.1 * np.arange(500), color='red', linestyle='--', linewidth=2, label='Trend')
axes[1].set_title('Non-Stationary Series', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Time')
axes[1].set_ylabel('Value')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n‚ùì Why is stationarity important?")
print("Many statistical models assume constant statistical properties.")
print("Non-stationary data often needs transformation first!")

---
## ‚òï BREAK (10 minutes)
---

---
## Part 4: Data Transformation Techniques

Real-world time series data often has problems:
- Missing values
- Outliers
- Too much noise
- Different scales

Let's learn how to fix these!

### Create "Dirty" Data for Practice

In [None]:
# Create a problematic time series
np.random.seed(42)
n_points = 200
time = pd.date_range('2023-01-01', periods=n_points, freq='H')

# Base signal
clean_signal = 50 + 10 * np.sin(2 * np.pi * np.arange(n_points) / 24) + 0.1 * np.arange(n_points)

# Add problems
dirty_signal = clean_signal.copy()
dirty_signal += np.random.normal(0, 5, n_points)  # Add noise

# Add missing values (10% of data)
missing_indices = np.random.choice(n_points, size=int(n_points * 0.1), replace=False)
dirty_signal[missing_indices] = np.nan

# Add outliers (5% of data)
outlier_indices = np.random.choice(n_points, size=int(n_points * 0.05), replace=False)
dirty_signal[outlier_indices] += np.random.choice([-1, 1], size=len(outlier_indices)) * np.random.uniform(30, 50, size=len(outlier_indices))

# Create DataFrame
df_dirty = pd.DataFrame({
    'timestamp': time,
    'value': dirty_signal,
    'clean_value': clean_signal
})

# Visualize the problem
plt.figure(figsize=(15, 5))
plt.plot(df_dirty['timestamp'], df_dirty['value'], 'o-', markersize=3, linewidth=0.5, label='Dirty Data', alpha=0.7)
plt.plot(df_dirty['timestamp'], df_dirty['clean_value'], linewidth=2, label='Clean Signal', alpha=0.8)
plt.title('"Dirty" Time Series Data', fontsize=16, fontweight='bold')
plt.xlabel('Time')
plt.ylabel('Value')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"\nüìä Data Quality Report:")
print(f"Total points: {len(df_dirty)}")
print(f"Missing values: {df_dirty['value'].isna().sum()} ({df_dirty['value'].isna().sum()/len(df_dirty)*100:.1f}%)")
print(f"Potential outliers detected: ~{int(n_points * 0.05)}")
print(f"\nüéØ Goal: Clean this data to make it analyzable!")

---
### Technique 1: Dealing with Missing Data (6 minutes)

In [None]:
# Different strategies for handling missing values
df_missing = df_dirty.copy()

# Method 1: Forward fill
ffill_values = df_missing['value'].fillna(method='ffill')

# Method 2: Backward fill
bfill_values = df_missing['value'].fillna(method='bfill')

# Method 3: Linear interpolation
interp_values = df_missing['value'].interpolate(method='linear')

# Method 4: Mean imputation
mean_values = df_missing['value'].fillna(df_missing['value'].mean())

# Visualize comparison
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

axes[0, 0].plot(df_missing['timestamp'], ffill_values, linewidth=1)
axes[0, 0].scatter(df_missing['timestamp'][missing_indices], ffill_values[missing_indices], color='red', s=50, zorder=5, label='Imputed')
axes[0, 0].set_title('Forward Fill', fontsize=12, fontweight='bold')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

axes[0, 1].plot(df_missing['timestamp'], bfill_values, linewidth=1)
axes[0, 1].scatter(df_missing['timestamp'][missing_indices], bfill_values[missing_indices], color='red', s=50, zorder=5, label='Imputed')
axes[0, 1].set_title('Backward Fill', fontsize=12, fontweight='bold')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

axes[1, 0].plot(df_missing['timestamp'], interp_values, linewidth=1)
axes[1, 0].scatter(df_missing['timestamp'][missing_indices], interp_values[missing_indices], color='red', s=50, zorder=5, label='Imputed')
axes[1, 0].set_title('Linear Interpolation', fontsize=12, fontweight='bold')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

axes[1, 1].plot(df_missing['timestamp'], mean_values, linewidth=1)
axes[1, 1].scatter(df_missing['timestamp'][missing_indices], mean_values[missing_indices], color='red', s=50, zorder=5, label='Imputed')
axes[1, 1].set_title('Mean Imputation', fontsize=12, fontweight='bold')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### üí≠ Reflection: When to use which method?

- **Forward/Backward Fill**: Best when ___________
- **Interpolation**: Best when ___________
- **Mean Imputation**: Best when ___________

**Your choice for this data:** ___________

In [None]:
# Apply your chosen method
df_clean = df_dirty.copy()
df_clean['value'] = df_clean['value'].interpolate(method='linear')  # Change this to your preferred method

print("‚úÖ Missing values handled!")
print(f"Remaining missing values: {df_clean['value'].isna().sum()}")

---
### Technique 2: Noise Reduction (7 minutes)

In [None]:
# Moving Average (Simple Smoothing)
def moving_average(data, window_size):
    return data.rolling(window=window_size, center=True).mean()

# Exponential Smoothing
def exponential_smoothing(data, alpha):
    return data.ewm(alpha=alpha, adjust=False).mean()

# Apply different smoothing techniques
ma_5 = moving_average(df_clean['value'], 5)
ma_15 = moving_average(df_clean['value'], 15)
exp_01 = exponential_smoothing(df_clean['value'], 0.1)
exp_03 = exponential_smoothing(df_clean['value'], 0.3)

# Visualize
fig, axes = plt.subplots(2, 1, figsize=(15, 10))

# Moving average comparison
axes[0].plot(df_clean['timestamp'], df_clean['value'], linewidth=0.5, alpha=0.5, label='Original (noisy)')
axes[0].plot(df_clean['timestamp'], ma_5, linewidth=2, label='MA (window=5)')
axes[0].plot(df_clean['timestamp'], ma_15, linewidth=2, label='MA (window=15)')
axes[0].plot(df_clean['timestamp'], df_clean['clean_value'], linewidth=2, linestyle='--', label='True Signal', color='black')
axes[0].set_title('Moving Average Smoothing', fontsize=14, fontweight='bold')
axes[0].set_ylabel('Value')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Exponential smoothing comparison
axes[1].plot(df_clean['timestamp'], df_clean['value'], linewidth=0.5, alpha=0.5, label='Original (noisy)')
axes[1].plot(df_clean['timestamp'], exp_01, linewidth=2, label='Exp Smoothing (Œ±=0.1)')
axes[1].plot(df_clean['timestamp'], exp_03, linewidth=2, label='Exp Smoothing (Œ±=0.3)')
axes[1].plot(df_clean['timestamp'], df_clean['clean_value'], linewidth=2, linestyle='--', label='True Signal', color='black')
axes[1].set_title('Exponential Smoothing', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Time')
axes[1].set_ylabel('Value')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### üß™ Experiment: Try Different Window Sizes

Modify the code below to experiment with different parameters:

In [None]:
# YOUR TURN: Experiment with different parameters
window_size = 10  # Try changing this: 3, 5, 10, 20, 30
alpha = 0.2  # Try changing this: 0.1, 0.2, 0.5, 0.8

# Apply smoothing
custom_ma = moving_average(df_clean['value'], window_size)
custom_exp = exponential_smoothing(df_clean['value'], alpha)

# Visualize your results
plt.figure(figsize=(15, 5))
plt.plot(df_clean['timestamp'], df_clean['value'], linewidth=0.5, alpha=0.4, label='Original')
plt.plot(df_clean['timestamp'], custom_ma, linewidth=2, label=f'Your MA (window={window_size})')
plt.plot(df_clean['timestamp'], custom_exp, linewidth=2, label=f'Your Exp Smoothing (Œ±={alpha})')
plt.title('Your Smoothing Experiment', fontsize=16, fontweight='bold')
plt.xlabel('Time')
plt.ylabel('Value')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nüí° Observations:")
print("- Larger window size = ___________")
print("- Smaller alpha = ___________")
print("- Best parameter for this data = ___________")

In [None]:
# Apply your chosen smoothing method
df_clean['value_smoothed'] = moving_average(df_clean['value'], 10)  # Adjust as needed
print("‚úÖ Noise reduction applied!")

---
### Technique 3: Normalization (6 minutes)

In [None]:
# Create multiple series with different scales
series_a = df_clean['value_smoothed'].fillna(method='bfill').fillna(method='ffill')  # Scale: ~50-100
series_b = series_a * 10  # Scale: ~500-1000
series_c = series_a / 2  # Scale: ~25-50

# Normalize using different methods
def min_max_scaling(data):
    return (data - data.min()) / (data.max() - data.min())

def z_score_normalization(data):
    return (data - data.mean()) / data.std()

# Apply normalizations
a_minmax = min_max_scaling(series_a)
b_minmax = min_max_scaling(series_b)
c_minmax = min_max_scaling(series_c)

a_zscore = z_score_normalization(series_a)
b_zscore = z_score_normalization(series_b)
c_zscore = z_score_normalization(series_c)

# Visualize
fig, axes = plt.subplots(3, 1, figsize=(15, 12))

# Original (different scales)
axes[0].plot(df_clean['timestamp'], series_a, label='Series A (~50-100)', linewidth=2)
axes[0].plot(df_clean['timestamp'], series_b, label='Series B (~500-1000)', linewidth=2)
axes[0].plot(df_clean['timestamp'], series_c, label='Series C (~25-50)', linewidth=2)
axes[0].set_title('Original Series (Different Scales)', fontsize=14, fontweight='bold')
axes[0].set_ylabel('Value')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Min-Max Normalized (0-1 range)
axes[1].plot(df_clean['timestamp'], a_minmax, label='Series A (normalized)', linewidth=2)
axes[1].plot(df_clean['timestamp'], b_minmax, label='Series B (normalized)', linewidth=2)
axes[1].plot(df_clean['timestamp'], c_minmax, label='Series C (normalized)', linewidth=2)
axes[1].set_title('Min-Max Scaling (0-1 range)', fontsize=14, fontweight='bold')
axes[1].set_ylabel('Normalized Value')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

# Z-score Normalized (mean=0, std=1)
axes[2].plot(df_clean['timestamp'], a_zscore, label='Series A (normalized)', linewidth=2)
axes[2].plot(df_clean['timestamp'], b_zscore, label='Series B (normalized)', linewidth=2)
axes[2].plot(df_clean['timestamp'], c_zscore, label='Series C (normalized)', linewidth=2)
axes[2].set_title('Z-Score Normalization (mean=0, std=1)', fontsize=14, fontweight='bold')
axes[2].set_xlabel('Time')
axes[2].set_ylabel('Normalized Value')
axes[2].legend()
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüìè Normalization Statistics:")
print("\nMin-Max Scaling:")
print(f"  Range: [{a_minmax.min():.2f}, {a_minmax.max():.2f}]")
print("\nZ-Score Normalization:")
print(f"  Mean: {a_zscore.mean():.2f}")
print(f"  Std: {a_zscore.std():.2f}")

### üí≠ Reflection: Why normalize?

**When is normalization needed?**
- Comparing multiple time series with different scales
- Machine learning models sensitive to scale
- Computing distances or similarities

**Which method to use?**
- Min-Max: ___________
- Z-Score: ___________

---
### Technique 4: Outlier Detection (6 minutes)

In [None]:
# Detect outliers using statistical methods
data = df_clean['value'].fillna(method='bfill').fillna(method='ffill')

# Method 1: Z-score method (threshold = 3)
z_scores = np.abs(stats.zscore(data))
outliers_zscore = z_scores > 3

# Method 2: IQR method
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers_iqr = (data < lower_bound) | (data > upper_bound)

# Visualize outlier detection
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Z-score method visualization
axes[0, 0].plot(df_clean['timestamp'], data, linewidth=1, label='Data')
axes[0, 0].scatter(df_clean['timestamp'][outliers_zscore], data[outliers_zscore], 
                   color='red', s=100, zorder=5, label=f'Outliers (n={outliers_zscore.sum()})')
axes[0, 0].set_title('Z-Score Method (threshold=3)', fontsize=12, fontweight='bold')
axes[0, 0].set_ylabel('Value')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# IQR method visualization
axes[0, 1].plot(df_clean['timestamp'], data, linewidth=1, label='Data')
axes[0, 1].scatter(df_clean['timestamp'][outliers_iqr], data[outliers_iqr], 
                   color='red', s=100, zorder=5, label=f'Outliers (n={outliers_iqr.sum()})')
axes[0, 1].axhline(y=lower_bound, color='orange', linestyle='--', label='IQR bounds')
axes[0, 1].axhline(y=upper_bound, color='orange', linestyle='--')
axes[0, 1].set_title('IQR Method', fontsize=12, fontweight='bold')
axes[0, 1].set_ylabel('Value')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Boxplot for IQR visualization
axes[1, 0].boxplot(data, vert=True)
axes[1, 0].scatter([1]*outliers_iqr.sum(), data[outliers_iqr], color='red', s=100, zorder=5, label='Outliers')
axes[1, 0].set_title('Boxplot (IQR Method)', fontsize=12, fontweight='bold')
axes[1, 0].set_ylabel('Value')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# Distribution with outliers marked
axes[1, 1].hist(data, bins=30, alpha=0.7, label='Data distribution')
axes[1, 1].hist(data[outliers_iqr], bins=30, alpha=0.7, color='red', label='Outliers')
axes[1, 1].axvline(x=lower_bound, color='orange', linestyle='--', linewidth=2, label='IQR bounds')
axes[1, 1].axvline(x=upper_bound, color='orange', linestyle='--', linewidth=2)
axes[1, 1].set_title('Distribution with Outliers', fontsize=12, fontweight='bold')
axes[1, 1].set_xlabel('Value')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüìä Outlier Detection Summary:")
print(f"Z-score method detected: {outliers_zscore.sum()} outliers")
print(f"IQR method detected: {outliers_iqr.sum()} outliers")
print(f"\nIQR bounds: [{lower_bound:.2f}, {upper_bound:.2f}]")

### üõ†Ô∏è Handling Outliers

In [None]:
# Different strategies for handling outliers
data_cleaned = data.copy()

# Strategy 1: Remove outliers (set to NaN)
data_removed = data.copy()
data_removed[outliers_iqr] = np.nan

# Strategy 2: Cap outliers (winsorization)
data_capped = data.copy()
data_capped[data_capped < lower_bound] = lower_bound
data_capped[data_capped > upper_bound] = upper_bound

# Strategy 3: Replace with median
data_median = data.copy()
data_median[outliers_iqr] = data.median()

# Visualize strategies
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

axes[0, 0].plot(df_clean['timestamp'], data, linewidth=1)
axes[0, 0].scatter(df_clean['timestamp'][outliers_iqr], data[outliers_iqr], color='red', s=50, zorder=5)
axes[0, 0].set_title('Original (with outliers)', fontsize=12, fontweight='bold')
axes[0, 0].grid(True, alpha=0.3)

axes[0, 1].plot(df_clean['timestamp'], data_removed, linewidth=1)
axes[0, 1].set_title('Strategy 1: Removed (NaN)', fontsize=12, fontweight='bold')
axes[0, 1].grid(True, alpha=0.3)

axes[1, 0].plot(df_clean['timestamp'], data_capped, linewidth=1)
axes[1, 0].scatter(df_clean['timestamp'][outliers_iqr], data_capped[outliers_iqr], color='orange', s=50, zorder=5)
axes[1, 0].set_title('Strategy 2: Capped at bounds', fontsize=12, fontweight='bold')
axes[1, 0].set_xlabel('Time')
axes[1, 0].grid(True, alpha=0.3)

axes[1, 1].plot(df_clean['timestamp'], data_median, linewidth=1)
axes[1, 1].scatter(df_clean['timestamp'][outliers_iqr], data_median[outliers_iqr], color='green', s=50, zorder=5)
axes[1, 1].set_title('Strategy 3: Replaced with median', fontsize=12, fontweight='bold')
axes[1, 1].set_xlabel('Time')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüí° Which strategy to use?")
print("- Remove: When outliers are measurement errors")
print("- Cap: When you want to preserve the pattern but limit extreme values")
print("- Replace: When you want to maintain data continuity")

### üë• Peer Review (5 minutes)

Exchange notebooks with your neighbor and discuss:

1. Which outlier detection method found more outliers?
2. Which handling strategy seems most appropriate for this data?
3. What are the trade-offs of each approach?

**Your conclusions:**

___________________________________________

___________________________________________

---
## üéì Final Exercise: Complete Pipeline

Now apply all techniques to create a complete preprocessing pipeline!

In [None]:
# Complete preprocessing pipeline
def preprocess_time_series(data):
    """
    Complete preprocessing pipeline for time series data
    """
    # Step 1: Handle missing values
    data_clean = data.interpolate(method='linear')
    
    # Step 2: Detect outliers
    Q1 = data_clean.quantile(0.25)
    Q3 = data_clean.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = (data_clean < lower_bound) | (data_clean > upper_bound)
    
    # Step 3: Handle outliers (cap them)
    data_clean[data_clean < lower_bound] = lower_bound
    data_clean[data_clean > upper_bound] = upper_bound
    
    # Step 4: Smooth noise
    data_clean = data_clean.rolling(window=10, center=True).mean()
    data_clean = data_clean.fillna(method='bfill').fillna(method='ffill')
    
    # Step 5: Normalize
    data_normalized = (data_clean - data_clean.min()) / (data_clean.max() - data_clean.min())
    
    return data_normalized, outliers

# Apply pipeline to our dirty data
final_data, detected_outliers = preprocess_time_series(df_dirty['value'])

# Visualize before and after
fig, axes = plt.subplots(2, 1, figsize=(15, 8))

axes[0].plot(df_dirty['timestamp'], df_dirty['value'], linewidth=1, alpha=0.7)
axes[0].set_title('Before Preprocessing', fontsize=14, fontweight='bold')
axes[0].set_ylabel('Value')
axes[0].grid(True, alpha=0.3)

axes[1].plot(df_dirty['timestamp'], final_data, linewidth=2, color='green')
axes[1].set_title('After Complete Preprocessing Pipeline', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Time')
axes[1].set_ylabel('Normalized Value')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n‚úÖ Preprocessing Pipeline Complete!")
print(f"\nüìä Summary:")
print(f"  - Missing values handled: {df_dirty['value'].isna().sum()}")
print(f"  - Outliers detected and handled: {detected_outliers.sum()}")
print(f"  - Data smoothed and normalized")
print(f"  - Ready for analysis! üéâ")

---
## üìù Exit Ticket

Before you leave, please answer these questions:

### 1. Name ONE difference between time series and sequential data:

___________________________________________

### 2. Which transformation technique would you use for sensor data with a lot of noise?

___________________________________________

### 3. One question you still have:

___________________________________________

---
## üéØ Key Takeaways

1. **Time Series vs Sequential**: Time series has regular time intervals and time is intrinsically important

2. **Time Series Components**:
   - Trend: Long-term direction
   - Seasonality: Regular patterns
   - Noise: Random fluctuations

3. **Preprocessing Techniques**:
   - **Missing data**: Interpolation, forward/backward fill
   - **Noise reduction**: Moving average, exponential smoothing
   - **Normalization**: Min-max scaling, z-score
   - **Outliers**: Z-score method, IQR method

4. **Stationarity** is important for many forecasting models

---

## üîÆ Next Lesson Preview

Now that we can prepare time series data, we'll learn how to make **forecasts** using ARIMA models!

**See you next time! üëã**