# Urban Pulse - Data Preprocessing

## Data Cleaning and Feature Engineering

This notebook handles:
- Missing value imputation
- Outlier detection and handling
- Datetime parsing and temporal feature creation
- Derived feature engineering (rush hour, traffic stress levels)
- Data quality documentation


In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import sys
import os
from pathlib import Path

# Add project root to path (works in PyCharm and Jupyter)
# This solution works regardless of where the notebook is run from
current_dir = Path().resolve()
# Check if we're in notebooks directory or project root
if (current_dir / 'src').exists():
    # We're in project root
    project_root = current_dir
elif (current_dir.parent / 'src').exists():
    # We're in notebooks directory, go up one level
    project_root = current_dir.parent
else:
    # Try to find project root by looking for src directory
    project_root = current_dir
    while project_root != project_root.parent:
        if (project_root / 'src').exists():
            break
        project_root = project_root.parent

# Add project root to Python path
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

from src.data_processing import (
    load_data,
    inspect_data,
    handle_missing_values,
    handle_outliers,
    parse_datetime,
    create_rush_hour_feature,
    create_traffic_stress_level,
    preprocess_pipeline,
    load_and_clean_data
)

print("✓ Libraries imported successfully")


✓ Libraries imported successfully


## 1. Load Raw Data

Load the dataset from the exploration notebook or directly.


In [2]:
# Load raw data
data_path = '../data/raw/Metro_Interstate_Traffic_Volume.csv'

try:
    df_raw = load_data(data_path)
    print(f"✓ Raw data loaded: {df_raw.shape}")
except FileNotFoundError:
    print("⚠️  Please run 01_data_exploration.ipynb first or ensure data file exists")


✓ Data loaded successfully: 48204 rows, 9 columns
✓ Raw data loaded: (48204, 9)


## 2. Run Complete Preprocessing Pipeline

The preprocessing pipeline handles all cleaning and feature engineering steps automatically.


In [3]:
# Run complete preprocessing pipeline
df_processed, preprocessing_report = preprocess_pipeline(
    df_raw,
    target_column='traffic_volume',
    date_column='date_time',
    missing_strategy='forward_fill',  # Good for time series data
    outlier_method='cap'  # Cap outliers rather than remove
)


STARTING DATA PREPROCESSING PIPELINE
DATA QUALITY REPORT
Shape: 48204 rows × 9 columns

Missing Values:
  holiday: 48143 (99.87%)

Duplicate Rows: 17
Memory Usage: 11.71 MB
✓ Parsed datetime column 'date_time' and extracted temporal features

Handling missing values using 'forward_fill' strategy...
  holiday: 48143 → 0 missing values
✓ Capped outliers in 'traffic_volume' at [-4417.00, 10543.00]
✓ Created rush hour features
✓ Created traffic stress levels:
  Low: < 2158
  Medium: 2158 - 4586
  High: >= 4586
DATA QUALITY REPORT
Shape: 48204 rows × 19 columns

Missing Values:

Duplicate Rows: 17
Memory Usage: 17.52 MB

PREPROCESSING PIPELINE COMPLETE
Initial rows: 48204
Final rows: 48204
Features created: 10


## 3. Verify Preprocessing Results

Check that all features were created correctly.


In [4]:
# Display new features created
print("New Features Created:")
print("="*60)
new_features = ['year', 'month', 'day', 'hour', 'day_of_week', 'is_weekend',
                'is_rush_hour', 'rush_hour_type', 'traffic_stress_level', 'is_congested']

for feature in new_features:
    if feature in df_processed.columns:
        print(f"✓ {feature}")
        if df_processed[feature].dtype == 'object':
            print(f"    Values: {df_processed[feature].value_counts().to_dict()}")
        else:
            print(f"    Range: {df_processed[feature].min()} - {df_processed[feature].max()}")


New Features Created:
✓ year
    Range: 2012 - 2018
✓ month
    Range: 1 - 12
✓ day
    Range: 1 - 31
✓ hour
    Range: 0 - 23
✓ day_of_week
    Range: 0 - 6
✓ is_weekend
    Range: 0 - 1
✓ is_rush_hour
    Range: 0 - 1
✓ rush_hour_type
    Values: {'normal': 36147, 'morning_rush': 6177, 'evening_rush': 5880}
✓ traffic_stress_level
    Values: {'Medium': 16387, 'High': 15910, 'Low': 15907}
✓ is_congested
    Range: 0 - 1


## 4. Save Processed Data

Save the cleaned and processed dataset for use in EDA and ML notebooks.


In [5]:
# Save processed data
output_path = '../data/processed/traffic_cleaned.csv'
df_processed.to_csv(output_path, index=False)
print(f"✓ Processed data saved to: {output_path}")
print(f"  Shape: {df_processed.shape}")


✓ Processed data saved to: ../data/processed/traffic_cleaned.csv
  Shape: (48204, 19)
