# Urban Pulse - Data Preprocessing

## Data Cleaning and Feature Engineering

This notebook handles:
- Missing value imputation
- Outlier detection and handling
- Datetime parsing and temporal feature creation
- Derived feature engineering (rush hour, traffic stress levels)
- Data quality documentation


In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import sys
import os

# Add src to path
sys.path.append(os.path.join(os.path.dirname(os.getcwd()), 'src'))

from data_processing import (
    load_data,
    inspect_data,
    handle_missing_values,
    handle_outliers,
    parse_datetime,
    create_rush_hour_feature,
    create_traffic_stress_level,
    preprocess_pipeline,
    load_and_clean_data
)

print("✓ Libraries imported successfully")


## 1. Load Raw Data

Load the dataset from the exploration notebook or directly.


In [None]:
# Load raw data
data_path = '../data/raw/Metro_Interstate_Traffic_Volume.csv'

try:
    df_raw = load_data(data_path)
    print(f"✓ Raw data loaded: {df_raw.shape}")
except FileNotFoundError:
    print("⚠️  Please run 01_data_exploration.ipynb first or ensure data file exists")


## 2. Run Complete Preprocessing Pipeline

The preprocessing pipeline handles all cleaning and feature engineering steps automatically.


In [None]:
# Run complete preprocessing pipeline
df_processed, preprocessing_report = preprocess_pipeline(
    df_raw,
    target_column='traffic_volume',
    date_column='date_time',
    missing_strategy='forward_fill',  # Good for time series data
    outlier_method='cap'  # Cap outliers rather than remove
)


## 3. Verify Preprocessing Results

Check that all features were created correctly.


In [None]:
# Display new features created
print("New Features Created:")
print("="*60)
new_features = ['year', 'month', 'day', 'hour', 'day_of_week', 'is_weekend',
                'is_rush_hour', 'rush_hour_type', 'traffic_stress_level', 'is_congested']

for feature in new_features:
    if feature in df_processed.columns:
        print(f"✓ {feature}")
        if df_processed[feature].dtype == 'object':
            print(f"    Values: {df_processed[feature].value_counts().to_dict()}")
        else:
            print(f"    Range: {df_processed[feature].min()} - {df_processed[feature].max()}")


## 4. Save Processed Data

Save the cleaned and processed dataset for use in EDA and ML notebooks.


In [None]:
# Save processed data
output_path = '../data/processed/traffic_cleaned.csv'
df_processed.to_csv(output_path, index=False)
print(f"✓ Processed data saved to: {output_path}")
print(f"  Shape: {df_processed.shape}")
