# Urban Pulse - Machine Learning Models

## Traffic Congestion Prediction

This notebook implements:
- Logistic Regression model for binary classification
- Decision Tree model for comparison
- Model evaluation and comparison
- Feature importance analysis


In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sys
import os

# Add src to path
sys.path.append(os.path.join(os.path.dirname(os.getcwd()), 'src'))

from models import (
    prepare_features,
    train_logistic_regression,
    train_decision_tree,
    plot_confusion_matrix,
    plot_feature_importance,
    plot_model_comparison,
    print_model_comparison_summary
)

# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

print("✓ Libraries imported successfully")


## 1. Load Processed Data

Load the cleaned and preprocessed dataset with all features.


In [None]:
# Load processed data
data_path = '../data/processed/traffic_cleaned.csv'

try:
    df = pd.read_csv(data_path, parse_dates=['date_time'])
    print(f"✓ Data loaded: {df.shape}")
    print(f"Target distribution:")
    print(df['is_congested'].value_counts())
except FileNotFoundError:
    print("⚠️  Please run 02_data_preprocessing.ipynb first")


## 2. Prepare Features for Machine Learning

Select and prepare features for model training.


In [None]:
# Prepare features and target
X, y = prepare_features(df, target_column='is_congested')

print(f"\nFeature matrix shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nFeature columns:")
for i, col in enumerate(X.columns, 1):
    print(f"  {i:2d}. {col}")


## 3. Train Logistic Regression Model

Train the first model: Logistic Regression for binary classification.


In [None]:
# Train Logistic Regression
lr_model, lr_metrics = train_logistic_regression(
    X, y,
    test_size=0.2,
    random_state=42,
    max_iter=1000
)


## 4. Visualize Logistic Regression Results

Plot confusion matrix and analyze coefficients.


In [None]:
# Plot confusion matrix
plot_confusion_matrix(
    lr_metrics['confusion_matrix'],
    'Logistic Regression',
    save_path='../reports/figures/lr_confusion_matrix.png'
)

# Display top coefficients
print("\nTop 10 Most Important Features (by coefficient magnitude):")
print("="*60)
coef_df = pd.DataFrame({
    'Feature': list(lr_metrics['coefficients'].keys()),
    'Coefficient': list(lr_metrics['coefficients'].values())
})
coef_df['Abs_Coefficient'] = coef_df['Coefficient'].abs()
coef_df = coef_df.sort_values('Abs_Coefficient', ascending=False).head(10)
print(coef_df.to_string(index=False))


## 5. Train Decision Tree Model

Train the second model: Decision Tree for comparison.


In [None]:
# Train Decision Tree
dt_model, dt_metrics = train_decision_tree(
    X, y,
    test_size=0.2,
    random_state=42,
    max_depth=10,  # Limit depth to prevent overfitting
    min_samples_split=20
)


## 6. Visualize Decision Tree Results

Plot confusion matrix and feature importance.


In [None]:
# Plot confusion matrix
plot_confusion_matrix(
    dt_metrics['confusion_matrix'],
    'Decision Tree',
    save_path='../reports/figures/dt_confusion_matrix.png'
)

# Plot feature importance
plot_feature_importance(
    dt_metrics,
    top_n=10,
    save_path='../reports/figures/dt_feature_importance.png'
)


## 7. Model Comparison

Compare both models side by side.


In [None]:
# Compare models
plot_model_comparison(
    lr_metrics,
    dt_metrics,
    save_path='../reports/figures/model_comparison.png'
)

# Print detailed comparison
print_model_comparison_summary(lr_metrics, dt_metrics)


## 8. Detailed Classification Reports

Generate detailed classification reports for both models.


In [None]:
from sklearn.metrics import classification_report

print("="*60)
print("LOGISTIC REGRESSION - Classification Report")
print("="*60)
print(classification_report(lr_metrics['y_test'], lr_metrics['y_pred'], 
                           target_names=['Not Congested', 'Congested']))

print("\n" + "="*60)
print("DECISION TREE - Classification Report")
print("="*60)
print(classification_report(dt_metrics['y_test'], dt_metrics['y_pred'],
                           target_names=['Not Congested', 'Congested']))


## 9. Key Insights and Conclusions

### Model Performance Summary

**Logistic Regression:**
- Provides interpretable coefficients
- Good baseline performance
- Easy to understand feature contributions

**Decision Tree:**
- Slightly better accuracy
- Shows clear feature importance
- Captures non-linear patterns

### Critical Factors for Traffic Congestion

Based on both models, the most important factors are:
1. **Time of Day (Hour)**: Strongest predictor
2. **Rush Hour Flag**: Clear indicator of congestion
3. **Day of Week**: Weekdays vs weekends
4. **Weather Conditions**: Impact traffic patterns

### Recommendations

1. **For City Planning**: Focus on rush hour management (7-9 AM, 5-7 PM)
2. **For Real-time Prediction**: Use Decision Tree for better accuracy
3. **For Interpretability**: Use Logistic Regression to explain factors

**Next Steps:**
- Consider ensemble methods for improved performance
- Add more features (events, construction, accidents)
- Deploy model for real-time predictions
