# Bus Duration Prediction: Feature Engineering and Modeling

This notebook contains the feature engineering process and model development for predicting bus journey durations.

## Table of Contents
1. [Setup and Imports](#setup)
2. [Data Loading and Preprocessing](#data)
3. [Feature Engineering](#features)
   - Basic Features
   - Weather Features
   - Time-based Features
   - Advanced Feature Engineering
4. [Model Development](#modeling)
   - Data Splitting
   - Model Training
   - Model Evaluation
   - Stacking Ensemble
5. [Model Interpretability](#interpretability)
6. [Model Deployment](#deployment)
7. [Results Analysis](#results)
   - Feature Importance
   - Performance Metrics
   - Visualizations

<a id='setup'></a>
## 1. Setup and Imports

Import required libraries and set up the environment

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
import matplotlib.pyplot as plt
import seaborn as sns

# Import custom utility functions
import sys
sys.path.append('../analysis_src')
from model_utils import (
    dataframe_info,
    create_cyclical_features,
    evaluate_model,
    select_features,
    plot_feature_importance,
    tune_hyperparameters,
    analyze_predictions,
    plot_learning_curves,
    create_error_analysis
)

# Set random seed for reproducibility
np.random.seed(42)

AttributeError: module 'matplotlib._api' has no attribute 'MatplotlibDeprecationWarning'

<a id='data'></a>
## 2. Data Loading and Preprocessing

Load the preprocessed data and perform any necessary cleaning or transformations

In [None]:
# Load preprocessed data
train_data = pd.read_csv('../data/processed/train_data.csv')
test_data = pd.read_csv('../data/processed/test_data.csv')

# Display basic information about the dataset
print("Training data shape:", train_data.shape)
print("\nFeature information:")
dataframe_info(train_data)

<a id='features'></a>
## 3. Feature Engineering

Create and transform features for model training

### 3.1 Basic Features

Process route, distance, and other basic features

In [None]:
print("Creating basic features...")
# Calculate distances, speeds, etc.
train_data['avg_speed'] = train_data['distance'] / train_data['duration']
train_data['stops_per_km'] = train_data['num_stops'] / train_data['distance']

### 3.2 Weather Features

Process and engineer weather-related features

In [None]:
# Weather feature processing code here
# ...

### 3.3 Time-based Features

Create cyclical time features and time-based aggregations

In [None]:
print("\nCreating time-based features...")
# Extract time components
train_data['hour'] = pd.to_datetime(train_data['departure_time']).dt.hour
train_data['day_of_week'] = pd.to_datetime(train_data['departure_date']).dt.dayofweek
train_data['month'] = pd.to_datetime(train_data['departure_date']).dt.month

# Create cyclical features for time components
print("Creating cyclical features...")
train_data['hour_sin'], train_data['hour_cos'] = create_cyclical_features(train_data, 'hour', 24)
train_data['day_sin'], train_data['day_cos'] = create_cyclical_features(train_data, 'day_of_week', 7)
train_data['month_sin'], train_data['month_cos'] = create_cyclical_features(train_data, 'month', 12)

In [None]:
# Weather Features (if available)
if 'temperature' in train_data.columns and 'precipitation' in train_data.columns:
    print("\nProcessing weather features...")
    # Bin temperature into categories
    train_data['temp_category'] = pd.qcut(train_data['temperature'], q=5, labels=['Very Cold', 'Cold', 'Moderate', 'Warm', 'Hot'])
    
    # Create precipitation categories
    train_data['weather_condition'] = np.where(train_data['precipitation'] == 0, 'Clear',
                                    np.where(train_data['precipitation'] < 2.5, 'Light Rain',
                                    np.where(train_data['precipitation'] < 7.6, 'Moderate Rain', 'Heavy Rain')))

In [None]:
# Feature Selection
print("\nSelecting most important features...")
feature_cols = [col for col in train_data.columns if col not in ['duration', 'departure_time', 'departure_date']]
X = train_data[feature_cols]
y = train_data['duration']

# Select top features using f_regression
X_selected, selected_features = select_features(X, y, method='f_regression', k=15)
print("\nTop 15 selected features:", selected_features)

# Display feature information
print("\nFeature information after engineering:")
feature_info = dataframe_info(X_selected)
display(feature_info)

### Advanced Feature Engineering


In [None]:
print("Creating advanced features...")

# Create interaction features
numeric_features = X.select_dtypes(include=[np.number]).columns
feature_pairs = [
    ('distance', 'num_stops'),
    ('avg_speed', 'stops_per_km'),
    ('hour', 'day_of_week')
]

X_advanced = create_interaction_features(X, feature_pairs)

# Create polynomial features for key metrics
poly_features = ['distance', 'avg_speed', 'stops_per_km']
X_advanced = create_polynomial_features(X_advanced, poly_features, degree=2)

# Create lag features if we have time-series data
if 'departure_time' in train_data.columns:
    print("\nCreating time-based lag features...")
    lag_features = ['duration', 'avg_speed']
    group_columns = ['route_id'] if 'route_id' in train_data.columns else None
    X_advanced = create_lag_features(
        X_advanced, 
        lag_features,
        'departure_time',
        group_columns=group_columns
    )

# Update feature selection with new features
X_selected, selected_features = select_features(X_advanced, y, method='f_regression', k=20)
print("\nSelected features after advanced engineering:", selected_features)

<a id='modeling'></a>
## 4. Model Development

### 4.1 Data Splitting

Split data into training and validation sets

In [None]:
print("Splitting data into train and test sets...")
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.2, random_state=42)

### 4.2 Model Training

Train and compare different models

In [None]:
# Initialize models
models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(alpha=1.0),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
    'XGBoost': XGBRegressor(n_estimators=100, random_state=42),
    'KNN': KNeighborsRegressor(n_neighbors=5)
}

# Train and evaluate models
results = {}
print("\nTraining and evaluating models...")

for name, model in models.items():
    print(f"\nTraining {name}...")
    model.fit(X_train, y_train)
    
    # Evaluate model
    metrics = evaluate_model(model, X_train, X_test, y_train, y_test)
    results[name] = metrics
    
    print(f"{name} Results:")
    print(f"Train RMSE: {metrics['rmse_train']:.2f}")
    print(f"Test RMSE: {metrics['rmse_test']:.2f}")
    print(f"R² Score (Test): {metrics['r2_test']:.4f}")
    
    # Plot feature importance for tree-based models
    if name in ['Random Forest', 'XGBoost']:
        print(f"\n{name} Feature Importance:")
        plot_feature_importance(model, selected_features)

In [None]:
# Create results summary
results_df = pd.DataFrame(results).T
print("\nModel Comparison:")
display(results_df)

# Find best model
best_model = min(results.items(), key=lambda x: x[1]['rmse_test'])
print(f"\nBest Model: {best_model[0]}")
print(f"Test RMSE: {best_model[1]['rmse_test']:.2f}")
print(f"R² Score: {best_model[1]['r2_test']:.4f}")

### 4.3 Model Evaluation

Evaluate model performance using various metrics

In [None]:
# Model evaluation code here
# ...

### Stacking Ensemble


In [None]:
print("\nCreating stacking ensemble...")

# Define base models with tuned hyperparameters
base_models = {
    'rf': best_rf,
    'xgb': best_xgb,
    'knn': KNeighborsRegressor(n_neighbors=5),
    'ridge': Ridge(alpha=1.0)
}

# Use LightGBM as meta-learner
from lightgbm import LGBMRegressor
meta_model = LGBMRegressor(n_estimators=100, random_state=42)

# Create stacking ensemble
stacking_model, meta_train, meta_test = create_stacking_ensemble(
    base_models,
    meta_model,
    X_selected,
    y,
    X_test
)

# Evaluate stacking ensemble
print("\nEvaluating stacking ensemble...")
analyze_predictions(stacking_model, meta_test, y_test)

<a id='interpretability'></a>
## Model Interpretability

### Model Interpretability


In [None]:
print("\nAnalyzing feature importance using SHAP values...")
analyze_feature_importance_shap(best_rf, X_test, max_display=15)

<a id='deployment'></a>
## Model Deployment

### Model Deployment


In [None]:
print("\nPreparing model for deployment...")

# Save model artifacts
save_model_artifacts(
    best_rf,
    selected_features,
    output_dir='model_artifacts'
)

# Generate FastAPI application
api_code = create_prediction_api()

# Save API code
with open('app.py', 'w') as f:
    f.write(api_code)

print("\nModel artifacts saved and API code generated.")
print("To deploy the model:")
print("1. Install requirements: pip install fastapi uvicorn")
print("2. Start the API: uvicorn app:app --reload")
print("3. Access the API documentation at http://localhost:8000/docs")

### Final Summary


In [None]:
print("\nModel Development Summary:")
print("1. Advanced Feature Engineering:")
print(f"   - Created {len(X_advanced.columns) - len(X.columns)} new features")
print(f"   - Selected top {len(selected_features)} features")

print("\n2. Model Performance:")
print("   Base Models:")
for name, metrics in final_results.items():
    print(f"   - {name}: RMSE = {metrics['rmse_test']:.2f}, R² = {metrics['r2_test']:.4f}")
print(f"\n   Stacking Ensemble:")
ensemble_metrics = evaluate_model(stacking_model, meta_train, meta_test, y_train, y_test)
print(f"   - RMSE = {ensemble_metrics['rmse_test']:.2f}, R² = {ensemble_metrics['r2_test']:.4f}")

print("\n3. Deployment:")
print("   - Model artifacts saved in 'model_artifacts' directory")
print("   - FastAPI application generated in 'app.py'")
print("   - Ready for deployment with feature scaling and validation")

<a id='results'></a>
## 5. Results Analysis

### 5.1 Feature Importance

Analyze which features contribute most to the predictions

In [None]:
# Feature importance analysis code here
# ...

### 5.2 Performance Metrics

Detailed analysis of model performance metrics

In [None]:
# Performance metrics analysis code here
# ...

### 5.3 Visualizations

Visualize model predictions and performance

In [None]:
# Visualization code here
# ...