# üö≤ Bike Demand Prediction System - Complete Demo

**Level 2 MLOps Portfolio Project**

This notebook demonstrates the complete bike demand forecasting system including:
- üìä **Data Pipeline**: Real-time data collection from NYC Citi Bike & Weather APIs
- üîß **Feature Engineering**: 100+ automated features (temporal, lag, rolling, weather)
- ü§ñ **ML Training**: XGBoost, LightGBM, CatBoost with MLflow tracking
- üìà **Predictions**: Single, batch, and multi-hour forecasts
- üîç **Monitoring**: Data drift detection and model performance tracking

---

## System Architecture

```
APIs (Citi Bike + Weather)
    ‚Üì
Data Collection ‚Üí PostgreSQL ‚Üí Feature Engineering ‚Üí Feature Store
                                                           ‚Üì
                                      Model Training ‚Üí MLflow Registry
                                                           ‚Üì
                                      FastAPI + Streamlit Dashboard
                                                           ‚Üì
                                      Prometheus + Grafana Monitoring
```

## 1. Setup & Imports

In [None]:
# Core libraries
import sys
import os
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Add src to path
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root / "src"))

# Data science
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta

# MLflow
import mlflow
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("bike-demand-notebook-demo")

# Configure plotting
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

print("‚úÖ Imports successful")
print(f"üìÅ Project root: {project_root}")
print(f"üìä MLflow tracking: http://localhost:5000")

## 2. Data Collection Pipeline

### 2.1 Collect Bike Station Data

In [None]:
from src.data.collectors.citi_bike_collector import CitiBikeCollector
from src.data.collectors.weather_collector import WeatherCollector

# Initialize collectors
bike_collector = CitiBikeCollector()
weather_collector = WeatherCollector()

print("üö≤ Collecting bike station data from NYC Citi Bike API...")

with bike_collector:
    # Collect station information
    stations = bike_collector.collect_station_information()
    print(f"‚úÖ Collected {len(stations)} bike stations")
    
    # Collect current status
    statuses = bike_collector.collect_station_status()
    print(f"‚úÖ Collected {len(statuses)} station statuses")

# Convert to DataFrames
df_stations = pd.DataFrame(stations)
df_statuses = pd.DataFrame(statuses)

print(f"\nüìä Station data shape: {df_stations.shape}")
print(f"üìä Status data shape: {df_statuses.shape}")

# Preview
display(df_stations.head())
display(df_statuses.head())

### 2.2 Collect Weather Data

In [None]:
print("üå§Ô∏è Collecting weather data from OpenWeatherMap API...")

with weather_collector:
    weather = weather_collector.collect_current_weather()
    print(f"‚úÖ Collected weather data")

# Display weather
print(f"\nüìä Current Weather in NYC:")
print(f"  üå°Ô∏è Temperature: {weather['temperature']}¬∞C")
print(f"  üíß Humidity: {weather['humidity']}%")
print(f"  üí® Wind Speed: {weather['wind_speed']} m/s")
print(f"  ‚òÅÔ∏è Condition: {weather['weather_condition']}")
print(f"  üìç Location: ({weather['latitude']}, {weather['longitude']})")

df_weather = pd.DataFrame([weather])
display(df_weather)

### 2.3 Save Raw Data

In [None]:
# Create data directories
data_dir = project_root / "data"
raw_dir = data_dir / "raw"
processed_dir = data_dir / "processed"

raw_dir.mkdir(parents=True, exist_ok=True)
processed_dir.mkdir(parents=True, exist_ok=True)

# Save raw data with timestamp
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

stations_file = raw_dir / f"stations_{timestamp}.csv"
statuses_file = raw_dir / f"statuses_{timestamp}.csv"
weather_file = raw_dir / f"weather_{timestamp}.csv"

df_stations.to_csv(stations_file, index=False)
df_statuses.to_csv(statuses_file, index=False)
df_weather.to_csv(weather_file, index=False)

print(f"üíæ Saved raw data to:")
print(f"  üìÑ {stations_file}")
print(f"  üìÑ {statuses_file}")
print(f"  üìÑ {weather_file}")

### 2.4 Data Exploration

In [None]:
# Merge station info with status
df_merged = df_statuses.merge(df_stations, on='station_id', how='left')

# Calculate demand
df_merged['total_capacity'] = df_merged['bikes_available'] + df_merged['docks_available']
df_merged['utilization'] = df_merged['bikes_available'] / df_merged['total_capacity']

# Statistics
print("üìä Station Statistics:")
print(f"  Total Stations: {len(df_merged)}")
print(f"  Active Stations: {df_merged['is_installed'].sum()}")
print(f"  Total Bikes Available: {df_merged['bikes_available'].sum()}")
print(f"  Total Docks Available: {df_merged['docks_available'].sum()}")
print(f"  Average Utilization: {df_merged['utilization'].mean():.2%}")

# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# 1. Bikes available distribution
axes[0, 0].hist(df_merged['bikes_available'], bins=30, edgecolor='black', alpha=0.7)
axes[0, 0].set_title('Distribution of Bikes Available', fontsize=12, fontweight='bold')
axes[0, 0].set_xlabel('Bikes Available')
axes[0, 0].set_ylabel('Frequency')

# 2. Utilization distribution
axes[0, 1].hist(df_merged['utilization'].dropna(), bins=30, edgecolor='black', alpha=0.7, color='orange')
axes[0, 1].set_title('Station Utilization Distribution', fontsize=12, fontweight='bold')
axes[0, 1].set_xlabel('Utilization Rate')
axes[0, 1].set_ylabel('Frequency')

# 3. Top 10 stations by bikes
top_stations = df_merged.nlargest(10, 'bikes_available')[['name', 'bikes_available']]
axes[1, 0].barh(range(len(top_stations)), top_stations['bikes_available'].values)
axes[1, 0].set_yticks(range(len(top_stations)))
axes[1, 0].set_yticklabels([name[:30] + '...' if len(name) > 30 else name for name in top_stations['name']], fontsize=8)
axes[1, 0].set_title('Top 10 Stations by Bikes Available', fontsize=12, fontweight='bold')
axes[1, 0].set_xlabel('Bikes Available')
axes[1, 0].invert_yaxis()

# 4. Capacity vs bikes scatter
axes[1, 1].scatter(df_merged['capacity'], df_merged['bikes_available'], alpha=0.5)
axes[1, 1].set_title('Station Capacity vs Bikes Available', fontsize=12, fontweight='bold')
axes[1, 1].set_xlabel('Station Capacity')
axes[1, 1].set_ylabel('Bikes Available')

plt.tight_layout()
plt.savefig(data_dir / 'exploratory_analysis.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"\nüíæ Saved visualization to {data_dir / 'exploratory_analysis.png'}")

## 3. Feature Engineering

Generate 100+ features from raw data

In [None]:
from src.features.temporal_features import TemporalFeatureGenerator
from src.features.lag_features import LagFeatureGenerator
from src.features.rolling_features import RollingFeatureGenerator
from src.features.weather_features import WeatherFeatureGenerator
from src.features.holiday_features import HolidayFeatureGenerator

# Initialize generators
temporal_gen = TemporalFeatureGenerator()
lag_gen = LagFeatureGenerator()
rolling_gen = RollingFeatureGenerator()
weather_gen = WeatherFeatureGenerator()
holiday_gen = HolidayFeatureGenerator()

print("üîß Generating features...\n")

### 3.1 Create Sample Time Series Data

In [None]:
# Create 30 days of hourly data for demonstration
hours = 30 * 24
dates = pd.date_range(start='2024-11-01', periods=hours, freq='h')

# Simulate demand with daily/weekly patterns
np.random.seed(42)
base_demand = 15
hour_effect = 5 * np.sin(2 * np.pi * np.arange(hours) / 24)  # Daily pattern
day_effect = 3 * np.sin(2 * np.pi * np.arange(hours) / (24 * 7))  # Weekly pattern
noise = np.random.normal(0, 2, hours)
demand = base_demand + hour_effect + day_effect + noise
demand = np.maximum(demand, 0)  # No negative demand

# Create DataFrame
df_timeseries = pd.DataFrame({
    'station_id': 'demo_station_001',
    'timestamp': dates,
    'bikes_available': demand,
    'docks_available': 30 - demand,
    'temperature': 15 + 5 * np.sin(2 * np.pi * np.arange(hours) / 24) + np.random.normal(0, 2, hours),
    'humidity': 60 + 10 * np.random.randn(hours),
    'wind_speed': 5 + 2 * np.random.randn(hours),
    'precipitation': np.random.choice([0, 0, 0, 0.5, 1.0], hours),
    'weather_condition': np.random.choice(['Clear', 'Clouds', 'Rain'], hours, p=[0.6, 0.3, 0.1])
})

print(f"‚úÖ Created time series data: {df_timeseries.shape}")
display(df_timeseries.head(10))

### 3.2 Generate Temporal Features

In [None]:
print("‚è∞ Generating temporal features...")
df_features = temporal_gen.generate(df_timeseries.copy())

temporal_cols = [col for col in df_features.columns if col not in df_timeseries.columns]
print(f"‚úÖ Generated {len(temporal_cols)} temporal features")
print(f"   Features: {', '.join(temporal_cols[:10])}...")

### 3.3 Generate Lag Features

In [None]:
print("‚èÆÔ∏è Generating lag features...")
df_features = lag_gen.generate(df_features)

lag_cols = [col for col in df_features.columns if 'lag_' in col or 'change_' in col]
print(f"‚úÖ Generated {len(lag_cols)} lag features")
print(f"   Features: {', '.join(lag_cols[:10])}...")

### 3.4 Generate Rolling Features

In [None]:
print("üìä Generating rolling window features...")
df_features = rolling_gen.generate(df_features)

rolling_cols = [col for col in df_features.columns if 'rolling_' in col]
print(f"‚úÖ Generated {len(rolling_cols)} rolling features")
print(f"   Features: {', '.join(rolling_cols[:10])}...")

### 3.5 Generate Weather Features

In [None]:
print("üå§Ô∏è Generating weather features...")
df_features = weather_gen.generate(df_features)

weather_cols = [col for col in df_features.columns if any(x in col for x in ['temp_', 'humidity_', 'wind_', 'is_rainy', 'weather_severity'])]
print(f"‚úÖ Generated {len(weather_cols)} weather features")
print(f"   Features: {', '.join(weather_cols[:10])}...")

### 3.6 Generate Holiday Features

In [None]:
print("üéâ Generating holiday features...")
df_features = holiday_gen.generate(df_features)

holiday_cols = [col for col in df_features.columns if 'holiday' in col]
print(f"‚úÖ Generated {len(holiday_cols)} holiday features")
print(f"   Features: {holiday_cols}")

# Summary
total_features = len(df_features.columns) - len(df_timeseries.columns)
print(f"\nüéØ Total features generated: {total_features}")
print(f"üìä Final dataset shape: {df_features.shape}")

### 3.7 Save Processed Features

In [None]:
# Save processed features
features_file = processed_dir / f"features_{timestamp}.csv"
df_features.to_csv(features_file, index=False)
print(f"üíæ Saved features to: {features_file}")

# Display feature importance preview
display(df_features.head())
print(f"\nüìã All features ({len(df_features.columns)}):")
print(df_features.columns.tolist())

## 4. Model Training

Train XGBoost, LightGBM, and CatBoost models with MLflow tracking

### 4.1 Prepare Training Data

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import xgboost as xgb
import lightgbm as lgb
from catboost import CatBoostRegressor

# Drop NaN values (from lag/rolling features at the start)
df_clean = df_features.dropna().reset_index(drop=True)

# Define target and features
target = 'bikes_available'
exclude_cols = ['station_id', 'timestamp', target, 'docks_available', 'weather_condition']
feature_cols = [col for col in df_clean.columns if col not in exclude_cols]

X = df_clean[feature_cols]
y = df_clean[target]

print(f"üìä Training data prepared:")
print(f"   Samples: {len(X)}")
print(f"   Features: {len(feature_cols)}")
print(f"   Target: {target}")

# Time-based split (70% train, 15% val, 15% test)
train_size = int(0.7 * len(X))
val_size = int(0.15 * len(X))

X_train = X[:train_size]
y_train = y[:train_size]

X_val = X[train_size:train_size + val_size]
y_val = y[train_size:train_size + val_size]

X_test = X[train_size + val_size:]
y_test = y[train_size + val_size:]

print(f"\nüìä Data splits:")
print(f"   Train: {len(X_train)} samples ({len(X_train)/len(X):.1%})")
print(f"   Val:   {len(X_val)} samples ({len(X_val)/len(X):.1%})")
print(f"   Test:  {len(X_test)} samples ({len(X_test)/len(X):.1%})")

### 4.2 Helper Functions

In [None]:
def calculate_metrics(y_true, y_pred, set_name="Test"):
    """Calculate and display metrics"""
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    mae = mean_absolute_error(y_true, y_pred)
    mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100
    r2 = r2_score(y_true, y_pred)
    
    print(f"\nüìä {set_name} Metrics:")
    print(f"   RMSE: {rmse:.4f}")
    print(f"   MAE:  {mae:.4f}")
    print(f"   MAPE: {mape:.2f}%")
    print(f"   R¬≤:   {r2:.4f}")
    
    return {'rmse': rmse, 'mae': mae, 'mape': mape, 'r2': r2}

def plot_predictions(y_true, y_pred, model_name, save_path=None):
    """Plot actual vs predicted"""
    fig, axes = plt.subplots(1, 2, figsize=(15, 5))
    
    # Scatter plot
    axes[0].scatter(y_true, y_pred, alpha=0.5)
    axes[0].plot([y_true.min(), y_true.max()], [y_true.min(), y_true.max()], 'r--', lw=2)
    axes[0].set_xlabel('Actual')
    axes[0].set_ylabel('Predicted')
    axes[0].set_title(f'{model_name}: Actual vs Predicted')
    
    # Residuals
    residuals = y_true - y_pred
    axes[1].scatter(y_pred, residuals, alpha=0.5)
    axes[1].axhline(y=0, color='r', linestyle='--', lw=2)
    axes[1].set_xlabel('Predicted')
    axes[1].set_ylabel('Residuals')
    axes[1].set_title(f'{model_name}: Residual Plot')
    
    plt.tight_layout()
    if save_path:
        plt.savefig(save_path, dpi=150, bbox_inches='tight')
    plt.show()

print("‚úÖ Helper functions defined")

### 4.3 Train XGBoost Model

In [None]:
print("ü§ñ Training XGBoost model...\n")

with mlflow.start_run(run_name="xgboost_demo") as run:
    # Model parameters
    params_xgb = {
        'objective': 'reg:squarederror',
        'max_depth': 6,
        'learning_rate': 0.1,
        'n_estimators': 100,
        'random_state': 42
    }
    
    # Log parameters
    mlflow.log_params(params_xgb)
    
    # Train
    model_xgb = xgb.XGBRegressor(**params_xgb)
    model_xgb.fit(
        X_train, y_train,
        eval_set=[(X_val, y_val)],
        verbose=False
    )
    
    # Predictions
    y_pred_train = model_xgb.predict(X_train)
    y_pred_val = model_xgb.predict(X_val)
    y_pred_test = model_xgb.predict(X_test)
    
    # Metrics
    metrics_train = calculate_metrics(y_train, y_pred_train, "Train")
    metrics_val = calculate_metrics(y_val, y_pred_val, "Validation")
    metrics_test = calculate_metrics(y_test, y_pred_test, "Test")
    
    # Log metrics
    mlflow.log_metrics({
        'train_rmse': metrics_train['rmse'],
        'val_rmse': metrics_val['rmse'],
        'test_rmse': metrics_test['rmse'],
        'test_mae': metrics_test['mae'],
        'test_r2': metrics_test['r2']
    })
    
    # Plot
    plot_path = data_dir / 'xgboost_predictions.png'
    plot_predictions(y_test, y_pred_test, "XGBoost", plot_path)
    mlflow.log_artifact(str(plot_path))
    
    # Save model
    models_dir = project_root / "models"
    models_dir.mkdir(exist_ok=True)
    
    model_path = models_dir / f"xgboost_{timestamp}.json"
    model_xgb.save_model(model_path)
    print(f"\nüíæ Saved XGBoost model to: {model_path}")
    
    # Log model to MLflow
    mlflow.xgboost.log_model(model_xgb, "model")
    
    print(f"\n‚úÖ XGBoost training complete!")
    print(f"   MLflow Run ID: {run.info.run_id}")

### 4.4 Train LightGBM Model

In [None]:
print("ü§ñ Training LightGBM model...\n")

with mlflow.start_run(run_name="lightgbm_demo") as run:
    # Model parameters
    params_lgb = {
        'objective': 'regression',
        'max_depth': 6,
        'learning_rate': 0.1,
        'n_estimators': 100,
        'random_state': 42,
        'verbose': -1
    }
    
    # Log parameters
    mlflow.log_params(params_lgb)
    
    # Train
    model_lgb = lgb.LGBMRegressor(**params_lgb)
    model_lgb.fit(
        X_train, y_train,
        eval_set=[(X_val, y_val)]
    )
    
    # Predictions
    y_pred_train = model_lgb.predict(X_train)
    y_pred_val = model_lgb.predict(X_val)
    y_pred_test = model_lgb.predict(X_test)
    
    # Metrics
    metrics_train = calculate_metrics(y_train, y_pred_train, "Train")
    metrics_val = calculate_metrics(y_val, y_pred_val, "Validation")
    metrics_test = calculate_metrics(y_test, y_pred_test, "Test")
    
    # Log metrics
    mlflow.log_metrics({
        'train_rmse': metrics_train['rmse'],
        'val_rmse': metrics_val['rmse'],
        'test_rmse': metrics_test['rmse'],
        'test_mae': metrics_test['mae'],
        'test_r2': metrics_test['r2']
    })
    
    # Plot
    plot_path = data_dir / 'lightgbm_predictions.png'
    plot_predictions(y_test, y_pred_test, "LightGBM", plot_path)
    mlflow.log_artifact(str(plot_path))
    
    # Save model
    model_path = models_dir / f"lightgbm_{timestamp}.txt"
    model_lgb.booster_.save_model(str(model_path))
    print(f"\nüíæ Saved LightGBM model to: {model_path}")
    
    # Log model to MLflow
    mlflow.lightgbm.log_model(model_lgb, "model")
    
    print(f"\n‚úÖ LightGBM training complete!")
    print(f"   MLflow Run ID: {run.info.run_id}")

### 4.5 Train CatBoost Model

In [None]:
print("ü§ñ Training CatBoost model...\n")

with mlflow.start_run(run_name="catboost_demo") as run:
    # Model parameters
    params_cat = {
        'iterations': 100,
        'depth': 6,
        'learning_rate': 0.1,
        'random_state': 42,
        'verbose': False
    }
    
    # Log parameters
    mlflow.log_params(params_cat)
    
    # Train
    model_cat = CatBoostRegressor(**params_cat)
    model_cat.fit(
        X_train, y_train,
        eval_set=(X_val, y_val)
    )
    
    # Predictions
    y_pred_train = model_cat.predict(X_train)
    y_pred_val = model_cat.predict(X_val)
    y_pred_test = model_cat.predict(X_test)
    
    # Metrics
    metrics_train = calculate_metrics(y_train, y_pred_train, "Train")
    metrics_val = calculate_metrics(y_val, y_pred_val, "Validation")
    metrics_test = calculate_metrics(y_test, y_pred_test, "Test")
    
    # Log metrics
    mlflow.log_metrics({
        'train_rmse': metrics_train['rmse'],
        'val_rmse': metrics_val['rmse'],
        'test_rmse': metrics_test['rmse'],
        'test_mae': metrics_test['mae'],
        'test_r2': metrics_test['r2']
    })
    
    # Plot
    plot_path = data_dir / 'catboost_predictions.png'
    plot_predictions(y_test, y_pred_test, "CatBoost", plot_path)
    mlflow.log_artifact(str(plot_path))
    
    # Save model
    model_path = models_dir / f"catboost_{timestamp}.cbm"
    model_cat.save_model(str(model_path))
    print(f"\nüíæ Saved CatBoost model to: {model_path}")
    
    # Log model to MLflow
    mlflow.catboost.log_model(model_cat, "model")
    
    print(f"\n‚úÖ CatBoost training complete!")
    print(f"   MLflow Run ID: {run.info.run_id}")

### 4.6 Model Comparison

In [None]:
# Compare all models
print("üìä Model Comparison Summary\n")
print("="*60)
print("View detailed comparison in MLflow UI:")
print("üëâ http://localhost:5000")
print("="*60)
print("\nüèÜ All models trained and logged to MLflow!")
print(f"\nüíæ Models saved in: {models_dir}")
print(f"üìä Visualizations saved in: {data_dir}")

## 5. Making Predictions

Use the best model for forecasting

In [None]:
# Use XGBoost for demo (typically you'd select based on metrics)
best_model = model_xgb
print("üéØ Using XGBoost model for predictions\n")

# Single prediction
sample_features = X_test.iloc[0:1]
single_pred = best_model.predict(sample_features)[0]
actual_value = y_test.iloc[0]

print(f"üìç Single Prediction Example:")
print(f"   Predicted: {single_pred:.2f} bikes")
print(f"   Actual:    {actual_value:.2f} bikes")
print(f"   Error:     {abs(single_pred - actual_value):.2f} bikes")

# Batch predictions
batch_size = 24  # 24 hours
batch_features = X_test.iloc[:batch_size]
batch_preds = best_model.predict(batch_features)
batch_actuals = y_test.iloc[:batch_size].values

print(f"\nüìä Batch Prediction (24 hours):")
print(f"   Mean predicted: {batch_preds.mean():.2f} bikes")
print(f"   Mean actual:    {batch_actuals.mean():.2f} bikes")
print(f"   RMSE:          {np.sqrt(mean_squared_error(batch_actuals, batch_preds)):.2f}")

### 5.1 Forecast Visualization

In [None]:
# 7-day forecast visualization
forecast_hours = 7 * 24  # 7 days
forecast_features = X_test.iloc[:forecast_hours]
forecast_preds = best_model.predict(forecast_features)
forecast_actuals = y_test.iloc[:forecast_hours].values

# Plot
plt.figure(figsize=(15, 6))
hours_range = range(len(forecast_preds))

plt.plot(hours_range, forecast_actuals, label='Actual', linewidth=2, alpha=0.7)
plt.plot(hours_range, forecast_preds, label='Predicted', linewidth=2, alpha=0.7, linestyle='--')
plt.fill_between(hours_range, forecast_actuals, forecast_preds, alpha=0.2)

plt.xlabel('Hours Ahead', fontsize=12)
plt.ylabel('Bikes Available', fontsize=12)
plt.title('7-Day Demand Forecast: Actual vs Predicted', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)

forecast_plot = data_dir / '7day_forecast.png'
plt.savefig(forecast_plot, dpi=150, bbox_inches='tight')
plt.show()

print(f"üíæ Saved forecast visualization to: {forecast_plot}")

## 6. Monitoring & Analysis

### 6.1 Feature Importance

In [None]:
# Get feature importance
feature_importance = pd.DataFrame({
    'feature': feature_cols,
    'importance': best_model.feature_importances_
}).sort_values('importance', ascending=False)

# Plot top 20
top_features = feature_importance.head(20)

plt.figure(figsize=(12, 8))
plt.barh(range(len(top_features)), top_features['importance'].values)
plt.yticks(range(len(top_features)), top_features['feature'].values)
plt.xlabel('Importance', fontsize=12)
plt.title('Top 20 Most Important Features', fontsize=14, fontweight='bold')
plt.gca().invert_yaxis()
plt.tight_layout()

importance_plot = data_dir / 'feature_importance.png'
plt.savefig(importance_plot, dpi=150, bbox_inches='tight')
plt.show()

print(f"üíæ Saved feature importance plot to: {importance_plot}")
print(f"\nüîù Top 10 Features:")
for i, row in top_features.head(10).iterrows():
    print(f"   {row['feature']}: {row['importance']:.4f}")

### 6.2 Error Analysis

In [None]:
# Analyze errors by time of day
test_df = df_clean.iloc[train_size + val_size:].copy()
test_df['prediction'] = y_pred_test
test_df['error'] = np.abs(test_df[target] - test_df['prediction'])

# Error by hour
if 'hour_of_day' in test_df.columns:
    error_by_hour = test_df.groupby('hour_of_day')['error'].mean()
    
    plt.figure(figsize=(12, 5))
    plt.bar(error_by_hour.index, error_by_hour.values)
    plt.xlabel('Hour of Day', fontsize=12)
    plt.ylabel('Mean Absolute Error', fontsize=12)
    plt.title('Prediction Error by Hour of Day', fontsize=14, fontweight='bold')
    plt.xticks(range(24))
    plt.grid(True, alpha=0.3, axis='y')
    
    error_plot = data_dir / 'error_by_hour.png'
    plt.savefig(error_plot, dpi=150, bbox_inches='tight')
    plt.show()
    
    print(f"üíæ Saved error analysis to: {error_plot}")
    print(f"\nüìä Peak error hours:")
    print(error_by_hour.nlargest(5))

## 7. Summary & Next Steps

In [None]:
print("="*70)
print(" "*20 + "üéâ DEMO COMPLETE! üéâ")
print("="*70)

print("\nüìä What We Accomplished:")
print("\n1Ô∏è‚É£ Data Pipeline")
print(f"   ‚úÖ Collected {len(df_stations)} bike stations from NYC Citi Bike API")
print(f"   ‚úÖ Collected {len(df_statuses)} station statuses")
print(f"   ‚úÖ Collected weather data from OpenWeatherMap")
print(f"   ‚úÖ Saved raw data to: {raw_dir}")

print("\n2Ô∏è‚É£ Feature Engineering")
print(f"   ‚úÖ Generated {total_features} features")
print(f"   ‚úÖ Created {hours} hours of time series data")
print(f"   ‚úÖ Saved features to: {processed_dir}")

print("\n3Ô∏è‚É£ Model Training")
print(f"   ‚úÖ Trained 3 models: XGBoost, LightGBM, CatBoost")
print(f"   ‚úÖ Logged all experiments to MLflow")
print(f"   ‚úÖ Saved models to: {models_dir}")

print("\n4Ô∏è‚É£ Predictions & Monitoring")
print(f"   ‚úÖ Generated 7-day forecast")
print(f"   ‚úÖ Analyzed feature importance")
print(f"   ‚úÖ Performed error analysis")
print(f"   ‚úÖ Saved visualizations to: {data_dir}")

print("\nüîó Quick Links:")
print(f"   üìä MLflow UI:       http://localhost:5000")
print(f"   üöÄ FastAPI Docs:    http://localhost:8000/docs")
print(f"   üìà Dashboard:       http://localhost:8501")
print(f"   üìÅ Data Folder:     {data_dir}")
print(f"   ü§ñ Models Folder:   {models_dir}")

print("\nüéØ Next Steps:")
print("   1. Start Docker services: docker-compose up -d")
print("   2. View MLflow experiments: http://localhost:5000")
print("   3. Test FastAPI: http://localhost:8000/docs")
print("   4. Open Streamlit dashboard: http://localhost:8501")
print("   5. Set up Airflow DAGs for automation")
print("   6. Configure monitoring with Prometheus + Grafana")

print("\n" + "="*70)
print("üöÄ System is ready for deployment!")
print("="*70)

## 8. Bonus: Quick Model Deployment Test

In [None]:
# Test if API is running
import requests

try:
    response = requests.get('http://localhost:8000/health', timeout=2)
    if response.status_code == 200:
        print("‚úÖ FastAPI is running!")
        print(f"   Response: {response.json()}")
        print("\n   Try making a prediction at: http://localhost:8000/docs")
    else:
        print("‚ö†Ô∏è FastAPI returned unexpected status")
except requests.exceptions.ConnectionError:
    print("‚ÑπÔ∏è FastAPI not running. Start it with:")
    print("   python src/serving/api/main.py")
except Exception as e:
    print(f"‚ÑπÔ∏è Could not connect to API: {e}")

---

## üìö Resources

- **Documentation**: See `docs/` folder
- **API Guide**: `docs/API_QUICK_START.md`
- **Deployment**: `docs/DEPLOYMENT.md`
- **GitHub**: https://github.com/shima-maleki/Bike-Demand-Prediction-for-Smart-Cities

---

**Level 2 MLOps Portfolio Project**  
*Automated Data Pipeline ‚Ä¢ Experiment Tracking ‚Ä¢ Model Registry ‚Ä¢ CI/CD ‚Ä¢ Monitoring*