# Urban Mobility Delay Analysis
 A Complete end-to-end data science project with EDA, feature engineering and predictive modeling.

## 1. Introduction

Efficient public transportation is essential for urban mobility. This project analyzes train delay patterns using real-world transit data. The goal is to identify the key factors contributing to delays and build a predictive model to classify whether a train will be delayed.

This notebook presents concise summary of the full workflow:
 - Exploratory data analysis (EDA)
 - Feature engineering
 - Model development and optimization
 - Interpretation of results and insights

## 2. Project Objectives
- Analyze urban train delay patterns using real operational data
- Engineer time-based and system-based features
- Build a predicitive model to classify whether a train will be delayed
- Evaluate model performance and derive actionable insights

## 3. Dataset Description

The dataset contains operational records for trains, including:
 - Train metadata (train type, operator)
 - Station information (departure and arrival stations)
 - Timing information (scheduled and actual times)
 - Delay information (delay at departure/arrival)

The target variable is:
 - is_delayed — binary classification (1 = delayed, 0 = not delayed)

## 4. Data Cleaning & Preprocessing

Key preprocessing steps:

 - Removed entries with missing times that prevented delay calculation.

 - Encoded categorical features (e.g., train type, operator, station).

 - Created time-based features:

    - Hour of day

    - Day of week

 - Engineered “delay duration” whenever possible.

 - Handled class imbalance using appropriate model weighting.

These steps produced a clean dataset suitable for modeling.

## 5. Exploratory Data Analysis (EDA)
### 5.1 Delay Distribution
- This histogram illustrates the distribution of train delay durations across all records in the dataset.
- The majority of delays cluster near zero, indicating that **most trains operate on time or with minimal delay**.
- The long right tail shows the presence of **ocasional extreme delays**, which heavily influence the overall variance in the dataset.
- Such skewness suggests the need for **robust statistical methods** when modeling delays, as the distribution is not normally distributed.
- Understanding this distribution helps quantify overall system reliability and identify the frequency of severe disruptions.
  
![Delay distribution.jpg](attachment:99f34d32-9e59-4d21-bfae-749c1f0cf5bc.jpg)

### 5.2 Delay by Staion
- This bar chart summarizes how frequently delays occur at each major station.
- The stations differ substantially in delay counts, indicating **location-based operational variability**.
- Larger hubs naturally demonstrate higher delay frequencies due to **greater rail traffic volume and congestion sensitivity**.
- Identifying stations with consistently high delays helps focus optimization or infrastructure improvements.
- The variation suggests that station-specific factors- such as traffic density, track layout or regional weather - play a significant role in delay patterns.

  ![Delay rate by station.jpg](attachment:a83d5d60-4a12-4d0a-adec-b930e48f3546.jpg)

### 5.3 Delay Patterns by Hour of the Day
- The plot shows how **average delays vary across the 24 hours of the day**.
- Delays generally **increase during the early morning hours**, reflecting the beginning of peak commuter periods when demand rises and congestion increases.
- Midday hours typically show **lower and more stable delay values**, which aligns with reduced passenger volume and smoother operations.
- A **second increase appears during late afternoon and early evening**, corresponding to the evening rush hour when the system again experiences heavier load.
- During late-night hours, delays remain **consistently low**, as service demand is minimal and operations face fewer disruptions.
- Overall, the plot highlights a **clear daily cycle**, where delays correspond strongly with typical human activity patterns and transportation system usage.
  
![Delay by hour of the day.jpg](attachment:ac466558-1ffa-4bf5-a3aa-92cdd3809828.jpg)

### 5.4 Delay Trend over time
- This line plot tracks the average train delay for each day in the recorded period.
- The fluctuating pattern suggests that delays are influenced by **temporal factors**, including weather, weekday vs. weekend patterns, seasonal travel or sporadic disruptions.
- The presence of sharp spiked indicates days with **system-wide operational stress**, likely due to cascading delays.
- The lack of strong trend indicates **irregular, event-driven delay behavior** rather than systematic improvement or deterioration.
- Time-based analysis such as this is crucial for understanding when delays are most likely to occur.

   ![Delay trend over time.jpg](attachment:1762dbf0-8dca-4480-9349-7b5fd78cad82.jpg)

## 6. Feature Importance (Random Forest)

A Random Forest model was used to evaluate the importance of each feature.

**Interpretation:**

 - Hour of day and train type were among the strongest predictors.

 - Certain stations show high importance, supporting earlier EDA findings.

 - Time-based features outperform geographic features, emphasizing temporal patterns in delays.

This informs where to direct operational improvements (e.g., schedule adjustments during peak hours).

## 7. Model Development
**Model: Logistic Regression**

Selected for its interpretability and strong performance on structured data.

**Hyperparameter Optimization**: Performed using RandomizedSearchCV.

Best parameters included:

 - C (regularization strength)

 - penalty

 - solver

These improved predictive accuracy and generalization.

## 8. Model Evaluation
**Confusion Matrix Interpretation**

 - True negatives dominate which shows most trains are on time (class imbalance).

 - The model successfully identifies delayed trains but remains conservative due to fewer delay examples.

**Classification Report**

 - Precision for the delayed class: High, when the model predicts a delay, it is usually correct.

 - Recall for the delayed class: Moderate, some delays are missed, expected in an imbalanced dataset.

 - F1 Score: Indicates consistent performance across both classes.

**Model Performance Table**
|Metric      | Score|
|------------|------|
|Accuracy    |84.89%|
|Precision   |83.82%|  
|Recall      |81.75%|
|F1 Score    |82.77%|


**Overall Assessment**
- The model performs reliably for a naturally imbalanced operational dataset.
- Delay prediction is inherently challenging because many causal factors (weather, emergencies, maintenance incidents) are not captured in the dataset.
- Despite this, the model provides useful predictive insight and can serve as a baseline for future improvements using richer feature data.

## 9. Key Insights
**Most trains run on time**

 - Delays follow a right-skewed distribution with few extreme cases.

**Time of day is a major factor**

 - Peak hours show significantly higher delay likelihood.

**Certain stations consistently exhibit higher delays**

 - Indicating operational bottlenecks or infrastructure limitations.

**Train type influences reliability**

 - Some service classes have inherently different delay characteristics.

**Predictive modeling is feasible**

 - Logistic Regression provides meaningful classification with interpretable coefficients.

## 10. Limitations

 - Missing or unreliable delay fields limited the precision of some analyses.

 - Delay causes are not included (e.g., weather, mechanical issues).

 - Real-world delays have random components not easily predictable.

## 11. Future Work

 - Incorporate external data (weather, events, maintenance logs).

 - Experiment with gradient boosting models (XGBoost, LightGBM).

 - Build a dashboard using Streamlit for interactive visualization.

 - Analyze delays spatially using geospatial mapping.

## 12. Conclusion
This project explored patterns of train delays across major German cities using a dataset containing schedules, updated times, station information, and binary delay labels. Through data cleaning, feature engineering and exploratory analysis,several meaningful insights emerged:
- Delay patterns vary significantly across stations and hours of the day, with peak delays occurring during typical high-traffic periods.
- Although most trains are on time (heavy class imbalance), the machine learning model was able to learn meaningful patterns and achieved strong performance across accuracy, precision, recall and F1-score.
- The Random Forest classifier performed reliably and demonstrated that even limited temporal and categorical features can predict delay likelihood with reasonable confidence.
  
Overall, the analysis provides a solid foundation for understanding operational behavior in train networks and highlights how data-driven approaches can support transportation planning and reliability forecasting.