# Project Documentation: Traffic Flow Prediction Based on Weather


























## 1. Introduction

The **Traffic Flow Prediction Based on Weather** project is a data-driven initiative aimed at understanding and predicting the impact of weather on urban traffic flow. By combining weather data with traffic patterns, this project seeks to improve traffic management and decision-making in urban areas.


## 2. Problem Statement

Traffic congestion in urban areas causes delays, reduced productivity, and environmental strain. Weather conditions significantly influence traffic flow, but their role is often underutilized in traffic prediction models. This project addresses this gap by building a robust predictive model to forecast traffic velocity based on weather and road attributes.


## 3. Objectives

1. **Analyze** the influence of weather conditions on traffic flow.  
2. **Develop** a predictive model to forecast traffic velocity based on weather data.  
3. **Provide** actionable insights for effective traffic management and planning.  


## 4. Tools and Technologies

- **Programming Language**: Python  
- **Libraries**: Pandas, Matplotlib, Seaborn, XGBoost, Scikit-learn  
- **Application Framework**: Streamlit  
- **Development Tools**: Jupyter Notebook, Visual Studio Code  


## 5. Data Loading

We began by loading two datasets:

1. **Traffic Flow Data in Ho Chi Minh City**:
   - Contains detailed traffic data such as segment velocity, updated timestamps, and geographical information.
   - **Key Files**:
     - `segment_status.csv`: Traffic velocities for road segments.
     - `segments.csv`: Segment details like street name and max velocity.

2. **Vietnam Weather Data**:
   - Includes weather records (2009–2021) for 40 provinces in Vietnam, covering factors like temperature, humidity, wind speed, and rainfall.



## 6. Data Slicing and Feature Selection

Key features were identified and selected from the datasets:
- **Traffic Data**:
  - `segment_id`, `velocity`, `updated_at_30min` (timestamp), `length`, and segment-specific attributes.
- **Weather Data**:
  - `rain`, `max`, `min`, `humidi` (humidity), `cloud` (cloud cover), and `pressure`.

This allowed us to focus on the most relevant factors for analyzing traffic flow and its relationship to weather.

## 7. Feature Engineering

We introduced additional features to enhance the dataset:
- **Date and Time Features**:
  - Extracted `year`, `month`, `day`, and `hour` from timestamps.
- **Geographical Mapping**:
  - Used `lat` (latitude) and `long` (longitude) for spatial visualizations.
- **Traffic Segments**:
  - Merged street-level details like `street_level` and `street_type`.


## 8. Data Cleaning and Merging

- **Handling Missing Values**:
  - Removed or imputed missing entries where necessary.
  - Ensured date and time fields were correctly parsed.

- **Outlier Handling**:
  - Used the **Interquartile Range (IQR)** method to remove outliers in key columns like `velocity` and `rain`.

- **Dataset Merging**:
  - Joined traffic and weather datasets using `date`, `time`, and `city` as keys.
  - The resulting dataset included a comprehensive combination of traffic velocities, geographical coordinates, and weather conditions.


## 9. Daily Aggregated Data



To simplify analysis and modeling, we created the **Daily Aggregated Data** dataset:
- **Creation**:
  - Aggregated the cleaned and merged dataset by `date`.
  - Calculated daily statistical summaries for `velocity` (mean, median, standard deviation) and included daily averages for weather attributes (`rain`, `humidity`, `cloud`, `pressure`, etc.).
  - Incorporated street-level and geographical features.

- **Purpose**:
  - The Daily Aggregated Data provides a higher-level view of traffic and weather trends over time, reducing data granularity while retaining critical patterns and relationships.


## 10. Exploratory Data Analysis (EDA)

Using Streamlit, we performed EDA to understand the dataset better:
- **Descriptive Statistics**:
  - Displayed summary statistics of key columns like `velocity`, `rain`, and `humidi`.

- **Correlation Heatmap**:
  - Highlighted relationships between features using a heatmap, identifying key predictors of traffic velocity.

- **Traffic Velocity Distribution**:
  - Visualized the distribution of `velocity` values to analyze variability.

- **Time Series Analysis**:
  - Explored trends in `mean_velocity` and `median_velocity` over time using line plots.
  - Compared `min` and `max` temperature trends for daily analysis.

## 11. Machine Learning: Model Development and Evaluation

### Random Forest Model
- **Model Selection**:
  - Chose Random Forest due to its robustness in handling non-linear relationships and feature interactions.

- **Data Preparation**:
  - Used the aggregated dataset, focusing on weather attributes and street-level features as predictors.
  - Target variable: `mean_velocity` (daily mean traffic velocity).

- **Model Training**:
  - Trained a Random Forest Regressor on 80% of the data (training set) and validated on 20% (test set).

- **Evaluation Metrics**:
  - Calculated **Mean Absolute Error (MAE)** 1.73 and **R² score** 0.95 to assess model performance.

- **Deployment**:
  - Saved the trained model using Pickle for integration into the Streamlit app.

### XGBoost Model
- **Implementation**:
  - Trained an **XGBoost Regressor** as an alternative to **Random Forest**.

- **Evaluation**:
  - Compared the performance of Random Forest and XGBoost using **MAE**  0.03
 and R² score 1.00.
  - XGBoost outperformed Random Forest, leading to its selection for deployment.


## 12. Clustering: K-Means Model

- **Objective**:
  - Uncover hidden patterns in traffic and weather data.

- **Implementation**:
  - Applied K-Means clustering to group data points based on similarities.
  - Chose an optimal number of clusters using the elbow method.

- **Insights**:
  - Identified distinct patterns in traffic behavior under different weather conditions.

- **Integration**:
  - Incorporated clustering results into the Streamlit app for interactive visualization and analysis.




## 13. Challenges and Solutions

### Challenge 1: Merging Traffic and Weather Data  
- **Solution**: Ensured proper alignment of datasets based on date and time.

### Challenge 2: Outliers in Traffic Data  
- **Solution**: Used the IQR method to remove extreme values.

### Challenge 3: Model Optimization  
- **Solution**: Fine-tuned the XGBoost Regressor to improve accuracy and reduce prediction error.



##14. Conclusion

The project effectively predicts traffic flow using weather attributes, providing valuable insights for urban traffic management. Future developments will further improve its predictive accuracy and real-time capabilities.