Skip to content

Extreme Gradient Boosting (XGBoost) model for predicting hourly traffic volume. Utilized MAPE for model scoring, train-test splits with TimeSeriesSplit and hyperparameter tuning with GridSearchCV.

Notifications You must be signed in to change notification settings

theo-obadiah-teguh/Traffic-XGB

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Traffic-XGB

A Jupyter Notebook used to conduct exploratory data analysis and time series forecasting on traffic volume for westbound I-94, a major interstate highway in the US that connects Minneapolis and St Paul, Minnesota. Here, we developed an Extreme Gradient Boosting (XGBoost) model with Mean Absolute Percentage Error (MAPE) as the scoring method. Cross validation splits were generated with TimeSeriesSplit and hyperparameter tuning was conducted with GridSearchCV.

Data Source

The dataset used in this notebook was taken from the UC Irvine Machine Learning Repository. The data was collected by the Minnesota Department of Transportation (MnDOT) from 2012 to 2018 at a station roughly midway between the two cities. The descriptions of included variables are as follows.

  • holiday - (Categorical) : US National holidays plus regional holiday, Minnesota State Fair
  • temp - (Numeric) : Average temp in kelvin
  • rain_1h - (Numeric) : Amount in mm of rain that occurred in the hour
  • snow_1h - (Numeric) : Amount in mm of snow that occurred in the hour
  • clouds_all - (Numeric) : Percentage of cloud cover
  • weather_main - (Categorical) : Short textual description of the current weather
  • weather_description - (Categorical) : Longer textual description of the current weather
  • date_time - (DateTime) : Hour of the data collected in local CST time
  • traffic_volume - (Numeric) : Hourly I-94 ATR 301 reported westbound traffic volume

Additional Resources

This project was made possible with the help of the following openly-available learning resources.

Remarks

This project was probably my first time training and using a openly available and powerful model like XGBoost. I learned concepts such as model training, train-test splitting for time series data, L1 and L2 regularization, data visualization with correlation matrixes, boxplots, time series lineplots, and so on. Here, note that TimeSeriesSplit must be used in order to avoid a data leak, where we would train with future data whilst testing on past data during cross validation. This is a mistake that most people fall into when following online tutorials.

Another issue was the gap of missing observations, particularly around the year 2015. Although overall this was project was a good exercise to get familiar with one of the famous and publicly available models out there, I certainly do wish that the data was in better quality. Finally, there wasn't a proper train-test split for this data. Generally people tend to use 80-20 splits, which is something I didn't do in this project. It may also be beneficial to look at more traditional time series forecasting models like ARMA, and actually learn to deal with stationarity, trends, and other aspects of time series data.

About

Extreme Gradient Boosting (XGBoost) model for predicting hourly traffic volume. Utilized MAPE for model scoring, train-test splits with TimeSeriesSplit and hyperparameter tuning with GridSearchCV.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published