# Traffic Congestion: A Comparison of Machine Learning Models in a Time-Series Scenario

Vehicular traffic and its management can be a challenge and is a pressing issue for most of our modern highways and road networks. We run the world on our roads and highways. And we can only imagine a world without such networks. As cities expand and people specifically in the developing world move to urban areas for opportunities and livelihood, the exponential increase in vehicle ownership and thereby the number of active automobiles has posed challenges for civic and government authorities and road related mishaps also increase in parallel. While many developed countries are far from facing the brunt of population pressures and hence a lower number vehicular ownership, in other parts such as China and India the scenario seems altogether different. This is not to mention that even the most well laid out road can record accidents and traffic mismanagement. In such circumstances, machine learning can predict what the traffic would be at what hour and how it could be managed at that point. So let's see how.

Below presented is a typical time-series dataset with hurly timestamps, starting from 1960s all the way to present day. The problem might seem tricial but as we proceed, things become clearer that data preprocessing and deep exploration are compulsory steps before utilising learning methods. This is because of the simple principle - if you do not know the data well you cannot predict on it well.

This analysi can be used for several purposes; from learning some of the nuances of undertaking typical time-series projects to simple reference material for those looking to freshen up their knowledge in  time-series work flows. This dataset also  also happens to be hosted at one of the [hackathons][1] and is also available at [kaggle][2] free for public use.

With all that said, let's define our goals and dive straight in. 

[1]: https://www.hackerearth.com/challenges/competitive/IIT-Madras-Sangam-ML-Hackathon-2019/
[2]: https://www.kaggle.com/rohith203/traffic-volume-dataset

## Objectives of this Notebook

At the end of the project, we'll learn:

- How time-series datasets are preprocessed, taking into consideration similar techniques used in supervised analysis.
- Best practices for deciding time series hyper-parameters (p, d and q).
- Use cross validation to compare standard machine learning models.
- Use cross validation to compare time series parameters.

Libraries used are mostly standard scientific modules such as `statsmodels` and `scikit-learn`. Additional libraries wherever used will be described briefly. I personally like to use only the required modules and make the most of out of them.

# Data Loading and Exploration

We start off with loading our data and the relevant modules. The data is in standard .csv format.

In [None]:
from statsmodels.tsa.statespace.sarimax import SARIMAX
from numpy import log
from statsmodels.tsa.stattools import adfuller
from xgboost import XGBRegressor, XGBRFRegressor
from yellowbrick.regressor import residuals_plot, prediction_error
from yellowbrick.features import rank2d, rank1d
from yellowbrick.model_selection import RFECV, ValidationCurve, LearningCurve
from sklearn.linear_model import LinearRegression, LassoCV, ElasticNetCV, SGDRegressor, PassiveAggressiveRegressor, Ridge
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor, BaggingRegressor
from sklearn.svm import SVR, NuSVR
from warnings import filterwarnings
from matplotlib.pylab import rcParams
from pandas.plotting import register_matplotlib_converters
from seaborn import catplot
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import pandas_profiling
import statsmodels.api as sm
import sklearn.metrics as metrics
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
register_matplotlib_converters()
filterwarnings("ignore")
plt.style.use("seaborn-whitegrid")
# rcParams['figure.figsize'] = 10, 8
plt.ion()
np.random.seed(1000)
print("\nEnvironment is ready.")

main_data = dd.read_csv("Train.csv")