# TPS March 2022 - Traffic Congestion Forecasting

In Kaggle's TPS March 2022 competition, we are challenged to forecast traffic congestion in U.S. metropolis. The dataset contains congestion across 65 roadways from April to September in 1991. We will be forecasting the afternoon of 30 September 1991.

Since I have no experience with Time Series, i will follow [Kaggle's Time Series Course](https://www.kaggle.com/learn/time-series) by [Ryan Holbrook](https://www.kaggle.com/ryanholbrook) to gain some experience in Time Series.

If you are here to see the model I used, you can jump directly to the section [Boosted Hybrid Model](#Boosted-Hybrid-Model).

Let's start with importing necessary packages and datasets.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import holidays
from sklearn.linear_model import LinearRegression
from xgboost import XGBRegressor
from sklearn.ensemble import ExtraTreesRegressor, RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from statsmodels.tsa.deterministic import CalendarFourier, DeterministicProcess

df_train = pd.read_csv("/kaggle/input/tabular-playground-series-mar-2022/train.csv",
                      parse_dates=["time"], index_col="row_id")
df_test = pd.read_csv("/kaggle/input/tabular-playground-series-mar-2022/test.csv",
                      parse_dates=["time"], index_col="row_id")
df_train.head()

# Unique Directions

Direction can be one of 8 values: 
* NB ↑
* NE ↗
* EB →
* SE ↘
* SB ↓
* SW ↙
* WB ←
* NW ↖

Since the dataset contains 65 direction, a (x,y) coordinate can not have all these 8 directions. Lets find out unique directions.

In [None]:
uniq_dirs = (df_train.x.astype(str) +" "+ df_train.y.astype(str) +" "+ df_train.direction).unique()
print("There are {} unique directions as:".format(len(uniq_dirs)))
print(uniq_dirs)

# Imputing Missing Time Stamps

The dataset does not have a missing value in any row or column. Still, there are missing time stamps. In this section i will create these time stamps. To impute congestion values, i will calculate mean value for the same time on the same week day in a road.

In [None]:
# finding difference between all possible times and out times 
all_times = pd.DataFrame(pd.date_range("1991-04-01 00:00:00","1991-09-30 11:40:00",freq="20Min"), columns=["time"])
missing_times = list(set(all_times.time.values)-set(df_train.time.unique()))
print("There are {} missing time stamps.".format(len(missing_times)))

In [None]:
# Calculating average congestion for each road_weekday_hour_minute
df = df_train.copy()
df["day_of_week"] = df.time.dt.day_name()
df["hour"] = df["time"].dt.hour.astype('Int64')
df["minute"] = df["time"].dt.minute.astype('Int64')
df["road_and_time"] = df.x.astype(str)+df.y.astype(str)+df.direction+"_"+df.day_of_week+"_"+df.hour.astype(str)+"_"+df.minute.astype(str)
df = df.groupby("road_and_time")["congestion"].mean().round(0).astype(int)
df.head()

In [None]:
# looping through missing times for all roads
miss_cong = []
for t in missing_times:
    for uniq_dir in uniq_dirs:
        # road:
        uniq_dir = uniq_dir.split()
        x = uniq_dir[0]
        y = uniq_dir[1]
        direc = uniq_dir[2]
        
        # time:
        t = pd.to_datetime(t)
        dayofweek = t.day_name()
        hour = t.hour
        minute = t.minute
        
        # creating string to search
        search = x+y+direc+"_"+dayofweek+"_"+str(hour)+"_"+str(minute)
        cong = df[search] # avg congestion
        miss_cong.append([t, int(x), int(y), direc, cong]) # saving in a list

# into DataFrame
miss_cong = pd.DataFrame(miss_cong, columns=df_train.columns)
miss_cong.head()

In [None]:
# concat with original dataframe
df_train = pd.concat([df_train, miss_cong])
# sorting and resetting index
df_train = df_train.set_index(["time", "x", "y", "direction"]).sort_index().reset_index()

In [None]:
all_times = pd.DataFrame(pd.date_range("1991-04-01 00:00:00","1991-09-30 11:40:00",freq="20Min"), columns=["time"])
missing_times = list(set(all_times.time.values)-set(df_train.time.unique()))
print("After imputation, remaining missing time stamps: {}".format(len(missing_times)))

# Congestion per Day

I would like to see the mean congestion of each day between April and September. I will group the dataset by day, take mean and plot. Also, i will include moving average in the plot to see the trend.

In [None]:
# creating column for date
df = df_train.copy()
df["date"] = pd.to_datetime(df.time.dt.date)

# grouping by date
df = df.groupby("date").mean()["congestion"]
df = pd.DataFrame(df).reset_index()

# Calculating moving average
df["congestion_ma"] = df.congestion.rolling(window=45,  # 45-day window
                            center=True, min_periods=15).mean()

# Plot
sns.set(style="whitegrid")
fig, ax = plt.subplots(figsize=(12, 6))
ax.plot("date", "congestion", data=df, color='0.5', ls='--', marker='o')
ax.plot("date", "congestion_ma", data=df, lw=3)
ax.set_title('Average Traffic Congestion per Day');
plt.show()

## -- Fitting a Linear Regression Line with Weekly Dummies

In the previous plot, the first thing that caught my attention is weekly seasonality. Most probably, there are less traffic at the weekends comparing to weekdays. I will use a dummy variable for weekday and apply one hot encoding. For trend, i will use first order time step feature. Also, I will include 15 days forecasting in the plot.

In [None]:
# defining 1st order DeterministicProcess to fit linear regression line
dp = DeterministicProcess(index=df.date.dt.to_period(), order=1, drop=True)
X = dp.in_sample()
X_fore = dp.out_of_sample(steps=15) # forecasting next 15 days

# Weekly Dummies
X["day_of_week"] = X.index.dayofweek.astype(str)
X = pd.get_dummies(X, drop_first=True)
X_fore["day_of_week"] = X_fore.index.dayofweek.astype(str)
X_fore = pd.get_dummies(X_fore, drop_first=True)

# linear regression
model = LinearRegression()
model.fit(X, df.loc[:,"congestion"])

y_pred = pd.Series(model.predict(X), index=X.index)
y_fore = pd.Series(model.predict(X_fore), index=X_fore.index)

# Plot
sns.set(style="whitegrid")
fig, ax = plt.subplots(figsize=(12, 6))
ax.plot("date", "congestion", data=df, color='0.5', ls='--', marker='o')
ax.plot(y_pred.index, y_pred.values, lw=2, color="C0")
ax.plot(y_fore.index, y_fore.values, lw=2, color="C3")
ax.set_title('Average Traffic Congestion per Day');
plt.show()

In the folling code cell, I also included Annual seasonality. But for now, I am not sure to use it in model since we don't have the data of complete year.

In [None]:
# defining 1st order DeterministicProcess to fit linear regression line
fourier_Q = CalendarFourier(freq="A", order=3)

dp = DeterministicProcess(index=df.date.dt.to_period(),
                          additional_terms=[fourier_Q],
                          order=1, drop=True)
X = dp.in_sample()
X_fore = dp.out_of_sample(steps=15) # forecasting next 15 days

# Weekly Dummies 
X["day_of_week"] = X.index.dayofweek.astype(str)
X = pd.get_dummies(X, drop_first=True)
X_fore["day_of_week"] = X_fore.index.dayofweek.astype(str)
X_fore = pd.get_dummies(X_fore, drop_first=True)

# linear regression
model = LinearRegression()
model.fit(X, df.loc[:,"congestion"])

y_pred = pd.Series(model.predict(X), index=X.index)
y_fore = pd.Series(model.predict(X_fore), index=X_fore.index)

# Plot
sns.set(style="whitegrid")
fig, ax = plt.subplots(figsize=(12, 6))
ax.plot("date", "congestion", data=df, color='0.5', ls='--', marker='o')
ax.plot(y_pred.index, y_pred.values, lw=2, color="C0")
ax.plot(y_fore.index, y_fore.values, lw=2, color="C3")
ax.set_title('Average Traffic Congestion per Day');
plt.show()

# Holidays

Using Python's holidays module, we can import holiday dates for different countries. I imported the holidays in 1991 in US and used as a feature with only a first order trend to see if it affect the traffic congestion. Looking at the plot, being a holiday that day pulls the linear regression line down. As a conclusion, there are less traffic in holiday days.

In [None]:
all_holidays = holidays.US(years=1991)
all_holidays_list = list(all_holidays.keys()) # taking days into list
print("All holidays in 1991 in US:")
pd.DataFrame.from_dict(all_holidays, orient="index", columns=["Holiday"])

In [None]:
dp = DeterministicProcess(index=df.date.dt.to_period(), order=1, drop=True)
X = dp.in_sample()

# CREATING HOLIDAY FEATURE
X["holiday"] = 0
X.loc[X.index.to_timestamp().isin(all_holidays_list),"holiday"] = 1

model = LinearRegression() # linear regression
model.fit(X, df.loc[:,"congestion"])
y_pred = pd.Series(model.predict(X), index=X.index)

# Plot
sns.set(style="whitegrid")
fig, ax = plt.subplots(figsize=(12, 6))
ax.plot("date", "congestion", data=df, color='0.5', ls='--', marker='o')
ax.plot(y_pred.index, y_pred.values, lw=2, color="C0")
ax.plot_date(X.loc[X.holiday==1].index, y_pred[X.loc[X.holiday==1].index], color="C3", ms=8)
ax.set_title('Average Traffic Congestion per Day');
plt.show()

# Hourly Congestion

In this section, I will investigate the effect of time on traffic congestion.

In [None]:
# creating column for hour:minute
df = df_train.copy()
df["hr_mn"] = df["time"].dt.time

# grouping by hour:minute
df = df.groupby("hr_mn").mean()["congestion"]
df = pd.DataFrame(df).reset_index()

# Plot
sns.set(style="whitegrid")
fig, ax = plt.subplots(figsize=(12, 4))
ax.plot(df["hr_mn"].astype(str), df["congestion"], color='0.5', ls='--', marker='o')
ax.tick_params(axis="x", labelrotation=90)
ax.set_title('Average Traffic Congestion');
plt.show()

Looking at the previous graph, it is seen that there are more traffic in the afternoon comparing to morning. I will try to fit a regression line using a daily fourier components with high order.

In [None]:
# creating column for hour:minute
df = df_train.copy()
df["hr_mn"] = df["time"].dt.time
df["week"] = df["time"].dt.isocalendar().week #week number

# grouping by hour:minute
df = df.groupby(["week","time"]).mean()["congestion"]
df = pd.DataFrame(df).reset_index()
print("Dataset contains information for weeks: {}".format(df.week.unique()))

In [None]:
# Looking at week 14
week_no = 14
df2 = df[df.week == week_no]

# daily fourier components
fourier = CalendarFourier(freq="D", order=24)
dp = DeterministicProcess(index=df2.time.dt.to_period("20min"),
                          additional_terms=[fourier],
                          order=1, drop=True)
X = dp.in_sample()
X_fore = dp.out_of_sample(steps=24*3) # forecasting for 1 day

# linear regression
model = LinearRegression()
model.fit(X, df2.loc[:,"congestion"])
y_pred = pd.Series(model.predict(X), index=X.index)
y_fore = pd.Series(model.predict(X_fore), index=X_fore.index)

# Plot
sns.set(style="whitegrid")
fig, ax = plt.subplots(figsize=(12, 4))
ax.plot(df2["time"], df2["congestion"], color='0.5', ls='--', marker='o')
ax.plot(y_pred.index.to_timestamp(), y_pred.values, lw=3, color="C0")
ax.plot(y_fore.index.to_timestamp(), y_fore.values, lw=3, color="C3")
ax.set_title('Average Traffic Congestion per Day (Week {})'.format(week_no));
ax.tick_params(axis="x", labelrotation=90)
plt.show()

In [None]:
# Looking at week 30
week_no = 30
df2 = df[df.week == week_no]

# daily fourier components
fourier = CalendarFourier(freq="D", order=24)
dp = DeterministicProcess(index=df2.time.dt.to_period("20min"),
                          additional_terms=[fourier],
                          order=1, drop=True)
X = dp.in_sample()
X_fore = dp.out_of_sample(steps=24*3) # forecasting for 1 day

# linear regression
model = LinearRegression()
model.fit(X, df2.loc[:,"congestion"])
y_pred = pd.Series(model.predict(X), index=X.index)
y_fore = pd.Series(model.predict(X_fore), index=X_fore.index)

# Plot
sns.set(style="whitegrid")
fig, ax = plt.subplots(figsize=(12, 4))
ax.plot(df2["time"], df2["congestion"], color='0.5', ls='--', marker='o')
ax.plot(y_pred.index.to_timestamp(), y_pred.values, lw=3, color="C0")
ax.plot(y_fore.index.to_timestamp(), y_fore.values, lw=3, color="C3")
ax.set_title('Average Traffic Congestion per Day (Week {})'.format(week_no));
ax.tick_params(axis="x", labelrotation=90)
plt.show()

In [None]:
# Looking at week 38
week_no = 38
df2 = df[df.week == week_no]

# daily fourier components
fourier = CalendarFourier(freq="D", order=24)
dp = DeterministicProcess(index=df2.time.dt.to_period("20min"),
                          additional_terms=[fourier],
                          order=1, drop=True)
X = dp.in_sample()
X_fore = dp.out_of_sample(steps=24*3) # forecasting for 1 day

# linear regression
model = LinearRegression()
model.fit(X, df2.loc[:,"congestion"])
y_pred = pd.Series(model.predict(X), index=X.index)
y_fore = pd.Series(model.predict(X_fore), index=X_fore.index)

# Plot
sns.set(style="whitegrid")
fig, ax = plt.subplots(figsize=(12, 4))
ax.plot(df2["time"], df2["congestion"], color='0.5', ls='--', marker='o')
ax.plot(y_pred.index.to_timestamp(), y_pred.values, lw=3, color="C0")
ax.plot(y_fore.index.to_timestamp(), y_fore.values, lw=3, color="C3")
ax.set_title('Average Traffic Congestion per Day (Week {})'.format(week_no));
ax.tick_params(axis="x", labelrotation=90)
plt.show()

# Linear Regression Model

In this section, I will build a linear regression model to train traffic congestion. Using knowledge from the previous sections, there will be following features:
* Dummy variable for week day
* Holiday Feature
* Daily fourier components

In [None]:
# Preparing DataFrame for training
df = df_train.copy()
df["time"] = df.time.dt.to_period("20min")
df = df.set_index(["time", "x", "y", "direction"]).sort_index()

y = df.unstack(["x", "y", "direction"])
y.head(3)

In [None]:
# daily fourier components
fourier = CalendarFourier(freq="D", order=24)
dp = DeterministicProcess(
    index=y.index,
    constant=True,
    additional_terms=[fourier],
    drop=True)

X = dp.in_sample()
X_test = dp.out_of_sample(steps=12*3) #forecasting next 12 hours for every 20 minutes

# Weekly Dummies
X["day_of_week"] = X.index.dayofweek.astype(str)
X = pd.get_dummies(X, drop_first=True)
X_test["day_of_week"] = X_test.index.dayofweek.astype(str)
X_test = pd.get_dummies(X_test, drop_first=True)

# Since test set is only on Monday, there will be missing columns for other days
# Creating these missing column in the test set
miss_dummies = list(set(X.columns) - set(X_test.columns))
X_test[miss_dummies] = 0
X_test = X_test[X.columns] #reordering columns

# Holiday Feature
X["holiday"] = 0
X.loc[X.index.to_timestamp().isin(all_holidays_list),"holiday"] = 1
X_test["holiday"] = 0
X_test.loc[X_test.index.to_timestamp().isin(all_holidays_list),"holiday"] = 1

X.head(3)

In [None]:
# Linear Regression Model
model = LinearRegression(fit_intercept=False)
model.fit(X, y)

# Predicting Test set
y_submit = pd.DataFrame(model.predict(X_test), index=X_test.index, columns=y.columns)
y_submit = y_submit.stack(["x", "y", "direction"])

# Preparing df_test to the same format as y_submit
# so that i can concat row_id and congestion
df_tst = df_test.copy()
df_tst = df_tst.reset_index()

df_tst["time"] = df_tst.time.dt.to_period("20min")
df_tst = df_tst.set_index(["time", "x", "y", "direction"]).sort_index()

# concating y_submit and df_tst
y_submit = pd.concat([df_tst, y_submit], axis=1)
y_submit.head()

# Boosted Hybrid Model

In this section, I will build a Boosted Hybrid model by stacking Linear Regression and XGBoost. I will first train the linear regression model and calculate residuals. Then use XGBoost to fit these residuals. I also used the predictions of the first layer as a lag feature for second layer. My features are:

Linear Regression:
* Daily fourier components (high order)
* Dummy variable for week day
* Holiday (boolean)

XGBoost:
* Lag feature
* Dummy variable for week day
* Hour
* Month
* Afternoon (boolean)

In [None]:
# definition of boosted hybrid model
class BoostedHybrid:
    def __init__(self, model_1, model_2):
        self.model_1 = model_1
        self.model_2 = model_2
        self.y_columns = None  # to store column names
        
    def fit(self, X_1, X_2, y):
        #training first model and calculation residuals
        self.model_1.fit(X_1, y)
        y_fit = pd.DataFrame(self.model_1.predict(X_1),
                             index=X_1.index, columns=y.columns)
        y_resid = y - y_fit
        y_resid = y_resid.stack(["x", "y", "direction"]).squeeze() # wide to long
        
        # training second model on residuals
        self.model_2.fit(X_2, y_resid)
        self.y_columns = y.columns # saving for predict method
        
    def predict(self, X_1, X_2):
        # predicting with first model
        y_pred = pd.DataFrame(self.model_1.predict(X_1),
                              index=X_1.index, columns=self.y_columns)
        y_pred = y_pred.stack(["x", "y", "direction"]).squeeze()  # wide to long
        
        # predicting with second model and taking sum
        y_pred += self.model_2.predict(X_2)
        return y_pred
    
    def model1_fit_predict(self, X_1, X_1_test, y):
        self.model_1.fit(X_1, y)
        y_pred = pd.DataFrame(self.model_1.predict(X_1_test),
                              index=X_1_test.index, columns=y.columns)
        return y_pred.stack(["x", "y", "direction"])

# BoostedHybrid instance
model = BoostedHybrid(model_1 = LinearRegression(),
                      model_2 = XGBRegressor(random_state=0))

In [None]:
# Preparing data for the first model
df = df_train.copy()
df["time"] = df.time.dt.to_period("20min")
df = df.set_index(["time", "x", "y", "direction"]).sort_index()

y = df.unstack(["x", "y", "direction"])

# Same X for linear regression as the previous section:
fourier = CalendarFourier(freq="D", order=24)
dp = DeterministicProcess(
    index=y.index,
    constant=True,
    additional_terms=[fourier],
    drop=True)
X_1 = dp.in_sample()
X_test_1 = dp.out_of_sample(steps=12*3) #forecasting next 12 hours for every 20 minutes

# Weekly Dummies
X_1["day_of_week"] = X_1.index.dayofweek.astype(str)
X_1 = pd.get_dummies(X_1, drop_first=True)
X_test_1["day_of_week"] = X_test_1.index.dayofweek.astype(str)
X_test_1 = pd.get_dummies(X_test_1, drop_first=True)

# Since test set is only on Monday, there will be missing columns for other days
# Creating these missing column in the test set
miss_dummies = list(set(X_1.columns) - set(X_test_1.columns))
X_test_1[miss_dummies] = 0
X_test_1 = X_test_1[X_1.columns] #reordering columns

# Holiday Feature
X_1["holiday"] = 0
X_1.loc[X_1.index.to_timestamp().isin(all_holidays_list),"holiday"] = 1
X_test_1["holiday"] = 0
X_test_1.loc[X_test_1.index.to_timestamp().isin(all_holidays_list),"holiday"] = 1

In [None]:
# Preparing data for the second model
test_len = len(df_test) # length of test set

# using predictions from first layer as lag feature for second layer
y_pred_1 = model.model1_fit_predict(X_1, X_test_1, y) # prediction with linear regression
df_test_copy = df_test.copy()
df_test_copy["congestion"] = y_pred_1.values 

# concat train and test set (to prepare lag features)
df_all = pd.concat([df_train, df_test_copy], axis=0)
df_all["time"] = df_all.time.dt.to_period("20min")
df_all = df_all.set_index(["time", "x", "y", "direction"]).sort_index()
df_all = df_all.unstack(["x", "y", "direction"])

##### LAG FEATURES:
lag_1 = df_all.shift(1) # 1 step lag (20 min)
lag_1_roll = lag_1.rolling(6).mean() # rolling mean of 1 step lag
#lag_1week = df_all.shift(7*24*3) # 1 week lag
# renaming columns
lag_1.columns = lag_1.columns.set_levels(["lag_1"],level=0)
lag_1_roll.columns = lag_1_roll.columns.set_levels(["lag_1_roll"],level=0)
#lag_1week.columns = lag_1week.columns.set_levels(["lag_1week"],level=0)

# concat all lags
df_all = pd.concat([df_all, lag_1, lag_1_roll], axis=1)
df_all = df_all.stack(["x", "y", "direction"]).reset_index(["x", "y", "direction"])

# dropping congestion, day of week & hour feature, get dummies:
df_all = df_all.drop(["congestion"], axis=1)
df_all = df_all.dropna()
df_all["day_of_week"] = df_all.index.dayofweek.astype(str)
df_all["hour"] = df_all.index.hour.astype('int64')
df_all["month"] = df_all.index.month.astype('int64')
df_all.loc[df_all.index.minute == 20, "hour"] += 0.33 # encoding minute in hour
df_all.loc[df_all.index.minute == 40, "hour"] += 0.67
df_all["is_afternoon"] = 0
df_all.loc[df_all.hour>11, "is_afternoon"] = 1
df_all = pd.get_dummies(df_all)

# split train and test set
X_2 = df_all.head(len(df_all) - test_len)
X_test_2 = df_all.tail(test_len)

# since I dropped missing lag rows in X_2, i have to drop same days in X_1 and y:
X_2_wide_len = int(len(X_2)/65)
X_1 = X_1.tail(X_2_wide_len)
y = y.tail(X_2_wide_len)

In [None]:
model.fit(X_1, X_2, y)
y_submit = model.predict(X_test_1, X_test_2)

# Preparing df_test to the same format as y_submit
# so that i can concat row_id and congestion
df_tst = df_test.copy()
df_tst = df_tst.reset_index()

df_tst["time"] = df_tst.time.dt.to_period("20min")
df_tst = df_tst.set_index(["time", "x", "y", "direction"]).sort_index()

# concating y_submit and df_tst
y_submit = pd.concat([df_tst, y_submit], axis=1)
y_submit.head()

In [None]:
y_submit.to_csv('submission.csv', index=False)