# Forecasting Strategies

In this notebook I try out some different forecasting techniques:

- A direct forecasting strategy, where we make a 32-step (12 hour) forecast from multiple forecasting origins on the day of the test set. A different regressor is trained for each forecasting step. For example 1 regressors is trained to predict 1 step ahead, another regressor is trained to forecast 2 steps ahead etc.. I decide that this approach is probably not necessary for this competition, but it shows some promise.
- A direct recursive forecasting strategy. A direct approach as before, but with forecasted congestion predictions from previous steps used as additional lag features for the model of the current step. I show that a recursive strategy will likely not work well in this competition, as the scores for steps above 1 or 2 suffer from error propagation. 


Before you read:

- This is my first time implementing these techniques.I would probably do it slightly different if implementing again.
- The score for this notebook is just the normal forecasting technique; 0-step ahead forecast.
- Model parameters chosen at random

References:
- These techniques are based on the Kaggle Time Series Forecasting Course https://www.kaggle.com/learn/time-series
- Clipping the final predictions to be in some range: https://www.kaggle.com/code/ambrosm/tpsmar22-generalizing-the-special-values

Additional thoughts on the competiton:

- I implemented a SARIMAX model in this competiton. I do not believe they are necessary, they are slow and a little tricky to work with. Additionally seasonal, autoregressive (lag), moving averages and even exogenous features (e.g. similar roadways) can instead be added as features for more normal machine learning techniques such as GBDTs. See this comment I found on the time series Kaggle course https://www.kaggle.com/learn/time-series/discussion/307318

- I do not believe hybrid regressors are necessary for this competition. One of the main advantages of hybrid regressors is they can extrapolate trends past the training data, where one (less powerful model) fits the trend and another (more powerful model) e.g GBDTs fits the residuals. However as we are only predicting for 12 hours ahead there is very little extrapolating required, so hybrid models don't seem necessary. That being said I could be wrong, indeed [MARTYNOV ANDREY](https://www.kaggle.com/martynovandrey) has had great results with them.

Improvements to be made:
- 1 day of validation is likely not enough.
- Add lag and moving average features. The ability to  use these features in one of the main advantages of direct forecasting.
- For the direct strategy rather than forecasting 36 steps ahead from every possible origin on the test day I probably should have just used forecasting origins from the morning and just used predicted afternoon congesion values.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime

from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_absolute_error

from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf

from lightgbm import LGBMRegressor

In [None]:
train_df = pd.read_csv("/kaggle/input/tabular-playground-series-mar-2022/train.csv", index_col='row_id', parse_dates=['time'])
test_df = pd.read_csv("/kaggle/input/tabular-playground-series-mar-2022/test.csv", index_col='row_id', parse_dates=['time'])

In [None]:
train_df["roadway"] = train_df["x"].astype(str) + train_df["y"].astype(str) + train_df["direction"]
test_df["roadway"] = test_df["x"].astype(str) + test_df["y"].astype(str) + test_df["direction"]
train_df.drop(columns=["x","y","direction"], inplace=True)
test_df.drop(columns=["x","y","direction"], inplace=True)

There are no missing (nan) values, however there are missing times; dealing with these:

In [None]:
def fill_missing_times():
    """
    Filling times is a little more complicated as times are not unique (multiple matching time entries, one for each roadway)
    
    We fill missing values with the median for each timeofday,dayofweek and roadway
    
    There are no missing values on Mondays, so providing we do not use non-mondays as validation we will not introduce data leakage from median calculations
    """
    #Introduce the missing times
    print("Total times (orginal):", len(train_df["time"].unique()))
    new_df = train_df.set_index("time").pivot(columns="roadway").asfreq('20Min').stack(dropna=False).reset_index()
    print("Total times (new):", len(new_df["time"].unique()))
    print("Missing values to fill:",new_df.isnull().sum().sum())
    
    #Create temportary features required for median calculations
    new_df["minute"] = new_df['time'].dt.hour * 60 + new_df['time'].dt.minute
    new_df["dayofweek"] = new_df['time'].dt.dayofweek
    
    #fill missing values with median
    fill_df = new_df.groupby(["minute","roadway","dayofweek"])["congestion"].median().rename("median_congestion").reset_index()
    new_df = new_df.merge(fill_df, on=["minute", "roadway","dayofweek"])
    new_df = new_df.fillna({"congestion": new_df["median_congestion"]})
    print("Missing values remaining:",new_df.isnull().sum().sum())
    
    # return train_df to orginal format
    new_df = new_df.drop(columns=["minute","dayofweek","median_congestion"]) 
    new_df = new_df.sort_values(["time","roadway"]).reset_index(drop=True)
    return new_df


In [None]:
train_df = fill_missing_times()
display(train_df)

In [None]:
def add_features(df):
    new_df = df.copy()
    new_df['minutes'] = df['time'].dt.hour * 60 + df['time'].dt.minute
    new_df['dayofweek'] = df['time'].dt.dayofweek.astype("category")
    new_df['weekend'] = (df['time'].dt.dayofweek >= 5).astype(int).astype("category")
    
    
    new_df['date'] = df['time'].dt.date # Drop later, just makes processing a bit easier
    
    return new_df

In [None]:
# These features don't work well with multioutput forecasting
def descriptive_features(df, limit_date = datetime.date(1991, 9, 30)):
    """ Set limit_date to avoid data leakage, we only calculate descriptive statistics on congestion values BEFORE the limit date
    
    Example: limit_date of datetime.date(1991, 9, 23) will calculate median on all times before 1991,9,23 12:00
    
    default = all available times
    """
    new_df = df.copy()
    
    limit_df = train_df_2[(train_df_2.date < limit_date) | ( (train_df_2.date == limit_date) & (train_df_2.minutes < 720))]
    print("Descriptive features added using calculations done on all times before and including:",limit_df.time.max())
    
    median_df = limit_df.groupby(["minutes","roadway","dayofweek"])["congestion"].median().rename("median_congestion").reset_index()
    new_df = new_df.merge(median_df, on=["minutes", "roadway","dayofweek"])
    
    mean_df = limit_df.groupby(["minutes","roadway","dayofweek"])["congestion"].mean().rename("mean_congestion").reset_index()
    new_df = new_df.merge(mean_df, on=["minutes", "roadway","dayofweek"])
    
    stddev_df = limit_df.groupby(["minutes","roadway","dayofweek"])["congestion"].std().rename("std_congestion").reset_index()
    new_df = new_df.merge(stddev_df, on=["minutes", "roadway","dayofweek"])
    
    #Restore original order
    new_df = new_df.sort_values(["time","roadway"]).reset_index(drop=True)

    return new_df

In [None]:
def make_cyclic(df, plot=True):
    new_df = df.copy()
    
    new_df["minutes_sin"] = np.sin(new_df['minutes'] * (2 * np.pi / 1440)) # There's 1440 minutes in a day
    new_df["minutes_cos"] = np.cos(new_df['minutes'] * (2 * np.pi / 1440))
    #new_df = new_df.drop(columns=["minutes"])
    if plot == True:
        f,ax = plt.subplots(figsize=(20,5))
        plt.subplot(1,3,1)
        sns.scatterplot(data = new_df.sample(1000), x="minutes_sin", y="minutes_cos")
        plt.subplot(1,3,2)
        sns.lineplot(data = new_df.sample(1000), x="minutes", y="minutes_sin")
        plt.subplot(1,3,3)
        sns.lineplot(data = new_df.sample(1000), x="minutes", y="minutes_cos")
    
    return new_df

In [None]:
y_full = train_df.congestion
y_pivot = train_df.set_index("time").pivot(columns = "roadway")
display(y_full.head(2))
display(y_pivot.head(2))

In [None]:
train_df_2 = add_features(train_df)
train_df_2 = descriptive_features(train_df_2, limit_date = datetime.date(1991, 9, 23))
train_df_2 = make_cyclic(train_df_2)

In [None]:
X_train = train_df_2[(train_df_2.date < datetime.date(1991, 9, 23)) | ((train_df_2.date == datetime.date(1991, 9, 23)) & (train_df_2["minutes"] < 12*60))] # I dont want to use future values in the training data
X_train = X_train.drop(columns=["date","congestion","minutes"])
y_train = y_full.loc[y_full.index.isin(X_train.index)]
X_train = X_train.set_index("time")


X_train_pivot = X_train.pivot(columns="roadway")
y_train_pivot = y_pivot.loc[y_pivot.index.isin(X_train_pivot.index)]

#Number the roadways:
enc = OrdinalEncoder()
X_train['roadway'] = enc.fit_transform(X_train[['roadway']])
X_train['roadway'] = X_train["roadway"].astype(int)
display(X_train.head(2))
display(y_train.head(2))

In [None]:
def make_multistep_target(ts, steps):
    return pd.concat(
        {f'y_step_{i}': ts.shift(-i)
         for i in range(steps+1)},
        axis=1)

We make multistep targets, for every time and every roadway we make a y_step_i column that holds the future congestion values at each step i into the future (20 minute intervals). For example y_step_1 for roadway 01EB at time 1991-04-01 00:00:00, contains the known congestion value at 1991-04-01 00:20:00. 

The idea is by having seperate columns we can make use of multiple output regressors to train a different regressor for each time in the future. For example at step 16, we will be using the X training data (containing hour/minute, day of week, moving average features, lagged features etc.) from **step 0** to train a regressor that purely attempts to **predict step 16** (16 20-minute time steps into the future).

We can notice that at the end of the training data (last 12 hours) we dont have full info of the congestion valus 1,...,36 values into the future, so we have to remove these from the training data. This means we have to remove the corresponding X data from training, so we loose 12 hours worth of training data. When I first implemented this I thought it would be an issue as we would loose AR (lags) and MA(moving average) data from the monday morning of the 1991-09-23; but it shouldnt be an issue, we can just extend the validation data back to start at 1991-09-23 00:00:00 so we can still use lags and MA terms. Note that we dont even loose the 1991-09-23 morning congestion data from training, e.g.the congestion at 11:40 1991-09-23 is contained in y_step_36 of 23:40 1991-09-22. 

In [None]:
y_train = make_multistep_target(y_train_pivot, steps=36).dropna()
y_train = y_train.stack("roadway")
display(y_train)
X_train = X_train.loc[X_train.index <= '1991-09-22 23:40:00']
X_train = X_train.set_index("roadway", append=True, drop=False)
X_train["roadway"] = X_train["roadway"].astype("category") #Roadways are not ordered
display(X_train)

In [None]:
%%time
from sklearn.multioutput import MultiOutputRegressor

model_lgbm = MultiOutputRegressor(LGBMRegressor(random_state=1, learning_rate=0.05, n_estimators=800, n_jobs=-1))
model_lgbm.fit(X_train, y_train)

Now we have to make a decision on what predictions to use. For predictions to be made on 12:20 1991-09-23 do we use y_step_1 of 12:00? y_step_2 of 11:40, y_step_3 of 11:20? Take the average of all of these predictions? I am not sure which is best.

This has implications on the validation training data (X_val); we have to use all of the 23rd September to make predictions from, not just the afternoon.

We can also use training data for afternoon times to make predictions (e.g. 16:00 to forecast 23:00). However this is slightly problematic as we don't have the congestion values for 12:00 - 16:00. This means use of some lags and some moving averages cannot be used as features. So we may want to only use the times before 12:00 to make predictions from. For now we include them all.

In [None]:
X_val = train_df_2.loc[train_df_2["date"] == datetime.date(1991,9,23)]
X_val = X_val.set_index("time").drop(columns=["date","congestion","minutes"])
X_val = X_val.set_index("roadway", append=True, drop=False)
X_val["roadway"] = enc.transform(X_val[["roadway"]]).astype(int)
X_val["roadway"] = X_val["roadway"].astype("category")
#display(X_val)
#The validation is the congestion values on the afternoon of September 23rd
y_val_index = train_df_2.loc[(train_df_2["date"] == datetime.date(1991,9,23)) & (train_df_2["minutes"] >= 12*60)].index
y_val = y_full.loc[y_val_index]

In [None]:
preds_lgbm = model_lgbm.predict(X_val)
preds_lgbm_df = pd.DataFrame(preds_lgbm, columns = y_train.columns, index = X_val.index)
preds_lgbm_df = preds_lgbm_df.reset_index(level="roadway")
display(preds_lgbm_df)

To implement, lets first make an emtpy dataframe for the target period.
- Each row will be for each time period and each roadway that we are predicting congestion values for.
- Each column will be the y_step value corresponding to the that targeted time. 

Some examples:
- [1991-09-23 12:00:00, y_step_2]  in the new df will be from [1991-09-23 11:20:00, y_step_2] of the old dataframe.
- [1991-09-23 12:00:00, y_step_36]  in the new df will be from [1991-09-23 00:00:00, y_step_36] of the old dataframe.
- [1991-09-23 23:40:00, y_step_1] in the new df will be from [1991-09-23 23:20:00, y_step_1] of the old dataframe.


In [None]:
#First create the empty dataframe
def empty_preds_df(preds_df, test=False):
    if test == False:
        new_index = preds_df.loc[preds_df.index >= "1991-09-23 12:00:00"].index
    else:
        new_index = preds_df.loc[preds_df.index >= "1991-09-30 12:00:00"].index
    X_final = pd.DataFrame(index = new_index, columns = preds_df.columns)
    if test == False:
        X_final["roadway"] = preds_df.loc[preds_df.index >= "1991-09-23 12:00:00","roadway"]
    else:
        X_final["roadway"] = preds_df.loc[preds_df.index >= "1991-09-30 12:00:00","roadway"]
    return X_final

#X_val_final = empty_preds_df(preds_lgbm_df, test=False)
#display(X_val_final.head(2))

In [None]:
def generate_preds(preds_df, test = False):
    empty_df = empty_preds_df(preds_df, test=test)
    for time in empty_df.index.unique():
        for i in range(0,37):
            corresponding_time = time - (pd.Timedelta('0 days 00:20:00')*i)
            corresponding_values = preds_df.loc[preds_df.index == corresponding_time, "y_step_"+str(i)]
            #print(corresponding_values)
            empty_df.loc[empty_df.index == time, "y_step_"+str(i)] = corresponding_values.values
    return empty_df
    

In [None]:
multioutput_preds = generate_preds(preds_lgbm_df, test=False)
multioutput_preds.set_index("roadway",append=True, inplace=True)
multioutput_preds["mean"] = multioutput_preds.mean(axis=1)
multioutput_preds

**Observations:**
- There is a fair amount of variation between steps
- y_step 0 is very different from the other steps


**Insights**

- Each row is each time period (and each roadway) that we are predicting congestion values for.
- Each column is the y_step model required to reach that target time. 



In [None]:
def clipped(preds):
    #Credit to AmbrosM: https://www.kaggle.com/code/ambrosm/tpsmar22-generalizing-the-special-values
    sep = train_df_2[(train_df_2.time.dt.hour >= 12) & (train_df_2.dayofweek.isin([0,1,2,3,4])) &
            (train_df_2.time.dt.dayofyear >= 246)]
    lower = sep.groupby(['minutes', 'roadway']).congestion.quantile(0.15).values
    upper = sep.groupby(['minutes', 'roadway']).congestion.quantile(0.7).values
    
    clipped_preds = preds["congestion"].clip(lower,upper)
    return clipped_preds

In [None]:
def generate_mae_steps():
    mae_preds = []
    mae_round = []
    mae_clipped = []
    mae_clipped_round = []
    mae_round_clipped = []
    for i in range(0,37):
        preds = multioutput_preds["y_step_"+str(i)]
        mae_preds.append(mean_absolute_error(preds,y_val))
        mae_round.append(mean_absolute_error(preds.round().astype(int),y_val))
        mae_clipped.append(mean_absolute_error(clipped(preds).astype("float"),y_val))
        mae_clipped_round.append(mean_absolute_error(clipped(preds).astype("float").round().astype(int),y_val))
        mae_round_clipped.append(mean_absolute_error(clipped(preds.astype("float").round().astype(int)),y_val))
    new_df = pd.DataFrame({"mae":mae_preds},index=["y_step"+str(i) for i in range(0,37)])
    new_df["mae_round"] =  mae_round
    new_df["mae_clipped"] =  mae_clipped
    new_df["mae_clipped_round"] =  mae_clipped_round
    new_df["mae_round_clipped"] =  mae_round_clipped
    return new_df
generate_mae_steps()

In [None]:
print("MAE mean:", mean_absolute_error(multioutput_preds["mean"],y_val))
print("MAE round:", mean_absolute_error(multioutput_preds["mean"].round().astype(int),y_val))
print("MAE: clip-round ", mean_absolute_error(clipped(multioutput_preds["mean"].astype("float").round().astype(int).rename("congestion").reset_index()),y_val))

**Observations:**
- y_step_0 has the lowest MAE.
- Clipping gives a massive improvement to score
- Rounding before clipping is slightly better than clipping before rounding

**Insight:**

- Forecasting 0 step ahead recieved the best score, but only because of the added descriptive features (e.g. the median for that time, day of week and roadway). If we remove these descriptive features all forecasts would achieve roughly the same score. This makes sense - afterall how useful is the median congestion for 9:00am on Monday  as a feature when we are trying to predict congestion levels a 13:00 (12 steps ahead). Whereas when predicting the congestion levels at 9:00am (0-step ahead) having the 9:00am median is a very useful feature. This is the primary disadvantage of this technique and the reason I chose not to use it - there may be ways around this problem.

**Asessing perfromance by forecasting origin**

This isn't the only way of asessing the performance. We could instead compare mae generated from the same forecasting starting time e.g. when we use X_val data from 11:40am as a starting date should we expect better predictions than X_val data from 00:20am? or perhaps it wont make any difference? But we would have to forecast further than 36 steps ahead to get a full comparison for all times, so I have not included it here.

Lets organise the results where the rows are the targetted predicted times, and the columns are the times for which we are forecasting from:

In [None]:
new_index = preds_lgbm_df.loc[preds_lgbm_df.index >= "1991-09-23 12:00:00"].index
X_val_final_forecast_date = pd.DataFrame(index = new_index, columns = ["roadway"] + preds_lgbm_df.index.unique().tolist())
X_val_final_forecast_date["roadway"] = preds_lgbm_df.loc[preds_lgbm_df.index >= "1991-09-23 12:00:00","roadway"]
display(X_val_final_forecast_date.head(2))

In [None]:
# Works but not very clean
def generate_preds_by_date():
    new_df = X_val_final_forecast_date.copy()
    for time in preds_lgbm_df.index.unique():
        steps_to_12 = int((pd.Timestamp('1991-09-23 12:00:00') - time) / pd.Timedelta('0 days 00:20:00'))
        if steps_to_12 >= 0:
            relevant_cols = ["y_step_"+str(i) for i in range(steps_to_12,steps_to_12 + 37)]
            relevant_steps = [i for i in range(steps_to_12,steps_to_12 + 37)]
        else:
            relevant_cols = ["y_step_"+str(i) for i in range(0,36+steps_to_12)]
            relevant_steps = [i for i in range(0,36+steps_to_12)]
        relevant_cols = [i for i in relevant_cols if i in preds_lgbm_df.columns] # Check if real lag value (e.g not y_step_40)
        relevant_steps = [i for i in relevant_steps if i<37]
        for n,col in enumerate(relevant_cols):
            relevant_step = relevant_steps[n]
            selected_steps = preds_lgbm_df.loc[time, col].values # The congestion values for time and step
            selected_steps = [item for sublist in selected_steps for item in sublist]
            #Selecting the correct row to update
            new_df.loc[new_df.index == (relevant_step*pd.Timedelta('0 days 00:20:00') + time), time] = selected_steps
    return new_df

In [None]:
preds_forecast_date = generate_preds_by_date()
#display(preds_forecast_date)
#display(preds_forecast_date.loc[preds_forecast_date["roadway"]=="00EB"])

As of now, there's only 1 starting date that contains all predictions for the afternoon of 31/09/2020 (11:40am).

In [None]:
print("MAE:", mean_absolute_error(preds_forecast_date.loc[:,pd.Timestamp('1991-09-23 11:40:00')],y_val))
print("MAE:", mean_absolute_error(np.round(preds_forecast_date.loc[:,pd.Timestamp('1991-09-23 11:40:00')].astype("float").round().astype(int)),y_val))

# DirRec Strategy

The DirRec strategy uses a combination of the direct strategy (above) and a recursive strategy. The previous predicted congested values are used as additional (lag) features in the model used to in the next step-ahead forecast.

In [None]:
#display(X_train)
#display(y_train)

In [None]:
%%time
from sklearn.multioutput import RegressorChain

model_lgbm = RegressorChain(LGBMRegressor(random_state=1, learning_rate=0.05, n_estimators=800, n_jobs=-1))
model_lgbm.fit(X_train, y_train)

In [None]:
preds_lgbm = model_lgbm.predict(X_val)
preds_lgbm_df = pd.DataFrame(preds_lgbm, columns = y_train.columns, index = X_val.index)
preds_lgbm_df = preds_lgbm_df.reset_index(level="roadway")
#display(preds_lgbm_df)

In [None]:
multioutput_preds = generate_preds(preds_lgbm_df)
multioutput_preds.set_index("roadway",append=True, inplace=True)
multioutput_preds["mean"] = multioutput_preds.mean(axis=1)
display(multioutput_preds)

In [None]:
print("MAE:", mean_absolute_error(multioutput_preds["mean"],y_val))
print("MAE:", mean_absolute_error(multioutput_preds["mean"].round().astype(int),y_val))

In [None]:
generate_mae_steps()

**Observations:**

- The higher the step, the higher the MAE.
- Clipping helps to stop the error from propogating as the step increases.

**Insights:** 

- We can see the error starts to propogate as the step increases. This likely isn't a good dataset to use a recursive strategy on (perhaps with the exception of the first few lags)

**Additional evidence against a recursive strategy:**

More evidence against a recursive strategy. We can use ACF and PACF to decide how many lag values we should include in our model. This is a common strategy for deciding the order of a SARIMAX model (see me [Notebook](https://www.kaggle.com/code/cabaxiom/tps-mar-22-sarima-linear-regression) where I do this). I'll recreate this here:

- I only consider mondays - that way I only have to worry about the daily seasonality not weekly as well (mondays follow each other in time series)
- Time series is made stationary before plotting ACF/PACF Functions
- AR = AutoRegressive (number of lags to use)
- MA = MovingAverage

In [None]:
def plot_series(df):
    plt.subplots(figsize=(25, 6))
    plt.title("Time series")
    xticks = df[df["minutes"]==0]["time_count"].values
    xtick_dates = df["week"].unique()
    ax = sns.lineplot(data=df, x="time_count", y="congestion", linewidth=1);

def plot_series_diff(df):
    temp_df = df.copy()

    temp_df["congestion_diff_72"] = temp_df["congestion"].diff(periods=72)
    temp_df = temp_df.dropna()
    plt.subplots(figsize=(25, 6))
    plt.title("Stationary time series (difference 72)")
    xticks = df[df["minutes"]==0]["time_count"].values
    xtick_dates = df["week"].unique()
    ax = sns.lineplot(data=temp_df, x="time_count", y="congestion_diff_72", linewidth=1 );

    return temp_df["congestion_diff_72"]

def plot_acf_pacf(df):
    temp_df = df.copy()
    temp_df["congestion_diff_72"] = temp_df["congestion"].diff(periods=72)
    temp_df = temp_df.dropna()
    
    f,ax= plt.subplots(figsize=(25, 12))
    ax = plt.subplot(2, 1, 1)
    plot_acf(temp_df["congestion_diff_72"], lags=300, ax=ax);
    for i in range(5):
            plt.axvline(i*72, color='r', lw=1)
            
    ax = plt.subplot(2, 1, 2)
    plot_pacf(temp_df["congestion_diff_72"], lags=300, method='ywm', ax=ax);
    for i in range(5):
        plt.axvline(i*72, color='r', lw=1)
        
def decide_orders(df):
    plot_series(df)
    series = plot_series_diff(df)
    plot_acf_pacf(df)

In [None]:
#Removing: 
#Memorial day - Monday May 27 1991
#Labor Day - Monday 2nd September 1991
mon = train_df_2.loc[(train_df_2["dayofweek"] == 0) & (~train_df_2["date"].isin([datetime.date(1991, 5, 27),datetime.date(1991, 9, 2)])),:].copy()

#Assign each time a unique ID number to make plotting easier.
enc = OrdinalEncoder()
enc.fit(mon[["time"]])
mon["time_count"] = enc.transform(mon[["time"]]).astype(int)
mon["week"] = mon["time"].dt.isocalendar().week # Inlcude week for plotting

In [None]:
temp_df = mon[mon["roadway"] == "02NB"]
decide_orders(temp_df)

**Observations:**

- Significant ACF spikes at lag 1,71,72,73.
- The PACF tapers to 0 seasonally.
- The PACF has 2 significant PACF spikes at lags 1 and 2.

Non-Seasonal terms:

- A Spike at lag 1 in both the ACF and PACF could indicate either MA(1) or AR(1) terms. Perhaps both - ARMA(1,1). As the PCAF early terms might be slightly tapering an MA(1) term is more likely than an AR(1).
- Perhaps an AR(2) term is also possible.

Seasonal

- There is 1 significant spike at lag 72 in the ACF (spikes at lags 71 and 73 too). A seasonal MA(1) component seems likely.

Possible ARIMA modles to consider (in order of most likely):

- (0,0,1) x (0,1,1)72
- (1,0,1) x (0,1,1)72
- (1,0,0) x (0,1,1)72
- (2,0,0) x (0,1,1)72
- (2,0,1) x (0,1,1)72

non_seasonal order(AR order , diff order, MA order) x season_order(AR order, diff order, MA order)_lags in season

**Insight:**

- The main point to notice here is the AR order for the non-seasonal terms. It seems unlikely that more than 2 lag values will improve the model (atleast for this roadway).  If lag values are not that useful then how can we expect a recursive model that uses previous predictions as additional features to work well?
- Perhaps this can give more insight about what other features might be appropriate
- This is only a tool to help decide on lagged/MA features, it does not mean that these features are guaranteed to work, or guarantee that other features will not work.

# Test

We use a 0-step ahead forecast from the Direct model for our predictions, as this achieved the highest validaiton score.

Using all the training data to train the model:

In [None]:
train_df_2 = add_features(train_df)
train_df_2 = descriptive_features(train_df_2)
test_df_2 = add_features(test_df)
test_df_2 = descriptive_features(test_df_2)

train_df_2 = make_cyclic(train_df_2)
test_df_2 = make_cyclic(test_df_2, plot=False)

In [None]:
X_train = train_df_2
X_train = X_train.drop(columns=["date","congestion","minutes"])
y_train = y_full.loc[y_full.index.isin(X_train.index)]
X_train = X_train.set_index("time")


X_train_pivot = X_train.pivot(columns="roadway")
y_train_pivot = y_pivot.loc[y_pivot.index.isin(X_train_pivot.index)]

#Number the roadways:
enc = OrdinalEncoder()
X_train['roadway'] = enc.fit_transform(X_train[['roadway']])
X_train['roadway'] = X_train["roadway"].astype(int)

y_train = make_multistep_target(y_train_pivot, steps=36).dropna()
y_train = y_train.stack("roadway")

X_train = X_train.loc[X_train.index <= '1991-09-29 23:40:00']
X_train = X_train.set_index("roadway", append=True, drop=False)
X_train["roadway"] = X_train["roadway"].astype("category") #Roadways are not ordered
display(y_train)
display(X_train)

In [None]:
X_test = pd.concat([train_df_2.loc[train_df_2["date"] == datetime.date(1991,9,30)], test_df_2])
X_test = X_test.set_index("time").drop(columns=["date","congestion","minutes"])
X_test = X_test.set_index("roadway", append=True, drop=False)
X_test["roadway"] = enc.transform(X_test[["roadway"]]).astype(int)
X_test["roadway"] = X_test["roadway"].astype("category")
X_test["dayofweek"] = X_test["dayofweek"].astype("category")
X_test["weekend"] = X_test["weekend"].astype("category")
display(X_test)

In [None]:
%%time
from sklearn.multioutput import MultiOutputRegressor

model_lgbm = MultiOutputRegressor(LGBMRegressor(random_state=1, learning_rate=0.05, n_estimators=800, n_jobs=-1))
model_lgbm.fit(X_train, y_train)

In [None]:
preds_lgbm = model_lgbm.predict(X_test)
preds_lgbm_df = pd.DataFrame(preds_lgbm, columns = y_train.columns, index = X_test.index)
preds_lgbm_df = preds_lgbm_df.reset_index(level="roadway")
#display(preds_lgbm_df)

In [None]:
multioutput_preds = generate_preds(preds_lgbm_df, test=True)
multioutput_preds.set_index("roadway",append=True, inplace=True)
multioutput_preds["mean"] = multioutput_preds.mean(axis=1)
multioutput_preds

In [None]:
preds = multioutput_preds.reset_index().loc[:,["time","roadway","y_step_0"]]#.reset_index()
preds.columns = preds.columns.droplevel(level=1)
preds = preds.rename(columns={"y_step_0":"congestion"})
preds["congestion"] = preds["congestion"].astype("float").round()
preds["congestion"] = clipped(preds)
preds

In [None]:
submission = pd.read_csv("../input/tabular-playground-series-mar-2022/sample_submission.csv")
submission['congestion'] = preds["congestion"]
display(submission)

In [None]:
submission.to_csv('submission.csv', index=False)