# Tabular Playground March 2022

This notebook takes a similar approach to the median/mean, where we would take the median of all mondays afternoons for each roadway and use these medians for the predictions.

However instead of using all Mondays to calculate the median, we instead look for Mondays mornings that are similar to the monday morning on the day of the test set. We have done this by calculating the MAE between Monday September 30th and all mondays 6am-11:40am, and selecting the dates with the x smallest MAE to use to calculate the mean/median over the test period (September 30th 12:00 - 23:40 pm).

The results were not very promising, being worse than just taking the median of all monday afternoons.

I had more success when looking from the best mornings over all weekdays.

In [None]:
X_SMALLEST = 41 # For reference there are 26 weeks in the train data
DAYS_OF_WEEK = [0,1,2,3,4] # The days of the week to look for closest mornings for; mon = 0 
TIME_START = 5 # Time of day in hours, the time to start comparisons to the test monday.
METHOD = "median" # Method to select whether we are taking the mean or median of the X best mondays afternoon values to make the final prediction (use lowercase)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import datetime

In [None]:
train_df = pd.read_csv("/kaggle/input/tabular-playground-series-mar-2022/train.csv", index_col='row_id', parse_dates=['time'])
test_df = pd.read_csv("/kaggle/input/tabular-playground-series-mar-2022/test.csv", index_col='row_id', parse_dates=['time'])

In [None]:
train_df["roadway"] = train_df["x"].astype(str) + train_df["y"].astype(str) + train_df["direction"]
test_df["roadway"] = test_df["x"].astype(str) + test_df["y"].astype(str) + test_df["direction"]

In [None]:
train_df.drop(columns=["x","y","direction"], inplace=True)
test_df.drop(columns=["x","y","direction"], inplace=True)

In [None]:
def add_features(df):
    new_df = df.copy()
    
    new_df['minutes'] = df['time'].dt.hour * 60 + df['time'].dt.minute
    new_df['dayofweek'] = df['time'].dt.dayofweek
    new_df['date'] = df['time'].dt.date
 
    new_df.drop(columns=["time"], inplace=True)
    
    return new_df

In [None]:
train_df_2 = add_features(train_df)
test_df_2 = add_features(test_df)

In [None]:
monday_morning = train_df_2[(train_df_2["dayofweek"].isin(DAYS_OF_WEEK)) & (train_df_2["minutes"] < 12*60) & (train_df_2["minutes"] >= TIME_START*60)]

test_monday_morning = monday_morning[monday_morning["date"] == datetime.date(1991, 9, 30)]

train_mm = monday_morning[monday_morning["date"] != datetime.date(1991, 9, 30)].groupby(["roadway", "date","minutes"])["congestion"].first()

test_mm = test_monday_morning.groupby(["roadway", "minutes"])["congestion"].first()
test_mm = test_mm.rolling(3, min_periods=1, center=True).mean()

abs_err = (abs(train_mm - test_mm)).groupby(["roadway", "date","minutes"]).first()

mon_MAE_df = abs_err.groupby(["roadway", "date"]).mean().reset_index().rename(columns={"congestion":"congestionMAE"})

In [None]:
mon_MAE_df

In [None]:
plt.subplots(figsize=(25, 6))
sns.barplot(data = mon_MAE_df, x = "roadway", y="congestionMAE")
plt.xticks(rotation=90);
plt.title("MAE congestion between Monday 30th Sept and all other Specified Days between 6am and 11:40am for each individual road");

**Insight:**

- This graph shows us which roads have similar congestion levels on monday mornings compared to the day of the test set.
- A lower congestionMAE means that the test day has a similar start of the day to previous Mondays.

In [None]:
plt.subplots(figsize=(25, 6))
sns.barplot(data = mon_MAE_df.groupby(["roadway"])["congestionMAE"].min().reset_index(), x = "roadway", y="congestionMAE")
plt.xticks(rotation=90);
plt.title("Minimum MAE congestion between Monday 30th Sept and all other Specifed Days between 6am and 11:40am for each individual road");

In [None]:
lowest5 = mon_MAE_df.sort_values(['roadway','congestionMAE'],ascending=True).groupby('roadway').head(X_SMALLEST)

low5_train = lowest5.merge(train_df_2, on=["roadway", "date"])
low5_train = low5_train[low5_train["minutes"] >= 720 ]

if METHOD == "median":
    low5_train = low5_train.groupby(["roadway","minutes"])["congestion"].median().round().astype(int) # Perhaps median - rounds values up if any
if METHOD == "mean":
    low5_train = low5_train.groupby(["roadway","minutes"])["congestion"].mean().round().astype(int) # Perhaps median
test_df_2 = test_df_2.drop(columns=["dayofweek","date"]).merge(low5_train, how="left", left_on=["roadway", "minutes"], right_index=True)

In [None]:
lowest5

In [None]:
submission = pd.read_csv("../input/tabular-playground-series-mar-2022/sample_submission.csv")
submission['congestion'] = test_df_2["congestion"].values

submission.head()

In [None]:
submission.to_csv('submission.csv', index=False)