# Tabular Playground March 2022

This notebook takes a similar approach to the median/mean, where we would take the median of all mondays afternoons for each roadway and use these medians for the predictions.

The aim of this notebook is to "borrow" mondays from other days of the week and from other similar roadways and use them as additional "monday" data when calculating the median of monday afternoons to use as predictions. 

The idea is normally the median is calculated with only 26 mondays. By increasing this number the results should be more accurate.

To do this we:
1. Find similar roadways to each target roadway. We then use these similar roadways as extra data.
2. From all weekdays and newly created data from other roadways, we find the days that best match the median monday afternoon for that roadway and assign these as additional mondays.
3. We filter down the list of new mondays by removing those that are not similar to the morning on the day of the test set.

In [None]:
QUANTILE = 0.55 # A quantile of 0.4 will select 40% of the closest dates for each roadway
METHOD = 'median' #Method applied to afternoons to make the final prediction 'mean' or 'median'
TIME_START = 6 # Time of day in hours, the time to start comparisons to the test monday.
SIMILAR_ROAD_CUTOFF = 3.5 # MAE cutoff, a lower number means less roads are identified as being similar
NEAREST_MORNING_CUTOFF = 11 # MAE cutoff, a lower number means we remove more of the worse data from the list of mondays.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import datetime

from sklearn.metrics import mean_absolute_error

In [None]:
train_df = pd.read_csv("/kaggle/input/tabular-playground-series-mar-2022/train.csv", index_col='row_id', parse_dates=['time'])
test_df = pd.read_csv("/kaggle/input/tabular-playground-series-mar-2022/test.csv", index_col='row_id', parse_dates=['time'])

In [None]:
train_df["roadway"] = train_df["x"].astype(str) + train_df["y"].astype(str) + train_df["direction"]
test_df["roadway"] = test_df["x"].astype(str) + test_df["y"].astype(str) + test_df["direction"]

In [None]:
train_df.drop(columns=["x","y","direction"], inplace=True)
test_df.drop(columns=["x","y","direction"], inplace=True)

In [None]:
def add_features(df):
    new_df = df.copy()
    
    new_df['minutes'] = df['time'].dt.hour * 60 + df['time'].dt.minute
    new_df['dayofweek'] = df['time'].dt.dayofweek
    new_df['date'] = df['time'].dt.date
 
    new_df.drop(columns=["time"], inplace=True)
    
    return new_df

In [None]:
train_df_2 = add_features(train_df)
test_df_2 = add_features(test_df)

### Borrowing Roads

Okay, so lets say we want to generate even more "Mondays" for each road. We can do this with similar roads to the target road. Once we have identified a similar road we can then use all of their Mondays too.

Whether or not this is a good idea is questionable...

Implementation:
- For each road, locate its similar roads and create a duplicate entry in the dataframe of that similar road but with the road name changed to the target road.

In [None]:
monday_day = train_df_2[(train_df_2["dayofweek"].isin([0])) & (train_df_2["minutes"] >= 6*60)]
monday_day_med = monday_day.groupby(["roadway","minutes"])["congestion"].median()
monday_day_med

In [None]:
mon_meds = monday_day_med.reset_index().pivot(index="roadway", columns="minutes", values="congestion")
a = pd.DataFrame(columns = mon_meds.index, index = mon_meds.index, dtype=float)
for n,row in enumerate(mon_meds.values):
    for m, row2 in enumerate(mon_meds.values):
        a.loc[mon_meds.index[n],mon_meds.index[m]] = mean_absolute_error(row,row2)
a

In [None]:
f, ax = plt.subplots(figsize=(15, 10))
ax = sns.heatmap(a, cmap = sns.cm.rocket_r)

In [None]:
a_mask = ((a < SIMILAR_ROAD_CUTOFF) & (a > 0))
similarRoads = {}
for i in a.index:
    similarRoads[i] = a_mask.index[a_mask[i]].tolist()
for key, val in similarRoads.items():
    print(key, ":", val)

In [None]:
def create_days_from_similar_roads():
    train_df_3 = train_df_2.copy()
    train_df_3["original_road"] = -1
    for road_key, val in similarRoads.items():
        train_df_3.loc[train_df_3["roadway"] == road_key,"original_road"] = road_key
        for road_val in val:
            temp_df = train_df_2.loc[train_df_2["roadway"] == road_key]
            temp_df = temp_df.replace(road_val, road_key)
            train_df_3 = pd.concat([train_df_3,temp_df])
            train_df_3 = train_df_3.fillna(road_val) # Works but probably not the best method
    return train_df_3

In [None]:
train_df_3 = create_days_from_similar_roads()

In [None]:
train_df_3

### Borrowing Days

To select which days will be assigned as Mondays we will:
1. Calculate the Median Monday congestions for all roads and all times (Possibly just time passed midday)
2. Calculate the MAE between this median monday and the other weekdays
3. Select the lowest x MAE score to use as additional mondays.


Step 1: Calculate the Median Monday congestions for all roads and all times (Possibly just time passed midday)

We will only include original roads not created ones for our measurement of the median Monday:

In [None]:
monday_afternoon = train_df_2[(train_df_2["dayofweek"].isin([0])) & (train_df_2["minutes"] >= 12*60)]
monday_afternoon_med = monday_afternoon.groupby(["roadway","minutes"])["congestion"].median()
monday_afternoon_med

In [None]:
#Basically the same as original version 
#monday_afternoon_all = train_df_3[(train_df_3["dayofweek"].isin([0])) & (train_df_3["minutes"] >= 12*60)]
monday_afternoon_v2 = train_df_3[(train_df_3["dayofweek"].isin([0])) & (train_df_3["roadway"] == train_df_3["original_road"]) & (train_df_3["minutes"] >= 12*60)]

2. Calculate the MAE between this median monday and the other weekdays, including the new roads we just assigned:

In [None]:
#We do not want to include mondays with original roads as we use these regardless, but we do want mondays from the other roads
#weekday_afternoon = train_df_3[(train_df_3["dayofweek"].isin([1,2,3,4])) & (train_df_3["minutes"] >= 12*60)]
weekday_afternoon = train_df_3[((train_df_3["dayofweek"].isin([1,2,3,4])) & (train_df_3["minutes"] >= 12*60)) | ((train_df_3["dayofweek"].isin([0])) & (train_df_3["roadway"] != train_df_3["original_road"]) & (train_df_3["minutes"] >= 12*60))]
weekday_afternoon = weekday_afternoon.groupby(["roadway", "minutes", "date", "original_road"])["congestion"].first()
weekday_afternoon

In [None]:
abs_err = abs(weekday_afternoon - monday_afternoon_med)
mae = abs_err.groupby(["roadway", "date", "original_road"]).mean()
mae

In [None]:
plt.subplots(figsize=(10, 6))
plt.title("MAE congestion values between the original median monday afternoons and other weekday afternoons");
plt.xlabel("Congestion MAE")
sns.histplot(x = mae.values);

3. Select the date corresponding to the x lowest MAE values for each roadway, to use as additional mondays. The quantile parameter controls this currently. A quantile of 0.4 will select the top ranking 40% of afternoons for each roadway to be used to calculate the median. 

Option 2 - Just remove all MAE congesiton values below some amount

In [None]:
mae_cutoff = mae.groupby(["roadway"]).quantile(QUANTILE)
mae_low_mae = mae[(mae - mae_cutoff) < 0]
mae_lowest = mae_low_mae.reset_index()

#mae_lowest = mae.reset_index()[mae.reset_index()["congestion"] < 8]
mae_low = mae_lowest.drop(columns=["congestion"])
mae_low

In [None]:
newMondays = mae_low.merge(train_df_3, on=["roadway", "date", "original_road"]) # Merge to get the original congestion values for each minute
newMondays

In [None]:
newMondays["dayofweek"].value_counts()

In [None]:
newMondays["original_road"].value_counts()

TO DO: There's some missing time values. May cause issues like lower MAE reported than their should be.

In [None]:
newmonday_afternoon = newMondays[newMondays["minutes"] >= 720]
newmonday_afternoon

In [None]:
additional_mondays = pd.concat([newmonday_afternoon, monday_afternoon_v2]) # Add all the original mondays back in
additional_mondays

In [None]:
additional_mondays["dayofweek"].value_counts()

In [None]:
additional_mondays["roadway"].value_counts()

We have only the afternoon data, we want the morning data now too:

In [None]:
add_mondays_dates = additional_mondays.groupby(["roadway","date", "original_road"])["congestion"].first() #first is just a place holder we dont need the value

In [None]:
new_monday_data = train_df_3.merge(add_mondays_dates, on=["roadway","date", "original_road"]).drop(columns=["congestion_y"]) #congestion_x just has those placeholder values
new_monday_data

### Matching mornings

The aim now is to filter down our original and borrowed mondays. We do this by finding monday mornings that dont match the monday on the day of the test set (September 30). We then remove these Mondays from our calculation of the median.

Now we have the new monday afternoons I want the monday mornings:

In [None]:
#Check everything looks correct
new_monday_data["dayofweek"].value_counts()

In [None]:
monday_morning = new_monday_data[(new_monday_data["minutes"] < 12*60) & (new_monday_data["minutes"] >= TIME_START*60)].rename(columns={"congestion_x":"congestion"}) #This doesn't contain the 30th already as a result of us selecting by afternoon earlier
test_monday_morning = train_df_2[(train_df_2["date"] == datetime.date(1991, 9, 30)) & (train_df_2["minutes"] >= TIME_START*60)]
#test_monday_morning = train_df_2[(train_df_2["date"] == datetime.date(1991, 9, 23)) & (train_df_2["minutes"] >= TIME_START*60)]
train_mm = monday_morning.groupby(["roadway", "date", "original_road", "minutes"])["congestion"].first()
test_mm = test_monday_morning.groupby(["roadway", "minutes"])["congestion"].first()
abs_err = (abs(train_mm - test_mm)).groupby(["roadway", "date", "original_road", "minutes"]).first()
mon_MAE_df = abs_err.groupby(["roadway", "date", "original_road"]).mean().reset_index().rename(columns={"congestion":"congestionMAE"})
mon_MAE_df

In [None]:
highestx = mon_MAE_df.sort_values(['congestionMAE'],ascending=True)
plt.subplots(figsize=(10, 6))
plt.title("MAE congestion values between the monday morning on 30th Sept and other mornings");
plt.xlabel("Congestion MAE")
sns.histplot(data = highestx, x="congestionMAE");

In [None]:
highestx = highestx[highestx["congestionMAE"] < NEAREST_MORNING_CUTOFF]
highestx

In [None]:
highestx_train = highestx.merge(train_df_2, on=["roadway", "date"])
highestx_train = highestx_train[highestx_train["minutes"] >= 720 ]

if METHOD == "median":
    highestx_train = highestx_train.groupby(["roadway","minutes"])["congestion"].median().round().astype(int) # Perhaps median - rounds values up if any
if METHOD == "mean":
    highestx_train = highestx_train.groupby(["roadway","minutes"])["congestion"].mean().round().astype(int) # Perhaps median
    
    
test_df_2 = test_df_2.drop(columns=["dayofweek","date"]).merge(highestx_train, how="left", left_on=["roadway", "minutes"], right_index=True)

### Submission

In [None]:
submission = pd.read_csv("../input/tabular-playground-series-mar-2022/sample_submission.csv")
submission['congestion'] = test_df_2["congestion"].values

submission

In [None]:
submission.to_csv('submission.csv', index=False)