# Generalizing the Special Values

Most top solutions to the March TPS competition follow the same three-step pattern:
1. Predict test congestions using an ensemble of gradient-boosted trees
2. Replace some predictions by so-called "special values" ([EDA introducing the special values](https://www.kaggle.com/ambrosm/tpsmar22-eda-which-makes-sense#Congestion-and-its-special-values))
3. Round the predictions to the nearest integer ([Why rounding improves the score](https://www.kaggle.com/c/tabular-playground-series-jan-2022/discussion/301249))

In this notebook, we generalize step 2: Rather than replacing some predictions by special values (which are medians of the training data), we clip all predictions to some quantiles of the training data.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error


In [None]:
# Read and prepare the training data
train = pd.read_csv('../input/tabular-playground-series-mar-2022/train.csv', parse_dates=['time'])
train['hour'] = train['time'].dt.hour
train['minute'] = train['time'].dt.minute

# Read the current top public submission of Mirena Borisova
submission_in = pd.read_csv('../input/tabular-playground-march-2022-04/lightautoml_rounded_special_ve_37_v2.csv')

# Compute the quantiles of workday afternoons in September except Labor Day
sep = train[(train.time.dt.hour >= 12) & (train.time.dt.weekday < 5) &
            (train.time.dt.dayofyear >= 246)]
lower = sep.groupby(['hour', 'minute', 'x', 'y', 'direction']).congestion.quantile(0.15).values
upper = sep.groupby(['hour', 'minute', 'x', 'y', 'direction']).congestion.quantile(0.7).values

# Clip the submission data to the quantiles
submission_out = submission_in.copy()
submission_out['congestion'] = submission_in.congestion.clip(lower, upper)
#submission_out.to_csv('submission.csv', index=False)

# Display some statistics
mae = mean_absolute_error(submission_in.congestion, submission_out.congestion)
print(f'Mean absolute modification: {mae:.4f}')
print(f"Submission was below lower bound: {(submission_in.congestion <= lower - 0.5).sum()}")
print(f"Submission was above upper bound: {(submission_in.congestion > upper + 0.5).sum()}")

#submission_out

In [None]:
# Round the submission
submission_out['congestion'] = submission_out.congestion.round().astype(int)
submission_out.to_csv('submission_rounded.csv', index=False)
submission_out

In [None]:
# How many predictions where changed by how much?
difference = submission_out.congestion - submission_in.congestion
histogram = difference[difference != 0].value_counts()
plt.rcParams['axes.facecolor'] = '#0057b8' # blue
plt.bar(histogram.index, histogram, color='#ffd700')
plt.xlabel('difference before / after clipping')
plt.ylabel('count')
plt.show()