# Motivation
Out of 950 discrete pressure values, only 12 values have their neighbors with the same counts! 

*Relative freq are not equal!*

Inspired by recent wonderful dummy rounding strategy, I explored a probability weighted rounding idea, which is illustrated by a simple scenario:

*  Assume the step width is 1 and a prediction is 0.3 and we need to round to either 0 or 1. A dummy will round to *0*, assuming equal probability weights.
*  But if the actual probablities were not equal, say, 0.01 at 0 and 0.04 at 1. Then the relative proportion within the step is 0.01/(0.01 + 0.04) = 20%. Thus the prob adjusted cut point becomes relative proportion x step = 20% x 1 = 0.2. This new cut point will round the prediction 0.3 to *1*, instead of 0.

## Results
After applying to two public submissions, LB 1.56 and 1.57, both reduced the LB scores by **0.002**, which seemed very similar to the dummy method, or invisible from LB with only the 3 decimals.

Two methods actually produced 2.2% different predictions of the test data. So the probability weighted rounding idea may have the potential to *further* improve up to 2.2% $*$ 0.0703 = **0.00155** at the best scenario or just change on the 5th decimal point, say **0.00003** at the worst scenarario.

## Discussion


Why does prob rounding not improve further here?
*  The presure relative frequency distribution is clearly single mode.
* The relative frequencies from two neighors are just too local and have not borrowed any strength from the whole spectrum. Some smoothed distribution may improve.

Will prob rounding work better for other competitions? 
Maybe, if the relative freqency distribution is not sinlge mode, but has multiple modes or complicated patterns.

Please leave your comments if any or you also thought about the similar idea but see little improvement. Thanks!

## Reference
1.  https://www.kaggle.com/snnclsr/a-dummy-approach-to-improve-your-score-postprocess
1.  https://www.kaggle.com/tenffe/finetune-of-tensorflow-bidirectional-lstm
1.  https://www.kaggle.com/cdeotte/ensemble-folds-with-median-0-153/


# What we know

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
df_train = pd.read_csv("../input/ventilator-pressure-prediction/train.csv")
df_sub = pd.read_csv("../input/ensemble-of-public-submissions/submission.csv")

In [None]:
unique_pressures = df_train["pressure"].unique()
sorted_pressures = np.sort(unique_pressures)

In [None]:
total_pressures_len = len(sorted_pressures)

def find_nearest(prediction):
    insert_idx = np.searchsorted(sorted_pressures, prediction)
    if insert_idx == total_pressures_len:
        # If the predicted value is bigger than the highest pressure in the train dataset,
        # return the max value.
        return sorted_pressures[-1]
    elif insert_idx == 0:
        # Same control but for the lower bound.
        return sorted_pressures[0]
    lower_val = sorted_pressures[insert_idx - 1]
    upper_val = sorted_pressures[insert_idx]
    return lower_val if abs(lower_val - prediction) < abs(upper_val - prediction) else upper_val

In [None]:
%%time
df_sub["pressure"] = df_sub["pressure"].apply(find_nearest)

In [None]:
df_sub

In [None]:
df_sub.to_csv("submission_round_LB154.csv", index=False)
submission_round_LB154 = df_sub.copy()

# What is proposed

## Construct look up table: pressure_freq

In [None]:
pressure_freq = df_train['pressure'].value_counts().to_frame()
pressure_freq['freq'] = df_train['pressure'].value_counts(normalize = True).values
pressure_freq = pressure_freq.sort_index().reset_index()
pressure_freq.columns = ['pressure', 'count', 'freq']

In [None]:
pressure_freq # already sorted by 'pressure' due to sort_index()

In [None]:
pressure_freq['count_pre'] = pressure_freq['count'].shift(1)
pressure_freq[pressure_freq['count_pre'] == pressure_freq['count']]

## Out of 950 discrete pressure values, only 12 values have their neighbor with the same counts!

*  So relative frequencies are dominately not equal!

In [None]:
pressure_freq['pressure_pre'] = pressure_freq['pressure'].shift(1)
pressure_freq['pressure_step'] = pressure_freq['pressure'] - pressure_freq['pressure_pre']

In [None]:
PRESSURE_MIN = pressure_freq['pressure'].min()
PRESSURE_MAX = pressure_freq['pressure'].max()
PRESSURE_STEP = pressure_freq['pressure_step'].mean()
PRESSURE_STEP

In [None]:
pressure_freq['pressure_step'].describe()

There are not jump 2 or more steps among 950 distinct pressures.

In [None]:
pressure_freq['freq_pre'] = pressure_freq['freq'].shift(1)
pressure_freq['freq_relative_pct'] = pressure_freq['freq'] / (pressure_freq['freq'] + pressure_freq['freq_pre'])
# 'pressure_prob' will be the neighboring probability weighted cut point to be used
pressure_freq['pressure_prob'] = pressure_freq['pressure_pre'] + \
                                 pressure_freq['pressure_step'] * pressure_freq['freq_relative_pct']

In [None]:
# 'pressure_half' will be the middle cut point for the usual equal weighted rounding
pressure_freq['pressure_half'] = (pressure_freq['pressure_pre'] + pressure_freq['pressure']) / 2
plt.figure(figsize=(10,6))
plt.plot(pressure_freq['pressure_prob'] - pressure_freq['pressure_half'], 
             label = 'Differences in two cut points within a pressure step')
plt.plot(pressure_freq['freq'], label = 'Relative freqs of 950 distinct pressures')
plt.title('Differences in two cut points of two rounding methods')
plt.legend()
plt.show()

In [None]:
pressure_freq

## Prob rounding function

In [None]:
#sorted_pressures = np.sorted(pressure_freq['pressure'])
total_pressures_len = len(sorted_pressures)

def find_nearest_prob(prediction):
    '''
    Probability weighted rounding.
    Just modify the lines after 'upper_val' of function 'find_nearest'
    '''
    insert_idx = np.searchsorted(sorted_pressures, prediction)
    if insert_idx == total_pressures_len:
        # If the predicted value is bigger than the highest pressure in the train dataset,
        # return the max value.
        return sorted_pressures[-1]
    elif insert_idx == 0:
        # Same control but for the lower bound.
        return sorted_pressures[0]
    lower_val = sorted_pressures[insert_idx - 1]
    upper_val = sorted_pressures[insert_idx]
    cut_val = pressure_freq['pressure_prob'][insert_idx]
        # Existing usual rounding without freqency adjustment
#     return lower_val if abs(lower_val - prediction) < abs(upper_val - prediction) else upper_val
        # New probability weighted rounding adjusted for the different in relative frequencies
    return lower_val if prediction < cut_val else upper_val  


## Data re-loading and overview

In [None]:
df_sub = pd.read_csv("../input/ensemble-of-public-submissions/submission.csv")

In [None]:
sum(df_sub['pressure'] < PRESSURE_MIN)

In [None]:
np.searchsorted(sorted_pressures, df_sub['pressure'].min())

In [None]:
sum(df_sub['pressure'] > PRESSURE_MAX)

In [None]:
sum(df_sub['pressure'] > PRESSURE_MAX) / len(df_sub)

In [None]:
PRESSURE_MAX

In [None]:
df_sub['pressure'].max() 

In [None]:
df_sub['pressure'].max() - PRESSURE_MAX

In [None]:
 (df_sub['pressure'].max() - PRESSURE_MAX) / PRESSURE_STEP

Max pressure in the ensemble submission can be larger than PRESSURE_MAX, but within the measurement error range: PRESSURE_STEP, ~ 0.0703.

Over-estimation or overfitting seem not a problem?

## Apply probability rounding

In [None]:
df_sub

In [None]:
%%time
df_sub["pressure"] = df_sub["pressure"].apply(find_nearest_prob)

In [None]:
df_sub.to_csv("submission.csv", index=False)
submission_prob = df_sub.copy()

In [None]:
df_sub

# Compare

In [None]:
y_old = submission_round_LB154['pressure']
y_new = submission_prob['pressure']

In [None]:
(y_new - y_old).describe().round(3)

In [None]:
sum((y_new > y_old))

In [None]:
sum((y_new > y_old))/len(df_sub)

In [None]:
sum((y_new < y_old))


In [None]:
sum((y_new < y_old))/len(df_sub)

In [None]:
(0.0112 + 0.0108) * 0.0703 

Signs of differences between two methods were very similar (1.12% vs 1.08%). Just flip the sign by chance?
So the new prob rounding prediction has changed 2.2% of pressures in the submission data, compared to non-prob rounding.

But it is hard to believe that prob rounding reduce mae for all these 2.2%. 
If that happens, **at the best**, the new rounding may further improve mae by 2.2% * PRESSURE_STEP = **0.0015**.

This is not observed. 
The simple non-prob rounding methond has improved the score by 0.002. There might be very little room for prob rounding to further improve: may cut score only by 0.0000284?

In [None]:
(sum(y_new > y_old) - sum(y_new < y_old))/len(df_sub)

In [None]:
0.0004033300198807157 * PRESSURE_STEP

# Another public LB 0.157, cut 0.002

In [None]:
sub_1 = pd.read_csv('../input/finetune-of-tensorflow-bidirectional-lstm/submission.csv')
sub_2 = pd.read_csv('../input/finetune-of-tensorflow-bidirectional-lstm/submission.csv')

In [None]:
%%time
sub_1["pressure"] = sub_1["pressure"].apply(find_nearest)
sub_1.to_csv("submission_LB157_round_LB155.csv", index=False)

In [None]:
%%time
sub_2["pressure"] = sub_2["pressure"].apply(find_nearest_prob)
sub_2.to_csv("submission_LB157_prob_LB155.csv", index=False)

In [None]:
y_old = sub_1['pressure']
y_new = sub_2['pressure']

In [None]:
sum((y_new > y_old))/len(sub_1)

In [None]:
sum((y_new < y_old))/len(sub_1)