# Rough Argument about the Theoretical Lowerbound of the LB Score

In this notebook, I argued rough argument about theoretical lower bound for the task.

Considering how the host created the target value[1], it's natural to say that the target value already contains the uncertainty which corresponds to the `standard_error`.

So I conducted a naive experiment:

1. generate Gaussian noize (sigma = standard error)
2. resample 600, 2000 samples from the Gausian noize (these corresponds to the # of dataset in public and the private test datasets)
3. calculate RSME error (this corresponds to the hypothesis that the model estimates the targets in 100% accuracy)
4. iterate 1-3 for 10000 times and argure with the variance of RSME

---

[1] https://www.kaggle.com/c/commonlitreadabilityprize/discussion/240423

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
train_df = pd.read_csv(os.path.join(dirname, 'train.csv'))

First, naive estimation of RSME of 100% accuracy model is just calculate the mean of standard_error:

In [None]:
seq = train_df['standard_error']
filter_zero = seq != 0

print(f'mean standard_error: {seq[filter_zero].mean():.4f}')
print(f'min standard_error: {seq[filter_zero].min():.4f}')
print(f'max standard_error: {seq[filter_zero].max():.4f}')

sns.histplot(seq[filter_zero])
plt.title('Distribution of S.E.')

Then, let's do the experiment with 10000 iteration.

1. sampling of 600 datasets
2. sampling of 2000 datasets

In [None]:
import random

rsme_public = []
seq = train_df['standard_error']

for i in range(10000):
    np.random.seed(123 + i)
    sampled = seq.sample(n=int(0.3 * 2000), replace=True, random_state=234 + i).values
    noize = np.array(list(map(lambda e: np.random.normal(0, e, 1)[0], sampled)))
    rsme_public.append(np.sqrt(np.square(noize).mean()))
    assert len(noize) == 600

sns.histplot(rsme_public)
plt.title('#sample dataset = 600')

In [None]:
print(f'p2.5 = {np.quantile(rsme_public, 0.025):.4f}')
print(f'p97.5 = {np.quantile(rsme_public, 0.975):.4f}')
print(f'mean = {np.mean(rsme_public):.4f}')
print(f'std = {np.std(rsme_public):.4f}')

In [None]:
import random

rsme_private = []
seq = train_df['standard_error']

for i in range(10000):
    np.random.seed(123 + i)
    sampled = seq.sample(n=int(2000), replace=True, random_state=234 + i).values
    noize = np.array(list(map(lambda e: np.random.normal(0, e, 1)[0], sampled)))
    rsme_private.append(np.sqrt(np.square(noize).mean()))
    assert len(noize) == 2000

sns.histplot(rsme_private)
plt.title('#sample dataset = 2000')

In [None]:
print(f'p2.5 = {np.quantile(rsme_private, 0.025):.4f}')
print(f'p97.5 = {np.quantile(rsme_private, 0.975):.4f}')
print(f'mean = {np.mean(rsme_private):.4f}')
print(f'std = {np.std(rsme_private):.4f}')

## Result

In the above experiments, can say:

1. the theoretical lower bound of the public LB score is around 0.46-0.52 (for 95% confidence)
2. the theoretical lower bound of the private LB score is around 0.48-0.51 (for 95% confidence)

### Note

This notebook’s argument is just examining the general feature of RSME loss if the residual errors are Gaussian distribution N(0, σ) [1]. (Strictly saying, they are slightly different because standard error is not constant.)

Besides, the result relies on the correctness of the standard error of the Bradley-Terry estimation of the target value. I wonder if this errors truly are the theoretical lower bound or not. At least, the target must contains some uncertainty, and this notebook shows how RSME changes if the uncertainty is around 0.49.

[1] https://zenn.dev/bilzard/articles/f62c762a0016b9