# CommonLit: Signal vs. Noise

### The nature of standard_error

The standard error is a *measure of spread of scores among multiple raters for each excerpt.* This is an ensemble method of building the dataset. To take an example from the training data set:

In [None]:
import pandas as pd
train = pd.read_csv('../input/commonlitreadabilityprize/train.csv')
train.head()

Sample *id c12129c31* has a target of *-0.340* and a standard_error of *0.464*. An example calculation which approximates this result is:

In [None]:
from scipy.stats import norm

# Generate random scores from 100 raters
scores = norm.rvs(loc=train['target'][0], scale=train['standard_error'][0], size=100, random_state=42)
print("""Mean: {:.3f}
Std: {:.3f}
Range: {:.3f} to {:.3f}""".format(scores.mean(), scores.std(), scores.min(), scores.max()))

While not representative of the actual observerations, the calculation is illustrative of the range of views of difficulty that a human labeler might apply from -1.556 to 0.519. This suggests there is substantial variation in the human labels.

### Applying standard_error to the dataset

The standard_error may be used to simulate the views of a human labeller. I am assuming that the errors follow a normal distribution:

In [None]:
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error

# Split for multiple datasets and simulations
k_folds = KFold(5, shuffle=True)

predictions = [] # store the predictions
for i, (train_ix, _) in enumerate(k_folds.split(train), 1):
    target, error = train.loc[train_ix, ['target','standard_error']].to_numpy().T # get the targets
    normal_sample = norm.rvs(size=len(train_ix), random_state=i) # normal sample from standard normal
    prediction = target + error * normal_sample # pro-rata to the scale implied by the train set
    rmse = mean_squared_error(target, prediction)**.5 # how does this compare to the actuals
    predictions.append(rmse)
    print('Fold {} {:.3f}'.format(i, rmse))
print('Average RMSE {:.3f}'.format(sum(predictions)/len(predictions)))

This suggests that an average human labeller would have an RMSE of 0.490. The leading models on the leaderboard are pushing towards an RMSE of 0.460. The 0.490s on the other hand are around the 375 rank or top 1/3 of competitors. For me this raises three questions:
- What is the distribution and nature of the errors?
- Does a RMSE of substantially below 0.490 indicate overfitting?
- What does it mean for a model to be a *better* labeller of difficulty than a human labeller?