# Potential Shake Up Analysis

In this competition, many of us are having a hard time finding a good CV/LB correlation.

As always, the big final question is "Should we trust CV or LB" ?

In this notebook, I analyze the behaviour of two different (different parameters) but similar (same architecture) model that I trained.
They both have very close overall OOF CV scores around 0.84, but have different LB scores (it won't be fun otherwise!).

Let's see what we can learn from that.

The whole idea behind this analysis is that the models have similar CV and similar architectures, so they should essentially be equivalent.
Any difference between the two can be considered as an observation of the noise of our metric and problem.

In [None]:
import pandas as pd
import numpy as np
from sklearn.metrics import cohen_kappa_score
from matplotlib import pyplot as plt
from tqdm.notebook import tqdm

df_train = pd.read_csv("/kaggle/input/learning-agency-lab-automated-essay-scoring-2/train.csv")

model_1_oof = pd.read_csv("/kaggle/input/shakeupanalysis/model_1_oof.csv")
model_2_oof = pd.read_csv("/kaggle/input/shakeupanalysis/model_2_oof.csv")

df_train["model_1"] = model_1_oof["0"]
df_train["model_2"] = model_2_oof["0"]

cv_model_1 = cohen_kappa_score(np.clip(np.rint(df_train["model_1"]), 1, 6),
                                 df_train.score, weights='quadratic')

cv_model_2 = cohen_kappa_score(np.clip(np.rint(df_train["model_2"]), 1, 6),
                                 df_train.score, weights='quadratic')

print(f"Model 1 CV: {cv_model_1:.4f} -> LB score (fold 1) 0.805")
print(f"Model 2 CV: {cv_model_2:.4f} -> LB score (fold 1) 0.809")

We know that the test set has approximately 8K rows, with 30% (2400 rows) being in Public LB and 70% (5600 rows) being in Private LB.

So let's randomly select a Public and a Private LB among the train data, and repeat that 1000 times to compute a few bootstrapped statistics.

In [None]:
models = ["model_1", "model_2"]
public_results = {}
for col in models:
    public_results[col] = []

private_results = {}
for col in models:
    private_results[col] = []
    


for rand_iter in tqdm(range(1000)):
    
    # Here take random indices to form the Public Leaderboard
    rand_idx = np.random.choice(np.arange(len(df_train)), 2400, replace=False)
    
    # On the remaining indices, let's form a Private Leaderboard
    private_idx = np.array([i for i in range(len(df_train)) if i not in rand_idx])
    private_idx = np.random.choice(private_idx, 5600, replace=False)
    
    for col in models:

        local_score = cohen_kappa_score(np.clip(np.rint(df_train[col]), 1, 6)[rand_idx],
                                       df_train.score.values[rand_idx], weights='quadratic')
        public_results[col].append(local_score)
        private_score = cohen_kappa_score(np.clip(np.rint(df_train[col]), 1, 6)[private_idx],
                                       df_train.score.values[private_idx], weights='quadratic')
        private_results[col].append(private_score)

## Public LB variability

The two plots bellow tell us multiple things.

First that depending on the Public LB, the scores can vary quite a lot from ~0.825 to ~ 85.5.
This is somewhat expected as some cases must be harder than others so the scores vary, this also could explain the CV/LB gap that most of us are experiencing.

Secondly, we can see that the scores between two equally good models can differ of 0.01 with a standard deviation ~0.004.
This tells us that a difference in score <0.004 in the Public Leaderboard is not really significant. 

In [None]:
x_values = public_results["model_1"]
y_values = public_results["model_2"]

# Find the minimum and maximum values
min_val = min(np.min(x_values), np.min(y_values))-0.001
max_val = max(np.max(x_values), np.max(y_values))+0.001

# Create the scatter plot
plt.scatter(x_values, y_values)

# Set the same limits for both axes
plt.xlim(min_val, max_val)
plt.ylim(min_val, max_val)

# Plot the lines y=x, y=x+0.005, y=x-0.005
x_line = np.linspace(min_val, max_val, 100)
plt.plot(x_line, x_line, 'r--', label='y = x')  # Line y = x
plt.plot(x_line, x_line + 0.005, 'g--', label='y = x + 0.005')  # Line y = x + 0.005
plt.plot(x_line, x_line - 0.005, 'b--', label='y = x - 0.005')  # Line y = x - 0.005

# Add legend
plt.legend()
plt.title("Public LB variability")
plt.xlabel("Model 1")
plt.ylabel("Model 2")
# Display the plot
plt.show()

In [None]:
plt.hist([x - y for x, y in zip(public_results["model_1"], public_results["model_2"])], bins=100);
plt.title(f"Public difference between the two models (std: {np.std([(x - y) for x, y in zip(public_results['model_1'],public_results['model_2'])]):.2e})")

plt.show()

## Private LB variability

If we plot the same thing but for private LB scores what can we say ?

We can see from the histogram bellow that having more data reduces the deviation between the two models private scores and that a difference of 0.002 on the private LB should be significant.

However, since there are almost 2000 participants in this competition, we can expect that some of us would have a "lucky boost" as big as 0.005 on the Private Leader Board.

In [None]:
x_values = private_results["model_1"]
y_values = private_results["model_2"]

# Find the minimum and maximum values
min_val = min(np.min(x_values), np.min(y_values))-0.001
max_val = max(np.max(x_values), np.max(y_values))+0.001

# Create the scatter plot
plt.scatter(x_values, y_values)

# Set the same limits for both axes
plt.xlim(min_val, max_val)
plt.ylim(min_val, max_val)

# Plot the lines y=x, y=x+0.005, y=x-0.005
x_line = np.linspace(min_val, max_val, 100)
plt.plot(x_line, x_line, 'r--', label='y = x')  # Line y = x
plt.plot(x_line, x_line + 0.005, 'g--', label='y = x + 0.005')  # Line y = x + 0.005
plt.plot(x_line, x_line - 0.005, 'b--', label='y = x - 0.005')  # Line y = x - 0.005

# Add legend
plt.legend()
plt.title("Private LB variability")
plt.xlabel("Model 1")
plt.ylabel("Model 2")
# Display the plot
plt.show()

In [None]:
plt.hist([x - y for x, y in zip(private_results["model_1"], private_results["model_2"])], bins=100);
plt.title(f"Private difference between the two models (std: {np.std([(x - y) for x, y in zip(private_results['model_1'],private_results['model_2'])]):.2e})")

plt.show()

# Conclusion

This analysis is just here to give some insights about a potential shakeup and is by no mean a proof of anything.

However, since the Public Leaderboard of this competition is quite crowdy around 0.820-0.822 (mostly because of public notebooks) I would say that competitors with Public Scores above 0.826 are (almost) certain of finishing above the public notebooks.
Obviously, things are more complicated when looking at the public notebooks scores as they are by design created to overfit the LB. Considering this, I would say that anyone with a solution independant of the top public notebooks with a score > 0.815 can have reasonnable hope to outperform this baseline on the Private Leaderboard.

Finally, we are still a long way to go before the end of the competition and gold zone will probably require a score >0.830 but at the moment I would definitely foresee a large shake up!

Happy kaggling!