# What is the best possible CV score?

In this notebook we explore what would be the best possible CV score on validation data.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

Let's read data

In [None]:
train = pd.read_csv('../input/jigsaw-toxic-severity-rating/validation_data.csv')
train

How are comment pairs distributed?

In [None]:
train['freq'] = train.groupby(['less_toxic', 'more_toxic']).worker.transform('count')
train.freq.value_counts()

We see that the number of pairs appears twice is the double of the number of pairs appearing once. It is because there are three rows of the form (row ordering is not meaningful, worker column is removed):
```
less_toxic more_toxic 
comment1   comment2   
comment1   comment2   
comment2   comment1   
```
Two annotators ranked the two comments the same way, and another one ranked them the other way.  The majority vote is:
```
less_toxic more_toxic 
comment1 comment2
```
To anayse this further let's add all pairs in the reverse order, and let's define the target explicitly. Target is 1 for original data, and 0 for reversed pairs.  For the example above we would now get 6 rows:
```
less_toxic more_toxic target 
comment1   comment2   1      
comment1   comment2   1      
comment2   comment1   1      
comment2   comment1   0      
comment2   comment1   0      
comment1   comment2   0      
```

In [None]:
train_reverse = pd.DataFrame({
    'worker':train.worker.values,
    'less_toxic':train.more_toxic.values,
    'more_toxic':train.less_toxic.values,
})

train['target'] = 1
train_reverse['target'] = 0
train2 = pd.concat([train, train_reverse]).reset_index(drop=True)
train2

Now, we can compute the average annotation score per pair.  This mean can be intepreted as the probability that pairs are in the right order.  In our example above we would get:
```
less_toxic more_toxic target proba
comment1   comment2   1      2/3
comment1   comment2   1      2/3
comment2   comment1   1      1/3
comment2   comment1   0      1/3
comment2   comment1   0      1/3
comment1   comment2   0      2/3
```


In [None]:
train2['proba'] = train2.groupby(['less_toxic', 'more_toxic']).target.transform('mean')
train2

If we look at our example above, and keep only the rows with a probabiliy above 1/2 we get:

```
less_toxic more_toxic target proba
comment1   comment2   1      2/3
comment1   comment2   1      2/3
comment1   comment2   0      2/3
```
which are our original rows, with an explicit target, and a probability.

We can see that the best possible average score for these three rows is 2/3, i.e. their probailities.

The reasoning is easier when all three rows are ordered the same way, in which case the best possible average score is 1, which is also their probabilities:

```
less_toxic more_toxic target proba
comment1   comment2   1      1
comment1   comment2   1      1
comment1   comment2   1      1
```

Therefore, the average score cannot exceed the average of row probabilities. 

Let's compute it.

In [None]:
train = train2[train2.proba >= 0.5].reset_index(drop=True)
train.proba.mean()

This is the best possible score on validation data.  If we assume that validation data is representative of private test data (which seems to be the case here), then we also get an upper bound on private LB scores.