# Understanding the evaluation metric & CV for this competion

**Update:
The code to calculate CV for all mentioned submissions is added at the end of this notebook.**

It might be somewhat unclear how the results are evaluated in this competition.

The official explanation presents Mean F1-Score metric as a function of true positives (tp) and false negatives (fn). However, a multi-label task introduces some uncertainty into the definition of tp and fn. For example, if the true label is `'scab frog_eye_leaf_spot'`, and the submitted prediction is `'scab'`, will this prediction be considered *partially correct* or just *incorrect* for the purpose of evaluation? In [this discussion thread](https://www.kaggle.com/c/plant-pathology-2021-fgvc8/discussion/227237) a few related questions were raised, and they became an inspiration for this notebook.

First of all, I would like to test:
1. Whether there is some value in predicting correctly only some labels, but not all labels, for a particular picture (if the true label is `'scab frog_eye_leaf_spot'`, will we get some credit for predicting just `'scab'` or just `'frog_eye_leaf_spot'`?).
2. Whether the order of predicted classes matters (`'scab frog_eye_leaf_spot'` vs `'frog_eye_leaf_spot scab'`).

With the code in this notebook I made a few submissions in which predictions for every image in the test were the same:

`Prediction     Public score`

`'healthy'            0.251`

`'scab'               0.292`

`'scab healthy'       0.365`

`'healthy scab'       0.365`

`all possible labels  0.265` 

The implication of this:
1. **There is a benefit in partially predicting the correct answer.** Otherwise nonsense prediction `'healthy scab'` would score zero.

2. Looks like **the order of labels does not matter** because the scores for `'healthy scab'` and `'scab healthy'` are the same.

You might want to use this approach to test other hypotheses about the evaluation.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os

Let's list all the possible metrics in the train set.

In [None]:
train = pd.read_csv('../input/plant-pathology-2021-fgvc8/train.csv')
train.labels.value_counts()

A prediction with all possible labels would look like this:

In [None]:
s = set()
for k in train.labels.value_counts().index:
    for c in k.split(' '):
        s.add(c)
' '.join(sorted(s))

Now let's create a few submissions with the same prediction for every image.
* Submission1: `'healthy'`
* Submission2: `'scab'`
* Submission3: `'scab healthy'`
* Submission4: `'healthy scab'`
* Submission4: `'cider_apple_rust complex frog_eye_leaf_spot healthy powdery_mildew rust scab'`

In [None]:
TEST_FOLDER = '../input/plant-pathology-2021-fgvc8/test_images/'
images = os.listdir(TEST_FOLDER)
sub = pd.DataFrame(images, columns=['image'])

In [None]:
# sub['labels'] = 'healthy' # score 0.251
# sub['labels'] = 'scab' # score 0.292
# sub['labels'] = 'scab healthy' # score 0.365
sub['labels'] = 'healthy scab' # score 0.365
# sub['labels'] = 'cider_apple_rust complex frog_eye_leaf_spot healthy powdery_mildew rust scab' # score 0.265

sub.to_csv("submission.csv", index=False)

# Calculating CV score from train

Here is [an article](https://medium.com/synthesio-engineering/precision-accuracy-and-f1-score-for-multi-label-classification-34ac6bdfb404) that explains application of Mean F1-Score metric for multi-label classification. It explains the difference between 'macro' and 'micro' averaging. It is claimed that macro-averaging is to be preferred over micro-averaging.

At the moment, I am not sure which version ('macro' or 'micro') is used in the evaluation. But I find it problematic to apply macro-averaging because it is not defined for some cases mentioned in this workbook. Consider Submission1 when predictions for every image are 'healthy'. Macro-averaging requires calculating precision and recall for every label. But all labels except 'healthy' have tp=0 and fp=0, thus p=tp/(tp+fp) is undefined.

That is why I try to use micro-averaging here, which is properly defined for all cases mentioned above. Micro-averaging requires calculating the sum of tp, fp and fn across all classes. Then the formula described in the competition overview is applied: 
$$F1 = 2\frac{p \cdot r}{p+r}\ \ \mathrm{where}\ \ p = \frac{tp}{tp+fp},\ \ r = \frac{tp}{tp+fn}$$


In [None]:
predictions = [
    'healthy',
    'scab', 
    'scab healthy',
    'healthy scab',
    'cider_apple_rust complex frog_eye_leaf_spot healthy powdery_mildew rust scab',
]
for prediction in predictions:
    prediction_labels = set(prediction.split(' '))
    tp = sum([sum(train.labels.map(lambda x: c in x.split(' '))) for c in prediction_labels])
    fp = sum([sum(train.labels.map(lambda x: c not in x.split(' '))) for c in prediction_labels])
    fn = sum([sum(train.labels.map(lambda x: c in x.split(' '))) for c in s if c not in prediction_labels])
    p = tp/(tp+fp)
    r = tp/(tp+fn)
    f1 = 2*p*r/(p+r)
    print('{}: tp={}, fp={}, fn={}, p={:.3f}, r={:.3f}, f1={:.3f}'.format(
        prediction, tp, fp, fn, p, r, f1
    ))
    

Let's put the results into a table:

`Prediction       Public score        CV`

`'healthy'              0.251      0.238`

`'scab'                 0.292      0.294`

`'scab healthy'         0.365      0.360`

`'healthy scab'         0.365      0.360`

`all possible labels    0.265      0.268`

This looks not bad at all. Probably even too good to be true.