## Evaluation

In this notebook, we read in submission file and calculate success@5 and NDCG@5 score against ground truth file. The success@5 is a simple order-irrelevant metric, which is usually used when there is only one true match among all the candidates. However we do want to encourage participants take order into account, so we also involved NDCG@5 score, which is a widely used metric for rankin tasks. 

Notice there is a column called confidence in the sample submission. This is the probability for each prediction. Though it is not used to calculate NDCG or Success score, we'd really appreciate it if you could involve it in your submission because it will help Westat staff to better review the results.

We currently plan on using these measures, suitable for ranking problems, for model evaluation. Participating teams will be notified of any changes in the future.

In [None]:
import pandas as pd
import numpy as np

In [None]:
gen_pt = 'sample_submission.csv'
ref_pt = 'fake_ground_truth.csv'

In [None]:
submission = pd.read_csv(gen_pt, dtype=str)
ground_truth = pd.read_csv(ref_pt, dtype=str)

In [None]:
%%time
submission = submission.drop_duplicates()

In [None]:
%%time
df = pd.merge(submission, ground_truth, how='outer', on='upc')
df = df[df['ec_y'].notna()]
df['correct'] = df['ec_x'] == df['ec_y']
df['correct'] = df['correct'].astype(int)
df = df.groupby('upc').agg(lambda x: x.tolist()).reset_index()

In [None]:
def dcg_at_k(r, k=5):
    """
    Args:
        r: Relevance scores list (binary value) in rank order
        k: Number of results to consider

    Returns:
        Discounted Cumulative Gain
    """
    r = np.asfarray(r)[:k]
    if r.size:
        return np.sum(r / np.log2(np.arange(2, r.size + 2))) 
    return 0.


def ndcg_at_k(r, k=5):
    """
    Args:
        r: Relevance scores list (binary value) in rank order
        k: Number of results to consider
    Returns:
        Normalized Discounted Cumulative Gain
    """
    dcg_max = dcg_at_k(sorted(r, reverse=True), k)
    if not dcg_max:
        return 0.
    return dcg_at_k(r, k) / dcg_max


def success_at_k(r, k=5):
    """
    Args:
        r: Correct match list (binary value) in rank order
        k: Number of results to consider
    Returns:
    """
    return np.sum(r[:k])

def padding(r, k=5):
    while len(r) < k:
        r.append(0)
    return r

In [None]:
%%time
# We used padding here to make sure every UPC has at least 5 predictions. 
# If there are less than 5 predictions, it will be filled with 0 to make the score consistent.
df['correct'] = df['correct'].apply(padding)
df['ndcg@5'] = df['correct'].apply(ndcg_at_k)
df['s@5'] = df['correct'].apply(success_at_k)

In [None]:
# Here is an example of the scoring details
df.head()

In [None]:
%%time
ndcg_score = df['ndcg@5'].mean()
success_score = df['s@5'].mean()

In [None]:
print("The NDCG@5 score is: {}".format(round(ndcg_score, 3)))
print("The Success@5 score is: {}".format(round(success_score, 3)))

## Extra Help Notes

To help with the process of evaluation, we attached another python file in this folder. To use it, simply run in the command line: 
 - `python evaluation_script.py <submission_csv_file_path> <ground_truth_csv_file_path>`  

If you don't want to output the files everytime and want to see the scores inside your notebook. You could import `evaluate` function to you notebook from the script and use the following code.

In [None]:
from evaluation_script import evaluate

In [None]:
evaluate(submission, ground_truth)

Please let us know if you find any issues with the evaluation script. And we will keep you posted if there is any update in the future.