# Faster Metric Computation

At the end of this notebook you can find a small change that doubles the speed of the code.
Otherwise the notebook is a verbatim copy of [CPMPs notebook](https://www.kaggle.com/cpmpml/faster-metric-computation).

Rob Mula @robikscube wrote a code to compute the evaluation metric in this [notebook](https://www.kaggle.com/robikscube/student-writing-competition-twitch). This is extremely valuable because metric computation is otften tricky, and finding the right python code for it can be time consuming. Rob saved lots of time for many here.

version 3: After I published my notebook Rob commented this:

*Just a heads up that you might be using an earlier version of my code from before the host clarified that the evaluation metric is indeed macro_f1. If you look at the latest version of my notebook the original score_feedback_comp function is replaced with score_feedback_comp_micro and the new score_feedback_comp scores the macro f1 based off each class. You can still use the older version but will just need to call it on each class individually and then average.*

Here is an updated version which takes Rob's comment into account hopefully.

Version 4: typo fixes.

Version 5: I moved the prediction string split before the outer join.  It speeds up code further.

Version 6: I added a variant of the code which gets a perfect score of 1 when ground truth is used as prediction.

Version 7: Some typo fixes and further optimisation of the variant code.

Version 8: There is no need to explicitly compute the match between predictions and ground truth to compute the score. Our perfect score code benefits from it.

Version 9. Typos fixes.

Let me reproduce Rob's code to start with.

In [None]:
import pandas as pd
import numpy as np

# from Rob Mulla @robikscube
# https://www.kaggle.com/robikscube/student-writing-competition-twitch
def calc_overlap(row):
    """
    Calculates the overlap between prediction and
    ground truth and overlap percentages used for determining
    true positives.
    """
    set_pred = set(row.predictionstring_pred.split(" "))
    set_gt = set(row.predictionstring_gt.split(" "))
    # Length of each and intersection
    len_gt = len(set_gt)
    len_pred = len(set_pred)
    inter = len(set_gt.intersection(set_pred))
    overlap_1 = inter / len_gt
    overlap_2 = inter / len_pred
    return [overlap_1, overlap_2]

def score_feedback_comp_micro(pred_df, gt_df):
    """
    A function that scores for the kaggle
        Student Writing Competition

    Uses the steps in the evaluation page here:
        https://www.kaggle.com/c/feedback-prize-2021/overview/evaluation
    """
    gt_df = (
        gt_df[["id", "discourse_type", "predictionstring"]]
        .reset_index(drop=True)
        .copy()
    )
    pred_df = pred_df[["id", "class", "predictionstring"]].reset_index(drop=True).copy()
    pred_df["pred_id"] = pred_df.index
    gt_df["gt_id"] = gt_df.index
    # Step 1. all ground truths and predictions for a given class are compared.
    joined = pred_df.merge(
        gt_df,
        left_on=["id", "class"],
        right_on=["id", "discourse_type"],
        how="outer",
        suffixes=("_pred", "_gt"),
    )
    joined["predictionstring_gt"] = joined["predictionstring_gt"].fillna(" ")
    joined["predictionstring_pred"] = joined["predictionstring_pred"].fillna(" ")

    joined["overlaps"] = joined.apply(calc_overlap, axis=1)

    # 2. If the overlap between the ground truth and prediction is >= 0.5,
    # and the overlap between the prediction and the ground truth >= 0.5,
    # the prediction is a match and considered a true positive.
    # If multiple matches exist, the match with the highest pair of overlaps is taken.
    joined["overlap1"] = joined["overlaps"].apply(lambda x: eval(str(x))[0])
    joined["overlap2"] = joined["overlaps"].apply(lambda x: eval(str(x))[1])

    joined["potential_TP"] = (joined["overlap1"] >= 0.5) & (joined["overlap2"] >= 0.5)
    joined["max_overlap"] = joined[["overlap1", "overlap2"]].max(axis=1)
    tp_pred_ids = (
        joined.query("potential_TP")
        .sort_values("max_overlap", ascending=False)
        .groupby(["id", "predictionstring_gt"])
        .first()["pred_id"]
        .values
    )

    # 3. Any unmatched ground truths are false negatives
    # and any unmatched predictions are false positives.
    fp_pred_ids = [p for p in joined["pred_id"].unique() if p not in tp_pred_ids]

    matched_gt_ids = joined.query("potential_TP")["gt_id"].unique()
    unmatched_gt_ids = [c for c in joined["gt_id"].unique() if c not in matched_gt_ids]

    # Get numbers of each type
    TP = len(tp_pred_ids)
    FP = len(fp_pred_ids)
    FN = len(unmatched_gt_ids)
    # calc microf1
    my_f1_score = TP / (TP + 0.5 * (FP + FN))
    return my_f1_score

def score_feedback_comp(pred_df, gt_df, return_class_scores=False):
    class_scores = {}
    pred_df = pred_df[["id", "class", "predictionstring"]].reset_index(drop=True).copy()
    for discourse_type, gt_subset in gt_df.groupby("discourse_type"):
        pred_subset = (
            pred_df.loc[pred_df["class"] == discourse_type]
            .reset_index(drop=True)
            .copy()
        )
        class_score = score_feedback_comp_micro(pred_subset, gt_subset)
        class_scores[discourse_type] = class_score
    f1 = np.mean([v for v in class_scores.values()])
    if return_class_scores:
        return f1, class_scores
    return f1


Let's apply it to the ground truth to see how much time it takes.

In [None]:
train_df = pd.read_csv('../input/feedback-prize-2021/train.csv')
pred_df = train_df[['id','discourse_type','predictionstring']].copy()
pred_df.columns = ['id','class','predictionstring']

score_feedback_comp(pred_df, train_df)

We look at why it isn't equal to 1 at the end of this notebook. For now we are rather interested in the time the metric computation takes.

In [None]:
%timeit score_feedback_comp(pred_df, train_df)

OK, less than a minute, this is probably good enough if we run this code once per epoch.

Let's run it on a sample of the ground truth as well, to get a second comparison point.

In [None]:
pred_df2 = pred_df.sample(frac=0.7).reset_index(drop=True)
score_feedback_comp(pred_df2, train_df)

In [None]:
%timeit score_feedback_comp(pred_df2, train_df)

Can we do better? I think we can as shown below. We can remove unnecessary data frame copies, list constructions, fillna calls, and, most importantly, avoid the use of the apply function.  Apply is very convenient, but it is extremely slow. We can also split strings only once, before the outer join.

Let's do this.

In [None]:
def calc_overlap2(set_pred, set_gt):
    """
    Calculates the overlap between prediction and
    ground truth and overlap percentages used for determining
    true positives.
    """
    # Length of each and intersection
    try:
        len_gt = len(set_gt)
        len_pred = len(set_pred)
        inter = len(set_gt & set_pred)
        overlap_1 = inter / len_gt
        overlap_2 = inter/ len_pred
        return (overlap_1, overlap_2)
    except:  # at least one of the input is NaN
        return (0, 0)

def score_feedback_comp_micro2(pred_df, gt_df, discourse_type):
    """
    A function that scores for the kaggle
        Student Writing Competition
        
    Uses the steps in the evaluation page here:
        https://www.kaggle.com/c/feedback-prize-2021/overview/evaluation
    """
    gt_df = gt_df.loc[gt_df['discourse_type'] == discourse_type, 
                      ['id', 'predictionstring']].reset_index(drop=True)
    pred_df = pred_df.loc[pred_df['class'] == discourse_type,
                      ['id', 'predictionstring']].reset_index(drop=True)
    pred_df['pred_id'] = pred_df.index
    gt_df['gt_id'] = gt_df.index
    pred_df['predictionstring'] = [set(pred.split(' ')) for pred in pred_df['predictionstring']]
    gt_df['predictionstring'] = [set(pred.split(' ')) for pred in gt_df['predictionstring']]
    
    # Step 1. all ground truths and predictions for a given class are compared.
    joined = pred_df.merge(gt_df,
                           left_on='id',
                           right_on='id',
                           how='outer',
                           suffixes=('_pred','_gt')
                          )
    overlaps = [calc_overlap2(*args) for args in zip(joined.predictionstring_pred, 
                                                     joined.predictionstring_gt)]
    
    # 2. If the overlap between the ground truth and prediction is >= 0.5, 
    # and the overlap between the prediction and the ground truth >= 0.5,
    # the prediction is a match and considered a true positive.
    # If multiple matches exist, the match with the highest pair of overlaps is taken.
    joined['potential_TP'] = [(overlap[0] >= 0.5 and overlap[1] >= 0.5) \
                              for overlap in overlaps]
    joined['max_overlap'] = [max(*overlap) for overlap in overlaps]
    joined_tp = joined.query('potential_TP').reset_index(drop=True)
    tp_pred_ids = joined_tp\
        .sort_values('max_overlap', ascending=False) \
        .groupby(['id','gt_id'])['pred_id'].first()

    # 3. Any unmatched ground truths are false negatives
    # and any unmatched predictions are false positives.
    fp_pred_ids = set(joined['pred_id'].unique()) - set(tp_pred_ids)

    matched_gt_ids = joined_tp['gt_id'].unique()
    unmatched_gt_ids = set(joined['gt_id'].unique()) -  set(matched_gt_ids)

    # Get numbers of each type
    TP = len(tp_pred_ids)
    FP = len(fp_pred_ids)
    FN = len(unmatched_gt_ids)
    #calc microf1
    my_f1_score = TP / (TP + 0.5*(FP+FN))
    return my_f1_score

def score_feedback_comp2(pred_df, gt_df, return_class_scores=False):
    class_scores = {}
    for discourse_type in gt_df.discourse_type.unique():
        class_score = score_feedback_comp_micro2(pred_df, gt_df, discourse_type)
        class_scores[discourse_type] = class_score
    f1 = np.mean([v for v in class_scores.values()])
    if return_class_scores:
        return f1, class_scores
    return f1


Is this code computing the same value as Rob's code?

In [None]:
score_feedback_comp2(pred_df, train_df)

In [None]:
score_feedback_comp2(pred_df2, train_df)

It seems so.  Is it faster?

In [None]:
%timeit score_feedback_comp2(pred_df, train_df)

In [None]:
%timeit score_feedback_comp2(pred_df2, train_df)

It is about 5 times faster.

## Why Not A Perfect Score?

I was a bit puzzled by why we don't get a score of 1 when using ground truth as prediction. By looking at details I found that FP is 3 in that case. These prediction ids are declared false positives: `{13933, 30714, 143325}`

Further investigation led to three cases of significant overlap in prediction strings in different rows:

In [None]:
train_df.iloc[13933:13935]

In [None]:
train_df.iloc[30713:30715]

In [None]:
train_df.iloc[143325:143327]

Let's look at the first example.

In [None]:
filename = '../input/feedback-prize-2021/train/96948C0AFC15.txt'
with open(filename, "r") as f:
    text = f.read()
print(text)

Issue is that some words are not separated by white spaces, for instance in "more money saved,more space,pollution, and no traffic." at the end of first paragraph.  The way prediction string are created is a bit weird when that happens.  To know more you can have a look at this post by Chris Deotte @cdeotte : https://www.kaggle.com/c/feedback-prize-2021/discussion/297591

Bottom line is that Rob's code is correct when the input is correct: the issue happens when there is some overlapping ground truth across rows.  A possible cure is presented below.

## A Perfect Score Code

A further optimization does not compute FP and FN values explicitly. It rather uses both the total number of predictions, and the total number of ground truth.  It also removes the computation of the match between ground truth and predictions. This code is a faster than our code above, but it computes values a bit different than Rob's code. Good news is that it does compute a value of 1 when the ground truth is used as prediction.

In [None]:
def calc_overlap3(set_pred, set_gt):
    """
    Calculates if the overlap between prediction and
    ground truth is enough fora potential True positive
    """
    # Length of each and intersection
    try:
        len_gt = len(set_gt)
        len_pred = len(set_pred)
        inter = len(set_gt & set_pred)
        overlap_1 = inter / len_gt
        overlap_2 = inter/ len_pred
        return overlap_1 >= 0.5 and overlap_2 >= 0.5
    except:  # at least one of the input is NaN
        return False

def score_feedback_comp_micro3(pred_df, gt_df, discourse_type):
    """
    A function that scores for the kaggle
        Student Writing Competition
        
    Uses the steps in the evaluation page here:
        https://www.kaggle.com/c/feedback-prize-2021/overview/evaluation
    """
    gt_df = gt_df.loc[gt_df['discourse_type'] == discourse_type, 
                      ['id', 'predictionstring']].reset_index(drop=True)
    pred_df = pred_df.loc[pred_df['class'] == discourse_type,
                      ['id', 'predictionstring']].reset_index(drop=True)
    pred_df['pred_id'] = pred_df.index
    gt_df['gt_id'] = gt_df.index
    pred_df['predictionstring'] = [set(pred.split(' ')) for pred in pred_df['predictionstring']]
    gt_df['predictionstring'] = [set(pred.split(' ')) for pred in gt_df['predictionstring']]
    
    # Step 1. all ground truths and predictions for a given class are compared.
    joined = pred_df.merge(gt_df,
                           left_on='id',
                           right_on='id',
                           how='outer',
                           suffixes=('_pred','_gt')
                          )
    overlaps = [calc_overlap3(*args) for args in zip(joined.predictionstring_pred, 
                                                     joined.predictionstring_gt)]
    
    # 2. If the overlap between the ground truth and prediction is >= 0.5, 
    # and the overlap between the prediction and the ground truth >= 0.5,
    # the prediction is a match and considered a true positive.
    # If multiple matches exist, the match with the highest pair of overlaps is taken.
    # we don't need to compute the match to compute the score
    TP = joined.loc[overlaps]['gt_id'].nunique()

    # 3. Any unmatched ground truths are false negatives
    # and any unmatched predictions are false positives.
    TPandFP = len(pred_df)
    TPandFN = len(gt_df)
    
    #calc microf1
    my_f1_score = 2*TP / (TPandFP + TPandFN)
    return my_f1_score

def score_feedback_comp3(pred_df, gt_df, return_class_scores=False):
    class_scores = {}
    for discourse_type in gt_df.discourse_type.unique():
        class_score = score_feedback_comp_micro3(pred_df, gt_df, discourse_type)
        class_scores[discourse_type] = class_score
    f1 = np.mean([v for v in class_scores.values()])
    if return_class_scores:
        return f1, class_scores
    return f1


In [None]:
score_feedback_comp3(pred_df, train_df)

In [None]:
score_feedback_comp3(pred_df2, train_df)

In [None]:
%timeit score_feedback_comp3(pred_df, train_df)

In [None]:
%timeit score_feedback_comp3(pred_df2, train_df)

Whether to use the last version or not is left to the reader...

Once again, let me thank Rob for his code. The fact that his code could be made faster does not mean that his code was bad at all. First, his code was correct, and it was running fast enough to be used in practice. Second, my understanding is that he wrote his code live in a Twitch stream. I am not sure I would have written a correct code in that setting. Offline optimization of an already correct code is much easier.

This said, I hope the optimized code can still be useful to some in this competition.

# A further speed up

The only change below is that you can avoid set operations as the `predictionstring` contains integers from an interval. So you only need to convert the first and the last element into a number and you have all the information. This doubles the speed. If you want to check the performance of your model you can speed it up further by avoiding the generation of the prediction string and by preprocessing the ground truth.

In [None]:
def calc_overlap4(pred_range, gt_range):
    """
    Calculates if the overlap between prediction and
    ground truth is enough for a potential True positive
    """
    # Length of each and intersection
    try:
        len_gt = gt_range[1]-gt_range[0]
        len_pred = pred_range[1]-pred_range[0]
        inter = max(0, min(gt_range[1], pred_range[1])-max(gt_range[0], pred_range[0]))
        overlap_1 = inter / len_gt
        overlap_2 = inter/ len_pred
        return overlap_1 >= 0.5 and overlap_2 >= 0.5
    except:  # at least one of the input is NaN
        return False

def get_range(pred_str):
    return int(pred_str.split(' ',1)[0]), int(pred_str.rsplit(' ', 1)[-1])+1

def score_feedback_comp_micro4(pred_df, gt_df, discourse_type):
    """
    A function that scores for the kaggle
        Student Writing Competition
        
    Uses the steps in the evaluation page here:
        https://www.kaggle.com/c/feedback-prize-2021/overview/evaluation
    """
    gt_df = gt_df.loc[gt_df['discourse_type'] == discourse_type, 
                      ['id', 'predictionstring']].reset_index(drop=True)
    pred_df = pred_df.loc[pred_df['class'] == discourse_type,
                      ['id', 'predictionstring']].reset_index(drop=True)
    pred_df['pred_id'] = pred_df.index
    gt_df['gt_id'] = gt_df.index
    pred_df['pred_range'] = [get_range(pred) for pred in pred_df['predictionstring']]
    gt_df['gt_range'] = [get_range(pred) for pred in gt_df['predictionstring']]

    #pred_df['predictionstring'] = [set(pred.split(' ')) for pred in pred_df['predictionstring']]
    #gt_df['predictionstring'] = [set(pred.split(' ')) for pred in gt_df['predictionstring']]
    
    # Step 1. all ground truths and predictions for a given class are compared.
    joined = pred_df.merge(gt_df,
                           left_on='id',
                           right_on='id',
                           how='outer',
                           suffixes=('_pred','_gt')
                          )
    overlaps = [calc_overlap4(*args) for args in zip(joined.pred_range, 
                                                     joined.gt_range)]
    
    # 2. If the overlap between the ground truth and prediction is >= 0.5, 
    # and the overlap between the prediction and the ground truth >= 0.5,
    # the prediction is a match and considered a true positive.
    # If multiple matches exist, the match with the highest pair of overlaps is taken.
    # we don't need to compute the match to compute the score
    TP = joined.loc[overlaps]['gt_id'].nunique()

    # 3. Any unmatched ground truths are false negatives
    # and any unmatched predictions are false positives.
    TPandFP = len(pred_df)
    TPandFN = len(gt_df)
    
    #calc microf1
    my_f1_score = 2*TP / (TPandFP + TPandFN)
    return my_f1_score

def score_feedback_comp4(pred_df, gt_df, return_class_scores=False):
    class_scores = {}
    for discourse_type in gt_df.discourse_type.unique():
        class_score = score_feedback_comp_micro4(pred_df, gt_df, discourse_type)
        class_scores[discourse_type] = class_score
    f1 = np.mean([v for v in class_scores.values()])
    if return_class_scores:
        return f1, class_scores
    return f1


In [None]:
score_feedback_comp4(pred_df, train_df)

In [None]:
score_feedback_comp4(pred_df2, train_df)

In [None]:
%timeit score_feedback_comp4(pred_df, train_df)

In [None]:
%timeit score_feedback_comp4(pred_df2, train_df)