This notebook explains my understanding of how the F0.5 metric is computed in this competition. I have a post in the discussion forum - https://www.kaggle.com/c/coleridgeinitiative-show-us-the-data/discussion/230457 - to confirm my understanding.

There is a class at the end of this notebook which encapsulates the entire evaluation logic.

Please feel free to share your thoughts and questions.

Reference: https://www.kaggle.com/c/coleridgeinitiative-show-us-the-data/overview/evaluation

## Imports and utils

In [None]:
import re
import numpy as np
import pandas as pd

In [None]:
# Utils
def jaccard(str1, str2): 
    a = set(str1.lower().split()) 
    b = set(str2.lower().split())
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

In [None]:
# Util for displaying (ground-truth) - predictions matrix as DataFrame
def display_sample_matrix(np_mat):
    return pd.DataFrame(np_mat, columns=y_pred_labels, index=y_true_labels)

## Sample ground-truth and prediction
Let's use this sample ground-truth and predictions to walk through the evaluation steps

In [None]:
# '|'.join(train_df[train_df.Id == 'ecc1b24e-35b3-4b7e-8c54-c9d42bef16a9'].cleaned_label.values)
y_true_str = ('baccalaureate and beyond longitudinal study|baccalaureate and beyond|beginning postsecondary student|education longitudinal study|national education longitudinal study')

y_pred_str = 'postsecondary student|education longitudinal study|xyz'

## Split label-strings into labels

In [None]:
y_true_labels = y_true_str.split('|')
y_pred_labels = sorted(y_pred_str.split('|'))
y_pred_labels

## Compute Jaccard similarity for each prediction & ground-truth pair

In [None]:
# Get Jaccard-matrix with ground-truths as rows and predictions in columns 
gt_p_jaccards = np.array([[jaccard(pred_label, true_label) for pred_label in y_pred_labels] for true_label in y_true_labels])
display_sample_matrix(gt_p_jaccards)

## Get matched predictions matrix

For each ground-truth label, identify prediction which matches it. A matrix is created to represent matched ground-truths and predictions.

Result matrix characteristics:
- ground-truths are represented by rows and predictions by columns
- if there is a match for a ground truth, the corresponding row will be binary array having 1 at the position of the predicted label. Otherwise, it will be an array of 0's
- if there is a tie in the best predictions' jaccard scores, the one which comes earlier alphabetically would be chosen

In [None]:
def get_matched_preds(jacc_scores):
    true_pred_flags = np.zeros(len(jacc_scores))
    
    best_jacc_score = jacc_scores.max()
    
    if best_jacc_score < 0.5:
        '''
        If there are no matches, return an array of zeros indicating no match
        Note: A prediction / ground truth pair is considered a match if Jaccard score is less than 0.5
        '''
        return true_pred_flags
    
    # If there is a match, identify position of best prediction
    best_score_positions = np.nonzero(jacc_scores == best_jacc_score)[0]
    best_score_pos = best_score_positions[0]  # Note: requires columns to be sorted alphabetically
    true_pred_flags[best_score_pos] = 1
    return true_pred_flags
    
matched_preds_mt = np.apply_along_axis(get_matched_preds, 1, gt_p_jaccards)
display_sample_matrix(matched_preds_mt)

## Compute F-Beta
Reference: https://www.kaggle.com/c/coleridgeinitiative-show-us-the-data/overview/evaluation

I have a post in the discussion forum - https://www.kaggle.com/c/coleridgeinitiative-show-us-the-data/discussion/230457 - to confirm my understanding for the formulae.

**Note:**
- Any matched predictions where the Jaccard score meets or exceeds the threshold of 0.5 are counted as true positives (TP), the remainder as false positives (FP).
- Any unmatched predictions are counted as false positives (FP).
- Any ground truths with no nearest predictions are counted as false negatives (FN).

All TP, FP and FN across all samples are used to calculate a final micro F0.5 score. (Note that a micro F score does precisely this, creating one pool of TP, FP and FN that is used to calculate a score for the entire set of predictions.)

In [None]:
sample_tp = matched_preds_mt.sum()  # No. of matched predictions and ground-truths
sample_fp = (~matched_preds_mt.any(axis=0)).sum()  # No. of predictions (columns) without any matches
sample_fn = (~matched_preds_mt.any(axis=1)).sum()  # No. of ground-truths (rows) without any matches

Combine TP, FP and FN of all samples and compute F0.5 score

In [None]:
tp = sum([sample_tp])
fp = sum([sample_fp])
fn = sum([sample_fn])

precision = tp / (tp + fp)
recall = tp / (tp + fn)

beta = 0.5
fbeta = (1 + beta**2) * precision * recall / ((beta**2 * precision) + recall)
fbeta

## ColeridgeEvaluation class
This class encapsulates all the above evaluation logic

In [None]:
class ColeridgeEvaluation:
    @classmethod
    def evaluate_samples(cls, y_true_list, y_pred_list):
        tp_fp_fn_list = []
        
        # Compute TP, FP, FN for each sample
        for sample_y_true, sample_y_pred in zip(y_true_list, y_pred_list):
            sample_tp_fp_fn = cls.evaluate_sample(sample_y_true, sample_y_pred)
            tp_fp_fn_list.append(sample_tp_fp_fn)
        
        # Compute F0.5
        tp, fp, fn = np.array(tp_fp_fn_list).sum(axis=0)
        sample_fbeta = cls.compute_fbeta(tp, fp, fn)
        return sample_fbeta
    
    @classmethod
    def evaluate_sample(cls, y_true_str, y_pred_str):
        # Split label-strings into labels
        y_true_labels = y_true_str.split('|')
        y_pred_labels = sorted(y_pred_str.split('|'))
        
        # -- Compute Jaccard similarity for each prediction & ground-truth pair --
        # Get Jaccard-matrix with ground-truths as rows and predictions in columns 
        gt_p_jaccards = np.array([[cls.jaccard(pred_label, true_label) for pred_label in y_pred_labels] 
                                      for true_label in y_true_labels])

        # Binarize matched-predictions
        matched_preds_mt = np.apply_along_axis(cls._get_matched_preds, 1, gt_p_jaccards)
        
        # Return sample's tp, fp, fn
        return cls.compute_sample_cf_metrics(matched_preds_mt)
    
    @classmethod
    def jaccard(cls, str1, str2): 
        a = set(str1.lower().split()) 
        b = set(str2.lower().split())
        c = a.intersection(b)
        return float(len(c)) / (len(a) + len(b) - len(c))
    
    @classmethod
    def _get_matched_preds(cls, jacc_scores):
        true_pred_flags = np.zeros(len(jacc_scores))

        best_jacc_score = jacc_scores.max()

        if best_jacc_score < 0.5:
            '''
            If there no matches, return an array of zeros indicating no match
            Note: A prediction / ground truth pair is considered a match if Jaccard score is less than 0.5
            '''
            return true_pred_flags

        # If there is a match, identify position of the best prediction
        best_score_positions = np.nonzero(jacc_scores == best_jacc_score)[0]
        best_score_pos = best_score_positions[0]  # Get first position of best-score predictions
        true_pred_flags[best_score_pos] = 1
        return true_pred_flags
    
    @classmethod
    def compute_sample_cf_metrics(cls, matched_preds_mt, beta=0.5):
        sample_tp = matched_preds_mt.sum()  # No. of matched predictions and ground-truths
        sample_fp = (~matched_preds_mt.any(axis=0)).sum()  # No. of predictions (columns) without any matches
        sample_fn = (~matched_preds_mt.any(axis=1)).sum()  # No. of ground-truths (rows) without any matches
        return sample_tp, sample_fp, sample_fn
        
    @classmethod
    def compute_fbeta(cls, tp, fp, fn, beta=0.5):
        precision = tp / (tp + fp)
        recall = tp / (tp + fn)

        fbeta = (1 + beta**2) * precision * recall / ((beta**2 * precision) + recall)
        return fbeta

In [None]:
ColeridgeEvaluation.evaluate_samples([y_true_str], [y_pred_str])