# Sample-Level Evaluation
In this guide, we present `pEYES`' sub-module called `sample_metrics`. This sub-module provides enables comparing two sequences of labels - one considered "**ground truth**" and the other "**prediction**". The `sample_metrics` sub-module contains implementations for multiple metrics to evaluate the prediction sequence against the ground truth sequence. Some of these metrics are "global" measures of agreement, and others are label-specific measures that require specifying a "positive" label for the calculation.  
Also, this sub-module contains several functions that do not return a single number (so they are not, strictly speaking, a "metric"), but rather a vector/matrix of values. For example, `label_counts` returns a vector of counts per label, and `confusion_matrix` returns a matrix of true/false classifications between the prediction * ground-truth.

In [1]:
!pip install peyes --upgrade



In [2]:
import numpy as np
import pandas as pd

import peyes

## Step One: Preparing the Data
We start by downloading one of `peyes`' datasets, for example the `lund2013` dataset.  
We take data from a single trial, which was annotated by two human annotators, _"RA"_ and _"MN"_.  
The output here are the two arrays of labels, one for each annotator, though from here on we will only use _"RA"_ as our ground-truth annotator.

In [3]:
dataset = peyes.datasets.lund2013(directory=None, save=False, verbose=True)

Downloading...


Processing Files: 100%|██████████| 96/96 [00:00<00:00, 198.18it/s]


In [4]:
trial1_data = dataset[dataset[peyes.constants.TRIAL_ID_STR] == 1]
ra = trial1_data["RA"].values
mn = trial1_data["MN"].values

ra, mn

(array([1., 1., 1., ..., 4., 4., 4.]), array([1., 1., 1., ..., 4., 4., 4.]))

## Step Two: Algorithmic detection
We create an instance of Enbert's detector and use it to label the data from a single trial. To do so, we extract the time, x, and y columns from the trial data, as well as the pixel size and viewer distance. As shown in a previous notebook, the `detect` method returns two objects - the detected labels, and metadata calculated during the detection process.  
Here, we output the detected labels for the first trial.

In [5]:
engbert = peyes.create_detector("engbert", missing_value=np.nan, min_event_duration=4, pad_blinks_time=0)

trial1_t=trial1_data[peyes.constants.T].values
trial1_x=trial1_data[peyes.constants.X].values
trial1_y=trial1_data[peyes.constants.Y].values
trial1_pixel_size = trial1_data["pixel_size"].values[0]
trial1_viewer_distance = trial1_data["viewer_distance"].values[0]

eng_labels, eng_metadata = engbert.detect(
    t=trial1_t, x=trial1_x, y=trial1_y, pixel_size_cm=trial1_pixel_size, viewer_distance_cm=trial1_viewer_distance
)
eng_labels

[<EventLabelEnum.UNDEFINED: 0>,
 <EventLabelEnum.UNDEFINED: 0>,
 <EventLabelEnum.UNDEFINED: 0>,
 <EventLabelEnum.FIXATION: 1>,
 <EventLabelEnum.FIXATION: 1>,
 <EventLabelEnum.FIXATION: 1>,
 <EventLabelEnum.FIXATION: 1>,
 <EventLabelEnum.FIXATION: 1>,
 <EventLabelEnum.FIXATION: 1>,
 <EventLabelEnum.FIXATION: 1>,
 <EventLabelEnum.FIXATION: 1>,
 <EventLabelEnum.FIXATION: 1>,
 <EventLabelEnum.FIXATION: 1>,
 <EventLabelEnum.FIXATION: 1>,
 <EventLabelEnum.FIXATION: 1>,
 <EventLabelEnum.FIXATION: 1>,
 <EventLabelEnum.FIXATION: 1>,
 <EventLabelEnum.FIXATION: 1>,
 <EventLabelEnum.FIXATION: 1>,
 <EventLabelEnum.FIXATION: 1>,
 <EventLabelEnum.FIXATION: 1>,
 <EventLabelEnum.FIXATION: 1>,
 <EventLabelEnum.FIXATION: 1>,
 <EventLabelEnum.FIXATION: 1>,
 <EventLabelEnum.FIXATION: 1>,
 <EventLabelEnum.FIXATION: 1>,
 <EventLabelEnum.FIXATION: 1>,
 <EventLabelEnum.FIXATION: 1>,
 <EventLabelEnum.FIXATION: 1>,
 <EventLabelEnum.FIXATION: 1>,
 <EventLabelEnum.FIXATION: 1>,
 <EventLabelEnum.FIXATION: 1>,
 <Eve

## Step Three: Calculating Sample-Level Metrics
As mentioned, we are interested in comparing the human annotators to the algorithmic detectors, on a sample-by-sample level (other comparison levels are shown in this and subsequent notebooks). To do so, we will use the `sample_metrics` sub-module of `peyes`, which provides an easy way to calculate many sample-level metrics.  

Metric calculations require that we define a **ground truth** sequence of labels (usually one of the human annotators) and a **predicted** sequence of labels (usually one of the detectors). Some metrics (like _d-prime_ or _f1-score_) require that we define a **positive** label, which is the label that we are interested in detecting (all other labels are implicitly considered negative). Other metrics (like _Cohen's Kappa_ or _Matthew's Correlation Coefficient_) provide a global measure of agreement between the two sequences of labels, without requiring a positive label.  

We will show an example of a few metrics below, using the `sample_metrics` sub-module.

### Example 3.1: Global Agreement Metrics
We start by calculating some global agreement metrics, which do not require a positive label. These metrics include _Accuracy_ and _Balanced Accuracy_, which are simple measures of agreement between the two sequences of labels. Additionally, we calculate _Cohen's Kappa_, _Matthew's Correlation Coefficient_, which are more complex measures of agreement. Finally, We calculate the _Complement Normalized-Levenshtein Distance_ ($1-NLD$), which measures the (complement to) edit distance between the two sequences of labels.

In [6]:
bacc = peyes.sample_metrics.balanced_accuracy(ra, eng_labels)
kappa = peyes.sample_metrics.cohen_kappa(ra, eng_labels)
mcc = peyes.sample_metrics.mcc(ra, eng_labels)
nld = peyes.sample_metrics.complement_nld(ra, eng_labels)

print("Ground Truth: RA,\tPredicted: Engbert")
print(f"Balanced Accuracy: {bacc:.2f}")
print(f"Cohen's Kappa: {kappa:.2f}")
print(f"Matthew's Correlation Coefficient: {mcc:.2f}")
print(f"Complement NLD: {nld:.2f}")

Ground Truth: RA,	Predicted: Engbert
Balanced Accuracy: 0.49
Cohen's Kappa: 0.02
Matthew's Correlation Coefficient: 0.12
Complement NLD: 0.09


### Example 3.2: Positive-Label Metrics
We continue by calculating some metrics that require specifying a positive label. These metrics include _Precision_, _Recall_, _F1-Score_, and Signal Detection Theory metrics, _D-Prime_ and _Criterion_.  
We specify a positive label using argument `pos_labels`, which takes various data types that could represent a label (like _int_, _float_ or `peyes`' own enum-based label representation). Behind the scenes, `peyes` converts this label to its own enum-based representation, so make sure you follow the same convention when specifying the positive label (e.g. "1" is a fixation, etc.).

In [7]:
rec = peyes.sample_metrics.recall(ra, eng_labels, pos_labels=1)
prec = peyes.sample_metrics.precision(ra, eng_labels, pos_labels=1)
f1 = peyes.sample_metrics.f1_score(ra, eng_labels, pos_labels=1)
dprime = peyes.sample_metrics.d_prime(ra, eng_labels, pos_labels=1)
crit = peyes.sample_metrics.criterion(ra, eng_labels, pos_labels=1)

print("Ground Truth: RA,\tPredicted: Engbert")
print(f"Recall: {rec:.2f}")
print(f"Precision: {prec:.2f}")
print(f"F1-Score: {f1:.2f}")
print(f"D-Prime: {dprime:.2f}")
print(f"Criterion: {crit:.2f}")

Ground Truth: RA,	Predicted: Engbert
Recall: 0.97
Precision: 0.08
F1-Score: 0.15
D-Prime: 0.69
Criterion: -1.49


### Example 3.3: Confusion Matrix
We can evaluate the performance of the detector by calculating the confusion matrix between the two sequences of labels. Other than specifying the "ground truth" and "prediction" sequences, we can also specify the optional argument _labels_. When this argument is specified, the confusion matrix is calculated only for those labels. Otherwise, it is calculated for all labels present in the two sequences. Here, we show how to compute the confusion matrix with and without specifying a subset of labels.

In [8]:
conf = peyes.sample_metrics.confusion_matrix(ra, eng_labels)

print(conf)

Prediction    0     1    2  3  4  5
Ground Truth                       
0             0     0    0  0  0  0
1             3   115    1  0  0  0
2             0     0   31  0  0  0
3             0     1    6  0  0  0
4             4  1343  154  0  0  0
5             0     0    0  0  0  0


In [9]:
conf_1_2 = peyes.sample_metrics.confusion_matrix(ra, eng_labels, labels=[1, 2])
print(conf_1_2)

Prediction      1   2
Ground Truth         
1             115   1
2               0  31


### Example 3.4: Single Sequence Evaluation
The `sample_metrics` sub-module also offers ways to evaluate a single sequence of labels:
- _Label Counts_ returns a pandas Series with the counts of each label in the input sequence.
- _Transition Matrix_ returns a pandas DataFrame indicating the number of transitions from an origin label (DataFrame's rows) to a destination label (DataFrame's columns). If argument _normalize_rows_ is set to `True`, the transitions are normalized by total row-count, which produces the transition probabilities (instead of counts).

In [10]:
cnts = peyes.sample_metrics.label_counts(eng_labels)

print("Label Counts:")
print(cnts)

Label Counts:
0       7
1    1459
2     192
3       0
4       0
5       0
Name: count, dtype: int64


In [11]:
trans_cnts = peyes.sample_metrics.transition_matrix(eng_labels)

print("Transition Counts:")
print(trans_cnts)

Transition Counts:
To    0     1    2
From              
0     5     1    0
1     1  1424   34
2     0    34  158


In [12]:
trans_probs = peyes.sample_metrics.transition_matrix(eng_labels, normalize_rows=True)

print("Transition Probabilities:")
print(trans_probs)

Transition Probabilities:
To           0         1         2
From                              
0     0.833333  0.166667  0.000000
1     0.000685  0.976011  0.023304
2     0.000000  0.177083  0.822917


## Step 4: Multiple Trials
Of course, we aren't interested in a single trial, but rather the performance over the entire dataset (or subset of it).  
Here we Show how to calculate multiple metrics for the first 10 trials in the dataset, and present the results in a pandas DataFrame.

In [13]:
metrics = [
    peyes.constants.BALANCED_ACCURACY_STR,
    peyes.constants.COHENS_KAPPA_STR,
    peyes.constants.RECALL_STR,
    peyes.constants.PRECISION_STR,
    peyes.constants.D_PRIME_STR,
]

results = {}
for tr in range(1, 11):
    trial_data = dataset[dataset[peyes.constants.TRIAL_ID_STR] == tr]
    ra = trial_data["RA"].values
    
    t = trial_data[peyes.constants.T].values
    x = trial_data[peyes.constants.X].values
    y = trial_data[peyes.constants.Y].values
    pixel_size = trial_data["pixel_size"].values[0]
    viewer_distance = trial_data["viewer_distance"].values[0]
    eng_labels, _ = engbert.detect(t=t, x=x, y=y, pixel_size_cm=pixel_size, viewer_distance_cm=viewer_distance)
    
    results[tr] = pd.Series(peyes.sample_metrics.calculate(ra, eng_labels, pos_labels=1, *metrics))

results_df = pd.DataFrame(results).T
results_df

Unnamed: 0,balanced_accuracy,cohen's_kappa,recall,precision,d_prime
1,0.491597,0.024323,0.966387,0.078821,0.688051
2,0.522733,0.1631,0.958678,0.346269,1.318624
3,0.458667,0.186651,0.96,0.307692,1.466417
4,0.47,0.085841,0.88,0.067073,0.607073
5,0.322581,0.0606,,0.0,
6,0.307692,0.015464,,0.0,
7,0.488372,0.083319,0.953488,0.257862,1.313033
8,0.311111,0.05267,,0.0,
9,0.5,0.01649,,0.0,
10,0.3125,0.096693,,0.0,


## Summary
In this notebook, we showed how to use `peyes`' `sample_metrics` sub-module to evaluate the performance of an algorithmic detector against a human annotator. We demonstrated how to calculate various metrics, both global and label-specific, and how to evaluate a single sequence of labels. Finally, we showed how to calculate multiple metrics over multiple trials and present the results in a pandas DataFrame.