# Datasets
In this tutorial, we will dive deeper into the datasets available through `pEYES`. We will explore the datasets available, examine the type of data they contain, and show how to access them both from a local directory and from the internet.

In [1]:
!pip install peyes --upgrade



In [2]:
import os
import json

import numpy as np
import pandas as pd

import peyes

_DATASETS_DIR = "path/to/datasets"

## Availble Datasets
The `pEYES` package provides an API to download and parse four datasets: "lund2013", "irf", "hfc" and "gazecom".   
All datasets contains eye-tracking data (t-x-y coordinates), as well as human annotations from at least one annotator. Additionally, the datasets include recording-specific metadata, such as the screen resolution, the sampling rate, viewer distance, and more.    
Thorough descriptions of these datasets is available in Table 1 from article "Evaluating eye movement event detection: A review of the state of the art." (Startsev, M., & Zemblys, R., 2023).

### The `datasets` Submodule
  
The datasets are stored in the `pEYES` package and can be accessed using the `peyes.datasets` submodule.
The module contains an API to download and parse the datasets, or load a parsed version of the dataset from a local directory. This is done by calling the `load_dataset` function with the dataset name as an argument. Additional arguments can be passed to specify a directory from which to load the dataset or save the parsed dataset once it is loaded.  
  
Another API available in the `peyes.datasets` submodule is the `get_metadata` function. This function returns a dictionary containing the metadata of the dataset: article citation, URL, license, etc.  

### Descriptive Statistics
We will extract descriptive statistics for each of the labelers in the dataset: number of labeled samples & trials, distribution of labels, etc.

In [3]:
def _labeler_statistics(dataset: pd.DataFrame, labeler: str) -> dict:
    labeler_data = dataset[dataset[labeler].notnull()]
    labels_distribution = labeler_data[labeler].value_counts(dropna=True, normalize=True).sort_index()
    labels_distribution.index = labels_distribution.index.map(peyes.parse_label)
    labeler_statistics = {
        "num_samples": len(labeler_data),
        "num_subjects": labeler_data[peyes.constants.SUBJECT_ID_STR].nunique(),
        "num_trials": labeler_data[peyes.constants.TRIAL_ID_STR].nunique(),
        "num_labels": labeler_data[labeler].nunique(),
        "labels_distribution": (100 * labels_distribution).to_dict(),
    }
    return labeler_statistics


def _extended_metadata(dataset: pd.DataFrame, metadata: dict) -> dict:
    metadata = metadata.copy()
    metadata["num_samples"] = len(dataset)
    metadata["num_subjects"] = dataset[peyes.constants.SUBJECT_ID_STR].nunique()
    metadata["num_trials"] = dataset[peyes.constants.TRIAL_ID_STR].nunique()
    metadata["sampling_rate"] = peyes._utils.event_utils.calculate_sampling_rate(dataset[peyes.constants.T].values)
    metadata["stimuli"] = dataset[peyes.constants.STIMULUS_TYPE_STR].unique().tolist()
    metadata["raters"] = list(set(dataset.columns) - {
        # Remove non-rater columns
        peyes.constants.TRIAL_ID_STR, peyes.constants.SUBJECT_ID_STR,
        peyes.constants.STIMULUS_TYPE_STR, peyes.constants.STIMULUS_NAME_STR,
        peyes.constants.T, peyes.constants.X, peyes.constants.Y, peyes.constants.PUPIL,
        peyes.constants.PIXEL_SIZE_STR, peyes.constants.VIEWER_DISTANCE_STR,
        "subject_group", "v",
    })
    for rtr in metadata["raters"]:
        metadata[rtr] = _labeler_statistics(dataset, rtr)
    return metadata


def load(
        dataset_name: str, directory: str = _DATASETS_DIR, save: bool = False, verbose: bool = False
) -> (pd.DataFrame, dict):
    """
    Load a dataset from a local directory or download it from the internet.
    The dataset is parsed and returned as a pandas DataFrame.
    The metadata of the dataset is also returned as a dictionary.
    
    :param dataset_name: The name of the dataset to load.
    :param directory: The directory from which to load the dataset or save the parsed dataset.
    :param save: Whether to save the parsed dataset to the directory.
    :param verbose: Whether to print information about the dataset.
    
    :return: The parsed dataset and the metadata of the dataset.
    """
    from peyes.datasets.load_dataset import load_dataset
    dataset = load_dataset(dataset_name, directory, save, verbose)
    metadata = peyes.datasets.get_metadata(dataset_name, False)
    metadata = _extended_metadata(dataset, metadata)
    return dataset, metadata


def pretty_print(metadata: dict, indent: int = 4):
    out = json.dumps(metadata, indent=indent)
    print(out)
    

def labeler_stimulus_stats(dataframe: pd.DataFrame, labeler: str) -> pd.DataFrame:
    if labeler:
        subset = dataframe[dataframe[labeler].notnull()]
    else:
        subset = dataframe
    counts = pd.concat([
        subset.groupby("stimulus_type").size().rename("num_samples"),
        subset.groupby("stimulus_type")["subject_id"].nunique().rename("num_subjects"),
        subset.groupby("stimulus_type")["trial_id"].nunique().rename("num_trials"),
    ], axis=1)
    total_counts = pd.Series(
        [len(subset), subset["subject_id"].nunique(), subset["trial_id"].nunique()],
        index=counts.columns, name="total"
    )
    counts.loc["total"] = total_counts
    
    if not labeler:
        return counts
    stats = pd.concat([
        subset[labeler].value_counts(dropna=True, normalize=True).sort_index().rename("total"),
        subset.groupby("stimulus_type")[labeler].value_counts(dropna=True, normalize=True).unstack().fillna(0).T
    ], axis=1).T * 100
    stats.index.name = peyes.constants.LABEL_STR
    return pd.concat([counts, stats], axis=1)

#### Lund2013 Dataset

In [4]:
lund_data, lund_metadata = load("lund2013", directory=_DATASETS_DIR, save=False, verbose=True)
pretty_print(lund_metadata)

Dataset Lund2013 not found in directory path/to/datasets.
Downloading...


Processing Files: 100%|██████████| 96/96 [00:00<00:00, 201.08it/s]

{
    "name": "Lund2013",
    "url": "https://github.com/richardandersson/EyeMovementDetectorEvaluation/archive/refs/heads/master.zip",
    "articles": [
        "Andersson, R., Larsson, L., Holmqvist, K., Stridh, M., & Nystr\u00f6m, M. (2017): One algorithm to rule them all? An evaluation and discussion of ten eye movement event-detection algorithms. Behavior Research Methods, 49(2), 616-637."
    ],
    "license": "GNU GPL-3.0",
    "num_samples": 383212,
    "num_subjects": 30,
    "num_trials": 63,
    "sampling_rate": 500.0,
    "stimuli": [
        "moving_dot",
        "image",
        "video"
    ],
    "raters": [
        "RA",
        "MN"
    ],
    "RA": {
        "num_samples": 381886,
        "num_subjects": 30,
        "num_trials": 62,
        "num_labels": 6,
        "labels_distribution": {
            "0": 0.14035602247791226,
            "1": 42.38385277281702,
            "2": 5.525470952064229,
            "3": 3.060337378170449,
            "4": 46.80899535463463




Descriptive statistics about each stimulus type:

In [5]:
stim_stats = labeler_stimulus_stats(lund_data, "")

stim_stats

Unnamed: 0_level_0,num_samples,num_subjects,num_trials
stimulus_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
image,87790,18,20
moving_dot,21326,19,24
video,274096,18,19
total,383212,30,63


Descriptive stats about each labeler:

In [6]:
labeler_stats = pd.concat(
    {rtr: pd.json_normalize(lund_metadata[rtr], sep="_") for rtr in lund_metadata["raters"]},
).droplevel(1)

labeler_stats

Unnamed: 0,num_samples,num_subjects,num_trials,num_labels,labels_distribution_0,labels_distribution_1,labels_distribution_2,labels_distribution_3,labels_distribution_4,labels_distribution_5
RA,381886,30,62,6,0.140356,42.383853,5.525471,3.060337,46.808995,2.080988
MN,104745,20,34,6,0.296912,61.448279,7.185068,4.361067,22.623514,4.085159


In [7]:
extended_stats = pd.concat([
    labeler_stimulus_stats(lund_data, "RA"), labeler_stimulus_stats(lund_data, "MN")
], keys=["RA", "MN"], axis=0)
extended_stats = extended_stats.reorder_levels([1, 0]).reindex(
    axis=0, level=0, labels=["image", "video", "moving_dot", "total"]
)
extended_stats

Unnamed: 0,Unnamed: 1,num_samples,num_subjects,num_trials,0.0,1.0,2.0,3.0,4.0,5.0
image,RA,87790,18,20,0.144663,76.455177,9.181,4.759084,4.777309,4.682766
image,MN,63849,13,14,0.198907,79.597175,8.592147,5.243622,0.853576,5.514573
video,RA,274096,18,19,0.079899,33.62654,4.413417,2.635573,57.883734,1.360837
video,MN,29029,9,9,0.055117,42.974267,5.174136,3.382824,46.381205,2.03245
moving_dot,RA,20000,19,23,0.95,12.845,4.72,1.425,79.53,0.53
moving_dot,MN,11867,10,11,1.415691,8.99132,4.533581,2.005562,81.638156,1.415691
total,RA,381886,30,62,0.140356,42.383853,5.525471,3.060337,46.808995,2.080988
total,MN,104745,20,34,0.296912,61.448279,7.185068,4.361067,22.623514,4.085159


#### IRF Dataset

In [8]:
irf_data, irf_metadata = load("irf", directory=_DATASETS_DIR, save=False, verbose=True)
irf_stats = pd.concat(
    {rtr: pd.json_normalize(irf_metadata[rtr], sep="_") for rtr in irf_metadata["raters"]},
).droplevel(1)

pretty_print(irf_metadata)
irf_stats

Dataset IRF not found in directory path/to/datasets.
Downloading...


Processing Files: 6it [00:00, 20.55it/s]


{
    "name": "IRF",
    "url": "https://github.com/r-zemblys/irf/archive/refs/heads/master.zip",
    "articles": [
        "Zemblys, Raimondas and Niehorster, Diederick C and Komogortsev, Oleg and Holmqvist, Kenneth. Using machine learning to detect events in eye-tracking data. Behavior Research Methods, 50(1), 160\u2013181 (2018)."
    ],
    "license": "MIT",
    "num_samples": 486016,
    "num_subjects": 6,
    "num_trials": 6,
    "sampling_rate": 1000.0,
    "stimuli": [
        "moving_dot"
    ],
    "raters": [
        "RZ"
    ],
    "RZ": {
        "num_samples": 486016,
        "num_subjects": 6,
        "num_trials": 6,
        "num_labels": 5,
        "labels_distribution": {
            "0": 0.33908348696339213,
            "1": 86.76792533579142,
            "2": 5.653517579668159,
            "3": 3.0007242559915723,
            "5": 4.238749341585462
        }
    }
}


Unnamed: 0,num_samples,num_subjects,num_trials,num_labels,labels_distribution_0,labels_distribution_1,labels_distribution_2,labels_distribution_3,labels_distribution_5
RZ,486016,6,6,5,0.339083,86.767925,5.653518,3.000724,4.238749


#### HFC Dataset

In [9]:
hfc_data, hfc_metadata = load("hfc", directory=_DATASETS_DIR, save=False, verbose=True)
hfc_stats = pd.concat(
    {rtr: pd.json_normalize(hfc_metadata[rtr], sep="_") for rtr in hfc_metadata["raters"]},
).droplevel(1)

pretty_print(hfc_metadata)
hfc_stats

Dataset HFC not found in directory path/to/datasets.
Downloading...


Processing Files: 70it [00:00, 875.29it/s]


{
    "name": "HFC",
    "url": "https://github.com/dcnieho/humanFixationClassification/archive/refs/heads/master.zip",
    "articles": [
        "Hooge, I.T.C., Niehorster, D.C., Nystr\u00f6m, M., Andersson, R. & Hessels, R.S. (2018). Is human classification by experienced untrained observers a gold standard in fixation detection?"
    ],
    "license": "CC NC-BY-SA 4.0",
    "num_samples": 1267692,
    "num_subjects": 70,
    "num_trials": 70,
    "sampling_rate": 300.0300030002925,
    "stimuli": [
        "free_viewing",
        "search_task"
    ],
    "raters": [
        "JB",
        "JV",
        "RH",
        "RA",
        "MS",
        "DN",
        "PZ",
        "IH",
        "MN",
        "TC",
        "KH",
        "JF"
    ],
    "JB": {
        "num_samples": 1267692,
        "num_subjects": 70,
        "num_trials": 70,
        "num_labels": 2,
        "labels_distribution": {
            "0": 28.538162266544237,
            "1": 71.46183773345575
        }
    },
    "

Unnamed: 0,num_samples,num_subjects,num_trials,num_labels,labels_distribution_0,labels_distribution_1
JB,1267692,70,70,2,28.538162,71.461838
JV,1267692,70,70,2,28.952774,71.047226
RH,1267692,70,70,2,25.199496,74.800504
RA,1267692,70,70,2,29.250954,70.749046
MS,1267692,70,70,2,27.788453,72.211547
DN,1267692,70,70,2,25.008283,74.991717
PZ,1267692,70,70,2,33.378139,66.621861
IH,1267692,70,70,2,23.150103,76.849897
MN,1267692,70,70,2,27.861342,72.138658
TC,1267692,70,70,2,29.728988,70.271012


#### Gazecom Dataset

In [10]:
gazecom_data, gazecom_metadata = load("gazecom", directory=_DATASETS_DIR, save=False, verbose=True)
gazecom_stats = pd.concat(
    {rtr: pd.json_normalize(gazecom_metadata[rtr], sep="_") for rtr in gazecom_metadata["raters"]},
).droplevel(1)

pretty_print(gazecom_metadata)
gazecom_stats

Dataset GazeCom not found in directory path/to/datasets.
Downloading...


Processing Files: 2532it [00:54, 46.79it/s]


{
    "name": "GazeCom",
    "url": "https://gin.g-node.org/ioannis.agtzidis/gazecom_annotations/archive/master.zip",
    "articles": [
        "Agtzidis, I., Startsev, M., & Dorr, M. (2016a). In the pursuit of (ground) truth: A hand-labelling tool for eye movements recorded during dynamic scene viewing. In 2016 IEEE second workshop on eye tracking andvisualization (ETVIS) (pp. 65\u201368).",
        "Michael Dorr, Thomas Martinetz, Karl Gegenfurtner, and Erhardt Barth. Variability of eye movements when viewing dynamic natural scenes. Journal of Vision, 10(10):1-17, 2010.",
        "Startsev, M., Agtzidis, I., & Dorr, M. (2016). Smooth pursuit. http://michaeldorr.de/smoothpursuit/"
    ],
    "license": "GNU GPL-3.0",
    "num_samples": 12954168,
    "num_subjects": 54,
    "num_trials": 2532,
    "sampling_rate": 250.0,
    "stimuli": [
        "video"
    ],
    "raters": [
        "HL_FINAL",
        "HL2",
        "HL1"
    ],
    "HL_FINAL": {
        "num_samples": 12954168,
    

Unnamed: 0,num_samples,num_subjects,num_trials,num_labels,labels_distribution_0,labels_distribution_1,labels_distribution_2,labels_distribution_4
HL_FINAL,12954168,54,2532,4,5.90349,72.545053,10.532216,11.019241
HL2,12954168,54,2532,4,5.291224,77.081932,10.578649,7.048195
HL1,12954168,54,2532,4,5.453635,72.248368,10.886658,11.411339
