# Adding a Public Dataset to the Library

## What you will learn in this tutorial:

- how to add a new dataset to pymovements dataset library, so that it can be accessed easily
- how to test if the dataset integration works properly

**This tutorial is offered on three levels. Pick the one that corresponds to your prior knowledge:**

- **[Basic](#basic):** You have never used pymovements before and don't have any programming experience.  
  → You will learn how to create an issue that will allow pymovement maintainers to add your dataset

- **[Intermediate](#intermediate):** You have some experience with programming, but are not familiar with Git.  
  → You will learn how to create a pull request with a draft of your dataset definition file

- **[Advanced](#advanced):** You are proficient in Python and familiar with Git.  
  → You will learn how to add and test your dataset definition on your local machine

## Prerequisites: hosting your dataset

**pymovements does not _host_ datasets**, it only provides an interface for downloading and reading them. Therefore, you will need to upload your dataset somewhere, such that pymovements will be able to download it.

- Your data must be openly available and downloadable from a simple link without requiring additional steps like logging in. We recommend [OSF](https://osf.io/) for hosting your files, but other platforms like Zenodo or GitHub will also work.
- Your data must be stored in one of the supported formats: CSV, ASC (EyeLink), IPC/Feather.
- Your dataset may consist of multiple files, including ZIP files containing nested folders.
- Trial information (e.g., trial or participant IDs) may be stored as additional columns in the data files, in the filenames, or as messages in ASC files.

## Basic

In order to add your dataset, we will need some information on where and in what format you stored your data, as well as some metadata about your data collection. Specifically, we need:

- Links to your data files (containing sample-based data, event-based data, and/or aggregated measures)
- Information on where/how participant IDs, trial IDs, and related data are stored (within the data files, or in the filename)
- Information on the screen you used to present the stimuli:
  - Screen size in centimeters
  - Screen resolution in pixels
  - Eye-to-screen distance
- Information on the eye-tracker you used:
  - Model and manufacturer
  - Sampling rate
  - Where the origin (0, 0) of the gaze coordinates recorded by the eye-tracker is (e.g., top left of screen, center of screen)
- Any paper(s) you would like to be referenced by users of your dataset

Once you have all the information, you can [create an issue in the pymovements repository on GitHub](https://github.com/pymovements/pymovements/issues/new?template=DATASET.md). You need a GitHub account to do this. If you prefer not creating a GitHub account, please send the information above to pymovements@python.org instead.

After receiving your information, we will start working to include your dataset. It is likely that we will need some additional information from you, so please keep an eye on the GitHub issue. Once the inclusion is completed, your dataset will be included in the next release of pymovements. This process may take several weeks.

## Intermediate

To add a new dataset to the library, you will need to create a _dataset definition_. This is a text file in the YAML format that contains information about where your dataset is hosted, what format it is stored in, how it was collected, and other metadata. You can find some examples of YAML files for existing datasets [here](https://github.com/pymovements/pymovements/tree/main/src/pymovements/datasets).

You will need to draft a new YAML file and create a pull request on GitHub. This requires a GitHub account. You can use this link to create a new file directly in your browser: [**NEW YAML FILE**](https://github.com/pymovements/pymovements/new/main/src/pymovements/datasets?filename=my_dataset.yaml&value=name%3A%20MyDataset%0A%0Along_name%3A%20%22Long%20name%20of%20my%20dataset%22%0A%0Aresources%3A%0A%20%20-%20content%3A%20gaze%0A%20%20%20%20url%3A%20%22https%3A%2F%2Furl.to%2Fdata%2Ffile%2Fgaze.csv%22%0A%20%20%20%20filename%3A%20%22gaze.csv%22%0A%20%20%20%20md5%3A%20%22%3Chash%20of%20file%20content%3E%22%0A%0Aexperiment%3A%0A%20%20-%20screen_width_px%3A%201920%0A%20%20-%20screen_height_px%3A%201080%0A%20%20-%20screen_width_cm%3A%2050%0A%20%20-%20screen_height_cm%3A%2028%0A%20%20-%20distance_cm%3A%2060%0A%20%20-%20origin%3A%20%22upper%20left%22%0A%20%20-%20sampling_rate%3A%201000)

The most important fields are `name`, `long_name`, `resources`, which contains the links to the data files, and `experiment`, which contains metadata about the physical setup and the eye tracker:

```yaml
name: MyDataset

long_name: "Long name of my dataset"

resources:
  - content: gaze
    url: "https://url.to/data/file/gaze.csv"
    filename: "gaze.csv"
    md5: "<MD5 hash of file content>"

experiment:
  - screen_width_px: 1920
  - screen_height_px: 1080
  - screen_width_cm: 50
  - screen_height_cm: 28
  - distance_cm: 60
  - origin: "upper left"
  - sampling_rate: 1000
```

Detailed documentation on the different fields can be found [here](https://pymovements.readthedocs.io/en/stable/reference/api/pymovements.DatasetDefinition.html).

To get the MD5 hash of a data file, you can either use the command line:

```bash
md5sum path/to/gaze.csv
```

or Python code:

```py
from pymovements.datasets._utils._downloads import _calculate_md5
_calculate_md5("path/to/gaze.csv")
```

After adding your information in the file, click "Commit changes" to create a pull request. Feel free to create a pull request even if the file is still missing some information. You (or pymovements maintainers) will still be able to edit it later. If you are unsure about something, just add a comment on the pull request and we will help you.

Once the pull request is created, we will start working to include your dataset. It is likely that we will need some additional information from you, so please keep an eye on the pull request. Once the pull request is completed and merged, your dataset will be included in the next release of pymovements. This process may take several weeks.

## Advanced

Follow the [contributing guide](https://github.com/pymovements/pymovements/blob/main/CONTRIBUTING.md) to set up your development environment.

### Setting up your dataset locally

We recommend setting up and testing your dataset locally first. Please refer to the [corresponding tutorial](https://pymovements.readthedocs.io/en/stable/tutorials/local-dataset.html).

As described in that tutorial, the `DatasetDefinition` for our **local** toy dataset looks like this:

In [None]:
import pymovements as pm

experiment = pm.gaze.Experiment(
    screen_width_px=1280,
    screen_height_px=1024,
    screen_width_cm=38,
    screen_height_cm=30.2,
    distance_cm=68,
    origin='upper left',
    sampling_rate=1000,
)

dataset_definition = pm.DatasetDefinition(
    name='my_dataset',
    has_files={
        'gaze': True,
        'precomputed_events': False,
        'precomputed_reading_measures': False,
    },
    experiment=experiment,
    filename_format={'gaze': r'trial_{text_id:d}_{page_id:d}.csv'},
    filename_format_schema_overrides={'gaze': {'text_id': int, 'page_id': int}},
    custom_read_kwargs={'gaze': {'separator': '\t'}},
    time_column='timestamp',
    time_unit='ms',
    pixel_columns=['x', 'y'],
)

### Adding resource definitions

The dataset definition above makes it possible to load the dataset if the data files are already downloaded. But the files cannot be downloaded automatically through pymovements yet. To achieve this, we need to add `ResourceDefinitions`, which define where your files are stored, where they should be downloaded, and what kind of data they contain.

Let's add one to the `DatasetDefinition` of our toy dataset:

In [None]:
dataset_definition = pm.DatasetDefinition(
    name='my_dataset',
    resources=[
        {
            'content': 'gaze',
            'url': 'https://github.com/pymovements/pymovements-toy-dataset/archive/refs/heads/main.zip',  # noqa: E501
            'filename': 'pymovements-toy-dataset.zip',
            'md5': '256901852c1c07581d375eef705855d6',
            'filename_pattern': r'trial_{text_id:d}_{page_id:d}.csv',
            'filename_pattern_schema_overrides': {
                'text_id': int,
                'page_id': int,
            },
        },
    ],
    experiment=experiment,
    custom_read_kwargs={'gaze': {'separator': '\t'}},
    time_column='timestamp',
    time_unit='ms',
    pixel_columns=['x', 'y'],
)

Note that some of the information previously defined at the definition level (`has_files`, `filename_format`, `filename_patterns_schema_overrides`) are now defined at the level of individual resources.

To get the MD5 hash of a file, you can either use the command line:

```bash
md5sum path/to/pymovements-toy-dataset.zip
```

or Python code:

```py
from pymovements.datasets._utils._downloads import _calculate_md5
_calculate_md5("path/to/pymovements-toy-dataset.zip")
```

Let's test if the data files can be downloaded and loaded into memory:

In [None]:
dataset = pm.Dataset(
    definition=dataset_definition,
    path='data/my_dataset',
)
dataset.download()

And load the dataset into memory:

In [None]:
dataset.load()

### Writing the YAML file

All public datasets in the library are defined in YAML files (see [here](https://github.com/pymovements/pymovements/tree/main/src/pymovements/datasets) for examples). These YAML files contain exactly the same fields as the `DatasetDefinition` objects, and the two can be easily converted into each other.

Let's convert the `DatasetDefinition` of our toy dataset to a YAML file:

In [None]:
dataset_definition.to_yaml('my_dataset.yaml')

with open('my_dataset.yaml', encoding='utf-8') as f:
    print(f.read())

This YAML file can now be added to `src/pymovements/datasets/`. Commit the file in your fork of the pymovements repository, and create a pull request. We will then review the pull request and request additional information or changes if necessary, so please keep an eye on the pull request. Once the pull request is completed and merged, your dataset will be included in the next release of pymovements. This process may take several weeks.

If you run into problems, feel free to create a draft pull request and explain the issue, so that we can provide support.

### Running integration tests

To check whether the integration in the dataset library is working properly, you can run the integration test for your dataset:

```bash
tox -e integration -- 'tests/integration/public_dataset_processing_test.py::test_public_dataset_processing[my_dataset]'
```