## Prerequisites: hosting your dataset

**pymovements does not _host_ datasets**, it only provides an interface for downloading and reading them. Therefore, you will need to upload your dataset somewhere, such that pymovements will be able to download it.

- Your data must be openly available and downloadable from a simple link without requiring additional steps like logging in. We recommend [OSF](https://osf.io/) for hosting your files, but other platforms like Zenodo or GitHub will also work.
- Your data must be stored in one of the supported formats: CSV, ASC (EyeLink), IPC/Feather.
- Your dataset may consist of multiple files, including ZIP files containing nested folders.
- Trial information (e.g., trial or participant IDs) may be stored as additional columns in the data files, in the filenames, or as messages in ASC files.

(adding_dataset_intermediate)=
## Intermediate

To add a new dataset to the library, you will need to create a {py:class}`~pymovements.DatasetDefinition`. This is a text file in the YAML format that contains information about where your dataset is hosted, what format it is stored in, how it was collected, and other metadata. You can find some examples of YAML files for existing datasets [here](https://github.com/pymovements/pymovements/tree/main/src/pymovements/datasets).

You will need to draft a new YAML file and create a pull request on GitHub. This requires a GitHub account. You can use this link to create a new file directly in your browser: [**NEW YAML FILE**](https://github.com/pymovements/pymovements/new/main/src/pymovements/datasets?filename=my_dataset.yaml&value=name%3A%20MyDataset%0A%0Along_name%3A%20%22Long%20name%20of%20my%20dataset%22%0A%0Aresources%3A%0A%20%20-%20content%3A%20gaze%0A%20%20%20%20url%3A%20%22https%3A%2F%2Furl.to%2Fdata%2Ffile%2Fgaze.csv%22%0A%20%20%20%20filename%3A%20%22gaze.csv%22%0A%20%20%20%20md5%3A%20%22%3Chash%20of%20file%20content%3E%22%0A%0Aexperiment%3A%0A%20%20-%20screen_width_px%3A%201920%0A%20%20-%20screen_height_px%3A%201080%0A%20%20-%20screen_width_cm%3A%2050%0A%20%20-%20screen_height_cm%3A%2028%0A%20%20-%20distance_cm%3A%2060%0A%20%20-%20origin%3A%20%22upper%20left%22%0A%20%20-%20sampling_rate%3A%201000)

The most important fields are `name`, `long_name`, `resources`, which contains the links to the data files, and `experiment`, which contains metadata about the physical setup and the eye tracker:

```yaml
name: MyDataset

long_name: "Long name of my dataset"

resources:
  - content: gaze
    url: "https://url.to/data/file/gaze.csv"
    filename: "gaze.csv"
    md5: "<MD5 hash of file content>"

experiment:
  - eyetracker:
    - sampling_rate: 1000
  - screen:
    - width_px: 1920
    - height_px: 1080
    - width_cm: 50
    - height_cm: 28
    - distance_cm: 60
    - origin: "upper left"
```

The field {py:attr}`~pymovements.DatasetDefinition.resources` contains a list of {py:class}`~pymovements.ResourceDefinition` instances, which contain the necessary data to download and load a specific resource (group) of a dataset. It also includes the type of content that is contained in files of that particular resource group. In our example we only have a single type of resource: ``gaze`` (samples). Other supported content types are ``precomputed_events`` and ``precomputed_reading_measures``.

Detailed documentation on the different fields can be found in the API references for {py:class}`~pymovements.DatasetDefinition` and {py:class}`~pymovements.ResourceDefinition`.

To get the MD5 hash of a data file, you can either use the command line:

```bash
md5sum path/to/gaze.csv
```

or Python code:

```py
from pymovements.datasets._utils._downloads import _calculate_md5
_calculate_md5("path/to/gaze.csv")
```

After adding your information in the file, click "Commit changes" to create a pull request. Feel free to create a pull request even if the file is still missing some information. You (or pymovements maintainers) will still be able to edit it later. If you are unsure about something, just add a comment on the pull request, and we will help you.

Once the pull request is created, we will start working to include your dataset. It is likely that we will need some additional information from you, so please keep an eye on the pull request. Once the pull request is completed and merged, your dataset will be included in the next release of pymovements. This process may take several weeks.

### Setting up your dataset locally

We recommend setting up and testing your dataset locally first. Please refer to the {ref}`working_with_local_datasets` tutorial.

The {py:class}`~pymovements.DatasetDefinition` that we get from that tutorial looks like this:

### Adding resource definitions

In [None]:
url = 'https://github.com/pymovements/pymovements-toy-dataset/archive/refs/heads/main.zip'

dataset_definition = pm.DatasetDefinition(
    name='MyDataset',
    resources=[
        pm.ResourceDefinition(
            content='gaze',
            url=url,
            filename='pymovements-toy-dataset.zip',
            md5='256901852c1c07581d375eef705855d6',
            filename_pattern=r'trial_{text_id:d}_{page_id:d}.csv',
            filename_pattern_schema_overrides={
                'text_id': int,
                'page_id': int,
            },
            load_kwargs={
                'read_csv_kwargs': {'separator': '\t'},
                'time_column': 'timestamp',
                'time_unit': 'ms',
                'pixel_columns': ['x', 'y'],
            },
        ),
    ],
    experiment=experiment,
)

In [None]:
dataset = pm.Dataset(
    definition=dataset_definition,
    path='data/my_dataset',
)
dataset.download()

In [None]:
dataset.load()

In [None]:
dataset_definition.to_yaml('my_dataset.yaml')

In [None]:
with open('my_dataset.yaml', encoding='utf-8') as f:
    print(f.read())

### Running integration tests

To check whether the integration in the dataset library is working properly, you can run the integration test for your dataset:

```bash
tox -e integration -- \
  'tests/integration/public_dataset_processing_test.py::test_public_dataset_processing[my_dataset]'
```