# Working with Local Dataset

In this tutorial, we will show how to use your own local dataset with the Dataset class. The Dataset class can help you to manage and process your eyetracking data.

## Preparations

We import `pymovements` as the alias `pm` for convenience.

In [None]:
import pymovements as pm

For demonstration purposes, we will use the raw data provided by the Toy dataset, a sample dataset that comes with *pymovements*.

We will download the resources of this dataset the directory to simulate a local dataset for you.
All downloaded archive files are automatically extracted and then removed.
The directory of the dataset will be `data/my_dataset`.

After that we won't use the python class anymore and delete the object
(the files on your system will stay in place).
Don't worry if you're confused about these lines as they are not relevant to your use case.

Just keep in mind that we now have some files with gaze data in the directory `data/my_dataset`.

In [None]:
toy_dataset = pm.Dataset('ToyDataset', path='data/my_dataset')
toy_dataset.download(remove_finished=True)

del toy_dataset

## Defining your Dataset

In order to load your dataset you will need to specify a {py:class}`~pymovements.DatasetDefinition`.

The following fields are required:

- `name`: the (abbreviated) name of your dataset
- `experiment`: the particular experiment setup
- `resources`: metadata on your available dataset resources

Some additional fields are optional:

- `long_name`: the long-form name of your dataset

## Define your Experiment

To use the Dataset class, we first need to create an Experiment instance. This class represents the properties of the experiment, such as the screen dimensions and sampling rate.

In [None]:
experiment = pm.Experiment(
    screen_width_px=1280,
    screen_height_px=1024,
    screen_width_cm=38,
    screen_height_cm=30.2,
    distance_cm=68,
    origin='upper left',
    sampling_rate=1000,
)

## Defining your resources

Next we will define our dataset resources by setting up a {py:class}`~pymovements.ResourceDefinition`.

A `ResourceDefinition` should always include the following fields:
- `content`: the type of content (e.g., `gaze`, `precomputed_events`)
- `filename_pattern`: the filename pattern of resource files

Some additional fields are optional but might be necessary for your dataset:
- `filename_pattern_schema_overrides`: specify datatypes of named groups in `filename_pattern`
- `load_function`: the loading function, usually inferred automatically
- `load_kwargs`: additional keyword arguments that are passed to the loading function

In our tutorial dataset we only have one type of content: `gaze` sample data stored in csv files, hence we only need to setup a single `ResourceDefinition`.

The `filename_pattern` is a pattern expression used to match dataset filenames.
The named groups in the curly braces will be parsed as additional metadata.

In our tutorial dataset all files conform to the filename pattern:

In [None]:
filename_pattern = r'trial_{text_id:d}_{page_id:d}.csv'

This will match filenames like `trial_1_2.csv` and parse the values of `text_id==1` and `page_id==2`.

As both `text_id` and `page_id` are numeric values, we can explicitly specify the these values as `int`:

In [None]:
filename_pattern_schema_overrides = {
    'text_id': int,
    'page_id': int,
}

## Column Definitions

The `trial_columns` argument can be used to specify which columns define a single trial.

This is important for correctly applying all preprocessing methods.

For this very small single user dataset a trial is just defined by `text_id` and `page_id`.

In [None]:
trial_columns = ['text_id', 'page_id']

The `time_column` and `pixel_columns` arguments can be used to correctly map the columns in your dataframes. If the time unit differs from the default milliseconds `ms` one must also specify the `time_unit` for correct computations.

Depending on the content of your dataset, you can alternatively also provide `position_columns`, `velocity_columns` and `acceleration_columns`.

Specifying these columns is needed for correctly applying preprocessing methods. For example, if you want to apply the {py:meth}`~pymovements.Dataset.pix2deg` method, you will need to specify `pixel_columns` accordingly.

If your dataset has gaze positions available only in degrees of visual angle, you have to specify the `position_columns` instead.

In [None]:
time_column = 'timestamp'
time_unit = 'ms'
pixel_columns = ['x', 'y']

## Setting up loading function parameters

Now we must setup the parameters for our loading function.

As the content is `gaze` and the filename extension of the `filename_pattern` is `.csv`, the loading function is automatically inferred to be `from_csv`.

In case the loading function cannot be automatically inferred from your `filename_pattern` you will have to specify it explictly:

In [None]:
load_function = 'from_csv'

Have a look at the {py:func}`~pymovements.gaze.from_csv` reference to see what additional parameters you can set up.

We will use our defined values for `time_column`, `time_unit` and `pixel_columns`.

As our csv files are tab separated we need to specify that separator via `read_csv_kwargs`:

In [None]:
load_kwargs = {
    'time_column': 'timestamp',
    'time_unit': 'ms',
    'pixel_columns': ['x', 'y'],
    'read_csv_kwargs': {'separator': '\t'},
}

We can now initialize our {py:class}`~pymovements.ResourceDefinition`. The content keyword for our gaze sample files is 'gaze'.

In [None]:
resource_definition = pm.ResourceDefinition(
    content='gaze',
    filename_pattern=filename_pattern,
    filename_pattern_schema_overrides=filename_pattern_schema_overrides,
    load_function=load_function,
    load_kwargs=load_kwargs,
)

## Define and load the Dataset

Next we use all these definitions and create a {py:class}`~pymovements.DatasetDefinition` by passing in the root directory, Experiment instance, and other optional parameters such as the filename regular expression and custom CSV reading parameters.

In [None]:
dataset_definition = pm.DatasetDefinition(
    name='my_dataset',
    experiment=experiment,
    resources=[resource_definition],
)

Finally, we create a {py:class}`~pymovements.Dataset` instance by using the {py:class}`~pymovements.DatasetDefinition` and specifying the directory path.

In [None]:
dataset = pm.Dataset(
    definition=dataset_definition,
    path='data/my_dataset/',
)

If we have a root data directory which holds all your local datasets we can further need to define the paths of the dataset.

The `dataset`, `raw`, `preprocessed`, and `events` parameters define the names of the directories for the dataset, raw data, preprocessed data, and events data, respectively.

In [None]:
dataset_paths = pm.DatasetPaths(
    root='data/',
    raw='raw',
    preprocessed='preprocessed',
    events='events',
)

dataset = pm.Dataset(
    definition=dataset_definition,
    path=dataset_paths,
)

Now let's load the dataset into memory. Here we select a subset including the first page of texts with ID 1 and 2.

In [None]:
subset = {
    'text_id': [1, 2],
    'page_id': 1,
}

dataset.load(subset=subset)

## Use the Dataset

Once we have created the Dataset instance, we can use its methods to preprocess and analyze data in our local dataset.

In [None]:
dataset.gaze[0]

Here we use the {py:meth}`~pymovements.Dataset.pix2deg` method to convert the pixel coordinates to degrees of visual angle.

In [None]:
dataset.pix2deg()

dataset.gaze[0]

We can use the {py:meth}`~pymovements.Dataset.pos2vel` method to calculate the velocity of the gaze position.

In [None]:
dataset.pos2vel(method='savitzky_golay', degree=2, window_length=7)

dataset.gaze[0]