## How to include a dataset in scikit-mobility
<br>
<i>v 1.0</i>

To add a dataset into `scikit-mobility`, create a dataset pre-processing script (DPS) and a JSON file that stores the information about the dataset.<br>
The JSON sould contain information about the dataset, such as the license and the URL where the resource is available. <br>
The DPS contains the instruction to load and pre-process the dataset, making it available to the user without any effort on their part.

The types of mobility dataset that may be included in the library are:
 - <i>trajectories</i> $\rightarrow$ `TrajDataFrame`
 - <i>flows</i> $\rightarrow$ `FlowDataFrame`
 - <i>shapes</i> $\rightarrow$ `GeoDataFrame`
 - <i>others</i> $\rightarrow$ `DataFrame`

### A practical example: aliens moving in the G-24 galaxy

Suppose we want to include in the `scikit-mobility` data collection the trajectories of aliens moving in a galaxy called G24.
We first have to choose a name for the dataset (unique in the collection). The name must be in lowercase, starting with a letter and the only non alphanumerical symbol allowed is `_`.<br><br>
<span style="color:green">**Valid**</span> names are: `trajectories_g24`, `alien_movements_g24`, `alien_traces_g24`. <br><br>
<span style="color:red">**Invalid**</span> names are: `TrajAlien_G24`, `99_aliens_in_motion`, `traj-of: aliens`.
<br><br> Assume we select as a name `alien_movements_g24`.

#### 1. Create the `JSON` file

The first step is to create a `JSON` file with the same name of the dataset (in our case `alien_movements_g24.json`).<br>

The `JSON` file should include the following **mandatory** information:

- `name` (str): The name of the dataset. Must be unique inside the data collection.
- `description` (str): A description of the dataset
- `url` (str): The URL at which to download the dataset
- `hash` (str): A known hash (checksum) of the file; it will be used to verify the correctness of the download or to check if an existing file needs to be updated.
- `data_type` (str): the type of the dataset. It can be 'trajectory', 'flow', 'shape', or 'auxiliar'.
- `auth` (str $\in$ {yes, no}): specifies if an authentication is required to download the file at the specified URL
- `processor`(str): if not `None`, denotes the compression method to apply at the downloaded dataset by the function `skmob_downloader`.
- `license` (str): the license associated with the dataset
- `reference` (str): the reference or references to cite if someone publishes material based on this dataset.
- `tags` (str): a set of tags separated from a `;` for the dataset search engine.

<br>
An example of structure is the following (more keys can be added):

```
{
'name':'trajectories_g24',
'description': 'Trajectories of 99 aliens moving in the G24 Galaxy.',
'url':'https://something.com',
'hash':'s1162a19242KBGMAGC14Z',
'data_type':'trajectory',
'auth':'no',
'processor':None,
'license':'MIT LICENSE',
'reference':'Extragalactic Mobility, DOI XXX'
'tags':'mobility;trajectories;aliens;spaceships;galaxy'
}
```

## 2. Create the Data pre-processing script

The second step is to create the data pre-processing script (DPS).
The name of the DPS should be the same name of the dataset with, followed by `.py`; in our case the DPS name is `alien_movements_g24.py`.
A template of the DPS can be found at `dps_example.py`.<br><br>

The dataset class name (line 12 of `dps_example.py`) must be equal to same as the DPS name, in our case `alien_movements_g24`.
<br><br>
```
line 12:
class alien_movements_g24(DatasetBuilder):
```
<br><br>
The final step is to implement the function `prepare` (line 31 of `dps_example.py`).

To implement the function assume that the paths of the files downloaded at the URL specified in the JSON file are described by the argument `f_names` (represented as a list of strings) then:
1. load the dataset
2. pre-process it if necessary (e.g., adjust the timezone or delete/add some information)
3. convert it in the correct skmob format:
 - trajectory $\rightarrow$ `TrajDataFrame`
 - flow $\rightarrow$ `FlowDataFrame`
 - shape $\rightarrow$ `GeoDataFrame`
 - auxiliar $\rightarrow$ `DataFrame`
4. return the dataset in the skmob format

<br>
In our example (pseudocode) the `prepare` function can be the following:

```
def prepare(self, f_names):
    
    #step 1
    file_path = f_names[0]
    file = read(file_path)
    
    #step 2 
    file_filtered = filter(file)
    
    #step 3
    file_tdf = TrajDataFrame(file_filtered)
    
    #step 4
    return file_tdf

```


Finally, put the files `alien_movements_g24.json` and `alien_movements_g24.py` inside a folder with the same name as the dataset (`alien_movements_g24`).

```
Folder alien_movements_g24 contains:
- alien_movements_g24.json
- alien_movements_g24.py
```


### DATA-PAPER

https://docs.google.com/spreadsheets/d/1K2O8Zqp_F_TyHiVc3Ynp3DCJ30QHXUhTyJu1CS8HeB0/edit?usp=sharing