# make_dataset pipeline
This notebook seeks to explain the data-processing pipeline of our make_dataset.py file. It will take you through each step, going from the raw data to the segmented data with computed KPIs, while explaining the functions and choices made at each step.

For each step, the command to run the *step* will be stated, and then an explanation of the relevant parts of the code running in the back will be made. Additioanlly, we will showcase what happens to the data after each step. For simplicity, all commands will be listed here for a quick overview.

#### ***Step-by-step***  
**Step 1 - Convert:**       `python src/data/make_dataset.py convert`  
**Step 2 - Validate:**      `python src/data/make_dataset.py validate` (\*)    
**Step 3 - Segment:**       `python src/data/make_dataset.py segment`  
**Step 4 - Match:**         `python src/data/make_dataset.py match`  
**Step 5 - Resample:**      `python src/data/make_dataset.py resample` (\*)  
**Step 6 - KPI:**           `python src/data/make_dataset.py kpi`

Note that (*) statements also have a verbose functionalioty that can be added as `--verbose`, where plots and additional tqdm progress bars may be displayed.

#### ***Everything-at-once***  
If you wish to run everything at once simply use the following command, `python src/data/make_dataset.py all`, and adding `--verbose` will still activate at the relevant steps.

#### ***Begin-from-and-do-the-rest***
We added a functionality that allows you to continue from any point in the data-pipeline, meaning e.g. if you have already done `convert`, `validate` and `segment` and wishes to run everything else at once, use the following command, `python src/data/make_dataset.py --begin-from match`.


## Step 1 - Converting data 

**Command:** `python src/data/make_dataset.py convert`

### Explanation
The raw data is the CPH1 route from the LiRA data-set, as can be found in [Table 7](https://doi.org/10.1016/j.dib.2023.109426) of the LiRA-CD paper. Our first goal, as described in *section 3* of the paper, is to perform translation (conversion) of some of the car sensors, 

$$
\begin{align*}
    s = ((s_{\text{LiRA-CD}} - b^* \cdot r^*) - b) \cdot r,
\end{align*}
$$
where $s_{LiRA-CD}$ is the sensor signal stored in LiRA-CD, $b^*$ and $r^*$ are the offset and resolution values (values achieved through the CanZE application) and $b$ and $r$ are the corrected offset and resolution values (found in the LiRA project). The values are found in the following [paper](https://doi.org/10.1016/j.dib.2023.109426) and are further specified below,

```Python
CONVERT_PARAMETER_DICT = {
    'acc_long':     {'bstar': 198,      'rstar': 1,     'b': 198,   'r': 0.05   },
    'acc_trans':    {'bstar': 32768,    'rstar': 1,     'b': 32768, 'r': 0.04   },
    'acc_yaw':      {'bstar': 2047,     'rstar': 1,     'b': 2047,  'r': 0.1    },
    'brk_trq_elec': {'bstar': 4096,     'rstar': -1,    'b': 4098,  'r': -1     },
    'whl_trq_est':  {'bstar': 12800,    'rstar': 0.5,   'b': 12700, 'r': 1      },
    'trac_cons':    {'bstar': 80,       'rstar': 1,     'b': 79,    'r': 1      },
    'trip_cons':    {'bstar': 0,        'rstar': 0.1,   'b': 0,     'r': 1      }
}
```
In addition to performing the conversion, we also smooth the data of some of the car sensor signals, as they are prone to noise and can have alot of sporadic behavior. To smoothen the signals we use Locally Weighted Scatterplot Smoothing (LOWESS).

```Python
SMOOTH_PARAMETER_DICT = {
    'acc.xyz':       {'kind': 'lowess', 'frac': 0.005},
    'spd_veh':       {'kind': 'lowess', 'frac': 0.005},
    'acc_long':      {'kind': 'lowess', 'frac': 0.005},
    'acc_trans':     {'kind': 'lowess', 'frac': 0.005}
}
```

### What happened to the data?