In [2]:
import sys
sys.path.insert(0, "../src/data")

import h5py
# from make_dataset import *

# make_dataset pipeline
This notebook seeks to explain the data-processing pipeline of our make_dataset.py file. It will take you through each step, going from the raw data to the segmented data with computed KPIs, while explaining the functions and choices made at each step.

For each step, the command to run the *step* will be stated, and then an explanation of the relevant parts of the code running in the back will be made. Additioanlly, we will showcase what happens to the data after each step. For simplicity, all commands will be listed here for a quick overview.

#### ***Step-by-step***  
**Step 1 - Convert:**       `python src/data/make_dataset.py convert`  
**Step 2 - Validate:**      `python src/data/make_dataset.py validate` (\*)    
**Step 3 - Segment:**       `python src/data/make_dataset.py segment`  
**Step 4 - Match:**         `python src/data/make_dataset.py match`  
**Step 5 - Resample:**      `python src/data/make_dataset.py resample` (\*)  
**Step 6 - KPI:**           `python src/data/make_dataset.py kpi`

Note that (*) statements also have a verbose functionalioty that can be added as `--verbose`, where plots and additional tqdm progress bars may be displayed.

#### ***Everything-at-once***  
If you wish to run everything at once simply use the following command, `python src/data/make_dataset.py all`, and adding `--verbose` will still activate at the relevant steps.
#### ***Begin-from-and-do-the-rest***
We added a functionality that allows you to continue from any point in the data-pipeline, meaning e.g. if you have already done `convert`, `validate` and `segment` and wishes to run everything else at once, use the following command, `python src/data/make_dataset.py --begin-from match`.


# Raw data

Before we dive into the data pipeline, lets get an overview of the different datasets we are dealing with. The raw data consists of the 4 datasets:

#### ***1. GM data (AutoPi and CAN)***  

Something about what the data is. Yada yada

>platoon_CPH1.hdf5<br>
│── GM<br>
│&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;│── Car ID [int]<br>
│&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;│&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;│── Pass ID [int]<br>
│&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;│&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;│&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;└── [ acc.xyz (11359), acc_long (16011), acc_trans (16011), acc_yaw (16011), alt (125), asr_trq_req_dyn (8005), ... ]<br>
│&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;│&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;...<br>
│&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;...<br>
│<br>
... (Not used)


#### ***2. GoPro***

Det er sgu en orker


#### ***3. ARAN***

Used for computing KPIs

In [12]:
import pandas as pd

aran_raw = pd.read_csv("../data/raw/ref_data/cph1_aran_hh.csv", sep=';', encoding='unicode_escape')
aran_raw.head()

Unnamed: 0,L_Route_ID,DCSTimeStamp,BeginChainage,EndChainage,Venstre IRI (m/km),Højre IRI (m/km),Rivninger MeanRI (cm³/m²),Rivninger MeanExistingRI (cm³/m²),Rivninger MeanRPI (cm³/m²),Rivninger MeanAVC (cm³/m²),...,Latitude To (rad),Longitude From (rad),Longitude To (rad),Heading (rad),Elevation (m),Lat,Lon,Alt,Heading,Bearing
0,9990001-0-HVB1,44055.48542,-53.642797,-52.642797,0.0,0.0,,,,,...,0.971321,0.217917,0.217917,1.548209,38.407361,55.652595,12.485727,38.407361,88.705816,89.282278
1,9990001-0-HVB1,44055.48542,-52.642797,-51.642797,,,,,,,...,0.971321,0.217917,0.217917,1.548716,38.409111,55.652596,12.485742,38.409111,88.734917,89.282278
2,9990001-0-HVB1,44055.48542,-51.642797,-50.642797,,,,,,,...,0.971321,0.217917,0.217918,1.549302,38.414436,55.652596,12.485758,38.414436,88.768446,89.271525
3,9990001-0-HVB1,44055.48542,-50.642797,-49.642797,,,28.416636,32.142211,2238.577759,2266.9944,...,0.971321,0.217918,0.217918,1.550076,38.423914,55.652596,12.485774,38.423914,88.812826,89.265799
4,9990001-0-HVB1,44055.48542,-49.642797,-48.642797,,,,,,,...,0.971321,0.217918,0.217918,1.550698,38.431759,55.652596,12.48579,38.431759,88.848426,89.300827


#### ***4. P79***

Not actually used for now, but it is a very precise laser measurement of the surface of the road.
Using a model such as the ***quarter car model*** one can simulate a vehicle driving on the road.

In [13]:
p79_raw = pd.read_csv("../data/raw/ref_data/cph1_zp_hh.csv", sep=';', encoding='unicode_escape')
p79_raw.head()

Unnamed: 0,Distance [m],Laser 1 [mm],Laser 2 [mm],Laser 3 [mm],Laser 4 [mm],Laser 5 [mm],Laser 6 [mm],Laser 7 [mm],Laser 8 [mm],Laser 9 [mm],...,Laser 22 [mm],Laser 23 [mm],Laser 24 [mm],Laser 25 [mm],Lat,Lon,Højde,GeoHøjde,Alt,Bearing
0,0.0,77.979963,76.129577,73.432504,72.343552,71.085527,70.748823,71.22128,69.414377,66.948905,...,47.506321,44.715228,42.602378,40.864356,55.652685,12.488391,12.8,38.400002,51.200002,87.366884
1,0.100709,77.577069,75.484964,72.750423,71.660123,70.377488,69.797778,70.464433,69.102455,66.336974,...,47.03387,44.391341,42.129228,41.160619,55.652685,12.488392,12.8,38.400002,51.200002,87.366884
2,0.201419,76.67461,74.928609,72.165267,70.970617,69.58861,69.407065,69.937898,68.307188,65.791922,...,46.652856,44.162898,42.144453,41.047513,55.652685,12.488394,12.8,38.400002,51.200002,87.366884
3,0.302128,76.192724,74.53589,71.536658,70.283476,69.134717,68.887796,69.609289,68.062037,65.154582,...,46.271888,43.681742,41.807248,40.788764,55.652685,12.488396,12.8,38.400002,51.200002,87.366884
4,0.402838,75.589468,73.560795,70.603412,69.809506,68.749661,68.331567,69.250653,67.711902,64.769605,...,45.98265,43.383675,41.542447,40.4437,55.652685,12.488397,12.8,38.400002,51.200002,87.309725


# Step 1 - Converting data 

**Command:** `python src/data/make_dataset.py convert`

## Explanation

#### ***AutoPi and CAN data***  
The raw data is the CPH1 route from the LiRA data-set, as can be found in [Table 7](https://doi.org/10.1016/j.dib.2023.109426) of the LiRA-CD paper. Our first goal, as described in *section 3* of the paper, is to perform translation (conversion) of some of the car sensors, 

$$
\begin{align*}
    s = ((s_{\text{LiRA-CD}} - b^* \cdot r^*) - b) \cdot r,
\end{align*}
$$
where $s_{LiRA-CD}$ is the sensor signal stored in LiRA-CD, $b^*$ and $r^*$ are the offset and resolution values (values achieved through the CanZE application) and $b$ and $r$ are the corrected offset and resolution values (found in the LiRA project). The values are found in the following [paper](https://doi.org/10.1016/j.dib.2023.109426) and are further specified below,

```Python
CONVERT_PARAMETER_DICT = {
    'acc_long':     {'bstar': 198,      'rstar': 1,     'b': 198,   'r': 0.05   },
    'acc_trans':    {'bstar': 32768,    'rstar': 1,     'b': 32768, 'r': 0.04   },
    'acc_yaw':      {'bstar': 2047,     'rstar': 1,     'b': 2047,  'r': 0.1    },
    'brk_trq_elec': {'bstar': 4096,     'rstar': -1,    'b': 4098,  'r': -1     },
    'whl_trq_est':  {'bstar': 12800,    'rstar': 0.5,   'b': 12700, 'r': 1      },
    'trac_cons':    {'bstar': 80,       'rstar': 1,     'b': 79,    'r': 1      },
    'trip_cons':    {'bstar': 0,        'rstar': 0.1,   'b': 0,     'r': 1      }
}
```
In addition to performing the conversion, we also smooth the data of some of the car sensor signals, as they are prone to noise and can have alot of sporadic behavior. To smoothen the signals we use Locally Weighted Scatterplot Smoothing (LOWESS).

```Python
SMOOTH_PARAMETER_DICT = {
    'acc.xyz':       {'kind': 'lowess', 'frac': 0.005},
    'spd_veh':       {'kind': 'lowess', 'frac': 0.005},
    'acc_long':      {'kind': 'lowess', 'frac': 0.005},
    'acc_trans':     {'kind': 'lowess', 'frac': 0.005}
}
```

#### ***GoPro Data***  

The values of the GoPro data are not altered by any convertion, but we do change the structure of the files. Instead of storing the measurements according to each GoPro recording, the measurements are paired with the corresponding car (***16006***, ***16009*** or ***16011***), and all passes on the road in the given trip are joined in a single csv file for each measurement type (***accl***, ***gps5***, ***gyro***). By doing this, we have aligned the structures of the possible input data, ***GM*** and ***GoPro***, leading to easier matching in the coming steps.


# Step 2 - Validation
To validate that our conversion is done corretly and the smoothing with LOWESS has improved the signal, we wish to compare the the AutoPi with CAN.

#### Process data for comparison
1. We fix the sampling frequency (used to calculate the time) to $f_s=10$.
2. Extract the speed distance calculated based on the vehicle speed from the GM data (`spd_veh`).
3. Extract GPS (`gps`) longtitude and latitude data.
4. Extract Odometer (`odo`) distance measure and adding the fine distance measure (`f_dist`), all computed in meters.
5. Extract and normalise AutoPi 3D accelerations (`acc.xyz`).
6. Extract and normalise transverse (`acc_trans`) and longitudinal (`acc_long`) acceleration.
7. Resample time into 100hz
8. Resample all extracted data into 100hz via interpolation in the function clean_int(...).
9. Ensure accelerations are in $m/s^2$ and not $g$
10. Determnine the orientation of the sensors and reorient if needed - computed based on the correlation of th AutoPi 3D x- and y-accelerations with the CAN accelerations.

#### Compare and calculate correlation coefficients
1. Compare x-accelerations
2. Compare y-accelerations
3. Compare distance, speed distance with gps and speed distance with odometer.

**NOTE**: This stop does not change the structure of the data.