# Data pre-processing for AV edgecase

**Jasvir Dhillon June 30, 2021**

## All functions to extract features and outputs from result files

**Workflow**
* Download the dataset
* Read and order files
* Extract data from the files and place it in a dataframe. Clean data/apply transformation is necessary
* Explore and investigate data
* Store data for later access
* Split dataset into training and test data. Store training later

Please Note: All visual plots (from plotly) consume a lot of storage space. Hence they have exported to 'edgecase_datat_processing.html' for review and have been commented in this notebook

### Importing libraries

In [1]:
import numpy as np
import pandas as pd

In [22]:
pip install natsort

Collecting natsort
  Downloading natsort-7.1.1-py3-none-any.whl (35 kB)
Installing collected packages: natsort
Successfully installed natsort-7.1.1
Note: you may need to restart the kernel to use updated packages.


## Read and organize data

**Dataset information**

The variations in features are shown in the image as follows:
										
<img src='nb_ims/feature_variations.jpg'/>

**Features:** 
* speed: the cruising vehicle speed when the traffic object is detected on collision course (kph) 
* body_mass: The sprung mass of the vehicle body (kg)
* cogx: Centre of gravity location of the sprung mass in the longitudinal axis in the vehcile frame of reference. Lower this value closer it is to the rear hitch point (m)
* tire_rr: Rolling resistance factor for the tyres (same value assumed for all tyres)
* Brake_file: Name of the hydraulic brake system parameter file
* road_mu: Road coefficient of friction
* react_time: Time taken by the driver (or controller) to react to object detection and applying brakes (s)
* obj_dist: Straightline distance infront of the vehicle when the traffic object is detected (m)
* update_rate: Update frequency of the simulated LIDAR sensor which is responsible for traffic object detection (Hz) 

*Each of Brake file has seven different parameters that have been changed. The rest of the hydraulic system has been kept unchanged*

```
           pedal.ratio    boo.ampli       mc.area    pf.area      pf.rbrake       pr.area       pr.rbrake
                
HydESP_1	3.0	         5.0	         4.5	    23	       0.10	        11	       0.10        
HydESP_2	1.5	         6.0	         3.5	    30	       0.07	        14	       0.07
HydESP_3	2.5	         5.0	         4.5	    15	       0.10	        15	       0.10
HydESP_4	4.0	         3.0	         6.0	    12	       0.14	        08	       0.14
HydESP_5	3.0	         5.0	         3.5	    30	       0.10	        14	       0.10

```

**Brake file features**
* pedal.ratio: Brake pedal ratio 
* boo.ampli: Brake booster amplification factor
* mc.area: Master cylinder piston area (cm^2)
* pf.area: Effective brake cylinder piston area at individual front brake (cm^2)
* pf.rbrake: Effective front brake radius (m)
* pr.area: Effective brake cylinder piston area at individual rear brake (cm^2)
* pr.rbrake: Effective rear brake radius (m)

Total number of features: 15 (15-dimensional data)

Total number of outputs: 2 ('collision' and 'dist_to_col')

The total number of variations: 4x5x3x3x5x3x3x4x3 -> 97,200

Each simulation (variation) result stores two files:
    - .dat.info: A text file which contains the feature names and data that we need to extract
    - .dat: An ascii file which contains the output values (binary for collision- Yes:1, No:0 & distance to collision logged at the end of each simulation)
    
The total number of files: 97,200x2 -> 194,400

Total data size: 660 MB ¦ Total disk space required: 1.05 GB

### Download data from dropbox storage

In [37]:
%mkdir data
!wget -O data/edge-case_dataset.zip https://www.dropbox.com/s/j9jrkkaptkygsb5/edge-case_dataset.zip?dl=0

--2021-06-17 08:43:30--  https://www.dropbox.com/s/j9jrkkaptkygsb5/edge-case_dataset.zip?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.6.18, 2620:100:601c:18::a27d:612
Connecting to www.dropbox.com (www.dropbox.com)|162.125.6.18|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/j9jrkkaptkygsb5/edge-case_dataset.zip [following]
--2021-06-17 08:43:31--  https://www.dropbox.com/s/raw/j9jrkkaptkygsb5/edge-case_dataset.zip
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uce6f9ae643bfa54ba7f4b99efb9.dl.dropboxusercontent.com/cd/0/inline/BQn7wIhAnZjkHdqth1kz6CuDAMQSWi6VBTFQrdyZXowXnUMls3GRpeVOqcUzSfwv4evd2qnZhUPvtenL3McLTrQ2TAmVTBA4X-JNu4B-vPNWFGOuxwx7zNSBSHt780cmJ8_42_kZxTd3J_xkSxWfbXxv/file# [following]
--2021-06-17 08:43:31--  https://uce6f9ae643bfa54ba7f4b99efb9.dl.dropboxusercontent.com/cd/0/inline/BQn7wIhAnZjkHdqth1kz6CuDAMQSWi6VBTFQrdyZXowXnUMls3GRpeVOqc

In [39]:
# Unzip the data
### Output removed because it printed all the 194,400 file names which was too long ###

#!unzip data/edge-case_dataset.zip -d data

### Read data files

In [40]:
import glob
import natsort

def read_org_data(dat_filepath, dat_info_filepath):
    
    # read output file list
    op_filelist = []
    for file in glob.glob(dat_filepath):
        op_filelist.append(file)
        #print(file)

    # read feature file list
    feature_filelist = []
    for file in glob.glob(dat_info_filepath):
        feature_filelist.append(file)
        #print(file)
    
    # Organize the result files in the same chronological order
    op_filelist = natsort.natsorted(op_filelist,reverse=True)
    feature_filelist = natsort.natsorted(feature_filelist,reverse=True)
    
    return op_filelist,feature_filelist

In [41]:
dat_filepath = "data/case-LIDAR-1/*.dat"
dat_info_filepath = "data/case-LIDAR-1/*.dat.info"
op_filelist_1, feature_filelist_1 = read_org_data(dat_filepath, dat_info_filepath)

In [42]:
op_filelist_1[0:10]

['data/case-LIDAR-1/case-LIDAR1_Variation 32399_012801_032408.dat',
 'data/case-LIDAR-1/case-LIDAR1_Variation 32398_012800_032407.dat',
 'data/case-LIDAR-1/case-LIDAR1_Variation 32397_012759_032406.dat',
 'data/case-LIDAR-1/case-LIDAR1_Variation 32396_012758_032405.dat',
 'data/case-LIDAR-1/case-LIDAR1_Variation 32395_012757_032404.dat',
 'data/case-LIDAR-1/case-LIDAR1_Variation 32394_012756_032403.dat',
 'data/case-LIDAR-1/case-LIDAR1_Variation 32393_012755_032402.dat',
 'data/case-LIDAR-1/case-LIDAR1_Variation 32392_012754_032401.dat',
 'data/case-LIDAR-1/case-LIDAR1_Variation 32391_012753_032400.dat',
 'data/case-LIDAR-1/case-LIDAR1_Variation 32390_012752_032399.dat']

In [43]:
feature_filelist_1[0:10]

['data/case-LIDAR-1/case-LIDAR1_Variation 32399_012801_032408.dat.info',
 'data/case-LIDAR-1/case-LIDAR1_Variation 32398_012800_032407.dat.info',
 'data/case-LIDAR-1/case-LIDAR1_Variation 32397_012759_032406.dat.info',
 'data/case-LIDAR-1/case-LIDAR1_Variation 32396_012758_032405.dat.info',
 'data/case-LIDAR-1/case-LIDAR1_Variation 32395_012757_032404.dat.info',
 'data/case-LIDAR-1/case-LIDAR1_Variation 32394_012756_032403.dat.info',
 'data/case-LIDAR-1/case-LIDAR1_Variation 32393_012755_032402.dat.info',
 'data/case-LIDAR-1/case-LIDAR1_Variation 32392_012754_032401.dat.info',
 'data/case-LIDAR-1/case-LIDAR1_Variation 32391_012753_032400.dat.info',
 'data/case-LIDAR-1/case-LIDAR1_Variation 32390_012752_032399.dat.info']

In [44]:
display(len(op_filelist_1))
display(len(feature_filelist_1))

32403

32403

In [59]:
dat_filepath = "data/case-LIDAR-2/*.dat"
dat_info_filepath = "data/case-LIDAR-2/*.dat.info"
op_filelist_2, feature_filelist_2 = read_org_data(dat_filepath, dat_info_filepath)

display(op_filelist_2[0:10])
display(feature_filelist_2[0:10])
display(len(op_filelist_2))
display(len(feature_filelist_2))

['data/case-LIDAR-2/case-LIDAR2_235959_003251.dat',
 'data/case-LIDAR-2/case-LIDAR2_235958_003250.dat',
 'data/case-LIDAR-2/case-LIDAR2_235957_003249.dat',
 'data/case-LIDAR-2/case-LIDAR2_235956_003248.dat',
 'data/case-LIDAR-2/case-LIDAR2_235955_003247.dat',
 'data/case-LIDAR-2/case-LIDAR2_235954_003246.dat',
 'data/case-LIDAR-2/case-LIDAR2_235953_003245.dat',
 'data/case-LIDAR-2/case-LIDAR2_235952_003244.dat',
 'data/case-LIDAR-2/case-LIDAR2_235950_003243.dat',
 'data/case-LIDAR-2/case-LIDAR2_235949_003242.dat']

['data/case-LIDAR-2/case-LIDAR2_235959_003251.dat.info',
 'data/case-LIDAR-2/case-LIDAR2_235958_003250.dat.info',
 'data/case-LIDAR-2/case-LIDAR2_235957_003249.dat.info',
 'data/case-LIDAR-2/case-LIDAR2_235956_003248.dat.info',
 'data/case-LIDAR-2/case-LIDAR2_235955_003247.dat.info',
 'data/case-LIDAR-2/case-LIDAR2_235954_003246.dat.info',
 'data/case-LIDAR-2/case-LIDAR2_235953_003245.dat.info',
 'data/case-LIDAR-2/case-LIDAR2_235952_003244.dat.info',
 'data/case-LIDAR-2/case-LIDAR2_235950_003243.dat.info',
 'data/case-LIDAR-2/case-LIDAR2_235949_003242.dat.info']

32400

32400

In [60]:
dat_filepath = "data/case-LIDAR-3/*.dat"
dat_info_filepath = "data/case-LIDAR-3/*.dat.info"
op_filelist_3, feature_filelist_3 = read_org_data(dat_filepath, dat_info_filepath)

display(op_filelist_3[0:10])
display(feature_filelist_3[0:10])
display(len(op_filelist_3))
display(len(feature_filelist_3))

['data/case-LIDAR-3/case-LIDAR3_Variation 32399_220431_032408.dat',
 'data/case-LIDAR-3/case-LIDAR3_Variation 32398_220431_032407.dat',
 'data/case-LIDAR-3/case-LIDAR3_Variation 32397_220430_032406.dat',
 'data/case-LIDAR-3/case-LIDAR3_Variation 32396_220428_032405.dat',
 'data/case-LIDAR-3/case-LIDAR3_Variation 32395_220427_032404.dat',
 'data/case-LIDAR-3/case-LIDAR3_Variation 32394_220426_032403.dat',
 'data/case-LIDAR-3/case-LIDAR3_Variation 32393_220424_032402.dat',
 'data/case-LIDAR-3/case-LIDAR3_Variation 32392_220424_032401.dat',
 'data/case-LIDAR-3/case-LIDAR3_Variation 32391_220422_032400.dat',
 'data/case-LIDAR-3/case-LIDAR3_Variation 32390_220420_032399.dat']

['data/case-LIDAR-3/case-LIDAR3_Variation 32399_220431_032408.dat.info',
 'data/case-LIDAR-3/case-LIDAR3_Variation 32398_220431_032407.dat.info',
 'data/case-LIDAR-3/case-LIDAR3_Variation 32397_220430_032406.dat.info',
 'data/case-LIDAR-3/case-LIDAR3_Variation 32396_220428_032405.dat.info',
 'data/case-LIDAR-3/case-LIDAR3_Variation 32395_220427_032404.dat.info',
 'data/case-LIDAR-3/case-LIDAR3_Variation 32394_220426_032403.dat.info',
 'data/case-LIDAR-3/case-LIDAR3_Variation 32393_220424_032402.dat.info',
 'data/case-LIDAR-3/case-LIDAR3_Variation 32392_220424_032401.dat.info',
 'data/case-LIDAR-3/case-LIDAR3_Variation 32391_220422_032400.dat.info',
 'data/case-LIDAR-3/case-LIDAR3_Variation 32390_220420_032399.dat.info']

32400

32400

### Define method to extract data from result files and append to dataframe

In [45]:
from io import StringIO

def get_data(op_filelist, feature_filelist, update_rate):
  """
  Method to extract first level data from the data files
  
    Arguments
    op_filelist: The list of file names containing output data values
    ip_filelist: The list of file names containing input feature names and values
    update_rate: The sensor update frequency in Hz used for that dataset
    
  """
  column_names = ["body_mass", "cogx", "obj_dist", "react_time", "road_mu", "speed", "tire_rr", "brake_file", "collision", "dist_to_col"]
  main_df = pd.DataFrame(columns = column_names)

  for i in range(0, len(op_filelist)):
    # extracting output
    with open(op_filelist[i], 'r') as f:
      data = f.read()

    df = pd.read_fwf(StringIO(data), 
                    sep="\s+", 
                    skiprows=1, 
                    usecols=[2,3], 
                    names=['collision','dist_to_col'])

    shape = df.shape
    y_series = df.loc[shape[0]-1] # this is the output pandas series

    # extracting input features
    with open(feature_filelist[i], 'r') as f:
      feature_file = f.read()
      
    df_feature = pd.read_fwf(StringIO(feature_file), 
                    sep="\s+", 
                    skiprows=1 
                    )

    #df_feature.loc[110:127]
    item_no = [111, 113, 115, 117, 119, 121, 123, 127] # line numbers of interest
    feature_list = ['body_mass', 'cogx', 'obj_dist', 'react_time', 'road_mu', 'speed', 'tire_rr', 'brake_file']
    value_list = []

    for j in item_no:
      if j < 123:
        string = df_feature.loc[j].values
        string_split = string[0].split('=')
        value = float(string_split[1])
        value_list.append(value)
      else:
        string = df_feature.loc[j].values
        string_split = string[0].split('=')
        value = string_split[1]
        value_list.append(value)

    feature_array = np.array(value_list)
    feature_series = pd.Series(feature_array, index=feature_list)

    # combine feature and data series
    data_series = pd.concat([feature_series, y_series], axis=0)

    # append feature and output series to main_df
    main_df = main_df.append(data_series, ignore_index=True)
  
  # Transform dataframe columns to replace strings with values  
  # Replace tire file names with values
  main_df['tire_rr'] = main_df['tire_rr'].replace({' tire_1': 0.008, ' tire_2': 0.015, ' tire_3': 0.03})

  # place 'brake_file' column at the end
  new_columns = ["body_mass", "cogx", "obj_dist", "react_time", "road_mu", "speed", "tire_rr", "collision", "dist_to_col", "brake_file"]
  main_df = main_df.reindex(columns=new_columns)

  # Add required brake feature columns to dataframe
  brake_columns_list = ["pedal.ratio", "boo.ampli", "mc.area", "pf.area", "pf.rbrake", "pr.area", "pr.rbrake"]
  HydESP_1 = [3, 5, 4.5 , 23, 0.1, 11, 0.1]
  HydESP_2 = [1.5, 6, 3.5 , 30, 0.07, 14, 0.07]
  HydESP_3 = [2.5, 5, 4.5 , 15, 0.1, 15, 0.1]
  HydESP_4 = [4, 3, 6 , 12, 0.14, 8, 0.14]
  HydESP_5 = [3, 5, 3.5 , 30, 0.1, 14, 0.1]

  for name in brake_columns_list:
    main_df[name] = main_df["brake_file"]
    i = brake_columns_list.index(name)
    main_df[name] = main_df[name].replace({' HydESP_1': HydESP_1[i], ' HydESP_2': HydESP_2[i], ' HydESP_3': HydESP_3[i], ' HydESP_4': HydESP_4[i], ' HydESP_5': HydESP_5[i]})

  # delete the 'brake_file' which is now not required
  del main_df["brake_file"] 

  # Add the sensor update_rate column
  main_df["update_rate"] = update_rate

  # place 'brake_file' column at the end
  final_columns = ["body_mass", "cogx", "obj_dist", "react_time", "road_mu", "speed", "tire_rr", "pedal.ratio", "boo.ampli", "mc.area", "pf.area", "pf.rbrake", "pr.area", "pr.rbrake", "update_rate", "collision", "dist_to_col"]
  main_df = main_df.reindex(columns=final_columns)

  return main_df

#### Get the dataframe for case-LIDAR-1

In [46]:
update_rate = 10
main_df_1 = get_data(op_filelist_1, feature_filelist_1, update_rate)
main_df_1

IndexError: list index out of range

For case-LIDAR-1, the get_data() returned an error 'list index out of range'. We need to find all rogue files which do not belong to the intended simulation results that are causing this error

In [47]:
# Identify rogue files in the LIDAR-1 dataset

rogue_list = []
for i in range(0, len(op_filelist_1)):
    with open(feature_filelist_1[i], 'r') as f: 
        feature_file_1 = f.read()
      
    df_feature_1 = pd.read_fwf(StringIO(feature_file_1), 
                    sep="\s+", 
                    skiprows=1 
                    )
    if len(df_feature_1) != 145:
        rogue_list.append(op_filelist_1[i])

In [48]:
rogue_list

['data/case-LIDAR-1/case-LIDAR1_Variation 31567_091557.dat']

In [49]:
# Remove the rogue files
op_filelist_1.remove('data/case-LIDAR-1/case-LIDAR1_Variation 31567_091557.dat')
feature_filelist_1.remove('data/case-LIDAR-1/case-LIDAR1_Variation 31567_091557.dat.info')

In [50]:
# print file list lenght after removing rogue files
display(len(op_filelist_1))
display(len(feature_filelist_1))

32402

32402

In [51]:
update_rate = 10
main_df_1 = get_data(op_filelist_1, feature_filelist_1, update_rate)
main_df_1

Unnamed: 0,body_mass,cogx,obj_dist,react_time,road_mu,speed,tire_rr,pedal.ratio,boo.ampli,mc.area,pf.area,pf.rbrake,pr.area,pr.rbrake,update_rate,collision,dist_to_col
0,3000.0,2.8,540.0,0.4,1.0,140.0,0.030,3.0,5,3.5,30,0.1,14,0.1,10,0.0,1.894697
1,3000.0,2.8,500.0,0.4,1.0,140.0,0.030,3.0,5,3.5,30,0.1,14,0.1,10,1.0,-38.105303
2,3000.0,2.8,460.0,0.4,1.0,140.0,0.030,3.0,5,3.5,30,0.1,14,0.1,10,1.0,-78.105303
3,3000.0,2.8,440.0,0.4,1.0,140.0,0.030,3.0,5,3.5,30,0.1,14,0.1,10,1.0,-98.105303
4,3000.0,2.8,540.0,0.2,1.0,140.0,0.030,3.0,5,3.5,30,0.1,14,0.1,10,0.0,4.345519
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32397,1200.0,2.0,440.0,0.2,0.4,80.0,0.008,3.0,5,4.5,23,0.1,11,0.1,10,1.0,-18.877209
32398,1200.0,2.0,540.0,0.1,0.4,80.0,0.008,3.0,5,4.5,23,0.1,11,0.1,10,0.0,81.643414
32399,1200.0,2.0,500.0,0.1,0.4,80.0,0.008,3.0,5,4.5,23,0.1,11,0.1,10,0.0,41.643414
32400,1200.0,2.0,460.0,0.1,0.4,80.0,0.008,3.0,5,4.5,23,0.1,11,0.1,10,0.0,1.643414


In [56]:
main_df_1.isnull().values.any()

False

In [55]:
collisions = main_df_1[(main_df_1.collision == 1)]['collision']
percent_collisions = (len(collisions)/len(main_df_1))*100
print('The percentage of collisions in the LIDAR-1 dataset is: ', percent_collisions, '%')

The percentage of collisions in the LIDAR-1 dataset is:  67.51435096598975 %


#### Get the dataframe for case-LIDAR-2

In [61]:
update_rate_2 = 40
main_df_2 = get_data(op_filelist_2, feature_filelist_2, update_rate_2)
main_df_2

Unnamed: 0,body_mass,cogx,obj_dist,react_time,road_mu,speed,tire_rr,pedal.ratio,boo.ampli,mc.area,pf.area,pf.rbrake,pr.area,pr.rbrake,update_rate,collision,dist_to_col
0,2100.0,2.0,500.0,0.1,0.4,80.0,0.008,3.0,5,4.5,23,0.1,11,0.1,40,0.0,30.891520
1,2100.0,2.0,460.0,0.1,0.4,80.0,0.008,3.0,5,4.5,23,0.1,11,0.1,40,1.0,-9.108480
2,2100.0,2.0,440.0,0.1,0.4,80.0,0.008,3.0,5,4.5,23,0.1,11,0.1,40,1.0,-29.108480
3,1650.0,2.8,540.0,0.4,1.0,80.0,0.030,3.0,5,3.5,30,0.1,14,0.1,40,0.0,108.450310
4,1650.0,2.8,500.0,0.4,1.0,80.0,0.030,3.0,5,3.5,30,0.1,14,0.1,40,0.0,68.450311
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32395,2100.0,2.0,540.0,0.2,0.4,80.0,0.008,3.0,5,4.5,23,0.1,11,0.1,40,0.0,70.308482
32396,2100.0,2.0,500.0,0.2,0.4,80.0,0.008,3.0,5,4.5,23,0.1,11,0.1,40,0.0,30.308482
32397,2100.0,2.0,460.0,0.2,0.4,80.0,0.008,3.0,5,4.5,23,0.1,11,0.1,40,1.0,-9.691518
32398,2100.0,2.0,440.0,0.2,0.4,80.0,0.008,3.0,5,4.5,23,0.1,11,0.1,40,1.0,-29.691518


In [62]:
main_df_2.isnull().values.any()

False

In [63]:
collisions = main_df_2[(main_df_2.collision == 1)]['collision']
percent_collisions = (len(collisions)/len(main_df_2))*100
print('The percentage of collisions in the LIDAR-2 dataset is: ', percent_collisions, '%')

The percentage of collisions in the LIDAR-2 dataset is:  66.86111111111111 %


#### Get the dataframe for case-LIDAR-3

In [64]:
%%time
update_rate_3 = 60
main_df_3 = get_data(op_filelist_3, feature_filelist_3, update_rate_3)
main_df_3

CPU times: user 7min 21s, sys: 1.82 s, total: 7min 23s
Wall time: 7min 24s


Unnamed: 0,body_mass,cogx,obj_dist,react_time,road_mu,speed,tire_rr,pedal.ratio,boo.ampli,mc.area,pf.area,pf.rbrake,pr.area,pr.rbrake,update_rate,collision,dist_to_col
0,3000.0,2.8,540.0,0.4,1.0,140.0,0.030,3.0,5,3.5,30,0.1,14,0.1,60,0.0,5.472431
1,3000.0,2.8,500.0,0.4,1.0,140.0,0.030,3.0,5,3.5,30,0.1,14,0.1,60,1.0,-34.527569
2,3000.0,2.8,460.0,0.4,1.0,140.0,0.030,3.0,5,3.5,30,0.1,14,0.1,60,1.0,-74.527569
3,3000.0,2.8,440.0,0.4,1.0,140.0,0.030,3.0,5,3.5,30,0.1,14,0.1,60,1.0,-94.527569
4,3000.0,2.8,540.0,0.2,1.0,140.0,0.030,3.0,5,3.5,30,0.1,14,0.1,60,0.0,7.923254
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32395,1200.0,2.0,440.0,0.2,0.4,80.0,0.008,3.0,5,4.5,23,0.1,11,0.1,60,1.0,-17.543875
32396,1200.0,2.0,540.0,0.1,0.4,80.0,0.008,3.0,5,4.5,23,0.1,11,0.1,60,0.0,82.976747
32397,1200.0,2.0,500.0,0.1,0.4,80.0,0.008,3.0,5,4.5,23,0.1,11,0.1,60,0.0,42.976747
32398,1200.0,2.0,460.0,0.1,0.4,80.0,0.008,3.0,5,4.5,23,0.1,11,0.1,60,0.0,2.976747


In [68]:
main_df_3.isnull().values.any()

False

In [69]:
collisions = main_df_3[(main_df_3.collision == 1)]['collision']
percent_collisions = (len(collisions)/len(main_df_3))*100
print('The percentage of collisions in the LIDAR-3 dataset is: ', percent_collisions, '%')

The percentage of collisions in the LIDAR-3 dataset is:  66.78395061728395 %


#### Store dataframes for later access

In [70]:
main_df_1.to_pickle("data/main_df_1.pkl")
# with open("./main_df_1.pkl", "rb") as fh:
  #main_df_1 = pickle.load(fh)

In [71]:
main_df_2.to_pickle("data/main_df_2.pkl")
# with open("./main_df_2.pkl", "rb") as fh:
  #main_df_2 = pickle.load(fh)

In [72]:
main_df_3.to_pickle("data/main_df_3.pkl")
# with open("./main_df_3.pkl", "rb") as fh:
  #main_df_3 = pickle.load(fh)

# Concat the three dataframes for the three datasets together into one dataframe

In [73]:
main_df = pd.concat([main_df_1, main_df_2, main_df_3], axis=0, ignore_index=True)
main_df.head(10)

Unnamed: 0,body_mass,cogx,obj_dist,react_time,road_mu,speed,tire_rr,pedal.ratio,boo.ampli,mc.area,pf.area,pf.rbrake,pr.area,pr.rbrake,update_rate,collision,dist_to_col
0,3000.0,2.8,540.0,0.4,1.0,140.0,0.03,3.0,5,3.5,30,0.1,14,0.1,10,0.0,1.894697
1,3000.0,2.8,500.0,0.4,1.0,140.0,0.03,3.0,5,3.5,30,0.1,14,0.1,10,1.0,-38.105303
2,3000.0,2.8,460.0,0.4,1.0,140.0,0.03,3.0,5,3.5,30,0.1,14,0.1,10,1.0,-78.105303
3,3000.0,2.8,440.0,0.4,1.0,140.0,0.03,3.0,5,3.5,30,0.1,14,0.1,10,1.0,-98.105303
4,3000.0,2.8,540.0,0.2,1.0,140.0,0.03,3.0,5,3.5,30,0.1,14,0.1,10,0.0,4.345519
5,3000.0,2.8,500.0,0.2,1.0,140.0,0.03,3.0,5,3.5,30,0.1,14,0.1,10,1.0,-35.654481
6,3000.0,2.8,460.0,0.2,1.0,140.0,0.03,3.0,5,3.5,30,0.1,14,0.1,10,1.0,-75.654481
7,3000.0,2.8,440.0,0.2,1.0,140.0,0.03,3.0,5,3.5,30,0.1,14,0.1,10,1.0,-95.654481
8,3000.0,2.8,540.0,0.1,1.0,140.0,0.03,3.0,5,3.5,30,0.1,14,0.1,10,0.0,5.414779
9,3000.0,2.8,500.0,0.1,1.0,140.0,0.03,3.0,5,3.5,30,0.1,14,0.1,10,1.0,-34.585221


In [74]:
main_df.shape

(97202, 17)

Shuffle rows of the final data frame 

In [75]:
for i in range(0, 3):
  main_df = main_df.sample(frac=1).reset_index(drop=True)

main_df.head(10)

Unnamed: 0,body_mass,cogx,obj_dist,react_time,road_mu,speed,tire_rr,pedal.ratio,boo.ampli,mc.area,pf.area,pf.rbrake,pr.area,pr.rbrake,update_rate,collision,dist_to_col
0,2550.0,2.0,540.0,0.2,0.4,140.0,0.03,1.5,6,3.5,30,0.07,14,0.07,60,1.0,-112.22947
1,2550.0,2.4,460.0,0.4,0.7,140.0,0.015,2.5,5,4.5,15,0.1,15,0.1,60,1.0,-149.58166
2,1200.0,2.4,460.0,0.4,0.7,120.0,0.008,4.0,3,6.0,12,0.14,8,0.14,10,1.0,-66.651648
3,3000.0,2.8,500.0,0.2,1.0,120.0,0.03,3.0,5,4.5,23,0.1,11,0.1,60,1.0,-30.155521
4,2100.0,2.8,440.0,0.1,0.4,100.0,0.015,3.0,5,4.5,23,0.1,11,0.1,40,1.0,-50.162968
5,2100.0,2.0,460.0,0.1,0.4,80.0,0.008,3.0,5,3.5,30,0.1,14,0.1,60,0.0,1.268645
6,1200.0,2.0,460.0,0.2,0.7,100.0,0.015,4.0,3,6.0,12,0.14,8,0.14,10,1.0,-27.204663
7,1200.0,2.4,460.0,0.2,0.4,100.0,0.015,3.0,5,3.5,30,0.1,14,0.1,60,1.0,-27.207652
8,2550.0,2.0,460.0,0.2,1.0,120.0,0.03,3.0,5,3.5,30,0.1,14,0.1,40,1.0,-27.017936
9,1200.0,2.8,500.0,0.1,0.4,100.0,0.008,3.0,5,4.5,23,0.1,11,0.1,10,0.0,10.942852


### Special operation on the final dataframe
The column 'obj_dist' contains values of the traffic object from the Origin coorindate of the virtual environment. We want to make this value relative to the vehicle. Hence we will subtract the values in this column by the distance the vehicle has travelled where the traffic object appears infront of the vehicle. 

The traffic object comes into sensor range of the vehicle when the vehicle has travelled '400m' from the start of the simulation inorder to stabilize at the cruising speed. Hence we will subtract '400' from the 'obj_dist' values of the 'main_df'

In [119]:
main_df.to_csv("tmp.csv", index=False) # converts string values to float automatically
main_df = pd.read_csv("tmp.csv")
main_df.obj_dist = main_df.obj_dist - 400
main_df

Unnamed: 0,body_mass,cogx,obj_dist,react_time,road_mu,speed,tire_rr,pedal.ratio,boo.ampli,mc.area,pf.area,pf.rbrake,pr.area,pr.rbrake,update_rate,collision,dist_to_col
0,2550.0,2.0,140.0,0.2,0.4,140.0,0.030,1.5,6,3.5,30,0.07,14,0.07,60,1.0,-112.229470
1,2550.0,2.4,60.0,0.4,0.7,140.0,0.015,2.5,5,4.5,15,0.10,15,0.10,60,1.0,-149.581660
2,1200.0,2.4,60.0,0.4,0.7,120.0,0.008,4.0,3,6.0,12,0.14,8,0.14,10,1.0,-66.651648
3,3000.0,2.8,100.0,0.2,1.0,120.0,0.030,3.0,5,4.5,23,0.10,11,0.10,60,1.0,-30.155521
4,2100.0,2.8,40.0,0.1,0.4,100.0,0.015,3.0,5,4.5,23,0.10,11,0.10,40,1.0,-50.162968
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
97197,1200.0,2.4,100.0,0.1,0.4,80.0,0.008,1.5,6,3.5,30,0.07,14,0.07,40,0.0,43.224258
97198,2550.0,2.4,40.0,0.1,0.4,100.0,0.030,2.5,5,4.5,15,0.10,15,0.10,40,1.0,-64.368848
97199,3000.0,2.4,60.0,0.2,1.0,120.0,0.030,2.5,5,4.5,15,0.10,15,0.10,10,1.0,-112.050890
97200,2100.0,2.4,100.0,0.4,0.7,80.0,0.030,4.0,3,6.0,12,0.14,8,0.14,40,0.0,15.687989


#### Store concatenated dataframe as .csv and .pkl files for later access

In [120]:
main_df.to_csv("data/edgecase_dataset.csv", index=False) 

In [121]:
main_df.to_pickle("data/main_df.pkl")
# pip install pickle5
# import pickle5 as pickle
# with open("data/main_df.pkl", "rb") as fh:
  #main_df = pickle.load(fh)

## Explore data

In [6]:
#pip install pickle5

In [7]:
#import pickle5 as pickle
#with open("data/main_df.pkl", "rb") as fh:
#  main_df = pickle.load(fh)

In [3]:
collisions = main_df[(main_df.collision == 1)]['collision']
percent_collisions = (len(collisions)/len(main_df))*100
print('The percentage of collisions in the dataset is: ', percent_collisions, '%')

The percentage of collisions in the dataset is:  67.05314705458736 %


In [4]:
main_df.isnull().values.any()

False

### Plotting and Investigation

In [18]:
#jupyter labextension install jupyterlab-plotly

In [17]:
 #pip install plotly

In [5]:
import plotly.express as px

In [6]:
def plot_hist(dataset, feature_name, color_feature, marginal=None, histfunc=None, y=None):
  """
  Method to check frequency distribution of dataset features
  
    Arguments
    dataset: the dataframe containing complete data
    feature_name: string "name" of the series of feature of interest
    color_feature: string "name" of the feature for color classification
    marginal: "violin" to plot frequency distribution subfigure
    histfunc: 'avg' to plot average of feature in y-axis
    y: string "name" of the series of feature for y-axis
    
  """
  fig = px.histogram(dataset, x=feature_name, y=y, color=color_feature, marginal=marginal, histfunc=histfunc)
  fig.show("notebook")

In [14]:
feature_names = ["body_mass", "cogx", "obj_dist", "react_time", "road_mu", "speed", "tire_rr", "pedal.ratio", "boo.ampli", "mc.area", "pf.area", "pf.rbrake", "pr.area", "pr.rbrake", "update_rate", "collision", "dist_to_col"]
color_feature = "update_rate"

for feature_name in feature_names[0:7]:
  plot_hist(main_df, feature_name, color_feature)

# Plots exported to .html to save sotrage space 

**Observation**
* The first 6 features (column data) are in equal counts for different values for different sensor update rates

In [15]:
feature_names = ["body_mass", "cogx", "obj_dist", "react_time", "road_mu", "speed", "tire_rr", "pedal.ratio", "boo.ampli", "mc.area", "pf.area", "pf.rbrake", "pr.area", "pr.rbrake", "update_rate", "collision", "dist_to_col"]
color_feature = "update_rate"

for feature_name in feature_names[7:14]:
  plot_hist(main_df, feature_name, color_feature)

# Plots exported to .html to save sotrage space 

**Observation**
* Hydraulic brake system parameters are unequally distributed. This is because all the combinations of these parameters would create a very large number of variations which would require a HPC capacity to finish the simulations in good time
* Due to the unequal distribution the dataset doesn't cover the full array of Brake parameter variations. Hence, we can expect lower prediction accuracy when it comes to brake parameters specifically

In [16]:
color_feature = "obj_dist"
marginal="violin"

for feature_name in feature_names[15:17]:
  plot_hist(main_df, feature_name, color_feature, marginal)

# Plots exported to .html to save sotrage space 

**Observation (output data)**
* With these graphs we can get a very good understanding of the how a certain 'feature' relates to the output (collison or no collison or distance to collision). In the plotted case, we see that collision cases are lower for higher 'obj_dist' and higher for lower 'obj_dist'
* The second plot gives a more detailed view of the first plot. The distance to collision is higher and lot more in the positive range when the vehicle is far away from the detected object and vice versa which makes complete sense

In [17]:
color_feature = "speed"
marginal="violin"

for feature_name in feature_names[15:17]:
  plot_hist(main_df, feature_name, color_feature, marginal)

# Plots exported to .html to save sotrage space 

**Observation**
Here we plot similar graphs for another feature :'speed'
* We see that collision cases are higher for higher vehicle speeds and lower for lower speeds
* The second plot gives a more detailed view of the plot before

In [19]:
for feature_name in feature_names[0:14]:
  color_feature = feature_name
  #y="dist_to_col"
  y="collision"
  marginal=None
  histfunc='avg'
  plot_hist(main_df, feature_name, color_feature, marginal, histfunc, y)

# Plots exported to .html to save sotrage space 

**Obeservation**
* These plots are complementary to PCA (Principal Component Analysis) plots and insightful. By plotting the average of collision (0 or 1) for datapoints for a given feature, it reveals trends on how the features (vehicle parameters) affect the output value (collision)
* One example- 'body_mass': As it increases the probability of collision increases dramatically according to the plot. It makes senses since braking distances for heavier vehicles is higher compared to lighter vehicles with the same tyre and brake properties
* Another example - cogx: Based on the plot, the changes in centre of gravity location on the longitudinal axis of the vehicle body does not have significant infulence on collision occurring or not. Increasing this parameter increases the weight transfer onto the front tyres, but is also takes weight off from the rear tyres. The overall effect is not drastic, which can be seen from the data plots

Similarly, the remaining plots give us valuable insights and help us make sense of the data

In [21]:
fig = px.scatter(main_df, x="dist_to_col", y="obj_dist", color="speed")
fig.show("notebook")

# Plots exported to .html to save sotrage space 

In [22]:
fig = px.scatter(main_df, x="dist_to_col", y="speed", color="body_mass")
fig.show("notebook")

# Plots exported to .html to save sotrage space 

**Obeservation**
* The plots above show a different way of representing data. The color heat bar helps us find out visually obvious trends in the data, just like we saw with histograms.
* for example, we can see that for a given distance from traffic object, the distance to collision at the end of braking increases with increasing speed (bright yellow points) and decreases with decreasing speed (dark blue points)


# Convert dataframe to csv. Split into train and test set

In [26]:
# Split 90% data to train and 10% to test set
num = np.random.rand(len(main_df)) < 0.9
train_df = main_df[num].reset_index(drop=True)
test_df = main_df[~num].reset_index(drop=True)

In [27]:
train_df

Unnamed: 0,body_mass,cogx,obj_dist,react_time,road_mu,speed,tire_rr,pedal.ratio,boo.ampli,mc.area,pf.area,pf.rbrake,pr.area,pr.rbrake,update_rate,collision,dist_to_col
0,2550.0,2.0,140.0,0.2,0.4,140.0,0.030,1.5,6,3.5,30,0.07,14,0.07,60,1.0,-112.229470
1,2550.0,2.4,60.0,0.4,0.7,140.0,0.015,2.5,5,4.5,15,0.10,15,0.10,60,1.0,-149.581660
2,1200.0,2.4,60.0,0.4,0.7,120.0,0.008,4.0,3,6.0,12,0.14,8,0.14,10,1.0,-66.651648
3,3000.0,2.8,100.0,0.2,1.0,120.0,0.030,3.0,5,4.5,23,0.10,11,0.10,60,1.0,-30.155521
4,2100.0,2.8,40.0,0.1,0.4,100.0,0.015,3.0,5,4.5,23,0.10,11,0.10,40,1.0,-50.162968
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
87475,1200.0,2.4,100.0,0.1,0.4,80.0,0.008,1.5,6,3.5,30,0.07,14,0.07,40,0.0,43.224258
87476,2550.0,2.4,40.0,0.1,0.4,100.0,0.030,2.5,5,4.5,15,0.10,15,0.10,40,1.0,-64.368848
87477,3000.0,2.4,60.0,0.2,1.0,120.0,0.030,2.5,5,4.5,15,0.10,15,0.10,10,1.0,-112.050890
87478,2100.0,2.4,100.0,0.4,0.7,80.0,0.030,4.0,3,6.0,12,0.14,8,0.14,40,0.0,15.687989


In [28]:
test_df

Unnamed: 0,body_mass,cogx,obj_dist,react_time,road_mu,speed,tire_rr,pedal.ratio,boo.ampli,mc.area,pf.area,pf.rbrake,pr.area,pr.rbrake,update_rate,collision,dist_to_col
0,2100.0,2.4,60.0,0.4,0.4,80.0,0.008,2.5,5,4.5,15,0.10,15,0.10,60,1.0,-3.537620
1,2550.0,2.0,60.0,0.2,0.7,100.0,0.015,2.5,5,4.5,15,0.10,15,0.10,10,1.0,-49.004542
2,1200.0,2.0,40.0,0.1,0.4,100.0,0.008,4.0,3,6.0,12,0.14,8,0.14,60,1.0,-62.001117
3,2550.0,2.4,100.0,0.2,0.4,120.0,0.030,3.0,5,3.5,30,0.10,14,0.10,60,1.0,-29.573343
4,2100.0,2.4,100.0,0.2,0.7,140.0,0.015,1.5,6,3.5,30,0.07,14,0.07,60,1.0,-84.509473
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9717,1200.0,2.0,140.0,0.4,0.4,80.0,0.015,3.0,5,4.5,23,0.10,11,0.10,10,0.0,80.233185
9718,1650.0,2.8,40.0,0.2,1.0,140.0,0.015,4.0,3,6.0,12,0.14,8,0.14,60,1.0,-166.882730
9719,2100.0,2.4,60.0,0.1,0.4,100.0,0.030,3.0,5,3.5,30,0.10,14,0.10,10,1.0,-30.751471
9720,2100.0,2.0,140.0,0.4,0.4,80.0,0.008,2.5,5,4.5,15,0.10,15,0.10,60,0.0,72.876206


In [25]:
main_df

Unnamed: 0,body_mass,cogx,obj_dist,react_time,road_mu,speed,tire_rr,pedal.ratio,boo.ampli,mc.area,pf.area,pf.rbrake,pr.area,pr.rbrake,update_rate,collision,dist_to_col
0,2550.0,2.0,140.0,0.2,0.4,140.0,0.030,1.5,6,3.5,30,0.07,14,0.07,60,1.0,-112.229470
1,2550.0,2.4,60.0,0.4,0.7,140.0,0.015,2.5,5,4.5,15,0.10,15,0.10,60,1.0,-149.581660
2,1200.0,2.4,60.0,0.4,0.7,120.0,0.008,4.0,3,6.0,12,0.14,8,0.14,10,1.0,-66.651648
3,3000.0,2.8,100.0,0.2,1.0,120.0,0.030,3.0,5,4.5,23,0.10,11,0.10,60,1.0,-30.155521
4,2100.0,2.8,40.0,0.1,0.4,100.0,0.015,3.0,5,4.5,23,0.10,11,0.10,40,1.0,-50.162968
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
97197,1200.0,2.4,100.0,0.1,0.4,80.0,0.008,1.5,6,3.5,30,0.07,14,0.07,40,0.0,43.224258
97198,2550.0,2.4,40.0,0.1,0.4,100.0,0.030,2.5,5,4.5,15,0.10,15,0.10,40,1.0,-64.368848
97199,3000.0,2.4,60.0,0.2,1.0,120.0,0.030,2.5,5,4.5,15,0.10,15,0.10,10,1.0,-112.050890
97200,2100.0,2.4,100.0,0.4,0.7,80.0,0.030,4.0,3,6.0,12,0.14,8,0.14,40,0.0,15.687989


In [194]:
train_df.to_csv("data/train_test/train.csv", index=False)

In [200]:
test_df.to_csv("data/train_test/test.csv", index=False)

In [201]:
# Remove testdata folder
#!rm -rf 'testdata'