# Read and organize NEW data files

### Jasvir Dhillon June 30, 2021

## All functions to extract features and outputs from result files

**Workflow**
* Download the new dataset
* Read and order files
* Extract data from the files and place it in a dataframe. Clean data/apply transformation is necessary
* Explore and investigate data
* Combine with the original dataset
* Store data for later access
* Split dataset into training and test data. Store training later

### Importing libraries

In [1]:
import numpy as np
import pandas as pd

In [2]:
pip install natsort

Collecting natsort
  Downloading natsort-7.1.1-py3-none-any.whl (35 kB)
Installing collected packages: natsort
Successfully installed natsort-7.1.1
Note: you may need to restart the kernel to use updated packages.


## Read and organize files from data/post-case-LIDAR-1

The variations in features are shown in the image as follows:
										
<img src='nb_ims/new_feature_variations.jpg'/>

*New brake file parameters*

```
             pedal.ratio      boo.ampli         mc.area      pf.area        pf.rbrake         pr.area         pr.rbrake
                
HydESP_test	  3.5	         5.5	         3.5	      25	         0.12	          14	         0.12        

```

Total number of features: 15 (15-dimensional data)

Total number of outputs: 2 ('collision' and 'dist_to_col')

The total number of variations: 4x3x3x1x2x3x1x3x2 -> 1296

The total number of files: 1296x2 -> 2,592

Total data size: 8.5 MB ¦ Total disk space required: 14.6 MB

### Download data from dropbox storage

In [4]:
%mkdir data
!wget -O data/new_edge-case_dataset.zip https://www.dropbox.com/s/hz0kia92aiwawan/new_edge-case_dataset.zip?dl=0

mkdir: cannot create directory ‘data’: File exists
--2021-06-27 12:00:34--  https://www.dropbox.com/s/hz0kia92aiwawan/new_edge-case_dataset.zip?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.6.18, 2620:100:601c:18::a27d:612
Connecting to www.dropbox.com (www.dropbox.com)|162.125.6.18|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/hz0kia92aiwawan/new_edge-case_dataset.zip [following]
--2021-06-27 12:00:34--  https://www.dropbox.com/s/raw/hz0kia92aiwawan/new_edge-case_dataset.zip
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uce1e29eb1930bc1cc8a7f8acb12.dl.dropboxusercontent.com/cd/0/inline/BRP-ZaSZxVDT5Z76DmeXyD6sJr49_NpKbtsosKFPKCtbNJFCTQRgrAEX4xVm0B9L0A_5g4PXHwGaox9iOf4HAR7ZHDZ46OsS4IkZKanffj8Xfiu5ctyr6drYKlqqGZbqcgs-It0nBohBPg1TOLWiYOlg/file# [following]
--2021-06-27 12:00:35--  https://uce1e29eb1930bc1cc8a7f8acb12.dl.dropboxusercontent.com/cd/0/in

In [6]:
#!unzip data/new_edge-case_dataset.zip -d data

### Read data files

In [7]:
import glob
import natsort

def read_org_data(dat_filepath, dat_info_filepath):
    
    # read output file list
    op_filelist = []
    for file in glob.glob(dat_filepath):
        op_filelist.append(file)
        #print(file)

    # read feature file list
    feature_filelist = []
    for file in glob.glob(dat_info_filepath):
        feature_filelist.append(file)
        #print(file)
    
    # Organize the result files in the same chronological order
    op_filelist = natsort.natsorted(op_filelist,reverse=True)
    feature_filelist = natsort.natsorted(feature_filelist,reverse=True)
    
    return op_filelist,feature_filelist

In [8]:
dat_filepath = "data/new_edge-case_dataset/post_case-LIDAR-2/*.dat"
dat_info_filepath = "data/new_edge-case_dataset/post_case-LIDAR-2/*.dat.info"
op_filelist_2, feature_filelist_2 = read_org_data(dat_filepath, dat_info_filepath)

display(op_filelist_2[0:10])
display(feature_filelist_2[0:10])
display(len(op_filelist_2))
display(len(feature_filelist_2))

['data/new_edge-case_dataset/post_case-LIDAR-2/post_case-LIDAR2_Variation 647_094507_000656.dat',
 'data/new_edge-case_dataset/post_case-LIDAR-2/post_case-LIDAR2_Variation 646_094507_000655.dat',
 'data/new_edge-case_dataset/post_case-LIDAR-2/post_case-LIDAR2_Variation 645_094505_000654.dat',
 'data/new_edge-case_dataset/post_case-LIDAR-2/post_case-LIDAR2_Variation 644_094505_000653.dat',
 'data/new_edge-case_dataset/post_case-LIDAR-2/post_case-LIDAR2_Variation 643_094504_000652.dat',
 'data/new_edge-case_dataset/post_case-LIDAR-2/post_case-LIDAR2_Variation 642_094504_000651.dat',
 'data/new_edge-case_dataset/post_case-LIDAR-2/post_case-LIDAR2_Variation 641_094502_000650.dat',
 'data/new_edge-case_dataset/post_case-LIDAR-2/post_case-LIDAR2_Variation 640_094501_000649.dat',
 'data/new_edge-case_dataset/post_case-LIDAR-2/post_case-LIDAR2_Variation 639_094501_000648.dat',
 'data/new_edge-case_dataset/post_case-LIDAR-2/post_case-LIDAR2_Variation 638_094500_000647.dat']

['data/new_edge-case_dataset/post_case-LIDAR-2/post_case-LIDAR2_Variation 647_094507_000656.dat.info',
 'data/new_edge-case_dataset/post_case-LIDAR-2/post_case-LIDAR2_Variation 646_094507_000655.dat.info',
 'data/new_edge-case_dataset/post_case-LIDAR-2/post_case-LIDAR2_Variation 645_094505_000654.dat.info',
 'data/new_edge-case_dataset/post_case-LIDAR-2/post_case-LIDAR2_Variation 644_094505_000653.dat.info',
 'data/new_edge-case_dataset/post_case-LIDAR-2/post_case-LIDAR2_Variation 643_094504_000652.dat.info',
 'data/new_edge-case_dataset/post_case-LIDAR-2/post_case-LIDAR2_Variation 642_094504_000651.dat.info',
 'data/new_edge-case_dataset/post_case-LIDAR-2/post_case-LIDAR2_Variation 641_094502_000650.dat.info',
 'data/new_edge-case_dataset/post_case-LIDAR-2/post_case-LIDAR2_Variation 640_094501_000649.dat.info',
 'data/new_edge-case_dataset/post_case-LIDAR-2/post_case-LIDAR2_Variation 639_094501_000648.dat.info',
 'data/new_edge-case_dataset/post_case-LIDAR-2/post_case-LIDAR2_Variation

648

648

In [9]:
dat_filepath = "data/new_edge-case_dataset/post_case-LIDAR-3/*.dat"
dat_info_filepath = "data/new_edge-case_dataset/post_case-LIDAR-3/*.dat.info"
op_filelist_3, feature_filelist_3 = read_org_data(dat_filepath, dat_info_filepath)

display(op_filelist_3[0:10])
display(feature_filelist_3[0:10])
display(len(op_filelist_3))
display(len(feature_filelist_3))

['data/new_edge-case_dataset/post_case-LIDAR-3/post_case-LIDAR3_Variation 647_100051_000656.dat',
 'data/new_edge-case_dataset/post_case-LIDAR-3/post_case-LIDAR3_Variation 646_100049_000655.dat',
 'data/new_edge-case_dataset/post_case-LIDAR-3/post_case-LIDAR3_Variation 645_100049_000654.dat',
 'data/new_edge-case_dataset/post_case-LIDAR-3/post_case-LIDAR3_Variation 644_100048_000653.dat',
 'data/new_edge-case_dataset/post_case-LIDAR-3/post_case-LIDAR3_Variation 643_100047_000652.dat',
 'data/new_edge-case_dataset/post_case-LIDAR-3/post_case-LIDAR3_Variation 642_100046_000651.dat',
 'data/new_edge-case_dataset/post_case-LIDAR-3/post_case-LIDAR3_Variation 641_100045_000650.dat',
 'data/new_edge-case_dataset/post_case-LIDAR-3/post_case-LIDAR3_Variation 640_100045_000649.dat',
 'data/new_edge-case_dataset/post_case-LIDAR-3/post_case-LIDAR3_Variation 639_100044_000648.dat',
 'data/new_edge-case_dataset/post_case-LIDAR-3/post_case-LIDAR3_Variation 638_100043_000647.dat']

['data/new_edge-case_dataset/post_case-LIDAR-3/post_case-LIDAR3_Variation 647_100051_000656.dat.info',
 'data/new_edge-case_dataset/post_case-LIDAR-3/post_case-LIDAR3_Variation 646_100049_000655.dat.info',
 'data/new_edge-case_dataset/post_case-LIDAR-3/post_case-LIDAR3_Variation 645_100049_000654.dat.info',
 'data/new_edge-case_dataset/post_case-LIDAR-3/post_case-LIDAR3_Variation 644_100048_000653.dat.info',
 'data/new_edge-case_dataset/post_case-LIDAR-3/post_case-LIDAR3_Variation 643_100047_000652.dat.info',
 'data/new_edge-case_dataset/post_case-LIDAR-3/post_case-LIDAR3_Variation 642_100046_000651.dat.info',
 'data/new_edge-case_dataset/post_case-LIDAR-3/post_case-LIDAR3_Variation 641_100045_000650.dat.info',
 'data/new_edge-case_dataset/post_case-LIDAR-3/post_case-LIDAR3_Variation 640_100045_000649.dat.info',
 'data/new_edge-case_dataset/post_case-LIDAR-3/post_case-LIDAR3_Variation 639_100044_000648.dat.info',
 'data/new_edge-case_dataset/post_case-LIDAR-3/post_case-LIDAR3_Variation

648

648

### Define method to extract data from result files and append to dataframe

In [15]:
from io import StringIO

def get_data(op_filelist, feature_filelist, update_rate):
  """
  Method to extract first level data from the data files
  
    Arguments
    op_filelist: The list of file names containing output data values
    ip_filelist: The list of file names containing input feature names and values
    update_rate: The sensor update frequency in Hz used for that dataset
    
  """
  column_names = ["body_mass", "cogx", "obj_dist", "react_time", "road_mu", "speed", "tire_rr", "brake_file", "collision", "dist_to_col"]
  main_df = pd.DataFrame(columns = column_names)

  for i in range(0, len(op_filelist)):
    # extracting output
    with open(op_filelist[i], 'r') as f:
      data = f.read()

    df = pd.read_fwf(StringIO(data), 
                    sep="\s+", 
                    skiprows=1, 
                    usecols=[2,3], 
                    names=['collision','dist_to_col'])

    shape = df.shape
    y_series = df.loc[shape[0]-1] # this is the output pandas series

    # extracting input features
    with open(feature_filelist[i], 'r') as f:
      feature_file = f.read()
      
    df_feature = pd.read_fwf(StringIO(feature_file), 
                    sep="\s+", 
                    skiprows=1 
                    )

    #df_feature.loc[110:127]
    item_no = [111, 113, 115, 117, 119, 121, 123, 127] # line numbers of interest
    feature_list = ['body_mass', 'cogx', 'obj_dist', 'react_time', 'road_mu', 'speed', 'tire_rr', 'brake_file']
    value_list = []

    for j in item_no:
      if j < 123:
        string = df_feature.loc[j].values
        string_split = string[0].split('=')
        value = float(string_split[1])
        value_list.append(value)
      else:
        string = df_feature.loc[j].values
        string_split = string[0].split('=')
        value = string_split[1]
        value_list.append(value)

    feature_array = np.array(value_list)
    feature_series = pd.Series(feature_array, index=feature_list)

    # combine feature and data series
    data_series = pd.concat([feature_series, y_series], axis=0)

    # append feature and output series to main_df
    main_df = main_df.append(data_series, ignore_index=True)
  
  # Transform dataframe columns to replace strings with values  
  # Replace tire file names with values
  main_df['tire_rr'] = main_df['tire_rr'].replace({' tire_1': 0.008, ' tire_2': 0.015, ' tire_3': 0.03})

  # place 'brake_file' column at the end
  new_columns = ["body_mass", "cogx", "obj_dist", "react_time", "road_mu", "speed", "tire_rr", "collision", "dist_to_col", "brake_file"]
  main_df = main_df.reindex(columns=new_columns)

  # Add required brake feature columns to dataframe
  brake_columns_list = ["pedal.ratio", "boo.ampli", "mc.area", "pf.area", "pf.rbrake", "pr.area", "pr.rbrake"]
  HydESP_1 = [3, 5, 4.5 , 23, 0.1, 11, 0.1]
  HydESP_2 = [1.5, 6, 3.5 , 30, 0.07, 14, 0.07]
  HydESP_3 = [2.5, 5, 4.5 , 15, 0.1, 15, 0.1]
  HydESP_4 = [4, 3, 6 , 12, 0.14, 8, 0.14]
  HydESP_5 = [3, 5, 3.5 , 30, 0.1, 14, 0.1]
  HydESP_test = [3.5, 5.5, 3.5 , 25, 0.12, 14, 0.12]

  for name in brake_columns_list:
    main_df[name] = main_df["brake_file"]
    i = brake_columns_list.index(name)
    main_df[name] = main_df[name].replace({' HydESP_1': HydESP_1[i], ' HydESP_2': HydESP_2[i], ' HydESP_3': HydESP_3[i], ' HydESP_4': HydESP_4[i], ' HydESP_5': HydESP_5[i], ' HydESP_test': HydESP_test[i]})

  # delete the 'brake_file' which is now not required
  del main_df["brake_file"] 

  # Add the sensor update_rate column
  main_df["update_rate"] = update_rate

  # place 'brake_file' column at the end
  final_columns = ["body_mass", "cogx", "obj_dist", "react_time", "road_mu", "speed", "tire_rr", "pedal.ratio", "boo.ampli", "mc.area", "pf.area", "pf.rbrake", "pr.area", "pr.rbrake", "update_rate", "collision", "dist_to_col"]
  main_df = main_df.reindex(columns=final_columns)

  return main_df

#### Get the dataframe for post_case-LIDAR-2

In [16]:
update_rate_2 = 40
new_df_2 = get_data(op_filelist_2, feature_filelist_2, update_rate_2)
new_df_2

Unnamed: 0,body_mass,cogx,obj_dist,react_time,road_mu,speed,tire_rr,pedal.ratio,boo.ampli,mc.area,pf.area,pf.rbrake,pr.area,pr.rbrake,update_rate,collision,dist_to_col
0,2500.0,3.0,540.0,0.1,1.0,180.0,0.015,3.5,5.5,3.5,25,0.12,14,0.12,40,1.0,-32.417780
1,2500.0,3.0,520.0,0.1,1.0,180.0,0.015,3.5,5.5,3.5,25,0.12,14,0.12,40,1.0,-52.417780
2,2500.0,3.0,500.0,0.1,1.0,180.0,0.015,3.5,5.5,3.5,25,0.12,14,0.12,40,1.0,-72.417780
3,2500.0,3.0,540.0,0.1,0.7,180.0,0.015,3.5,5.5,3.5,25,0.12,14,0.12,40,1.0,-61.772550
4,2500.0,3.0,520.0,0.1,0.7,180.0,0.015,3.5,5.5,3.5,25,0.12,14,0.12,40,1.0,-81.772550
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
643,1700.0,2.4,520.0,0.1,0.7,130.0,0.015,3.0,5.0,4.5,23,0.10,11,0.10,40,0.0,21.840349
644,1700.0,2.4,500.0,0.1,0.7,130.0,0.015,3.0,5.0,4.5,23,0.10,11,0.10,40,0.0,1.840349
645,1700.0,2.4,540.0,0.1,0.4,130.0,0.015,3.0,5.0,4.5,23,0.10,11,0.10,40,1.0,-9.570047
646,1700.0,2.4,520.0,0.1,0.4,130.0,0.015,3.0,5.0,4.5,23,0.10,11,0.10,40,1.0,-29.570047


In [17]:
new_df_2.isnull().values.any()

False

In [18]:
collisions = new_df_2[(new_df_2.collision == 1)]['collision']
percent_collisions = (len(collisions)/len(new_df_2))*100
print('The percentage of collisions in the LIDAR-2 dataset is: ', percent_collisions, '%')

The percentage of collisions in the LIDAR-2 dataset is:  73.61111111111111 %


#### Get the dataframe for post_case-LIDAR-3

In [20]:
update_rate_3 = 60
new_df_3 = get_data(op_filelist_3, feature_filelist_3, update_rate_3)
new_df_3

Unnamed: 0,body_mass,cogx,obj_dist,react_time,road_mu,speed,tire_rr,pedal.ratio,boo.ampli,mc.area,pf.area,pf.rbrake,pr.area,pr.rbrake,update_rate,collision,dist_to_col
0,2500.0,3.0,540.0,0.1,1.0,180.0,0.015,3.5,5.5,3.5,25,0.12,14,0.12,60,1.0,-32.517779
1,2500.0,3.0,520.0,0.1,1.0,180.0,0.015,3.5,5.5,3.5,25,0.12,14,0.12,60,1.0,-52.517779
2,2500.0,3.0,500.0,0.1,1.0,180.0,0.015,3.5,5.5,3.5,25,0.12,14,0.12,60,1.0,-72.517779
3,2500.0,3.0,540.0,0.1,0.7,180.0,0.015,3.5,5.5,3.5,25,0.12,14,0.12,60,1.0,-61.872548
4,2500.0,3.0,520.0,0.1,0.7,180.0,0.015,3.5,5.5,3.5,25,0.12,14,0.12,60,1.0,-81.872548
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
643,1700.0,2.4,520.0,0.1,0.7,130.0,0.015,3.0,5.0,4.5,23,0.10,11,0.10,60,0.0,22.129238
644,1700.0,2.4,500.0,0.1,0.7,130.0,0.015,3.0,5.0,4.5,23,0.10,11,0.10,60,0.0,2.129238
645,1700.0,2.4,540.0,0.1,0.4,130.0,0.015,3.0,5.0,4.5,23,0.10,11,0.10,60,1.0,-9.281158
646,1700.0,2.4,520.0,0.1,0.4,130.0,0.015,3.0,5.0,4.5,23,0.10,11,0.10,60,1.0,-29.281158


In [21]:
new_df_3.isnull().values.any()

False

In [22]:
collisions = new_df_3[(new_df_3.collision == 1)]['collision']
percent_collisions = (len(collisions)/len(new_df_3))*100
print('The percentage of collisions in the LIDAR-2 dataset is: ', percent_collisions, '%')

The percentage of collisions in the LIDAR-2 dataset is:  73.61111111111111 %


## Concat the three dataframes for the three datasets together into one dataframe

In [23]:
new_df = pd.concat([new_df_2, new_df_3], axis=0, ignore_index=True)
new_df

Unnamed: 0,body_mass,cogx,obj_dist,react_time,road_mu,speed,tire_rr,pedal.ratio,boo.ampli,mc.area,pf.area,pf.rbrake,pr.area,pr.rbrake,update_rate,collision,dist_to_col
0,2500.0,3.0,540.0,0.1,1.0,180.0,0.015,3.5,5.5,3.5,25,0.12,14,0.12,40,1.0,-32.417780
1,2500.0,3.0,520.0,0.1,1.0,180.0,0.015,3.5,5.5,3.5,25,0.12,14,0.12,40,1.0,-52.417780
2,2500.0,3.0,500.0,0.1,1.0,180.0,0.015,3.5,5.5,3.5,25,0.12,14,0.12,40,1.0,-72.417780
3,2500.0,3.0,540.0,0.1,0.7,180.0,0.015,3.5,5.5,3.5,25,0.12,14,0.12,40,1.0,-61.772550
4,2500.0,3.0,520.0,0.1,0.7,180.0,0.015,3.5,5.5,3.5,25,0.12,14,0.12,40,1.0,-81.772550
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1291,1700.0,2.4,520.0,0.1,0.7,130.0,0.015,3.0,5.0,4.5,23,0.10,11,0.10,60,0.0,22.129238
1292,1700.0,2.4,500.0,0.1,0.7,130.0,0.015,3.0,5.0,4.5,23,0.10,11,0.10,60,0.0,2.129238
1293,1700.0,2.4,540.0,0.1,0.4,130.0,0.015,3.0,5.0,4.5,23,0.10,11,0.10,60,1.0,-9.281158
1294,1700.0,2.4,520.0,0.1,0.4,130.0,0.015,3.0,5.0,4.5,23,0.10,11,0.10,60,1.0,-29.281158


### Speacial operation on the concatenated new dataframe
The column 'obj_dist' contains values of the traffic object from the Origin coorindate of the virtual environment. We want to make this value relative to the vehicle. Hence we will subtract the values in this column by the distance the vehicle has travelled where the traffic object appears infront of the vehicle. 

The traffic object comes into sensor range of the vehicle when the vehicle has travelled '400m' from the start of the simulation inorder to stabilize at the cruising speed. Hence we will subtract '400' from the 'obj_dist' values of the 'new_df'

In [24]:
new_df.to_csv("tmp.csv", index=False) # converts string values to float automatically
new_df = pd.read_csv("tmp.csv")
new_df.obj_dist = new_df.obj_dist - 400
new_df

Unnamed: 0,body_mass,cogx,obj_dist,react_time,road_mu,speed,tire_rr,pedal.ratio,boo.ampli,mc.area,pf.area,pf.rbrake,pr.area,pr.rbrake,update_rate,collision,dist_to_col
0,2500.0,3.0,140.0,0.1,1.0,180.0,0.015,3.5,5.5,3.5,25,0.12,14,0.12,40,1.0,-32.417780
1,2500.0,3.0,120.0,0.1,1.0,180.0,0.015,3.5,5.5,3.5,25,0.12,14,0.12,40,1.0,-52.417780
2,2500.0,3.0,100.0,0.1,1.0,180.0,0.015,3.5,5.5,3.5,25,0.12,14,0.12,40,1.0,-72.417780
3,2500.0,3.0,140.0,0.1,0.7,180.0,0.015,3.5,5.5,3.5,25,0.12,14,0.12,40,1.0,-61.772550
4,2500.0,3.0,120.0,0.1,0.7,180.0,0.015,3.5,5.5,3.5,25,0.12,14,0.12,40,1.0,-81.772550
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1291,1700.0,2.4,120.0,0.1,0.7,130.0,0.015,3.0,5.0,4.5,23,0.10,11,0.10,60,0.0,22.129238
1292,1700.0,2.4,100.0,0.1,0.7,130.0,0.015,3.0,5.0,4.5,23,0.10,11,0.10,60,0.0,2.129238
1293,1700.0,2.4,140.0,0.1,0.4,130.0,0.015,3.0,5.0,4.5,23,0.10,11,0.10,60,1.0,-9.281158
1294,1700.0,2.4,120.0,0.1,0.4,130.0,0.015,3.0,5.0,4.5,23,0.10,11,0.10,60,1.0,-29.281158


#### Shuffle new_df

In [25]:
for i in range(0, 3):
  new_df = new_df.sample(frac=1).reset_index(drop=True)

new_df

Unnamed: 0,body_mass,cogx,obj_dist,react_time,road_mu,speed,tire_rr,pedal.ratio,boo.ampli,mc.area,pf.area,pf.rbrake,pr.area,pr.rbrake,update_rate,collision,dist_to_col
0,2000.0,2.6,100.0,0.1,0.4,140.0,0.015,3.0,5.0,4.5,23,0.10,11,0.10,40,1.0,-72.774239
1,1700.0,3.0,140.0,0.1,0.4,130.0,0.015,3.0,5.0,4.5,23,0.10,11,0.10,40,1.0,-7.237760
2,2000.0,2.6,100.0,0.1,0.4,130.0,0.015,3.5,5.5,3.5,25,0.12,14,0.12,40,1.0,-48.736288
3,1700.0,2.4,120.0,0.1,1.0,130.0,0.015,3.0,5.0,4.5,23,0.10,11,0.10,40,0.0,21.851754
4,2500.0,3.0,140.0,0.1,1.0,160.0,0.015,3.0,5.0,4.5,23,0.10,11,0.10,40,1.0,-58.446640
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1291,2000.0,3.0,140.0,0.1,1.0,140.0,0.015,3.0,5.0,4.5,23,0.10,11,0.10,40,0.0,11.415281
1292,1700.0,3.0,140.0,0.1,0.4,180.0,0.015,3.5,5.5,3.5,25,0.12,14,0.12,60,1.0,-133.482550
1293,2500.0,3.0,100.0,0.1,0.7,180.0,0.015,3.5,5.5,3.5,25,0.12,14,0.12,40,1.0,-101.772550
1294,1700.0,3.0,140.0,0.1,0.7,130.0,0.015,3.0,5.0,4.5,23,0.10,11,0.10,60,0.0,37.376125


#### Store dataframes for later access

In [26]:
new_df.to_pickle("data/new_df.pkl")

## Read original dataset inorder to concatenate it with the new_df

In [27]:
pip install pickle5

Collecting pickle5
  Downloading pickle5-0.0.11.tar.gz (132 kB)
[K     |████████████████████████████████| 132 kB 11.1 MB/s eta 0:00:01
[?25hBuilding wheels for collected packages: pickle5
  Building wheel for pickle5 (setup.py) ... [?25ldone
[?25h  Created wheel for pickle5: filename=pickle5-0.0.11-cp36-cp36m-linux_x86_64.whl size=123209 sha256=5ca9f525897f8e599c109464e51ed7b70dfc7f57ce44565e6ba67e00e5012d07
  Stored in directory: /home/ec2-user/.cache/pip/wheels/f9/b7/be/bf9768ab0daa28fa4b386f3ad1bac5dd4d9c349c60e83b24e3
Successfully built pickle5
Installing collected packages: pickle5
Successfully installed pickle5-0.0.11
Note: you may need to restart the kernel to use updated packages.


In [28]:
import pickle5 as pickle
with open("data/main_df.pkl", "rb") as fh:
  main_df = pickle.load(fh)

In [29]:
main_df

Unnamed: 0,body_mass,cogx,obj_dist,react_time,road_mu,speed,tire_rr,pedal.ratio,boo.ampli,mc.area,pf.area,pf.rbrake,pr.area,pr.rbrake,update_rate,collision,dist_to_col
0,2550.0,2.0,140.0,0.2,0.4,140.0,0.030,1.5,6,3.5,30,0.07,14,0.07,60,1.0,-112.229470
1,2550.0,2.4,60.0,0.4,0.7,140.0,0.015,2.5,5,4.5,15,0.10,15,0.10,60,1.0,-149.581660
2,1200.0,2.4,60.0,0.4,0.7,120.0,0.008,4.0,3,6.0,12,0.14,8,0.14,10,1.0,-66.651648
3,3000.0,2.8,100.0,0.2,1.0,120.0,0.030,3.0,5,4.5,23,0.10,11,0.10,60,1.0,-30.155521
4,2100.0,2.8,40.0,0.1,0.4,100.0,0.015,3.0,5,4.5,23,0.10,11,0.10,40,1.0,-50.162968
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
97197,1200.0,2.4,100.0,0.1,0.4,80.0,0.008,1.5,6,3.5,30,0.07,14,0.07,40,0.0,43.224258
97198,2550.0,2.4,40.0,0.1,0.4,100.0,0.030,2.5,5,4.5,15,0.10,15,0.10,40,1.0,-64.368848
97199,3000.0,2.4,60.0,0.2,1.0,120.0,0.030,2.5,5,4.5,15,0.10,15,0.10,10,1.0,-112.050890
97200,2100.0,2.4,100.0,0.4,0.7,80.0,0.030,4.0,3,6.0,12,0.14,8,0.14,40,0.0,15.687989


## Concat the three dataframes for the three datasets together into one dataframe

In [32]:
updated_df = pd.concat([main_df, new_df], axis=0, ignore_index=True)
updated_df

Unnamed: 0,body_mass,cogx,obj_dist,react_time,road_mu,speed,tire_rr,pedal.ratio,boo.ampli,mc.area,pf.area,pf.rbrake,pr.area,pr.rbrake,update_rate,collision,dist_to_col
0,2550.0,2.0,140.0,0.2,0.4,140.0,0.030,1.5,6.0,3.5,30,0.07,14,0.07,60,1.0,-112.229470
1,2550.0,2.4,60.0,0.4,0.7,140.0,0.015,2.5,5.0,4.5,15,0.10,15,0.10,60,1.0,-149.581660
2,1200.0,2.4,60.0,0.4,0.7,120.0,0.008,4.0,3.0,6.0,12,0.14,8,0.14,10,1.0,-66.651648
3,3000.0,2.8,100.0,0.2,1.0,120.0,0.030,3.0,5.0,4.5,23,0.10,11,0.10,60,1.0,-30.155521
4,2100.0,2.8,40.0,0.1,0.4,100.0,0.015,3.0,5.0,4.5,23,0.10,11,0.10,40,1.0,-50.162968
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98493,2000.0,3.0,140.0,0.1,1.0,140.0,0.015,3.0,5.0,4.5,23,0.10,11,0.10,40,0.0,11.415281
98494,1700.0,3.0,140.0,0.1,0.4,180.0,0.015,3.5,5.5,3.5,25,0.12,14,0.12,60,1.0,-133.482550
98495,2500.0,3.0,100.0,0.1,0.7,180.0,0.015,3.5,5.5,3.5,25,0.12,14,0.12,40,1.0,-101.772550
98496,1700.0,3.0,140.0,0.1,0.7,130.0,0.015,3.0,5.0,4.5,23,0.10,11,0.10,60,0.0,37.376125


#### Shuffle updated_df

In [34]:
for i in range(0, 3):
  updated_df = updated_df.sample(frac=1).reset_index(drop=True)

updated_df

Unnamed: 0,body_mass,cogx,obj_dist,react_time,road_mu,speed,tire_rr,pedal.ratio,boo.ampli,mc.area,pf.area,pf.rbrake,pr.area,pr.rbrake,update_rate,collision,dist_to_col
0,1650.0,2.0,60.0,0.4,0.7,100.0,0.008,4.0,3.0,6.0,12,0.14,8,0.14,40,1.0,-53.507872
1,2100.0,2.0,40.0,0.2,0.4,100.0,0.030,3.0,5.0,3.5,30,0.10,14,0.10,10,1.0,-51.201674
2,2100.0,2.4,60.0,0.2,0.4,100.0,0.008,3.0,5.0,3.5,30,0.10,14,0.10,60,1.0,-31.126988
3,1200.0,2.0,100.0,0.1,0.4,140.0,0.030,2.5,5.0,4.5,15,0.10,15,0.10,10,1.0,-69.550360
4,1650.0,2.4,100.0,0.2,0.7,100.0,0.030,3.0,5.0,3.5,30,0.10,14,0.10,60,0.0,46.013491
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98493,3000.0,2.4,140.0,0.2,0.7,140.0,0.030,2.5,5.0,4.5,15,0.10,15,0.10,60,1.0,-87.013884
98494,2550.0,2.0,140.0,0.4,0.7,120.0,0.015,1.5,6.0,3.5,30,0.07,14,0.07,10,1.0,-23.404287
98495,1650.0,2.0,60.0,0.1,1.0,140.0,0.008,3.0,5.0,4.5,23,0.10,11,0.10,10,1.0,-50.718682
98496,3000.0,2.8,140.0,0.4,0.4,100.0,0.008,2.5,5.0,4.5,15,0.10,15,0.10,10,0.0,10.972057


#### Store dataframes for later access

In [35]:
updated_df.to_pickle("data/updated_df.pkl")

#### Save the .csv file

In [36]:
updated_df.to_csv("data/updated_edgecase_dataset.csv", index=False)

# Convert updated dataframe to csv. Split into train and test set

This time we will increase the size of the dataset because there are only about 1300 new data varaitions compare to 97202 variations in the original dataset. Hence we try to make sure that the new data points are captured in the updated training dataset

In [37]:
# Split 95% data to train and 5% to test set
num = np.random.rand(len(updated_df)) < 0.95
updated_train_df = updated_df[num].reset_index(drop=True)
updated_test_df = updated_df[~num].reset_index(drop=True)

In [38]:
updated_train_df

Unnamed: 0,body_mass,cogx,obj_dist,react_time,road_mu,speed,tire_rr,pedal.ratio,boo.ampli,mc.area,pf.area,pf.rbrake,pr.area,pr.rbrake,update_rate,collision,dist_to_col
0,1650.0,2.0,60.0,0.4,0.7,100.0,0.008,4.0,3.0,6.0,12,0.14,8,0.14,40,1.0,-53.507872
1,2100.0,2.0,40.0,0.2,0.4,100.0,0.030,3.0,5.0,3.5,30,0.10,14,0.10,10,1.0,-51.201674
2,2100.0,2.4,60.0,0.2,0.4,100.0,0.008,3.0,5.0,3.5,30,0.10,14,0.10,60,1.0,-31.126988
3,1200.0,2.0,100.0,0.1,0.4,140.0,0.030,2.5,5.0,4.5,15,0.10,15,0.10,10,1.0,-69.550360
4,1650.0,2.4,100.0,0.2,0.7,100.0,0.030,3.0,5.0,3.5,30,0.10,14,0.10,60,0.0,46.013491
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
93544,3000.0,2.4,140.0,0.2,0.7,140.0,0.030,2.5,5.0,4.5,15,0.10,15,0.10,60,1.0,-87.013884
93545,2550.0,2.0,140.0,0.4,0.7,120.0,0.015,1.5,6.0,3.5,30,0.07,14,0.07,10,1.0,-23.404287
93546,1650.0,2.0,60.0,0.1,1.0,140.0,0.008,3.0,5.0,4.5,23,0.10,11,0.10,10,1.0,-50.718682
93547,3000.0,2.8,140.0,0.4,0.4,100.0,0.008,2.5,5.0,4.5,15,0.10,15,0.10,10,0.0,10.972057


In [39]:
updated_test_df

Unnamed: 0,body_mass,cogx,obj_dist,react_time,road_mu,speed,tire_rr,pedal.ratio,boo.ampli,mc.area,pf.area,pf.rbrake,pr.area,pr.rbrake,update_rate,collision,dist_to_col
0,2100.0,2.0,40.0,0.2,0.7,140.0,0.008,3.0,5.0,3.5,30,0.10,14,0.10,60,1.0,-89.868168
1,2100.0,2.8,40.0,0.2,0.7,80.0,0.008,3.0,5.0,3.5,30,0.10,14,0.10,40,0.0,3.184095
2,1200.0,2.8,40.0,0.4,0.4,80.0,0.030,3.0,5.0,3.5,30,0.10,14,0.10,60,1.0,-18.427254
3,1200.0,2.0,40.0,0.2,1.0,140.0,0.008,1.5,6.0,3.5,30,0.07,14,0.07,10,1.0,-81.651985
4,1200.0,2.4,40.0,0.4,0.7,120.0,0.008,3.0,5.0,3.5,30,0.10,14,0.10,60,1.0,-35.910026
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4944,3000.0,2.4,140.0,0.1,1.0,80.0,0.015,3.0,5.0,4.5,23,0.10,11,0.10,10,0.0,78.027713
4945,1200.0,2.8,40.0,0.2,1.0,100.0,0.008,3.0,5.0,3.5,30,0.10,14,0.10,60,1.0,-0.579072
4946,2550.0,2.0,40.0,0.4,1.0,120.0,0.008,2.5,5.0,4.5,15,0.10,15,0.10,40,1.0,-118.930990
4947,3000.0,2.8,60.0,0.1,0.7,100.0,0.008,4.0,3.0,6.0,12,0.14,8,0.14,10,1.0,-119.526330


In [40]:
updated_train_df.to_csv("data/new_train_test/train.csv", index=False)

In [41]:
updated_test_df.to_csv("data/new_train_test/test.csv", index=False)