# Statistical EDA

In this section of the notebook we will be using statistics and other measures of data to examine the data for interesting trends, occurences, and relationships. First, we will explore using univariate statistics, but we plan to investigate multivariate relationships as well.

In [2]:
import numpy as np
import pandas as pd
import math

In [8]:
# Helper functions for EDA
def isnotnan(x):
    return not math.isnan(x)

def dropna(x):
    return list(filter(lambda x: x > -np.inf, x))

def describe(x_):
    x = dropna(x_)
    if len(x) == 0: print("No description available for this data as there are no non-NaN values present."); return
    n = sum(map(isnotnan, x))
    print ('N: \t\t\t{}\nRange: \t\t\t[{:.4}  -:-  {:.4}]\nMean: \t\t\t{:.4}\nStandard Deviation: \t{:.4}'.format(n, min(x), max(x), np.mean(x), np.std(x)))

def min_diff(x):
    if (len(x)) == 1: return 0
    return min((x[i + 1] - x[i] for i in range(len(x) - 1)))
    
def _is_int(x):
    for i in range(len(x) - 1):
        if (x[i + 1] - x[i]) != 1: return False
    return True
        
def granularity(x):
    span = max(x) - min(x)
    if (span == 0): return 0 # No real definiton of granularity if there is only one unique value, essentially 0. 
    
    g = min_diff(x)
    
    if g == 1: 
        if _is_int: return 'integer'
    if (g / span <= 2**-16): # If it requires more than 16 bits of information to encode the data and is not an integer,
        return 'real'        # we assume it is a real number
    return '{:.4}'.format(g)
    
def unique(x_, force_print = False):
    x = dropna(x_)
    if len(x) == 0: print("No uniqueness information available for this data as there are no non-NaN values present."); return
    
    uniques = pd.Series(x).unique(); uniques.sort()
    g = granularity(uniques)
    n = len(uniques)
    
    print('N unique: \t\t\t{}\nGranularity (estimated): \t {}'.format(n, g))
    if n <= 100 or force_print: print(uniques)

In [4]:
path = 'data/VehicleID_152851_DriverID_22209/VehicleID_152851_DriverID_22209/File_ID_1229.csv'
data = pd.read_csv(path, na_values=" ")

In [5]:
col = data['vtti.lane_width']
col

0              NaN
1              NaN
2              NaN
3              NaN
4              NaN
5              NaN
6              NaN
7              NaN
8              NaN
9              NaN
10             NaN
11             NaN
12             NaN
13             NaN
14             NaN
15             NaN
16             NaN
17             NaN
18             NaN
19             NaN
20        0.000000
21             NaN
22             NaN
23             NaN
24             NaN
25             NaN
26             NaN
27             NaN
28             NaN
29      631.443896
           ...    
2093    334.771411
2094    334.771411
2095    334.771411
2096    334.771411
2097    334.771411
2098    334.771411
2099    334.771411
2100    334.771411
2101    334.771411
2102    334.771411
2103    334.771411
2104    334.771411
2105    334.771411
2106    334.771411
2107    334.771411
2108    334.771411
2109    334.771411
2110    334.771411
2111    334.771411
2112    334.771411
2113    334.771411
2114    334.

In [6]:
describe(col)

N: 			1933
Range: 			[0.0  -:-  748.3]
Mean: 			372.3
Standard Deviation: 	113.0


In [9]:
unique(col)

N unique: 			859
Granularity (estimated): 	 real


# Statistical EDA discoveries:

### vtti.file_id
* Consistent for file, can be dropped to save RAM and only include relevant data. Purely an ID/key.
* CAN BE DROPPED

### vtti.abs
* No non-zero, non-nan values for files 1308 or 1229

### vtti.alcohol_interior
* Average value is ~4082, range of (4050, 4095)

### vtti.accel (\_x, \_y, \_z)
* All have same amount of nans (87), this is comforting
    * Should be something that we check and enforce if we use accelerometer readings.
* Could/should be consolidated?
    
### vtti.cruise_state
* No non-zero, non-nan values for file 1308
    * Perhaps a driver's usage of cruis control could suggest something about the safety of their driving habits?
    
### computed.day_of_month
* All non-nan values are 6 (the only value present among a sea of NaNs in the column)
    * Indicates that all values will be the same for all csvs in this format, should be verified regardless.
* CAN BE DROPPED

### vtti.driver_button
* Not used in either file that Tommy has access to.
    * ergo we don't have crash validation data

### vtti.elevation_gps
* Used 1321/13507 in file 1308
    * Suggests roughly 1/10 the frequency of acceleration data.
    
### vtti.engine_rpm
* Only 21 nan values!
* Can be compared to throttle

### vtti.esc
* Only 21 nan values
* No non-zero non-nan values
* No idea what this variable is, possibly the automatic event detection (automated version of driver button)

### vtti.gyro (\_x, \_y, \_z)
* 1356 nan values
    * Seems to be same frequency as acceleration data, but with more drops
    * Could smoothly extrapolate these data points fairly easily (no meaning lost)
    
### vtti.head_confidence
* 1833 nan values
* Integer, ranges from 0 to 235, mean of 145 and sd of 70
    * Confidence is likely relative to itself
        * Will require determining max possible value and scaling this down to a probability
    * Suspect that this can be fit into 8 bits (and may even be 8 bits for vtti) as values are likely 0-255 and are definitely integer.
        

### vtti.head_position (\_x, \_y, \_z)
* 1833 nan values
    * Good that it matches the confidence, should be verified.
* Integer values, likely mapping the center of the head in pixels.
    
### vtti.head\_position\_~\_baseline (where ~ is x, y, or z)
* 14 values
* BASELINE VARIES, will be necessary to consider these variations in order to get accurate relative location data. Maybe.
* Hypotheses about variations:
    * The baseline resets after a set amount of time
        * Incorrect since indices of resets are NOT at even intervals in the index, which directly correlates to realtime.
    * The baseline resets when the driver adjusts too much
        * Would require video confirmation
        * Could be investigated by examining the amount of spread in the head's location prior to a reset, and comparing this spread to the average spread during the segment between resets.
            
### vtti.head_rotation (\_x, \_y, \_z)
* 1833 nan values
    * Great that this matches position/confidence
    
### vtti.head\_rotation\_~\_baseline (where ~ is x, y, or z)
* 14 values
* BASELINE VARIES at same spots as head position baseline, so they are reset at the same time.
    * For hypotheses about why resets happen, see vtti.head_position_~_baseline
        
### vtti.heading_gps
* 1321 values
    * Same frequency as elevation_gps (which is comforting)
* Seems to be unnecessary, why do we need to know the direction of the car?

### vtti.headlights
* 543 values
* All values are 1 or NaN
    * -1: unavailable (vs nan??)
    * 0: headlights off
    * 1: headlights on
* Might not be useful in these simulated driving situations (i.e. if a person who always puts headlights on tends to drive safer, that increased safety is tossed out the window when they begin to drive recklessly on purpose.

### vtti.lane_distance_off_center
* 13290 values
* Ranges from ~-900 to ~700
    * Standard deviation of 78, likely measured in cm then.
    * Outliers may have to be removed (esp. around lane changes)
        * May actually be helpful in identifying lane changes.

### vtti.lane_width
* 13099 values
    * Strange that it is not the same as lane_distance_off_center (would assume these measurements come from the same peripheral.
* Ranges from ~60 to ~750
    * Average of 350, means this is likely in cm (assuming 10-12ft average lane width on a road.
    

### vtti.left_line_right_distance
* 13269 values
    * Strange that these all seem to vary.
* Ranges from -970 to ~310
    * Mean: -210, std: 90
* Unsure what this measures, assuming it tells how close one side of the car is to one of the lines, but not sure which side and which line.
    * May be very helpful to see how close the driver is willing to drive to the lines, shows recklessness.

### vtti.(left/right)\_marker_probability
* 13290 values
    * Same as lane distance off center at least
* Ranges from 0 to 1024 (so this probability is recorded as a 10-bit value)
* Allows us to discard data we are not certain about in the (left/right)\_marker_type feature.

### vtti.(left/right)\_marker_type
* 13290 values
    * Same as a couple now, thankfully
* Encoded as five distinct values: 0, 1, 2, 3, 4
* Allows us to distinguish between legal/illegal turns as well as detect reckless behavior (crossing double yellows, crossing into the shoulder)

### vtti.latitude / vtti.longitude
* 1321 values 
    * Matches GPS data, suggesting this info was provided by the GPS
* Hard to imagine how this will be useful besides comparing speed to speed limits on the same road.

### vtti.light_level
* 4029 values
    * ~3x the rate of gps data
* Ranges from 1 to 11.56
* I believe this details the ambient lighting level, but at the time of writing cannot find the actual description of what this value represents.
    * If this *is* ambient light, then could be useful to contrast the 'recklessness' of one's driving vs. the safety of the road at the time.
    
### vtti.month_gps/vtti.year_gps
* 1321 values
    * Same as other GPS records
* Static values, no need to keep 1321 of them when one will do.
* Provides very little information anyway

### vtti.number_of_satellites
* 1321 values
    * Same as other GPS records
* Ranges from 4 to 8 satellites at all times, with an average of 7.2 and an sd of .7.
    * Implies very good GPS coverage for this trip (1308)
        * This will not always be the case
* This will need to be taken into consideration when using any of the GPS measures, as each additional satellite provides a significant increase in the accuracy of GPS measures.

### vtti.odometer
* 267 values
    * Only 14 unique values
        * Tells us this trip was ~13-15 miles
* Very low density information, likely not useful in reality.

### vtti.pdop
* 1321 values
    * Same frequency as GPS data (weirdly)
* Not entirely sure what this measures
* Ranges from 1.31 to 5.47, with a mean of 1.61 and an sd of 0.41
    * Implies this value is generally low, with aberrant behavior being high values

### vtti.pedal_brake_state
* 13487 values
    * Likely from bus of car
* Ranges from 0 to 1
    * Likely a percentage of how pressed down the brake pedal is at the present.
* Very good measure of the intent of the driver in braking
    * Allows us to detect hard braking behavior, and contrast this with earlier (and hence safer) braking

### vtti.pedal_gas_position
* 13487 values
    * Likely from bus of car
* Ranges from 0 to ~37
    * Likely a measure of pressure on the gas pedal, has a granularity of 0.38
        * Only 94 levels present, but a max of nearly 38 and a g of 0.38 implies that there are 100 possible levels for the gas pedal
            * Likely unverifiable, instead we can simply use this data as a proportion of the maximum gas pedal pressure.
* Very good measure of the intent of the driver in accelerating
    * Allows us to detect aggressive accelerations and dangerous cornering, and to contrast these with safer behaviors

### vtti.prndl
* 13487 values
* Direct encoding of Park, Reverse, Neutral, Drive, Lower
    * Only values 0, 1, 2, 3 are present (4 unique values)
        * My suspicion is that this is caused by the fact that no one ever puts their car into lower gears without hauling uphill.
* Will be useful to split behaviors into dangerous *driving* behaviors, *reversing* behaviors, and possibly behaviors while in neutral

### vtti.right_line_left_distance
* 13283 values
    * The fact that this number is different than the number of values we have for  the similarly named left_line_right_distance implies these are calculated values
       * Does this mean we should associate/relatively trust it with the probabilities for line estimates?
* Still not sure what these variables mean

### vtti.seatbelt_driver
* No non-NaN values present in file 1308
    * Not even relevant anyway, since this training data does not include situations where the driver breaks the law (e.g. not wearing a seatbelt)

### vtti.speed_gps
* 1321 values
    * Matches all other GPS values
* Ranges from 0 to 106.5, mean of 35 and sd of 36
    * Lots of variation, will be interesting to use this
* May not be useful compared to other speed measures

### vtti.speed_network
* 12077
    * Doesn't match any other known values, may be a sign of missing data in this
        * Still likely more accurate than GPS data
* Ranges from 0 to 108, mean of 36 and sd of 36
    * Similar variation to GPS data

### vtti.steering_wheel_position
* 13486 values
    * Likely from bus of car (off by 1 of other suspects for this)
* Ranges from -320 to 433, mean of 27 and sd of 85
    * Granularity of .125 suggests we could encode this more efficiently
        * Would require *true* range of this feature
* Can tell us how quickly the driver is turning the wheel. Sharper wheel turns associated with higher speeds imply hard turning/cornering (and therefore reckless behavior)


### vtti.temperature_interior
* 936 values, 142 unique, 0.00625 granularity
* Ranges from ~28 to ~42, with a mean of 36 and an sd of 2.5

### computed.time_bin
* 1321 values
    * Implies that this data is computed from GPS data
* Only value is 6, so I'm not sure what this value represents.

### vtti.traction_control_state
* 13486 values
    * Implies this data is taken from bus of car
* All values are 0
   * Regardless, this feature should be useful for event detection (hard cornering, hard acceleration)

### vtti.turn_signal
* No non-NaN values present in file 1308 or file 1229
* This data would allow us to measure the safety of a drivers turning and lane-changing behaviors
    * This would be very valuable information, but without turn signal data being present this will be hard.

### vtti.video_frame
* 13480 values
    * Integer values, naturally
* Each video frame seems to correspond to an observation
* Allows us to easily check out events we detect in the sample video.

### vtti.wiper
* 543 values
* All values for trip 1308 are 0
* Allows us to detect (at the very least moderate-severe) rainy/snowy conditions without relying on extracting video data.

### Vehicle tracking data:
* Track1 is quite crowded, whereas others are not (seems obvious)
* Data allows us to understand how many other drivers are around
    * Dangerous behavior with others around is *even more* reckless, so this will be something to consider
        * Also will be able to consider proximity, catchup behavior, following behavior (i.e. tailgating or not)

# Data Cleaning

This section of the notebook deals with cleaning the data based on observations made in the EDA section of this notebook. This will require smoothing of some variables, pruning of others, and 

# Cleaning to be done:

## High Priority
* HIGH: Add smoothing of some sort (likely thresholding plus random variation) to clean the accelerometer, pedal, and speed behavior
    * May be missing some categories


## Medium Priority
* MED: Remove any columns that are all nulls or all the same value.
    * Save the singular value if it seems like it could be useful, else ignore it
* MED: Write a function to call the google api for each Lat/Long and get the current speed limit
    * Will need to be converted to the speed format used in speed_gps and speed_network
    * Not really data cleaning, belongs in an external data notebook
        * Could also use the following feature extraction notebooks:
            * Event detection via csv
            * Event detection via video
            * Distraction detection via video
                * From posture detection using OpenCV
            * Distraction detection via csv
                * From the head position and rotation data provided by vtti
            * Braking behavior assessment via csv
            * Acceleration behavior assessment via csv
            * Following behavior assessment via csv
            * Lane changing behavior assessment via csv
            * Frequency domain conversion (for any time-series columns that are not already assessed broadly enough) via csv
        * Also need a notebook for clustering 'sections' of driving (likely between full-stops)
            * Will require converting time-series data to frequency data (as sections are of different length)
                * This will have to be done in its own notebook


## Low Priority
* LOW: Any columns that have a non-real granularity should be converted to a space-saving format

In [None]:
TO_DROP = [
    'vtti.file_id', # file id is in the name of the csv
    'computed.day_of_month', # consistent throughout the entire csv. Verify this before dropping.
    'computed.time_bin', # see ^
    'vtti.month_gps', # see ^
    'vtti.year_gps' # see ^
] 


In [None]:
print(",\t".join(data.columns))