# Big G Express - Data Exploration

## Team: Elden Ring

<img src="https://eldenring.wiki.fextralife.com/file/Elden-Ring/mirel_pastor_of_vow.jpg" alt="PRAISE DOG" style="width:806px;height:600px;"/>

#### PRAISE THE DOG!

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime
from sklearn.feature_selection import chi2

In [2]:
faults = pd.read_csv('../data/J1939Faults.csv', low_memory=False, parse_dates=['EventTimeStamp', 'LocationTimeStamp']) #index_col='EventTimeStamp'
service_fault = pd.read_excel('../data/Service Fault Codes_1_0_0_167.xlsx')
vehicle_diagnostic = pd.read_csv('../data/VehicleDiagnosticOnboardData.csv')


  for idx, row in parser.parse():


Few keyponts from questions to Josh Treet: 
- throw 2011 dates and older out, mistake with an integer overflow that took a few days to correct
- any time being able to predict a derate is great (even just a few hours)
- derates are going to be related to emissions conditions
- coolant level codes (and some others) can often flip between on and off
- derate + light continuing to be on, it's the same event (a pulse of it)
- spn + fmi together determine the fault code
- most trucks fairly similar/same (within like 4 years)
- maybe costs about $500 if misspredicted potential derate 

## Exploratory Data Analysis

In [3]:
print(faults.shape)
print(service_fault.shape)
print(vehicle_diagnostic.shape)

(1187335, 20)
(7124, 14)
(12821626, 4)


Faults joins to vehicle_diagnostic with RecordID = FaultID

Also, columns actionDescription and faultValue in the faults are unused.

`faults['actionDescription'].isna().sum()`

We also remove 2169 EquipmentID that have more than 5 characters

In [4]:
faults = (
    faults.drop(['actionDescription', 'faultValue'], axis=1)
    [faults['EquipmentID'].str.len() <= 5]
)

There are three service locations that appear in the dataset. The fault signals might be going on and off there. In order to eliminate those counts, we check if the Latitutde and Longitude coordinates of the truck are within 0.01 units (in both Lat and Long directions) next to a service location. The 0.01 represent, roughly, the distance of a mile.

Doing so, we eliminate 131778 events.

In [5]:
for lat, lon in [(36.0666667, -86.4347222), (35.5883333, -86.4438888), (36.1950, -83.174722)]:
    
    faults = faults.loc[~((abs(lat - faults['Latitude']) <= 0.01) &
                          (abs(lon - faults['Longitude']) <= 0.01))]

Also filter out all erroneous years, 2011 or earlier, (394 lines), caused by an integer error

In [6]:
faults = faults.loc[faults['EventTimeStamp'].dt.year > 2011]

Finally, remove the rows where 'active' column is False -> those represent where an indicator was turned off (506690) rows.

So we end up with 546674 rows in faults.

In [7]:
faults = faults.loc[faults['active'] == True]

## Modifying faults into rolling window

Next, combine the spn and fmi columns together in order to get them ready to one hot encode and use in the rolling window.

> note: need to order by event time stamp in order to use the rolling window later

In [8]:
faults['spn_fmi'] = ['_'.join(i) for i in zip(faults['spn'].astype(str), faults['fmi'].astype(str))]

faults_encoded = pd.get_dummies(faults, columns=['spn_fmi'], prefix='spn_fmi')

faults_encoded = faults_encoded.sort_values(by='EventTimeStamp')

In [95]:
# to obtain the one hot encoded columns since there are so many
spnfmi_cols = [col for col in faults_encoded.columns if 'spn_fmi' in col]
fixed_cols = ['RecordID', 'spn', 'fmi']

In [12]:
# for some reason, the agg function with sum works without grouping by;
# but when added the groupby, it just keeps running without being able to complete

# d1 = dict.fromkeys(fixed_cols, lambda x: x[-1]) #this function gets the last value in group!
# d2 = dict.fromkeys(spnfmi_cols, 'sum')

# d = {**d1, **d2}

# faults_encoded.groupby('EquipmentID')[['EventTimeStamp'] + fixed_cols + spnfmi_cols].rolling(window = '1d', on = "EventTimeStamp").agg(d)

Using the groupby (for each truck) and rolling window on top of that:

In [51]:
faults_rolling = (
    faults_encoded
    .groupby('EquipmentID')[['EventTimeStamp'] + spnfmi_cols]
    .rolling(window = '1d', on = "EventTimeStamp")
    .sum()
)

faults_rolling = faults_rolling.reset_index()

In [14]:
# to bring in back the spn and fmi information - this was an alternative since the arg made the kernel crash!
faults_rolling = pd.merge(faults_encoded[fixed_cols],
                          faults_rolling,
                          left_index= True,
                          right_on = 'level_1').drop(columns='level_1')

randomly sampling 1000 rows of past 24 hrs. we see that the most common fault are:
- 111-17, coolant level below normal, low severity
- 929-9, abnormal update? rate tire location
- 96-3, high voltage in fuel level
- 829-3, high voltage left fuel level
- 596-31, Condition Exists Cruise Control Enable Switch
- 111-18, Low engine coolant level detected, med severity
- 51923-0, ???????????
- 4096-0, High (Severity High) NOx limits exceeded due t....
- 97-15, High (Severity Low) Water In Fuel Indicator; Water has been detected in the fuel filter.
- 639-2, Incorrect Data J1939 Network #1; The ECM has a communication error.
- 629-12, ECM power supply errors / ECM error / ECM data lost
- 2863-7, Not Reporting Data Front Operator Wiper Switch, 
- 1068-2, Incorrect Data Brake Signal Sensor 2, 
- 50353_0, ??????
- 1807_2, Incorrect Data Steering Wheel Angle
- 807_5, Low Current Dif 2 - ASR Valve
- 611_14, Special Instructions System Diagnostic Code #1
- 0_0, ???
- 4276_0, ???
- 412_0, High (Severity High) Engine Exhaust Gas Recirc...; The EGR temperature sensor indicates that the



Some events we couldn't find the description for.

Now onto figuring out which SPN and FMI might be useful for predicting a derate. the logic here is to randomly sample rows and compare the frequency of codes present there, with respect of the frequency of codes present where a derate occured.

In [15]:
sample_codes = (
    faults_rolling
        .sample(5000)
        .drop(columns=['EventTimeStamp','EquipmentID','spn', 'fmi'])
        .sum()
)

# 928 rows have derate as current event
derate_codes = (
    faults_rolling
        .loc[faults_rolling['spn'] == 5246]
        .drop(columns=['EventTimeStamp','EquipmentID','spn', 'fmi'])
        .sum()
)

code_differences = (derate_codes / derate_codes.sum()) - (sample_codes / sample_codes.sum())

In [16]:
# note 5246 are all derates ... .sort_values(ascending=False).head(20)
code_differences = (
    code_differences
    .to_frame()
    .reset_index()
    .rename(columns={'index': 'spn_fmi', 0:'rel_frequency'})
)

# I did it kinda backwards; these need to be eliminated earlier
code_differences = code_differences.loc[~code_differences['spn_fmi'].str.contains('5246')]

In [17]:
code_differences.sort_values(by='rel_frequency', ascending=False).head(20)

Unnamed: 0,spn_fmi,rel_frequency
142,spn_fmi_1569_31,0.115231
446,spn_fmi_4094_18,0.049594
617,spn_fmi_5394_17,0.022633
368,spn_fmi_3362_31,0.020105
620,spn_fmi_5394_5,0.01637
193,spn_fmi_1761_19,0.016152
196,spn_fmi_1761_9,0.013359
379,spn_fmi_3364_9,0.011733
809,spn_fmi_6802_31,0.010817
301,spn_fmi_3216_9,0.01072


In [31]:
# I'm not exactly sure how to interpret this
#chi2(faults_rolling.drop(columns=['EventTimeStamp','EquipmentID','spn', 'fmi']), faults_rolling['spn'])

Using the codes above that have a positive frequency (more likely to be associated with a derate), their distributions are very close to 0. The ones that have a negative frequency means it is more likely to not be associated with a derate.

In [44]:
faults_rolling.loc[(faults_rolling['spn'] == 5246)][['spn_fmi_1569_31',
	'spn_fmi_3362_31',
    'spn_fmi_4094_18',
    'spn_fmi_1761_19',
    'spn_fmi_1761_9',
    'spn_fmi_3364_9',
    'spn_fmi_5394_17',
    'spn_fmi_5394_5',
    'spn_fmi_6802_31',
    'spn_fmi_3031_9']].describe()

#.to_csv('../data/rolling_trucks.csv')

Unnamed: 0,spn_fmi_1569_31,spn_fmi_3362_31,spn_fmi_4094_18,spn_fmi_1761_19,spn_fmi_1761_9,spn_fmi_3364_9,spn_fmi_5394_17,spn_fmi_5394_5,spn_fmi_6802_31,spn_fmi_3031_9
count,491.0,491.0,491.0,491.0,491.0,491.0,491.0,491.0,491.0,491.0
mean,0.391039,0.06721,0.164969,0.057026,0.046843,0.040733,0.075356,0.05499,0.03666,0.03055
std,0.735292,0.266428,0.413145,0.307735,0.247116,0.207931,0.271849,0.253608,0.268557,0.183735
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,5.0,2.0,3.0,4.0,3.0,2.0,2.0,2.0,4.0,2.0


There are 1045 trucks in the dataset, 1185166 rows; 498 have partial derail, 211 total and there's 182 with both.

Finding out below the trucks that have only partial derail, total, both, or neither.

In [20]:
all_trucks = faults['EquipmentID'].unique()
partial_derate = faults.loc[(faults['spn'] == 1569) & (faults['fmi'] == 31)]['EquipmentID'].unique()
total_derate = faults.loc[faults['spn'] == 5246]['EquipmentID'].unique()

partial_derate_only = partial_derate[np.isin(partial_derate, total_derate, invert=True)]
total_derate_only = total_derate[np.isin(total_derate, partial_derate, invert=True)]
partial_and_total_derate = np.intersect1d(partial_derate, total_derate)
no_derate = all_trucks[np.isin(all_trucks, partial_derate_only, invert=True) | np.isin(all_trucks, total_derate_only, invert=True)]



In [21]:
print(len(partial_derate_only))
print(len(total_derate_only))
print(len(partial_and_total_derate))
print(len(no_derate))

330
28
161
1042


## Creating Predictor Variable

In order to be able to train the models, we need to create a predictor variable. Essentially, look into the "future" of each row and see if a derate happens (let's say within a 6hr window).

To do that, I used a similar approach as above, when performing data exploration. Major difference here is that I sorted the timeseries backwards (in order to "look at the future") and only used spn instead of spn_fmi, because I was looking at full derates that have same spn but different possible fmi.

In [110]:
faults_encodedspn = faults.drop(columns='spn_fmi')

faults_encodedspn['spn_double'] = faults_encodedspn['spn']

faults_encodedspn = pd.get_dummies(faults_encodedspn, columns=['spn_double'], prefix='spn')

# have to invert the time order here to look into the future!
faults_encodedspn = faults_encodedspn.sort_values(by='EventTimeStamp', ascending=False)

In [111]:
spn_cols = [col for col in faults_encodedspn.columns if 'spn_' in col]
fixed_cols = ['RecordID', 'spn', 'fmi']

In [114]:
faults_rolling_future = (
    faults_encodedspn
    .groupby('EquipmentID')[['EventTimeStamp'] + spn_cols]
    .rolling(window = '6h', on = "EventTimeStamp")
    .sum()
)

faults_rolling_future = faults_rolling_future.reset_index()

In [115]:
faults_rolling_future = pd.merge(faults_encodedspn[fixed_cols],
                          faults_rolling_future,
                          left_index= True,
                          right_on = 'level_1').drop(columns='level_1')

In [123]:
# if we don't include the times when derate already happened & (faults_rolling_future['spn'] != 5246)
faults_rolling_future['target'] = np.where(faults_rolling_future['spn_5246'] > 0, 1, 0)

In [128]:
y = faults_rolling_future[['RecordID', 'target']]

## Vehicle Diagnostic

For vehicle diagnostic:
- Id -  the record Id
- Name – the name of the diagnostic
- Value – the value for that diagnostic
- FaultId – foreign key to the QCJ1939Fault record

Work on the Diagnostic table done by Alison Cordoba

In [23]:
vehicle_diagnostic.head(10)

Unnamed: 0,Id,Name,Value,FaultId
0,1,IgnStatus,False,1
1,2,EngineOilPressure,0,1
2,3,EngineOilTemperature,96.74375,1
3,4,TurboBoostPressure,0,1
4,5,EngineLoad,11,1
5,6,AcceleratorPedal,0,1
6,7,IntakeManifoldTemperature,78.8,1
7,8,FuelRate,0,1
8,9,FuelLtd,12300.907429328,1
9,10,EngineRpm,0,1


In [45]:
# make a copy of DF to prevent accidental changes
Diagnostics = vehicle_diagnostic

In [47]:
Diagnostics = Diagnostics.pivot(index="FaultId", columns="Name", values="Value").reset_index()

In [48]:
Diagnostics.isna().sum()

Name
FaultId                            0
AcceleratorPedal              655446
BarometricPressure            601359
CruiseControlActive           612419
CruiseControlSetSpeed         610877
DistanceLtd                   601516
EngineCoolantTemperature      601264
EngineLoad                    601714
EngineOilPressure             601091
EngineOilTemperature          603423
EngineRpm                     600414
EngineTimeLtd                 605969
FuelLevel                     684540
FuelLtd                       602140
FuelRate                      602098
FuelTemperature               888225
IgnStatus                     578881
IntakeManifoldTemperature     601044
LampStatus                         0
ParkingBrake                  787363
ServiceDistance              1187120
Speed                         603419
SwitchedBatteryVoltage       1073276
Throttle                      766832
TurboBoostPressure            603984
dtype: int64

In [49]:
# convert columns types
cols = ['AcceleratorPedal', 'BarometricPressure', 'CruiseControlSetSpeed', 'DistanceLtd', 
        'EngineCoolantTemperature', 'EngineLoad', 'EngineOilPressure', 'EngineOilTemperature', 
        'FuelTemperature',
        'IntakeManifoldTemperature','ServiceDistance', 'Speed', 'SwitchedBatteryVoltage', 
        'Throttle', 'TurboBoostPressure']

In [50]:
# Remove commas from all 11 columns
for col in Diagnostics.columns[:21]:
    Diagnostics[col] = Diagnostics[col].astype(str).str.replace(',', '')

# Convert all columns to numeric
Diagnostics = Diagnostics.apply(pd.to_numeric, errors='coerce')

Diagnostics

Name,FaultId,AcceleratorPedal,BarometricPressure,CruiseControlActive,CruiseControlSetSpeed,DistanceLtd,EngineCoolantTemperature,EngineLoad,EngineOilPressure,EngineOilTemperature,...,FuelTemperature,IgnStatus,IntakeManifoldTemperature,LampStatus,ParkingBrake,ServiceDistance,Speed,SwitchedBatteryVoltage,Throttle,TurboBoostPressure
0,1,0.0,14.2100,,66.48672,423178.70000,100.4,11.0,0.00,96.74375,...,,,78.8,1023,,,0.00000,3276.75,,0.00
1,2,,,,,,,,,,...,,,,1279,,,,,,
2,3,,,,,,,,,,...,,,,1279,,,,,,
3,4,,,,,,,,,,...,,,,1279,,,,,,
4,5,,,,,,,,,,...,,,,16639,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1187330,1248454,,,,,,,,,,...,,,,1023,,,,,,
1187331,1248455,100.0,14.5000,,64.62260,423937.90000,185.0,51.0,37.12,211.49370,...,32.0,,98.6,18431,,,65.01096,,73.2,7.83
1187332,1248456,0.0,14.3550,,66.48672,465925.40000,186.8,62.0,41.18,212.84380,...,,,91.4,17407,,,66.57410,,100.0,6.96
1187333,1248457,1.6,14.4275,,67.72946,28606.65625,181.4,0.0,27.26,221.73120,...,,,100.4,1023,,,11.84489,14.10,100.0,1.74
