# Big G Express - Data Exploration

## Team: Elden Ring

<img src="https://eldenring.wiki.fextralife.com/file/Elden-Ring/mirel_pastor_of_vow.jpg" alt="PRAISE DOG" style="width:806px;height:600px;"/>

#### PRAISE THE DOG!

In [75]:
import pandas as pd
import numpy as np
from datetime import datetime
from sklearn.feature_selection import chi2

In [2]:
faults = pd.read_csv('../data/J1939Faults.csv', low_memory=False, parse_dates=['EventTimeStamp', 'LocationTimeStamp']) #index_col='EventTimeStamp'
service_fault = pd.read_excel('../data/Service Fault Codes_1_0_0_167.xlsx')
vehicle_diagnostic = pd.read_csv('../data/VehicleDiagnosticOnboardData.csv')


  for idx, row in parser.parse():


Few keyponts from questions to Josh Treet: 
- throw 2001 dates out (the super old), mistake with an integer overflow that took a few days to correct
- any time telling before a derate is great
- derates are going to be related to emissions conditions
- coolant level codes (and some others) can often flip between on and off
- derate + light continuing on, it's the same event (a pulse of it)
- spn + fmi together determine the fault code
- most trucks fairly similar/same (within like 4 years)
- maybe costs about $500 if misspredicted potential derate 

## Exploratory Data Analysis

In [3]:
print(faults.shape)
print(service_fault.shape)
print(vehicle_diagnostic.shape)

(1187335, 20)
(7124, 14)
(12821626, 4)


Faults joins to vehicle_diagnostic with RecordID = FaultID

Also, columns actionDescription and faultValue in the faults are unused.

`faults['actionDescription'].isna().sum()`

We also remove 2169 EquipmentID that have more than 5 characters

In [4]:
faults = (
    faults.drop(['actionDescription', 'faultValue'], axis=1)
    [faults['EquipmentID'].str.len() <= 5]
)

There are three service locations that appear in the dataset. The fault signals might be going on and off there. In order to eliminate those counts, we check if the Latitutde and Longitude coordinates of the truck are within 0.01 units (in both Lat and Long directions) next to a service location. The 0.01 represent, roughly, the distance of a mile.

Doing so, we eliminate 131778 events.

In [5]:
for lat, lon in [(36.0666667, -86.4347222), (35.5883333, -86.4438888), (36.1950, -83.174722)]:
    
    faults = faults.loc[~((abs(lat - faults['Latitude']) <= 0.01) &
                          (abs(lon - faults['Longitude']) <= 0.01))]

Also filter out all erroneous years, 2011 or earlier, (394 lines), caused by an integer error

In [6]:
faults = faults.loc[faults['EventTimeStamp'].dt.year > 2011]

Finally, remove the rows where 'active' column is False -> those represent where an indicator was turned off (506690) rows.

So we end up with 546674 rows in faults.

In [12]:
faults = faults.loc[faults['active'] == True]

Next, combine the spn and fmi columns together in order to get them ready to one hot encode and use in the rolling window.

In [14]:
faults['spn_fmi'] = ['_'.join(i) for i in zip(faults['spn'].astype(str), faults['fmi'].astype(str))]

In [15]:
faults_encoded = pd.get_dummies(faults, columns=['spn_fmi'], prefix='spn_fmi')

In [16]:
# ordering by time in order to have the rolling window later
faults_encoded = faults_encoded.sort_values(by='EventTimeStamp')

In [17]:
# to obtain the one hot encoded columns since there are so many
spnfmi_cols = [col for col in faults_encoded.columns if 'spn_fmi' in col]
fixed_cols = ['spn', 'fmi']

In [None]:
# for some reason, the agg function with sum works without grouping by;
# but when added the groupby, it just keeps running without being able to complete

# d1 = dict.fromkeys(fixed_cols, lambda x: x[-1]) #this function gets the last value in group!
# d2 = dict.fromkeys(spnfmi_cols, 'sum')

# d = {**d1, **d2}

# faults_encoded.groupby('EquipmentID')[['EventTimeStamp'] + fixed_cols + spnfmi_cols].rolling(window = '1d', on = "EventTimeStamp").agg(d)

In [18]:
# note that you can use groupby and then rolling window!
faults_rolling = faults_encoded.groupby('EquipmentID')[['EventTimeStamp'] + spnfmi_cols].rolling(window = '1d', on = "EventTimeStamp").sum()

faults_rolling = faults_rolling.reset_index()

In [19]:
# to bring in back the spn and fmi information - this was an alternative since the arg made the kernel crash!
faults_rolling = pd.merge(faults_encoded[fixed_cols], faults_rolling, left_index= True, right_on = 'level_1').drop(columns='level_1')

randomly sampling 1000 rows of past 24 hrs. we see that the most common fault are:
- 111-17, coolant level below normal, low severity
- 929-9, abnormal update? rate tire location
- 96-3, high voltage in fuel level
- 829-3, high voltage left fuel level
- 596-31, Condition Exists Cruise Control Enable Switch
- 111-18, Low engine coolant level detected, med severity
- 51923-0, ???????????
- 4096-0, High (Severity High) NOx limits exceeded due t....
- 97-15, High (Severity Low) Water In Fuel Indicator; Water has been detected in the fuel filter.
- 639-2, Incorrect Data J1939 Network #1; The ECM has a communication error.
- 629-12, ECM power supply errors / ECM error / ECM data lost
- 2863-7, Not Reporting Data Front Operator Wiper Switch, 
- 1068-2, Incorrect Data Brake Signal Sensor 2, 
- 50353_0, ??????
- 1807_2, Incorrect Data Steering Wheel Angle
- 807_5, Low Current Dif 2 - ASR Valve
- 611_14, Special Instructions System Diagnostic Code #1
- 0_0, ???
- 4276_0, ???
- 412_0, High (Severity High) Engine Exhaust Gas Recirc...; The EGR temperature sensor indicates that the



Some events we couldn't find the description for.

Now onto figuring out which SPN and FMI might be useful for predicting a derate. the logic here is to randomly sample rows and compare the frequency of codes present there, with respect of the frequency of codes present where a derate occured.

In [110]:
sample_codes = (
    faults_rolling
        .sample(5000)
        .drop(columns=['EventTimeStamp','EquipmentID','spn', 'fmi'])
        .sum()
)

# 928 rows have derate as current event
derate_codes = (
    faults_rolling
        .loc[faults_rolling['spn'] == 5246]
        .drop(columns=['EventTimeStamp','EquipmentID','spn', 'fmi'])
        .sum()
)

code_differences = (derate_codes / derate_codes.sum()) - (sample_codes / sample_codes.sum())

In [111]:
# note 5246 are all derates ... .sort_values(ascending=False).head(20)
code_differences = (
    code_differences
    .to_frame()
    .reset_index()
    .rename(columns={'index': 'spn_fmi', 0:'rel_frequency'})
)

# I did it kinda backwards; these need to be eliminated earlier
code_differences = code_differences.loc[~code_differences['spn_fmi'].str.contains('5246')]

In [113]:
code_differences.sort_values(by='rel_frequency', ascending=False)

Unnamed: 0,spn_fmi,rel_frequency
142,spn_fmi_1569_31,0.152800
368,spn_fmi_3362_31,0.033332
446,spn_fmi_4094_18,0.031990
193,spn_fmi_1761_19,0.020596
196,spn_fmi_1761_9,0.019046
...,...,...
449,spn_fmi_4096_0,-0.089148
69,spn_fmi_111_17,-0.098143
948,spn_fmi_929_9,-0.100353
921,spn_fmi_829_3,-0.115545


In [87]:
# I'm not exactly sure how to interpret this
#chi2(faults_rolling.drop(columns=['EventTimeStamp','EquipmentID','spn', 'fmi']), faults_rolling['spn'])

The end results are these (positive numbers mean these codes occured with a derate at the end, negavitve mean the opposite - no derate at the end).
- spn_fmi_1569_31    0.153992
- spn_fmi_3362_31    0.033548
- spn_fmi_4094_18    0.032013
- spn_fmi_1761_19    0.020426
- spn_fmi_1761_9     0.018891
- spn_fmi_3364_9     0.016721
- spn_fmi_5394_17    0.016165
- spn_fmi_5394_5     0.012213
- spn_fmi_6802_31    0.011312
- spn_fmi_3031_9     0.010798

- spn_fmi_0_0       -0.010679
- spn_fmi_596_31    -0.015601
- spn_fmi_4276_0    -0.021850
- spn_fmi_51923_0   -0.072363
- spn_fmi_4096_0    -0.074337
- spn_fmi_111_17    -0.099292
- spn_fmi_929_9     -0.103831
- spn_fmi_829_3     -0.125463
- spn_fmi_96_3      -0.163479

For the columns in faults:
- ESS_Id – the event subscriber service event that contained the fault
- EventTimeStamp – when the event took place
- eventDescription – brief text of meaning of the code (not always present)
- actionDescription – never seen this filled in
- ecuSoftwareVersion – version string from the reporting vehicle computer system
- ecuSerialNumber – Serial number of the reporting Engine Control Module (ECM)
- ecuModel -Model of the reporting ECM
- ecuMake – Manufacturer of the reporting ECM
- ecuSource –
- spn – Fault code being reported
- fmi – Failure Mode associated with the Fault Code
- active – whether the code is being set or being removed
- activeTransitionCount – Number of times code has been set/unset
- faultValue – never seen used
- EquipmentID – Assigned truck number of the unit in question - 1122 different trucks
- MCTNumber – Communications Terminal assigned to the truck
- Latitude – Latitude at time of event
- Longitude – Longitude at time of event
- LocationTimeStamp – Time latitude and longitude were obtained

There are 1045 trucks in the dataset, 1185166 rows; 498 have partial derail, 211 total and there's 182 with both.

Finding out below the trucks that have only partial derail, total, both, or neither.

In [71]:
all_trucks = faults['EquipmentID'].unique()
partial_derate = faults.loc[(faults['spn'] == 1569) & (faults['fmi'] == 31)]['EquipmentID'].unique()
total_derate = faults.loc[faults['spn'] == 5246]['EquipmentID'].unique()

partial_derate_only = partial_derate[np.isin(partial_derate, total_derate, invert=True)]
total_derate_only = total_derate[np.isin(total_derate, partial_derate, invert=True)]
partial_and_total_derate = np.intersect1d(partial_derate, total_derate)
no_derate = all_trucks[np.isin(all_trucks, partial_derate_only, invert=True) | np.isin(all_trucks, total_derate_only, invert=True)]



In [72]:
print(len(partial_derate_only))
print(len(total_derate_only))
print(len(partial_and_total_derate))
print(len(no_derate))

330
28
161
1042


In [73]:
faults.loc[faults['spn'] == 5246]['fmi'].value_counts()

0     323
16     78
15     54
19     32
14      4
Name: fmi, dtype: int64

## Vehicle Diagnostic

For vehicle diagnostic:
- Id -  the record Id
- Name – the name of the diagnostic
- Value – the value for that diagnostic
- FaultId – foreign key to the QCJ1939Fault record

In [None]:
vehicle_diagnostic.head(10)

In [None]:
service_fault.head()