# Big G Express - Data Exploration

## Team: Elden Ring

<img src="https://eldenring.wiki.fextralife.com/file/Elden-Ring/mirel_pastor_of_vow.jpg" alt="PRAISE DOG" style="width:806px;height:600px;"/>

#### PRAISE THE DOG!

In [2]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from sklearn.feature_selection import chi2
from sklearn.impute import SimpleImputer

In [3]:
faults = pd.read_csv('../data/J1939Faults.csv', low_memory=False, parse_dates=['EventTimeStamp', 'LocationTimeStamp']) #index_col='EventTimeStamp'
service_fault = pd.read_excel('../data/Service Fault Codes_1_0_0_167.xlsx')
vehicle_diagnostic = pd.read_csv('../data/VehicleDiagnosticOnboardData.csv')


  for idx, row in parser.parse():


Few keyponts from questions to Josh Treet: 
- throw 2011 dates and older out, mistake with an integer overflow that took a few days to correct
- any time being able to predict a derate is great (even just a few hours)
- derates are going to be related to emissions conditions
- coolant level codes (and some others) can often flip between on and off
- derate + light continuing to be on, it's the same event (a pulse of it)
- spn + fmi together determine the fault code
- most trucks fairly similar/same (within like 4 years)
- maybe costs about $500 if misspredicted potential derate 

## Exploratory Data Analysis

In [4]:
print(faults.shape)
print(service_fault.shape)
print(vehicle_diagnostic.shape)

(1187335, 20)
(7124, 14)
(12821626, 4)


Faults joins to vehicle_diagnostic with RecordID = FaultID

Also, columns actionDescription and faultValue in the faults are unused.

`faults['actionDescription'].isna().sum()`

We also remove 2169 EquipmentID that have more than 5 characters

In [5]:
faults = (
    faults.drop(['actionDescription', 'faultValue'], axis=1)
    [faults['EquipmentID'].str.len() <= 5]
)

There are three service locations that appear in the dataset. The fault signals might be going on and off there. In order to eliminate those counts, we check if the Latitutde and Longitude coordinates of the truck are within 0.01 units (in both Lat and Long directions) next to a service location. The 0.01 represent, roughly, the distance of a mile.

Doing so, we eliminate 131778 events.

In [6]:
for lat, lon in [(36.0666667, -86.4347222), (35.5883333, -86.4438888), (36.1950, -83.174722)]:
    
    faults = faults.loc[~((abs(lat - faults['Latitude']) <= 0.01) &
                          (abs(lon - faults['Longitude']) <= 0.01))]

Also filter out all erroneous years, 2011 or earlier, (394 lines), caused by an integer error

In [7]:
faults = faults.loc[faults['EventTimeStamp'].dt.year > 2011]

Finally, remove the rows where 'active' column is False -> those represent where an indicator was turned off (506690) rows.

So we end up with 546674 rows in faults.

In [8]:
faults = faults.loc[faults['active'] == True]

In [9]:
# save the filtered faults to use for ml
# faults.to_pickle('../data/faults_filtered.pkl')

## Modifying faults into rolling window

Next, combine the spn and fmi columns together in order to get them ready to one hot encode and use in the rolling window.

> note: need to order by event time stamp in order to use the rolling window later

In [10]:
faults_encoded = faults.copy()

faults_encoded['spn_fmi'] = ['_'.join(i) for i in zip(faults_encoded['spn'].astype(str), faults_encoded['fmi'].astype(str))]

faults_encoded = pd.get_dummies(faults, columns=['spn_fmi'], prefix='spn_fmi')

faults_encoded = faults_encoded.sort_values(by='EventTimeStamp')

KeyError: "None of [Index(['spn_fmi'], dtype='object')] are in the [columns]"

In [None]:
# to obtain the one hot encoded columns since there are so many
spnfmi_cols = [col for col in faults_encoded.columns if 'spn_fmi' in col]
fixed_cols = ['RecordID', 'spn', 'fmi']

In [None]:
# for some reason, the agg function with sum works without grouping by;
# but when added the groupby, it just keeps running without being able to complete

# d1 = dict.fromkeys(fixed_cols, lambda x: x[-1]) #this function gets the last value in group!
# d2 = dict.fromkeys(spnfmi_cols, 'sum')

# d = {**d1, **d2}

# faults_encoded.groupby('EquipmentID')[['EventTimeStamp'] + fixed_cols + spnfmi_cols].rolling(window = '1d', on = "EventTimeStamp").agg(d)

Using the groupby (for each truck) and rolling window on top of that:

In [None]:
faults_rolling = (
    faults_encoded
    .groupby('EquipmentID')[['EventTimeStamp'] + spnfmi_cols]
    .rolling(window = '1d', on = "EventTimeStamp")
    .sum()
)

faults_rolling = faults_rolling.reset_index()

In [None]:
# to bring in back the spn and fmi information - this was an alternative since the arg made the kernel crash!
faults_rolling = pd.merge(faults_encoded[fixed_cols],
                          faults_rolling,
                          left_index= True,
                          right_on = 'level_1').drop(columns='level_1')

randomly sampling 1000 rows of past 24 hrs. we see that the most common fault are:
- 111-17, coolant level below normal, low severity
- 929-9, abnormal update? rate tire location
- 96-3, high voltage in fuel level
- 829-3, high voltage left fuel level
- 596-31, Condition Exists Cruise Control Enable Switch
- 111-18, Low engine coolant level detected, med severity
- 51923-0, ???????????
- 4096-0, High (Severity High) NOx limits exceeded due t....
- 97-15, High (Severity Low) Water In Fuel Indicator; Water has been detected in the fuel filter.
- 639-2, Incorrect Data J1939 Network #1; The ECM has a communication error.
- 629-12, ECM power supply errors / ECM error / ECM data lost
- 2863-7, Not Reporting Data Front Operator Wiper Switch, 
- 1068-2, Incorrect Data Brake Signal Sensor 2, 
- 50353_0, ??????
- 1807_2, Incorrect Data Steering Wheel Angle
- 807_5, Low Current Dif 2 - ASR Valve
- 611_14, Special Instructions System Diagnostic Code #1
- 0_0, ???
- 4276_0, ???
- 412_0, High (Severity High) Engine Exhaust Gas Recirc...; The EGR temperature sensor indicates that the



Some events we couldn't find the description for.

Now onto figuring out which SPN and FMI might be useful for predicting a derate. the logic here is to randomly sample rows and compare the frequency of codes present there, with respect of the frequency of codes present where a derate occured.

In [None]:
sample_codes = (
    faults_rolling
        .sample(5000)
        .drop(columns=['RecordID','EventTimeStamp','EquipmentID','spn', 'fmi'])
        .sum()
)

# 928 rows have derate as current event
derate_codes = (
    faults_rolling
        .loc[faults_rolling['spn'] == 5246]
        .drop(columns=['RecordID','EventTimeStamp','EquipmentID','spn', 'fmi'])
        .sum()
)

code_differences = (derate_codes / derate_codes.sum()) - (sample_codes / sample_codes.sum())

In [None]:
# note 5246 are all derates ... .sort_values(ascending=False).head(20)
code_differences = (
    code_differences
    .to_frame()
    .reset_index()
    .rename(columns={'index': 'spn_fmi', 0:'rel_frequency'})
)

# I did it kinda backwards; these need to be eliminated earlier
code_differences = code_differences.loc[~code_differences['spn_fmi'].str.contains('5246')]

In [None]:
code_differences.sort_values(by='rel_frequency', ascending=False).head(20)

In [None]:
# I'm not exactly sure how to interpret this
#chi2(faults_rolling.drop(columns=['EventTimeStamp','EquipmentID','spn', 'fmi']), faults_rolling['spn'])

Using the codes above that have a positive frequency (more likely to be associated with a derate), their distributions are very close to 0. The ones that have a negative frequency means it is more likely to not be associated with a derate.

In [None]:
faults_rolling.loc[(faults_rolling['spn'] == 5246)][['spn_fmi_1569_31',
	'spn_fmi_3362_31',
    'spn_fmi_4094_18',
    'spn_fmi_1761_19',
    'spn_fmi_1761_9',
    'spn_fmi_3364_9',
    'spn_fmi_5394_17',
    'spn_fmi_5394_5',
    'spn_fmi_6802_31',
    'spn_fmi_3031_9']].describe()

#.to_csv('../data/rolling_trucks.csv')

There are 1045 trucks in the dataset, 1185166 rows; 498 have partial derail, 211 total and there's 182 with both.

Finding out below the trucks that have only partial derail, total, both, or neither.

In [9]:
all_trucks = faults['EquipmentID'].unique()
partial_derate = faults.loc[(faults['spn'] == 1569) & (faults['fmi'] == 31)]['EquipmentID'].unique()
total_derate = faults.loc[faults['spn'] == 5246]['EquipmentID'].unique()

partial_derate_only = partial_derate[np.isin(partial_derate, total_derate, invert=True)]
total_derate_only = total_derate[np.isin(total_derate, partial_derate, invert=True)]
partial_and_total_derate = np.intersect1d(partial_derate, total_derate)
no_derate = all_trucks[np.isin(all_trucks, partial_derate_only, invert=True) | np.isin(all_trucks, total_derate_only, invert=True)]



In [10]:
print(len(partial_derate_only))
print(len(total_derate_only))
print(len(partial_and_total_derate))
print(len(no_derate))

330
28
161
1042


## Creating Predictor Variable

In order to be able to train the models, we need to create a predictor variable. Essentially, look into the "future" of each row and see if a derate happens (let's say within a 6hr window).

To do that, I used a similar approach as above, when performing data exploration. Major difference here is that I sorted the timeseries backwards (in order to "look at the future") and only used spn instead of spn_fmi, because I was looking at full derates that have same spn but different possible fmi.

In [61]:
faults_target_derate = faults.copy()
faults_target_75derate = faults.copy()

# this is not needed!
#faults_target_derate['spn_double'] = faults_target_derate['spn']
#faults_target_75derate['spn_double'] = faults_target_75derate['spn']

# # this column is so that we can sort derates at the top when the events happen at same time!
# # this actually gets additional 48 rows (that are happening at same time)
# # have to do same thing with partial derate
faults_target_derate['dummy_derate'] = np.where(faults_target_derate['spn'] == 5246, 1, 0)
faults_target_75derate['dummy_derate'] = np.where(faults_target_75derate['spn'] == 1569, 1, 0)

faults_target_derate = pd.get_dummies(faults_target_derate, columns=['spn'], prefix='spn')
faults_target_75derate = pd.get_dummies(faults_target_75derate, columns=['spn'], prefix='spn')

# # have to invert the time order here to look into the future!
faults_target_derate = faults_target_derate.sort_values(by=['EquipmentID','EventTimeStamp','dummy_derate'], ascending=[False, False, False])
faults_target_75derate = faults_target_75derate.sort_values(by=['EquipmentID','EventTimeStamp','dummy_derate'], ascending=[False, False, False])

In [62]:
# these are same for both dataframes
var_cols = ['EventTimeStamp'] + [col for col in faults_target_derate.columns if 'spn_' in col]

In [64]:
rolling_derate_future = (
    faults_target_derate
    .groupby('EquipmentID')[var_cols]
    .rolling(window = '6h', on = "EventTimeStamp")
    .sum()
)

rolling_derate_future = rolling_derate_future.reset_index()

In [63]:
rolling_75derate_future = (
    faults_target_75derate
    .groupby('EquipmentID')[var_cols]
    .rolling(window = '6h', on = "EventTimeStamp")
    .sum()
)

rolling_75derate_future = rolling_75derate_future.reset_index()

In [65]:
rolling_derate_future = pd.merge(faults_target_derate['RecordID'],
                          rolling_derate_future,
                          left_index= True,
                          right_on = 'level_1').drop(columns='level_1')

rolling_75derate_future = pd.merge(faults_target_75derate['RecordID'],
                          rolling_75derate_future,
                          left_index= True,
                          right_on = 'level_1').drop(columns='level_1')

In [66]:
# if we don't include the times when derate already happened & (faults_rolling_future['spn'] != 5246)
rolling_derate_future['target'] = np.where(rolling_derate_future['spn_5246'] > 0, 1, 0)
rolling_75derate_future['target'] = np.where(rolling_75derate_future['spn_1569'] > 0, 1, 0)

In [67]:
# this is just to keep it separate, just the recordID and the two possible targets of interest
y_derate = rolling_derate_future[['RecordID', 'target']]
y_75derate = rolling_75derate_future[['RecordID', 'target']]

within 6 hours into the future, there's 1389 with derate and 10864 with partial derate

In [68]:
print(y_derate['target'].sum())
print(y_75derate['target'].sum())

1389
10864


In [78]:
# save the filtered faults to use for ml
# y_derate.to_pickle('../data/target_derate.pkl')
# y_75derate.to_pickle('../data/target_75derate.pkl')

### Alternative

Ajay designed a function (below) that should return the same output.

In [None]:
def  GetFilteredSPNbyDays(df_faults, windowTimeframeUnit, day_window):
    df_new = pd.DataFrame(columns = ['RecordID','EquipmentID', 'EventTimeStamp','active', 'spn', 'fmi', 'target'])  #'EventTimeStamp_DateOnly', 'PartialDerate',
    #print(df_new)
    df_new = df_new.astype({'EquipmentID': 'int'})
    dts_evt = ""
    dts_evt_max=""

    hasDerate = False
    # loop through rows of original dataframe and assign new values to columns of new dataframe
    for index, row in df_faults.iterrows():
        #if((row['spn'] == 1569) & (row['fmi'] == 31)):  
        if(row['spn'] == 5246): 
            if(hasDerate == False):
                hasDerate = True
                #df_new.loc[index, 'PartialDerate'] = 1
                df_new.loc[index, 'target'] = 0
                dts_evt = row['EventTimeStamp'] 
                #dts_evt_max = dts_evt - timedelta(days=day_window)
                if(windowTimeframeUnit == "hours"):                   
                    dts_evt_max = dts_evt - timedelta(hours=day_window)
                    #print("hours = dts_evt_max= " + str(dts_evt_max))
                elif(windowTimeframeUnit == "days"):                   
                    dts_evt_max = dts_evt - timedelta(days=day_window)
                    #print("days = dts_evt_max= " + str(dts_evt_max))
                #add cols
                df_new.loc[index, 'RecordID'] = row['RecordID']
                df_new.loc[index, 'EquipmentID'] = row['EquipmentID']
                df_new.loc[index, 'EventTimeStamp'] = row['EventTimeStamp']
                #df_new.loc[index, 'EventTimeStamp_DateOnly'] = row['EventTimeStamp_DateOnly'] 
                df_new.loc[index, 'active'] = row['active']
                df_new.loc[index, 'spn'] = row['spn']
                df_new.loc[index, 'fmi'] = row['fmi']

        else:
            if(dts_evt_max != ""):
 
                if((row['EventTimeStamp'] > dts_evt_max) & (hasDerate)) : #row['EventTimeStamp'] <= dts_evt)
                    #print('EventTimeStamp > dts_evt_max' + str(dts_evt_max) + " ---- " + str(row['EventTimeStamp_DateOnly']))
                    #df_new.loc[index, 'PartialDerate'] = 0 
                    df_new.loc[index, 'target'] = 1
                    df_new.loc[index, 'RecordID'] = row['RecordID']
                    df_new.loc[index, 'EquipmentID'] = row['EquipmentID']
                    df_new.loc[index, 'EventTimeStamp'] = row['EventTimeStamp']
                    #df_new.loc[index, 'EventTimeStamp_DateOnly'] = row['EventTimeStamp_DateOnly'] 
                    df_new.loc[index, 'active'] = row['active']
                    df_new.loc[index, 'spn'] = row['spn']
                    df_new.loc[index, 'fmi'] = row['fmi']
                else:
                    #print('ELSE  ' + str(dts_evt_max)+ " ---- " +  str(row['EventTimeStamp_DateOnly']))
                    hasDerate = False
                    dts_evt = ""
                    dts_evt_max = ""
    #print(df_new) 
    return df_new
    

In [None]:
df_filtered = pd.DataFrame()
for chunk in pd.read_csv("../data/df_faults.csv", chunksize=5000, parse_dates=['EventTimeStamp']):
    dfW = GetFilteredSPNbyDays(chunk, "hours", 6)
    #print(dfW)
    dfW['active']=dfW['active'].astype(bool) #to avoid futuretype warning
    df_filtered = pd.concat([df_filtered, dfW])

## Vehicle Diagnostic

For vehicle diagnostic:
- Id -  the record Id
- Name – the name of the diagnostic
- Value – the value for that diagnostic
- FaultId – foreign key to the QCJ1939Fault record

Work on the Diagnostic table done by Alison Cordoba

In [79]:
vehicle_diagnostic.head(10)

Unnamed: 0,Id,Name,Value,FaultId
0,1,IgnStatus,False,1
1,2,EngineOilPressure,0,1
2,3,EngineOilTemperature,96.74375,1
3,4,TurboBoostPressure,0,1
4,5,EngineLoad,11,1
5,6,AcceleratorPedal,0,1
6,7,IntakeManifoldTemperature,78.8,1
7,8,FuelRate,0,1
8,9,FuelLtd,12300.907429328,1
9,10,EngineRpm,0,1


In [80]:
# make a copy of DF to prevent accidental changes
Diagnostics = vehicle_diagnostic

In [81]:
Diagnostics = Diagnostics.pivot(index="FaultId", columns="Name", values="Value").reset_index()

In [82]:
# Remove commas from all 11 columns
for col in Diagnostics.columns[:21]:
    Diagnostics[col] = Diagnostics[col].astype(str).str.replace(',', '')

# Convert all columns to numeric
Diagnostics = Diagnostics.apply(pd.to_numeric, errors='coerce')

Diagnostics

Name,FaultId,AcceleratorPedal,BarometricPressure,CruiseControlActive,CruiseControlSetSpeed,DistanceLtd,EngineCoolantTemperature,EngineLoad,EngineOilPressure,EngineOilTemperature,...,FuelTemperature,IgnStatus,IntakeManifoldTemperature,LampStatus,ParkingBrake,ServiceDistance,Speed,SwitchedBatteryVoltage,Throttle,TurboBoostPressure
0,1,0.0,14.2100,,66.48672,423178.70000,100.4,11.0,0.00,96.74375,...,,,78.8,1023,,,0.00000,3276.75,,0.00
1,2,,,,,,,,,,...,,,,1279,,,,,,
2,3,,,,,,,,,,...,,,,1279,,,,,,
3,4,,,,,,,,,,...,,,,1279,,,,,,
4,5,,,,,,,,,,...,,,,16639,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1187330,1248454,,,,,,,,,,...,,,,1023,,,,,,
1187331,1248455,100.0,14.5000,,64.62260,423937.90000,185.0,51.0,37.12,211.49370,...,32.0,,98.6,18431,,,65.01096,,73.2,7.83
1187332,1248456,0.0,14.3550,,66.48672,465925.40000,186.8,62.0,41.18,212.84380,...,,,91.4,17407,,,66.57410,,100.0,6.96
1187333,1248457,1.6,14.4275,,67.72946,28606.65625,181.4,0.0,27.26,221.73120,...,,,100.4,1023,,,11.84489,14.10,100.0,1.74


In [83]:
Diagnostics = Diagnostics.drop(columns=['CruiseControlActive', 'IgnStatus', 'ParkingBrake'])

In [84]:
faults_improved = faults.merge(Diagnostics, left_on='RecordID', right_on='FaultId')
faults_improved

Unnamed: 0,RecordID,ESS_Id,EventTimeStamp,eventDescription,ecuSoftwareVersion,ecuSerialNumber,ecuModel,ecuMake,ecuSource,spn,...,FuelLtd,FuelRate,FuelTemperature,IntakeManifoldTemperature,LampStatus,ServiceDistance,Speed,SwitchedBatteryVoltage,Throttle,TurboBoostPressure
0,1,990349,2015-02-21 10:47:13,Low (Severity Low) Engine Coolant Level,unknown,unknown,unknown,unknown,0,111,...,12300.907429,0.000000,,78.8,1023,,0.000000,3276.75,,0.00
1,2,990360,2015-02-21 11:34:34,,unknown,unknown,unknown,unknown,11,629,...,,,,,1279,,,,,
2,4,990370,2015-02-21 11:35:33,Incorrect Data Steering Wheel Angle,unknown,unknown,unknown,unknown,11,1807,...,,,,,1279,,,,,
3,6,990431,2015-02-21 11:40:22,Low (Severity Low) Engine Coolant Level,04993120*00025921*082113134117*07700053*I0*BBZ*,79466580,6X1u10D1500000000,CMMNS,0,111,...,70349.809964,4.583399,,111.2,1023,,13.602200,3276.75,,6.67
4,7,990439,2015-02-21 11:40:52,Low (Severity Low) Engine Coolant Level,unknown,unknown,unknown,unknown,0,111,...,40961.065437,14.291750,,78.8,1023,,41.534780,3276.75,,20.59
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
546669,1248448,123899434,2020-03-06 13:12:43,High Voltage (Fuel Level),,,CECU3B-NAMUX4,PACCR,49,96,...,51466.131257,0.620806,,120.2,1279,,0.941766,,100.0,1.16
546670,1248452,123901805,2020-03-06 13:42:48,Low (Severity Medium) Engine Coolant Level,04358814*06030918*051718174436*09401683*G1*BDR*,79904453,6X1u13D1500000000,CMMNS,0,111,...,64491.926797,0.515137,,104.0,2047,,5.932153,,100.0,0.58
546671,1248455,123905139,2020-03-06 14:04:23,Condition Exists Engine Protection Torque Derate,04358814*06099720*030816202706*09400153*G1*BDR*,79932020,6X1u13D1500000000,CMMNS,0,1569,...,58979.184416,7.647805,32.0,98.6,18431,,65.010960,,73.2,7.83
546672,1248456,123905996,2020-03-06 14:13:38,Abnormal Rate of Change Aftertreatment 1 Intak...,05317106*05100987*050719120655*09401585*G1*BDR*,79880653,6X1u13D1500000000,CMMNS,0,3216,...,65080.105870,8.995086,,91.4,17407,,66.574100,,100.0,6.96


In [85]:
cols = ['activeTransitionCount','MCTNumber', 'AcceleratorPedal',
         'BarometricPressure', 'CruiseControlSetSpeed', 'DistanceLtd',
         'EngineCoolantTemperature', 'EngineLoad', 'EngineOilPressure', 
        'EngineOilTemperature', 'EngineRpm', 'EngineTimeLtd', 'FuelLevel', 'FuelLtd', 
        'FuelRate', 'FuelTemperature', 'IntakeManifoldTemperature', 'LampStatus', 
        'ServiceDistance', 'Speed', 'SwitchedBatteryVoltage', 'Throttle', 'TurboBoostPressure']


imputer = SimpleImputer(strategy='mean')


for column in cols:
    #print('current ', column)
    
    equipment_fixed = faults_improved.groupby('EquipmentID')[column].apply(lambda x: imputer.fit_transform(x.values.reshape(-1, 1)))

    for Id in equipment_fixed.index :
        #print('id', Id)
        
        # the flatten will turn an empty array to a single digit and crash the loop
        # meaning, we still have to impute values for trucks that don't have any single value in a column filled
        if len(equipment_fixed.loc[Id].flatten() > 0):
            faults_improved.loc[faults_improved['EquipmentID'] == Id, column] = equipment_fixed.loc[Id].flatten()

In [86]:
faults.columns

Index(['RecordID', 'ESS_Id', 'EventTimeStamp', 'eventDescription',
       'ecuSoftwareVersion', 'ecuSerialNumber', 'ecuModel', 'ecuMake',
       'ecuSource', 'spn', 'fmi', 'active', 'activeTransitionCount',
       'EquipmentID', 'MCTNumber', 'Latitude', 'Longitude',
       'LocationTimeStamp'],
      dtype='object')

In [89]:
# dropped the columns from saving because that reduces the file size from 200mb to 80mb
# faults_improved.drop(columns=['RecordID', 'ESS_Id', 'EventTimeStamp', 'eventDescription',
#        'ecuSoftwareVersion', 'ecuSerialNumber', 'ecuModel', 'ecuMake',
#        'ecuSource', 'spn', 'fmi', 'active', 'activeTransitionCount',
#        'EquipmentID', 'MCTNumber', 'Latitude', 'Longitude',
#        'LocationTimeStamp']).to_pickle('../data/diagnostics_imputed.pkl')

An alternative to the above, maybe faster ...

In [None]:
# Michael's function to deal with the empty lists that were causing trouble above...
imputer = SimpleImputer(strategy='mean')

def impute_values(x):
    imputer_results = imputer.fit_transform(x.values.reshape(-1,1))
    
    if len(imputer_results[0]) == 0:
        return np.array([np.nan] * len(x))
    return imputer_results

In [156]:
diagnostics_imputed = faults.merge(Diagnostics, left_on='RecordID', right_on='FaultId')

# this below is needed so that we can simply reassign back to the dataframe
# and that's because if we sort it by EquipmentID, then the grouping and apply is not going to change the order
diagnostics_imputed = diagnostics_imputed.sort_values(by='EquipmentID')

In [168]:
cols = ['activeTransitionCount','MCTNumber', 'AcceleratorPedal',
         'BarometricPressure', 'CruiseControlSetSpeed', 'DistanceLtd',
         'EngineCoolantTemperature', 'EngineLoad', 'EngineOilPressure', 
        'EngineOilTemperature', 'EngineRpm', 'EngineTimeLtd', 'FuelLevel', 'FuelLtd', 
        'FuelRate', 'FuelTemperature', 'IntakeManifoldTemperature', 'LampStatus', 
        'ServiceDistance', 'Speed', 'SwitchedBatteryVoltage', 'Throttle', 'TurboBoostPressure']


for column in cols:
    # double explode!! can't simply assign back because the right side's EquipmentID is not a unique index
    diagnostics_imputed[column] = diagnostics_imputed.groupby('EquipmentID')[column].apply(lambda x: impute_values(x)).explode().explode().array

Doing it this way only takes 15.6s instead of 8 min!!!

In [169]:
diagnostics_imputed[cols].sort_index()

Unnamed: 0,activeTransitionCount,MCTNumber,AcceleratorPedal,BarometricPressure,CruiseControlSetSpeed,DistanceLtd,EngineCoolantTemperature,EngineLoad,EngineOilPressure,EngineOilTemperature,...,FuelLtd,FuelRate,FuelTemperature,IntakeManifoldTemperature,LampStatus,ServiceDistance,Speed,SwitchedBatteryVoltage,Throttle,TurboBoostPressure
0,2.0,105354361.0,0.0,14.21,66.48672,423178.7,100.4,11.0,0.0,96.74375,...,12300.907429,0.0,,78.8,1023.0,,0.0,3276.75,,0.0
1,127.0,105354361.0,32.440129,14.429612,65.284064,480916.417049,171.2,29.976821,28.270323,191.222193,...,21801.213896,4.18574,,108.378065,1279.0,,25.395354,3276.75,,5.8501
2,127.0,105336226.0,41.956863,14.310221,65.438916,513604.422772,183.966337,43.970588,28.681569,203.305754,...,81782.122302,7.218394,,106.805882,1279.0,,39.45926,3276.75,,9.8687
3,1.0,105438630.0,48.0,14.4275,64.6226,470381.4,181.4,30.0,38.28,196.5313,...,70349.809964,4.583399,,111.2,1023.0,,13.6022,3276.75,,6.67
4,2.0,105344243.0,82.8,14.2825,64.6226,278736.7,188.6,80.0,39.44,210.0313,...,40961.065437,14.29175,,78.8,1023.0,,41.53478,3276.75,,20.59
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
546669,126.0,105355619.0,0.0,14.645,66.48672,391932.6,181.4,11.0,22.62,197.6,...,51466.131257,0.620806,32.0,120.2,1279.0,,0.941766,,100.0,1.16
546670,93.0,105351219.0,0.0,14.355,66.48672,457529.7,181.4,11.0,19.72,207.2188,...,64491.926797,0.515137,32.0,104.0,2047.0,,5.932153,,100.0,0.58
546671,5.0,105354084.0,100.0,14.5,64.6226,423937.9,185.0,51.0,37.12,211.4937,...,58979.184416,7.647805,32.0,98.6,18431.0,,65.01096,,73.2,7.83
546672,1.0,105336308.0,0.0,14.355,66.48672,465925.4,186.8,62.0,41.18,212.8438,...,65080.10587,8.995086,32.0,91.4,17407.0,,66.5741,,100.0,6.96


In [99]:
faults_improved[cols]

Unnamed: 0,activeTransitionCount,MCTNumber,AcceleratorPedal,BarometricPressure,CruiseControlSetSpeed,DistanceLtd,EngineCoolantTemperature,EngineLoad,EngineOilPressure,EngineOilTemperature,...,FuelLtd,FuelRate,FuelTemperature,IntakeManifoldTemperature,LampStatus,ServiceDistance,Speed,SwitchedBatteryVoltage,Throttle,TurboBoostPressure
0,2,105354361,0.000000,14.210000,66.486720,423178.700000,100.400000,11.000000,0.000000,96.743750,...,12300.907429,0.000000,,78.800000,1023,,0.000000,3276.75,,0.0000
1,127,105354361,32.440129,14.429612,65.284064,480916.417049,171.200000,29.976821,28.270323,191.222193,...,21801.213896,4.185740,,108.378065,1279,,25.395354,3276.75,,5.8501
2,127,105336226,41.956863,14.310221,65.438916,513604.422772,183.966337,43.970588,28.681569,203.305754,...,81782.122302,7.218394,,106.805882,1279,,39.459260,3276.75,,9.8687
3,1,105438630,48.000000,14.427500,64.622600,470381.400000,181.400000,30.000000,38.280000,196.531300,...,70349.809964,4.583399,,111.200000,1023,,13.602200,3276.75,,6.6700
4,2,105344243,82.800000,14.282500,64.622600,278736.700000,188.600000,80.000000,39.440000,210.031300,...,40961.065437,14.291750,,78.800000,1023,,41.534780,3276.75,,20.5900
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
546669,126,105355619,0.000000,14.645000,66.486720,391932.600000,181.400000,11.000000,22.620000,197.600000,...,51466.131257,0.620806,32.0,120.200000,1279,,0.941766,,100.0,1.1600
546670,93,105351219,0.000000,14.355000,66.486720,457529.700000,181.400000,11.000000,19.720000,207.218800,...,64491.926797,0.515137,32.0,104.000000,2047,,5.932153,,100.0,0.5800
546671,5,105354084,100.000000,14.500000,64.622600,423937.900000,185.000000,51.000000,37.120000,211.493700,...,58979.184416,7.647805,32.0,98.600000,18431,,65.010960,,73.2,7.8300
546672,1,105336308,0.000000,14.355000,66.486720,465925.400000,186.800000,62.000000,41.180000,212.843800,...,65080.105870,8.995086,32.0,91.400000,17407,,66.574100,,100.0,6.9600
