# Big G Express - Data Exploration

## Team: Elden Ring

<img src="https://eldenring.wiki.fextralife.com/file/Elden-Ring/mirel_pastor_of_vow.jpg" alt="PRAISE DOG" style="width:806px;height:600px;"/>

#### PRAISE THE DOG!

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from sklearn.ensemble import GradientBoostingClassifier

from imblearn.over_sampling import SMOTE

from sklearn.impute import SimpleImputer

from sklearn.metrics import roc_auc_score

from joblib import dump, load

In [2]:
faults = pd.read_pickle('../data/faults_filtered.pkl')
#y_derate = pd.read_pickle('../data/target_derate.pkl') # this one is the starting/base model, 6 hr
y_derate = pd.read_pickle('../data/target_derate3h.pkl')
#y_derate = pd.read_pickle('../data/target_derate12h.pkl')
#y_derate = pd.read_pickle('../data/target_derate24h.pkl')
# y_derate = pd.read_pickle('../data/target_derate1wk.pkl')
#y_derate = pd.read_pickle('../data/target_derate6h_noderaterow.pkl')
#y_75derate = pd.read_pickle('../data/target_75derate.pkl')
diagnostics_imputed = pd.read_pickle('../data/diagnostics_imputed.pkl')
#diagnostics_imputed = pd.read_pickle('../data/diagnostics_imputed_median.pkl')

In [3]:
# this one is mostly NaNs, just 250 values or so
diagnostics_imputed = diagnostics_imputed.drop(columns='ServiceDistance')

# and this drops columns that are not useful for predictions
faults = faults.drop(columns=['ESS_Id', 'active', 'eventDescription','ecuSoftwareVersion', 'ecuSerialNumber', 
    'ecuModel', 'ecuMake', 'ecuSource', 'MCTNumber', 'Latitude', 'Longitude', 'LocationTimeStamp'])

Remember there are parts of columns (where a particular truck had no values)

In [4]:
# this was just a simple fill with mean..
diagnostics_imputed['AcceleratorPedal'] = diagnostics_imputed['AcceleratorPedal'].fillna(value=diagnostics_imputed['AcceleratorPedal'].mean())
diagnostics_imputed['CruiseControlSetSpeed'] = diagnostics_imputed['CruiseControlSetSpeed'].fillna(value=diagnostics_imputed['CruiseControlSetSpeed'].mean())
diagnostics_imputed['EngineTimeLtd'] = diagnostics_imputed['EngineTimeLtd'].fillna(value=diagnostics_imputed['EngineTimeLtd'].mean())
diagnostics_imputed['FuelLevel'] = diagnostics_imputed['FuelLevel'].fillna(value=diagnostics_imputed['FuelLevel'].mean())
diagnostics_imputed['FuelTemperature'] = diagnostics_imputed['FuelTemperature'].fillna(value=diagnostics_imputed['FuelTemperature'].mean())
diagnostics_imputed['SwitchedBatteryVoltage'] = diagnostics_imputed['SwitchedBatteryVoltage'].fillna(value=diagnostics_imputed['SwitchedBatteryVoltage'].mean())
diagnostics_imputed['Throttle'] = diagnostics_imputed['Throttle'].fillna(value=diagnostics_imputed['Throttle'].mean())

#same but when using median - slightly worse than the mean
# diagnostics_imputed['AcceleratorPedal'] = diagnostics_imputed['AcceleratorPedal'].fillna(value=diagnostics_imputed['AcceleratorPedal'].median())
# diagnostics_imputed['CruiseControlSetSpeed'] = diagnostics_imputed['CruiseControlSetSpeed'].fillna(value=diagnostics_imputed['CruiseControlSetSpeed'].median())
# diagnostics_imputed['EngineTimeLtd'] = diagnostics_imputed['EngineTimeLtd'].fillna(value=diagnostics_imputed['EngineTimeLtd'].median())
# diagnostics_imputed['FuelLevel'] = diagnostics_imputed['FuelLevel'].fillna(value=diagnostics_imputed['FuelLevel'].median())
# diagnostics_imputed['FuelTemperature'] = diagnostics_imputed['FuelTemperature'].fillna(value=diagnostics_imputed['FuelTemperature'].median())
# diagnostics_imputed['SwitchedBatteryVoltage'] = diagnostics_imputed['SwitchedBatteryVoltage'].fillna(value=diagnostics_imputed['SwitchedBatteryVoltage'].median())
# diagnostics_imputed['Throttle'] = diagnostics_imputed['Throttle'].fillna(value=diagnostics_imputed['Throttle'].median())

In [5]:
# this took 30 min and didn't stop ...
# from sklearn.experimental import enable_iterative_imputer
# from sklearn.impute import IterativeImputer, KNNImputer
# scaler = StandardScaler().fit(diagnostics_imputed)

# knn_filled = scaler.inverse_transform(KNNImputer().fit_transform(scaler.transform(diagnostics_imputed)))

# diagnostics_imputed = IterativeImputer().fit_transform(diagnostics_imputed)

NOTE: during one of the trainings there was a particular spn-fmi code that got to the top, with importance of 0.8! It was the 46262 code and after inspecting, despite appearing only once in the dataset, since there are many events happening in a smlal timeframe around it, it got picked up as important!

## Better Train-Test split

Initially I used just a regular train-test split on the variables. However, there are trucks whose events end up mixed between both train and test split. Instead, we want to make sure that each individual truck only appears in one.

I also refined the process that I initially used and combined it into a function as below.

In [6]:
print(faults['EquipmentID'].nunique())
print(faults.loc[faults['spn'] == 5246]['EquipmentID'].nunique())

1042
189


First off, get the two lists of trucks that had (or not) a full derate.

In [7]:
all_trucks = faults['EquipmentID'].unique()
derate_trucks = faults.loc[faults['spn'] == 5246]['EquipmentID'].unique()
no_derate_trucks = all_trucks[np.isin(all_trucks, derate_trucks, invert=True)]

Secondly, put those lists together, marking if a derate occured (1) or not (0).

In [8]:
trucks_df = pd.concat([
            pd.DataFrame({'EquipmentID': derate_trucks, 'derate': 1}),
            pd.DataFrame({'EquipmentID': no_derate_trucks, 'derate': 0}) 
            ])

Lastly, use the train_test_split, by accounting for the proportion of 'derates' in both (using stratify)

In [9]:
trucks_train, trucks_test = train_test_split(trucks_df, stratify=trucks_df['derate'], train_size = 0.8, test_size = 0.2, random_state = 42)

In [10]:
# this was just to verify that the proportions of trucks with and without derate in two samples are equal
# print(trucks_train['derate'].value_counts(normalize=True))
# print(trucks_test['derate'].value_counts(normalize=True))

# print(faults.loc[faults['EquipmentID'].isin(trucks_train['EquipmentID'])].shape[0])
# print(faults.loc[faults['EquipmentID'].isin(trucks_test['EquipmentID'])].shape[0])

Finally, use that information to split the diagnostics and targets.

In [11]:
# need to extract this because the train dataset only has RecordID
records_train = faults.loc[faults['EquipmentID'].isin(trucks_train['EquipmentID'])]['RecordID']
records_test = faults.loc[faults['EquipmentID'].isin(trucks_test['EquipmentID'])]['RecordID']

In [12]:
y_train = y_derate.loc[y_derate['RecordID'].isin(records_train)].sort_values('RecordID').drop(columns='RecordID')['target']
y_test = y_derate.loc[y_derate['RecordID'].isin(records_test)].sort_values('RecordID').drop(columns='RecordID')['target']

Now that the y_train and y_test are sorted, time to do the same for the X_train and X_test.

In [13]:
faults_diagnostics = faults.merge(diagnostics_imputed, left_on='RecordID', right_on='FaultId', how='inner').drop(columns='FaultId')

Next it depends on how these get prepared, so I'll build a function. It takes the faults + diagnostic con

In [14]:
def windowize_predictors(fulldetail_faults, time_window='1d', faults_agg='max', windowize_diagnostics = True, diagnostics_agg='mean'):

    # pull out the diagnostics table columns for later
    diagnostics_cols = [col for col in fulldetail_faults.columns if col not in ['RecordID', 'spn', 'fmi', 'EquipmentID']]

    # create a combined spn_fmi column to make dummies out of
    fulldetail_faults['spn_fmi'] = ['_'.join(i) for i in zip(fulldetail_faults['spn'].astype(str), fulldetail_faults['fmi'].astype(str))]

    # make dummies (one hot encode)
    fulldetail_faults = pd.get_dummies(fulldetail_faults, columns=['spn_fmi'], prefix='spn_fmi')

    # make sure the dataframe is in the right order to be able to later re-assign RecordID to it
    fulldetail_faults = fulldetail_faults.sort_values(by=['EquipmentID', 'EventTimeStamp'])

    # pull out all the Faults table columns (now one hot encoded)
    faults_cols = ['EventTimeStamp'] + [col for col in fulldetail_faults.columns if 'spn_fmi' in col] 

    # rolling window function over faults - by default just taking IF a code appears in a 24 hr past window
    faults_rolling = (
        fulldetail_faults
            .groupby('EquipmentID')[faults_cols]
            .rolling(window = time_window, on = "EventTimeStamp")
            .agg(faults_agg)
            .reset_index()
    )
    
    # by default I also decided to apply the same rolling window for the diagnostics part
    # (can be turned off by setting = False, it is quick to execute)
    if windowize_diagnostics:

        # rolling window over diagnostics, by default using mean
        diagnostics_rolling = (
            fulldetail_faults
                .groupby('EquipmentID')[diagnostics_cols]
                .rolling(window = time_window, on = "EventTimeStamp")
                .agg(diagnostics_agg)
                .reset_index()
        )

        # joining back the faults rw to the original dataframe to get the "RecordID" out
        faults_rolling = pd.merge(fulldetail_faults[['RecordID', 'spn']],
                            faults_rolling,
                            left_index= True,
                            right_on = 'level_1').drop(columns='level_1')
        
        ###### ONLY uncomment this next line IF the derate rows are not tagged
        # faults_rolling = faults_rolling.loc[faults_rolling['spn'] != 5246]

        # joining back the diagnostics rw to the original dataframe to get the "RecordID" out
        diagnostics_rolling = pd.merge(fulldetail_faults[['RecordID', 'spn']],
                                diagnostics_rolling,
                                left_index= True,
                                right_on = 'level_1').drop(columns='level_1')
        
        ####### ONLY uncomment this next line IF the derate rows are not tagged
        # diagnostics_rolling = diagnostics_rolling.loc[diagnostics_rolling['spn'] != 5246]
        
        # joining the two rolling windows
        faults_diagnostics_rolling =  pd.merge(
            diagnostics_rolling.drop(columns=['EquipmentID', 'EventTimeStamp', 'spn']),
            faults_rolling.drop(columns=['EquipmentID', 'EventTimeStamp', 'spn']),
            on = 'RecordID'
        )

    # this gets used if we only want to take into account the current diagnostics
    # (essentially, NO rolling window for diagnostics)
    else :

        # simply get back 'RecordID' and other diagnostic columns
        faults_diagnostics_rolling = pd.merge(
            fulldetail_faults[['RecordID', 'spn'] + diagnostics_cols].drop(columns=['EventTimeStamp']),
            faults_rolling.drop(columns=['EquipmentID', 'EventTimeStamp']),
            left_index= True,
            right_on = 'level_1').drop(columns='level_1')
        
        ####### ONLY uncomment this next line IF the derate rows are not tagged
        # faults_diagnostics_rolling = faults_diagnostics_rolling.loc[faults_diagnostics_rolling['spn'] != 5246]
        
        faults_diagnostics_rolling = faults_diagnostics_rolling.drop(columns='spn')
        
    predictor_train = (
        faults_diagnostics_rolling
        .loc[faults_diagnostics_rolling['RecordID']
             .isin(records_train)]
        .sort_values('RecordID')
        .drop(columns='RecordID')
    )
    predictor_test = (
        faults_diagnostics_rolling
        .loc[faults_diagnostics_rolling['RecordID']
             .isin(records_test)]
        .sort_values('RecordID')
        .drop(columns='RecordID')
    )

    return predictor_train, predictor_test

In [15]:
X_train, X_test = windowize_predictors(faults_diagnostics, time_window='7d', faults_agg='max', windowize_diagnostics=True, diagnostics_agg='mean')

In [16]:
gbr = Pipeline(
    steps = [
        ('gb', GradientBoostingClassifier(verbose=True)) #, n_estimators =350, learning_rate=0.03
    ]
)

In [17]:
#gbr.fit(X_train, y_train)

In [18]:
# print('confusion matrix')
# print(confusion_matrix(y_train, gbr.predict(X_train)))
# print('\n')
# print('classification report')
# print(classification_report(y_train, gbr.predict(X_train)))
# print('\n')

# importances = pd.DataFrame({
#     'variable': gbr.feature_names_in_,
#     'importance': gbr['gb'].feature_importances_
# })

# print('Variable Importances:')
# display(importances.sort_values('importance', ascending = False).head(20))

# print('------ TEST')
# print(confusion_matrix(y_test, gbr.predict(X_test)))
# print('ROC AUC Score')
# print(roc_auc_score(y_true=y_test, y_score=gbr.predict_proba(X_test)[:,1]))

In [19]:
oversampler = SMOTE(k_neighbors=5, random_state=42)

X_smote, y_smote = oversampler.fit_resample(X_train, y_train)

In [41]:
# this is going to re-fit from scratch, unless we set warm_start=True
# also, simply add this line to all X_ variables if you want to exclude 5246 influencing the model:
# .drop(columns=[col for col in X_smote.columns if 'spn_fmi_5246' in col])
# for partial derates, this one is much worse
# .drop(columns=[col for col in X_smote.columns if 'spn_fmi_1569' in col])
gbr.fit(X_smote.drop(columns=[col for col in X_smote.columns if 'spn_fmi_5246' in col]), y_smote)

      Iter       Train Loss   Remaining Time 
         1           1.2303           19.92m
         2           1.1026           19.59m
         3           0.9966           19.41m
         4           0.9074           19.01m
         5           0.8317           18.65m
         6           0.7720           18.47m
         7           0.7148           18.57m
         8           0.6696           18.26m
         9           0.6246           17.92m
        10           0.5903           17.69m
        20           0.3474           16.68m
        30           0.2600           14.78m
        40           0.2188           12.79m
        50           0.1963           10.70m
        60           0.1805            8.60m
        70           0.1681            6.46m
        80           0.1589            4.32m
        90           0.1501            2.16m
       100           0.1421            0.00s


In [42]:
print('confusion matrix')
print(confusion_matrix(y_train, gbr.predict(X_train.drop(columns=[col for col in X_smote.columns if 'spn_fmi_5246' in col]))))
print('\n')
print('classification report')
print(classification_report(y_train, gbr.predict(X_train.drop(columns=[col for col in X_smote.columns if 'spn_fmi_5246' in col]))))
print('\n')

importances = pd.DataFrame({
    'variable': gbr.feature_names_in_,
    'importance': gbr['gb'].feature_importances_
})

print('Variable Importances:')
display(importances.sort_values('importance', ascending = False).head(20))

print('------ TEST')
print('confusion matrix')
print(confusion_matrix(y_test, gbr.predict(X_test.drop(columns=[col for col in X_smote.columns if 'spn_fmi_5246' in col]))))
print('classification report')
print(classification_report(y_test, gbr.predict(X_test.drop(columns=[col for col in X_smote.columns if 'spn_fmi_5246' in col]))))
print('ROC AUC Score')
print(roc_auc_score(y_true=y_test, y_score=gbr.predict_proba(X_test.drop(columns=[col for col in X_smote.columns if 'spn_fmi_5246' in col]))[:,1]))

confusion matrix
[[428651  13269]
 [    21    860]]


classification report
              precision    recall  f1-score   support

           0       1.00      0.97      0.98    441920
           1       0.06      0.98      0.11       881

    accuracy                           0.97    442801
   macro avg       0.53      0.97      0.55    442801
weighted avg       1.00      0.97      0.98    442801



Variable Importances:


Unnamed: 0,variable,importance
15,LampStatus,0.397541
13,FuelTemperature,0.315414
163,spn_fmi_1569_31,0.116364
20,activeTransitionCount,0.048253
2,CruiseControlSetSpeed,0.020625
90,spn_fmi_111_17,0.019034
389,spn_fmi_3362_31,0.00958
636,spn_fmi_5394_5,0.009374
219,spn_fmi_1787_11,0.00533
392,spn_fmi_3363_3,0.004143


------ TEST
confusion matrix
[[100231   3439]
 [    28    175]]
classification report
              precision    recall  f1-score   support

           0       1.00      0.97      0.98    103670
           1       0.05      0.86      0.09       203

    accuracy                           0.97    103873
   macro avg       0.52      0.91      0.54    103873
weighted avg       1.00      0.97      0.98    103873

ROC AUC Score
0.9884836595468474


In [43]:
# to load model
# gbr = load('../models/gbr_model_1.joblib') 

# to save model
#dump(gbr, '../models/gbr_model_28.joblib') 

['../models/gbr_model_28.joblib']

Besides saving the models, I will construct a json file that describes how they were obtained.

In [44]:
import json

In [46]:
to_dump = {
    'file_path' : '../models/gbr_model_28.joblib',
    'targets' : 'any row where a derate (5246) happens in the next 3 hours',
    'diagnostics_file' : 'used imputer to average data per truck and then simple mean to average any remaining nulls',
    'train_test_split' : 'using trucks and assuring same ratio of derate and nonderate',
    'windowize_predictors': {'dataframe': 'merged faults and diagnostics',
                             'how far in the past to aggregate' : '7 days',
                             'how to aggregate the one-hot encoded spn_fmi': 'max (default)',
                             'use rolling window on diagnostics?' : 'True ',
                             'how to aggregate diagnostics data' : 'max'},
    'pipeline' : {'step 1': 'GradientBoostingClassifier (default values)'},
    'rebalancing' : {'over or under fitting': 'used SMOTE(k_neighbors=5, random_state=42)',
                     'variables used': 'eliminated 5246 columns (derates), by using the .drop on X_train and X_test'}

}

tmp_matrix = confusion_matrix(y_train, gbr.predict(X_train.drop(columns=[col for col in X_smote.columns if 'spn_fmi_5246' in col])))

to_dump['train_confusion_matrix'] = {'TN': int(tmp_matrix[0][0]),
                                     'FP': int(tmp_matrix[0][1]),
                                     'FN': int(tmp_matrix[1][0]),
                                     'TP': int(tmp_matrix[1][1])}

tmp_matrix = confusion_matrix(y_test, gbr.predict(X_test.drop(columns=[col for col in X_smote.columns if 'spn_fmi_5246' in col])))

to_dump['test_confusion_matrix'] = {'TN': int(tmp_matrix[0][0]),
                                    'FP': int(tmp_matrix[0][1]),
                                    'FN': int(tmp_matrix[1][0]),
                                    'TP': int(tmp_matrix[1][1])}


to_dump['test_rocaouc_score'] = roc_auc_score(y_true=y_test, y_score=gbr.predict_proba(X_test.drop(columns=[col for col in X_smote.columns if 'spn_fmi_5246' in col]))[:,1])

importances = pd.DataFrame({
    'variable': gbr.feature_names_in_,
    'importance': gbr['gb'].feature_importances_
})

importances = importances.sort_values('importance', ascending = False).head(20)

tmp_dict={}

for index, row in importances.iterrows():
    tmp_dict[row["variable"]] = row['importance']

to_dump['top20_fature_importances'] = tmp_dict


json_object = json.dumps(to_dump, indent=4)

In [47]:
# with open('../models/gbr_model_28.json', 'w') as outfile:
#     outfile.write(json_object)

## Looking into the most promising model

First few steps are straightforward as outlined below:

Michael's suggestion was to check if any of the false positives actually happen to have a derate within 24 hours. Adding that to the dataframe from above

In [48]:
# load the model that I want to use to look at the predictions
gbr_best = load('../models/gbr_model_28.joblib')

# get the target values 
# note: the model might have been trained on a different time window, but if it correctly predicts derates further down the road, that is perfectly fine
# that is why, using Michael's suggestion, always compare models to the 24 hr derate window
y_derate = pd.read_pickle('../data/target_derate24h.pkl')

y_comparison = y_derate.loc[y_derate['RecordID'].isin(records_test)].sort_values('RecordID')

# preditc y values based on model
y_pred = gbr_best.predict(X_test.drop(columns=[col for col in X_smote.columns if 'spn_fmi_5246' in col])) #.drop(columns=[col for col in X_smote.columns if 'spn_fmi_5246' in col])

# put all of it together in a dataframe
y_comparison['predicted'] = y_pred

# merge it back to get the complete faults info
test_results = pd.merge(faults, y_comparison, on='RecordID', how='inner')

# flag the rows where the derate occurred
test_results['dummy_derate'] = np.where(test_results['spn'] == 5246, 1, 0)

# sort test_results in the right order since that's what's needed for rolling windows
# note that the dummy_derate now needs to be last in case of a tie (as opposed to when we were looking in the future)
test_results = test_results.sort_values(by=['EquipmentID','EventTimeStamp','dummy_derate'], ascending=[False, True, True])

> IMPORTANT note: initially I did not consider the fact that there will be some codes following a derate event whose predictions should be ignored (likely truck being worked on etc).

There's 208 rows in the test dataset that follow within 1 day after a derate

In [49]:
after_derate = (
    test_results
        .groupby('EquipmentID')[['EventTimeStamp', 'dummy_derate']]
        .rolling(window = '7d', on = "EventTimeStamp")
        .sum()
        .reset_index()
)

In [50]:
test_results = pd.merge(test_results.drop(columns=['dummy_derate']),
        after_derate.drop(columns=['EquipmentID', 'EventTimeStamp']),
        left_index= True,
        right_on = 'level_1').drop(columns='level_1')

In [51]:
test_results = test_results.loc[(test_results['dummy_derate'] == 0.) | (test_results['spn'] == 5246)]

This is the confusion Matrix for the model 5:
- "TN": 100535
- "FP": 3099
- "FN": 37
- "TP": 202

This is the confusion Matrix for the model 13:
- "TN": 99699
- "FP": 3726
- "FN": 81
- "TP": 367

This is the confusion Matrix for the model 15:
- "TN": 99585
- "FP": 3840
- "FN": 78
- "TP": 370

This is the confusion Matrix for the model 26:
- "TN": 100410
- "FP": 3224
- "FN": 30
- "TP": 209

In [52]:
# select the false positives
false_positive = test_results.loc[(test_results['target'] == 0) & (test_results['predicted'] == 1)]

The logic here is that if I use the rolling window again, I can sum up on the "predicted" values. Any sums that are more than 1 indicate repeated values. I.e. they show that those predictions occur within 24 hours and therefore, they were not actually separate predictions of the model.

On top of that, I realized that some "predicted" derates were happening AFTER a derate occured. So within the next hour or two. Those also shouldn't be counted as false predictions since the truck is likely being worked on.

In order to get the unique false predictions, we count how many times 'predicted' was 1.

**results**:
- model 5: 819 out of 3099 are false positives
- model 13: 644 out of 3726 are false positives
- model 15: 609 out of 2840 are false positives
- model 26: 496 out of 3224 are false positives

In [53]:
# first off, need to see how many of these false positives occur in the next day or so after the derate
# because the truck will likely be repaired and / on worked on, so the conditions for the derate might still be there to be picked up on by the model

false_positive = (
    false_positive
    .groupby('EquipmentID')[['EventTimeStamp', 'predicted']]
    .rolling(window = '1d', on = "EventTimeStamp")
    .sum()
)

In [54]:
false_positive.loc[false_positive['predicted'] == 1.]

Unnamed: 0_level_0,Unnamed: 1_level_0,EventTimeStamp,predicted
EquipmentID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1350,653,2015-04-05 22:00:30,1.0
1366,1140,2015-06-10 01:45:22,1.0
1366,1175,2015-07-01 12:38:58,1.0
1366,1302,2015-10-26 14:33:56,1.0
1366,1303,2015-10-28 15:14:30,1.0
...,...,...,...
2015,98086,2020-02-12 00:06:58,1.0
2019,98117,2017-06-06 09:59:45,1.0
2045,98431,2017-04-04 09:46:14,1.0
309,103417,2018-03-12 12:33:31,1.0


Similar approach to get the false negatives, except now we invert target and predicted.

**results**:
- model 5: 11 out of 37 are false negatives
- model 13: 24 out of 81 are false negatives
- model 15: 24 out of 78 are false negatives

In [55]:
# select the false negatives
false_negative = test_results.loc[(test_results['target'] == 1) & (test_results['predicted'] == 0)]


In [56]:
false_negative = (
    false_negative
    .groupby('EquipmentID')[['EventTimeStamp', 'target']]
    .rolling(window = '1d', on = "EventTimeStamp")
    .sum()
)

len(false_negative.loc[false_negative['target'] == 1.])

29

Finally, looking at the true positives

**results**: 
- 67 out of 202 are true positives, out of which 21 are predicted at least 2 hours in advance
- 68 out of 367 are true positives, out of which 41 are predicted at least 2 hours in advance
- 68 out of 367 are true positives, out of which 43 are predicted at least 2 hours in advance
- 68 out of 209

In [57]:
# select the true positive
true_positive = test_results.loc[(test_results['target'] == 1) & (test_results['predicted'] == 1)]


In [58]:
true_positive = (
    true_positive
    .groupby('EquipmentID')[['EventTimeStamp', 'RecordID', 'predicted', 'target']]
    .rolling(window = '1d', on = "EventTimeStamp")
    .agg({'RecordID': lambda x: x[-1], 'predicted': 'sum', 'target': 'sum'})
    .reset_index()
)

true_positive['RecordID'] = true_positive['RecordID'].astype(int)

true_positive = true_positive.loc[true_positive['predicted'] == 1.]

> NOTE: do not use iterrows to modify the dataframe it's being iterated over!! the results are not guaranteed

In [59]:
true_positive

Unnamed: 0,EquipmentID,EventTimeStamp,RecordID,predicted,target
0,1329,2015-02-25 13:53:08,5715,1.0,1.0
2,1339,2015-06-12 08:24:15,85259,1.0,1.0
4,1366,2015-06-11 10:08:58,84237,1.0,1.0
14,1366,2015-07-03 15:10:45,109732,1.0,1.0
16,1366,2015-09-23 07:25:22,214277,1.0,1.0
...,...,...,...,...,...
254,1922,2019-07-07 11:13:03,1176722,1.0,1.0
265,1928,2018-08-03 12:34:33,1042659,1.0,1.0
266,1970,2019-04-28 17:50:36,1153464,1.0,1.0
273,2004,2019-07-03 07:08:25,1176070,1.0,1.0


In [60]:
derate_times = []

# find the timestamp of next actual derate that happens
for index, row in true_positive.iterrows():
    derate_times.append(
        faults.loc[(faults['EquipmentID'] == str(row['EquipmentID']))
                   & (faults['spn'] == 5246) 
                   & (faults['EventTimeStamp'] >= row['EventTimeStamp'])]
                   .iloc[0]['EventTimeStamp']
    )

# save that in the dataframe
true_positive['derateTimeStamp'] = derate_times

# measure how soon the prediction happened before the derate
true_positive['timediff'] = true_positive['derateTimeStamp'] - true_positive['EventTimeStamp']

In [61]:
len(true_positive.loc[true_positive['timediff'] > timedelta(hours= 2)])

43

Exploring some more.

In [None]:
# test_results.loc[(test_results['target'] == 0) & (test_results['predicted'] == 1)]['activeTransitionCount'].mean()
# test_results.loc[test_results['target']]['activeTransitionCount'].mean()
# faults['activeTransitionCount'].mean()