# CNN with Keras by TensorFlow

# 1.0 Dependencies and Notes

This notebook was built with the libraries imported below and the following versions:

Pandas 2.2.3 <br>
Numpy 2.0.2 <br>
Altair 5.4.1 <br>
sklearn 1.5.0 <br>
Keras 3.6.0 <br>

Different versions of these libraries may affect the functionality of this notebook.

The purpose of this notebook is to create a convolutional neural network to predict remaining useful life of jet engines using data provided by NASA. The notebook includes definitions to build the model, fit it, and then explore and store the results. 

Results are stored via a CSV file. There is a function for looping through different parameters, and other functions for viewing, exploring, and saving the results. 

In [None]:
import pandas as pd
import numpy as np
import altair as alt
import sklearn
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from numpy import array, hstack
import pickle
import keras
from keras.models import Sequential
from keras.layers import Dense, Flatten, Conv1D, MaxPooling1D
from keras import Input

In [2]:
print(pd.__version__)
print(np.__version__)
print(alt.__version__)
print(sklearn.__version__)
print(keras.__version__)

2.2.3
2.0.2
5.4.1
1.5.0
3.6.0


## 1.1 Load and define train and test data. 

Input data is batched and processed per the 'data_cleaning.ipynb' and 'data_processing.iypnb' notebooks. 

In [3]:
#link = 'processed_data_pickle_files_no_smoothing/'
link = '../data/batched_data_pickle_files/'

with open(link + 'processed_test_data.pkl', 'rb') as file:
    test = pickle.load(file)
    
with open(link + 'processed_train_data.pkl', 'rb') as file:
    train = pickle.load(file)    
    
with open(link + 'processed_train_targets.pkl', 'rb') as file:
    y_train = pickle.load(file)
    
with open(link + 'true_rul.pkl', 'rb') as file:
    y_test = pickle.load(file)

In [4]:
display(train.shape)
display(y_train.shape)
display(test.shape)
display(y_test.shape)

(17731, 30, 14)

(17731,)

(100, 30, 14)

(100,)

In [5]:
#Initialize so definitions don't throw an error. Will be redefined in subsequent section. 
default_params = {}
loops = {}

# 2.0 Definitions

In [6]:
def make_cnn(params, print_summary = False):
    
    """
    Intializes a convolutional neural network using TensorFlow Keras Library. 
    Parmeters are entered as a library. 
    
    Inputs:
    params: library of input parameters, must include 'sec_conv', '#filters_conv1', 'filter_size_conv1', 
            'act_func_conv1', 'input_1', 'input_2', 'dense_neurons', 'act_func_dense'; may need to include others. 
    print_summary: boolean that dictates whether to print a summary of the constructed model 
            (provides number of nuerons in each network, etc.)
    """
    
    
    #A sequential model can have different NN layers added. Use a list to define layers, or use add method.
    #Sequential models only work for one input tensor and one output tensor (including for each layer).
    model = Sequential()
    
    #Input(shape = (params['input_1'], params["input_2"]))
    
    #Adds a 1-dimensional convolution layer (only moves in one direction: down each 'sample' of timeseries).
    model.add(Conv1D(filters = params["#filters_conv1"], #dimension of output space (number of filters)
                     kernel_size = params["filter_size_conv1"], #size of the convolution window
                     strides = 1, #convolution layer stride length
                     padding = 'valid', #zero padding at ends of convolutions, options: 'valid' (no padding), 'same'
                     activation=params["act_func_conv1"], #activation function to derive non-linear relationships; default is None.
                     input_shape=(params['input_1'], params["input_2"]) #steps followed by number of features (same order as input)
                    )) 
    
    #Adds a second 1D convolution filter if user requests.
    if params["sec_conv"] == True:
        model.add(Conv1D(filters = params["#filters_conv2"], #dimension of output space (number of filters)
                     kernel_size = params["filter_size_conv2"], #size of the convolution window
                     activation=params["act_func_conv2"], #activation function to derive non-linear relationships; default is None.
                        )) 
    
    #Adds a pooling layer to reduce dimensions/computation time - Max or Average is available. 
    #Downsizes the input from the Conv1D layer, so structure is kept but some info is lost.
    model.add(MaxPooling1D(pool_size = 2)) #Number of features considered at once.
    
    #Adds a flattening layer, which makes the 3D data into a 1D array so it's compatible with Dense Layers.
    model.add(Flatten())
    
    #Adds a "dense" layer, which is a regular NN layer. This interprets the output of the previous layers.
    model.add(Dense(params['dense_neurons'], #First parameter is the output space dimensionality.
                    activation = params['act_func_dense'])) 
    
    #Adds a second dense layer to reduce the 50 neurons to a single output.
    #This is done seperately for each batch.
    model.add(Dense(1))
    
    #This method configures the model for training. It's where you choose parameters.
    #Other parameter available: loss_weights, metrics, weighted_metrics, etc. 
    model.compile(optimizer = 'adam', loss = 'mse', metrics = ["mean_squared_error", 
                                                               "root_mean_squared_error", 
                                                               "mean_absolute_error", 
                                                               "mean_absolute_percentage_error", 
                                                               "r2_score"])
    
    if print_summary == True:
        model.summary()
    
    return model

In [7]:
def cnn_run_for_log(model, train_data, train_target, test_data, test_target, params):
    
    """ 
    Trains an input model on the input train data, then collects various scoring metrics of both the 
    train and test data. The input parameters dictionary is then concatenated with the metrics to provide 
    a dictionary of both the metrics and input parameters used. 
    
    Inputs:
    
    Model: CNN model from the previous function above ('make_cnn') or CNN defined via other means
    
    Various data inputs: Train and Test, plus targets
    
    params: library of parameters. Must include 'perform_validation' and '#_epochs'.
    
    Output: library of train and test parameters, along with parameters included in the 'params' input. 
    
    """
    
    if params['perform_validation'] == True:
        v_s = 0.2
    else:
        v_s = None
    
    history = model.fit(train_data, 
                      train_target, 
                      validation_split = v_s, 
                      epochs = params["#_epochs"], 
                      verbose = 0
                     )
    logger = {}
    
    y_hat = model.predict(test_data, verbose = 0)
    
    logger["train_MSE"] = history.history["mean_squared_error"][-1]
    logger["test_MSE"] = mean_squared_error(test_target, y_hat).item()
    
    logger["train_RMSE"] = history.history["root_mean_squared_error"][-1]
    logger["test_RMSE"] = np.sqrt(mean_squared_error(test_target, y_hat)).item()
    
    logger["train_MAE"] = history.history["mean_absolute_error"][-1]
    logger["test_MAE"] = mean_absolute_error(test_target, y_hat).item()
    
    logger["train_MAPE"] = history.history["mean_absolute_percentage_error"][-1]
    logger["test_MAPE"] = np.mean(np.abs((test_target - y_hat) / y_test)).item() * 100
    
    logger["train_R2"] = history.history["r2_score"][-1]
    logger["test_R2"] = r2_score(test_target, y_hat)
    
    logger.update(params)
    
    return logger
    
    

In [8]:
def add_to_logger(new_instance, existing_dict):
    
    """
    Takes in a new instance of the function 'run_cnn_for_log' which returns the performance metrics 
    for a CNN using the listed input parameters. It then adds that instance to a dictionary of lists,
    where each index in the list represents a new run of 'run_cnn_for_log'. This is intended to be used
    in running loops during parameter tuning to keep track of which parameters perform the best. 
    
    Input:
    new_instance: a dictionary of the most recent parameters and performance metrics
    existing_dict: a dictionary of lists that keep a record of performance metrics and the parameters that
            led to those results
    
    Output:
    An updated record of parameters/performance metrics in which the most recent parameters are added to the record
    
    """
    
    if existing_dict == None:
        record = {}
        for key in new_instance.keys():
            record[key] = []
    else:
        record = existing_dict.copy()
        
    for key in record.keys():
        l = record[key]
        if key in new_instance.keys():
            l.append(new_instance[key])     
        else:
            l.append(np.nan)
        record[key] = l
        
    return record

In [9]:
def loop_through_parameters(loops, 
                            parameters, 
                            train = train, 
                            y_train = y_train, 
                            test = test, 
                            y_test = y_test):
    """
    Runs parameter loops for model and records the resulting metrics. Returns a dictionary that is 
    a log of the results, where each key represents a parameter or performance metric, and each item is
    a list where the indexes represent the runs in chronological order. 
    
    Inputs:
    loops: a dictionary of the the keys and values to loop through.
    params: a dictionary with all the parameters neccessary to build the CNN, train it, and acquire the 
    results using the functions defined previously.
    
    Output:
    A dictionary of lists with input parameters and resultant performance metrics. Can easily be used
    to construct a dataframe. 
    """
    
    record = None
    keys = list(loops.keys())
    
    def nested_function(p1 = None, p2 = None, p3 = None, p4 = None):
        parameters[keys[0]] = p1
        if len(keys) > 1:
            parameters[keys[1]] = p2
        if len(keys) > 2:
            parameters[keys[2]] = p3
        if len(keys) > 3:
            parameters[keys[3]] = p4
            
        #print(parameters)
        model = make_cnn(parameters)
        new_instance = cnn_run_for_log(model, train, y_train, test, y_test, parameters)
        r = add_to_logger(new_instance, record)
        return r
        
    if len(loops.keys()) == 1:
        for p_1 in loops[keys[0]]:
            record = nested_function(p1 = p_1)
            
    elif len(loops.keys()) == 2:
        for p_1 in loops[keys[0]]:
            for p_2 in loops[keys[1]]:
                record = nested_function(p1 = p_1, p2 = p_2)
                
    elif len(loops.keys()) == 3:
        for p_1 in loops[keys[0]]:
            for p_2 in loops[keys[1]]:
                for p_3 in loops[keys[2]]:
                    record = nested_function(p1 = p_1, p2 = p_2, p3 = p_3)
                    
    else:
        for p_1 in loops[keys[0]]:
            for p_2 in loops[keys[1]]:
                for p_3 in loops[keys[2]]:
                    for p_4 in loops[keys[3]]:
                        record = nested_function(p1 = p_1, p2 = p_2, p3 = p_3, p4 = p_4)
                        
    return record
        

In [10]:
def add_to_cnn_log(record, link = 'CNN_log.csv', save_changes = False):
    
    """
    Combines most recent group of models with those saved in the log file. Includes designating a
    tuning group based on the most recent tuning group in the log file.
    
    Inputs:
    record: most recent record set of parameter tunings, as returned by above functions.
    link: pathway and filename for the saved log file, it if exists. If it doesn't exist, this 
        function won't work.
    save_changes: designates whether to save the updates to the CSV file that stores the model parameter tuning results.
    
    Output:
    A dataframe of the combines records on file and the most recent turning group. 
    """
    
    df = pd.read_csv(link)
    group_num = df['tuning_group'].max() + 1
    r = pd.DataFrame(record)
    r.insert(0, 'tuning_group', group_num)
    
    df = pd.concat([df, r], ignore_index = True)

    if save_changes == True:
        df.to_csv(link, index = False)
    
    return df

In [11]:
def plot_results(data, fields):
    
    """
    Makes a simple Altair Chart for compairing results visually.
    
    Input:
    data: Dataframe that includes the fields from the most current record or from the CNN_log saved as a CSV.
    fields: designated fields to encode using shape and color. Default are the first two designated in the 
        most recent record.
        
    Output:
    A chart comparing changes in the designated fields, mapped against test RMSE and difference between
        train and test RMSE.
    """

    data['RMSE_diff'] = data['test_RMSE'] - data['train_RMSE']
    fields_to_keep = ["test_RMSE", "train_RMSE", "RMSE_diff"] + fields
    data = data[fields_to_keep]

    chart = alt.Chart(data).mark_point().encode(x = alt.X("test_RMSE").scale(zero = False), 
                                                y = 'RMSE_diff', 
                                                color = fields[0] + ":N", 
                                                shape = fields[1] + ":N")

    return chart

def show_df_of_results(data, fields):
    data['RMSE_diff'] = data['test_RMSE'] - data['train_RMSE']
    fields_to_keep = ["test_RMSE", "train_RMSE", "RMSE_diff"] + fields
    data = data[fields_to_keep]
    return data

# 3.0 Building Models and Exploring Results

## Define the default parameters and those to change while looping. 

The cell below was used while tuning the CNN model. Different "tuning groups" were used in loops and then investigated by studying plots and dataframes. Thereafter, a new tuning group would be created for further investigation. The goal was to find the best model that maximized the accuracy but minimized overtraining. 

For repeatability, the functions called out in the blocks above were used in the tuning. 

In [18]:
#Parameter List
default_params = {"sec_conv": True, 
          "perform_validation": False,
          "num_train_samples": train.shape[0],
          "input_1": train.shape[1], 
          "input_2": train.shape[2],
          "#_epochs": 4,
          "#filters_conv1": 64,
          "filter_size_conv1": 3,
          "act_func_conv1": 'sigmoid',
          "#filters_conv2": 384,
          "filter_size_conv2": 5,
          "act_func_conv2": 'relu',
          "dense_neurons": 50,
          "act_func_dense": 'relu',
         }

loops = {
#          "#_epochs": [3, 4, 5, 6, 9, 15], #GROUP 1 - USED 4 IN GROUP 2
#          "#filters_conv1": [32, 64, 128, 256, 512], #GROUP 1 - USED 48 IN GROUP 2
#          "filter_size_conv1": [2, 3, 4, 5, 6, 8], #GROUP 2 AND GROUP 3 - go with 3 in filter 1, 5 in filter 2
#          "act_func_conv1": ['sigmoid', 'tanh', 'relu'], #GROUP 2 - KEEP AS SIGMOID
#          "#filters_conv1": [16, 32, 48, 64, 80, 96, 112, 128], #GROUP 3 - go with 48 and add second conv layer
#          "#filters_conv2": [16, 32, 48, 64, 80, 96, 128, 256, 384, 512], #GROUP 4 - KEEP 512
#          "act_func_conv2": ['relu', 'sigmoid', 'tanh'], #GROUP 4 - KEEP RELU
#          "filter_size_conv2": [2, 3, 4, 5, 6, 8, 10], #GROUP #5
#          "perform_validation": [True, False], #GROUP #5
#          "#_epochs": [3, 4, 5, 6, 7], #GROUP 6
#          "#filters_conv1": [16, 32, 64, 96, 128], #GROUP 6
#          "filter_size_conv1": [2, 3, 4, 5], #GROUP 6
#          "#filters_conv1": [48, 56, 64, 72], #GROUP 7
#          "#filters_conv2": [48, 96, 256, 384, 512], #GROUP 7
#          "filter_size_conv2": [3, 5, 8], #GROUP 7
#          "#filters_conv1": [32, 32, 32, 64, 64, 64], #GROUP 8
#          "#filters_conv2": [48, 48, 48, 96, 96, 96, 384, 384, 384], #GROUP 8  
#            "dense_neurons": [20, 40, 60, 80, 100, 200, 300, 400, 500], #GROUP 9
#            "act_func_dense": ['relu', 'sigmoid', 'tanh'], #GROUP 9
#            "dense_neurons": [80, 150, 180, 200, 220, 250, 300], #GROUP 10 - check several 50 vs. 200 next
#            "perform_validation": [True, False], #GROUP #10
           "dense_neurons": (([50] * 10) + ([200] * 10)), #GROUP 11
        }

The loops are ran with in the block below using the inputs designated in the block above.

When not being actively used, the function "loop_through_parameters" is commented out. 

In [None]:
#for each key and associated list in 'loops', make a record of results for different parameters.

# record = loop_through_parameters(loops, default_params)

## Compare results graphically and in a dataframe

Results from the most recent tuning group were viewed below. 

These items were commented out after tuning concluded. They can be reused by loading and filtering the log file. 

In [20]:
print(loops.keys())
# plot_results(pd.DataFrame(record), list(loops.keys()))

display(plot_results(pd.DataFrame(record), ['dense_neurons', 'sec_conv']))
# display(plot_results(pd.DataFrame(record), ['#filters_conv2', 'filter_size_conv2']))

dict_keys(['dense_neurons', 'perform_validation'])


A table of the most recent results was viewed below, sorted by test RMSE or train/test performance difference. 

These items were commented out after tuning concluded. They can be reused by loading and filtering the log file.

In [37]:
df = show_df_of_results(pd.DataFrame(record), list(loops.keys()))
df.sort_values("RMSE_diff")
df.sort_values("test_RMSE")

Unnamed: 0,test_RMSE,train_RMSE,RMSE_diff,dense_neurons,perform_validation
3,14.545352,14.331355,0.213997,50,False
5,14.635473,14.496909,0.138564,50,False
2,14.715892,14.36711,0.348782,50,False
17,14.723854,14.294314,0.42954,200,False
16,14.930543,14.097642,0.832901,200,False
4,14.952317,14.648026,0.304291,50,False
8,14.963652,14.529636,0.434015,50,False
15,15.10658,14.150016,0.956564,200,False
7,15.200913,14.283427,0.917485,50,False
11,15.238183,14.497846,0.740338,200,False


The cell below was used to initialize the performance log CSV. 

In [19]:
#Initialize the log with the first tuning group. 
#FILENAME REMOVED - DO NOT OVERWRITE LOG!

# df = pd.DataFrame(record)

# df.insert(0, "tuning_group", 1)

# df.to_csv("....csv")

The below block saves the most recent tuning group to the CSV log.  

In [34]:
# add_to_cnn_log(record, save_changes = True)

The below block is used to explore the entire CSV log to compare results. Also organized by performace/overfitting to find the best parameters to test. 

In [36]:
# df = pd.DataFrame(record)
# print(df.columns)
df = pd.read_csv("CNN_log.csv")

##USE CTRL + "/" TO COMMENT OUT FIELDS
df = df[[
    'tuning_group',
#     'train_MSE', 
#     'test_MSE', 
    'train_RMSE', 
    'test_RMSE', 
#     'train_MAE',
#     'test_MAE', 
#     'train_MAPE', 
#     'test_MAPE', 
#     'train_R2', 
#     'test_R2',
    'sec_conv', 
    'perform_validation', 
#     'num_train_samples', 
#     'input_1',
#     'input_2', 
    '#_epochs', 
    '#filters_conv1', 
    'filter_size_conv1',
    'act_func_conv1', 
    '#filters_conv2', 
    'filter_size_conv2',
    'act_func_conv2',
    'dense_neurons',
    'act_func_dense',
    ]]
df.insert(3, "RMSE_diff", df["test_RMSE"] - df["train_RMSE"])
pd.set_option('display.max_rows', None)
df = df.sort_values("test_RMSE").reset_index()[:20]
df

Unnamed: 0,index,tuning_group,train_RMSE,test_RMSE,RMSE_diff,sec_conv,perform_validation,#_epochs,#filters_conv1,filter_size_conv1,act_func_conv1,#filters_conv2,filter_size_conv2,act_func_conv2,dense_neurons,act_func_dense
0,397,10,14.001601,14.22076,0.219159,True,False,4,64,3,sigmoid,384,5,relu,200,relu
1,178,6,14.448936,14.383256,-0.06568,True,False,4,64,3,sigmoid,512,5,relu,50,relu
2,343,8,14.18666,14.438518,0.251858,True,False,4,64,3,sigmoid,384,5,relu,50,relu
3,378,9,14.239456,14.463041,0.223585,True,False,4,64,3,sigmoid,384,5,relu,200,relu
4,361,8,14.448208,14.474606,0.026398,True,False,4,64,3,sigmoid,384,5,relu,50,relu
5,334,8,14.154706,14.494445,0.339739,True,False,4,32,3,sigmoid,384,5,relu,50,relu
6,132,4,14.143855,14.526599,0.382744,True,False,4,48,3,sigmoid,512,5,relu,50,relu
7,113,4,13.828195,14.526994,0.698799,True,False,4,48,3,sigmoid,48,5,tanh,50,relu
8,354,8,14.309043,14.53213,0.223087,True,False,4,64,3,sigmoid,48,5,relu,50,relu
9,407,11,14.331355,14.545352,0.213997,True,False,4,64,3,sigmoid,384,5,relu,50,relu


# 4.0 Final Model

## 4.1 Discussion

Tuning group #11 was performed to see if 50 dense neurons or 200 dense neurons was more effective. Ten iterations using the same parameters to used to get an idea of random variation in the modeling.  

It seems the best answer is 50 (see below), even though there was an extreme outlier where overfitting seemed to occur (see Section 3). 

The variation in the modeling results seems to be wide, which may make finding the best model difficult. However the best models seem to provide test values around 15 RMSE, when overtraining isn't significant.

In [43]:
#This cell is used to see whether 50 dense neurons or 200 dense neurons is more effective. 
df = pd.read_csv('CNN_log.csv')
df = df[df['tuning_group'] == 11]
df.insert(3, "RMSE_diff", df["test_RMSE"] - df["train_RMSE"])
print(df[df['dense_neurons'] == 50][['test_RMSE', 'train_RMSE', 'RMSE_diff', 'dense_neurons']].mean(), "\n")
print(df[df['dense_neurons'] == 200][['test_RMSE', 'train_RMSE', 'RMSE_diff', 'dense_neurons']].mean(), "\n")

del df

test_RMSE        15.760149
train_RMSE       14.383757
RMSE_diff         1.376391
dense_neurons    50.000000
dtype: float64 

test_RMSE         16.289293
train_RMSE        14.310314
RMSE_diff          1.978979
dense_neurons    200.000000
dtype: float64 



## 4.2 Choose final model parameters/architecture and save it

Using the above tuning, a final model was chosen with the parameters below. Then, the model was trained with the same data and saved into a pickle file. 

In [44]:
#Final parameter list
params = {"sec_conv": True, 
          "perform_validation": False,
          "num_train_samples": train.shape[0],
          "input_1": train.shape[1], 
          "input_2": train.shape[2],
          "#_epochs": 4,
          "#filters_conv1": 64,
          "filter_size_conv1": 3,
          "act_func_conv1": 'sigmoid',
          "#filters_conv2": 384,
          "filter_size_conv2": 5,
          "act_func_conv2": 'relu',
          "dense_neurons": 50,
          "act_func_dense": 'relu',
         }

Model is built/defined in block below, and the model structure summary is provided as output. 

In [48]:
cnn = make_cnn(params, print_summary = True)

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Model is trained in the block below, and then saved for future use. 

In [49]:
history = cnn.fit(train, 
                  y_train,  
                  epochs = params["#_epochs"], 
                  verbose = 0,
                    )

In [50]:
print("Train RMSE: ", history.history['root_mean_squared_error'][-1])
y_hat = cnn.predict(test)
print("Test RMSE: ", np.sqrt(mean_squared_error(y_test, y_hat)).item())
      

Train RMSE:  14.054810523986816
[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step
Test RMSE:  14.805909461321166


Save the model and model history below, if desired. 

WARNING: THIS WILL SAVE OVER ANY EXISTING FILES.

In [55]:
# cnn.save('CNN_model_trained.keras')

# with open('CNN_model_history.pkl', 'wb') as f:
#     pickle.dump(history.history, f)

# np.save('CNN_model_trained_test_predictions.npy', y_hat)

## 4.3 Load saved models and associated files for future use

Model and associated save files can be re-loaded and used below. 

In [64]:
with open('CNN_model_history.pkl', 'rb') as f:
    history = pickle.load(f)

In [66]:
history['root_mean_squared_error']

[22.921009063720703,
 14.874977111816406,
 14.557729721069336,
 14.054810523986816]

In [67]:
cnn = keras.models.load_model('CNN_model_trained.keras')

In [68]:
y_hat = cnn.predict(test)

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step


In [69]:
np.sqrt(mean_squared_error(y_test, y_hat)).item()

14.805909461321166

In [70]:
y_hat = np.load('CNN_model_trained_test_predictions.npy')
np.sqrt(mean_squared_error(y_test, y_hat)).item()

14.805909461321166

## 4.4 Exploration of Results 

As can be seen below, the model performed much better when the RUL was closer to 0. For estimates where actual RUL was less than 25, the absolute difference was less than 3 years. For estimates where RUL was less than 10 years, the absolute difference was less than 2. 

The model overestimated RUL on 39% of estimations. This is a problem, as the engine would have failed before the estimated end of life. There were not enough samples to determine if this holds true as the RUL nears 0. 

In [105]:
df = pd.DataFrame(y_hat, columns = ['prediction'])
df['actual'] = y_test
df['difference'] = df['actual'] - df['prediction']
print("Number of RUL overestimations (failure occurs before estimate): ", len(df[df['difference'] < 0]))
num = 10
temp1 = len(df[df['actual'] < num])
temp2 = len(df[(df['difference'] < 0) & (df['actual'] < num)])
print("Percentage of RUL overestimations where RUL < ", str(num), ": ", temp2/temp1)
del temp1, temp2, num
print("Average absolute error where RUL < 10: ", df[df['actual'] < 10].difference.abs().mean())
print("Average absolute error where RUL < 25: ", df[df['actual'] < 25].difference.abs().mean())
print("Average absolute error where RUL b/t 25 & 75: ", df[(df['actual'] > 25) & (df['actual'] < 75)].difference.abs().mean())
print("Average absolute error where RUL b/t 75 & 125: ", df[(df['actual'] > 75) & (df['actual'] < 125)].difference.abs().mean())
print("Average absolute error where RUL > 125: ", df[df['actual'] > 125].difference.abs().mean())
df.describe()

Number of RUL overestimations (failure occurs before estimate):  39
Percentage of RUL overestimations where RUL <  10 :  0.2
Average absolute error where RUL < 10:  1.8784668922424317
Average absolute error where RUL < 25:  2.854082233027408
Average absolute error where RUL b/t 25 & 75:  10.504339694976807
Average absolute error where RUL b/t 75 & 125:  11.491354900857676
Average absolute error where RUL > 125:  24.109589316628195


Unnamed: 0,prediction,actual,difference
count,100.0,100.0,100.0
mean,73.281151,75.52,2.238848
std,39.381599,41.76497,14.709391
min,5.088395,7.0,-37.626015
25%,27.863488,32.75,-5.641363
50%,87.461006,86.0,2.357883
75%,104.856033,112.25,9.57402
max,130.551117,145.0,42.673317


The below plot shows the predicted vs. the actual RUL for all 100 test samples. You can see the accuracy improves closer to zero, as there is less variance. 

In [86]:
alt.Chart(df).mark_point().encode(x = alt.X("actual").scale(zero = False), 
                                                y = 'prediction', 
                                                color = "difference", 
                                                )

The below plot shows the predicted vs. the actual RUL for samples where actual RUL is less than 12 (close-up of bottom left of above chart). 

In [107]:
temp = df[df['actual'] < 12]
display(alt.Chart(temp).mark_point().encode(x = alt.X("actual").scale(zero = False), 
                                                y = 'prediction', 
                                                color = "difference", 
                                                ))
del temp