# CNN with Keras by TensorFlow using Smoothed Data

This notebook is a variation of the workbook 'CNN.ipynb', with the primary difference being the input data is smoothed. This notebook was developed as an exploratory measured to determine if data smoothing would improve model results. The conclusion is that it does not, which is documented in the following sections. 

# 1.0 Dependencies and Notes

This notebook was built with the libraries imported below and the following versions:

Pandas 2.2.3 <br>
Numpy 2.0.2 <br>
Altair 5.4.1 <br>
sklearn 1.5.0 <br>
Keras 3.6.0 <br>

Different versions of these libraries may affect the functionality of this notebook.

The purpose of this notebook is to create a convolutional neural network to predict remaining useful life of jet engines using data provided by NASA that was then smoothed to reduce noise. The notebook includes definitions to build the model, fit it, and then explore and store the results. 

Results are stored via a CSV file. There is a function for looping through different parameters, and other functions for viewing, exploring, and saving the results. 

In [2]:
import pandas as pd
import numpy as np
import altair as alt
import sklearn
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from numpy import array, hstack
import pickle
import keras
from keras.models import Sequential
from keras.layers import Dense, Flatten, Conv1D, MaxPooling1D
from keras import Input

In [3]:
print(pd.__version__)
print(np.__version__)
print(alt.__version__)
print(sklearn.__version__)
print(keras.__version__)

2.2.3
2.0.2
5.4.1
1.5.0
3.6.0


## 1.1 Load smoothed batched data and define train and test sets 

In [5]:
#link = 'processed_data_pickle_files_no_smoothing/'
link = '../data/smoothed_batched_data_pickle_files/'

with open(link + 'processed_test_data.pkl', 'rb') as file:
    test = pickle.load(file)
    
with open(link + 'processed_train_data.pkl', 'rb') as file:
    train = pickle.load(file)    
    
with open(link + 'processed_train_targets.pkl', 'rb') as file:
    y_train = pickle.load(file)
    
with open('../data/batched_data_pickle_files/' + 'true_rul.pkl', 'rb') as file:
    y_test = pickle.load(file)

In [8]:
display(train.shape)
display(y_train.shape)
display(test.shape)
display(y_test.shape)

(17731, 30, 14)

(17731,)

(100, 30, 14)

(100,)

In [9]:
#Initialize so definitions don't throw an error. Will be redefined in subsequent section. 
default_params = {}
loops = {}

# 2.0 Definitions

In [10]:
def make_cnn(params, print_summary = False):
    
    """
    Intializes a convolutional neural network using TensorFlow Keras Library. 
    Parmeters are entered as a library. 
    
    Inputs:
    params: library of input parameters, must include 'sec_conv', '#filters_conv1', 'filter_size_conv1', 
            'act_func_conv1', 'input_1', 'input_2', 'dense_neurons', 'act_func_dense'; may need to include others. 
    print_summary: boolean that dictates whether to print a summary of the constructed model 
            (provides number of nuerons in each network, etc.)
    """
    
    
    #A sequential model can have different NN layers added. Use a list to define layers, or use add method.
    #Sequential models only work for one input tensor and one output tensor (including for each layer).
    model = Sequential()
    
    #Input(shape = (params['input_1'], params["input_2"]))
    
    #Adds a 1-dimensional convolution layer (only moves in one direction: down each 'sample' of timeseries).
    model.add(Conv1D(filters = params["#filters_conv1"], #dimension of output space (number of filters)
                     kernel_size = params["filter_size_conv1"], #size of the convolution window
                     strides = 1, #convolution layer stride length
                     padding = 'valid', #zero padding at ends of convolutions, options: 'valid' (no padding), 'same'
                     activation=params["act_func_conv1"], #activation function to derive non-linear relationships; default is None.
                     input_shape=(params['input_1'], params["input_2"]) #steps followed by number of features (same order as input)
                    )) 
    
    #Adds a second 1D convolution filter if user requests.
    if params["sec_conv"] == True:
        model.add(Conv1D(filters = params["#filters_conv2"], #dimension of output space (number of filters)
                     kernel_size = params["filter_size_conv2"], #size of the convolution window
                     activation=params["act_func_conv2"], #activation function to derive non-linear relationships; default is None.
                        )) 
    
    #Adds a pooling layer to reduce dimensions/computation time - Max or Average is available. 
    #Downsizes the input from the Conv1D layer, so structure is kept but some info is lost.
    model.add(MaxPooling1D(pool_size = 2)) #Number of features considered at once.
    
    #Adds a flattening layer, which makes the 3D data into a 1D array so it's compatible with Dense Layers.
    model.add(Flatten())
    
    #Adds a "dense" layer, which is a regular NN layer. This interprets the output of the previous layers.
    model.add(Dense(params['dense_neurons'], #First parameter is the output space dimensionality.
                    activation = params['act_func_dense'])) 
    
    #Adds a second dense layer to reduce the 50 neurons to a single output.
    #This is done seperately for each batch.
    model.add(Dense(1))
    
    #This method configures the model for training. It's where you choose parameters.
    #Other parameter available: loss_weights, metrics, weighted_metrics, etc. 
    model.compile(optimizer = 'adam', loss = 'mse', metrics = ["mean_squared_error", 
                                                               "root_mean_squared_error", 
                                                               "mean_absolute_error", 
                                                               "mean_absolute_percentage_error", 
                                                               "r2_score"])
    
    if print_summary == True:
        model.summary()
    
    return model

In [11]:
def cnn_run_for_log(model, train_data, train_target, test_data, test_target, params):
    
    """ 
    Trains an input model on the input train data, then collects various scoring metrics of both the 
    train and test data. The input parameters dictionary is then concatenated with the metrics to provide 
    a dictionary of both the metrics and input parameters used. 
    
    Inputs:
    
    Model: CNN model from the previous function above ('make_cnn') or CNN defined via other means
    
    Various data inputs: Train and Test, plus targets
    
    params: library of parameters. Must include 'perform_validation' and '#_epochs'.
    
    Output: library of train and test parameters, along with parameters included in the 'params' input. 
    
    """
    
    if params['perform_validation'] == True:
        v_s = 0.2
    else:
        v_s = None
    
    history = model.fit(train_data, 
                      train_target, 
                      validation_split = v_s, 
                      epochs = params["#_epochs"], 
                      verbose = 0
                     )
    logger = {}
    
    y_hat = model.predict(test_data, verbose = 0)
    
    logger["train_MSE"] = history.history["mean_squared_error"][-1]
    logger["test_MSE"] = mean_squared_error(test_target, y_hat).item()
    
    logger["train_RMSE"] = history.history["root_mean_squared_error"][-1]
    logger["test_RMSE"] = np.sqrt(mean_squared_error(test_target, y_hat)).item()
    
    logger["train_MAE"] = history.history["mean_absolute_error"][-1]
    logger["test_MAE"] = mean_absolute_error(test_target, y_hat).item()
    
    logger["train_MAPE"] = history.history["mean_absolute_percentage_error"][-1]
    logger["test_MAPE"] = np.mean(np.abs((test_target - y_hat) / y_test)).item() * 100
    
    logger["train_R2"] = history.history["r2_score"][-1]
    logger["test_R2"] = r2_score(test_target, y_hat)
    
    logger.update(params)
    
    return logger
    
    

In [12]:
def add_to_logger(new_instance, existing_dict):
    
    """
    Takes in a new instance of the function 'run_cnn_for_log' which returns the performance metrics 
    for a CNN using the listed input parameters. It then adds that instance to a dictionary of lists,
    where each index in the list represents a new run of 'run_cnn_for_log'. This is intended to be used
    in running loops during parameter tuning to keep track of which parameters perform the best. 
    
    Input:
    new_instance: a dictionary of the most recent parameters and performance metrics
    existing_dict: a dictionary of lists that keep a record of performance metrics and the parameters that
            led to those results
    
    Output:
    An updated record of parameters/performance metrics in which the most recent parameters are added to the record
    
    """
    
    if existing_dict == None:
        record = {}
        for key in new_instance.keys():
            record[key] = []
    else:
        record = existing_dict.copy()
        
    for key in record.keys():
        l = record[key]
        if key in new_instance.keys():
            l.append(new_instance[key])     
        else:
            l.append(np.nan)
        record[key] = l
        
    return record

In [13]:
def loop_through_parameters(loops, 
                            parameters, 
                            train = train, 
                            y_train = y_train, 
                            test = test, 
                            y_test = y_test):
    """
    Runs parameter loops for model and records the resulting metrics. Returns a dictionary that is 
    a log of the results, where each key represents a parameter or performance metric, and each item is
    a list where the indexes represent the runs in chronological order. 
    
    Inputs:
    loops: a dictionary of the the keys and values to loop through.
    params: a dictionary with all the parameters neccessary to build the CNN, train it, and acquire the 
    results using the functions defined previously.
    
    Output:
    A dictionary of lists with input parameters and resultant performance metrics. Can easily be used
    to construct a dataframe. 
    """
    
    record = None
    keys = list(loops.keys())
    
    def nested_function(p1 = None, p2 = None, p3 = None, p4 = None):
        parameters[keys[0]] = p1
        if len(keys) > 1:
            parameters[keys[1]] = p2
        if len(keys) > 2:
            parameters[keys[2]] = p3
        if len(keys) > 3:
            parameters[keys[3]] = p4
            
        #print(parameters)
        model = make_cnn(parameters)
        new_instance = cnn_run_for_log(model, train, y_train, test, y_test, parameters)
        r = add_to_logger(new_instance, record)
        return r
        
    if len(loops.keys()) == 1:
        for p_1 in loops[keys[0]]:
            record = nested_function(p1 = p_1)
            
    elif len(loops.keys()) == 2:
        for p_1 in loops[keys[0]]:
            for p_2 in loops[keys[1]]:
                record = nested_function(p1 = p_1, p2 = p_2)
                
    elif len(loops.keys()) == 3:
        for p_1 in loops[keys[0]]:
            for p_2 in loops[keys[1]]:
                for p_3 in loops[keys[2]]:
                    record = nested_function(p1 = p_1, p2 = p_2, p3 = p_3)
                    
    else:
        for p_1 in loops[keys[0]]:
            for p_2 in loops[keys[1]]:
                for p_3 in loops[keys[2]]:
                    for p_4 in loops[keys[3]]:
                        record = nested_function(p1 = p_1, p2 = p_2, p3 = p_3, p4 = p_4)
                        
    return record
        

In [14]:
def add_to_cnn_log(record, link = 'CNN_log.csv', save_changes = False):
    
    """
    Combines most recent group of models with those saved in the log file. Includes designating a
    tuning group based on the most recent tuning group in the log file.
    
    Inputs:
    record: most recent record set of parameter tunings, as returned by above functions.
    link: pathway and filename for the saved log file, it if exists. If it doesn't exist, this 
        function won't work.
    save_changes: designates whether to save the updates to the CSV file that stores the model parameter tuning results.
    
    Output:
    A dataframe of the combines records on file and the most recent turning group. 
    """
    
    df = pd.read_csv(link)
    group_num = df['tuning_group'].max() + 1
    r = pd.DataFrame(record)
    r.insert(0, 'tuning_group', group_num)
    
    df = pd.concat([df, r], ignore_index = True)

    if save_changes == True:
        df.to_csv(link, index = False)
    
    return df

In [15]:
def plot_results(data, fields):
    
    """
    Makes a simple Altair Chart for compairing results visually.
    
    Input:
    data: Dataframe that includes the fields from the most current record or from the CNN_log saved as a CSV.
    fields: designated fields to encode using shape and color. Default are the first two designated in the 
        most recent record.
        
    Output:
    A chart comparing changes in the designated fields, mapped against test RMSE and difference between
        train and test RMSE.
    """

    data['RMSE_diff'] = data['test_RMSE'] - data['train_RMSE']
    fields_to_keep = ["test_RMSE", "train_RMSE", "RMSE_diff"] + fields
    data = data[fields_to_keep]

    chart = alt.Chart(data).mark_point().encode(x = alt.X("test_RMSE").scale(zero = False), 
                                                y = 'RMSE_diff', 
                                                color = fields[0] + ":N", 
                                                shape = fields[1] + ":N")

    return chart

def show_df_of_results(data, fields):
    data['RMSE_diff'] = data['test_RMSE'] - data['train_RMSE']
    fields_to_keep = ["test_RMSE", "train_RMSE", "RMSE_diff"] + fields
    data = data[fields_to_keep]
    return data

# 3.0 Building Models and Exploring Results

## Define the default parameters and those to change while looping. 

The cell below was used while tuning the CNN model. Different "tuning groups" were used in loops and then investigated by studying plots and dataframes. Thereafter, a new tuning group would be created for further investigation. The goal was to find the best model that maximized the accuracy but minimized overtraining. 

For repeatability, the functions called out in the blocks above were used in the tuning. 

SPECIAL NOTE FOR SMOOTHED DATA: Only one tuning group was attempted. The results did not generalize well so no further tuning groups were attempted. Results are saved in a log file. 

In [18]:
#Parameter List
default_params = {"sec_conv": True, 
          "perform_validation": False,
          "num_train_samples": train.shape[0],
          "input_1": train.shape[1], 
          "input_2": train.shape[2],
          "#_epochs": 4,
          "#filters_conv1": 64,
          "filter_size_conv1": 3,
          "act_func_conv1": 'sigmoid',
          "#filters_conv2": 384,
          "filter_size_conv2": 5,
          "act_func_conv2": 'relu',
          "dense_neurons": 50,
          "act_func_dense": 'relu',
         }

loops = {
         "#_epochs": [2, 3, 4, 5, 6, 9, 15], #GROUP 1 - USED 4 IN GROUP 2
         "#filters_conv1": [32, 64, 128, 256, 512], #GROUP 1 - USED 48 IN GROUP 2
         "filter_size_conv1": [2, 3, 4, 5, 6, 8], #GROUP 1
#          "act_func_conv1": ['sigmoid', 'tanh', 'relu'], #GROUP 2 - KEEP AS SIGMOID
#          "#filters_conv1": [16, 32, 48, 64, 80, 96, 112, 128], #GROUP 3 - go with 48 and add second conv layer
#          "#filters_conv2": [16, 32, 48, 64, 80, 96, 128, 256, 384, 512], #GROUP 4 - KEEP 512
#          "act_func_conv2": ['relu', 'sigmoid', 'tanh'], #GROUP 4 - KEEP RELU
#          "filter_size_conv2": [2, 3, 4, 5, 6, 8, 10], #GROUP #5
#          "perform_validation": [True, False], #GROUP #5
#          "#_epochs": [3, 4, 5, 6, 7], #GROUP 6
#          "#filters_conv1": [16, 32, 64, 96, 128], #GROUP 6
#          "filter_size_conv1": [2, 3, 4, 5], #GROUP 6
#          "#filters_conv1": [48, 56, 64, 72], #GROUP 7
#          "#filters_conv2": [48, 96, 256, 384, 512], #GROUP 7
#          "filter_size_conv2": [3, 5, 8], #GROUP 7
#          "#filters_conv1": [32, 32, 32, 64, 64, 64], #GROUP 8
#          "#filters_conv2": [48, 48, 48, 96, 96, 96, 384, 384, 384], #GROUP 8  
#            "dense_neurons": [20, 40, 60, 80, 100, 200, 300, 400, 500], #GROUP 9
#            "act_func_dense": ['relu', 'sigmoid', 'tanh'], #GROUP 9
#            "dense_neurons": [80, 150, 180, 200, 220, 250, 300], #GROUP 10
#            "perform_validation": [True, False], #GROUP #10
        }

The loops are ran with in the block below using the inputs designated in the block above. 

The function "loop_through_parameters" is commented out when not actively in use. 

In [None]:
#for each key and associated list in 'loops', make a record of results for different parameters.

# record = loop_through_parameters(loops, default_params)

ADDED FOR SMOOTH DATA ANALYSIS: Can set record equal to the saved data log to replicate results and view graphs.

In [21]:
record = pd.read_csv("CNN_SMOOTH_data_log.csv")

## Compare results graphically and in a dataframe

Results from the most recent tuning group were viewed below.

ADDED NOTE FOR SMOOTHED DATA ANALYSIS: As can be seen in the visualization below, the difference between test and train RMSE does not drop significantly below 14, meaning the models do not generalize to the test data after training. It was assumed this is due to problems in the smoothing process (was an issue with the input data), and further analysis was abandoned. 

In [22]:
print(loops.keys())
plot_results(pd.DataFrame(record), list(loops.keys()))

# display(plot_results(pd.DataFrame(record), ['#filters_conv1', '#filters_conv2']))
# display(plot_results(pd.DataFrame(record), ['#filters_conv2', 'filter_size_conv2']))

dict_keys(['#_epochs', '#filters_conv1', 'filter_size_conv1'])


A table of the most recent results was viewed below, sorted by test RMSE or train/test performance difference. 

In [24]:
df = show_df_of_results(pd.DataFrame(record), list(loops.keys()))
df.sort_values("RMSE_diff")
#df.sort_values("test_RMSE")

Unnamed: 0,test_RMSE,train_RMSE,RMSE_diff,#_epochs,#filters_conv1,filter_size_conv1
18,28.016074,14.02531,13.990765,2,256,2
21,27.031874,12.796519,14.235355,2,256,5
7,26.549247,12.119228,14.430019,2,64,3
42,25.26248,10.654352,14.608128,3,128,2
5,28.932498,13.764834,15.167663,2,32,8
15,27.39951,12.130551,15.268958,2,128,5
22,28.481653,13.188814,15.292839,2,256,6
19,28.077034,12.752298,15.324736,2,256,3
13,26.834564,11.369565,15.464999,2,128,3
2,27.773286,12.14154,15.631747,2,32,4


The cell below was used to initialize the performance log CSV. 

In [17]:
#Initialize the log with the first tuning group. 
#FILENAME REMOVED - DO NOT OVERWRITE LOG!

# df = pd.DataFrame(record)

# df.insert(0, "tuning_group", 1)

# df.to_csv("CNN_SMOOTH....csv")

The below block saves the most recent tuning group to the CSV log.  

NOTE FOR SMOOTHED DATA: Not used since only one tuning group was attempted. 

In [None]:
#add_to_cnn_log(record, save_changes = True)

The below block is used to explore the entire CSV log to compare results. Also organized by performace/overfitting to find the best parameters to test. 

In [25]:
# df = pd.DataFrame(record)
# print(df.columns)
df = pd.read_csv("CNN_SMOOTH_data_log.csv")

##USE CTRL + "/" TO COMMENT OUT FIELDS
df = df[[
    'tuning_group',
#     'train_MSE', 
#     'test_MSE', 
    'train_RMSE', 
    'test_RMSE', 
#     'train_MAE',
#     'test_MAE', 
#     'train_MAPE', 
#     'test_MAPE', 
#     'train_R2', 
#     'test_R2',
    'sec_conv', 
    'perform_validation', 
#     'num_train_samples', 
#     'input_1',
#     'input_2', 
    '#_epochs', 
    '#filters_conv1', 
    'filter_size_conv1',
    'act_func_conv1', 
    '#filters_conv2', 
    'filter_size_conv2',
    'act_func_conv2',
    'dense_neurons',
    'act_func_dense',
    ]]
df.insert(3, "RMSE_diff", df["test_RMSE"] - df["train_RMSE"])
pd.set_option('display.max_rows', None)
df = df.sort_values("test_RMSE").reset_index()[:20]
df

Unnamed: 0,index,tuning_group,train_RMSE,test_RMSE,RMSE_diff,sec_conv,perform_validation,#_epochs,#filters_conv1,filter_size_conv1,act_func_conv1,#filters_conv2,filter_size_conv2,act_func_conv2,dense_neurons,act_func_dense
0,42,1,10.654352,25.26248,14.608128,True,False,3,128,2,sigmoid,384,5,relu,50,relu
1,7,1,12.119228,26.549247,14.430019,True,False,2,64,3,sigmoid,384,5,relu,50,relu
2,174,1,10.111709,26.637015,16.525306,True,False,9,512,2,sigmoid,384,5,relu,50,relu
3,13,1,11.369565,26.834564,15.464999,True,False,2,128,3,sigmoid,384,5,relu,50,relu
4,21,1,12.796519,27.031874,14.235355,True,False,2,256,5,sigmoid,384,5,relu,50,relu
5,15,1,12.130551,27.39951,15.268958,True,False,2,128,5,sigmoid,384,5,relu,50,relu
6,26,1,11.850192,27.609002,15.75881,True,False,2,512,4,sigmoid,384,5,relu,50,relu
7,78,1,10.412127,27.610314,17.198187,True,False,4,256,2,sigmoid,384,5,relu,50,relu
8,67,1,9.879682,27.721093,17.841411,True,False,4,64,3,sigmoid,384,5,relu,50,relu
9,2,1,12.14154,27.773286,15.631747,True,False,2,32,4,sigmoid,384,5,relu,50,relu


# 4.0 Final Model

NOTE FOR SMOOTHED DATA ANALYSIS: No final model was chosen due to poor results when analyzing the smooth data. 