# Gradient Boosted Decision Tree with XGBoost

# 1.0 Dependencies and Notes

This notebook was built with the libraries imported below and the following versions:

Pandas 2.2.3 <br>
Numpy 2.0.2 <br>
Altair 5.4.1 <br>
sklearn 1.5.0 <br>
XGBoost 2.1.2 <br>

Different versions of these libraries may affect the functionality of this notebook.

The purpose of this notebook is to create a gradiet-boosted random forest to predict remaining useful life of jet engines using data provided by NASA. The notebook includes definitions to build the model, fit it, and then explore and store the results. 

Results are stored via a CSV file. There is a function for looping through different parameters, and other functions for viewing, exploring, and saving the results. 

In [139]:
import pandas as pd
import numpy as np
import altair as alt
import sklearn
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from numpy import array, hstack
import pickle
import xgboost as xg

In [140]:
print(pd.__version__)
print(np.__version__)
print(alt.__version__)
print(sklearn.__version__)
print(xg.__version__)

2.2.3
2.0.2
5.4.1
1.5.0
2.1.2


## 1.1 Load and define train and test data. 

In [187]:
#link = 'processed_data_pickle_files_no_smoothing/'
link1 = 'cleaned_data_no_batches/'
link2 = 'batched_data_pickle_files/'

with open(link1 + 'test_data.pkl', 'rb') as file:
    test = pickle.load(file)
    
with open(link1 + 'train_data.pkl', 'rb') as file:
    train = pickle.load(file)    
    
y_train = train['RUL']
train.drop(["RUL"], axis = 1, inplace = True)
    
with open(link2 + 'true_rul.pkl', 'rb') as file:
    y_test = pickle.load(file)

In [188]:
display(train.shape)
display(y_train.shape)
display(test.shape)
display(y_test.shape)

(20631, 16)

(20631,)

(13096, 16)

(100,)

# 2.0 Definitions

Definitions used in the model creation and tuning. 

## 2.1 Function Definitions for Data Processing

Data can be scaled, but may not be neccessary with decision trees.

Two options for a scaler exist - 'standard' and 'minmax', both built off the canned sklearn scalers. 

In [235]:
def scale_data(train_data, test_data, scaler = 'standard'):
    
    trn = train_data.iloc[:,2:]
    tst = test_data.iloc[:,2:]
    
    cols = trn.columns
    
    trn = trn.to_numpy()
    tst = tst.to_numpy()
    
    if scaler == 'standard':
        scaler = StandardScaler()
    elif scaler == 'minmax':
        scaler = MinMaxScaler()
        
    trn = pd.DataFrame(scaler.fit_transform(trn), columns = cols)
    tst = pd.DataFrame(scaler.fit_transform(tst), columns = cols)
    
    new_train = pd.concat([train_data.iloc[:,:2], trn], axis = 1)
    new_test = pd.concat([test_data.iloc[:,:2], tst], axis = 1)
    
    return new_train, new_test

In [236]:
#Test Run Only
a, b = scale_data(train, test, scaler = 'minmax')

Since decision trees consider each instance separately (as a singular moment in time), extra consideration should be given to how the different time series fields are changing over the time cycles. To do this, the below functions break apart the data set into the separate engines, add additional fields representing changes in each time series field, and then rebuild the dataset by concatenating the engines back into a single dataframe with the new fields. 

For each existing field, one or two additional fields are possible to represent the observed changes in that field. The changes are measured using windows and are summarized as follows:

<ul>
  <li>Average Change ('avg'): average change between each consecutive time step within the designated window. The larger the time window, the larger the time frame the field considers for average change.</li><br>
  <li>Absolute Change ('abs'): absolute change between the first and last rows in a window.</li><br>
  <li>Acceleration of Change ('acc'): Measures if the rate of change of the field is increasing or decreasing, and to what extent.</li><br>
    <li>Difference ('dif'): The difference between the current instance and the immediate preceding instance. Is the same as absolute change with a window size of 2. </li>
</ul>
This methodoly creates some fields with NaN values at the beginnning of each engine's time series. An option is available to drop the rows with NaN values with the default option being "True" (default is to drop NaN values). 

In [None]:
def extract_test_instances(data):
    df = pd.DataFrame(columns = list(data.columns))
    
    for engine in data['unit_number'].unique():
        temp = data[data['unit_number'] == engine].copy().iloc[-1,:]
        df = pd.concat([df])
        

In [237]:
def add_features(data, first_window, first_type = "dif", second_window = None, second_type = None, min_periods = None):
    
    cols = list(data.iloc[:,2:].columns)          
    
    for col in cols:
        dif = data[col].diff()
#         print(dif.diff())
        
        #WINDOW 1
        name1 = 'w1_' + first_type + "_" + col
        if first_type == 'avg':
            win = dif.rolling(window = first_window, min_periods = min_periods)
            data[name1] = win.mean()
        elif first_type == 'abs':
            abs_win = data[col].rolling(window = first_window, min_periods = min_periods)
            data[name1] = abs_win.apply(lambda x: x.iloc[-1] - x.iloc[0])
        elif first_type == 'acc':
            acc = dif.rolling(window = first_window, min_periods = min_periods)
            data[name1] = acc.apply(lambda x: x.iloc[-1] - x.iloc[0])
        elif first_type == 'dif':
            data[name1] = dif
        
        #WINDOW 2
        if second_window != None:
            name2 = 'w2_' + second_type + "_" + col
            if second_type == 'avg':
                win = dif.rolling(window = second_window, min_periods = min_periods)
                data[name2] = win.mean()
            elif second_type == 'abs':
                abs_win = data[col].rolling(window = second_window, min_periods = min_periods)
                data[name2] = abs_win.apply(lambda x: x.iloc[-1] - x.iloc[0])
            elif second_type == 'acc':
                acc = dif.rolling(window = second_window, min_periods = min_periods)
                data[name2] = acc.apply(lambda x: x.iloc[-1] - x.iloc[0])
            elif second_type == 'dif':
                data[name2] = dif
                
    return data


def isolate_engines_add_features(params, data, drop_na = True):
    
    df = pd.DataFrame(columns = list(data.columns))
    
    for engine in data['unit_number'].unique():
        temp = data[data['unit_number'] == engine].copy()
        
        temp = add_features(temp, 
                     first_window = params['first_window'] , 
                     first_type = params['first_window_type'], 
                     second_window = params['second_window'], 
                     second_type = params['second_window_type'], 
                     min_periods = params['min_periods'],
                    )
        
        if len(df) == 0:
            df = temp
        else:    
            df = pd.concat([df, temp], axis = 0)
            
    if drop_na == True:
        df.dropna(inplace = True)
        
    return df

In [241]:
#Test Run Only
temp_params = {'scaler': 'standard',
 'first_window': 3,
 'first_window_type': 'avg',
 'second_window': 3,
 'second_window_type': 'avg',
 'min_periods': None}

a, b = scale_data(train, test, scaler = 'standard')
df = isolate_engines_add_features(temp_params, a, drop_na = True)
df

Unnamed: 0,unit_number,time_cycles,Bleed Enthalpy,Bypass Ratio,"Coolant Bleed, HPT","Coolant Bleed, LPT","Corr. Speed, Fan","Speed, core","Speed, fan","Stat Press, HPC out","Temp, HPC out","Temp, LPC out","Temp, LPT out","Tot Press, HPC out","Tot Press, bypass",phi Fuel Flow Ratio,w1_avg_Bleed Enthalpy,w2_avg_Bleed Enthalpy,w1_avg_Bypass Ratio,w2_avg_Bypass Ratio,"w1_avg_Coolant Bleed, HPT","w2_avg_Coolant Bleed, HPT","w1_avg_Coolant Bleed, LPT","w2_avg_Coolant Bleed, LPT","w1_avg_Corr. Speed, Fan","w2_avg_Corr. Speed, Fan","w1_avg_Speed, core","w2_avg_Speed, core","w1_avg_Speed, fan","w2_avg_Speed, fan","w1_avg_Stat Press, HPC out","w2_avg_Stat Press, HPC out","w1_avg_Temp, HPC out","w2_avg_Temp, HPC out","w1_avg_Temp, LPC out","w2_avg_Temp, LPC out","w1_avg_Temp, LPT out","w2_avg_Temp, LPT out","w1_avg_Tot Press, HPC out","w2_avg_Tot Press, HPC out","w1_avg_Tot Press, bypass","w2_avg_Tot Press, bypass",w1_avg_phi Fuel Flow Ratio,w2_avg_phi Fuel Flow Ratio
3,1,4,-1.095310,-1.089802,1.106042,1.115505,-1.156655,-0.910566,-1.114881,-1.105662,-1.074177,-1.064201,-1.091843,1.108578,-1.980079,1.108458,0.006059,0.006059,0.003596,0.003596,-0.005434,-0.005434,-0.004156,-0.004156,-0.003109,-0.003109,-0.007298,-0.007298,-0.007633,-0.007633,0.001573,0.001573,-0.004001,-0.004001,-0.012209,-0.012209,0.004418,0.004418,-0.012207,-0.012207,0.017430,0.017430,-0.006328,-0.006328
4,1,5,-1.090481,-1.086160,1.101504,1.111569,-1.156733,-0.914403,-1.117122,-1.103117,-1.080430,-1.068025,-1.088342,1.097537,-1.956553,1.103455,0.005370,0.005370,0.003526,0.003526,-0.004788,-0.004788,-0.003909,-0.003909,-0.001696,-0.001696,-0.005305,-0.005305,-0.004543,-0.004543,0.002174,0.002174,-0.005700,-0.005700,-0.007344,-0.007344,0.003875,0.003875,-0.011795,-0.011795,0.019726,0.019726,-0.005629,-0.005629
5,1,6,-1.086164,-1.082427,1.097480,1.107996,-1.156398,-0.917087,-1.117943,-1.100449,-1.084439,-1.071473,-1.085035,1.089226,-1.931658,1.098934,0.004841,0.004841,0.003608,0.003608,-0.004451,-0.004451,-0.003797,-0.003797,-0.000502,-0.000502,-0.003858,-0.003858,-0.002435,-0.002435,0.002496,0.002496,-0.005527,-0.005527,-0.004630,-0.004630,0.003539,0.003539,-0.010314,-0.010314,0.022927,0.022927,-0.005033,-0.005033
6,1,7,-1.082012,-1.078630,1.093518,1.104566,-1.156383,-0.919136,-1.117766,-1.097896,-1.085802,-1.075930,-1.081752,1.082856,-1.906910,1.094579,0.004433,0.004433,0.003724,0.003724,-0.004175,-0.004175,-0.003646,-0.003646,0.000091,0.000091,-0.002857,-0.002857,-0.000962,-0.000962,0.002589,0.002589,-0.003875,-0.003875,-0.003910,-0.003910,0.003364,0.003364,-0.008574,-0.008574,0.024390,0.024390,-0.004627,-0.004627
7,1,8,-1.078134,-1.074868,1.090137,1.101120,-1.155044,-0.920274,-1.114646,-1.095109,-1.087562,-1.077773,-1.078641,1.077303,-1.880498,1.090368,0.004116,0.004116,0.003764,0.003764,-0.003789,-0.003789,-0.003483,-0.003483,0.000563,0.000563,-0.001957,-0.001957,0.000825,0.000825,0.002669,0.002669,-0.002378,-0.002378,-0.003249,-0.003249,0.003234,0.003234,-0.006745,-0.006745,0.025352,0.025352,-0.004363,-0.004363
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20626,100,196,1.630793,1.609318,-1.600296,-1.608053,1.436355,1.705526,1.413391,1.574708,1.591582,1.614849,1.604044,-1.573784,1.134921,-1.579370,0.003709,0.003709,0.006384,0.006384,-0.009462,-0.009462,-0.004452,-0.004452,0.013121,0.013121,0.023220,0.023220,0.013899,0.013899,0.004504,0.004504,0.004425,0.004425,0.003207,0.003207,0.007279,0.007279,-0.004200,-0.004200,0.014784,0.014784,-0.004774,-0.004774
20627,100,197,1.633671,1.612732,-1.605214,-1.608648,1.449696,1.727694,1.427235,1.576720,1.592804,1.615060,1.607476,-1.576266,1.147243,-1.581425,0.002758,0.002758,0.004861,0.004861,-0.006824,-0.006824,-0.002107,-0.002107,0.013255,0.013255,0.022666,0.022666,0.013872,0.013872,0.003023,0.003023,0.002734,0.002734,0.001561,0.001561,0.005202,0.005202,-0.003102,-0.003102,0.013579,0.013579,-0.003145,-0.003145
20628,100,198,1.636695,1.615648,-1.609700,-1.608875,1.463096,1.749438,1.441124,1.578018,1.593555,1.614623,1.609968,-1.578461,1.157417,-1.583066,0.002788,0.002788,0.003698,0.003698,-0.005136,-0.005136,-0.000813,-0.000813,0.013309,0.013309,0.022199,0.022199,0.013861,0.013861,0.002009,0.002009,0.001388,0.001388,0.000384,0.000384,0.003608,0.003608,-0.002502,-0.002502,0.011906,0.011906,-0.002238,-0.002238
20629,100,199,1.639481,1.617543,-1.613631,-1.608625,1.476285,1.770664,1.455010,1.578420,1.593160,1.613329,1.611835,-1.580621,1.165693,-1.584284,0.002896,0.002896,0.002742,0.002742,-0.004445,-0.004445,-0.000191,-0.000191,0.013310,0.013310,0.021713,0.021713,0.013873,0.013873,0.001237,0.001237,0.000526,0.000526,-0.000507,-0.000507,0.002597,0.002597,-0.002279,-0.002279,0.010257,0.010257,-0.001638,-0.001638


## 2.2 Model Function Definitions

In [None]:
default_parameters = {'scaler': 'standard', 
                      'first_window': 3, 
                      'first_window_type': 'avg', #options: avg, dif, acc, abs
                      'second_window': None, 
                      'second_window_type': 'avg', 
                      'min_periods': None,
                      'mod_loss': 'squared_error', 
                      'mod_learning_rate': 0.1, 
                      'mod_n_estimators': 100, 
                      'mod_subsample': 1.0,
                      'mod_min_samples_split': 2, 
                      'mod_min_samples_leaf': 1, 
                      'mod_max_depth': 3, 
                      'mod_validation': None  #requires integer as input - number of iterations with no change
                     }

In [259]:
def make_and_train_GBDT(params, train_data, train_target, test_data, test_target):
    
    """
    Intializes a convolutional neural network using TensorFlow Keras Library. 
    Parmeters are entered as a library. 
    
    Inputs:
    params: library of input parameters, must include 'sec_conv', '#filters_conv1', 'filter_size_conv1', 
            'act_func_conv1', 'input_1', 'input_2', 'dense_neurons', 'act_func_dense'; may need to include others. 
    print_summary: boolean that dictates whether to print a summary of the constructed model 
            (provides number of nuerons in each network, etc.)
    """
    
    model = GradientBoostingRegressor(loss = 'squared_error', 
                                      learning_rate = params['mod_learning_rate'], 
                                      n_estimators = params['mod_n_estimators'], 
                                      subsample = params['mod_subsample'],   #Set to less than 1 to help with overfitting
                                      min_samples_split = params['mod_min_samples_split'], #Min to split into two branches
                                      min_samples_leaf = params['mod_min_samples_leaf'], #Min in each branch at a split-can help with overfitting
                                      max_depth = params['mod_max_depth'], 
                                      validation_fraction = 0.1, 
                                      n_inter_no_change = params['mod_validation']) 
    
    model.fit(train_data, train_target)
    
    return model

In [260]:
def GBDT_run_for_log(model, train_data, train_target, test_data, test_target, params):
    
    """ 
    Trains an input model on the input train data, then collects various scoring metrics of both the 
    train and test data. The input parameters dictionary is then concatenated with the metrics to provide 
    a dictionary of both the metrics and input parameters used. 
    
    Inputs:
    
    Model: Gradient-Boosted Decision Tree model from the previous function above ('make_and_train_GBDT') 
    or GBDT defined via other means
    
    Various data inputs: Train and Test, plus targets
    
    params: library of parameters. Must include 'perform_validation' and '#_epochs'.
    
    Output: library of train and test parameters, along with parameters included in the 'params' input. 
    
    """
    
    logger = {}
    
    y_hat_train = model.predict(test_data)
    y_hat_test = model.predict(test_data)
    
    logger["train_MSE"] = mean_squared_error(train_target, y_hat_train).item()
    logger["test_MSE"] = mean_squared_error(test_target, y_hat_test).item()
    
    logger["train_RMSE"] = np.sqrt(mean_squared_error(train_target, y_hat_train)).item()
    logger["test_RMSE"] = np.sqrt(mean_squared_error(test_target, y_hat_test)).item()
    
    logger["train_MAE"] = mean_absolute_error(train_target, y_hat_train).item()
    logger["test_MAE"] = mean_absolute_error(test_target, y_hat_test).item()
    
    logger["train_MAPE"] = np.mean(np.abs((train_target - y_hat_train) / y_hat_train)).item() * 100
    logger["test_MAPE"] = np.mean(np.abs((test_target - y_hat_test) / y_hat_test)).item() * 100
    
    logger["train_R2"] = r2_score(train_target, y_hat_train)
    logger["test_R2"] = r2_score(test_target, y_hat_test)
    
    logger.update(params)
    
    return logger
    
    

In [261]:
def add_to_logger(new_instance, existing_dict):
    
    """
    Takes in a new instance of the function 'GBDT_run_for_log' which returns the performance metrics 
    for a GBDT using the listed input parameters. It then adds that instance to a dictionary of lists,
    where each index in the list represents a new run of 'GBDT_run_for_log'. This is intended to be used
    in running loops during parameter tuning to keep track of which parameters perform the best. 
    
    Input:
    new_instance: a dictionary of the most recent parameters and performance metrics
    existing_dict: a dictionary of lists that keep a record of performance metrics and the parameters that
            led to those results
    
    Output:
    An updated record of parameters/performance metrics in which the most recent parameters are added to the record
    
    """
    
    if existing_dict == None:
        record = {}
        for key in new_instance.keys():
            record[key] = []
    else:
        record = existing_dict.copy()
        
    for key in record.keys():
        l = record[key]
        if key in new_instance.keys():
            l.append(new_instance[key])     
        else:
            l.append(np.nan)
        record[key] = l
        
    return record

In [262]:
def loop_through_parameters(loops, 
                            parameters, 
                            train = train, 
                            y_train = y_train, 
                            test = test, 
                            y_test = y_test):
    """
    Runs parameter loops for model and records the resulting metrics. Returns a dictionary that is 
    a log of the results, where each key represents a parameter or performance metric, and each item is
    a list where the indexes represent the runs in chronological order. 
    
    Inputs:
    loops: a dictionary of the the keys and values to loop through.
    params: a dictionary with all the parameters neccessary to build the CNN, train it, and acquire the 
    results using the functions defined previously.
    
    Output:
    A dictionary of lists with input parameters and resultant performance metrics. Can easily be used
    to construct a dataframe. 
    """
    
    record = None
    keys = list(loops.keys())
    
    def nested_function(p1 = None, p2 = None, p3 = None, p4 = None):
        parameters[keys[0]] = p1
        if len(keys) > 1:
            parameters[keys[1]] = p2
        if len(keys) > 2:
            parameters[keys[2]] = p3
        if len(keys) > 3:
            parameters[keys[3]] = p4
            
        #print(parameters)
        model = make_and_train_GBDT(parameters)
        new_instance = GBDT_run_for_log(model, train, y_train, test, y_test, parameters)
        r = add_to_logger(new_instance, record)
        return r
        
    if len(loops.keys()) == 1:
        for p_1 in loops[keys[0]]:
            record = nested_function(p1 = p_1)
            
    elif len(loops.keys()) == 2:
        for p_1 in loops[keys[0]]:
            for p_2 in loops[keys[1]]:
                record = nested_function(p1 = p_1, p2 = p_2)
                
    elif len(loops.keys()) == 3:
        for p_1 in loops[keys[0]]:
            for p_2 in loops[keys[1]]:
                for p_3 in loops[keys[2]]:
                    record = nested_function(p1 = p_1, p2 = p_2, p3 = p_3)
                    
    else:
        for p_1 in loops[keys[0]]:
            for p_2 in loops[keys[1]]:
                for p_3 in loops[keys[2]]:
                    for p_4 in loops[keys[3]]:
                        record = nested_function(p1 = p_1, p2 = p_2, p3 = p_3, p4 = p_4)
                        
    return record
        

In [263]:
def add_to_GBDT_log(record, link = 'GBDT_log.csv', save_changes = False):
    
    """
    Combines most recent group of models with those saved in the log file. Includes designating a
    tuning group based on the most recent tuning group in the log file.
    
    Inputs:
    record: most recent record set of parameter tunings, as returned by above functions.
    link: pathway and filename for the saved log file, it if exists. If it doesn't exist, this 
        function won't work.
    save_changes: designates whether to save the updates to the CSV file that stores the model parameter tuning results.
    
    Output:
    A dataframe of the combines records on file and the most recent turning group. 
    """
    
    df = pd.read_csv(link)
    group_num = df['tuning_group'].max() + 1
    r = pd.DataFrame(record)
    r.insert(0, 'tuning_group', group_num)
    
    df = pd.concat([df, r], ignore_index = True)

    if save_changes == True:
        df.to_csv(link, index = False)
    
    return df

In [264]:
def plot_results(data, fields):
    
    """
    Makes a simple Altair Chart for compairing results visually.
    
    Input:
    data: Dataframe that includes the fields from the most current record or from the CNN_log saved as a CSV.
    fields: designated fields to encode using shape and color. Default are the first two designated in the 
        most recent record.
        
    Output:
    A chart comparing changes in the designated fields, mapped against test RMSE and difference between
        train and test RMSE.
    """

    data['RMSE_diff'] = data['test_RMSE'] - data['train_RMSE']
    fields_to_keep = ["test_RMSE", "train_RMSE", "RMSE_diff"] + fields
    data = data[fields_to_keep]

    chart = alt.Chart(data).mark_point().encode(x = alt.X("test_RMSE").scale(zero = False), 
                                                y = 'RMSE_diff', 
                                                color = fields[0] + ":N", 
                                                shape = fields[1] + ":N")

    return chart

def show_df_of_results(data, fields):
    data['RMSE_diff'] = data['test_RMSE'] - data['train_RMSE']
    fields_to_keep = ["test_RMSE", "train_RMSE", "RMSE_diff"] + fields
    data = data[fields_to_keep]
    return data

# 3.0 Building Models and Exploring Results

## Define the default parameters and those to change while looping. 

The cell below was used while tuning the GBDT model. Different "tuning groups" were used in loops and then investigated by studying plots and dataframes. Thereafter, a new tuning group would be created for further investigation. The goal was to find the best model that maximized the accuracy but minimized overtraining. 

For repeatability, the functions called out in the blocks above were used in the tuning. 

In [275]:
#Parameter List
default_params = {"sec_conv": True, 
          "perform_validation": False,
          "num_train_samples": train.shape[0],
          "input_1": train.shape[1], 
          "input_2": train.shape[2],
          "#_epochs": 4,
          "#filters_conv1": 64,
          "filter_size_conv1": 3,
          "act_func_conv1": 'sigmoid',
          "#filters_conv2": 384,
          "filter_size_conv2": 5,
          "act_func_conv2": 'relu',
          "dense_neurons": 50,
          "act_func_dense": 'relu',
         }

loops = {
#          "#_epochs": [3, 4, 5, 6, 9, 15], #GROUP 1 - USED 4 IN GROUP 2
#          "#filters_conv1": [32, 64, 128, 256, 512], #GROUP 1 - USED 48 IN GROUP 2
#          "filter_size_conv1": [2, 3, 4, 5, 6, 8], #GROUP 2 AND GROUP 3 - go with 3 in filter 1, 5 in filter 2
#          "act_func_conv1": ['sigmoid', 'tanh', 'relu'], #GROUP 2 - KEEP AS SIGMOID
#          "#filters_conv1": [16, 32, 48, 64, 80, 96, 112, 128], #GROUP 3 - go with 48 and add second conv layer
#          "#filters_conv2": [16, 32, 48, 64, 80, 96, 128, 256, 384, 512], #GROUP 4 - KEEP 512
#          "act_func_conv2": ['relu', 'sigmoid', 'tanh'], #GROUP 4 - KEEP RELU
#          "filter_size_conv2": [2, 3, 4, 5, 6, 8, 10], #GROUP #5
#          "perform_validation": [True, False], #GROUP #5
#          "#_epochs": [3, 4, 5, 6, 7], #GROUP 6
#          "#filters_conv1": [16, 32, 64, 96, 128], #GROUP 6
#          "filter_size_conv1": [2, 3, 4, 5], #GROUP 6
#          "#filters_conv1": [48, 56, 64, 72], #GROUP 7
#          "#filters_conv2": [48, 96, 256, 384, 512], #GROUP 7
#          "filter_size_conv2": [3, 5, 8], #GROUP 7
#          "#filters_conv1": [32, 32, 32, 64, 64, 64], #GROUP 8
#          "#filters_conv2": [48, 48, 48, 96, 96, 96, 384, 384, 384], #GROUP 8  
#            "dense_neurons": [20, 40, 60, 80, 100, 200, 300, 400, 500], #GROUP 9
#            "act_func_dense": ['relu', 'sigmoid', 'tanh'], #GROUP 9
           "dense_neurons": [80, 150, 180, 200, 220, 250, 300], #GROUP 10
           "perform_validation": [True, False], #GROUP #10
        }

The loops are ran with in the block below using the inputs designated in the block above. 

In [276]:
#for each key and associated list in 'loops', make a record of results for different parameters.
record = loop_through_parameters(loops, default_params)

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
  super().__init__(activity_regularizer=activity_regularizer, **

## Compare results graphically and in a dataframe

Results from the most recent tuning group were viewed below. 

In [277]:
print(loops.keys())
plot_results(pd.DataFrame(record), list(loops.keys()))

# display(plot_results(pd.DataFrame(record), ['#filters_conv1', '#filters_conv2']))
# display(plot_results(pd.DataFrame(record), ['#filters_conv2', 'filter_size_conv2']))

dict_keys(['dense_neurons', 'perform_validation'])


A table of the most recent results was viewed below, sorted by test RMSE or train/test performance difference. 

In [278]:
df = show_df_of_results(pd.DataFrame(record), list(loops.keys()))
df.sort_values("RMSE_diff")
df.sort_values("test_RMSE")

Unnamed: 0,test_RMSE,train_RMSE,RMSE_diff,dense_neurons,perform_validation
7,14.22076,14.001601,0.219159,200,False
12,14.585687,14.161062,0.424625,300,True
2,14.631289,14.232586,0.398704,150,True
4,14.771598,14.589047,0.182551,180,True
8,14.791526,14.339052,0.452474,220,True
11,14.805165,14.026073,0.779093,250,False
5,14.856135,14.292357,0.563777,180,False
6,15.170737,14.784474,0.386262,200,True
3,15.214619,14.126642,1.087977,150,False
0,15.443559,14.471303,0.972256,80,True


The cell below was used to initialize the performance log CSV. 

In [279]:
#Initialize the log with the first tuning group. 
#FILENAME REMOVED - DO NOT OVERWRITE LOG!

# df = pd.DataFrame(record)

# df.insert(0, "tuning_group", 1)

# df.to_csv("....csv")

The below block saves the most recent tuning group to the CSV log.  

In [280]:
#add_to_cnn_log(record, save_changes = True)

Unnamed: 0.1,Unnamed: 0,tuning_group,train_MSE,test_MSE,train_RMSE,test_RMSE,train_MAE,test_MAE,train_MAPE,test_MAPE,...,input_2,#_epochs,#filters_conv1,filter_size_conv1,act_func_conv1,#filters_conv2,filter_size_conv2,act_func_conv2,dense_neurons,act_func_dense
0,0.0,1,355.895904,337.148566,18.865204,18.361606,14.578653,13.619847,59482784.0,149.622047,...,14,3,32,3,sigmoid,16,5,relu,50,relu
1,1.0,1,261.337524,261.644746,16.165937,16.175436,12.574144,12.334615,64940664.0,137.768392,...,14,3,64,3,sigmoid,16,5,relu,50,relu
2,2.0,1,237.478317,241.57048,15.410332,15.542538,12.074533,11.737319,75410768.0,144.281251,...,14,3,128,3,sigmoid,16,5,relu,50,relu
3,3.0,1,213.664215,233.558094,14.617257,15.282608,11.420168,11.452817,68126056.0,140.974781,...,14,3,256,3,sigmoid,16,5,relu,50,relu
4,4.0,1,204.825729,241.239781,14.311734,15.531896,11.161909,11.784251,52648680.0,151.277022,...,14,3,512,3,sigmoid,16,5,relu,50,relu
5,5.0,1,287.872467,258.666353,16.966804,16.083108,13.105846,11.970575,59131460.0,147.409903,...,14,4,32,3,sigmoid,16,5,relu,50,relu
6,6.0,1,210.422592,227.718637,14.50595,15.090349,11.263729,11.352382,38350452.0,145.565197,...,14,4,64,3,sigmoid,16,5,relu,50,relu
7,7.0,1,203.550232,246.400471,14.267103,15.697148,11.199593,12.066667,56108536.0,151.755213,...,14,4,128,3,sigmoid,16,5,relu,50,relu
8,8.0,1,183.459992,224.885678,13.544741,14.996189,10.270009,11.029661,42406780.0,147.017653,...,14,4,256,3,sigmoid,16,5,relu,50,relu
9,9.0,1,193.699326,254.986108,13.91759,15.968284,10.732617,12.037212,41657932.0,137.45793,...,14,4,512,3,sigmoid,16,5,relu,50,relu


The below block is used to explore the entire CSV log to compare results. Also organized by performace/overfitting to find the best parameters to test. 

In [291]:
# df = pd.DataFrame(record)
# print(df.columns)
df = pd.read_csv("CNN_log.csv")

##USE CTRL + "/" TO COMMENT OUT FIELDS
df = df[[
    'tuning_group',
#     'train_MSE', 
#     'test_MSE', 
    'train_RMSE', 
    'test_RMSE', 
#     'train_MAE',
#     'test_MAE', 
#     'train_MAPE', 
#     'test_MAPE', 
#     'train_R2', 
#     'test_R2',
    'sec_conv', 
    'perform_validation', 
#     'num_train_samples', 
#     'input_1',
#     'input_2', 
    '#_epochs', 
    '#filters_conv1', 
    'filter_size_conv1',
    'act_func_conv1', 
    '#filters_conv2', 
    'filter_size_conv2',
    'act_func_conv2',
    'dense_neurons',
    'act_func_dense',
    ]]
df.insert(3, "RMSE_diff", df["test_RMSE"] - df["train_RMSE"])
pd.set_option('display.max_rows', None)
df = df.sort_values("test_RMSE").reset_index()[:20]
df

Unnamed: 0,index,tuning_group,train_RMSE,test_RMSE,RMSE_diff,sec_conv,perform_validation,#_epochs,#filters_conv1,filter_size_conv1,act_func_conv1,#filters_conv2,filter_size_conv2,act_func_conv2,dense_neurons,act_func_dense
0,397,10,14.001601,14.22076,0.219159,True,False,4,64,3,sigmoid,384,5,relu,200,relu
1,178,6,14.448936,14.383256,-0.06568,True,False,4,64,3,sigmoid,512,5,relu,50,relu
2,343,8,14.18666,14.438518,0.251858,True,False,4,64,3,sigmoid,384,5,relu,50,relu
3,378,9,14.239456,14.463041,0.223585,True,False,4,64,3,sigmoid,384,5,relu,200,relu
4,361,8,14.448208,14.474606,0.026398,True,False,4,64,3,sigmoid,384,5,relu,50,relu
5,334,8,14.154706,14.494445,0.339739,True,False,4,32,3,sigmoid,384,5,relu,50,relu
6,132,4,14.143855,14.526599,0.382744,True,False,4,48,3,sigmoid,512,5,relu,50,relu
7,113,4,13.828195,14.526994,0.698799,True,False,4,48,3,sigmoid,48,5,tanh,50,relu
8,354,8,14.309043,14.53213,0.223087,True,False,4,64,3,sigmoid,48,5,relu,50,relu
9,335,8,14.088375,14.552896,0.464521,True,False,4,32,3,sigmoid,384,5,relu,50,relu


# 4.0 Final Model

Using the above tuning, a final model was chosen with the parameters below. Then, the model was trained with the same data and saved into a pickle file. 

In [294]:
#Final parameter list
params = {"sec_conv": True, 
          "perform_validation": False,
          "num_train_samples": train.shape[0],
          "input_1": train.shape[1], 
          "input_2": train.shape[2],
          "#_epochs": 4,
          "#filters_conv1": 64,
          "filter_size_conv1": 3,
          "act_func_conv1": 'sigmoid',
          "#filters_conv2": 384,
          "filter_size_conv2": 5,
          "act_func_conv2": 'relu',
          "dense_neurons": 200,
          "act_func_dense": 'relu',
         }

Model is built/defined in block below, and the model structure summary is provided as output. 

In [297]:
cnn = make_cnn(params, print_summary = True)

Model is trained in the block below, and then saved to a pickle file for future use. 

In [299]:
history = cnn.fit(train, 
                  y_train,  
                  epochs = params["#_epochs"], 
                  verbose = 0,
                    )

with open('CNN_model_trained.pkl', 'wb') as f:
    pickle.dump(history, f)