# Gradient Boosted Decision Tree to Predict Jet Engine RUL

# 1.0 Dependencies and Notes

This notebook was built with the libraries imported below and the following versions:

Pandas 2.2.3 <br>
Numpy 2.0.2 <br>
Altair 5.4.1 <br>
sklearn 1.5.0 <br>
XGBoost 2.1.2 <br>
matplotlib 3.9.0 <br>

Different versions of these libraries may affect the functionality of this notebook.

The purpose of this notebook is to create a gradiet-boosted random forest to predict remaining useful life of jet engines using data provided by NASA. The notebook includes definitions to build the model, fit it, and then explore and store the results. 

Results are stored via a CSV file. There is a function for looping through different parameters, and other functions for viewing, exploring, and saving the results. 

In [None]:
import pandas as pd
import numpy as np
import altair as alt
import sklearn
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from numpy import array, hstack
import pickle
import xgboost as xg
import matplotlib
import matplotlib.pyplot as plt

In [2]:
print("IMPORTED VERSIONS OF DEPENDENT LIBRARIES:")
print(pd.__version__)
print(np.__version__)
print(alt.__version__)
print(sklearn.__version__)
print(xg.__version__)
print(matplotlib.__version__)

IMPORTED VERSIONS OF DEPENDENT LIBRARIES:
2.2.3
2.0.2
5.4.1
1.5.0
2.1.2
3.9.0


The cell below can be used to view active variables in the notebook. It can be referred back to as the user goes through the notebook. 

In [3]:
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)
%whos

Variable                    Type                          Data/Info
-------------------------------------------------------------------
GradientBoostingRegressor   ABCMeta                       <class 'sklearn.ensemble.<...>adientBoostingRegressor'>
MinMaxScaler                type                          <class 'sklearn.preproces<...>sing._data.MinMaxScaler'>
StandardScaler              type                          <class 'sklearn.preproces<...>ng._data.StandardScaler'>
alt                         module                        <module 'altair' from 'C:<...>es\\altair\\__init__.py'>
array                       builtin_function_or_method    <built-in function array>
hstack                      _ArrayFunctionDispatcher      <function hstack at 0x0000019E580934C0>
matplotlib                  module                        <module 'matplotlib' from<...>matplotlib\\__init__.py'>
mean_absolute_error         function                      <function mean_absolute_e<...>or at 0x0000019E6AD40AF0

## 1.1 Load and define train and test data. 

Since the format of data input into the GBDT does not match that of other algorithms (cheifly, the neural networks), additional data preprocessing is conducted below. It maintains the same dropped columns and 'early RUL' as provided in the preprocessing workbook. Scaling occurs later in Section 2.1 and is optional. No batching occurs, as data is required to be 2 dimensional input for the GBDT. 

### 1.1.1 Load raw data, select 'early_RUL' parameter

Load the raw data as it is provided by the NASA website.

Variables to keep active include early_RUL, and data variables for further processing: raw_train, raw_test, y_test.
<ul>
    <li><b>early_RUL</b> - sets the threshold of the maximum remaining useful life. This is intended to lessen the impact of data instance far from the end of useful life; see the data processing notebook for more. </li>
    <li><b>raw_train, raw_test, y_test</b> - the raw, unprocessed data from the NASA data repository. The raw files are preprocessed in Section 1.1 of this notebook (and maintain the same variable names) to remove unneeded fields and to create the RUL column of the training set.</li>
<ul>

In [4]:
early_RUL = 125

#Load and process data from scratch
train_file = "../data/CMAPSSData/train_FD001.txt"
test_file = "../data/CMAPSSData/test_FD001.txt"
rul_file = "../data/CMAPSSData/RUL_FD001.txt"

raw_train = pd.read_csv(train_file, sep="\s+", header = None)
raw_test = pd.read_csv(test_file, sep = "\s+", header = None)
y_test = pd.read_csv(rul_file, sep = "\s+", header = None).iloc[:,0]

del train_file, test_file, rul_file

### 1.1.2 Add RUL values to train set for each seperate unit engine

The definition below was borrowed from the 'data_preprocessing.ipynb' workbook. It takes in a single engine and returns the RUL for each line of data for that engine. 

The train targets are then defined using these output RUL values. As explained previously, the 'early_RUL' variable is used to dictate the maximum value that can be provided for RUL in any given data instance. Values above the 'early_RUL' parameter are automatically set to the 'early_RUL' parameter. 

In [5]:
def process_targets(data_length, early_rul=None):
    
    # if no early RUL is provided, generate a descending sequence from data_length -1 to 0
    if early_rul is None:
        return np.arange(data_length - 1, -1, -1)
    
    else:
        # calculate the duration for which early RUL is applicable
        early_rul_duration = data_length - early_rul
        
        # if the early RUL duration is non-positive, use a linear degradation curve (same as when early_rul is none)
        if early_rul_duration <= 0:
            return np.arange(data_length - 1, -1, -1)
        
        else:
            # create an array where the first early_rul_duration values are equal to early_rul
            target_array = np.append(early_rul * np.ones(shape=(early_rul_duration,)),
                                     np.arange(early_rul - 1, -1, -1))  # Add descending values from early_rul-1 to 0
            return target_array
        
raw_y_train = np.ndarray([])

for engine in raw_train.iloc[:,0].unique():
    
    data_length = len(raw_train[raw_train.iloc[:,0] == engine])
    if engine ==1:
        raw_y_train = process_targets(data_length, early_rul = early_RUL)
    else:
        new = process_targets(data_length, early_rul = 125)
        raw_y_train = np.append(raw_y_train, new)

del process_targets
        
print(len(raw_y_train))
print(raw_y_train)

20631
[125. 125. 125. ...   2.   1.   0.]


In [6]:
display(raw_train.shape)
display(raw_y_train.shape)
display(raw_test.shape)
display(y_test.shape)

(20631, 26)

(20631,)

(13096, 26)

(100,)

### 1.1.3 Create column field names; Drop columns that are included in the 'missing_indices.npy' file

Column titles are defined below, as well as which columns will be dropped. These updates are then applied to the test and train sets. 

The dropped columns were saved as a .npy file and were determined in the 'data_cleaning.ipynb' workbook. 

In [7]:
index_names = ['unit_number', 'time_cycles', 'setting_1', 'setting_2', 'setting_3']
index_names = index_names + ['sensor_{}'.format(i) for i in range(1, 22)]

columns_to_drop = np.load("../data/missing_indices.npy")

print("Columns to be dropped: ")
print(columns_to_drop)

Columns to be dropped: 
[ 2  3  4  5  9 14 18 20 22 23]


In [8]:
raw_train.columns = index_names
raw_test.columns = index_names

raw_train.drop(raw_train.iloc[:,columns_to_drop], axis = 1, inplace = True)
raw_test.drop(raw_test.iloc[:,columns_to_drop], axis = 1, inplace = True)

del index_names

In [9]:
print(raw_train.shape)
print(raw_y_train.shape)
print(raw_test.shape)
print(y_test.shape)

(20631, 16)
(20631,)
(13096, 16)
(100,)


## 1.2 Capturing Headings and Choosing Features

This section is for reference only and does not affect the raw data files used in the following sections. 

Hard coded descriptions of the different sensors are provided below, both abbreviated descriptions (to use as column headers if so desired) and longer descriptions that can be printed out. Dictionaries are created for each based on sensor number.

Then, the sensors that were removed are shown, along with the sensors that were kept in the dataset. 

Only variables for the dictionaries are kept: feature_dictionary_short, feature_dictionary_long

In [10]:
description_headers = ["Temp, fan in", "Temp, LPC out", "Temp, HPC out", 
                           "Temp, LPT out", "Press, fan in", "Tot Press, bypass", "Tot Press, HPC out",
                           "Speed, fan", "Speed, core", "Eng Press Ratio", "Stat Press, HPC out", 
                           "phi Fuel Flow Ratio", "Corr. Speed, Fan", "Corr. Speed, Core", "Bypass Ratio", 
                           "Burner Fuel/Air Ratio", "Bleed Enthalpy", "Dem Speed, fan", "(?) Dem Corr Speed, fan", 
                           "Coolant Bleed, HPT", "Coolant Bleed, LPT" ]
long_descriptions = ["Total temperature at fan inlet, degrees rankine", 
                     "Total temperature at Low Pressure Compressor (LPC) outlet, degrees rankine", 
                     "Total temperature at High Pressure Compressor (HPC) outlet, degrees rankine", 
                     "Total temperature at Low Pressure Turbine (LPT) outlet, degrees rankine", 
                     "Pressure at fan inlet, psia", 
                     "Total pressure in bypass-duct, psia", 
                     "Total pressure at HPC outlet, psia", 
                     "Physical fan speed, rpm", 
                     "Physical core speed, rpm", 
                     "Engine pressure ratio (P50/P2) where P2 is Pressure at Fan Inlet and P50 is Pressure at LPT outlet, psia", 
                     "Static pressure at HPC outlet, psia", 
                     "Ratio of fuel flow to Ps30 where Ps30 is static pressure at HPC outlet, pps/psi", 
                     "Corrected fan speed, rpm", 
                     "Corrected core speed, rpm", 
                     "Bypass Ratio, unitless", 
                     "Burner fuel-air ratio, unitless", 
                     "Bleed Enthalpy, unitless", 
                     "Demanded fan speed, rpm", 
                     "Demanded corrected fan speed, rpm", 
                     "HPT coolant bleed, lbm/s", 
                     "LPT coolant bleed, lbm/s"]
    #Temps are in R; for temps in F, subtract 459.67
    #Pressures are in psia
    #Speed is in rpm
    #"phi Fuel Flow Ratio" is ration of fuel flow to static pressure at HPC, in pps/psi
    #Bypass ratio - proportion of air mass passing through bypass versus the engine core (compressors/burners/turbines)
    #"Bleed Enthalpy" refers to bleed air (?), and the total enthalpy of it (Enthalpy = Internal Energy + (Pressure*Volume))
    #Coolant Bleed is in pound mass per second (lbm/s)
    
sensor_names = ['sensor_{}'.format(i) for i in range(1, 22)]

feature_dictionary_short = {}
feature_dictionary_long = {}
for i, sensor in enumerate(sensor_names):
    feature_dictionary_short[sensor] = description_headers[i]
    feature_dictionary_long[sensor] = long_descriptions[i]
    
for key in feature_dictionary_long.keys():
    print(key, ": ", feature_dictionary_long[key])
    
del description_headers
del long_descriptions
del sensor_names

raw_train.head()

sensor_1 :  Total temperature at fan inlet, degrees rankine
sensor_2 :  Total temperature at Low Pressure Compressor (LPC) outlet, degrees rankine
sensor_3 :  Total temperature at High Pressure Compressor (HPC) outlet, degrees rankine
sensor_4 :  Total temperature at Low Pressure Turbine (LPT) outlet, degrees rankine
sensor_5 :  Pressure at fan inlet, psia
sensor_6 :  Total pressure in bypass-duct, psia
sensor_7 :  Total pressure at HPC outlet, psia
sensor_8 :  Physical fan speed, rpm
sensor_9 :  Physical core speed, rpm
sensor_10 :  Engine pressure ratio (P50/P2) where P2 is Pressure at Fan Inlet and P50 is Pressure at LPT outlet, psia
sensor_11 :  Static pressure at HPC outlet, psia
sensor_12 :  Ratio of fuel flow to Ps30 where Ps30 is static pressure at HPC outlet, pps/psi
sensor_13 :  Corrected fan speed, rpm
sensor_14 :  Corrected core speed, rpm
sensor_15 :  Bypass Ratio, unitless
sensor_16 :  Burner fuel-air ratio, unitless
sensor_17 :  Bleed Enthalpy, unitless
sensor_18 :  Dema

Unnamed: 0,unit_number,time_cycles,sensor_2,sensor_3,sensor_4,sensor_6,sensor_7,sensor_8,sensor_9,sensor_11,sensor_12,sensor_13,sensor_15,sensor_17,sensor_20,sensor_21
0,1,1,641.82,1589.7,1400.6,21.61,554.36,2388.06,9046.19,47.47,521.66,2388.02,8.4195,392,39.06,23.419
1,1,2,642.15,1591.82,1403.14,21.61,553.75,2388.04,9044.07,47.49,522.28,2388.07,8.4318,392,39.0,23.4236
2,1,3,642.35,1587.99,1404.2,21.61,554.26,2388.08,9052.94,47.27,522.42,2388.03,8.4178,390,38.95,23.3442
3,1,4,642.35,1582.79,1401.87,21.61,554.45,2388.11,9049.48,47.13,522.86,2388.08,8.3682,392,38.88,23.3739
4,1,5,642.37,1582.85,1406.22,21.61,554.0,2388.06,9055.15,47.28,522.19,2388.04,8.4294,393,38.9,23.4044


In [11]:
print('SENSORS REMOVED') 
for item in columns_to_drop:
    if (item - 4) > 0:
        sensor = "sensor_" + str(item - 4)
        print(sensor + " : ", feature_dictionary_long[sensor])

SENSORS REMOVED
sensor_1 :  Total temperature at fan inlet, degrees rankine
sensor_5 :  Pressure at fan inlet, psia
sensor_10 :  Engine pressure ratio (P50/P2) where P2 is Pressure at Fan Inlet and P50 is Pressure at LPT outlet, psia
sensor_14 :  Corrected core speed, rpm
sensor_16 :  Burner fuel-air ratio, unitless
sensor_18 :  Demanded fan speed, rpm
sensor_19 :  Demanded corrected fan speed, rpm


In [12]:
print('SENSORS KEPT AS FIELDS') 
for field in raw_train.columns:
    if 'sensor' in field:
        sensor = field
        print(sensor + " : ", feature_dictionary_long[sensor])

SENSORS KEPT AS FIELDS
sensor_2 :  Total temperature at Low Pressure Compressor (LPC) outlet, degrees rankine
sensor_3 :  Total temperature at High Pressure Compressor (HPC) outlet, degrees rankine
sensor_4 :  Total temperature at Low Pressure Turbine (LPT) outlet, degrees rankine
sensor_6 :  Total pressure in bypass-duct, psia
sensor_7 :  Total pressure at HPC outlet, psia
sensor_8 :  Physical fan speed, rpm
sensor_9 :  Physical core speed, rpm
sensor_11 :  Static pressure at HPC outlet, psia
sensor_12 :  Ratio of fuel flow to Ps30 where Ps30 is static pressure at HPC outlet, pps/psi
sensor_13 :  Corrected fan speed, rpm
sensor_15 :  Bypass Ratio, unitless
sensor_17 :  Bleed Enthalpy, unitless
sensor_20 :  HPT coolant bleed, lbm/s
sensor_21 :  LPT coolant bleed, lbm/s


# 2.0 Definitions

Definitions used in the model creation and tuning. 

## 2.1 Function Definitions for Data Processing

Data can be scaled, but may not be neccessary with decision trees.

Two options for a scaler exist - 'standard' and 'minmax', both built off the canned sklearn scalers. 

Since decision trees consider each instance separately (as a singular moment in time), extra consideration was given to how the different time series fields are changing over the time cycles. To do this, the below functions break apart the data set into the separate engines, add additional fields representing changes in each time series field, and then rebuild the dataset by concatenating the engines back into a single dataframe with the new fields (this is Step 4 in the main pipeline function defined below). 

For each existing field, one or two additional fields are possible to represent the observed changes in that field. The changes are measured using windows and are summarized as follows:

<ul>
  <li>Average Change ('avg'): average change between each consecutive time step within the designated window. The larger the time window, the larger the time frame the field considers for average change.</li><br>
  <li>Absolute Change ('abs'): absolute change between the first and last rows in a window.</li><br>
  <li>Acceleration of Change ('acc'): Measures if the rate of change of the field is increasing or decreasing, and to what extent. It does this by first taking the difference as defined below with input 'dif', and then taking the absolute differences (dependent on window size) of those differences.</li><br>
    <li>Difference ('dif'): The difference between the current instance and the immediate preceding instance. Is the same as absolute change with a window size of 2. </li>
</ul>
This methodoly creates some fields with NaN values at the beginnning of each engine's time series. An option is available to drop the rows with NaN values with the default option being "True" (default is to drop NaN values). 

In [13]:
def add_window_field(data, size, typ, min_p, num):
    if num == "1":
        cols = list(data.columns)
    elif num == "2":
        cols = list(data.columns[0:14])
    
    for col in cols:
        dif = data[col].diff()
        name = "w" + num + "_" + typ + "_" + col
        lamb_x = lambda x: x.iloc[-1] - x.iloc[0]
        if typ == 'avg':
            data[name] = dif.rolling(window = size, min_periods = min_p).mean()
        elif typ == 'abs':
            data[name] = data[col].rolling(window = size, min_periods = min_p).apply(lamb_x)
        elif typ == 'acc':
            data[name] = dif.rolling(window = size, min_periods = min_p).apply(lamb_x)
        elif typ == 'dif':
            data[name] = dif
        else:
            raise NameError("The window type was not properly specified.")           
    return data

def add_window_fields(dat, p, num, show_prints = False):
    #Initialize new dataframe, exclude identifier columns (unit and cycle)
    n_tn = dat.copy().iloc[:0,2:]
    
    #Break apart existing dataframe by engine
    for unit in dat['unit_number'].unique():
            #Create new fields w/ function
            chunk = dat[dat['unit_number'] == unit].iloc[:,2:]
            if show_prints == True:
                print("Unit ", unit)
                display(chunk)    
            chunk = add_window_field(chunk, 
                                     p["first_window"], 
                                     p["first_window_type"], 
                                     p["min_periods"], 
                                     num)
            if show_prints == True:
                display(chunk)
            #Append chunk to newly initialized dataframe
            n_tn = pd.concat([n_tn, chunk], ignore_index = True)
            
    #First add back in identifier columns, then return new dataframe
    n_tn = pd.concat([dat.iloc[:,:2], n_tn], axis = 1, ignore_index = False)
    return n_tn

def process_data(params, train, test, y_train):
    
    #STEP 1
    #Throw an error if the designated windows are too big. 
    if params['first_window'] != None:
        if params['first_window'] > 30:
            raise ValueError('Window size cannot be greater than 30 to maintain functionality with the given test set.')
    if params['second_window'] != None:
        if params['second_window'] > 30:
            raise ValueError('Window size cannot be greater than 30 to maintain functionality with the given test set.')
    
    #STEP 2
    #Initialize required variables/datasets used in this function definition.
    trn = train.iloc[:,2:]
    tst = test.iloc[:,2:]
    y_trn = y_train
    cols = train.columns[2:]
    before = len(trn)
    
#     print("STEP 2:")
#     display(tst.head(5))
    
    #STEP 3
    #Scale the test and train sets fit on all the fields in the train set (regardless of unit).
    #Don't forget to reattach the 'unit_number' and 'cycles'
    if params['scaler'] == 'standard':
        scaler = StandardScaler()
    elif params['scaler'] == 'minmax':
        scaler = MinMaxScaler()
    elif params['scaler'] == None:
        scaler = None
    else:
        raise NameError("The designated scaler does not exist or is not available.")
    
    if scaler != None:
        trn = pd.DataFrame(scaler.fit_transform(trn), columns = cols)
        tst = pd.DataFrame(scaler.transform(tst), columns = cols)

    trn = pd.concat([train.iloc[:,:2], trn], axis = 1)
    tst = pd.concat([test.iloc[:,:2], tst], axis = 1)
    
#     print("STEP 3:")
#     display(tst.head(5))
    
    #STEP 4 (OPTIONAL BOOLEAN - USES PREVIOUSLY DEFINED FUNCTIONS)
    #Add new features to test and train sets, keep it confined to individual engines.
    #Use if statement to determine if new columns are desired
    if params["first_window"] != None:
        trn = add_window_fields(trn, params, "1")
        tst = add_window_fields(tst, params, "1")
    if params["second_window"] != None:
        trn = add_window_fields(trn, params, "2")
        tst = add_window_fields(tst, params, "2")

#     print("STEP 4:")
#     display(tst.head(5))
    
    #STEP 5 (OPTIONAL BOOLEAN)
    #Drop na values from train and test sets, if desired. 
    if params['drop_nan_train'] == True:
        trn["RUL"] = y_trn
        trn.dropna(inplace = True)
        y_trn = trn["RUL"]
        trn.drop(['RUL'], axis = 1, inplace = True)
        
        tst.dropna(inplace = True)

#     print("STEP 5:")
#     display(tst.head(5))
    
    #STEP 6
    #Extract final row from each test set. 
    temp_df = tst.iloc[:0,:]
    for unit in tst['unit_number'].unique():
        tail = tst[tst['unit_number'] == unit].tail(1)
        temp_df = pd.concat([temp_df, tail], ignore_index = True)
    tst = temp_df
    
#     print("STEP 6:")
#     display(tst.head(5))
    
    #STEP 7
    #Record number of dropped rows from train data and number of features, to include in record.
    dropped = before - len(trn)
    num_feat = len(trn.columns)
    
    #STEP 7
    #Return each processed dataframe
    return trn, tst, y_trn, {'num_features': num_feat - 2, 'train_samples_dropped': dropped}

## 2.2 Model Function Definitions

In [14]:
def make_and_train_GBDT(params, train_data, train_target):
    
    """
    Intializes a gradient-boosted decision tree using sklearn Library. 
    Parmeters are entered as a library. 
    
    Inputs:
    params: library of input parameters, must include .... 
    
    """

    model = GradientBoostingRegressor(loss = 'squared_error', 
                                      learning_rate = params['mod_learning_rate'], 
                                      n_estimators = params['mod_n_estimators'], 
                                      subsample = params['mod_subsample'],   #Set to less than 1 to help with overfitting
                                      min_samples_split = params['mod_min_samples_split'], #Min to split into two branches
                                      min_samples_leaf = params['mod_min_samples_leaf'], #Min in each branch at a split-can help with overfitting
                                      max_depth = params['mod_max_depth'], 
                                      validation_fraction = 0.1, 
                                      n_iter_no_change = params['mod_validation']) 
    
    model.fit(train_data, train_target)
    
    return model

In [15]:
def GBDT_run_for_log(model, train_data, train_target, test_data, test_target, params):
    
    """ 
    Trains an input model on the input train data, then collects various scoring metrics of both the 
    train and test data. The input parameters dictionary is then concatenated with the metrics to provide 
    a dictionary of both the metrics and input parameters used. 
    
    Inputs:
    
    Model: Gradient-Boosted Decision Tree model from the previous function above ('make_and_train_GBDT') 
    or GBDT defined via other means
    
    Various data inputs: Train and Test, plus targets
    
    params: library of parameters. Must include 'perform_validation' and '#_epochs'.
    
    Output: library of train and test parameters, along with parameters included in the 'params' input. 
    
    """
    
    logger = {}
    
    y_hat_train = model.predict(train_data)
    y_hat_test = model.predict(test_data)
    
    logger["train_MSE"] = mean_squared_error(train_target, y_hat_train).item()
    logger["test_MSE"] = mean_squared_error(test_target, y_hat_test).item()
    
    logger["train_RMSE"] = np.sqrt(mean_squared_error(train_target, y_hat_train)).item()
    logger["test_RMSE"] = np.sqrt(mean_squared_error(test_target, y_hat_test)).item()
    
    logger["train_MAE"] = mean_absolute_error(train_target, y_hat_train).item()
    logger["test_MAE"] = mean_absolute_error(test_target, y_hat_test).item()
    
    logger["train_MAPE"] = np.mean(np.abs((train_target - y_hat_train) / y_hat_train)).item() * 100
    logger["test_MAPE"] = np.mean(np.abs((test_target - y_hat_test) / y_hat_test)).item() * 100
    
    logger["train_R2"] = r2_score(train_target, y_hat_train)
    logger["test_R2"] = r2_score(test_target, y_hat_test)
    
    logger.update(params)
    
    return logger
    
    

In [16]:
def add_to_logger(new_instance, existing_dict):
    
    """
    Takes in a new instance of the function 'GBDT_run_for_log' which returns the performance metrics 
    for a GBDT using the listed input parameters. It then adds that instance to a dictionary of lists,
    where each index in the list represents a new run of 'GBDT_run_for_log'. This is intended to be used
    in running loops during parameter tuning to keep track of which parameters perform the best. 
    
    Input:
    new_instance: a dictionary of the most recent parameters and performance metrics
    existing_dict: a dictionary of lists that keep a record of performance metrics and the parameters that
            led to those results
    
    Output:
    An updated record of parameters/performance metrics in which the most recent parameters are added to the record
    
    """
    
    if existing_dict == None:
        record = {}
        for key in new_instance.keys():
            record[key] = []
    else:
        record = existing_dict.copy()
        
    for key in record.keys():
        l = record[key]
        if key in new_instance.keys():
            l.append(new_instance[key])     
        else:
            l.append(np.nan)
        record[key] = l
        
    return record

In [17]:
def loop_through_parameters(loops, 
                            parameters, 
                            train, 
                            y_train, 
                            test, 
                            y_test):
    """
    Runs parameter loops for model and records the resulting metrics. Returns a dictionary that is 
    a log of the results, where each key represents a parameter or performance metric, and each item is
    a list where the indexes represent the runs in chronological order. 
    
    Inputs:
    loops: a dictionary of the the keys and values to loop through.
    params: a dictionary with all the parameters neccessary to build the CNN, train it, and acquire the 
    results using the functions defined previously.
    
    Output:
    A dictionary of lists with input parameters and resultant performance metrics. Can easily be used
    to construct a dataframe. 
    """
    
    record = None
    keys = list(loops.keys())
    
    def nested_function(p1 = None, p2 = None, p3 = None, p4 = None):
        parameters[keys[0]] = p1
        if len(keys) > 1:
            parameters[keys[1]] = p2
        if len(keys) > 2:
            parameters[keys[2]] = p3
        if len(keys) > 3:
            parameters[keys[3]] = p4
            
#         print(parameters)
#         print(test.columns)

        trn, tst, trn_tar, d = process_data(parameters, train, test, y_train)

        trn = trn.iloc[:,2:]
        tst = tst.iloc[:,2:]
#         print(trn.shape)
#         print(tst.shape)
#         print(trn_tar.shape)
#         print(y_test)
        
        model = make_and_train_GBDT(parameters, trn, trn_tar, tst, y_test)
        
        new_instance = GBDT_run_for_log(model, trn, trn_tar, tst, y_test, parameters)
        
        for key in d.keys():
            new_instance[key] = d[key]
        
        r = add_to_logger(new_instance, record)
        
        return r
        
    if len(loops.keys()) == 1:
        for p_1 in loops[keys[0]]:
            print("Starting first loop with value ", p_1, ".")
            record = nested_function(p1 = p_1)
            
    elif len(loops.keys()) == 2:
        for p_1 in loops[keys[0]]:
            print("Starting first loop with value ", p_1, ".")
            for p_2 in loops[keys[1]]:
                record = nested_function(p1 = p_1, p2 = p_2)
                
    elif len(loops.keys()) == 3:
        for p_1 in loops[keys[0]]:
            print("Starting first loop with value ", p_1, ".")
            for p_2 in loops[keys[1]]:
                for p_3 in loops[keys[2]]:
                    record = nested_function(p1 = p_1, p2 = p_2, p3 = p_3)
                    
    else:
        for p_1 in loops[keys[0]]:
            print("Starting first loop with value ", p_1, ".")
            for p_2 in loops[keys[1]]:
                for p_3 in loops[keys[2]]:
                    for p_4 in loops[keys[3]]:
                        print("     Running with variables ", p_2, ", ", p_3, ", and ", p_4, ".")
                        record = nested_function(p1 = p_1, p2 = p_2, p3 = p_3, p4 = p_4)
                        
    return record
        

In [18]:
def add_to_GBDT_log(record, link = 'GBDT_log.csv', save_changes = False):
    
    """
    Combines most recent group of models with those saved in the log file. Includes designating a
    tuning group based on the most recent tuning group in the log file.
    
    Inputs:
    record: most recent record set of parameter tunings, as returned by above functions.
    link: pathway and filename for the saved log file, it if exists. If it doesn't exist, this 
        function won't work.
    save_changes: designates whether to save the updates to the CSV file that stores the model parameter tuning results.
    
    Output:
    A dataframe of the combines records on file and the most recent turning group. 
    """
    
    df = pd.read_csv(link)
    group_num = df['tuning_group'].max() + 1
    r = pd.DataFrame(record)
    r.insert(0, 'tuning_group', group_num)
    
    df = pd.concat([df, r], ignore_index = True)

    if save_changes == True:
        df.to_csv(link, index = False)
    
    return df

In [19]:
def plot_results(data, fields):
    
    """
    Makes a simple Altair Chart for compairing results visually.
    
    Input:
    data: Dataframe that includes the fields from the most current record or from the CNN_log saved as a CSV.
    fields: designated fields to encode using shape and color. Default are the first two designated in the 
        most recent record.
        
    Output:
    A chart comparing changes in the designated fields, mapped against test RMSE and difference between
        train and test RMSE.
    """

    data['RMSE_diff'] = data['test_RMSE'] - data['train_RMSE']
    fields_to_keep = ["test_RMSE", "train_RMSE", "RMSE_diff"] + fields
    data = data[fields_to_keep]
    
    if len(fields) > 1:
        chart = alt.Chart(data).mark_point().encode(x = alt.X("test_RMSE").scale(zero = False), 
                                                y = 'RMSE_diff', 
                                                color = fields[0] + ":N", 
                                                shape = fields[1] + ":N")
    else:
        chart = alt.Chart(data).mark_point().encode(x = alt.X("test_RMSE").scale(zero = False), 
                                                y = 'RMSE_diff', 
                                                shape = fields[0] + ":N",
                                                color = fields[0] + ":N"
                                                )

    return chart

def show_df_of_results(data, fields):
    data['RMSE_diff'] = data['test_RMSE'] - data['train_RMSE']
    fields_to_keep = ["test_RMSE", "train_RMSE", "RMSE_diff"] + fields
    data = data[fields_to_keep]
    return data

# 3.0 Building Models and Exploring Results

Different processing procedures (for example, different window sizes and additional fields added) and model input parameters were used to explore the best data input variation and the best model parameter selection. The functions defined in Sections 2.1 and 2.2 are utilized to perform looping of different variations of model and data architecture. Results are stored in a .CSV file. 

## 3.1 Define the default parameters and those to change while looping. 

The cell below was used while tuning the GBDT model. Different "tuning groups" were used in loops and then investigated by studying plots and dataframes. Thereafter, a new tuning group would be created for further investigation. The goal was to find the best model that maximized the accuracy but minimized overtraining. 

The dictionary variable "parameters" hold default parameters used by the processing/model. The dictionary variable "loops" shows the fields that are looped through for each tuning group. If a field is not included in "loops", then the default value is used for each model iteration. 

For future reference, each tuning group was commented out after the results were recorded. 

In [20]:
#Parameter List
parameters = {'scaler': 'standard', 
            'first_window': 30, 
            'first_window_type': 'abs', #options: avg, dif, acc, abs
            'second_window': None, 
            'second_window_type': 'avg', 
            'drop_nan_train': True,
            'min_periods': None, 
            'mod_loss': 'squared_error', 
            'mod_learning_rate': 0.2, 
            'mod_n_estimators': 400, 
            'mod_subsample': 1.0,
            'mod_min_samples_split': 2, 
            'mod_min_samples_leaf': 1, 
            'mod_max_depth': 3, 
            'mod_validation': None  #requires integer as input - number of iterations with no change
            }

loops = {
#          'mod_learning_rate': [0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5], #GROUP 1 - CHOOSE AROUND 0.4
#          'mod_n_estimators': [10, 20, 50, 100, 200, 300, 500], #GROUP 1 - CHOOSE B/T 20 AND 100
#          'mod_max_depth': [3, 5, 7, 9], #GROUP 1 - CHOOSE 2-5
#          'scaler': ['standard', 'minmax', None] #GROUP 1 - MAKES NO DIFFERENCE, STAY WITH STANDARD
#          'mod_learning_rate': [3.75, 0.4, 0.425, 0.45, 0.475], #GROUP 2 - CHOOSE 0.425
#          'mod_n_estimators': [15, 20, 25, 30, 35, 40, 45, 50, 55, 60], #GROUP 2 - CHOOSE 55
#          'mod_max_depth': [2, 3, 4], #GROUP 2 - CHOOSE 3
#          'mod_validation': [None, 2], #GROUP 2 - GO W/ NONE
#          'first_window': [3, 5, 7, 10, 15, 20], #GROUP 3 - try higher window sizes
#          'first_window_type': ['avg', 'dif', 'acc', 'abs'], #GROUP 3 - stick to avg and abs
#          'mod_n_estimators': [50, 55, 60, 70, 80, 100], #GROUP 3 - try higher number of estimators
#          'mod_learning_rate': [0.1, 0.425], #GROUP 3 - try wider range
#          'first_window': [12, 18, 20, 25, 30], #GROUP 4 - go with 30
#          'first_window_type': ['avg', 'abs'], #GROUP 4 - GO WITH avg
#          'mod_n_estimators': [100, 200, 300], #GROUP 4 - try 200 - 600
#          'mod_learning_rate': [0.05, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5] #GROUP 4 - TRY .18 TO .32
#          'first_window': [20, 22, 24, 26, 28, 30], #GROUP 5 - KEEP 30
#          'first_window_type': ['avg', 'abs', 'acc'], #GROUP 5 - KEEP ABS
#          'mod_n_estimators': [250, 300, 350, 400, 450, 500], #GROUP 5 - KEEP 400
#          'mod_learning_rate': [0.175, 0.2, 0.225, 0.25, 0.275, 0.3, 0.325] #GROUP 5 - KEEP 0.2, BEST PROBABLY B/T .2 AND .25
         'second_window': [3, 5, 10, 20], #GROUP 6
         'second_window_type': ['avg', 'abs', 'acc', 'dif'], #GROUP 6
         'mod_n_estimators': [300, 400, 500, 600], #GROUP 6
        }

## 3.2 Run processing/model loops and study results. 

Each tuning group was assessed manually and individually. The results were used to determine the looping parameters to use in the consecutive tuning group. Results are stored in the .CSV file. 

### 3.2.1 Run processing/model loops as defined in Section 3.1

The loops are ran with in the block below using the inputs designated in the block above. 

When not in use, the function below is commented out. 

In [21]:
#for each key and associated list in 'loops', make a record of results for different parameters.

# record = loop_through_parameters(loops, parameters, raw_train, raw_y_train, raw_test, y_test)

### 3.2.2 Compare results graphically and in a dataframe

Results from the most recent tuning group were viewed below. 

This cell block was commented out after tuning concluded. 

In [22]:
# print(loops.keys())
# plot_results(pd.DataFrame(record), list(loops.keys()))

# display(plot_results(pd.DataFrame(record), ['second_window']))
# display(plot_results(pd.DataFrame(record), ['second_window_type']))
# display(plot_results(pd.DataFrame(record), ['mod_n_estimators']))
# display(plot_results(pd.DataFrame(record), ['mod_learning_rate']))

A table of the most recent results was viewed below, sorted by test RMSE or train/test performance difference. Only the fields that changed with each iterations are included.

This cell block was commented out after tuning concluded. 

In [23]:
# df = show_df_of_results(pd.DataFrame(record), list(loops.keys()))
# df = df[abs(df["RMSE_diff"]) < 0.5].sort_values("RMSE_diff")
# df.sort_values("test_RMSE")[0:100]


The cell below was used to <b>initialize</b> the performance log CSV. This cell should not be uncommented unless the user wants to initialize a new log file for looping. 

In [24]:
#Initialize the log with the first tuning group. 
#FILENAME REMOVED - DO NOT OVERWRITE LOG!

# df = pd.DataFrame(record)

# df.insert(0, "tuning_group", 1)

# df.to_csv("....csv", index = False)

The below block saves the most recent tuning group to the CSV log. The value for 'tuning_group' is determined based on the last entry into the log. 

In [25]:
# add_to_GBDT_log(record, save_changes = True)

The below block is used to explore the entire CSV log to compare results. Also organized by performace/overfitting to find the best parameters to test. 

In [27]:
df = pd.read_csv("GBDT_log.csv")

##USE CTRL + "/" TO COMMENT OUT FIELDS
df = df[[ 'tuning_group',
#          'train_MSE', 
#          'test_MSE', 
         'train_RMSE', 
         'test_RMSE', 
#          'train_MAE', 
#          'test_MAE', 
#          'train_MAPE', 
#          'test_MAPE', 
#          'train_R2', 
#          'test_R2', 
         'scaler',
         'first_window', 
         'first_window_type', 
         'second_window',
         'second_window_type', 
#          'drop_nan_train', 
#          'min_periods', 
         'mod_loss',
         'mod_learning_rate', 
         'mod_n_estimators', 
         'mod_subsample',
#          'mod_min_samples_split',
#          'mod_min_samples_leaf', 
         'mod_max_depth',
         'mod_validation', 
         'num_features', 
         'train_samples_dropped'
        ]]
df.insert(3, "RMSE_diff", df["test_RMSE"] - df["train_RMSE"])
pd.set_option('display.max_rows', None)
#df = df.sort_values("test_RMSE").reset_index(drop = True)[:20]
df.sort_values("test_RMSE")[:15]

Unnamed: 0,tuning_group,train_RMSE,test_RMSE,RMSE_diff,scaler,first_window,first_window_type,second_window,second_window_type,mod_loss,mod_learning_rate,mod_n_estimators,mod_subsample,mod_max_depth,mod_validation,num_features,train_samples_dropped
2155,6,14.586652,14.488147,-0.098505,standard,30.0,abs,3.0,dif,squared_error,0.2,400,1.0,3,,42,2900
2203,6,14.586652,14.488147,-0.098505,standard,30.0,abs,20.0,dif,squared_error,0.2,400,1.0,3,,42,2900
2167,6,14.586652,14.488147,-0.098505,standard,30.0,abs,5.0,acc,squared_error,0.2,400,1.0,3,,42,2900
2151,6,14.586652,14.488147,-0.098505,standard,30.0,abs,3.0,acc,squared_error,0.2,400,1.0,3,,42,2900
2175,6,14.586652,14.488147,-0.098505,standard,30.0,abs,10.0,avg,squared_error,0.2,400,1.0,3,,42,2900
2183,6,14.586652,14.488147,-0.098505,standard,30.0,abs,10.0,acc,squared_error,0.2,400,1.0,3,,42,2900
2195,6,14.586652,14.488147,-0.098505,standard,30.0,abs,20.0,abs,squared_error,0.2,400,1.0,3,,42,2900
2199,6,14.586652,14.488147,-0.098505,standard,30.0,abs,20.0,acc,squared_error,0.2,400,1.0,3,,42,2900
2179,6,14.586652,14.488147,-0.098505,standard,30.0,abs,10.0,abs,squared_error,0.2,400,1.0,3,,42,2900
2163,6,14.586652,14.488147,-0.098505,standard,30.0,abs,5.0,abs,squared_error,0.2,400,1.0,3,,42,2900


## 3.2 Investigating Feature Importances

In tuning group 6, it was observed that the second window size and type did not have an effect on model performance. For this reason, looping ceased here. This second explores the best-performing models from each tuning group to compare the results and to study the feature importances. 

In [28]:
models_to_test = [2203, 2179, 2155, 2080, 1382, 1126, 897, 861, 444]
feat_imp = {}
RMSE_results = {}

df = pd.read_csv("GBDT_log.csv")
pd.set_option('display.float_format', None)

# df[df['tuning_group'] == 1].sort_values('test_RMSE')

df = df.iloc[models_to_test] #, 11:-2]
df.drop(df.columns[[1, 2, 5, 6, 7, 8, 9, 10]], axis = 1, inplace = True)
df.insert(3, "RMSE_diff", df["test_RMSE"] - df["train_RMSE"])
display(df)
df = df.iloc[:, 4:-2]

Unnamed: 0,tuning_group,train_RMSE,test_RMSE,RMSE_diff,scaler,first_window,first_window_type,second_window,second_window_type,drop_nan_train,min_periods,mod_loss,mod_learning_rate,mod_n_estimators,mod_subsample,mod_min_samples_split,mod_min_samples_leaf,mod_max_depth,mod_validation,num_features,train_samples_dropped
2203,6,14.586652,14.488147,-0.098505,standard,30.0,abs,20.0,dif,True,,squared_error,0.2,400,1.0,2,1,3,,42,2900
2179,6,14.586652,14.488147,-0.098505,standard,30.0,abs,10.0,abs,True,,squared_error,0.2,400,1.0,2,1,3,,42,2900
2155,6,14.586652,14.488147,-0.098505,standard,30.0,abs,3.0,dif,True,,squared_error,0.2,400,1.0,2,1,3,,42,2900
2080,5,14.586652,14.488147,-0.098505,standard,30.0,abs,,avg,True,,squared_error,0.2,400,1.0,2,1,3,,28,2900
1382,4,15.122653,14.729283,-0.39337,standard,30.0,abs,,avg,True,,squared_error,0.2,300,1.0,2,1,3,,28,2900
1126,3,17.78646,17.620346,-0.166115,standard,15.0,abs,,avg,True,,squared_error,0.1,100,1.0,2,1,3,,28,1400
897,3,17.154092,17.62414,0.470048,standard,3.0,avg,,avg,True,,squared_error,0.425,80,1.0,2,1,3,,28,300
861,2,17.810731,17.69794,-0.112791,standard,,avg,,avg,True,,squared_error,0.475,40,1.0,2,1,3,2.0,14,0
444,1,17.740941,18.077246,0.336305,standard,,avg,,avg,True,,squared_error,0.4,50,1.0,2,1,3,,14,0


The below code loops through the selected models to explore the feature importances and RMSE results. 

Note the below cells take a while to process. 

In [29]:
#Parameter List
parameters = {'scaler': 'standard', 
            'first_window': 30, 
            'first_window_type': 'abs', #options: avg, dif, acc, abs
            'second_window': None, 
            'second_window_type': 'avg', 
            'drop_nan_train': True,
            'min_periods': None, 
            'mod_loss': 'squared_error', 
            'mod_learning_rate': 0.2, 
            'mod_n_estimators': 400, 
            'mod_subsample': 1.0,
            'mod_min_samples_split': 2, 
            'mod_min_samples_leaf': 1, 
            'mod_max_depth': 3, 
            'mod_validation': None  #requires integer as input - number of iterations with no change
            }

np.set_printoptions(suppress=True)

#Loop through the different models to re-run them.
for model in models_to_test:
    print("MODEL NUMBER: ", model)
    for param in list(dict(df.loc[model]).keys()):
        if isinstance(parameters[param], str):
            parameters[param] = df.loc[model, param]
        elif param in ['first_window', 'second_window', 'mod_validation']:
            if np.isnan(df.loc[model, param]):
                parameters[param] = None
            else:
                parameters[param] = int(df.loc[model, param].item())
        elif np.isnan(df.loc[model, param]):
            parameters[param] = None
        else:
            parameters[param] = df.loc[model, param].item()

#     display(parameters)
    
    #Re-run given data/model architecture
    tn, ts, y_tn, d = process_data(parameters, raw_train, raw_test, raw_y_train)
#     display(tn.head(5))
    print("Length of train DF fields: ", len(tn.columns))
    m = make_and_train_GBDT(parameters, tn.iloc[:,2:], y_tn)
    display(m)
    
#     print(np.round(m.feature_importances_, decimals = 6))
    print("Length of feature importances: ", len(m.feature_importances_), "\n\n")
    
    feat_imp[model] = m.feature_importances_
    RMSE_tn = np.sqrt(mean_squared_error(y_tn, m.predict(tn.iloc[:,2:])).item())
    RMSE_ts = np.sqrt(mean_squared_error(y_test, m.predict(ts.iloc[:,2:])).item())
    RMSE_d = RMSE_ts - RMSE_tn
    RMSE_results[model] = {'RMSE Train': RMSE_tn, "RMSE Test": RMSE_ts, "RMSE Difference": RMSE_d}

MODEL NUMBER:  2203
Length of train DF fields:  44


Length of feature importances:  42 


MODEL NUMBER:  2179
Length of train DF fields:  44


Length of feature importances:  42 


MODEL NUMBER:  2155
Length of train DF fields:  44


Length of feature importances:  42 


MODEL NUMBER:  2080
Length of train DF fields:  30


Length of feature importances:  28 


MODEL NUMBER:  1382
Length of train DF fields:  30


Length of feature importances:  28 


MODEL NUMBER:  1126
Length of train DF fields:  30


Length of feature importances:  28 


MODEL NUMBER:  897
Length of train DF fields:  30


Length of feature importances:  28 


MODEL NUMBER:  861
Length of train DF fields:  16


Length of feature importances:  14 


MODEL NUMBER:  444
Length of train DF fields:  16


Length of feature importances:  14 




In [30]:
for key in RMSE_results.keys():
    specs = RMSE_results[key]
    print("FOR MODEL #", key, ":")
    print("    Train RMSE: ", specs["RMSE Train"])
    print("    Test RMSE: ", specs["RMSE Test"])
    print("    RMSE Difference: ", specs["RMSE Difference"], "\n")

FOR MODEL # 2203 :
    Train RMSE:  14.586651945396225
    Test RMSE:  14.488147304127896
    RMSE Difference:  -0.09850464126832925 

FOR MODEL # 2179 :
    Train RMSE:  14.586651945396223
    Test RMSE:  14.4881473041279
    RMSE Difference:  -0.09850464126832392 

FOR MODEL # 2155 :
    Train RMSE:  14.586651945396223
    Test RMSE:  14.488147304127901
    RMSE Difference:  -0.09850464126832215 

FOR MODEL # 2080 :
    Train RMSE:  14.586651945396225
    Test RMSE:  14.488147304127905
    RMSE Difference:  -0.09850464126832037 

FOR MODEL # 1382 :
    Train RMSE:  15.12265312773571
    Test RMSE:  14.729282921197221
    RMSE Difference:  -0.39337020653848853 

FOR MODEL # 1126 :
    Train RMSE:  17.786460332394018
    Test RMSE:  17.620345548155495
    RMSE Difference:  -0.16611478423852333 

FOR MODEL # 897 :
    Train RMSE:  17.154092036312626
    Test RMSE:  17.62414037048984
    RMSE Difference:  0.47004833417721414 

FOR MODEL # 861 :
    Train RMSE:  18.052931160855216
    Tes

In [31]:
for feat in feat_imp.keys():
    print("FOR MODEL #", feat)
    print("Max Importance: ", feat_imp[feat].max())
    print("Min Importance: ", feat_imp[feat].min())
    print("Mean Importance: ", feat_imp[feat].mean())
    print("Median Importance: ", np.median(feat_imp[feat]), "\n")

FOR MODEL # 2203
Max Importance:  0.34805993310441363
Min Importance:  2.8632880017066257e-05
Mean Importance:  0.023809523809523808
Median Importance:  0.0032184990460810345 

FOR MODEL # 2179
Max Importance:  0.3480719510613346
Min Importance:  1.8640704522204286e-05
Mean Importance:  0.023809523809523808
Median Importance:  0.002857378448609946 

FOR MODEL # 2155
Max Importance:  0.3481099665058965
Min Importance:  4.222091613234322e-05
Mean Importance:  0.023809523809523808
Median Importance:  0.003012635743279813 

FOR MODEL # 2080
Max Importance:  0.34810848792387716
Min Importance:  1.864070452220401e-05
Mean Importance:  0.03571428571428572
Median Importance:  0.0065763805623032575 

FOR MODEL # 1382
Max Importance:  0.3514819233086173
Min Importance:  3.1323941393568315e-05
Mean Importance:  0.03571428571428571
Median Importance:  0.006461577049542311 

FOR MODEL # 1126
Max Importance:  0.3872993480522112
Min Importance:  0.0
Mean Importance:  0.03571428571428572
Median Import

# 4.0 Final Model

Using the above tuning and evaluation from Section 3, model number 2080 from tuning group 5 was chosen as a final model. 

## 4.1 Train best model and save results. 

The parameters are shown below. The model is trained with the same data and saved into a pickle file. 

In [49]:
final_model = 2080

df = pd.read_csv("GBDT_log.csv")
pd.set_option('display.float_format', None)

df = df.iloc[final_model]
df[11:-2]

scaler                        standard
first_window                      30.0
first_window_type                  abs
second_window                      NaN
second_window_type                 avg
drop_nan_train                    True
min_periods                        NaN
mod_loss                 squared_error
mod_learning_rate                  0.2
mod_n_estimators                   400
mod_subsample                      1.0
mod_min_samples_split                2
mod_min_samples_leaf                 1
mod_max_depth                        3
mod_validation                     NaN
Name: 2080, dtype: object

In [50]:
#Final parameter list
parameters = {'scaler': 'standard', 
            'first_window': 30, 
            'first_window_type': 'abs', #options: avg, dif, acc, abs
            'second_window': None, 
            'second_window_type': "avg", 
            'drop_nan_train': True,
            'min_periods': None, 
            'mod_loss': 'squared_error', 
            'mod_learning_rate': 0.2, 
            'mod_n_estimators': 400, 
            'mod_subsample': 1.0,
            'mod_min_samples_split': 2, 
            'mod_min_samples_leaf': 1, 
            'mod_max_depth': 3, 
            'mod_validation': None  #requires integer as input - number of iterations with no change
            }

Input data and model are built/defined in block below. Model is also fit. 

In [51]:
train, test, y_train, shapes = process_data(parameters, 
                                            raw_train, 
                                            raw_test, 
                                            raw_y_train
                                           )
GBDT = make_and_train_GBDT(parameters, train.iloc[:,2:], y_train)

Model is saved to a pickle file for future use. Predicted values are also saved. 

In [54]:
# with open('GBDT_model_trained.pkl', 'wb') as f:
#     pickle.dump(GBDT, f)

y_hat = GBDT.predict(test.iloc[:,2:])

# np.save('GBDT_model_trained_test_predictions.npy', y_hat)

In [55]:
np.sqrt(mean_squared_error(y_hat, y_test)).item()

14.488147304127896

## 4.2 Load saved models and results for future use

In [65]:
with open('GBDT_model_trained.pkl', 'rb') as f:
    GBDT = pickle.load(f)
    
y_hat = np.load('GBDT_model_trained_test_predictions.npy')

Ensure the predicts the same as the saved predictions.

In [66]:
print(np.sqrt(mean_squared_error(GBDT.predict(test.iloc[:,2:]), y_test)).item())
print(np.sqrt(mean_squared_error(y_hat, y_test)).item())

14.488147304127896
14.488147304127896


In [67]:
y_hat

array([114.57991939, 125.82720139,  66.01696257,  86.568706  ,
       109.56795163, 114.40896923, 101.9307254 , 100.91611769,
       114.08474266,  99.78745599,  73.16868441,  94.78639028,
        65.5192576 , 119.23026932, 110.60819126, 111.3250461 ,
        53.76055598,  33.59269857,  96.15978725,  16.00577641,
        66.92882035, 116.88979519, 115.27962852,  17.97458204,
       119.53167021, 116.26774713,  97.02631986,  98.58303633,
       100.75082514, 102.96715719,  14.61926489,  41.70619637,
        90.45756868,   3.08408637,  12.09906439,  16.49987447,
        16.89510525,  56.44937932, 124.62681799,  31.39206463,
        42.4671207 ,   9.34325517,  60.56036866,  82.76508726,
        67.19159649,  50.21126618, 134.27703579,  94.93770772,
        12.75531386, 104.00673144, 106.72653079,  16.4831585 ,
        35.17127304, 124.63342908, 117.49545156,  17.51717158,
       110.54926563,  49.26994803, 122.56278014, 106.02142777,
        32.90560977,  46.42070273,  66.88289792,  26.35