<h1>Introduction to Deep Learning & Neural Networks with Keras</h1>

<h2>Peer-graded Assignment: Build a Regression Model in Keras</h2>

<h3>Submission by Xander Mol</h3>

<b>Imports</b>

In [26]:
#Import statements
import pandas as pd
import numpy as np
import keras
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from keras.models import Sequential
from keras.layers import Dense

<b>Load, examine and prepare dataset</b>

In [2]:
#Load CSV set provided into a Pandas dataframe
concrete_data = pd.read_csv('https://cocl.us/concrete_data')

#Show first lines of result
concrete_data.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


In [3]:
#Show amount of data points
concrete_data.shape

(1030, 9)

In [4]:
#Show dataset statistics
concrete_data.describe()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
count,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0
mean,281.167864,73.895825,54.18835,181.567282,6.20466,972.918932,773.580485,45.662136,35.817961
std,104.506364,86.279342,63.997004,21.354219,5.973841,77.753954,80.17598,63.169912,16.705742
min,102.0,0.0,0.0,121.8,0.0,801.0,594.0,1.0,2.33
25%,192.375,0.0,0.0,164.9,0.0,932.0,730.95,7.0,23.71
50%,272.9,22.0,0.0,185.0,6.4,968.0,779.5,28.0,34.445
75%,350.0,142.95,118.3,192.0,10.2,1029.4,824.0,56.0,46.135
max,540.0,359.4,200.1,247.0,32.2,1145.0,992.6,365.0,82.6


In [5]:
#Check for missing values
concrete_data.isnull().sum()

Cement                0
Blast Furnace Slag    0
Fly Ash               0
Water                 0
Superplasticizer      0
Coarse Aggregate      0
Fine Aggregate        0
Age                   0
Strength              0
dtype: int64

In [6]:
#Split data into predictors and target
concrete_data_columns = concrete_data.columns

predictors = concrete_data[concrete_data_columns[concrete_data_columns != 'Strength']] # all columns except Strength
target = concrete_data['Strength'] # Strength column

predictors.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360


In [7]:
target.head()

0    79.99
1    61.89
2    40.27
3    41.05
4    44.30
Name: Strength, dtype: float64

In [8]:
n_cols = predictors.shape[1] # number of predictors

<h3>Part A: Build a baseline model</h3>

<b>Building and testing the model</b>

In [9]:
#Define regression model given the requirement parameters:
#- One hidden layer of 10 nodes, and a ReLU activation function
#- The adam optimizer and the mean squared error as the loss function

def regression_model():
    # create model
    model = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(n_cols,)))
    model.add(Dense(1))
    
    # compile model
    model.compile(optimizer='adam', loss='mean_squared_error')
    return model

#Set loop for building baseline model 50 times with 50 random train/test splits

error = []

for x in range(50):
    
    #Create a train/test split using SciKitLearn function
    X_train, X_test, y_train, y_test = train_test_split(predictors, target, test_size=0.3, random_state=42)
    
    #Build the model
    model = regression_model()

    #Fit the model
    model.fit(X_train, y_train, epochs=50, verbose=0)
    
    #Test the model
    y_pred = model.predict(X_test)
    error.append(mean_squared_error(y_test, y_pred))
    print('The Mean Squared Error on run {} is {}.'.format(x,error[x]))

The Mean Squared Error on run 0 is 255.41045981567794.
The Mean Squared Error on run 1 is 344.095991421802.
The Mean Squared Error on run 2 is 125.65801999513812.
The Mean Squared Error on run 3 is 138.78105112787512.
The Mean Squared Error on run 4 is 385.32902554598616.
The Mean Squared Error on run 5 is 297.11226211338715.
The Mean Squared Error on run 6 is 330.67506132498823.
The Mean Squared Error on run 7 is 205.7465356678943.
The Mean Squared Error on run 8 is 100.9209921891931.
The Mean Squared Error on run 9 is 113.63502807779125.
The Mean Squared Error on run 10 is 125.39678647441322.
The Mean Squared Error on run 11 is 213.57733565761146.
The Mean Squared Error on run 12 is 223.75607754267145.
The Mean Squared Error on run 13 is 109.4021548258976.
The Mean Squared Error on run 14 is 296.5381907373424.
The Mean Squared Error on run 15 is 106.17246106782247.
The Mean Squared Error on run 16 is 105.30500095920333.
The Mean Squared Error on run 17 is 102.15324337062951.
The Mean

<b>Calculate mean and standard deviations of errors</b>

In [10]:
print('The mean of the calculated Mean Squared Errors is {}'.format(np.mean(error)))
print('The standard deviation of the calculated Mean Squared Errors is {}'.format(np.std(error)))

The mean of the calculated Mean Squared Errors is 302.51654109455666
The standard deviation of the calculated Mean Squared Errors is 293.274290284197


<h3>Part B: Normalize the data</h3>

<b>Normalize the data</b>

In [11]:
#normalize the data by substracting the mean and dividing by the standard deviation.
predictors_norm = (predictors - predictors.mean()) / predictors.std()
predictors_norm.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,2.476712,-0.856472,-0.846733,-0.916319,-0.620147,0.862735,-1.217079,-0.279597
1,2.476712,-0.856472,-0.846733,-0.916319,-0.620147,1.055651,-1.217079,-0.279597
2,0.491187,0.79514,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,3.55134
3,0.491187,0.79514,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,5.055221
4,-0.790075,0.678079,-0.846733,0.488555,-1.038638,0.070492,0.647569,4.976069


<b>Repeat steps of part A, now with the normalized data.</b>

In [12]:
#Define regression model given the requirement parameters:
#- One hidden layer of 10 nodes, and a ReLU activation function
#- The adam optimizer and the mean squared error as the loss function

def regression_model():
    # create model
    model = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(n_cols,)))
    model.add(Dense(1))
    
    # compile model
    model.compile(optimizer='adam', loss='mean_squared_error')
    return model

#Set loop for building baseline model 50 times with 50 random train/test splits

errorB = []

for x in range(50):
    
    #Create a train/test split using SciKitLearn function
    X_train, X_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.3, random_state=42)
    
    #Build the model
    model = regression_model()

    #Fit the model
    model.fit(X_train, y_train, epochs=50, verbose=0)
    
    #Test the model
    y_pred = model.predict(X_test)
    errorB.append(mean_squared_error(y_test, y_pred))
    print('The Mean Squared Error on run {} is {}.'.format(x,errorB[x]))
    
print('The mean of the calculated Mean Squared Errors is {}'.format(np.mean(errorB)))
print('The standard deviation of the calculated Mean Squared Errors is {}'.format(np.std(errorB)))

The Mean Squared Error on run 0 is 308.6051466082772.
The Mean Squared Error on run 1 is 518.5970739640179.
The Mean Squared Error on run 2 is 259.83636650333915.
The Mean Squared Error on run 3 is 262.7961751774522.
The Mean Squared Error on run 4 is 347.52544647772527.
The Mean Squared Error on run 5 is 278.16250450133015.
The Mean Squared Error on run 6 is 402.6685292996889.
The Mean Squared Error on run 7 is 379.7382342358728.
The Mean Squared Error on run 8 is 294.8534523990761.
The Mean Squared Error on run 9 is 235.36170280704707.
The Mean Squared Error on run 10 is 260.88744868813006.
The Mean Squared Error on run 11 is 232.84275862859056.
The Mean Squared Error on run 12 is 251.25070839891225.
The Mean Squared Error on run 13 is 241.97782921551567.
The Mean Squared Error on run 14 is 285.5628803479116.
The Mean Squared Error on run 15 is 282.3559009303809.
The Mean Squared Error on run 16 is 449.39685882533854.
The Mean Squared Error on run 17 is 818.1808395249852.
The Mean Sq

<b>Conclusion</b>

A: Not normalising the data on average does not create a higher MSE, but the spread of the MSE is much higher. Therefore, without normalising the risk of ending up with bad luck in splitting train and test data is much higher.
    
    
B: With normalised data, the MSE is not changing much (even increasing a bit on average here), but the spread gets much lower, much decreasing the risk of an unlucky train/test split.


<h3>Part C: Increate the number of epochs</h3> 

<b>Repeat steps of part B, now with 100 epochs instead of 50.</b>

In [13]:
#Define regression model given the requirement parameters:
#- One hidden layer of 10 nodes, and a ReLU activation function
#- The adam optimizer and the mean squared error as the loss function

def regression_model():
    # create model
    model = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(n_cols,)))
    model.add(Dense(1))
    
    # compile model
    model.compile(optimizer='adam', loss='mean_squared_error')
    return model

#Set loop for building baseline model 50 times with 50 random train/test splits

errorC = []

for x in range(50):
    
    #Create a train/test split using SciKitLearn function
    X_train, X_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.3, random_state=42)
    
    #Build the model
    model = regression_model()

    #Fit the model
    model.fit(X_train, y_train, epochs=100, verbose=0)
    
    #Test the model
    y_pred = model.predict(X_test)
    errorC.append(mean_squared_error(y_test, y_pred))
    print('The Mean Squared Error on run {} is {}.'.format(x,errorC[x]))
    
print('The mean of the calculated Mean Squared Errors is {}'.format(np.mean(errorC)))
print('The standard deviation of the calculated Mean Squared Errors is {}'.format(np.std(errorC)))

The Mean Squared Error on run 0 is 164.70283542411582.
The Mean Squared Error on run 1 is 167.31431977087666.
The Mean Squared Error on run 2 is 147.64225578648353.
The Mean Squared Error on run 3 is 155.84248541633363.
The Mean Squared Error on run 4 is 176.12245454832367.
The Mean Squared Error on run 5 is 152.9303164686516.
The Mean Squared Error on run 6 is 175.13502732544626.
The Mean Squared Error on run 7 is 159.52822460861123.
The Mean Squared Error on run 8 is 144.82155597021543.
The Mean Squared Error on run 9 is 142.29354571410346.
The Mean Squared Error on run 10 is 167.6231660080155.
The Mean Squared Error on run 11 is 154.46874413345753.
The Mean Squared Error on run 12 is 160.98163496470775.
The Mean Squared Error on run 13 is 155.67840124298743.
The Mean Squared Error on run 14 is 159.61787813087162.
The Mean Squared Error on run 15 is 170.63777592106968.
The Mean Squared Error on run 16 is 151.5377538080527.
The Mean Squared Error on run 17 is 146.72958683801093.
The M

<b>Conclusion:</b>
    
C: Increasing the epochs significantly decreased the average MSE, resulting average is half that of in part A and B. The spread is even decreased more dramatically, namely by tenfold. Conclusion is that the prediction quality of the model is significantly improving by increasing the epoch size compared to the base situation B

<h3>Part D: Increase the number of hidden layers</h3>

<b>Repeat steps of part B, now with three hidden layers, each of 10 nodes and ReLU activation function</b>

In [14]:
#Define regression model given the requirement parameters

def regression_model():
    # create model
    model = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(n_cols,)))
    model.add(Dense(10, activation='relu'))
    model.add(Dense(10, activation='relu'))
    model.add(Dense(1))
    
    # compile model
    model.compile(optimizer='adam', loss='mean_squared_error')
    return model

#Set loop for building baseline model 50 times with 50 random train/test splits

errorD = []

for x in range(50):
    
    #Create a train/test split using SciKitLearn function
    X_train, X_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.3, random_state=42)
    
    #Build the model
    model = regression_model()

    #Fit the model
    model.fit(X_train, y_train, epochs=50, verbose=0)
    
    #Test the model
    y_pred = model.predict(X_test)
    errorD.append(mean_squared_error(y_test, y_pred))
    print('The Mean Squared Error on run {} is {}.'.format(x,errorD[x]))
    
print('The mean of the calculated Mean Squared Errors is {}'.format(np.mean(errorD)))
print('The standard deviation of the calculated Mean Squared Errors is {}'.format(np.std(errorD)))

The Mean Squared Error on run 0 is 144.24521363651405.
The Mean Squared Error on run 1 is 143.92361396193485.
The Mean Squared Error on run 2 is 127.17493497170618.
The Mean Squared Error on run 3 is 119.89260025244563.
The Mean Squared Error on run 4 is 133.8050248938132.
The Mean Squared Error on run 5 is 128.78857266549446.
The Mean Squared Error on run 6 is 108.64030159542895.
The Mean Squared Error on run 7 is 86.18459737832804.
The Mean Squared Error on run 8 is 90.54050248921946.
The Mean Squared Error on run 9 is 123.05930688593575.
The Mean Squared Error on run 10 is 141.18233583368533.
The Mean Squared Error on run 11 is 142.12638578711883.
The Mean Squared Error on run 12 is 104.81863932069743.
The Mean Squared Error on run 13 is 133.37437009913407.
The Mean Squared Error on run 14 is 93.5314012640949.
The Mean Squared Error on run 15 is 132.4113716241138.
The Mean Squared Error on run 16 is 131.5045678614935.
The Mean Squared Error on run 17 is 143.19920949881296.
The Mean 

<b>Conclusion:</b>
    
D: Adding additional hidden layers has a similar impact as increasing the epoch size. Adding layers has a slightly bigger impact on the average MSE, but spread is slightly increasing. Conclusion is that the prediction quality of the model is significantly improving by adding hidden layers compared to the base situation B.


<h3>Recap and conclusion</h3>

In [36]:
#Show table to summarise error stats per part of the assignment
errortable = [ ['A', np.mean(error), np.std(error)],
               ['B', np.mean(errorB), np.std(errorB)],
               ['C', np.mean(errorC), np.std(errorC)],
               ['D', np.mean(errorD), np.std(errorD)]]

errordf = pd.DataFrame(errortable, columns = ['Part', 'Mean MSE', 'Std MSE'])
errordf.set_index('Part')

Unnamed: 0_level_0,Mean MSE,Std MSE
Part,Unnamed: 1_level_1,Unnamed: 2_level_1
A,302.516541,293.27429
B,321.086687,97.488763
C,157.461195,9.470662
D,123.745451,15.780066


<b>Conclusion recap</b>

A: Not normalising the data on average does not create a higher MSE, but the spread of the MSE is much higher. Therefore, without normalising the risk of ending up with bad luck in splitting train and test data is much higher.

B: With normalised data, the MSE is not changing much (even increasing a bit on average here), but the spread gets much lower, much decreasing the risk of an unlucky train/test split.

C: Increasing the epochs significantly decreased the average MSE, resulting average is half that of in part A and B. The spread is even decreased more dramatically, namely by tenfold. Conclusion is that the prediction quality of the model is significantly improving by increasing the epoch size compared to the base situation B.

D: Adding additional hidden layers has a similar impact as increasing the epoch size. Adding layers has a slightly bigger impact on the average MSE, but spread is slightly increasing. Conclusion is that the prediction quality of the model is significantly improving by adding hidden layers compared to the base situation B.

Comparing C to D is showing that adding epochs has a similar impact than adding hidden layers, the one is not significantly better than the other. It might be worthwhile to test a combination of the two for further optimization.