# Build a Regression Model in Keras

## Download and Clean Dataset

Let's start by importing the pandas and the Numpy libraries

In [14]:
import pandas as pd
import numpy as np

We will be playing around with the same dataset that we used in the videos.

<strong>The dataset is about the compressive strength of different samples of concrete based on the volumes of the different ingredients that were used to make them. Ingredients include:</strong>

<strong>1. Cement</strong>

<strong>2. Blast Furnace Slag</strong>

<strong>3. Fly Ash</strong>

<strong>4. Water</strong>

<strong>5. Superplasticizer</strong>

<strong>6. Coarse Aggregate</strong>

<strong>7. Fine Aggregate</strong>

Let's download the data and read it into a pandas dataframe.

In [15]:
concrete_data = pd.read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DL0101EN/labs/data/concrete_data.csv')
concrete_data.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


#### Let's check how many data points we have.

In [16]:
concrete_data.shape

(1030, 9)

So, there are approximately 1000 samples to train our model on. Because of the few samples, we have to be careful not to overfit the training data.

Let's check the dataset for any missing values.

In [17]:
concrete_data.describe()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
count,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0
mean,281.167864,73.895825,54.18835,181.567282,6.20466,972.918932,773.580485,45.662136,35.817961
std,104.506364,86.279342,63.997004,21.354219,5.973841,77.753954,80.17598,63.169912,16.705742
min,102.0,0.0,0.0,121.8,0.0,801.0,594.0,1.0,2.33
25%,192.375,0.0,0.0,164.9,0.0,932.0,730.95,7.0,23.71
50%,272.9,22.0,0.0,185.0,6.4,968.0,779.5,28.0,34.445
75%,350.0,142.95,118.3,192.0,10.2,1029.4,824.0,56.0,46.135
max,540.0,359.4,200.1,247.0,32.2,1145.0,992.6,365.0,82.6


In [18]:
concrete_data.isnull().sum()

Cement                0
Blast Furnace Slag    0
Fly Ash               0
Water                 0
Superplasticizer      0
Coarse Aggregate      0
Fine Aggregate        0
Age                   0
Strength              0
dtype: int64

The data looks very clean and is ready to be used to build our model.


#### Split data into predictors and target

The target variable in this problem is the concrete sample strength. And our predictors will be Cement, Blast Furnace Slag, Fly Ash, Water, Superplasticizer, Coarse Aggregate, Fine Aggregate.

In [19]:
predictors = concrete_data[['Cement','Blast Furnace Slag','Fly Ash','Water','Superplasticizer','Coarse Aggregate','Fine Aggregate']]
target = concrete_data['Strength'] # Strength column

Let's do a quick sanity check of the predictors and the target dataframes.

In [20]:
predictors.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5


In [21]:
target.head()

0    79.99
1    61.89
2    40.27
3    41.05
4    44.30
Name: Strength, dtype: float64

Let's save the number of predictors to n_cols since we will need this number when building our network.

In [22]:
n_cols = predictors.shape[1] # number of predictors

## A. Build a baseline model

Let's import the packages from the Keras library

In [11]:
import keras
from keras.models import Sequential
from keras.layers import Dense

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


import keras
from keras.models import Sequential
from keras.layers import Dense

#### Build a Neural Network

Let's define a function that defines our regression model for us so that we can conveniently call it to create our model.

In [12]:
# define regression model a
def regression_model_a():
    # create model
    model = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(n_cols,)))
    model.add(Dense(1))
    
    # compile model
    model.compile(optimizer='adam', loss='mean_squared_error')
    return model

#### Train and Test the Network

Next, we will train and test the model at the same time using the fit method. We will train the model for 50 epochs.

In [19]:
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

list_mean_squared_error_a = []

for i in range(50):
    
    # build the model a
    model_a = regression_model_a()

    # split Train/Test dataset
    X_train, X_test, y_train, y_test = train_test_split(predictors, target, test_size=0.3)

    # fit the model
    model_a.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=50, batch_size=200, verbose=0)

    # prediction
    predicted_y = model_a.predict(X_test)

    # calculate mean_squared_error
    mse = mean_squared_error(y_test, predicted_y)
    
    # append the mean_squared_error to the list
    list_mean_squared_error_a.append(mse)

In [20]:
print(list_mean_squared_error_a)

[14254.812329019782, 5210.099913534312, 2206.2123609454957, 435.3783659152507, 4526.215799363006, 8882.411912389614, 6815.300331046927, 1715.3271309626014, 3461.7028576143457, 239.94475643226014, 8348.504155516172, 4075.3003681703144, 1101.619808127377, 1819.4295486509434, 1131.6937247104486, 7412.98860732725, 1653.271395119261, 929.2118739900956, 1005.6605777557676, 12950.519498390036, 616.8546918777123, 401.91507484359715, 259.1064215093184, 164.2058625233736, 11632.571307444325, 1305.4935834490605, 7534.328494095634, 1061.5202530084632, 709.4943970277664, 1902.6327513013302, 2389.9430861581754, 263.24563639471046, 8530.407844604842, 680.4314691663091, 2528.895084518088, 9688.063981700237, 858.1906509022039, 33378.13767558322, 1680.7614897267322, 719.5628564395112, 545.9972824131454, 5128.859824850969, 2357.059221449216, 1224.3691333101522, 1585.4569124940872, 2845.737217362201, 1327.3438132735896, 6061.56891429838, 3929.277464901644, 1048.6022771149892]


Report the mean and the standard deviation of the mean squared errors.

In [26]:
print("The mean of the mean squared errors of model a: {}".format(np.mean(list_mean_squared_error_a))) 
print("The standard deviation of the mean squared errors of model a: {}".format(np.std(list_mean_squared_error_a))) 

The mean of the mean squared errors of model a: 4010.7127997744847
The standard deviation of the mean squared errors of model a: 5491.347688532758


### Use one model for all iterations:

In [23]:
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

list_mean_squared_error_a2 = []

# build the model a2
model_a2 = regression_model_a()
    
for i in range(50):
    

    # split Train/Test dataset
    X_train, X_test, y_train, y_test = train_test_split(predictors, target, test_size=0.3)

    # fit the model
    model_a2.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=50, batch_size=200, verbose=0)

    # prediction
    predicted_y = model_a2.predict(X_test)

    # calculate mean_squared_error
    mse = mean_squared_error(y_test, predicted_y)
    
    print(i, mse)
    
    # append the mean_squared_error to the list
    list_mean_squared_error_a2.append(mse)

0 8623.258005808888
1 2349.427536374766
2 1845.0431621535004
3 1768.7565690093888
4 1482.0604105283246
5 1390.0175609907992
6 1306.9694845588765
7 1236.6282063977094
8 1013.1611977284341
9 986.8938278007666
10 761.893020752844
11 766.9743768919915
12 538.887846635419
13 542.974163296708
14 519.5328099129738
15 428.93244213384634
16 410.87201247551553
17 371.41951589663995
18 302.6181721110425
19 266.38811672635035
20 253.01805770959305
21 218.9761002057276
22 214.9268083872216
23 208.60024803430457
24 193.86804333501556
25 166.1188141390311
26 174.91112786063786
27 157.82396129771593
28 173.08235227574517
29 179.5477669510943
30 147.94490023639054
31 156.9283455896055
32 161.74136261500496
33 154.11491064209116
34 147.1791881996213
35 158.35295640124784
36 148.77984674121532
37 144.33097908424733
38 168.54588931742546
39 140.1346361254324
40 162.3648706910635
41 167.33220898663134
42 155.13617931105415
43 137.3296143494658
44 165.4955824093595
45 148.01549515246043
46 150.0521829941127

### When we use one model for all iterations, the model will save the training state, so we will get smaller and smaller mean squared error. I we want to perform a completely new training for each iteration, we should initialize a new model for each iteration instead of keeping the same model. 

## B. Normalize the data

Repeat Part A but use a normalized version of the data. Recall that one way to normalize the data is by subtracting the mean from the individual predictors and dividing by the standard deviation

In [27]:
# Normalize the data
predictors_norm = (predictors - predictors.mean()) / predictors.std()
    
list_mean_squared_error_b = []

for i in range(50):
    
    # build the model a
    model_b = regression_model_a()

    # split Train/Test dataset
    X_train, X_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.3)

    # fit the model
    model_b.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=50, batch_size=200, verbose=0)

    # prediction
    predicted_y = model_b.predict(X_test)

    # calculate mean_squared_error
    mse = mean_squared_error(y_test, predicted_y)
    
    # append the mean_squared_error to the list
    list_mean_squared_error_b.append(mse)

Report the mean and the standard deviation of the mean squared errors.

In [28]:
print("The mean of the mean squared errors of model b: {}".format(np.mean(list_mean_squared_error_b))) 
print("The standard deviation of the mean squared errors of model b: {}".format(np.std(list_mean_squared_error_b))) 

The mean of the mean squared errors of model b: 1399.8211761141522
The standard deviation of the mean squared errors of model b: 72.79052008587949


#### How does the mean of the mean squared errors compare to that from Step A?

The mean and the standard deviation of the mean squared errors of model b are smaller than the model a, because normalizing data can significantly reduce errors and improve the performance of a model.

## C. Increate the number of epochs

Repeat Part B but use 100 epochs this time for training.

In [29]:
list_mean_squared_error_c = []

for i in range(50):
    
    # build the model a
    model_c = regression_model_a()

    # split Train/Test dataset
    X_train, X_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.3)

    # fit the model
    model_c.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=100, batch_size=200, verbose=0)

    # prediction
    predicted_y = model_c.predict(X_test)

    # calculate mean_squared_error
    mse = mean_squared_error(y_test, predicted_y)
    
    # append the mean_squared_error to the list
    list_mean_squared_error_c.append(mse)

Report the mean and the standard deviation of the mean squared errors.

In [30]:
print("The mean of the mean squared errors of model c: {}".format(np.mean(list_mean_squared_error_c))) 
print("The standard deviation of the mean squared errors of model c: {}".format(np.std(list_mean_squared_error_c))) 

The mean of the mean squared errors of model c: 1128.787041036273
The standard deviation of the mean squared errors of model c: 122.26529313929659


#### How does the mean of the mean squared errors compare to that from Step B?

The mean of the mean squared errors of model c is smaller than the model b, because increasing the number of epochs can improve the performance of a model.

## D. Increase the number of hidden layers 

Repeat part B but use a neural network with the following instead:

- Three hidden layers, each of 10 nodes and ReLU activation function.

In [31]:
# define regression model d
def regression_model_d():
    # create model
    model = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(n_cols,)))
    model.add(Dense(10, activation='relu'))
    model.add(Dense(10, activation='relu'))
    model.add(Dense(1))
    
    # compile model
    model.compile(optimizer='adam', loss='mean_squared_error')
    return model

In [37]:
list_mean_squared_error_d = []

for i in range(50):
    
    # build the model a
    model_d = regression_model_d()

    # split Train/Test dataset
    X_train, X_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.3)

    # fit the model
    model_d.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=50, batch_size=200, verbose=0)

    # prediction
    predicted_y = model_d.predict(X_test)

    # calculate mean_squared_error
    mse = mean_squared_error(y_test, predicted_y)
    
    print(i, mse)
    
    # append the mean_squared_error to the list
    list_mean_squared_error_d.append(mse)

0 497.47364669507505
1 580.9602738396808
2 363.89008359197715
3 362.2277551517633
4 327.8419576047078
5 336.9612648484292
6 396.48037009433506
7 431.92575724003206
8 560.7352060770709
9 811.7190339876831
10 455.8797879233869
11 280.3273037196742
12 727.2361894332123
13 804.2956149063433
14 435.6193100860439
15 409.1150430277346
16 1080.6893915881874
17 715.4008886399005
18 622.3942543215849
19 350.06001939949266
20 1002.0247542352176
21 371.28105523181256
22 425.8934158213333
23 447.32791338449965
24 1551.7763534827398
25 431.1537522930931
26 732.5122903491404
27 414.05524160785296
28 471.4934948707767
29 425.34855843008825
30 901.5636865285379
31 652.6141671844994
32 1510.2261552038
33 639.2630173692049
34 1377.5141512606115
35 599.4028856283505
36 307.9611375232856
37 509.3493945958369
38 541.1745716841461
39 411.79336647896884
40 348.55075764911606
41 293.82672493692644
42 527.9579037681615
43 362.7809753075818
44 468.9537592064883
45 309.1499749148454
46 477.5996049183303
47 581.38

Report the mean and the standard deviation of the mean squared errors.

In [5]:
print("The mean of the mean squared errors of model d: {}".format(np.mean(list_mean_squared_error_d))) 
print("The standard deviation of the mean squared errors of model d: {}".format(np.std(list_mean_squared_error_d))) 

The mean of the mean squared errors of model d: 567.0791177716301
The standard deviation of the mean squared errors of model d: 293.2009954632069


#### How does the mean of the mean squared errors compare to that from Step B?

The mean of the mean squared errors of model d is smaller than the model b, because increasing the number of hidden layers can improve the performance of a model.