<h2>Regression Models with Keras. Part B</h2>
<h3>Objective for this Notebook:</h3>    
Repeat Regression Models with Keras Part A but use a normalized version of the data. Recall that one way to normalize the data is by subtracting the mean from the individual predictors and dividing by the standard deviation.  

How does the mean of the mean squared errors compare to that from Step A?  

<h3>Concrete Data:</h3>    
The data can be found here: 
https://cocl.us/concrete_data
   
 
       


## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>
    
1. <a href="#item21">Prepare the data: Download, Clean and Split the Dataset     
2. <a href="#item22">Import Keras</a>  
3. <a href="#item23">Build a Neural Network</a>  
4. <a href="#item24">Train and Test the Network</a>  
5. <a href="#item25">Evaluate the model</a>      

</font>
</div>


<a id="item21"></a>

## 1. Prepare the data: Download, Check and Split the Dataset

#### 1.1 Download the data 

Import the <em>pandas</em> and the Numpy libraries.


In [1]:
# Uncomment the following if running on desktop:
#!pip install numpy==1.21.4
#!pip install pandas==1.3.4
#!pip install keras==2.1.6

In [2]:
import pandas as pd
import numpy as np

Download the data and read it into a <em>pandas</em> dataframe.

In [3]:
concrete_data = pd.read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DL0101EN/labs/data/concrete_data.csv')
concrete_data.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


The dataset is about the compressive strength of different samples of concrete based on the volumes of the different ingredients that were used to make them. Ingredients include:

1. Cement
2. Blast Furnace Slag
3. Fly Ash
4. Water
5. Superplasticizer
6. Coarse Aggregate
7. Fine Aggregate


#### 1.2 Check the data 

Check how many data points we have.

In [4]:
concrete_data.shape

(1030, 9)

Check the dataset for any missing values.

In [5]:
concrete_data.describe()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
count,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0
mean,281.167864,73.895825,54.18835,181.567282,6.20466,972.918932,773.580485,45.662136,35.817961
std,104.506364,86.279342,63.997004,21.354219,5.973841,77.753954,80.17598,63.169912,16.705742
min,102.0,0.0,0.0,121.8,0.0,801.0,594.0,1.0,2.33
25%,192.375,0.0,0.0,164.9,0.0,932.0,730.95,7.0,23.71
50%,272.9,22.0,0.0,185.0,6.4,968.0,779.5,28.0,34.445
75%,350.0,142.95,118.3,192.0,10.2,1029.4,824.0,56.0,46.135
max,540.0,359.4,200.1,247.0,32.2,1145.0,992.6,365.0,82.6


In [6]:
concrete_data.isnull().sum()

Cement                0
Blast Furnace Slag    0
Fly Ash               0
Water                 0
Superplasticizer      0
Coarse Aggregate      0
Fine Aggregate        0
Age                   0
Strength              0
dtype: int64

#### 1.3 Split the data into predictors and target


In [7]:
concrete_data_columns = concrete_data.columns

predictors = concrete_data[concrete_data_columns[concrete_data_columns != 'Strength']] # all columns except Strength
target = concrete_data['Strength'] # Strength column

Sanity check of the predictors and the target dataframes.


In [8]:
predictors.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360


In [9]:
target.head()

0    79.99
1    61.89
2    40.27
3    41.05
4    44.30
Name: Strength, dtype: float64

Normalize the data by substracting the mean and dividing by the standard deviation.


In [10]:
predictors_norm = (predictors - predictors.mean()) / predictors.std()
predictors_norm.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,2.476712,-0.856472,-0.846733,-0.916319,-0.620147,0.862735,-1.217079,-0.279597
1,2.476712,-0.856472,-0.846733,-0.916319,-0.620147,1.055651,-1.217079,-0.279597
2,0.491187,0.79514,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,3.55134
3,0.491187,0.79514,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,5.055221
4,-0.790075,0.678079,-0.846733,0.488555,-1.038638,0.070492,0.647569,4.976069


Save the number of predictors to *n_cols* since we will need this number when building our network.


In [11]:
n_cols = predictors_norm.shape[1] # number of predictors

<a id="item22"></a>

## 2. Import Keras 
Import Keras and the packages from the Keras library that we will need to build our regression model.

In [12]:
import keras

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [13]:
from keras.models import Sequential
from keras.layers import Dense

<a id="item23"></a>

## 3. Build a Neural Network

Define a function that defines our regression model for us so that we can conveniently call it to create our model.

- One hidden layer of 10 nodes, and a ReLU activation function

- Use the adam optimizer and the mean squared error as the loss function.


In [14]:
# define regression model
def regression_model():
    # create model
    model = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(n_cols,)))
    model.add(Dense(1))
    
    # compile model
    model.compile(optimizer='adam', loss='mean_squared_error')
    return model

Call the function to create our model.


In [15]:
# build the model
model = regression_model()







<a id="item24"></a>

## 4. Train and Test the Network


4.1. Randomly split the data into a training and test sets by holding 30% of the data for testing.  
4.2. Train the model on the training data using 50 epochs.

In [16]:
# fit the model
model.fit(predictors_norm, target, validation_split=0.3, epochs=50, verbose=2)



Train on 721 samples, validate on 309 samples
Epoch 1/50


2024-01-06 16:19:37.138464: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
2024-01-06 16:19:37.146589: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2394295000 Hz
2024-01-06 16:19:37.147447: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5592e4a645e0 executing computations on platform Host. Devices:
2024-01-06 16:19:37.147500: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>


 - 0s - loss: 1648.7299 - val_loss: 1181.4954
Epoch 2/50
 - 0s - loss: 1630.2484 - val_loss: 1170.4517
Epoch 3/50
 - 0s - loss: 1611.5799 - val_loss: 1159.1297
Epoch 4/50
 - 0s - loss: 1591.8213 - val_loss: 1147.4915
Epoch 5/50
 - 0s - loss: 1571.8002 - val_loss: 1135.2472
Epoch 6/50
 - 0s - loss: 1550.3522 - val_loss: 1122.3776
Epoch 7/50
 - 0s - loss: 1527.8443 - val_loss: 1109.0256
Epoch 8/50
 - 0s - loss: 1504.0465 - val_loss: 1094.7893
Epoch 9/50
 - 0s - loss: 1479.1157 - val_loss: 1079.8594
Epoch 10/50
 - 0s - loss: 1453.2125 - val_loss: 1063.6247
Epoch 11/50
 - 0s - loss: 1425.1282 - val_loss: 1046.9714
Epoch 12/50
 - 0s - loss: 1396.2742 - val_loss: 1028.8161
Epoch 13/50
 - 0s - loss: 1365.5060 - val_loss: 1009.6894
Epoch 14/50
 - 0s - loss: 1333.4061 - val_loss: 989.5419
Epoch 15/50
 - 0s - loss: 1299.4184 - val_loss: 967.9475
Epoch 16/50
 - 0s - loss: 1263.4158 - val_loss: 945.8518
Epoch 17/50
 - 0s - loss: 1226.5697 - val_loss: 922.1422
Epoch 18/50
 - 0s - loss: 1188.0190 - 

<keras.callbacks.History at 0x7f127f278d90>

Train on 721 samples, validate on 309 samples

<a id="item25"></a>

## 5. Evaluate the model

5.1. Evaluate the model on the test data and compute the mean squared error between the predicted concrete strength and the actual concrete strength.    

In [17]:
## evaluate the model
from sklearn.metrics import mean_squared_error

# y_pred are the predictions of the model with the test data 
y_pred = model.predict(predictors_norm)

# Calculate Mean Squared Error (MSE) 
mse = mean_squared_error(target, y_pred)

print(f'Mean Squared Error: {mse}')

  LARGE_SPARSE_SUPPORTED = LooseVersion(scipy_version) >= '0.14.0'


Mean Squared Error: 275.28829297644904


5.2. Repeat steps 4.1, 4.2 and 5.1, 50 times, i.e., create a list of 50 mean squared errors.

In [18]:
num_repeats = 50
mse_list = []
for _ in range(num_repeats):
    
    model.fit(predictors_norm, target, validation_split=0.3, epochs=50, verbose=0)
    y_pred = model.predict(predictors_norm)
    mse = mean_squared_error(target, y_pred)
    # Add MSE to the list
    mse_list.append(mse)

print("List of Mean Squared Errors:")
print(mse_list)

List of Mean Squared Errors:
[173.0126932889651, 132.91013248318612, 107.36636403400351, 93.8786821695912, 86.80236225986665, 82.42504480595427, 78.95697734260592, 75.7247187790033, 70.8277863760353, 64.85133575643043, 61.32670158327704, 58.712532262119886, 56.19405482284779, 54.789488193549204, 54.06946594248254, 53.845246256645716, 54.31565016805929, 54.80047650567156, 55.71780358748601, 56.34646942985413, 56.87394894625188, 57.414034617967175, 57.67425394122681, 58.32007925430286, 58.75282030893213, 58.73503520649855, 59.080693526909705, 59.81078510733657, 59.74919777674657, 60.2239332619884, 60.72727194571895, 61.20866205295705, 61.62363084604996, 61.865514286885855, 62.37266955166058, 63.00172363394282, 63.37821312273349, 63.76698316108783, 64.62737436992536, 65.47245993355236, 65.61890852936195, 66.0503177479169, 66.65699576726847, 67.18190049133521, 67.82729958099253, 68.42562167906452, 68.51435624095029, 68.58562139794222, 68.87540863894068, 69.35842115761692]


5.3. Report the mean and the standard deviation of the mean squared errors.

In [19]:
mean_mse = np.mean(mse_list)
std_mse = np.std(mse_list)

print(f"Mean of Mean Squared Errors: {mean_mse}")
print(f"Standard Deviation of Mean Squared Errors: {std_mse}")

Mean of Mean Squared Errors: 68.37296244263399
Standard Deviation of Mean Squared Errors: 20.43967020178593


Mean of Mean Squared Errors: 68.37296244263399  
Standard Deviation of Mean Squared Errors: 20.43967020178593  
Normalizing the data seems to have a positive impact, as the mean MSE decreased from 89.95 to 68.37. The standard deviation also decreased, indicating greater consistency in model performance.