## Download and Clean Dataset

In [1]:
import pandas as pd
import numpy as np

<strong>The dataset is about the compressive strength of different samples of concrete based on the volumes of the different ingredients that were used to make them. Ingredients include:</strong>

<strong>1. Cement</strong>

<strong>2. Blast Furnace Slag</strong>

<strong>3. Fly Ash</strong>

<strong>4. Water</strong>

<strong>5. Superplasticizer</strong>

<strong>6. Coarse Aggregate</strong>

<strong>7. Fine Aggregate</strong>

Download data and read it into  a <em>pandas</em> dataframe.

In [2]:
concrete_data = pd.read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DL0101EN/labs/data/concrete_data.csv')
concrete_data.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


So the first concrete sample has 540 cubic meter of cement, 0 cubic meter of blast furnace slag, 0 cubic meter of fly ash, 162 cubic meter of water, 2.5 cubic meter of superplaticizer, 1040 cubic meter of coarse aggregate, 676 cubic meter of fine aggregate. Such a concrete mix which is 28 days old, has a compressive strength of 79.99 MPa. 

#### Check how many data points we have

In [3]:
concrete_data.shape

(1030, 9)

We have approximately 1000 samples to train our model on. Because of the few samples, we have to be careful not to overfit the training data.

Check dataset for any missing values

In [4]:
concrete_data.describe()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
count,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0
mean,281.167864,73.895825,54.18835,181.567282,6.20466,972.918932,773.580485,45.662136,35.817961
std,104.506364,86.279342,63.997004,21.354219,5.973841,77.753954,80.17598,63.169912,16.705742
min,102.0,0.0,0.0,121.8,0.0,801.0,594.0,1.0,2.33
25%,192.375,0.0,0.0,164.9,0.0,932.0,730.95,7.0,23.71
50%,272.9,22.0,0.0,185.0,6.4,968.0,779.5,28.0,34.445
75%,350.0,142.95,118.3,192.0,10.2,1029.4,824.0,56.0,46.135
max,540.0,359.4,200.1,247.0,32.2,1145.0,992.6,365.0,82.6


In [5]:
concrete_data.isnull().sum()

Cement                0
Blast Furnace Slag    0
Fly Ash               0
Water                 0
Superplasticizer      0
Coarse Aggregate      0
Fine Aggregate        0
Age                   0
Strength              0
dtype: int64

Dataset looks clean so we're ready to be used to build our model.

#### Split dataset into predictors X and target y

The target variable in this problem is the concrete sample strength. Therefore, our predictors will be all the other columns.

In [6]:
concrete_data_columns = concrete_data.columns

X = concrete_data[concrete_data_columns[concrete_data_columns != 'Strength']] # all columns except Strength
y = concrete_data['Strength'] # Strength column

Checking predictors and target dataframes

In [7]:
X.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360


In [8]:
y.head()

0    79.99
1    61.89
2    40.27
3    41.05
4    44.30
Name: Strength, dtype: float64

Lastly, normalize the data by substracting the mean and dividing by the standard deviation.

In [9]:
X_norm = (X - X.mean()) / X.std()
X_norm.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,2.476712,-0.856472,-0.846733,-0.916319,-0.620147,0.862735,-1.217079,-0.279597
1,2.476712,-0.856472,-0.846733,-0.916319,-0.620147,1.055651,-1.217079,-0.279597
2,0.491187,0.79514,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,3.55134
3,0.491187,0.79514,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,5.055221
4,-0.790075,0.678079,-0.846733,0.488555,-1.038638,0.070492,0.647569,4.976069


Save the number of predictors to *n_cols* since we will need this number when building our network.

In [10]:
n_cols = X_norm.shape[1] # number of predictors

## Import Keras and Other Necessary Packages

In [11]:
import keras

In [12]:
from keras.models import Sequential
from keras.layers import Dense
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

## A. Build a baseline model

One hidden layer of 10 nodes, and a ReLU activation function
Use the <b>adam</b> optimizer and the <b>mean squared error</b> as the loss function.

In [13]:
# Define regression model
def regression_model():
    # Create model
    model = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(n_cols,)))
    model.add(Dense(1))
    
    # Compile model
    model.compile(optimizer='adam', loss='mean_squared_error')
    return model

In [14]:
# Build the model
model = regression_model()

1. Randomly split the data into a training and test sets by holding 30% of the data for testing. You can use the <b>train_test_split</b> helper function from Scikit-learn.
2. Train the model on the training data using <b>50 epochs</b>.
3. Evaluate the model on the test data and compute the mean squared error between the predicted concrete strength and the actual concrete strength. You can use the <b>mean_squared_error</b> function from Scikit-learn.

4. Repeat steps 1 - 3, <b>50 times</b>, i.e., create a list of 50 mean squared errors.

5. Report the <b>mean and the standard deviation of the mean squared errors</b>.

In [15]:
# Arrays to store mean and st dev of the mean squared errors
mse_PartA = []
r2_PartA = []

# Repeat 50 times (create a list of 50 mse's)
for i in range(50):
    # Split data into testing and training sets (30% testing)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
    X_train = np.asarray(X_train)
    X_test = np.asarray(X_test)
    y_train = np.asarray(y_train)
    y_test = np.asarray(y_test)
    
    # Train model on training data using 50 epochs
    model.fit(X_train, y_train, epochs=50, verbose=0)
    
    # Compute mse between predicted and actual concrete strength
    y_pred = model.predict(X_test)
    
    mse_PartA.append(mean_squared_error(y_test, y_pred))
    r2_PartA.append(r2_score(y_test, y_pred))



#### Mean and Standard Deviation of Mean Squared Errors for Part A

In [16]:
print('mse_Mean: {:.2f}'.format(np.mean(mse_PartA)))
print('mse_StdDev: {:.2f}'.format(np.std(mse_PartA)))
print('R^2_Mean: {:.2f}'.format(np.mean(r2_PartA)))
print('R^2_StdDev: {:.2f}'.format(np.std(r2_PartA)))

mse_Mean: 55.01
mse_StdDev: 21.06
R^2_Mean: 0.80
R^2_StdDev: 0.07


## B. Normalize the data

Repeat part A but now use X_norm instead of X

In [17]:
# Arrays to store mean and st dev of the mean squared errors
mse_PartB = []
r2_PartB = []

# Repeat 50 times (create a list of 50 mse's)
for i in range(50):
    # Split data into testing and training sets (30% testing)
    X_train, X_test, y_train, y_test = train_test_split(X_norm, y, test_size=0.3)
    X_train = np.asarray(X_train)
    X_test = np.asarray(X_test)
    y_train = np.asarray(y_train)
    y_test = np.asarray(y_test)
    
    # Train model on training data using 50 epochs
    model.fit(X_train, y_train, epochs=50, verbose=0)
    
    # Compute mse between predicted and actual concrete strength
    y_pred = model.predict(X_test)
    
    mse_PartB.append(mean_squared_error(y_test, y_pred))
    r2_PartB.append(r2_score(y_test, y_pred))



#### Mean and Standard Deviation of Mean Squared Errors for Part B

In [18]:
print('mse_Mean: {:.2f}'.format(np.mean(mse_PartB)))
print('mse_StdDev: {:.2f}'.format(np.std(mse_PartB)))
print('R^2_Mean: {:.2f}'.format(np.mean(r2_PartB)))
print('R^2_StdDev: {:.2f}'.format(np.std(r2_PartB)))

mse_Mean: 45.08
mse_StdDev: 33.41
R^2_Mean: 0.84
R^2_StdDev: 0.12


<b>How does the mean of the mean squared errors compare to that from Step A?</b>

The mse mean is lower, the R^2 mean is slightly higher

## C. Increase the number of epochs

Repeat Part B but use <b>100 epochs this time for training.</b>

In [19]:
# Arrays to store mean and st dev of the mean squared errors
mse_PartC = []
r2_PartC = []

# Repeat 50 times (create a list of 50 mse's)
for i in range(50):
    # Split data into testing and training sets (30% testing)
    X_train, X_test, y_train, y_test = train_test_split(X_norm, y, test_size=0.3)
    X_train = np.asarray(X_train)
    X_test = np.asarray(X_test)
    y_train = np.asarray(y_train)
    y_test = np.asarray(y_test)
    
    # Train model on training data using 100 epochs
    model.fit(X_train, y_train, epochs=100, verbose=0)
    
    # Compute mse between predicted and actual concrete strength
    y_pred = model.predict(X_test)
    
    mse_PartC.append(mean_squared_error(y_test, y_pred))
    r2_PartC.append(r2_score(y_test, y_pred))



#### Mean and Standard Deviation of Mean Squared Errors for Part C

In [20]:
print('mse_Mean: {:.2f}'.format(np.mean(mse_PartC)))
print('mse_StdDev: {:.2f}'.format(np.std(mse_PartC)))
print('R^2_Mean: {:.2f}'.format(np.mean(r2_PartC)))
print('R^2_StdDev: {:.2f}'.format(np.std(r2_PartC)))

mse_Mean: 27.99
mse_StdDev: 2.59
R^2_Mean: 0.90
R^2_StdDev: 0.01


<b>How does the mean of the mean squared errors compare to that from Step B?</b>

The mse mean is lower, the R^2 mean is slightly higher

## D. Increase the number of hidden layers

Repeat part B but use a neural network with the following instead:

- Three hidden layers, each of 10 nodes and ReLU activation function.

In [22]:
# Define regression model
def regression_model2():
    # Create model
    model = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(n_cols,)))
    model.add(Dense(10, activation='relu'))
    model.add(Dense(10, activation='relu'))
    model.add(Dense(1))
    
    # Compile model
    model.compile(optimizer='adam', loss='mean_squared_error')
    return model

In [23]:
# Build the model
model2 = regression_model2()

In [24]:
# Arrays to store mean and st dev of the mean squared errors
mse_PartD = []
r2_PartD = []

# Repeat 50 times (create a list of 50 mse's)
for i in range(50):
    # Split data into testing and training sets (30% testing)
    X_train, X_test, y_train, y_test = train_test_split(X_norm, y, test_size=0.3)
    X_train = np.asarray(X_train)
    X_test = np.asarray(X_test)
    y_train = np.asarray(y_train)
    y_test = np.asarray(y_test)
    
    # Train model on training data using 50 epochs
    model2.fit(X_train, y_train, epochs=50, verbose=0)
    
    # Compute mse between predicted and actual concrete strength
    y_pred = model.predict(X_test)
    
    mse_PartD.append(mean_squared_error(y_test, y_pred))
    r2_PartD.append(r2_score(y_test, y_pred))



#### Mean and Standard Deviation of Mean Squared Errors for Part D

In [25]:
print('mse_Mean: {:.2f}'.format(np.mean(mse_PartD)))
print('mse_StdDev: {:.2f}'.format(np.std(mse_PartD)))
print('R^2_Mean: {:.2f}'.format(np.mean(r2_PartD)))
print('R^2_StdDev: {:.2f}'.format(np.std(r2_PartD)))

mse_Mean: 24.93
mse_StdDev: 2.36
R^2_Mean: 0.91
R^2_StdDev: 0.01


<b>How does the mean of the mean squared errors compare to that from Step B?</b>

The mse mean is lower, the R^2 mean is slightly higher