# Assignment: Build a Regression Model in Keras 
## Part-A<a href="#parta"> click here</a>

Use the Keras library to build a neural network with the following:

- One hidden layer of 10 nodes, and a ReLU activation function

- Use the adam optimizer and the mean squared error as the loss function.

1. Randomly split the data into a training and test sets by holding 30% of the data for testing. You can use the train_test_splithelper function from Scikit-learn.

2. Train the model on the training data using 50 epochs.

3. Evaluate the model on the test data and compute the mean squared error between the predicted concrete strength and the actual concrete strength. You can use the mean_squared_error function from Scikit-learn.

4. Repeat steps 1 - 3, 50 times, i.e., create a list of 50 mean squared errors.

5. Report the mean and the standard deviation of the mean squared errors.

# Part (B)-- <a href="#partb"> click here</a>

B. Normalize the data (5 marks)

Repeat Part A but use a normalized version of the data. Recall that one way to normalize the data is by subtracting the mean from the individual predictors and dividing by the standard deviation.

How does the mean of the mean squared errors compare to that from Step A?

In [41]:
# @title Importing Library
import pandas as pd
import numpy as np

# Library for modal
import keras
from keras.models import Sequential
from keras.layers import Dense

#For data spliting
from sklearn.model_selection import train_test_split

#For mean sqare error
from sklearn.metrics import mean_squared_error

In [42]:
# @title Loading data set-

concrete_data=pd.read_csv('https://cocl.us/concrete_data')

In [43]:
concrete_data.sample(5)

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
1000,141.9,166.6,129.7,173.5,10.9,882.6,785.3,28,44.61
388,385.0,0.0,136.0,158.0,20.0,903.0,768.0,28,55.55
178,286.3,200.9,0.0,144.7,11.2,1004.6,803.7,91,76.8
164,425.0,106.3,0.0,153.5,16.5,852.1,887.1,91,65.2
996,152.6,238.7,0.0,200.0,6.3,1001.8,683.9,28,26.86


**The dataset is about the compressive strength of different samples of concrete based on the volumes of the different ingredients that were used to make them. Ingredients include:**

1. Cement

2. Blast Furnace Slag

3. Fly Ash

4. Water

5. Superplasticizer

6. Coarse Aggregate

7. Fine Aggregate

### Let's check how many data points we have.

In [44]:
concrete_data.shape

(1030, 9)

In [45]:
concrete_data.describe()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
count,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0
mean,281.167864,73.895825,54.18835,181.567282,6.20466,972.918932,773.580485,45.662136,35.817961
std,104.506364,86.279342,63.997004,21.354219,5.973841,77.753954,80.17598,63.169912,16.705742
min,102.0,0.0,0.0,121.8,0.0,801.0,594.0,1.0,2.33
25%,192.375,0.0,0.0,164.9,0.0,932.0,730.95,7.0,23.71
50%,272.9,22.0,0.0,185.0,6.4,968.0,779.5,28.0,34.445
75%,350.0,142.95,118.3,192.0,10.2,1029.4,824.0,56.0,46.135
max,540.0,359.4,200.1,247.0,32.2,1145.0,992.6,365.0,82.6


In [46]:
concrete_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1030 entries, 0 to 1029
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Cement              1030 non-null   float64
 1   Blast Furnace Slag  1030 non-null   float64
 2   Fly Ash             1030 non-null   float64
 3   Water               1030 non-null   float64
 4   Superplasticizer    1030 non-null   float64
 5   Coarse Aggregate    1030 non-null   float64
 6   Fine Aggregate      1030 non-null   float64
 7   Age                 1030 non-null   int64  
 8   Strength            1030 non-null   float64
dtypes: float64(8), int64(1)
memory usage: 72.5 KB


In [47]:
concrete_data.isnull().sum()

Cement                0
Blast Furnace Slag    0
Fly Ash               0
Water                 0
Superplasticizer      0
Coarse Aggregate      0
Fine Aggregate        0
Age                   0
Strength              0
dtype: int64

There are **1030** samples in the dataset.

**Strength is our target**

**The data looks very clean and is ready to be used to build our model.**

In [48]:
# Split data into predictors and target

predictors = concrete_data.iloc[:,:-1] # strength is the last column so this will exclude the last column.
target = concrete_data['Strength'] # Strength column

In [49]:
predictors.sample(5)

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
460,178.0,129.8,118.6,179.9,3.6,1007.3,746.8,100
510,424.0,22.0,132.0,178.0,8.5,822.0,750.0,7
73,425.0,106.3,0.0,151.4,18.6,936.0,803.7,3
483,446.0,24.0,79.0,162.0,11.6,967.0,712.0,56
434,178.0,129.8,118.6,179.9,3.6,1007.3,746.8,28


In [50]:
target.head(3)

0    79.99
1    61.89
2    40.27
Name: Strength, dtype: float64

In [51]:
# No. of features
n_cols=predictors.shape[1]
n_cols

8

The below function creates a model that has one hidden layer with 10 neurons and a ReLU activation function. It uses the adam optimizer and the mean squared error as the loss function.

Function is using keras Sequantial that we have imported above

# <div id='parta'>Part-(A)</div>

In [52]:
# define regression model
def regression_model():
    # create model
    model = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(n_cols,))) # hidden layers with node 10 and relu-- activation function
    model.add(Dense(1))
    
    # compile model
    model.compile(optimizer='adam', loss='mean_squared_error')
    return model

In [53]:
# Let's split the data in training and testing

X_train, X_test, y_train, y_test = train_test_split(predictors, target, test_size=0.3, random_state=42)

### Train and Test the Network

**Let's call the function now to create our model.**

In [54]:
# build the model
# model = regression_model()
# epochs=50

# # Fit
# model.fit(X_train, y_train, epochs=epochs, verbose=1)

In [55]:
# Evaluate the model on the test data.

loss_val = model.evaluate(X_test, y_test)
y_pred = model.predict(X_test)
loss_val



39.82253760427333

### Create a list of 50 mean squared errors and report mean and the standard deviation of the mean squared errors.

In [56]:
# Create a list of 50 mean squared errors and report mean and the standard deviation of the mean squared errors.
# Making it as a function as we will use it later--
def iterate(epochs,total_mean_squared_errors):
#     total_mean_squared_errors = 50
#     epochs = 50
    mean_squared_errors = []
    # Iterating 50 times
    for i in range(0, total_mean_squared_errors):
        X_train, X_test, y_train, y_test = train_test_split(predictors, target, test_size=0.3, random_state=i)
        model.fit(X_train, y_train, epochs=epochs, verbose=0)
        MSE = model.evaluate(X_test, y_test, verbose=0)
        print("MSE "+str(i+1)+": "+str(MSE))
        y_pred = model.predict(X_test)
        mean_square_error = mean_squared_error(y_test, y_pred)
        mean_squared_errors.append(mean_square_error)

    mean_squared_errors = np.array(mean_squared_errors)
    mean = np.mean(mean_squared_errors)
    standard_deviation = np.std(mean_squared_errors)

    print('\n')
    print("Below is the mean and standard deviation of " +str(total_mean_squared_errors) + " mean squared errors without normalized data. Total number of epochs for each training is: " +str(epochs) + "\n")
    print("Mean: "+str(mean))
    print("Standard Deviation: "+str(standard_deviation))
    return (mean, standard_deviation)

In [57]:
mean_a, std_a = iterate(50,50)

MSE 1: 35.65390965236429
MSE 2: 38.69336439953653
MSE 3: 33.60121722668892
MSE 4: 38.45173524035605
MSE 5: 40.61745424486673
MSE 6: 39.14148593643337
MSE 7: 44.87296801323258
MSE 8: 32.89427901085912
MSE 9: 36.29923606304675
MSE 10: 36.18784581650422
MSE 11: 42.38047342393005
MSE 12: 36.19032150564842
MSE 13: 45.110932260655275
MSE 14: 45.12595520513343
MSE 15: 40.145171415458606
MSE 16: 32.93197467643466
MSE 17: 38.579589720297015
MSE 18: 40.072668044698275
MSE 19: 39.98779251197395
MSE 20: 37.99384855992586
MSE 21: 38.4340170579435
MSE 22: 40.93059833613028
MSE 23: 33.61716669817187
MSE 24: 37.312451168171414
MSE 25: 39.97815508055456
MSE 26: 46.72314250662103
MSE 27: 35.69689537714986
MSE 28: 35.07320332450003
MSE 29: 43.49672279234457
MSE 30: 38.19403510726386
MSE 31: 39.56578155320053
MSE 32: 31.639139990204747
MSE 33: 35.097436022218375
MSE 34: 38.68452098068682
MSE 35: 38.70165282314264
MSE 36: 42.21679671451112
MSE 37: 41.80738207746092
MSE 38: 40.502178179793376
MSE 39: 35.333

<div id='partb'></div>

# Part-(B)

Result from previous part- 

The mean and standard deviation of 50 mean squared errors without normalized data. Total number of epochs for each training is: 50

Mean: 51.87652089428187
Standard Deviation: 8.24017812052958


**Here first job is normalisation.**

**--Then split data in train and test**

**--Build the modal**

**--Iterate for MSE and show mean and deviation**

In [58]:
#Normalising

predictors_norm = (predictors - predictors.mean()) / predictors.std()
predictors_norm.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,2.476712,-0.856472,-0.846733,-0.916319,-0.620147,0.862735,-1.217079,-0.279597
1,2.476712,-0.856472,-0.846733,-0.916319,-0.620147,1.055651,-1.217079,-0.279597
2,0.491187,0.79514,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,3.55134
3,0.491187,0.79514,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,5.055221
4,-0.790075,0.678079,-0.846733,0.488555,-1.038638,0.070492,0.647569,4.976069


In [59]:
# Let's split the data in training and testing

X_train, X_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.3, random_state=42)

In [60]:
mean_b, std_b = iterate(epochs=50,total_mean_squared_errors=50)

MSE 1: 37.268244536563415
MSE 2: 40.55566886482115
MSE 3: 33.593345450737715
MSE 4: 37.18928058710684
MSE 5: 39.317809070969865
MSE 6: 37.75623835097625
MSE 7: 47.92779812612194
MSE 8: 33.81234507267529
MSE 9: 38.035536904936855
MSE 10: 37.53126548100444
MSE 11: 38.39990730656003
MSE 12: 39.059246322483695
MSE 13: 40.127409999810375
MSE 14: 47.57109183400966
MSE 15: 39.741065016070614
MSE 16: 33.04913864012289
MSE 17: 37.16833649175452
MSE 18: 39.56336800411681
MSE 19: 41.653844481533014
MSE 20: 36.89947919629538
MSE 21: 33.73653437790362
MSE 22: 41.70790543293876
MSE 23: 42.86693616206592
MSE 24: 37.85636900620939
MSE 25: 39.745946964399714
MSE 26: 43.18807336504791
MSE 27: 35.811724949808955
MSE 28: 34.83083066971171
MSE 29: 45.651836012559414
MSE 30: 37.73076068544851
MSE 31: 36.12266467690082
MSE 32: 33.037524257277205
MSE 33: 33.17627251881226
MSE 34: 38.201913210180585
MSE 35: 39.72997197518457
MSE 36: 41.87195194577708
MSE 37: 38.801529628173434
MSE 38: 37.99495188395182
MSE 39:

**Here now we will see the change in mean and standard deviation**

In [61]:
change=[(mean_a, std_a),(mean_b, std_b),(mean_b-mean_a, std_b-std_a)]
pd.DataFrame(change,index=['A','B','Change'],columns=['Mean','STD'])

Unnamed: 0,Mean,STD
A,39.029834,3.474486
B,38.686191,3.512375
Change,-0.343642,0.037889
