This project deals with modeling concrete compressive strength using maachine learning. The dataset is from UC Irvine ML Repo. This project was done to be included in my ML portfolio.

# **Concrete Compressive Strength**
In this project a multivariate linear regression algorithm (as a baseline) and a neural network algorithm will be used to model the data. Because of the non-linearity in the data the neural network can be expected to perform better.

Import TensorFlow

In [None]:
# Run on TensorFlow 2.x
%tensorflow_version 2.x
from __future__ import absolute_import, division, print_function, unicode_literals

Colab only includes TensorFlow 2.x; %tensorflow_version has no effect.


Import the necessary modules to prepare data etc.

In [None]:
#Import relevant modules
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras import regularizers
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# The following lines adjust the granularity of reporting.
pd.options.display.max_rows = 10
pd.options.display.float_format = "{:.1f}".format

Load the dataset. Files can be found on the UC Irvine ML Repo.

In [None]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving Concrete_dataset.csv to Concrete_dataset.csv
User uploaded file "Concrete_dataset.csv" with length 41561 bytes


Import the dataset

In [None]:
dataset = pd.read_csv(filepath_or_buffer="Concrete_dataset.csv")

First let's have a quick look at the dataset.

In [None]:
dataset.head()

Unnamed: 0,Cement (kg_in_m3),Blast Furnace Slag (kg_in_m3),Fly Ash (kg_in_m3),Water (kg_in_m3),Superplasticizer (kg_in_m3),Coarse Aggregate (kg_in_m3),Fine Aggregate (kg_in_m3),Age (day),Concrete compressive strength (Mpa)
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,80.0
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.9
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.3
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.0
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


The statistics for the dataset is also good to have a look at.

In [None]:
dataset.describe()

Unnamed: 0,Cement (kg_in_m3),Blast Furnace Slag (kg_in_m3),Fly Ash (kg_in_m3),Water (kg_in_m3),Superplasticizer (kg_in_m3),Coarse Aggregate (kg_in_m3),Fine Aggregate (kg_in_m3),Age (day),Concrete compressive strength (Mpa)
count,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0
mean,281.2,73.9,54.2,181.6,6.2,972.9,773.6,45.7,35.8
std,104.5,86.3,64.0,21.4,6.0,77.8,80.2,63.2,16.7
min,102.0,0.0,0.0,121.8,0.0,801.0,594.0,1.0,2.3
25%,192.4,0.0,0.0,164.9,0.0,932.0,731.0,7.0,23.7
50%,272.9,22.0,0.0,185.0,6.4,968.0,779.5,28.0,34.4
75%,350.0,142.9,118.3,192.0,10.2,1029.4,824.0,56.0,46.1
max,540.0,359.4,200.1,247.0,32.2,1145.0,992.6,365.0,82.6


Before beginning with the modeling it's good practice to find the biggest correlations between features so that the dataset can be understood better.

In [None]:
dataset.corr(method='pearson')

Unnamed: 0,Cement (kg_in_m3),Blast Furnace Slag (kg_in_m3),Fly Ash (kg_in_m3),Water (kg_in_m3),Superplasticizer (kg_in_m3),Coarse Aggregate (kg_in_m3),Fine Aggregate (kg_in_m3),Age (day),Concrete compressive strength (Mpa)
Cement (kg_in_m3),1.0,-0.3,-0.4,-0.1,0.1,-0.1,-0.2,0.1,0.5
Blast Furnace Slag (kg_in_m3),-0.3,1.0,-0.3,0.1,0.0,-0.3,-0.3,-0.0,0.1
Fly Ash (kg_in_m3),-0.4,-0.3,1.0,-0.3,0.4,-0.0,0.1,-0.2,-0.1
Water (kg_in_m3),-0.1,0.1,-0.3,1.0,-0.7,-0.2,-0.5,0.3,-0.3
Superplasticizer (kg_in_m3),0.1,0.0,0.4,-0.7,1.0,-0.3,0.2,-0.2,0.4
Coarse Aggregate (kg_in_m3),-0.1,-0.3,-0.0,-0.2,-0.3,1.0,-0.2,-0.0,-0.2
Fine Aggregate (kg_in_m3),-0.2,-0.3,0.1,-0.5,0.2,-0.2,1.0,-0.2,-0.2
Age (day),0.1,-0.0,-0.2,0.3,-0.2,-0.0,-0.2,1.0,0.3
Concrete compressive strength (Mpa),0.5,0.1,-0.1,-0.3,0.4,-0.2,-0.2,0.3,1.0


## Dataset preparation
Here we separate the input features from the target(Concrete compressive strength (Mpa)) so that the features can be normalized later.



In [None]:
y_dataset = dataset.pop('Concrete compressive strength (Mpa)')
X_dataset = dataset

The features are normalized using Z-score to be easier to work with.

In [None]:
X_dataset_mean = dataset.mean()
X_dataset_std = dataset.std()
X_dataset_norm = (X_dataset - X_dataset_mean)/X_dataset_std

Now it is appropriate to split the dataset into a training set and a test set, since we now have all the features normalized. 80/20 split works well for this project. Random state will be set to 100 for reproducibility.

In [None]:
from sklearn.model_selection import train_test_split
X_train_norm, X_test_norm, y_train, y_test = train_test_split(X_dataset, y_dataset, test_size=0.2, random_state=100)

## Multivariate linear regression
A multivariate linear regression model will be used as a baseline to compare with the NN. Here we train the model with the training data.

In [None]:
# Create a linear regression object and train it
model_LR = LinearRegression()
model_LR.fit(X_train_norm, y_train)

# Print parameters
print(model_LR.intercept_)
print(model_LR.coef_)

-34.27352699820249
[ 0.12415357  0.10366839  0.093371   -0.13429401  0.28804259  0.02065756
  0.02563037  0.11461733]


Now we evaluate the performance of the LR model using the test set. Mean square error is an appropriate metric to judge the model by.

In [None]:
# Predict the output for the test data
y_pred_LR = model_LR.predict(X_test_norm)

# Evaluate the performance using MSE
print(mean_squared_error(y_test, y_pred_LR))

113.17875937789907


## Neural network
Now the neural network will be defined. For this we will try using three hidden layers with 32, 16, and 8 nodes. The activation function will be set to relu and the kernel_regularizer as l2, l=0.001.

In [None]:
# Create the neural network
model_NN = tf.keras.Sequential([
    layers.Dense(32, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(l=0.001), input_shape=(X_train_norm.shape[1],)),
    layers.Dense(16, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(l=0.001)),
    layers.Dense(8, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(l=0.001)),
    layers.Dense(1)
])

The neural network model should trained using the training data. Here we set the number of epochs(200), the optimization algorithm('adam'), the batch size(30), and loss function(MSE).

In [None]:
# Compile the model
model_NN.compile(optimizer='adam', loss='mean_squared_error', metrics=['mean_squared_error'])

# Fit the model
model_NN.fit(x=X_train_norm, y=y_train, batch_size=30, epochs=200)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

<keras.src.callbacks.History at 0x7fd465b073d0>

To evaluate the performance of the neural network model against the test set MSE will be used.

In [None]:
# Evaluate the performance using the mean squared error
model_NN.evaluate(X_test_norm, y_test)



[56.44139862060547, 56.406795501708984]

In [None]:
#MSE for training set linear regression
y_tra_LR = model_LR.predict(X_train_norm)

# Evaluate the performance using the mean squared error
print(mean_squared_error(y_train, y_tra_LR))

105.96676510674719


## Conclusion
The MSE for the linear regression on the training set was 106 and on the test set it was 113. On the neural network the training MSE was 49 and the test MSE was 53. From this it can be concluded that the neural network is better att describing and predicting the data. Since the MSE is not drasitcally different on the training data and the test data the linear model does not suffer from significant overfitting, and the same can be said about the neural network.