# Context

Concrete is one of the most important materials in civil engineering. Concrete's compressive strength is a highly nonlinear function of age and ingredients.

# Business Problem

This is a supervised regression problem. The objective is to accurately predict the strength of steel using various ingredients used in the production of concrete.


# Data

The dataset contains 1030 observations accross 8 input variables and an output variable. The variable name, variable type, the measurement unit and a brief description is provided.

  **Name --              Data Type --      Measurement --       Description**

Cement (component 1) -- quantitative -- kg in a m3 mixture -- Input Variable

Blast Furnace Slag (component 2) -- quantitative -- kg in a m3 mixture -- Input Variable

Fly Ash (component 3) -- quantitative -- kg in a m3 mixture -- Input Variable

Water (component 4) -- quantitative -- kg in a m3 mixture -- Input Variable

Superplasticizer (component 5) -- quantitative -- kg in a m3 mixture -- Input Variable

Coarse Aggregate (component 6) -- quantitative -- kg in a m3 mixture -- Input 

Fine Aggregate (component 7) -- quantitative -- kg in a m3 mixture -- Input Variable

Age -- quantitative -- Day (1~365) -- Input Variable

Concrete compressive strength -- quantitative -- MPa -- Output Variable|


# Acknowledgements

Original Owner and Donor

Prof. I-Cheng Yeh

Department of Information Management

Chung-Hua University

Citation Request:

I-Cheng Yeh, "Modeling of strength of high performance concrete using artificial neural networks," Cement and Concrete Research, Vol. 28, No. 12, pp. 1797-1808 (1998).


# Importing Libraries & Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR, LinearSVR
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.multioutput import MultiOutputRegressor
from xgboost import XGBRegressor 

In [None]:
# Import Dataset
df = pd.read_csv(r"../input/regression-with-neural-networking/concrete_data.csv")

# 1) Exploratory Data Analysis

# 1.1 Univariate Analysis

In [None]:
# Any missing values ?
df.isna().sum()

In [None]:
# What are the datatypes of our variables ?
df.info()

In [None]:
# Summary statistics
df.describe()

All predictors in the dataset are quantitative datatypes. Moreover, there are no missing values in the data. Lets analyse the probability distribution of the variables through a series of univariate density plots. 

In [None]:
# Independent and Dependent variables
X = df[1:]
y = df['Strength']

In [None]:
# Kernal Density plots to see the distribution of variables
sns.set(style="white") 
plt.figure(figsize = (20 , 20))
for variable in range(9):
    plt.subplot(4, 3, variable + 1)
    sns.kdeplot(df[list(X)[variable]], shade = True, color="green")
plt.show()

Most variables do not follow normal distribution. It would be interesting to convert the response variable into a category of 'optimal' and 'suboptimal' in order to detect any interesting patterns between the predictors and the target variable. I'll conisder the observations at the 75th percentile or greater to be good.

# 1.2 Bivariate Analysis

In [None]:
# Observations at the 75th percentile (46) are considered optimal. Anything lower than 46 is suboptimal.
df['target_binary'] = np.where( df.Strength > 46, 'optimal', 'suboptimal')

In [None]:
# Boxplots of good and suboptimal compressive strength against ingredients.
sns.set(style="white") 
plt.figure(figsize = (20 , 60))
for variable in range(8):
    plt.subplot(15,3 , variable + 1)
    sns.boxplot(x = df['target_binary'], y =df[list(X)[variable]],  palette='Set1' )
plt.show()

Some intersting patterns between cement, water, superplasticizer and strength. Lets further split the target variable into 4 categories good, better, great, perfect to further refine this visualization.

In [None]:
# Creating 3 bins out of the response variable.
df['target_cat'] = pd.cut(df.Strength,
                     bins=[0, 23, 34, 46, 82],
                     labels=["Good", "Better", "Great", "Perfect"])

In [None]:
# Boxplots of good and bad compressive strength against ingredients.
sns.set(style="white") 
plt.figure(figsize = (20 , 60))
for variable in range(9):
    plt.subplot(15,3 , variable + 1)
    sns.boxplot(x = df['target_cat'], y = df[list(X)[variable]],  palette='Set1' )
plt.show()

There seem to be some interesting patterns between the ingredients and the strength of the product. Greater quantities of cement and superplasticizer and lower amounts of water seems to correlate with a stronger product. The boxplot also indicates the presence of outliers. Lets create a pearsons correlation correlogram to quantitatively detect these associations.  

In [None]:
# Lets make subsets of the independent and dependent variables
X =  df.drop(columns=['Strength', 'target_cat', 'target_binary'])
y =  df['Strength']

In [None]:
# Lets deploy a correlogram for colrrelation
colormap = plt.cm.Blues
plt.figure(figsize=(10,8))
sns.heatmap(df.corr(), cmap=colormap, annot=True, linewidths=0.2)

The pearsons correlation assumes normality, linearity, homoscedasticity and no outliers. These assumptions are violated for most variables. This may be a potential reason as to why pearsons correlation won't detect nuances between predictors and the independent variables properly. Additionally, the relationship between inputs does seem to be largely non-linear. Nonetheless, cement, water, and superplastisizer do seem to be moderately correlated with strength.

# 2) Data Preprocessing

Some machine learning algorithms require features to be scaled through normalization or standardization. I will now split the dataset into training and testing and then perform standardization. This will tansform the features such that their mean and standard deviation will become 0 and 1.

In [None]:
# Split the dataset into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, shuffle = True, random_state = 43)

# Feature Scaling 
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test =sc.transform(X_test)

# 3) Predictive Modelling

Lets apply a number of linear and non linear regression models and retrieve their predictive accuracy.

In [None]:
# Store models in a dictionary
algorithms = {"KNN": KNeighborsRegressor(),
          "Linear Regression": LinearRegression(), 
          "Ridge Regression": Ridge(),
          "Random Forest": RandomForestRegressor(),
          "SVR RBF": SVR(),
          "Linear SVR": LinearSVR(),
          "Decision Tree":DecisionTreeRegressor(),
          "Adaboost" :AdaBoostRegressor(),
          "Gradient Boosting":GradientBoostingRegressor(),
          "Neural Network": MLPRegressor(max_iter=10000) ,
          "XGBRegressor" : XGBRegressor()}


# Create function to train and test the model
def train_and_test(algorithms, X_train,y_train,X_test,y_test):
    model_scores = {}
    for name, model in algorithms.items():
        model.fit(X_train, y_train)
        print(name + " R2: {:.2f}".format(model.score(X_test, y_test)))

# Training and testing
model_scores = train_and_test(algorithms, X_train, y_train, X_test, y_test)

The non-linear algorithms perform far better than their linear counterparts. Amongst the non-linear models XGBoost seems to outperform all other models. I will now set a grid search and perform hyperparametric tuning to try and further enhance its performance.

In [None]:
# XGboost on the default settings seems to outperform all other models.
estimator = XGBRegressor()
estimator.fit(X_train, y_train)
print("R2: {:.2f}".format(estimator.score(X_test, y_test)))

In [None]:
# Lets set up a gridsearch CV and optimize some hyperparameters
param_grid       =      {"learning_rate": (0.05, 0.10, 0.15, 0.2),
                         "max_depth": [5, 6, 8],
                         "min_child_weight": [ 5, 7, 9, 11],
                         "gamma":[ 0.0, 0.1, 0.2, 0.25],
                         "colsample_bytree":[ 0.3, 0.4, 0.5, 0.7],
                          "n_estimators": [1000]}

# GridSearchCV
optimized_estimator =  GridSearchCV(estimator, param_grid)
optimized_estimator.fit(X_train, y_train)

# Retrieve Best Parameters
optimized_estimator.best_params_
for i, j in optimized_estimator.best_params_.items():
    print("\nBest " + str(i) + " parameter: " +  str(j))

In [None]:
print("XGBoost R2 after hyperparametric Tuning: {:.2f}".format(optimized_estimator.score(X_test, y_test)))

The model seems to have improved very slightly. 

# 4) Conclusion

This notebook performed supervised learning to predict the compressive strength of concrete using various ingredients used to produce concrete. The tasks carried out in this report inlclude exploratory data analysis through visualizations and statistics, followed by preprocessing, modelling and hyperparametric tuning. There is still room for imporvement but I'll stop here for now. 

# **Thank you for reading ! :D**