# Concrete Compressive Testing

The Compressive Strength of Concrete determines the quality of Concrete. This is generally determined by a standard crushing test on a concrete cylinder. This requires engineers to build small concrete cylinders with different combinations of raw materials and test these cylinders for strength variations with a change in each raw material. The recommended wait time for testing the cylinder is 28 days to ensure correct results. This consumes a lot of time and requires a lot of labour to prepare different prototypes and test them. Also, this method is prone to human error and one small mistake can cause the wait time to drastically increase.

One way of reducing the wait time and reducing the number of combinations to try is to make use of digital simulations, where we can provide information to the computer about what we know and the computer tries different combinations to predict the compressive strength. This way we can reduce the number of combinations we can try physically and reduce the amount of time for experimentation. But, to design such software we have to know the relations between all the raw materials and how one material affects the strength. It is possible to derive mathematical equations and run simulations based on these equations, but we cannot expect the relations to be same in real-world. Also, these tests have been performed for many numbers of times now and we have enough real-world data that can be used for predictive modelling.

## Import Necessary Library

In [None]:
%matplotlib inline
!pip install pyforest

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import itertools
from pyforest import *

import warnings
warnings.filterwarnings('ignore')

In [None]:
from sklearn import metrics
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, KFold, cross_val_score 
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor, BaggingRegressor
from sklearn.svm import SVR
from sklearn.ensemble import VotingRegressor
from scipy import stats
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.cluster import KMeans
from sklearn.utils import resample

## Loading dataset from UCI

In [None]:
df = pd.read_csv('../input/concrete-data/Concrete_Data.csv')

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df = df.rename(columns={"Cement (component 1)(kg in a m^3 mixture)":"cement",
                        "Blast Furnace Slag (component 2)(kg in a m^3 mixture)":"slag",
                        "Fly Ash (component 3)(kg in a m^3 mixture)":"ash",
                        "Water  (component 4)(kg in a m^3 mixture)":"water",
                        "Superplasticizer (component 5)(kg in a m^3 mixture)":"superplastic",
                        "Coarse Aggregate  (component 6)(kg in a m^3 mixture)":"coarseagg",
                        "Fine Aggregate (component 7)(kg in a m^3 mixture)":"fineagg",
                        "Age (day)":"age",
                        "Concrete compressive strength(MPa, megapascals) ":"strength"})

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.isnull().sum()

In [None]:
df.describe()

## Exploratry Data Analysis

In [None]:
from scipy import stats

# Cement
Q1 = df['cement'].quantile(q=0.25)
Q3 = df['cement'].quantile(q=0.75)
print('1st quantile: ' , Q1 )
print('3rd quantile: ' , Q3 )
print('Interquantile range (IQR): ' , stats.iqr(df['cement']))

In [None]:
L_outlier = Q1-1.5*(Q3-Q1)
U_outlier = Q3+1.5*(Q3-Q1)
print('Lower outlier limit in cement: ', L_outlier)
print('Upper outlier limit in cement: ', U_outlier)

In [None]:
print('Number of outlier in cement Upper: ',df[df['cement']>586.4375]['cement'].count())
print('Number of outlier in cement Lower: ',df[df['cement']<-44.0625]['cement'].count())

In [None]:
sns.boxplot(x='cement',data=df);

In [None]:
# Ash
sns.distplot(df['ash']).set_title('Ash');

In [None]:
# Water
W_Q1 = df['water'].quantile(q=0.25)
W_Q3 = df['water'].quantile(q=0.75)
print('1st quantile: ' , W_Q1 )
print('3rd quantile: ' , W_Q3 )
print('Interquantile range (IQR): ' , stats.iqr(df['water']))

In [None]:
WL_outlier = W_Q1-1.5*(W_Q3-W_Q1)
WU_outlier = W_Q3+1.5*(W_Q3-W_Q1)
print('Lower outlier limit in cement: ', WL_outlier)
print('Upper outlier limit in cement: ', WU_outlier)

In [None]:
print('Number of outlier in cement Upper: ',df[df['water']>232.65]['water'].count())
print('Number of outlier in cement Lower: ',df[df['water']<124.25]['water'].count())

In [None]:
sns.boxplot(x='water',data=df);

In [None]:
sns.distplot(df['water']).set_title('water');

In [None]:
# Slag
S_Q1 = df['slag'].quantile(q=0.25)
S_Q3 = df['slag'].quantile(q=0.75)

In [None]:
LS_outliers = S_Q1-1.5*(S_Q3-S_Q1)
US_outliers = S_Q3+1.5*(S_Q3-S_Q1)
print('Lower outlier in slag', LS_outliers)
print('Upper outlier in slag', US_outliers)

In [None]:
print('Number of outlier in cement Upper: ',df[df['slag']>357.375]['slag'].count())
print('Number of outlier in cement Lower: ',df[df['slag']<-214.425]['slag'].count())

In [None]:
sns.boxplot(x='slag',data=df); 

In [None]:
# Age 
A_Q1 = df['age'].quantile(q=0.25)
A_Q3 = df['age'].quantile(q=0.75)

In [None]:
LA_outliers = A_Q1-1.5*(A_Q3-A_Q1)
UA_outliers = A_Q3+1.5*(A_Q3-A_Q1)
print('Lower outlier in age', LA_outliers)
print('Upper outlier in age', UA_outliers)

In [None]:
print('Number of outlier in age Upper: ',df[df['age']>129.5]['age'].count())
print('Number of outlier in age Lower: ',df[df['age']<-66.5]['age'].count())

In [None]:
sns.boxplot(x='age',data=df)

In [None]:
# Ash 
As_Q1 = df['ash'].quantile(q=0.25)
As_Q3 = df['ash'].quantile(q=0.75)

In [None]:
LAs_outliers = As_Q1-1.5*(As_Q3-As_Q1)
UAs_outliers = As_Q3+1.5*(As_Q3-As_Q1)
print('Lower outlier in ash', LAs_outliers)
print('Upper outlier in ash', UAs_outliers)

In [None]:
print('Number of outlier in ash Upper: ',df[df['ash']>295.75]['ash'].count())
print('Number of outlier in ash Lower: ',df[df['ash']<-177.45]['ash'].count())

In [None]:
sns.boxplot(x='ash',data=df);

In [None]:
fig, ax2 = plt.subplots(3,3, figsize=(12,12))
sns.distplot(df['cement'], ax=ax2[0][0])
sns.distplot(df['slag'], ax=ax2[0][1])
sns.distplot(df['ash'], ax=ax2[0][2])
sns.distplot(df['water'], ax=ax2[1][0])
sns.distplot(df['superplastic'], ax=ax2[1][1])
sns.distplot(df['coarseagg'], ax=ax2[1][2])
sns.distplot(df['fineagg'], ax=ax2[2][0])
sns.distplot(df['age'], ax=ax2[2][1])
sns.distplot(df['strength'], ax=ax2[2][2]);

In [None]:
sns.pairplot(df)

In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(df.corr(),square=True,annot=True,cmap='viridis');

In [None]:
df.boxplot(figsize=(10,10))

In [None]:
print('outlier in cement: ',df[((df.cement-df.cement.mean())/df.cement.std()).abs()>3]['cement'].count())
print('outlier in slag : ',df[((df.slag-df.slag.mean())/df.slag.std()).abs()>3]['slag'].count())
print('outlier in ash : ',df[((df.ash-df.ash.mean())/df.ash.std()).abs()>3]['ash'].count())
print('outlier in water: ',df[((df.water-df.water.mean())/df.water.std()).abs()>3]['water'].count())
print('outlier in superplastic: ',df[((df.superplastic-df.superplastic.mean())/df.superplastic.std()).abs()>3]['superplastic'].count())
print('outlier in  coarseagg: ',df[((df.coarseagg-df.coarseagg.mean())/df.coarseagg.std()).abs()>3]['coarseagg'].count())
print('outlier in fineagg: ',df[((df.fineagg-df.fineagg.mean())/df.fineagg.std()).abs()>3]['fineagg'].count())
print('outlier in age: ',df[((df.age-df.age.mean())/df.age.std()).abs()>3]['age'].count())

In [None]:
for cols in df.columns[:-1]:
    Q1 = df[cols].quantile(0.25)
    Q3 = df[cols].quantile(0.75)
    iqr = Q3-Q1
    
    low = Q1-1.5*iqr
    high = Q3+1.5*iqr
    df.loc[(df[cols]<low) | (df[cols]>high), cols] = df[cols].median()

In [None]:
df.boxplot(figsize=(10,10))

## Feature Engineering & Model Building

In [None]:
df.head()

In [None]:
X = df.drop('strength',axis=1)
y = df['strength']

In [None]:
from scipy.stats import zscore

Xscaled = X.apply(zscore)
Xscaled_df = pd.DataFrame(Xscaled,columns=df.columns)

In [None]:
X_train,X_test,y_train,y_test = train_test_split(Xscaled,y, test_size=0.3,random_state=1)

## Buildning Different Models

### 1) Random Forest Regressor

In [None]:
r_model = RandomForestRegressor()
r_model.fit(X_train,y_train)

In [None]:
y_pred = r_model.predict(X_test)

In [None]:
r_model.score(X_train,y_train)

In [None]:
r_model.score(X_test,y_test)

In [None]:
acc_r = metrics.r2_score(y_test,y_pred)
acc_r

In [None]:
metrics.mean_squared_error(y_test,y_pred)

In [None]:
result_1 = pd.DataFrame({'Algorithm':['Random Forest'],'accuracy':acc_r},index={'1'})
results = result_1[['Algorithm','accuracy']]                         
results                         

### 2) Random Forest Regressor KFold cross validation

In [None]:
k=20

kfold = KFold(n_splits=k,random_state=70)
K_results = cross_val_score(r_model,X,y,cv=kfold)
accuracy = np.mean(abs(K_results))
accuracy

In [None]:
K_results

In [None]:
random_re = pd.DataFrame({'Algorithm':['Random Forest K_fold'],'accuracy':accuracy},index={'2'})
results = pd.concat([results,random_re])
results = results[['Algorithm','accuracy']]
results

### 3) Gradient Boosting Regressor

In [None]:
g_model = GradientBoostingRegressor()
g_model.fit(X_train,y_train)

In [None]:
gy_pred = g_model.predict(X_test)

In [None]:
g_model.score(X_train,y_train)

In [None]:
acc_g = metrics.r2_score(y_test,gy_pred)
acc_g

In [None]:
g_model.score(X_test,y_test)

In [None]:
metrics.mean_squared_error(y_test,gy_pred)

In [None]:
gradient_re = pd.DataFrame({'Algorithm':['GradientBoostingRegressor'],'accuracy':acc_g},index={'3'})
results = pd.concat([results,gradient_re])
results = results[['Algorithm','accuracy']]
results

### 4) Gradient Boosting Regressor KFold cross validation

In [None]:
k=20

kfold = KFold(n_splits=k,random_state=70)
G_results = cross_val_score(g_model,X,y,cv=kfold)
g_accuracy = np.mean(abs(G_results))
g_accuracy

In [None]:
gradient_k = pd.DataFrame({'Algorithm':['GradientBoostingRegressor KFold'],'accuracy':g_accuracy},index={'4'})
results = pd.concat([results,gradient_k])
results = results[['Algorithm','accuracy']]
results

### 5) Ada Boosting Regressor

In [None]:
ada_model = AdaBoostRegressor()
ada_model.fit(X_train,y_train)

In [None]:
aday_pred = g_model.predict(X_test)

In [None]:
ada_model.score(X_train,y_train)

In [None]:
acc_ada = metrics.r2_score(y_test,aday_pred)
acc_ada

In [None]:
ada_model.score(X_test,y_test)

In [None]:
metrics.mean_squared_error(y_test,aday_pred)

In [None]:
ada_re = pd.DataFrame({'Algorithm':['AdaBoostRegressor'],'accuracy':acc_ada},index={'5'})
results = pd.concat([results,ada_re])
results = results[['Algorithm','accuracy']]
results

### 6) Ada Boosting Regressor KFold cross validation

In [None]:
k=20

kfold = KFold(n_splits=k,random_state=70)
ada_results = cross_val_score(ada_model,X,y,cv=kfold)
ada_accuracy = np.mean(abs(ada_results))
ada_accuracy

In [None]:
ada_k = pd.DataFrame({'Algorithm':['Ada BoostingRegressor KFold'],'accuracy':ada_accuracy},index={'6'})
results = pd.concat([results,ada_k])
results = results[['Algorithm','accuracy']]
results

### 7) KNN Regression

In [None]:
diff_k=[]
for i in range(1,45):
    knn = KNeighborsRegressor(n_neighbors=i)
    knn.fit(X_train,y_train)
    pred_i=knn.predict(X_test)
    diff_k.append(np.mean(pred_i!=y_test))

In [None]:
plt.figure(figsize=(10,5))
plt.plot(range(1,45),diff_k,color='blue',linestyle='dashed',marker='o',markerfacecolor='red',markersize=8)

In [None]:
k_model = KNeighborsRegressor(n_neighbors=3)
k_model.fit(X_train,y_train)

In [None]:
ky_pred = k_model.predict(X_test)

In [None]:
k_model.score(X_train,y_train)

In [None]:
acc_k = metrics.r2_score(y_test,ky_pred)
acc_k

In [None]:
k_model.score(X_test,y_test)

In [None]:
metrics.mean_squared_error(y_test,ky_pred)

In [None]:
k_re = pd.DataFrame({'Algorithm':['KNeighborsRegressor'],'accuracy':acc_k},index={'7'})
results = pd.concat([results,k_re])
results = results[['Algorithm','accuracy']]
results

### 8) KNN Regression KFold Validation

In [None]:
k=20

kfold = KFold(n_splits=k,random_state=70)
knn_results = cross_val_score(k_model,X,y,cv=kfold)
knn_accuracy = np.mean(abs(knn_results))
knn_accuracy

In [None]:
knn_k = pd.DataFrame({'Algorithm':['KNN KFold'],'accuracy':knn_accuracy},index={'8'})
results = pd.concat([results,knn_k])
results = results[['Algorithm','accuracy']]
results

### 9) Bagging Regressor

In [None]:
b_model = BaggingRegressor()
b_model.fit(X_train,y_train)

In [None]:
by_pred = b_model.predict(X_test)

In [None]:
b_model.score(X_train,y_train)

In [None]:
acc_b = metrics.r2_score(y_test,by_pred)
acc_b

In [None]:
b_model.score(X_test,y_test)

In [None]:
metrics.mean_squared_error(y_test,by_pred)

In [None]:
b_re = pd.DataFrame({'Algorithm':['BaggingRegressor'],'accuracy':acc_b},index={'9'})
results = pd.concat([results,b_re])
results = results[['Algorithm','accuracy']]
results

### 10) Bagging Regressor KFold Validation

In [None]:
k=20

kfold = KFold(n_splits=k,random_state=70)
b_results = cross_val_score(b_model,X,y,cv=kfold)
b_accuracy = np.mean(abs(b_results))
b_accuracy

In [None]:
b_k = pd.DataFrame({'Algorithm':['BaggingRegressor Fold'],'accuracy':b_accuracy},index={'10'})
results = pd.concat([results,b_k])
results = results[['Algorithm','accuracy']]
results

### 11) Support Vector Regressor

In [None]:
s_model = SVR(kernel='linear')
s_model.fit(X_train,y_train)

In [None]:
sy_pred = s_model.predict(X_test)

In [None]:
s_model.score(X_train,y_train)

In [None]:
acc_s = metrics.r2_score(y_test,sy_pred)
acc_s

In [None]:
s_model.score(X_test,y_test)

In [None]:
metrics.mean_squared_error(y_test,sy_pred)

In [None]:
s_re = pd.DataFrame({'Algorithm':['SVRegressor'],'accuracy':acc_s},index={'11'})
results = pd.concat([results,s_re])
results = results[['Algorithm','accuracy']]
results

### 12) Support Vector Regressor KFold Validation

In [None]:
k=20

kfold = KFold(n_splits=k,random_state=70)
s_results = cross_val_score(s_model,X,y,cv=kfold)
s_accuracy = np.mean(abs(s_results))
s_accuracy

In [None]:
s_k = pd.DataFrame({'Algorithm':['SVR Fold'],'accuracy':s_accuracy},index={'12'})
results = pd.concat([results,s_k])
results = results[['Algorithm','accuracy']]
results

### 13) XGBoost Regressor

In [None]:
import xgboost as xgb
from xgboost.sklearn import XGBRegressor
xgr = XGBRegressor()

xgr.fit(X_train,y_train);

In [None]:
xy_pred = xgr.predict(X_test)

In [None]:
xgr.score(X_train,y_train)

In [None]:
acc_x = metrics.r2_score(y_test,xy_pred)
acc_x

In [None]:
xgr.score(X_test,y_test)

In [None]:
metrics.mean_squared_error(y_test,xy_pred)

In [None]:
x_re = pd.DataFrame({'Algorithm':['XGB Regressor'],'accuracy':acc_x},index={'13'})
results = pd.concat([results,x_re])
results = results[['Algorithm','accuracy']]
results

Here, XGBoost Regressor gives maximum 90% Accuracy.