# <h1 align="center"> Compressive Strength of Concrete Analysis</h1> 

### Prepared by- [Shantanil Bagchi](https://www.linkedin.com/in/shantanilbagchi/) ([Github Repo](https://github.com/ShantanilBagchi/Hackathons_Notebooks/tree/master/Compressive_Strength_of_Concrete_Study))

## Objective
Compressive strength or compression strength is the capacity of a material or structure to withstand loads tending to reduce size, as opposed to tensile strength, which withstands loads tending to elongate.

compressive strength is one of the most important engineering properties of concrete. It is a standard industrial practice that the concrete is classified based on grades. This grade is nothing but the Compressive Strength of the concrete cube or cylinder. Cube or Cylinder samples are usually tested under a compression testing machine to obtain the compressive strength of concrete. The test requisites differ country to country based on the design code.

The concrete compressive strength is a highly nonlinear function of age and ingredients .These ingredients include cement, blast furnace slag, fly ash, water, superplasticizer, coarse aggregate, and fine aggregate.

The actual concrete compressive strength (MPa) for a given mixture under a specific age (days) was determined from laboratory. Data is in raw form (not scaled).

The concrete compressive strength is a highly nonlinear function of age and ingredients. These ingredients include cement, blast furnace slag, fly ash, water, superplasticizer, coarse aggregate, and fine aggregate. Our objective is to build a machine learning model that would help Civil Engineers to estimate the compressive strength of the concrete and they can further take a decision whether the concrete should be used in their current project or not.

## Dataset Summary

| Component           | Variable Type 
| :-------------       |:-------------:|
| Cement              | Input Variable |
| Blast Furnace Slag  | Input Variable|
| Fly Ash             | Input Variable|
| Water               | Input Variable|
| Superplasticizer    | Input Variable|
| Coarse Aggregate    | Input Variable|
| Fine Aggregate      | Input Variable|
| Age                 | Input Variable|
| **Concrete compressive strength** | **Output Variable** |

## Models

| Model                       | RMSE    | Accuracy     |
| :--------------------------- |:-------:| :----------: |
| Gradient Boosting Regressor | 4.23       | 93        |
| Random Forest Regressor     | 5.06    | 90        |
| Decision Tree Regressor     | 6.52    | 83.5        |
| Extra Trees Regressor       | 4.80    | 91        |
| AdaBoost Regressor          | 8    | 75        |
| **XGBoost Regressor**       | 4.06    | 93.62        |
| Deep Neural Network         | 4.63    |            |
| **Bagging Regressor (estimator= grid_searched XGBoost)**       | 4.17    | 93.24        |


***Note-***
* Outlier detection had been done but resulted in comparatively poor performance.
* New feature engineering i.e water/binder ratio introduced but didn't result in improved performance.
* Columns (Fly Ash, Coarse Agg, Fine Agg) were removed to check performance but didn't do well.

Follwing is the result for reference wrt **XGBoost** (not included in the notebook)

| Detail                                            | RMSE(Whole) | Test Acc |
| :-------------                                    | :----------:|:--------:|
|  X_original                                             | 5.04  |  90.78 |
|  X_without_outliers                                     | 5.06  | 90.37  |
|  X_with_columns_removed (Fly Ash, Coarse Agg, Fine Agg) | 5.08  |  90.61 |
|  X_feature_engineered_with_water_cement                 | 4.52  | 91.69  |
|  X_feature_engineered (without Water, Cement)           | 4.06  |  93.62 |

**Further Analysis can be done to improve performance**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Load the Dataset and view few rows

In [None]:
data =  pd.read_csv("../input/regression-with-neural-networking/concrete_data.csv")
data.head()

In [None]:

data.columns=['Cement',
       'Blast_Furnace_Slag',
       'Fly_Ash',
       'Water',
       'Superplasticizer',
       'Coarse_Aggregate',
       'Fine_Aggregate', 'Age',
       'Concrete_compressive_strength']

### Data Info and Missing Value 

In [None]:
data.info()

In [None]:
data.isnull().sum()

# Creating Interaction Terms

In [None]:
data['Water_Cement_ratio']=data['Water']/data['Cement']

### Data Summary

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(round(data.describe()[1:].transpose(),2),linewidth=2,annot=True,fmt="f",cmap="YlGnBu")
plt.xticks(fontsize=20)
plt.yticks(fontsize=12)
plt.title("Variables summary")
plt.show()


# Histogram of the complete dataset

In [None]:
# Distplot
fig, ax2 = plt.subplots(3, 3, figsize=(16, 16))
sns.distplot(data['Cement'],ax=ax2[0][0])
sns.distplot(data['Blast_Furnace_Slag'],ax=ax2[0][1])
sns.distplot(data['Fly_Ash'],ax=ax2[0][2])
sns.distplot(data['Water'],ax=ax2[1][0])
sns.distplot(data['Superplasticizer'],ax=ax2[1][1])
sns.distplot(data['Coarse_Aggregate'],ax=ax2[1][2])
sns.distplot(data['Fine_Aggregate'],ax=ax2[2][0])
sns.distplot(data['Age'],ax=ax2[2][1])
sns.distplot(data['Concrete_compressive_strength'],ax=ax2[2][2])

**Key Insights**
* cement is almost normal.
* slag has two/three gausssians and rightly skewed.
* ash has two gaussians and rightly skewed.
* water has three guassians and slighly left skewed.
* superplastic has two gaussians and rightly skewed.
* coarseagg has three guassians and almost normal.
* fineagg has almost two guassians and looks like normal.
* age has multiple guassians and rightly skewed.

# Skewness Degree

In [None]:

from scipy.stats import skew
numerical_features = data.select_dtypes(include=[np.number]).columns
categorical_features = data.select_dtypes(include=[np.object]).columns
skew_values = skew(data[numerical_features], nan_policy = 'omit')
dummy = pd.concat([pd.DataFrame(list(numerical_features), columns=['Features']), 
           pd.DataFrame(list(skew_values), columns=['Skewness degree'])], axis = 1)
dummy.sort_values(by = 'Skewness degree' , ascending = False)

In [None]:
data.groupby("Age").mean()

### Visualize pairplot

In [None]:
sns.pairplot(data)

### Further Analysis

In [None]:
fig, ax2 = plt.subplots(2,4, figsize=(20, 10))
sns.regplot('Concrete_compressive_strength','Cement',data=data,ax=ax2[0][0])
sns.regplot('Concrete_compressive_strength','Blast_Furnace_Slag',data=data,ax=ax2[0][1])
sns.regplot('Concrete_compressive_strength','Fly_Ash',data=data,ax=ax2[0][2])
sns.regplot('Concrete_compressive_strength','Water',data=data,ax=ax2[0][3])
sns.regplot('Concrete_compressive_strength','Superplasticizer',data=data,ax=ax2[1][0])
sns.regplot('Concrete_compressive_strength','Coarse_Aggregate',data=data,ax=ax2[1][1])
sns.regplot('Concrete_compressive_strength','Fine_Aggregate',data=data,ax=ax2[1][2])
sns.regplot('Concrete_compressive_strength','Age',data=data,ax=ax2[1][3])

### **Concrete Compressive Strength comparision independent attributes**

- **Strength vs Cement**: It is linearly related to the cement. Although the relationship is positive, for a given value of cement we have a multiple values of strength. Hence, it is not a very good predictor.
- **Strength vs Slag and Fly Ash**: There is no particular trend as a lot of values are zero.
- **Strength vs Age**: For a given value of age, we have different values of strength. Hence, it is not a very good predictor.
- **Strength vs Superplasticizer**:For a given value of age, we have different values of strength with a lot of vaues being zero. Hence, it is not a good predictor.
- Other attributes do not give any strong relationship with Strength. 

Hence, we can see that none of the independent attributes are a good predictors of the strength attribute. So, we will not use Linear model. 

**Thus, an interaction term has been created earlier i.e. water/cement ratio which has inverse relation with Strength. Moreover, Water and Cement Columns are dropped as their relationship has already been captured in the interaction term**

### Dropping Water and Cement Columns (After trial and error) See notes in the top

In [None]:
data=data[['Superplasticizer',
       'Coarse_Aggregate', 'Fine_Aggregate', 'Age', 'Water_Cement_ratio',
       'Blast_Furnace_Slag', 'Fly_Ash',
       'Concrete_compressive_strength']]

In [None]:
cor = data.corr()

mask = np.zeros_like(cor)
mask[np.triu_indices_from(mask)] = True

plt.figure(figsize=(12,10))

with sns.axes_style("white"):
    sns.heatmap(cor,annot=True,linewidth=2,
                mask = mask,cmap="YlGnBu")
plt.title("Correlation between variables")
plt.show()

# Outlier Detection

In [None]:
from scipy import stats
outlier_list=[]
for c in data.columns[:-1]:
    Q1=data[c].quantile(q=0.25)
    Q3=data[c].quantile(q=0.75)
    print ("***************************************************************************")
    print('OUTLIER DETECTION FOR',c.upper())
    print ("***************************************************************************")
    
    print('1st Quartile (Q1) is: ', Q1)
    print('3st Quartile (Q3) is: ', Q3)
    print('Interquartile range (IQR) is ', stats.iqr(data[c]))
    L_outliers=Q1-1.5*(Q3-Q1)
    U_outliers=Q3+1.5*(Q3-Q1)
    print('Lower outliers in',c, L_outliers)
    print('Upper outliers in ',c, U_outliers)
    print ("***************************************************************************")
    print('Number of outliers in',c, 'upper : ', data[data[c]>U_outliers][c].count())
    print('Number of outliers in',c,' lower : ', data[data[c]<L_outliers][c].count())
    print('% of Outlier in ',c,' upper: ',round(data[data[c]>U_outliers][c].count()*100/len(data)), '%')
    print('% of Outlier in ',c,' lower: ',round(data[data[c]<L_outliers][c].count()*100/len(data)), '%')
    print ("***************************************************************************")
    print(data[  (data[c] < L_outliers) | (data[c] > U_outliers)  ].index)
    outlier_list.extend(data[  (data[c] < L_outliers) | (data[c] > U_outliers)  ].index)
    print('\n')

In [None]:
data.loc[list(set(outlier_list))]

In [None]:
data_outlier=data.drop(outlier_list,axis=0).reset_index(drop = True)

# Algo Selection

In [None]:
# Input/independent variables
X = data.drop('Concrete_compressive_strength', axis = 1)   # here we are droping the output feature as this is the target and 'X' is input features, the changes are not 
                                              # made inplace as we have not used 'inplace = True'
y = data['Concrete_compressive_strength'] 

In [None]:
from sklearn.model_selection import  train_test_split, cross_val_score
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25,random_state=2)

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn import metrics
from scipy.stats import pearsonr
warnings.filterwarnings("ignore")
#sns.set(style="darkgrid", color_codes=True) 

target = "Concrete_compressive_strength"
def model(algorithm,dtrainx,dtrainy,dtestx,dtesty,of_type,plot=False):
    
    print (algorithm)
    print ("***************************************************************************")
    algorithm.fit(dtrainx,dtrainy)
    
    #print(algorithm.get_params(deep=True))
    
    prediction = algorithm.predict(dtestx)
    
    print ("ROOT MEAN SQUARED ERROR :", np.sqrt(mean_squared_error(dtesty,prediction)) )
    print ("***************************************************************************")
    
    print ('Performance on training data :', algorithm.score(dtrainx,dtrainy)*100)
    print ('Performance on testing data :', algorithm.score(dtestx,dtesty)*100)

    print ("***************************************************************************")
    if plot==True:
        sns.jointplot(x=dtesty, y=prediction, stat_func=pearsonr,kind="reg", color="k") 
    
       
    prediction = pd.DataFrame(prediction)
    cross_val = cross_val_score(algorithm,dtrainx,dtrainy,cv=10)#,scoring="neg_mean_squared_error"
    cross_val = cross_val.ravel()
    print ("CROSS VALIDATION SCORE")
    print ("************************")
    print ("cv-mean :",cross_val.mean()*100)
    print ("cv-std  :",cross_val.std()*100)
    
    if plot==True:
        plt.figure(figsize=(20,22))
        plt.subplot(211)

        testy = dtesty.reset_index()["Concrete_compressive_strength"]

        ax = testy.plot(label="originals",figsize=(20,9),linewidth=2)
        ax = prediction[0].plot(label = "predictions",figsize=(20,9),linewidth=2)
        plt.legend(loc="best")
        plt.title("ORIGINALS VS PREDICTIONS")
        plt.xlabel("index")
        plt.ylabel("values")
        ax.set_facecolor("k")

        plt.subplot(212)

        if of_type == "coef":
            coef = pd.DataFrame(algorithm.coef_.ravel())
            coef["feat"] = dtrainx.columns
            ax1 = sns.barplot(coef["feat"],coef[0],palette="jet_r",
                              linewidth=2,edgecolor="k"*coef["feat"].nunique())
            ax1.set_facecolor("lightgrey")
            ax1.axhline(0,color="k",linewidth=2)
            plt.ylabel("coefficients")
            plt.xlabel("features")
            plt.title('FEATURE IMPORTANCES')

        elif of_type == "feat":
            coef = pd.DataFrame(algorithm.feature_importances_)
            coef["feat"] = dtrainx.columns
            ax2 = sns.barplot(coef["feat"],coef[0],palette="jet_r",
                              linewidth=2,edgecolor="k"*coef["feat"].nunique())
            ax2.set_facecolor("lightgrey")
            ax2.axhline(0,color="k",linewidth=2)
            plt.ylabel("coefficients")
            plt.xlabel("features")
            plt.title('FEATURE IMPORTANCES')


# XGBoost Regresssor

In [None]:
import xgboost as xgb
from xgboost.sklearn import XGBRegressor
xgr =XGBRegressor(random_state=2)
#model(xgr,X_train_or,y_train_or,X_test_or,y_test_or,"feat")
model(xgr,X_train,y_train,X_test,y_test,"feat")

In [None]:
xgr_1=XGBRegressor(random_state=2,learning_rate = 0.2,
                max_depth = 2, n_estimators = 800,n_jobs=-1,reg_alpha=0.005,gamma=0.1,subsample=0.7,colsample_bytree=0.9, colsample_bylevel=0.9, colsample_bynode=0.9)
model(xgr_1,X_train,y_train,X_test,y_test,"feat",True)

### Hyperparameter Tuning

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
param_grid={'n_estimators' : [500,800,1000,1200],
            'max_depth' : [1,2, 3,5,7,9,10,11,15],
            'learning_rate' :[ 0.0001, 0.001, 0.01, 0.1, 0.15, 0.2, 0.8, 1.0],
                                                     }
# Create a base model
xgbr = XGBRegressor(random_state = 2,reg_alpha=0.005,gamma=0.1,subsample=0.7,colsample_bytree=0.9, colsample_bylevel=0.9, colsample_bynode=0.9)

# Instantiate the grid search model
grid_search = GridSearchCV(estimator = xgbr, param_grid = param_grid, 
                          cv = 3, n_jobs = -1, verbose = 2)

In [None]:
grid_search.fit(X_train, y_train)

In [None]:
print(grid_search.best_params_)
best_grid = grid_search.best_estimator_
model(best_grid,X_train,y_train,X_test,y_test,"feat",True)

# Using Bagging Technique

In [None]:
from sklearn.ensemble import BaggingRegressor
regr = BaggingRegressor(base_estimator=xgr_1,
                         n_estimators=400, random_state=2,n_jobs=-1).fit(X_train, y_train)
pred=regr.predict(X_test)
print('Root Mean Squared Error is: ', np.sqrt(mean_squared_error(y_test, pred)))

In [None]:
regr_1 = BaggingRegressor(base_estimator=best_grid,
                         n_estimators=400, random_state=2,n_jobs=-1).fit(X_train, y_train)
pred=regr_1.predict(X_test)
print('Root Mean Squared Error is: ', np.sqrt(mean_squared_error(y_test, pred)))

### Gradient Boosting Regressor

In [None]:
from sklearn.ensemble import  GradientBoostingRegressor
gbr = GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=4, max_features=None,
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=800, n_iter_no_change=None, presort='auto',
             random_state=2, subsample=1.0, tol=0.0001,
             validation_fraction=0.1, verbose=0, warm_start=False)
model(gbr,X_train,y_train,X_test,y_test,"feat",True)

### Random Forest Regressor

In [None]:
from sklearn.ensemble import  RandomForestRegressor
rf = RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=80,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=400, n_jobs=-1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)
model(rf,X_train,y_train,X_test,y_test,"feat")

### Decision Tree Regressor

In [None]:
from sklearn.tree import DecisionTreeRegressor
dtr = DecisionTreeRegressor()
model(dtr,X_train,y_train,X_test,y_test,"feat")

### Extra Trees Regressor

In [None]:
from sklearn.ensemble import ExtraTreesRegressor
etr = ExtraTreesRegressor()
model(etr,X_train,y_train,X_test,y_test,"feat")

### AdaBoost Regressor

In [None]:
from sklearn.ensemble import AdaBoostRegressor
adb = AdaBoostRegressor()
model(adb,X_train,y_train,X_test,y_test,"feat")

# Neural Network

In [None]:
### Neural Network
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

# Building ANN As a Regressor
import keras
from keras.models import Sequential
from keras.layers import Dense
from keras.layers.normalization import BatchNormalization
from keras import backend

#Defining Root Mean Square Error As our Metric Function 
def rmse(y_true, y_pred):
    return backend.sqrt(backend.mean(backend.square(y_pred - y_true), axis=-1))

# Initialising the ANN
model_nn = Sequential()

# Adding the input layer and the first hidden layer
model_nn.add(Dense(512, activation = 'relu', input_dim = 7))
model_nn.add(BatchNormalization())
# Adding the second hidden layer
model_nn.add(Dense(units = 256, activation = 'relu'))
model_nn.add(BatchNormalization())
# Adding the third hidden layer
model_nn.add(Dense(units = 128, activation = 'relu'))
model_nn.add(BatchNormalization())
model_nn.add(Dense(units = 32, activation = 'relu'))
model_nn.add(BatchNormalization())
# Adding the output layer
model_nn.add(Dense(units = 1))

# Optimize , Compile And Train The Model 
opt =keras.optimizers.Adam(lr=0.0015)
#print(model_nn.summary())
model_nn.compile(optimizer=opt,loss='mean_squared_error',metrics=[rmse])

In [None]:
import tensorflow as tf
checkpoint_filepath ='best.hdf5'
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath,
    save_weights_only=True,
    monitor='val_rmse',
    mode='min',
    save_best_only=True)

# Model weights are saved at the end of every epoch, if it's the best seen
# so far.
history=model_nn.fit(sc.fit_transform(X_train),y_train,epochs = 100 ,batch_size=32,validation_data=(sc.transform(X_test), y_test), callbacks=[model_checkpoint_callback])

# The model weights (that are considered the best) are loaded into the model.
model_nn.load_weights(checkpoint_filepath)

In [None]:
# Predicting and Finding R Squared Score
y_predict = model_nn.predict(sc.transform(X_test))
print('Root Mean Squared Error is: ', np.sqrt(mean_squared_error(y_test, y_predict))) 

plt.figure(figsize=(20,5))
plt.plot(list(y_test) ,color = 'red', label = 'Real data',marker='o')
plt.plot(y_predict, color = 'blue', label = 'Predicted data',marker='o')
plt.title('Prediction')
plt.legend()
plt.show()

# Plotting Loss And Root Mean Square Error For both Training And Test Sets
plt.plot(history.history['rmse'])
plt.plot(history.history['val_rmse'])
plt.title('Root Mean Squared Error')
plt.ylabel('rmse')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()