In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## **INTRODUCTION**

**Data Source:** Prof. I.-C. Yeh, Chung-Hua University via UC Irvine Machine Learning Repository  
**Citation:** I-Cheng Yeh, "Modeling of strength of high performance concrete using artificial neural networks," Cement and Concrete Research, Vol. 28, No. 12, pp. 1797-1808 (1998). 
  
**Motivation:** I have spent the previous four months learning to program with Python and applying it to machine learning applications. In deciding on an initial independent data science project, I searched for a dataset that was relatively easy to work with, of interest to me, and suitable for regression analysis. I selected this dataset because: it is fairly easy to work with, it is suitable for regression analysis, and, as a structural geologist, the compressive strength of materials (usually rocks, but concrete is neat too) is of interest to me.  
  
**Abstract:** Concrete has widespread use as a building material and is a primary constituent of many infrastructure and construction projects. Traditional concrete has three ingredients: Portland cement, aggreagates, and water. High-performance concrete (HPC), the focus of this dataset, uses additional material that have a cementitious properties (Yeh, 1998). The addition of these material makes for a complex material, of which the compressive strength is difficult to predict. Given this problem, I.-C. Yeh, the original owner of this dataset, set out to predict the strength of HPC using neural networks. His work produced a model of compressive strength with an R2 of ~0.9 +/- ~0.05. This measure of accuracy can be used as a benchmark against the models produced in this analysis.  
  
**Objective:** The main objective of this project is to understand variability in the prediction of different regression models. I will accomplish this in three steps: 1) predict compressive strength with various regression models, 2) compare the models using visualizations and measures of accuracy, and 3) analyze the importance of features in each model.

## **DATA EXPLORATION AND VISUALIZATION**

Within this section, basics data exploration steps are carried out to understand the data, make it easier to work with as a DataFrame, and make initial observations about the distributions of the data and the correlation between each feature and the compressive strength, the target for prediction.

In [None]:
# This code cell reads the concrete compressive strenght .csv into a Pandas data frame.
concrete = pd.read_csv('../input/concrete-compressive-strength-uci/Concrete_Data.csv')
concrete.head()

In [None]:
# As seen above, the column names in this dataset are quite long. 
# The code below will rename the columns so they are easier to work with.
columns = ['cement', 'slag', 'flyash', 'water', 'superplasticizer', 'coarseagg', 'fineagg', 'age', 'strength']
concrete.columns = columns
concrete.head()

In [None]:
# Check column data types.
concrete.dtypes

In [None]:
# Check for missing values in the dataset.
concrete.isnull().sum()

In [None]:
# There is no missing data. Thus our data cleaning efforts will be minimal.
# This cell provides summary information for the dataframe.
concrete.describe()

In [None]:
# The code below allows us to gain a first pass look at the correlation between different feature and our target feature, (compressive) strength.
concrete.corr()

Visualizations are a key step in understanding the dataset. Key features for visualization were selected using the first pass look at correlation coefficients in the **Data Exploration** section. Below are plots of these key features (x-axis) vs. compressive strength of HPC (y-axis). 

In [None]:
# This cell plots key features vs. compressive strength. 
fig = plt.figure(figsize=(15, 10))
ax1 = fig.add_subplot(221)
ax2 = fig.add_subplot(222)
ax3 = fig.add_subplot(223)
ax4 = fig.add_subplot(224)

ax1.scatter(concrete.cement, concrete.strength, color='gray')
ax2.scatter(concrete.age, concrete.strength, color='gray')
ax3.scatter(concrete.superplasticizer, concrete.strength, color='gray')
ax4.scatter(concrete.water, concrete.strength, color='gray')

ax1.title.set_text('Strength vs. Cement')
ax2.title.set_text('Strength vs. Age')
ax3.title.set_text('Strength vs. Superplasticizer')
ax4.title.set_text('Strength vs. Water')

The above plots illustrate that none of our current features display a particularly strong relationship with compressive strength. Feature engineering may provide us with features which display stronger relationships. 

## **FEATURE ENGINEERING**

Using the work of Yeh (1998) as a guide, the following section creates additional features that may be useful in predicting the compressive strength of HPC. Many of the additives to HPC are cementitious by nature, meaning they add to the bonding or binding strength of the mixture. We can create features that look at the ratio of these materials ('Binder') to other ingredients within the HPC (i.e. Water/Binder Ratio of Yeh, 1998). 

In [None]:
# This cell copies the original DataFrame in order to engineer new features without disturbing the original data.
concrete_eng = concrete.copy()
concrete_eng.head()

In [None]:
# This code cell creates new features for the DataFrame following the procedure of Yeh (1998). 
concrete_eng['total'] = concrete_eng.cement + concrete_eng.slag + concrete_eng.flyash + concrete_eng.water + concrete_eng.superplasticizer + concrete_eng.coarseagg + concrete_eng.fineagg
concrete_eng['binder'] = concrete_eng.cement + concrete_eng.flyash + concrete_eng.slag
concrete_eng['wcratio'] = concrete_eng.water/concrete_eng.cement
concrete_eng['wbratio'] = concrete_eng.water/concrete_eng.binder
concrete_eng['spbratio'] = concrete_eng.superplasticizer/concrete_eng.binder
concrete_eng['fabratio'] = concrete_eng.flyash/concrete_eng.binder
concrete_eng['sbratio'] = concrete_eng.slag/concrete_eng.binder
concrete_eng['fasbratio'] = (concrete_eng.flyash + concrete_eng.slag)/concrete_eng.binder
concrete_eng.describe()

In [None]:
# Initial look at correlation between engineered features and compressive strength, which helps select key features for visualization.
concrete_eng.corr()

In [None]:
# Plots of key engineered features vs. compressive strength.
engfig = plt.figure(figsize=(15, 10))
ax1 = engfig.add_subplot(221)
ax2 = engfig.add_subplot(222)
ax3 = engfig.add_subplot(223)
ax4 = engfig.add_subplot(224)

ax1.scatter(concrete_eng.wcratio, concrete_eng.strength, color='gray')
ax2.scatter(concrete_eng.wbratio, concrete_eng.strength, color='gray')
ax3.scatter(concrete_eng.spbratio, concrete_eng.strength, color='gray')
ax4.scatter(concrete_eng.binder, concrete_eng.strength, color='gray')

ax1.title.set_text('Strength vs. W/C Ratio')
ax2.title.set_text('Strength vs. W/B Ratio')
ax3.title.set_text('Strength vs. Superplasticizer to Binder Ratio')
ax4.title.set_text('Strength vs. Binder')

The plots above show a relatively strong correlation between water to cement ratio vs. strength and water to binder ratio vs. strength.  The remainder of the engineered features do not show strong correlations. Based on these analyses, it may be helpful to include the water to cement and water to binder ratios in our model, but remove the other engineered features.  

In [None]:
# Create the final DataFrame for model development based on observations from the previous sections.
final_columns = ['cement', 'slag', 'flyash', 'water', 'superplasticizer', 'coarseagg', 'fineagg', 'age', 'strength', 'binder', 'wcratio', 'wbratio']
concrete_final = concrete_eng[final_columns]
concrete_final.head(10)

## **MODEL DEVELOPMENT**

In [None]:
# The code below sets up the variables for prediction and splits the data into training and testing sets.
from sklearn.model_selection import train_test_split

indep_var = ['cement', 'slag', 'flyash', 'water', 'superplasticizer', 'coarseagg', 'fineagg', 'age', 'binder', 'wcratio', 'wbratio']
dep_var = ['strength']
X = concrete_final[indep_var]
y = concrete_final[dep_var]
X_test, X_train, y_test, y_train = train_test_split(X, y, test_size = 0.8, random_state=1)

In [None]:
# Here four models are set up. The predictions of various models will be compared later in this analysis.
from sklearn import linear_model
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

linear = linear_model.LinearRegression()
ridge = linear_model.BayesianRidge()
tree = DecisionTreeRegressor(random_state=1)
forest = RandomForestRegressor(random_state=1)

In [None]:
# Each model is fit using the validation data.
linear.fit(X_train, y_train.values.ravel())
ridge.fit(X_train, y_train.values.ravel())
tree.fit(X_train, y_train.values.ravel())
forest.fit(X_train, y_train.values.ravel())

Now that mulitple models are defined, we can predict the compressive strength of our test dataset and evaluation/compare the accuracy of the different models chosen for this problem.  
  
I once again create a plot with four subplots, one for each model in this project. Beyond the visual comparison, we can also measure accuracy using MSE, max error, and R2. MSE and R2 were chosen because they are common measures and easily understood. Max error was specifically for this problem. If using HPC for a construction or infrastructure project, it may be useful to know which model has a tendency to make predictions that are way off, as we want to minimize the risk of material failure for safety purposes. 

In [None]:
# This code cell uses the models created above to predict concrete strength with the test data.
linear_preds = linear.predict(X_test)
ridge_preds = ridge.predict(X_test)
tree_preds = tree.predict(X_test)
forest_preds = forest.predict(X_test)

In [None]:
# Here I visualize the predicted values vs. the known values.
engfig = plt.figure(figsize=(15, 10))
ax1 = engfig.add_subplot(221)
ax2 = engfig.add_subplot(222)
ax3 = engfig.add_subplot(223)
ax4 = engfig.add_subplot(224)

ax1.scatter(linear_preds, y_test, color='gray')
ax2.scatter(ridge_preds, y_test, color='gray')
ax3.scatter(tree_preds, y_test, color='gray')
ax4.scatter(forest_preds, y_test, color='gray')

ax1.title.set_text('Linear Model')
ax2.title.set_text('Ridge Model')
ax3.title.set_text('Decision Tree Model')
ax4.title.set_text('Random Forest Model')

In [None]:
# To compare models beyond visualizations, a measure of accuracy is needed. For this analysis, I use...

from sklearn.metrics import mean_squared_error
from sklearn.metrics import max_error
from sklearn.metrics import r2_score

linear_mse = mean_squared_error(y_test, linear_preds)
ridge_mse = mean_squared_error(y_test, ridge_preds)
tree_mse = mean_squared_error(y_test, tree_preds)
forest_mse = mean_squared_error(y_test, forest_preds)

linear_max = max_error(y_test, linear_preds)
ridge_max = max_error(y_test, ridge_preds)
tree_max = max_error(y_test, tree_preds)
forest_max = max_error(y_test, forest_preds)

linear_r2 = r2_score(y_test, linear_preds)
ridge_r2 = r2_score(y_test, ridge_preds)
tree_r2 = r2_score(y_test, tree_preds)
forest_r2 = r2_score(y_test, forest_preds)

print('Linear model MSE, max error, R2 =', linear_mse, ',', linear_max, ',', linear_r2)
print('Ridge model MSE, max error, R2 =', ridge_mse, ',', ridge_max, ',', ridge_r2)
print('Decision tree model MSE, max error, R2 =', tree_mse, ',', tree_max, ',', tree_r2)
print('Random forest model MSE, max error, R2 =', forest_mse, ',', forest_max, ',', forest_r2)

As is clear in both the visualization and the measures of accuracy, the Random Forest model is the most accurate predictor of HPC compressive strength, with the Decision Tree model as a close second. R2 values of 0.92 and 0.88 are within the range of R2 values from the work of Yeh (1998). We could potentially make adjustments to the models used here to increase accuracy, but for now that is beyond the scope of this project. When considering these two models, Random Forest is most likely more accurate than Decision Tree because it is an ensemble learning method.
  
Additionally, we can see the importance of using the max error measure of accuracy. While the Decision Tree model is significanly better in the MSE and R2 measures when compared to the Linear and Ridge models, it is actually worse when using the max error measure. In a real world situation, if a compressive is strength is predicted that is significantly higher than the actual strength, there is a higher likelyhood of failure of the material during its lifespan and thus a higher chance of damage to property or people.

## **FEATURE IMPORTANCE ANALYSIS**

Here I use Permutation Importance to gain insight into which features are most important to the various models. Permutation Importance was selected because it is quick to calculate and relatively easy to understand.

In [None]:
# This cell imports and establishes the permutation importance function.
import eli5
from eli5.sklearn import PermutationImportance

linear_perm = PermutationImportance(linear, random_state=1).fit(X_train, y_train)
ridge_perm = PermutationImportance(ridge, random_state=1).fit(X_train, y_train)
tree_perm = PermutationImportance(tree, random_state=1).fit(X_train, y_train)
forest_perm = PermutationImportance(forest, random_state=1).fit(X_train, y_train)

In [None]:
# The code below from the eli5 library creates a visualization of permutation importance for the linear model.
eli5.show_weights(linear_perm, feature_names = X_train.columns.tolist())

In [None]:
# The code below from the eli5 library creates a visualization of permutation importance for the ridge model.
eli5.show_weights(ridge_perm, feature_names = X_train.columns.tolist())

In [None]:
# The code below from the eli5 library creates a visualization of permutation importance for the decision tree model.
eli5.show_weights(tree_perm, feature_names = X_train.columns.tolist())

In [None]:
# The code below from the eli5 library creates a visualization of permutation importance for the random forest model.
eli5.show_weights(forest_perm, feature_names = X_train.columns.tolist())

Traditionally (per Yeh, 1998), the Abrams rule has been used, which states that generally an increase in the water to cement ration decreases concrete strength. The observation from this feature importance analysis tells us that this is generally still true, yet additional cementitious material have been added to HPC, thus they must be included in this ratio, leading to the water to binder ratio acting as a primary control, along with age, of HPC compressive strength. The water to cement ratio and the total amount of binder both play an important role in the models as well. 

## **CONCLUSIONS**

**Findings:** The above analysis demonstrates that the compressive strength of high performace concrete is primarily controlled by the age of the material and the water to binder ratio of the material. This finding is consistent with the work of Yeh (1998). The Decision Tree and Random Forest regression models used here were able to predict the compressive strength of HPC with accuracy in line with the work of Yeh (1998). 
  
**Objectives:** I was able to successfully accomplish my objectives for this project, my first independent machine learning project. The use of different regression models, along with visualizing and comparing measures of accuracy, allowed me to understand why certain models are used more frequently. While completing Kaggle's Microcourses, I was curious why the machine learning courses began with Decision Tree and Random Forest regression models without any discussion of a basic Linear model, however, this analysis helps demonstrate why (they are significantly more accurate). The feature importance analysis conducted also helped to illustrate what features had the largest impact of the models and led to a potential next step for this project. 
  
**Learnings and Next Steps:** Not suprisingly, the ensemble model (Random Forest) performed better than the more simple Linear, Ridge, and Decision Tree models, which reinforces the benefit of ensemble methods. Potential next steps to improve the prediction of compressive strength of HPC are: 1) select features for model building based on cutoff values from feature importance values, 2) select additional ensemble methods for model building, and 3) all models used here are first pass models, future models could alter the characteristics (i.e. setting value for max_leaf_nodes) to improve accuracy. 