Domain: Cement manufacturing

Context: Concrete is the most important material in civil engineering. The concrete compressive strength is a highly nonlinear function of age and ingredients. These ingredients include cement, blast furnace slag, fly ash, water, superplasticizer, coarse aggregate, and fine aggregate.

Objective: Modeling of strength of high performance concrete using Machine Learning - EDA, Building ML model for regression and Hyper parameter tuning 

In [None]:
#1. Data pre-processing: Perform all the necessary preprocessing on the data ready to be fed for 
# Featurization, Model Selection & Tuning

In [None]:
# Import necessary libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")

In [None]:
# Load the data and name it as cData (i.e. Concrete Data)
cData = pd.read_csv("../input/concrete/concrete.csv")
cData.shape

- Given data set have total of 9 properties or columns and 1030 rows or concrete properties data points

In [None]:
# Display the data set info
cData.info()

- 9 properties are provided to capture the measurement of various ingredients used in concrete preparation and has contribution in concrete strength
- 1 property i.e. '# of days' termed as 'age' is of integer type data
- remaining 8 properties are having decimal point values

In [None]:
# Make a copy of the original data set cData for further processing through various Machine Learning technique 
# in later part of the project
cDataOrig = cData.copy() #cDataOrig = Original copy of master data set for Concrete Strength measurement

# Let's get the sample look of the data
cDataOrig.head()

- All attributes are numerical in the data set
- None of the properties are categorical in the data set, hence no need of encoding or dummy propertie creation required

In [None]:
# Check if there is any missing values in the data set
cData.isnull().values.any()

- Above indicates there is no blank values in the dataset provided

In [None]:
# Check how many ZERO values present in each column
(cData == 0).sum(axis=0)

- From the above data point we see following % of values in respective columns are missing
    - 47% in Slag
    - 56% in ash
    - 37% in superplastic
    
- Let's review the other data distribution below and later shall take a technique to address the missing values

In [None]:
# Let's have a quick look into the data description to get an idea about the 5 point summary
cData.describe().T

- All properties are numerical, hence none of the columns are dropped in the 5 point summary
- slag: 25% concrete samples have no slag in it. Value in Q3 i.e. in 75% bucket is 142.95 where as max value in 359.5. It indicates a possible presence of outliers in this column
- ash: 50% concrete samples have no ash in it. Value in Q3 i.e. in 75% bucket is 118.3 where as max value in 200.1. It indicates a possible presence of outliers in this column
- superplastic: 25% concrete samples have no superplastic in it. Value in Q3 i.e. in 75% bucket is 10.2 where as max value in 32.2. It indicates a possible presence of outliers in this column as well
- age: The concrete strength is measured from 1 day to 365 days range, however 75% of the data points are measured for 56 days
- Strength is the target variable in the data set and remaining 8 ingredients to contribute in measuring the concrete strength are independent attributes

In [None]:
# Display the range of values of each column
cData.max() - cData.min()

In [None]:
# Let's review the skewness of the properties
cData.skew()

- Following properties are positively skewed (with value >.5). It indicates that these atributes are not normally distributed. The tail of the distribution is longer on the right side. The mean is greater than the median for these parameters.

    - slag, superplastic, age

- Following properties are negatively skewed. It indicates that these atributes are also not normally distributed. The tail of the distribution is longer on the left side. The mean is lesser than the median for these parameters.

    - coarseagg, fine aggregator

In [None]:
# Understanding the attributes - Find relationship between different attributes (Independent variables) and 
# choose which all attributes have to be a part of the analysis and why

In [None]:
# Describe the correlation of the variables through graphical Heat map
# +ve and -ve numbers indicate how the variabes are correalted to each other

colormap = plt.cm.viridis # Color range to be used in heatmap
plt.figure(figsize=(15,15))
plt.title('Concrete properties Correlation of attributes', y=1.05, size=19)
sns.heatmap(cData.corr(),linewidths=0.1,vmax=1.0, 
            square=True, cmap=colormap, linecolor='white', annot=True)

- Following features have strong relationship comparitively on the target column strength:
    - cement, superplastic, duration i.e. age and water respectively
- Fineagg, coarseagg, water are not strongly related to the strength of the concrete
- superplastic and ash are closely related and superlastic has positive influence on concrete strength

In [None]:
# Display the pair plot
sns.pairplot(cData)
plt.show()

- From the above pair plot between given features, we observe following -
    - slag, ash, age features don't have normalized data distribution and are right skewed
    - strength properties have multiple peaks indicating possible clusters in the dataset
    - all features don't have any linear relationsship with strength
    - slag, ash, superplastic feature have multiple zero values in the dataset
    - age columns can be grouped into 4 separate cluster

In [None]:
# Measure of possible outliers in the dataset -
# It can be done by checking the box plot of each properties or by evaluating the z scores of each columns

#As there are 9 properties given in the dataset, let's get the Z score of the entire data set 
#to check outliers statistically

from scipy.stats import zscore

# Get the z score
z_cData = cData.apply(zscore)

# Set the limt to 3 sigma to check outliers in the voice sample provided
limit = 3
t1 = np.where(z_cData > limit) #store the outliers in the touple variable

# print the index value of original data set which contains outliers >3 sigma
print(t1[0]) # depicts the row# containing outliers
print(t1[1]) # depicts the column# containing outliers

In [None]:
list1 = t1[0] #load the row# of the data set containing outliers
list2 = t1[1] #load the column# of the data set containing outliers

j = 0 #initiate the iterator

# print the outlier column and it's value from the data set provided
for i in list2:   #loop through the columns
    print("Outliers exist in properties: ", cData.columns[i], " and the value is: ",cData.loc[list1[j]][i])
    j +=1 #move to the next value of the corresponding row#  

- Following are few of the properties which have multiple outliers in the dataset, programatically above depicts the same.
    - slag, age, superplastic, water

In [None]:
# Let's review the outliers visually through below graph i.e. to find presence of leverage points
# plot strength vs age relation
sns.boxplot(x = "age", y = "strength", data = cData); 
plt.show()

- There are outliers for the age group of 14, 28 and 180
- 14 unique age groups can be obtained from the dataset

In [None]:
# Let's view the presence of outliers in overall age dataset
sns.boxplot(y = "age", data = cData); 
plt.show()

In [None]:
# Let's view the presence of outliers in slag dataset
sns.boxplot(y = "slag", data = cData); 
plt.show()

In [None]:
# Let's view the presence of outliers in overall water dataset
sns.boxplot(y="water", data = cData); 
plt.show()

In [None]:
# Let's view the presence of outliers in overall superplastic dataset
sns.boxplot(y="superplastic", data = cData); 
plt.show()

In [None]:
#Distribution plot of Slag
sns.distplot(cData['slag'])
plt.show()

#Distribution plot of ash
sns.distplot(cData['ash'])
plt.show()

#Distribution plot of age
sns.distplot(cData['age'])
plt.show()

#Distribution plot of superplastic
sns.distplot(cData['superplastic'])
plt.show()

#Distribution plot of water
sns.distplot(cData['water'])
plt.show()

In [None]:
# Perform necessary imputation - to address the presence outliers and missing values

In [None]:
# From above histograms of slag, ash, superplastic columns we see a significant % of data points doesn't have any value
# Hence let's impute these variables and replace it with it's median value

# Replace all the zero values with corresponding median in slag, ash and superplastic columns
from sklearn.impute import SimpleImputer #Use the SimpleImputer library to replace all blanks in a generic way

# Get the Imputer initialized for the data set
imputer_for_cData = SimpleImputer(missing_values = 0 , strategy = 'median')
imputer_for_cData.fit(cData)

col_names_cData = cData.columns.values # Get the column names for the data set

# Get the new data set after replacing the blanks with corresponding column median
cData = pd.DataFrame(imputer_for_cData.transform(cData), columns=col_names_cData)

In [None]:
# Let's obtain before and after comparison with respect to 5 point summary of the data set
cDataOrig.describe().T # from the dataset before the imputation

In [None]:
cData.describe().T # from the dataset after the imputation

- From above comparison, we see the 5 point summary (i.e. mean, standard deviation, min, IQR, max) data is updated for slag, ash and superplastic properties

In [None]:
# Let's compare the median value of columns before/after of the imputation
cDataOrig.median() #column median from the original data set before imputation

In [None]:
cData.median() #column median from the data set after imputing zero values

- Above indicates that the median value changed only for slag, ash and superplastic properties

In [None]:
# Let's review the data distribution after imputation of zeros i.e. missing values in the given concrete data set
#Distribution plot of Slag - after imputation of zeros
sns.distplot(cData['slag'])
plt.show()

#Distribution plot of superplastic - after imputation of zeros
sns.distplot(cData['superplastic'])
plt.show()

#Distribution plot of ash
#more than 50% data points had zero in the ash column, kde bandwidth is zero as well. 
#Hence marked kde as False in histogram plot
sns.distplot(cData['ash'], kde=False) 
plt.show()

- From above 3 distribution graph of slag, ash and super plastic we see the data is better distributed and following normal distribution with one major peak

In [None]:
# Let's review the dataset to get rid of the outliers
# Previously we measused the 3 sigma z-score on the entire dataset and found outliers statistically.

# Let's remove the outliers from the data set
cData_zScore = cData.apply(zscore) # Get the data set with z score

#Make new dataframe after removing the out liers
cData_wo_outliers_by_Zscore = cData[(cData_zScore <3).all(axis=1)]

print("The shape of the dataset (before outliers removal by Z-score): ", cData.shape)
print("The shape of the dataset (after outliers removal by Z-score): ", cData_wo_outliers_by_Zscore.shape)

- From above computation, 77 rows out of total 1030 (i.e. 7%) records are getting dropped

In [None]:
# In an another approach we can replace the outliers by referring the upper or lower whisker
# In this case no records need to be dropped from the data set

# Get the quartile and inter quartile ranges
Q1 = cData.quantile(0.25)
Q3 = cData.quantile(0.75)
IQR = Q3 - Q1
print(IQR)

In [None]:
cData_wo_outliers_by_whisker = cData.copy()

# Replace every outlier on the lower side by the lower whisker
for i, j in zip(np.where(cData_wo_outliers_by_whisker < Q1 - 1.5 * IQR)[0], np.where(cData_wo_outliers_by_whisker < Q1 - 1.5 * IQR)[1]): 
    
    whisker  = Q1 - 1.5 * IQR
    cData_wo_outliers_by_whisker.iloc[i,j] = whisker[j]
    
    
#Replace every outlier on the upper side by the upper whisker    
for i, j in zip(np.where(cData_wo_outliers_by_whisker > Q3 + 1.5 * IQR)[0], np.where(cData_wo_outliers_by_whisker > Q3 + 1.5 * IQR)[1]):
    
    whisker  = Q3 + 1.5 * IQR
    cData_wo_outliers_by_whisker.iloc[i,j] = whisker[j]

In [None]:
print("The shape of the dataset (before outliers replacement by whisker): ", cData.shape)
print("The shape of the dataset (after outliers replacement by whisker): ", cData_wo_outliers_by_whisker.shape)

- From above the outliers are replaced but not removed, hence the shape of the data set didn't change

In [None]:
# Let's review if still there is outliers in the dataset - after removal by whisker

# Check outliers in superplastic
sns.boxplot(y="superplastic", data = cData_wo_outliers_by_whisker); 
plt.show()

# Check outliers in superplastic
sns.boxplot(y="age", data = cData_wo_outliers_by_whisker); 
plt.show()

# Check outliers in superplastic
sns.boxplot(y="water", data = cData_wo_outliers_by_whisker); 
plt.show()

# Check outliers in superplastic
sns.boxplot(y="slag", data = cData_wo_outliers_by_whisker); 
plt.show()


- From above graphical representation of slag, water, age and superplastic, we see no more outliers in the dataset

In [None]:
# Identify opportunities to create a composite feature, drop a feature etc

In [None]:
# Make a copy of data set to check if any of the feature can be divided into subgroups
cDataFE = cData.copy() # cDataFE - stands for Concrete data set for Feature Engineering
cDataFE.head()

In [None]:
# Let's review if there are unique values in each of the columns
print("Print the unique values in cement column: ", len(cDataFE.cement.unique()))
print("Print the unique values in slag column: ", len(cDataFE.slag.unique()))
print("Print the unique values in ash column: ", len(cDataFE.ash.unique()))
print("Print the unique values in water column: ", len(cDataFE.water.unique()))
print("Print the unique values in superplastic column: ", len(cDataFE.superplastic.unique()))
print("Print the unique values in coarseagg column: ", len(cDataFE.coarseagg.unique()))
print("Print the unique values in fineagg column: ", len(cDataFE.fineagg.unique()))
print("Print the unique values in Age column: ", len(cDataFE.age.unique()))
print("Print the unique values in strength column: ", len(cDataFE.strength.unique()))

- As the Age i.e. count of unique # of days are only 14 out of total  1030 records, we can split the above feature as categoricl as follows -

    - Less than 1 month (for # of days between 1 - <30)
    - Less than 3 month (for # of days between 30 - <90)
    - Less than 6 month (for # of days between 90 - <180)
    - Less than a year  (for # of days between 180 - 365)

In [None]:
# Let's categorize the Age column as explained above.
# As a first steps converting the range of days into 4 categories

j = 0 # Instantiate the iterator

for i in cDataFE['age']: # Loop through the Age column to find the match defined in below conditions
    if i > 0 and i <=30:
        cDataFE.loc[j]['age'] = 1 # Denote the range Less than 1 month (for # of days between 1 - <30) with 1
        j +=1
    elif i > 30 and i <= 90:
        cDataFE.loc[j]['age'] = 2 # Denote the range Less than 3 month (for # of days between 30 - <90) with 2
        j +=1
    elif i > 90 and i <= 180:
        cDataFE.loc[j]['age'] = 3 # Denote the range Less than 6 month (for # of days between 90 - <180) with 3
        j +=1
    elif i > 180 and i <= 365:
        cDataFE.loc[j]['age'] = 4 # Denote the range Less than a year (for # of days between 180 - 365) with 4
        j +=1

In [None]:
# Have a look into the Age column's unique value now
cDataFE.age.unique()

In [None]:
# Now replace the categorical value with the target category of the date range
cDataFE['age'] = cDataFE['age'].replace({1: 'Less than 1 month', 
                                         2: 'Less than 2 month', 
                                         3: 'Less than 6 month',
                                         4: 'Less than a year'})


In [None]:
# Have a look into the Age column's unique value - after replacement with exact category
cDataFE.age.unique()

In [None]:
# Create dummy properties for age column
cDataFE = pd.get_dummies(cDataFE, columns=['age'])

In [None]:
cDataFE.head(10)

- Above shows that the # of days converted into boolean categorical properties denoting 1 = yes and 0 = no in each category of days range
- In following sections we will validate the importance of these features on target variable 'strength'

In [None]:
# Let's review if any composite feature can be created from the given properties in the data set
# In the concrete preparation water and cement ratio is very critical combination. Hence we can calculate the w/c ratio
# and drop the cement-water individual features

# Take copy of the data set prepared for Feature Engineerting above (this has the additional categorization of days column)
cDataFE_unscaled_wc_ratio = cDataFE.copy()

# Insert the new composite feature to the unscaled dataset
cDataFE_unscaled_wc_ratio.insert(cDataFE_unscaled_wc_ratio.shape[-1]-1,
                                 'water-cement-ratio', 
                                 cDataFE_unscaled_wc_ratio['water']/cDataFE_unscaled_wc_ratio['cement'])

# Drop the original individual feature as the water/cement ratio is introduced as composite feature
cDataFE_unscaled_wc_ratio.drop(['water', 'cement'], axis=1, inplace=True)

# Let's scale the dataset
from sklearn.preprocessing import MaxAbsScaler

scaler = MaxAbsScaler() #Instanciate MaxAbsScaler
cDataFE_unscaled_wc_ratio_copy = cDataFE_unscaled_wc_ratio.copy() 
cDataFE_scaled_wc_ratio = scaler.fit_transform(cDataFE_unscaled_wc_ratio) # Scale through fit-transform
cDataFE_unscaled_wc_ratio_copy.loc[:,:] = cDataFE_scaled_wc_ratio # Prepare the scaled dataset
cDataFE_scaled_wc_ratio = cDataFE_unscaled_wc_ratio_copy.copy() # Name the scaled dataset
cDataFE_scaled_wc_ratio.head() # View the scaled data set with a composite feature water/cement ratio

In [None]:
# Let's view the distribution of the new composite feature introduced
cDataFE_scaled_wc_ratio['water-cement-ratio'].describe()

In [None]:
# Describe the correlation of the variables through graphical Heat map
# +ve and -ve numbers indicate how the variabes are correalted to each other

colormap = plt.cm.viridis # Color range to be used in heatmap
plt.figure(figsize=(15,15))
plt.title('Concrete properties Correlation of attributes - with water-cement ratio', y=1.05, size=19)
sns.heatmap(cDataFE_scaled_wc_ratio.corr(),linewidths=0.1,vmax=1.0, 
            square=True, cmap=colormap, linecolor='white', annot=True)

- water-cement ratio composite feature doesn't have influence on concrete strength, however age less than 2 months and 1 months have stronger influence on strengh

In [None]:
# As explained above, water to binder ratio also can be computed to obtain another composite feature in the dataset

# Take copy of the data set prepared for Feature Engineerting above (this has the additional categorization of days column)
cDataFE_unscaled_wb_ratio = cDataFE.copy()

# Insert the new composite feature to the unscaled dataset
cDataFE_unscaled_wb_ratio.insert(cDataFE_unscaled_wb_ratio.shape[-1]-1,
                                 'water-binder-ratio', 
                                 cDataFE_unscaled_wb_ratio['water']/(cDataFE_unscaled_wb_ratio['cement']+
                                                                     cDataFE_unscaled_wb_ratio['ash']+
                                                                     cDataFE_unscaled_wb_ratio['slag']))

# Drop the original individual feature as the water/binder (consists of cement, ash & slag) ratio is introduced 
# as composite feature
cDataFE_unscaled_wb_ratio.drop(['water','cement','ash','slag'], axis=1, inplace=True)

# Let's scale the dataset
from sklearn.preprocessing import MaxAbsScaler

scaler1 = MaxAbsScaler() #Instanciate MaxAbsScaler
cDataFE_unscaled_wb_ratio_copy = cDataFE_unscaled_wb_ratio.copy() 
cDataFE_scaled_wb_ratio = scaler1.fit_transform(cDataFE_unscaled_wb_ratio) # Scale through fit-transform
cDataFE_unscaled_wb_ratio_copy.loc[:,:] = cDataFE_scaled_wb_ratio # Prepare the scaled dataset
cDataFE_scaled_wb_ratio = cDataFE_unscaled_wb_ratio_copy.copy() # Name the scaled dataset
cDataFE_scaled_wb_ratio.head() # View the scaled data set with a composite feature water/binder ratio

In [None]:
# Let's view the distribution of the new composite feature introduced
cDataFE_scaled_wb_ratio['water-binder-ratio'].describe()

In [None]:
# Describe the correlation of the variables through graphical Heat map
# +ve and -ve numbers indicate how the variabes are correalted to each other

colormap = plt.cm.viridis # Color range to be used in heatmap
plt.figure(figsize=(15,15))
plt.title('Concrete properties Correlation of attributes - with water-binder ratio', y=1.05, size=19)
sns.heatmap(cDataFE_scaled_wb_ratio.corr(),linewidths=0.1,vmax=1.0, 
            square=True, cmap=colormap, linecolor='white', annot=True)

- From above correlation matrix as well we see that water-binder ratio as a composite feature doesn't have strong influence on concrete strength

In [None]:
# Decide on complexity of the model, should it be simple linear model in terms of parameters or 
# would a quadratic or higher degree help

In [None]:
# Let's implement the linear model and check the model accuracy

# Import Linear Regression machine learning library
from sklearn.linear_model import LinearRegression

# Scale the original dataset cData - without dummy features for 'age' column is considered
cData_scaled = cData.apply(zscore)

# Scale the dataset cDataFE - with dummy features for 'age' column is considered
cDataFE_scaled = cDataFE.apply(zscore)

# Prepare dependent-independent data set - For original dataset
# Copy all the predictor variables into X dataframe. Since 'strength' is dependent variable, let's drop it
X_Linear_orig = cData_scaled.drop('strength', axis=1) # cDataFE dataset is considered with dummy columns for 'age' feature
# Copy the target variable 'strength' column alone into the y dataframe. This is the dependent variable
y_Linear_orig = cData_scaled[['strength']]

# Prepare dependent-independent data set - For dataset with age column dummy
# Copy all the predictor variables into X dataframe. Since 'strength' is dependent variable, let's drop it
X_Linear_wd = cDataFE_scaled.drop('strength', axis=1) # cDataFE dataset is considered with dummy columns for 'age' feature
# Copy the target variable 'strength' column alone into the y dataframe. This is the dependent variable
y_Linear_wd = cDataFE_scaled[['strength']]

# Prepare dependent-independent data set - For dataset with water-cement ratio
# Copy all the predictor variables into X dataframe. Since 'strength' is dependent variable, let's drop it
X_Linear_wc = cDataFE_scaled_wc_ratio.drop('strength', axis=1) # cDataFE dataset is considered with dummy columns for 'age' feature
# Copy the target variable 'strength' column alone into the y dataframe. This is the dependent variable
y_Linear_wc = cDataFE_scaled_wc_ratio[['strength']]

# Prepare dependent-independent data set - For dataset with water-binder ratio
# Copy all the predictor variables into X dataframe. Since 'strength' is dependent variable, let's drop it
X_Linear_wb = cDataFE_scaled_wb_ratio.drop('strength', axis=1) # cDataFE dataset is considered with dummy columns for 'age' feature
# Copy the target variable 'strength' column alone into the y dataframe. This is the dependent variable
y_Linear_wb = cDataFE_scaled_wb_ratio[['strength']]

from sklearn.model_selection import train_test_split # Import tain test split to split the dataset

# Split X and y into training and test set in 70:30 ratio - For original dataset
X_train_Linear_orig, X_test_Linear_orig, y_train_Linear_orig, y_test_Linear_orig = train_test_split(X_Linear_orig, y_Linear_orig, test_size=0.30 , 
                                                                            random_state=1)

# Split X and y into training and test set in 70:30 ratio - For dataset with age column dummy
X_train_Linear_wd, X_test_Linear_wd, y_train_Linear_wd, y_test_Linear_wd = train_test_split(X_Linear_wd, y_Linear_wd, test_size=0.30 , 
                                                                            random_state=1)

# Split X and y into training and test set in 70:30 ratio - For dataset with water-cement ratio
X_train_Linear_wc, X_test_Linear_wc, y_train_Linear_wc, y_test_Linear_wc = train_test_split(X_Linear_wc, y_Linear_wc, test_size=0.30 , 
                                                                            random_state=1)

# Split X and y into training and test set in 70:30 ratio - For dataset with water-binder ratio
X_train_Linear_wb, X_test_Linear_wb, y_train_Linear_wb, y_test_Linear_wb = train_test_split(X_Linear_wb, y_Linear_wb, test_size=0.30 , 
                                                                            random_state=1)

# invoke the LinearRegression function and find the bestfit model on training data - For original dataset
Linear_regression_model_orig = LinearRegression()
Linear_regression_model_orig.fit(X_train_Linear_orig, y_train_Linear_orig)

# invoke the LinearRegression function and find the bestfit model on training data - For dataset with age column dummy
Linear_regression_model_wd = LinearRegression()
Linear_regression_model_wd.fit(X_train_Linear_wd, y_train_Linear_wd)

# invoke the LinearRegression function and find the bestfit model on training data - For dataset with water-cement ratio
Linear_regression_model_wc = LinearRegression()
Linear_regression_model_wc.fit(X_train_Linear_wc, y_train_Linear_wc)

# invoke the LinearRegression function and find the bestfit model on training data - For dataset with water-binder ratio
Linear_regression_model_wb = LinearRegression()
Linear_regression_model_wb.fit(X_train_Linear_wb, y_train_Linear_wb)

In [None]:
# Print coefficients for each of the independent attributes

print("The coefficient for - original dataset")
for i, col_name in enumerate(X_train_Linear_orig.columns):
    print("{} = {}".format(col_name, Linear_regression_model_orig.coef_[0][i]))

# Print the intercept for the model
intercept1 = Linear_regression_model_orig.intercept_[0]
print("The intercept of the model is {}".format(intercept1))
print("===============================================================")

print("The coefficient for - dataset with age column dummy")
for i, col_name in enumerate(X_train_Linear_wd.columns):
    print("{} = {}".format(col_name, Linear_regression_model_wd.coef_[0][i]))

# Print the intercept for the model
intercept2 = Linear_regression_model_wd.intercept_[0]
print("The intercept of the model is {}".format(intercept2))
print("===============================================================")

print("The coefficient for - dataset with water-cement ratio")
for i, col_name in enumerate(X_train_Linear_wc.columns):
    print("{} = {}".format(col_name, Linear_regression_model_wc.coef_[0][i]))

# Print the intercept for the model
intercept3 = Linear_regression_model_wc.intercept_[0]
print("The intercept of the model is {}".format(intercept3))
print("===============================================================")

print("The coefficient for - dataset with water-binder ratio")
for i, col_name in enumerate(X_train_Linear_wb.columns):
    print("{} = {}".format(col_name, Linear_regression_model_wb.coef_[0][i]))

# Print the intercept for the model
intercept4 = Linear_regression_model_wb.intercept_[0]
print("The intercept of the model is {}".format(intercept4))
print("===============================================================")

In [None]:
# Get the Linear Regression model score - in Train data
print("Train data model score - for original dataset: ", Linear_regression_model_orig.score(X_train_Linear_orig, y_train_Linear_orig))
print("-----------------------------------------------------------------------------------------")
print("Train data model score - for dataset with age column dummy: ", Linear_regression_model_wd.score(X_train_Linear_wd, y_train_Linear_wd))
print("-----------------------------------------------------------------------------------------")
print("Train data model score - for dataset with water-cement ratio: ", Linear_regression_model_wc.score(X_train_Linear_wc, y_train_Linear_wc))
print("-----------------------------------------------------------------------------------------")
print("Train data model score - for dataset with water-binder ratio: ", Linear_regression_model_wb.score(X_train_Linear_wb, y_train_Linear_wb))
print("-----------------------------------------------------------------------------------------")

In [None]:
# Get the Linear Regression model score - in Test data
print("Test data model score - for original dataset: ", Linear_regression_model_orig.score(X_test_Linear_orig, y_test_Linear_orig))
print("-----------------------------------------------------------------------------------------")
print("Test data model score - for dataset with age column dummy: ", Linear_regression_model_wd.score(X_test_Linear_wd, y_test_Linear_wd))
print("-----------------------------------------------------------------------------------------")
print("Test data model score - for dataset with water-cement ratio: ", Linear_regression_model_wc.score(X_test_Linear_wc, y_test_Linear_wc))
print("-----------------------------------------------------------------------------------------")
print("Test data model score - for dataset with water-binder ratio: ", Linear_regression_model_wb.score(X_test_Linear_wb, y_test_Linear_wb))
print("-----------------------------------------------------------------------------------------")

- Based on the above analogy we see that when we introduced the dummy features for the different range of age column, we see a better score of the model accuracy in simple liner model
- The other data set i.e. the original, dataset with water-cement ratio or water-binder ratio don't yeild better score
- However the overall accuracy in the simple liner model is around 68%, but this is not significantly high. Hence shall continue to try other models to measure accuracy

In [None]:
# Compute the sum of squared errors by predicting value of y for test cases and 
# subtracting from the actual y for the test cases
mse_orig = np.mean((Linear_regression_model_orig.predict(X_test_Linear_orig)-y_test_Linear_orig)**2)
mse_wd = np.mean((Linear_regression_model_wd.predict(X_test_Linear_wd)-y_test_Linear_wd)**2)
mse_wc = np.mean((Linear_regression_model_wc.predict(X_test_Linear_wc)-y_test_Linear_wc)**2)
mse_wb = np.mean((Linear_regression_model_wb.predict(X_test_Linear_wb)-y_test_Linear_wb)**2)

print("MSE - for original dataset: ", mse_orig)
print("-----------------------------------------------------------------------------------------")
print("MSE - for dataset with age column dummy: ", mse_wd)
print("-----------------------------------------------------------------------------------------")
print("MSE - for dataset with water-cement ratio: ", mse_wc)
print("-----------------------------------------------------------------------------------------")
print("MSE - for dataset with water-binder ratio: ", mse_wb)
print("-----------------------------------------------------------------------------------------")

# underroot of mean_sq_error is standard deviation i.e. avg variance between predicted and actual
import math
print("SRMSE - for original dataset: ", math.sqrt(mse_orig))
print("-----------------------------------------------------------------------------------------")
print("SRMSE - for dataset with age column dummy: ", math.sqrt(mse_wd))
print("-----------------------------------------------------------------------------------------")
print("SRMSE - for dataset with water-cement ratio: ", math.sqrt(mse_wc))
print("-----------------------------------------------------------------------------------------")
print("SRMSE - for dataset with water-binder ratio: ", math.sqrt(mse_wb))
print("-----------------------------------------------------------------------------------------")

In [None]:
# As the dataaset with additional dummy columns yeilds best scrore, 
# hence predict strength - for dataset with age column dummy
y_pred_wd = Linear_regression_model_wd.predict(X_test_Linear_wd)

# Since this is regression, plot the predicted y value vs actual y values for the test data
# A good model's prediction will be close to actual leading to high R and R2 values
plt.scatter(y_test_Linear_wd['strength'], y_pred_wd)

In [None]:
# Build a Ridge model for each of the different data set and let's review corresponding coefficients

from sklearn.linear_model import Ridge # Load Ridge library

# Instantiate the model
ridge_orig = Ridge(alpha=.3) # for original data set
ridge_wd = Ridge(alpha=.3) # for dataset with age dummy
ridge_wc = Ridge(alpha=.3) # for dataset with water-cement ratio
ridge_wb = Ridge(alpha=.3) # for dataset with water-binder ratio

# Fit the model
ridge_orig.fit(X_train_Linear_orig,y_train_Linear_orig)
ridge_wd.fit(X_train_Linear_wd,y_train_Linear_wd)
ridge_wc.fit(X_train_Linear_wc,y_train_Linear_wc)
ridge_wb.fit(X_train_Linear_wb,y_train_Linear_wb)

# Display the co-efficiants
print ("Ridge model - for original data set: ", (ridge_orig.coef_))
print("---------------------------------------------------------------------------------------------------------")
print ("Ridge model - for dataset with age dummy: ", (ridge_wd.coef_))
print("---------------------------------------------------------------------------------------------------------")
print ("Ridge model - for dataset with water-cement ratio: ", (ridge_wc.coef_))
print("---------------------------------------------------------------------------------------------------------")
print ("Ridge model - for dataset with water-binder ratio: ", (ridge_wb.coef_))
print("---------------------------------------------------------------------------------------------------------")

In [None]:
# Build a Lasso model for each of the different data set and let's review corresponding coefficients

from sklearn.linear_model import Lasso # Load Ridge library

# Instantiate the model
lasso_orig = Lasso(alpha=.1) # for original data set
lasso_wd = Lasso(alpha=.1) # for dataset with age dummy
lasso_wc = Lasso(alpha=.1) # for dataset with water-cement ratio
lasso_wb = Lasso(alpha=.1) # for dataset with water-binder ratio

# Fit the model
lasso_orig.fit(X_train_Linear_orig,y_train_Linear_orig)
lasso_wd.fit(X_train_Linear_wd,y_train_Linear_wd)
lasso_wc.fit(X_train_Linear_wc,y_train_Linear_wc)
lasso_wb.fit(X_train_Linear_wb,y_train_Linear_wb)

# Display the co-efficiants
print ("Lasso model - for original data set: ", (lasso_orig.coef_))
print("---------------------------------------------------------------------------------------------------------")
print ("Lasso model - for dataset with age dummy: ", (lasso_wd.coef_))
print("---------------------------------------------------------------------------------------------------------")
print ("Lasso model - for dataset with water-cement ratio: ", (lasso_wc.coef_))
print("---------------------------------------------------------------------------------------------------------")
print ("Lasso model - for dataset with water-binder ratio: ", (lasso_wb.coef_))
print("---------------------------------------------------------------------------------------------------------")

- In Lasso model many of the coefficient for the associated properties becomes zero. It indicates the corresponding features from the dataset can be dropped as it doesn't have enough influence on the target feature 'strength' in this case
- Ex: 
    - In the original data set - cement, water, coarseagg, fineagg and age have correlation with strength of the cement mixture 
    - In the data set with dummy - cement, water, coarseagg, fineagg and age_less than 30 days have correlation with strength of the cement mixture. Other age category can be dropped from the data set
    - In the data set with water-cement ratio or water-binder ratio - none of the feature yeilds result to build a predictive model. The composite features doesn't benefit in the model building. Hence we shall not use these data set for further model building technique.

In [None]:
# Let's get the Ridge and Lasso model score for comparison

# Get the Ridge model score - in Train data
print("Ridge Model score")
print("=========================================================================================")
print("Train data model score - for original dataset: ", ridge_orig.score(X_train_Linear_orig, y_train_Linear_orig))
print("-----------------------------------------------------------------------------------------")
print("Train data model score - for dataset with age column dummy: ", ridge_wd.score(X_train_Linear_wd, y_train_Linear_wd))
print("-----------------------------------------------------------------------------------------")

# Get the Ridge model score - in Test data
print("Test data model score - for original dataset: ", ridge_orig.score(X_test_Linear_orig, y_test_Linear_orig))
print("-----------------------------------------------------------------------------------------")
print("Test data model score - for dataset with age column dummy: ", ridge_wd.score(X_test_Linear_wd, y_test_Linear_wd))
print("-----------------------------------------------------------------------------------------")

# Get the Lasso model score - in Train data
print("Lasso Model score")
print("=========================================================================================")
print("Train data model score - for original dataset: ", lasso_orig.score(X_train_Linear_orig, y_train_Linear_orig))
print("-----------------------------------------------------------------------------------------")
print("Train data model score - for dataset with age column dummy: ", lasso_wd.score(X_train_Linear_wd, y_train_Linear_wd))
print("-----------------------------------------------------------------------------------------")

# Get the Ridge model score - in Test data
print("Test data model score - for original dataset: ", lasso_orig.score(X_test_Linear_orig, y_test_Linear_orig))
print("-----------------------------------------------------------------------------------------")
print("Test data model score - for dataset with age column dummy: ", lasso_wd.score(X_test_Linear_wd, y_test_Linear_wd))
print("-----------------------------------------------------------------------------------------")

- The Ridge model accuracy for dataset with age dummy is still around 69% on test data, however lasso yeilds on 53% 

In [None]:
# It was evident earlier from the joint plot that independent variables have non-linear relation with the 
# dependent or target variable. Hence below building a ploynomial model with higher degree of diamesion 
# to evaluate the model score and compare

from sklearn.preprocessing import PolynomialFeatures # Get the polynomial feature library

# Instanstiate the ploynomial
poly_orig = PolynomialFeatures(degree = 2, interaction_only=True) # Instantiate - for original data set
poly_wd = PolynomialFeatures(degree = 2, interaction_only=True) # Instantiate - for data set with age dummy feature

# Fit Transform the dataset
X_poly_orig = poly_orig.fit_transform(X_Linear_orig) # for original data set
X_poly_wd = poly_wd.fit_transform(X_Linear_wd) # for data set with age dummy feature

# Split the data set
X_train_poly_orig, X_test_poly_orig, y_train_poly_orig, y_test_poly_orig = train_test_split(X_poly_orig, y_Linear_orig, test_size=0.30, random_state=1)
X_train_poly_wd, X_test_poly_wd, y_train_poly_wd, y_test_poly_wd = train_test_split(X_poly_wd, y_Linear_wd, test_size=0.30, random_state=1)

# View the shape of the model after introducing the higher degree diamensions
print("Shape of the original data set: ", X_train_poly_orig.shape)
print("Shape of the data set with age dummy feature: ", X_train_poly_wd.shape)

In [None]:
# Compute the linear model score with polynomial features introduced

# Apply for Ridge model:
# Instantiate the model
ridge_poly_orig = Ridge(alpha=.3) # for original data set
ridge_poly_wd = Ridge(alpha=.3) # for dataset with age dummy

# Fit the model
ridge_poly_orig.fit(X_train_poly_orig,y_train_poly_orig)
ridge_poly_wd.fit(X_train_poly_wd,y_train_poly_wd)

# Display the co-efficiants
print ("Ridge model (polynomial) - for original data set: ", (ridge_poly_orig.coef_))
print("---------------------------------------------------------------------------------------------------------")
print ("Ridge model (polynomial) - for dataset with age dummy: ", (ridge_poly_wd.coef_))
print("---------------------------------------------------------------------------------------------------------")

# Apply for the Lasso model:
# Instantiate the model
lasso_poly_orig = Lasso(alpha=.1) # for original data set
lasso_poly_wd = Lasso(alpha=.1) # for dataset with age dummy

# Fit the model
lasso_poly_orig.fit(X_train_poly_orig,y_train_poly_orig)
lasso_poly_wd.fit(X_train_poly_wd,y_train_poly_wd)

# Display the co-efficiants
print ("Lasso model (polynomial co-efficiants) - for original data set: ", (lasso_poly_orig.coef_))
print("---------------------------------------------------------------------------------------------------------")
print ("Lasso model (polynomial co-efficiants) - for dataset with age dummy: ", (lasso_poly_wd.coef_))
print("---------------------------------------------------------------------------------------------------------")


- In Lasso most of the coefficiants become zero, indicating to drop those polynomial features

In [None]:
# Let's get the Ridge and Lasso model score for comparison

# Get the Ridge model score - in Train data
print("Ridge Model (polynomial) score")
print("=========================================================================================")
print("Train data model score - for original dataset: ", ridge_poly_orig.score(X_train_poly_orig,y_train_poly_orig))
print("-----------------------------------------------------------------------------------------")
print("Train data model score - for dataset with age column dummy: ", ridge_poly_wd.score(X_train_poly_wd,y_train_poly_wd))
print("-----------------------------------------------------------------------------------------")

# Get the Ridge model score - in Test data
print("Test data model score - for original dataset: ", ridge_poly_orig.score(X_test_poly_orig,y_test_poly_orig))
print("-----------------------------------------------------------------------------------------")
print("Test data model score - for dataset with age column dummy: ", ridge_poly_wd.score(X_test_poly_wd,y_test_poly_wd))
print("-----------------------------------------------------------------------------------------")

# Get the Lasso model score - in Train data
print("Lasso Model (polynomial) score")
print("=========================================================================================")
print("Train data model score - for original dataset: ", lasso_poly_orig.score(X_train_poly_orig,y_train_poly_orig))
print("-----------------------------------------------------------------------------------------")
print("Train data model score - for dataset with age column dummy: ", lasso_poly_wd.score(X_train_poly_wd,y_train_poly_wd))
print("-----------------------------------------------------------------------------------------")

# Get the Ridge model score - in Test data
print("Test data model score - for original dataset: ", lasso_poly_orig.score(X_train_poly_orig,y_train_poly_orig))
print("-----------------------------------------------------------------------------------------")
print("Test data model score - for dataset with age column dummy: ", lasso_poly_wd.score(X_train_poly_wd,y_train_poly_wd))
print("-----------------------------------------------------------------------------------------")


- After introducing the higher degree dimension through polynomial feature and applying Ridge and Lasso we observed following -
    - The Ridge model score jumped to 72% on Train and 70% on Test for dataset with age dummy columns
    - The Lasso score improved as well to 55% on test data
    
- Few general observation:
    - In all cases model score is better on the data set where Age column (i.e. # of days to achieve a concrete strength level) is further classified into 4 groups. 
    - Original dataset didn't performed that well on these models
    - Applying higher degree polynomial feature improves the model accuracy as score compared against Ridge & Lasso model

In [None]:
# Explore for gaussians. If data is likely to be a mix of gaussians,explore individual clusters and 
# present your findings in terms of the independent attributes and their suitability to predict strength

In [None]:
# Display each of the feature distribution in the original dataset to view the gaussians

sns.distplot(cData['cement'], hist=False, kde=True)
plt.show()
sns.distplot(cData['slag'], hist=False, kde=True)
plt.show()
sns.distplot(cData['ash'], kde=False)
plt.show()
sns.distplot(cData['water'], hist=False, kde=True)
plt.show()
sns.distplot(cData['superplastic'], hist=False, kde=True)
plt.show()
sns.distplot(cData['coarseagg'], hist=False, kde=True)
plt.show()
sns.distplot(cData['fineagg'], hist=False, kde=True)
plt.show()
sns.distplot(cData['age'], hist=False, kde=True)
plt.show()
sns.distplot(cData['strength'], hist=False, kde=True)
plt.show()

- From the above plot we see the following gaussians -
    - Cement, fineaggregator are almost normally distributed
    - slag, ash and superplastic have a high peak indicating the data needs up/down sampling in the model building technique
    - Age, water, coarseagg have multiple gaussians. Hence Let's review the number of unique values in the dataset

In [None]:
print("Print the unique values in water column: ", len(cData.water.unique()))
print("Print the unique values in coarseagg column: ", len(cData.coarseagg.unique()))
print("Print the unique values in Age column: ", len(cData.age.unique()))

- Though water and coarse aggregator have multiple gaussians, these features have too many unique values to make further subgroups
- Applied the subgroup technique on the Age column and observed earlier how did it benefit in improving the model accuracy

- The target column strength is a continuous variable and the highest model score we observed for simple linear regression was 70% with higher degree of dimension introduced.
- In order to apply further models to predict the strength, let's review the distribution of this feature and if it can be grouped further.
- From the 5 point summary of Strength feature, we see following:
    - mean: 35.817961
    - standard deviation: 16.705742
    - min: 2.33
    - Q1: 23.710
    - Q2: 34.445
    - Q3: 46.135
    - Max: 82.6
    - Number of unique values:845
- Based on the above oservation, breaking the strength feature into following 3 groups:
    - strength level 1 (value between 2.33 and 23)
    - strength level 2 (value between 23 and 46)
    - strength level 3 (value between 46 and above)
- Also get the corresponding count in each level of strength    

In [None]:
# Let's categorize the Strength column as explained above.
# As a first steps take a copy of the dataset with dummies
cDataFES = cDataFE.copy() # cDataFES denotes Concrete dataset for Feature Engineering and for Strength dummies

j = 0 # Instantiate the iterator

for i in cDataFES['strength']: # Loop through the strength column to find the match defined in below conditions
    if i > 0 and i <=23.0:        
        cDataFES.loc[j,'strength'] = 1 # Denote the strength level 1 (value between 2.33 and 23) with 1
        j +=1
    elif i > 23.0 and i <= 46.0:
        cDataFES.loc[j,'strength'] = 2 # Denote the strength level 2 (value between 23 and 46) with 2
        j +=1    
    elif i > 46.0:
        cDataFES.loc[j,'strength'] = 3 # Denote the strength level 3 (value between 46 and above) with 4
        j +=1

In [None]:
cDataFES.head() # View the dataset after dividing the strength into 3 categories

In [None]:
# Now replace the categorical value with the target category of the date range
cDataFES['strength'] = cDataFES['strength'].replace({1: 'level 1', 
                                         2: 'level 2', 
                                         3: 'level 3'})

In [None]:
# Create dummy properties for age column
cDataFES = pd.get_dummies(cDataFES, columns=['strength'])

In [None]:
# Have a look into the dataset now
cDataFES.head()

In [None]:
# Let's view the distribution of strength level in the dataset
print("Strength Level 1: ", len(cDataFES.loc[cDataFES['strength_level 1'] == 1]))
print("Strength Level 2: ", len(cDataFES.loc[cDataFES['strength_level 2'] == 1]))
print("Strength Level 3: ", len(cDataFES.loc[cDataFES['strength_level 3'] == 1]))

In [None]:
# Based on the cluster obtained above for different strength level, in the upcoming model building technique we shall 
# see how individual features have strong/less influence on concrete strength

# Let's obtain feature importance for the individual features and present your findings

# Build the Decision Tree classifier, assess model score and evaluate feature importance

# Let's build the Decision Tree Model
from sklearn.tree import DecisionTreeClassifier

# Make a copy of the data set and scale it
cDataFES_scaled = cDataFES.apply(zscore)

# Prepare dependent-independent data set
# Copy all the predictor variables into X dataframe. Since 'strength' is dependent variable, let's drop it
X_sl1 = cDataFES_scaled.drop(['strength_level 1','strength_level 2','strength_level 3'], axis=1) # for strength_level 1
# Copy the target variable 'strength' column alone into the y dataframe. This is the dependent variable
y_sl1 = cDataFES[['strength_level 1']]

X_sl2 = cDataFES_scaled.drop(['strength_level 1','strength_level 2','strength_level 3'], axis=1) # for strength_level 2
# Copy the target variable 'strength' column alone into the y dataframe. This is the dependent variable
y_sl2 = cDataFES[['strength_level 2']]

X_sl3 = cDataFES_scaled.drop(['strength_level 1','strength_level 2','strength_level 3'], axis=1) # for strength_level 3
# Copy the target variable 'strength' column alone into the y dataframe. This is the dependent variable
y_sl3 = cDataFES[['strength_level 3']]


# Split X and y into training and test set in 70:30 ratio - for strength_level 1
X_train_sl1, X_test_sl1, y_train_sl1, y_test_sl1 = train_test_split(X_sl1, y_sl1, test_size=0.30 , 
                                                                            random_state=1)

# Split X and y into training and test set in 70:30 ratio - for strength_level 2
X_train_sl2, X_test_sl2, y_train_sl2, y_test_sl2 = train_test_split(X_sl2, y_sl2, test_size=0.30 , 
                                                                            random_state=1)

# Split X and y into training and test set in 70:30 ratio - for strength_level 3
X_train_sl3, X_test_sl3, y_train_sl3, y_test_sl3 = train_test_split(X_sl3, y_sl3, test_size=0.30 , 
                                                                            random_state=1)

#considered the Gini criteria to take decision. Other option was to use Entropy
cDataDecisionTree_sl1 = DecisionTreeClassifier(criterion = 'gini', random_state=1) # for strength_level 1
cDataDecisionTree_sl2 = DecisionTreeClassifier(criterion = 'gini', random_state=1) # for strength_level 2
cDataDecisionTree_sl3 = DecisionTreeClassifier(criterion = 'gini', random_state=1) # for strength_level 3

# Fit the model
cDataDecisionTree_sl1.fit(X_train_sl1, y_train_sl1) # for strength_level 1
cDataDecisionTree_sl2.fit(X_train_sl2, y_train_sl2) # for strength_level 2
cDataDecisionTree_sl3.fit(X_train_sl3, y_train_sl3) # for strength_level 3

# Get the decision tree score on Train and Test data
print("Decision Tree Score (on Train data - For strength_level 1): ", cDataDecisionTree_sl1.score(X_train_sl1, y_train_sl1))
print("Decision Tree Score (on Test data - For strength_level 1): ", cDataDecisionTree_sl1.score(X_test_sl1, y_test_sl1))

print("Decision Tree Score (on Train data - For strength_level 2): ", cDataDecisionTree_sl2.score(X_train_sl2, y_train_sl2))
print("Decision Tree Score (on Test data - For strength_level 2): ", cDataDecisionTree_sl2.score(X_test_sl2, y_test_sl2))

print("Decision Tree Score (on Train data - For strength_level 3): ", cDataDecisionTree_sl3.score(X_train_sl3, y_train_sl3))
print("Decision Tree Score (on Test data - For strength_level 3): ", cDataDecisionTree_sl3.score(X_test_sl3, y_test_sl3))

In [None]:
# importance of features in the tree building based in Gini value
print("Feature Importance - considering model for Strenth Level 1")
print("===========================================================")
print (pd.DataFrame(cDataDecisionTree_sl1.feature_importances_, columns = ["Imp"], index = X_train_sl1.columns))
print(" ")
print("Feature Importance - considering model for Strenth Level 2")
print("===========================================================")
print (pd.DataFrame(cDataDecisionTree_sl2.feature_importances_, columns = ["Imp"], index = X_train_sl2.columns))
print(" ")
print("Feature Importance - considering model for Strenth Level 3")
print("===========================================================")
print (pd.DataFrame(cDataDecisionTree_sl3.feature_importances_, columns = ["Imp"], index = X_train_sl3.columns))

- From above feature importance values, following observations can be made:
    - cement, coarseagg, fineagg, water, slag and within a month will have importance in order to reach the strength level 1 (i.e. upto 23 unit)
    - cement, coarseagg, fineagg, water, ash, slag and within a month or 3 month will have importance in order to reach the strength level 2 (i.e. between 23 and 46 unit). At this level of strength to achieve, ash will have influence on the concrete strength
    - cement, fineagg, water, slag, ash, coarseagg,and within a month will have importance in order to reach the strength level 3 (i.e. more than 46 unit). Coarseagg has less importance to achieve this highest level of strength

In [None]:
# Let's review below models and apply technique to confirm which will be suitable for this project
# Refer AUC-ROC implementation to identify which model to selct

In [None]:
# Let's implement Logistic and SVM model for the datasets with specific strength level and measure the ROC/AUC to compare
# the models and pick the best one

from sklearn.linear_model import LogisticRegression # Load the logistic regression library
from sklearn import svm # Load the support vector machine library

from sklearn.utils import shuffle
from sklearn.metrics import roc_curve, auc
random_state = np.random.RandomState(0)

# Set up the dependent and independent variables
array_x_data_scaled = cDataFES_scaled.values
X = array_x_data_scaled[:,0:11]

array_y_data = cDataFES.values
Y_sl1 = array_y_data[:,11]
Y_sl2 = array_y_data[:,12]
Y_sl3 = array_y_data[:,13]

# Split the dataset
# suffix 'mt' = model tuning
X_train_mt_sl1, X_test_mt_sl1, y_train_mt_sl1, y_test_mt_sl1 = train_test_split(X, Y_sl1, test_size=0.30, random_state=1) # for strength level 1
X_train_mt_sl2, X_test_mt_sl2, y_train_mt_sl2, y_test_mt_sl2 = train_test_split(X, Y_sl2, test_size=0.30, random_state=1) # for strength level 2
X_train_mt_sl3, X_test_mt_sl3, y_train_mt_sl3, y_test_mt_sl3 = train_test_split(X, Y_sl3, test_size=0.30, random_state=1) # for strength level 3

# Instantiate the models:
# For logistic regression
LR_sl1 = LogisticRegression() # for strength level 1
LR_sl2 = LogisticRegression() # for strength level 2
LR_sl3 = LogisticRegression() # for strength level 3
# For SVM
SVM_sl1 = svm.SVC(kernel='linear', probability=True) # for strength level 1
SVM_sl2 = svm.SVC(kernel='linear', probability=True) # for strength level 2
SVM_sl3 = svm.SVC(kernel='linear', probability=True) # for strength level 3

# Fit the models & get the model score
# For LR model
prob_score_LR_sl1 = LR_sl1.fit(X_train_mt_sl1, y_train_mt_sl1).predict_proba(X_test_mt_sl1)
prob_score_LR_sl2 = LR_sl2.fit(X_train_mt_sl2, y_train_mt_sl2).predict_proba(X_test_mt_sl2)
prob_score_LR_sl3 = LR_sl3.fit(X_train_mt_sl3, y_train_mt_sl3).predict_proba(X_test_mt_sl3)
# For SVM model
prob_score_SVM_sl1 = SVM_sl1.fit(X_train_mt_sl1, y_train_mt_sl1).predict_proba(X_test_mt_sl1)
prob_score_SVM_sl2 = SVM_sl2.fit(X_train_mt_sl2, y_train_mt_sl2).predict_proba(X_test_mt_sl2)
prob_score_SVM_sl3 = SVM_sl3.fit(X_train_mt_sl3, y_train_mt_sl3).predict_proba(X_test_mt_sl3)


In [None]:
# Let's view the model score: LR model:
print("Logistic Regression model score (in Train data) for strength level 1: ", LR_sl1.score(X_train_mt_sl1, y_train_mt_sl1))
print("Logistic Regression model score (in Test data) for strength level 1: ", LR_sl1.score(X_test_mt_sl1, y_test_mt_sl1))
print("======================================================================================================")
print("Logistic Regression model score (in Train data) for strength level 2: ", LR_sl2.score(X_train_mt_sl2, y_train_mt_sl2))
print("Logistic Regression model score (in Test data) for strength level 2: ", LR_sl2.score(X_test_mt_sl2, y_test_mt_sl2))
print("======================================================================================================")
print("Logistic Regression model score (in Train data) for strength level 3: ", LR_sl3.score(X_train_mt_sl3, y_train_mt_sl3))
print("Logistic Regression model score (in Test data) for strength level 3: ", LR_sl3.score(X_test_mt_sl3, y_test_mt_sl3))
print("======================================================================================================")
# Let's view the model score: SVM model:
print("SVM model score (in Train data) for strength level 1: ", SVM_sl1.score(X_train_mt_sl1, y_train_mt_sl1))
print("SVM model score (in Test data) for strength level 1: ", SVM_sl1.score(X_test_mt_sl1, y_test_mt_sl1))
print("======================================================================================================")
print("SVM model score (in Train data) for strength level 2: ", SVM_sl2.score(X_train_mt_sl2, y_train_mt_sl2))
print("SVM model score (in Test data) for strength level 2: ", SVM_sl2.score(X_test_mt_sl2, y_test_mt_sl2))
print("======================================================================================================")
print("SVM model score (in Train data) for strength level 3: ", SVM_sl3.score(X_train_mt_sl3, y_train_mt_sl3))
print("SVM model score (in Test data) for strength level 3: ", SVM_sl3.score(X_test_mt_sl3, y_test_mt_sl3))
print("======================================================================================================")

In [None]:
# Compute ROC curve and area the curve for logistic model
fpr1_LR, tpr1_LR, thresholds1_LR = roc_curve(y_test_mt_sl1, prob_score_LR_sl1[:, 1])
roc_auc1_LR = auc(fpr1_LR, tpr1_LR)
print("Area under the ROC curve - for strength level 1: %f" % roc_auc1_LR)

fpr2_LR, tpr2_LR, thresholds2_LR = roc_curve(y_test_mt_sl2, prob_score_LR_sl2[:, 1])
roc_auc2_LR = auc(fpr2_LR, tpr2_LR)
print("Area under the ROC curve - for strength level 2: %f" % roc_auc2_LR)

fpr3_LR, tpr3_LR, thresholds3_LR = roc_curve(y_test_mt_sl3, prob_score_LR_sl3[:, 1])
roc_auc3_LR = auc(fpr3_LR, tpr3_LR)
print("Area under the ROC curve - for strength level 3: %f" % roc_auc3_LR)

In [None]:
# Compute ROC curve and area the curve for SVM model
fpr1_SVM, tpr1_SVM, thresholds1_SVM = roc_curve(y_test_mt_sl1, prob_score_SVM_sl1[:, 1])
roc_auc1_SVM = auc(fpr1_SVM, tpr1_SVM)
print("Area under the ROC curve - for strength level 1: %f" % roc_auc1_SVM)

fpr2_SVM, tpr2_SVM, thresholds2_SVM = roc_curve(y_test_mt_sl2, prob_score_SVM_sl2[:, 1])
roc_auc2_SVM = auc(fpr2_SVM, tpr2_SVM)
print("Area under the ROC curve - for strength level 2: %f" % roc_auc2_SVM)

fpr3_SVM, tpr3_SVM, thresholds3_SVM = roc_curve(y_test_mt_sl3, prob_score_SVM_sl3[:, 1])
roc_auc3_SVM = auc(fpr3_SVM, tpr3_SVM)
print("Area under the ROC curve - for strength level 3: %f" % roc_auc3_SVM)

- Not a significant difference found for the AUC between Logistic and SVM model

In [None]:
# Depict the ROC curve and determine better model at the same level of strength

import pylab as pl

# Plot the curve - for LR vs SVM for strength level 1
pl.clf()
pl.plot(fpr1_LR, tpr1_LR, label='ROC curve for logistic at strength level 1 (area = %0.2f)' % roc_auc1_LR)
pl.plot(fpr1_SVM, tpr1_SVM, label='ROC curve for SVC at strength level 1 (area = %0.2f)' % roc_auc1_SVM)
pl.plot([0, 1], [0, 1], 'k--')
pl.xlim([0.0, 1.0])
pl.ylim([0.0, 1.0])
pl.xlabel('False Positive Rate')
pl.ylabel('True Positive Rate')
pl.title('Receiverrating characteristic example')
pl.legend(loc="lower right")
pl.show()

- From above plot it shows either Logistic regression or SVM model yeilds same result to predict concrete strength level 1

In [None]:
# Plot the curve - for LR vs SVM for strength level 2
pl.clf()
pl.plot(fpr2_LR, tpr2_LR, label='ROC curve for logistic at strength level 2 (area = %0.2f)' % roc_auc2_LR)
pl.plot(fpr2_SVM, tpr2_SVM, label='ROC curve for SVC at strength level 2 (area = %0.2f)' % roc_auc2_SVM)
pl.plot([0, 1], [0, 1], 'k--')
pl.xlim([0.0, 1.0])
pl.ylim([0.0, 1.0])
pl.xlabel('False Positive Rate')
pl.ylabel('True Positive Rate')
pl.title('Receiverrating characteristic example')
pl.legend(loc="lower right")
pl.show()

- The overall predictibility score is low at strength level 2, however from above we can derive that LR model is better than SVM

In [None]:
# Plot the curve - for LR vs SVM for strength level 3
pl.clf()
pl.plot(fpr3_LR, tpr3_LR, label='ROC curve for logistic at strength level 3 (area = %0.2f)' % roc_auc3_LR)
pl.plot(fpr3_SVM, tpr3_SVM, label='ROC curve for SVC at strength level 3 (area = %0.2f)' % roc_auc3_SVM)
pl.plot([0, 1], [0, 1], 'k--')
pl.xlim([0.0, 1.0])
pl.ylim([0.0, 1.0])
pl.xlabel('False Positive Rate')
pl.ylabel('True Positive Rate')
pl.title('Receiverrating characteristic example')
pl.legend(loc="lower right")
pl.show()

- The overall predictibility score is very good at strength level 3, however from above we can derive that LR model is better than SVM

In [None]:
# Let's implement k-neighbor classifier to see the best hyper paramer set, controlling of which would provie the best 
# result in model performance

from sklearn.neighbors import KNeighborsClassifier # Import the library

# Instantiate knn model for each of the data set
knn_sl1 = KNeighborsClassifier() # for strength level 1
knn_sl2 = KNeighborsClassifier() # for strength level 2
knn_sl3 = KNeighborsClassifier() # for strength level 3

# Fit the models & get the model score
knn_sl1.fit(X_train_mt_sl1, y_train_mt_sl1)
knn_sl2.fit(X_train_mt_sl2, y_train_mt_sl2)
knn_sl3.fit(X_train_mt_sl3, y_train_mt_sl3)

In [None]:
# Let's view the model score: KNN model:
print("KNN model score (in Train data) for strength level 1: ", knn_sl1.score(X_train_mt_sl1, y_train_mt_sl1))
print("KNN model score (in Test data) for strength level 1: ", knn_sl1.score(X_test_mt_sl1, y_test_mt_sl1))
print("======================================================================================================")
print("KNN model score (in Train data) for strength level 2: ", knn_sl2.score(X_train_mt_sl2, y_train_mt_sl2))
print("KNN model score (in Test data) for strength level 2: ", knn_sl2.score(X_test_mt_sl2, y_test_mt_sl2))
print("======================================================================================================")
print("KNN model score (in Train data) for strength level 3: ", knn_sl3.score(X_train_mt_sl3, y_train_mt_sl3))
print("KNN model score (in Test data) for strength level 3: ", knn_sl3.score(X_test_mt_sl3, y_test_mt_sl3))
print("======================================================================================================")

In [None]:
from sklearn.metrics import accuracy_score

# Take the list of parameter which will be used for tuning
param_grid = {'n_neighbors': list(range(1,9)),
             'algorithm': ('auto', 'ball_tree', 'kd_tree' , 'brute') }

# Instantiate the grid search algorithm to identify best fit parameter set
from sklearn.model_selection import GridSearchCV

gs_sl1 = GridSearchCV(knn_sl1,param_grid,cv=10) # for strength level 1
gs_sl2 = GridSearchCV(knn_sl2,param_grid,cv=10) # for strength level 2
gs_sl3 = GridSearchCV(knn_sl3,param_grid,cv=10) # for strength level 3

# Fit the Grid Search model
gs_sl1.fit(X_train_mt_sl1, y_train_mt_sl1)
gs_sl2.fit(X_train_mt_sl2, y_train_mt_sl2)
gs_sl3.fit(X_train_mt_sl3, y_train_mt_sl3)

In [None]:
# Let's view the model score: GridSearch model:
print("Grid Search model score (in Train data) for strength level 1: ", gs_sl1.score(X_train_mt_sl1, y_train_mt_sl1))
print("Grid Search model score (in Test data) for strength level 1: ", gs_sl1.score(X_test_mt_sl1, y_test_mt_sl1))
print("======================================================================================================")
print("Grid Search model score (in Train data) for strength level 2: ", gs_sl2.score(X_train_mt_sl2, y_train_mt_sl2))
print("Grid Search model score (in Test data) for strength level 2: ", gs_sl2.score(X_test_mt_sl2, y_test_mt_sl2))
print("======================================================================================================")
print("Grid Search model score (in Train data) for strength level 3: ", gs_sl3.score(X_train_mt_sl3, y_train_mt_sl3))
print("Grid Search model score (in Test data) for strength level 3: ", gs_sl3.score(X_test_mt_sl3, y_test_mt_sl3))
print("======================================================================================================")

In [None]:
# View the best parameter at each strength level which yeilds the best model score
print("Best parameter for strength level 1 dataset: ", gs_sl1.best_params_)
print("Best parameter for strength level 2 dataset: ", gs_sl2.best_params_)
print("Best parameter for strength level 3 dataset: ", gs_sl3.best_params_)

In [None]:
# View the mean test score for each of the dataset
print("Mean test score for strength level 1")
gs_sl1.cv_results_['mean_test_score']

In [None]:
print("Mean test score for strength level 2")
gs_sl2.cv_results_['mean_test_score']

In [None]:
print("Mean test score for strength level 3")
gs_sl3.cv_results_['mean_test_score']

In [None]:
# Measure the model performance range at 95% confidence level

In [None]:
# Let's build a model based on bootstrap sampling and measure the model score confidence range at 95%

# Implement the boot starp
from sklearn.utils import resample

n_iterations = 1000              # Number of bootstrap samples to create
n_size = int(len(cDataFES) * 0.50)   # picking only 50 % of the given data in every bootstrap sample
values = cDataFES.values

# run bootstrap
stats_sl1 = list() # for Strength level 1
stats_sl2 = list() # for Strength level 2
stats_sl3 = list() # for Strength level 3

for i in range(n_iterations):
    # prepare train and test sets at each of the strength level
    train = resample(values, n_samples=n_size)  # Sampling with replacement
    test = np.array([x for x in values if x.tolist() not in train.tolist()])  # picking rest of the data not considered in sample
    # fit model
    model_sl1 = DecisionTreeClassifier() # for Strength level 1
    model_sl2 = DecisionTreeClassifier() # for Strength level 2 
    model_sl3 = DecisionTreeClassifier() # for Strength level 3
    model_sl1.fit(train[:,:-3], train[:,-3]) # considering strength level 1 as predictor
    model_sl2.fit(train[:,:-3], train[:,-2]) # considering strength level 2 as predictor
    model_sl3.fit(train[:,:-3], train[:,-1]) # considering strength level 3 as predictor
    # evaluate model
    predictions_sl1 = model_sl1.predict(test[:,:-3])
    predictions_sl2 = model_sl2.predict(test[:,:-3])
    predictions_sl3 = model_sl3.predict(test[:,:-3])
    # Get the score
    score_sl1 = accuracy_score(test[:,-3], predictions_sl1)
    score_sl2 = accuracy_score(test[:,-2], predictions_sl2)
    score_sl3 = accuracy_score(test[:,-1], predictions_sl3)
    # Store the score into a list
    stats_sl1.append(score_sl1)
    stats_sl2.append(score_sl2)
    stats_sl3.append(score_sl3)

In [None]:
# plot scores - for strength level 1
plt.hist(stats_sl1)
plt.show()
# plot scores - for strength level 2
plt.hist(stats_sl2)
plt.show()
# plot scores - for strength level 3
plt.hist(stats_sl3)
plt.show()

# confidence intervals - for strength level 1
alpha = 0.95                             # for 95% confidence 
p1 = ((1.0-alpha)/2.0) * 100              # tail regions on right and left .25 on each side indicated by P value (border)
lower1 = max(0.0, np.percentile(stats_sl1, p1))  
p1 = (alpha+((1.0-alpha)/2.0)) * 100
upper1 = min(1.0, np.percentile(stats_sl1, p1))
print('%.1f confidence interval for strength lelve 1 %.1f%% and %.1f%%' % (alpha*100, lower1*100, upper1*100))

# confidence intervals - for strength level 2
p2 = ((1.0-alpha)/2.0) * 100              # tail regions on right and left .25 on each side indicated by P value (border)
lower2 = max(0.0, np.percentile(stats_sl2, p2))  
p2 = (alpha+((1.0-alpha)/2.0)) * 100
upper2 = min(1.0, np.percentile(stats_sl2, p2))
print('%.1f confidence interval for strength lelve 2 %.1f%% and %.1f%%' % (alpha*100, lower2*100, upper2*100))

# confidence intervals - for strength level 3
p3 = ((1.0-alpha)/2.0) * 100              # tail regions on right and left .25 on each side indicated by P value (border)
lower3 = max(0.0, np.percentile(stats_sl3, p3))  
p3 = (alpha+((1.0-alpha)/2.0)) * 100
upper3 = min(1.0, np.percentile(stats_sl3, p3))
print('%.1f confidence interval for strength lelve 3 %.1f%% and %.1f%%' % (alpha*100, lower3*100, upper3*100))

- From all previous analysis we can draw following inferences:
    - Clustering on Age feature and additional dummy columns improved the predictibility on concrete strength through Linerar model
    - Composite features i.e. water-cement or water-binder combination didn't improve model score to predict strength
    - Ridge model highlighted the features which are important to predict strength
    - Lasso model indicated which features to drop from the model building
    - Polynomial parameters i.e. higher degree diamensions improved the model predictibility and both Ridge and Lasso indicated importance of the features
    - Ridge model on ploynomial features improved the model score on dataset with age dummy columns
    - Once strength column is clusered into 3 subsets,
        - Decision Tree classifier model yeilds the best model accuracy to predict strength level 1 and 2
        - Logistic regression, SVM and KNN model performed with higher score with respect to Linear model
        - Grid search technique highlighted the hyper parameters to focus to get best model accuracy