## Thanks for visiting this notebook!!!


### please **UPVOTE** if you find this notebook useful...!

### What you will learn!!!
- Regression
- Exploratory Data Analysis
- Modeling
- Hyper parameter tuning

## Concrete Compressive Strength Prediction


Concrete is one of the most important materials in Civil Engineering. Knowing the compressive strength of concrete is very important when constructing a building or a bridge. The Compressive Strength of Concrete is a highly nonlinear function of ingredients used in making it and their characteristics. Thus, using Machine Learning to predict the Strength could be useful in generating a combination of ingredients which result in high Strength.


### Problem Statement
Predicting Compressive Strength of Concrete given its age and quantitative measurements of ingredients.

### Data Description

* Number of instances - 1030
* Number of Attributes - 9
  * Attribute breakdown - 8 quantitative inputs, 1 quantitative output

#### Attribute information
##### Inputs
* Cement
* Blast Furnace Slag
* Fly Ash
* Water
* Superplasticizer
* Coarse Aggregate
* Fine Aggregate

All above features measured in kg/$m^3$

* Age (in days)

##### Output
* Concrete Compressive Strength (Mpa)




In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
plt.rc('font',size=14)
sns.set(style='white')
sns.set(style='darkgrid',color_codes=True)
from scipy.stats import zscore
from sklearn.model_selection import train_test_split, cross_val_score
%matplotlib inline

##### Loading the Data 

In [None]:
data = pd.read_csv("../input/concrete.csv")

In [None]:
data.head()

Simplifying Column names, since they appear to be too lengthy.

In [None]:
data.head()

# Exploratory data quality report

### 1.a Univariate analysis

##### checking datatypes

In [None]:
data.info()

###### Data types information

In [None]:
data.dtypes

In [None]:
data.shape

##### There are 1030 rows and 9 columns

In [None]:
# column names
data.columns.tolist()

In [None]:
# dataset distribution
data.describe().T

- cement,slag,ash are left skewed.

###### 1.a -> Checking for 'null' values

In [None]:
data.isna().sum()

In [None]:
features = [col for col in data.columns.tolist() if col not in ['strength']]

In [None]:
def univariate_analysis(data):
    for col in features:
        print("*"*50)
        print("Column name: ", col)
        print('Range of values: ', data[col].max() - data[col].min())
        print("<- Pivote values -> ")
        print('Minimum value: ', data[col].min())
        print('Maximum value: ',data[col].max())
        print('Mean value: ', data[col].mean())
        print('Median value: ',data[col].median())
        print('Standard deviation: ', data[col].std())
        print('Null values: ',data[col].isnull().any())
        print("<- Outlier Detection -> ")
        Q1=data[col].quantile(q=0.25)
        Q3=data[col].quantile(q=0.75)
        print('1st Quartile (Q1) is: ', Q1)
        print('3st Quartile (Q3) is: ', Q3)
        print('Interquartile range (IQR) is ', stats.iqr(data[col]))
        L_outliers=Q1-1.5*(Q3-Q1)
        U_outliers=Q3+1.5*(Q3-Q1)
        print(f'Lower outliers in {col}: {L_outliers}')
        print(f'Upper outliers in {col}: {U_outliers}' )
        print(f'Number of outliers in {col} upper : ', data[data[col] > U_outliers][col].count())
        print(f'Number of outliers in {col} lower : ', data[data[col]<L_outliers][col].count())
        print(f'% of Outlier in {col} upper: {round(data[data[col] > U_outliers][col].count()*100/len(data))} %')
        print(f'% of Outlier in {col} lower: {round(data[data[col]<L_outliers][col].count()*100/len(data))} %')
        fig, (ax1,ax2,ax3)=plt.subplots(1,3,figsize=(13,5))

        #boxplot
        sns.boxplot(x=col,data=data,orient='v',ax=ax1)
        ax1.set_ylabel(col, fontsize=15)
        ax1.set_title(f'Distribution of {col}', fontsize=15)
        ax1.tick_params(labelsize=15)

        #distplot
        sns.distplot(data[col],ax=ax2)
        ax2.set_xlabel(col, fontsize=15)
        ax2.set_ylabel(col, fontsize=15)
        ax2.set_title(f'{col} vs Strength', fontsize=15)
        ax2.tick_params(labelsize=15)

        #histogram
        ax3.hist(data[col])
        ax3.set_xlabel(col, fontsize=15)
        ax3.set_ylabel(col, fontsize=15)
        ax3.set_title(f'{col} vs Strength', fontsize=15)
        ax3.tick_params(labelsize=15)

        plt.subplots_adjust(wspace=0.5)
        plt.tight_layout() 
        plt.show()
        print("#"*50)
univariate_analysis(data)

### Multivariate Analysis


##### 1.b Checking the pairwise relations of Features.

In [None]:
sns.pairplot(data)
plt.show()

There seems to be no high correlation between independant variables (features). This can be further confirmed by plotting the **Pearson Correlation coefficients** between the features.

In [None]:
fig, ax2 = plt.subplots(3, 3, figsize=(16, 16))
sns.distplot(data['cement'],ax=ax2[0][0])
sns.distplot(data['slag'],ax=ax2[0][1])
sns.distplot(data['ash'],ax=ax2[0][2])
sns.distplot(data['water'],ax=ax2[1][0])
sns.distplot(data['superplastic'],ax=ax2[1][1])
sns.distplot(data['coarseagg'],ax=ax2[1][2])
sns.distplot(data['fineagg'],ax=ax2[2][0])
sns.distplot(data['age'],ax=ax2[2][1])
sns.distplot(data['strength'],ax=ax2[2][2])

 ### Observation :
- cement is almost normal. 
- slag has  three gausssians and rightly skewed.
- ash has two gaussians and rightly skewed.
- water has three guassians and slighly left skewed.
- superplastic has two gaussians and rightly skewed.
- coarseagg has three guassians and almost normal.
- fineagg has almost two guassians and looks like normal.
- age has multiple guassians and rightly skewed.

In [None]:
corr = data.corr()

plt.figure(figsize=(14,10))
sns.heatmap(corr, annot=True, cmap='Blues')
b, t = plt.ylim()
plt.ylim(b+0.5, t-0.5)
plt.title("Feature Correlation Heatmap")
plt.show()

### Observations
* There are'nt any **high** correlations between **Compressive strength** and other features except for **Cement**, which should be the case for more strength.
* **Age** and **Super plasticizer** are the other two features which are strongly correlated with **Compressive Strength**.
* **Super Plasticizer** seems to have a negative high correlation with **Water**, positive correlations with **Fly ash** and **Fine aggregate**.

We can further analyze these correlations visually by plotting these relations.

In [None]:
plt.figure(figsize=(14, 10))
ax = sns.distplot(data.strength)
ax.set_title("Compressive Strength Distribution")

###  2.c

In [None]:
fig, ax = plt.subplots(figsize=(14,10))
sns.scatterplot(y="strength", x="cement", hue="water", size="age", data=data, ax=ax, sizes=(50, 300))
ax.set_title("Strength vs (Cement, Age, Water)")
ax.legend(loc="upper left", bbox_to_anchor=(1,1))
plt.show()

#### Observations from Strength vs (Cement, Age, Water)
* Compressive **strength increases with amount of cement**
* Compressive **strength increases with age**
* Cement with **low age** requires **more cement** for **higher strength**
* The **older the cement** is the **more water** it requires
* Concrete **strength increases** when **less water** is used in preparing it  

In [None]:
data.columns

In [None]:
fig, ax = plt.subplots(figsize=(14,10))
sns.scatterplot(y="strength", x="fineagg", hue="ash", size="superplastic", 
                data=data, ax=ax, sizes=(50, 300))
ax.set_title("Strength vs (Fine aggregate, Super Plasticizer, FlyAsh)")
ax.legend(loc="upper left", bbox_to_anchor=(1,1))
plt.show()

##### Observations from CC Strength vs (Fine aggregate, Super Plasticizer, FlyAsh)
* As **Flyash increases** the **strength decreases**
* **Strength increases** with **Super plasticizer**

In [None]:
fig, ax = plt.subplots(figsize=(14,10))
sns.scatterplot(y="strength", x="fineagg", hue="water", size="superplastic", 
                data=data, ax=ax, sizes=(50, 300))
ax.set_title("Strength vs (Fine aggregate, Super Plasticizer, Water)")
ax.legend(loc="upper left", bbox_to_anchor=(1,1))
plt.show()

##### Observations from CC Strength vs (Fine aggregate, Super Plasticizer, Water)
* **Strength decreases** with **increase in water**, **strength increases** with **increase in Super plasticizer** (already from above plots)
* **More Fine aggregate** is used when **less water**, **more Super plasticizer** is used.


### Data Preprocessing

In [None]:
concrete_df1 = data.copy()
concrete_df1.boxplot(figsize=(35,15))

In [None]:
print('Number of outliers in cement: ',concrete_df1[((concrete_df1.cement - concrete_df1.cement.mean()) / concrete_df1.cement.std()).abs() >3]['cement'].count())
print('Number of outliers in slag: ',concrete_df1[((concrete_df1.slag - concrete_df1.slag.mean()) / concrete_df1.slag.std()).abs() >3]['slag'].count())
print('Number of outliers in ash: ',concrete_df1[((concrete_df1.ash - concrete_df1.ash.mean()) / concrete_df1.ash.std()).abs() >3]['ash'].count())
print('Number of outliers in water: ',concrete_df1[((concrete_df1.water - concrete_df1.water.mean()) / concrete_df1.water.std()).abs() >3]['water'].count())
print('Number of outliers in superplastic: ',concrete_df1[((concrete_df1.superplastic - concrete_df1.superplastic.mean()) / concrete_df1.superplastic.std()).abs() >3]['superplastic'].count())
print('Number of outliers in coarseagg: ',concrete_df1[((concrete_df1.coarseagg - concrete_df1.coarseagg.mean()) / concrete_df1.coarseagg.std()).abs() >3]['coarseagg'].count())
print('Number of outliers in fineagg: ',concrete_df1[((concrete_df1.fineagg - concrete_df1.fineagg.mean()) / concrete_df1.fineagg.std()).abs() >3]['fineagg'].count())
print('Number of outliers in age: ',concrete_df1[((concrete_df1.age - concrete_df1.age.mean()) / concrete_df1.age.std()).abs() >3]['age'].count())

* Here, we have used Standard deviation method to detect the outliers.If we have any data point that is more than 3 times the standard deviation, then those points are very likely to be outliers.
* We can see that slag, water, superplastic and age contain outliers.

In [None]:
print('Records containing outliers in slag: \n',concrete_df1[((concrete_df1.slag - concrete_df1.slag.mean()) / concrete_df1.slag.std()).abs() >3]['slag'])

In [None]:
print('Records containing outliers in water: \n',concrete_df1[((concrete_df1.water - concrete_df1.water.mean()) / concrete_df1.water.std()).abs() >3]['water'])

In [None]:
print('Records containing outliers in superplastic: \n',concrete_df1[((concrete_df1.superplastic - concrete_df1.superplastic.mean()) / concrete_df1.superplastic.std()).abs() >3]['superplastic'])

In [None]:
print('Records containing outliers in age: \n',concrete_df1[((concrete_df1.age - concrete_df1.age.mean()) / concrete_df1.age.std()).abs() >3]['age'])

### Dealing wit outliers

we will replace the outliers with the median

In [None]:
for col_name in concrete_df1.columns[:-1]:
    q1 = concrete_df1[col_name].quantile(0.25)
    q3 = concrete_df1[col_name].quantile(0.75)
    iqr = q3 - q1
    
    low = q1-1.5*iqr
    high = q3+1.5*iqr
    concrete_df1.loc[(concrete_df1[col_name] < low) | (concrete_df1[col_name] > high), col_name] = concrete_df1[col_name].median()

In [None]:
concrete_df1.boxplot(figsize=(35,15))

### Feature Engineering

##### Scaling 
Standardizing the data i.e. to rescale the features to have a mean of zero and standard deviation of 1.

In [None]:
concrete_df_z = concrete_df1.apply(zscore)
concrete_df_z = pd.DataFrame(concrete_df_z,columns=data.columns) 

##### seperate feature and targets

In [None]:
X = concrete_df_z.iloc[:,:-1]         # Features - All columns but last
y = concrete_df_z.iloc[:,-1]          # Target - Last Column

##### Splitting data into Training and Test. 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 123)

### 3.a Model Building


#### Decision Trees

We can use Decision Trees, since we have a lot of zeros in some of the input features as seen from their distributions in the pair plot above. This would help the decision trees build trees based on some conditions on features which can further improve performance.

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn import metrics
from sklearn.model_selection import KFold

dt_model = DecisionTreeRegressor()
dt_model.fit(X_train , y_train)

#printing the feature importance
print('Feature importances: \n',pd.DataFrame(dt_model.feature_importances_,columns=['Imp'],index=X_train.columns).sort_values('Imp', ascending=False))

- cement is the most important feature
- Here, ash, coarseagg, fineagg, superplastic and slag are the less significant variable.These will impact less to the strength column. This we have seen in pairplot also.

In [None]:
y_pred = dt_model.predict(X_test)
# performance on train data
print('Performance on training data using Decision Tree:',dt_model.score(X_train,y_train))
# performance on test data
print('Performance on testing data using Decision Tree:',dt_model.score(X_test,y_test))
#Evaluate the model using accuracy
acc_DT=metrics.r2_score(y_test, y_pred)
print('Accuracy DT: ',acc_DT)
print('MSE: ',metrics.mean_squared_error(y_test, y_pred))

* There is a overfitting in the model as the dataset is performing 99% accurately in trainnig data. However, the accuracy on test data drops.

In [None]:
results = pd.DataFrame({'Method':['Decision Tree'], 'accuracy': acc_DT},index={'1'})
results = results[['Method', 'accuracy']]
results

### K fold cross validation

In [None]:
num_folds = 10
seed = 77
kfold = KFold(n_splits=num_folds, random_state=seed)
results1 = cross_val_score(dt_model,X, y, cv=kfold)
accuracy=np.mean(abs(results1))
print('Average accuracy: ',accuracy)
print('Standard Deviation: ',results1.std())

In [None]:
tempResultsDf = pd.DataFrame({'Method':['Decision Tree k fold'], 'accuracy': [accuracy]},index={'2'})
results = pd.concat([results, tempResultsDf])
results = results[['Method', 'accuracy']]
results

##### Drop least significant features

In [None]:
concrete_df2=concrete_df_z.copy()

In [None]:
#independent and dependent variable
X = concrete_df2.drop( ['strength','ash','coarseagg','fineagg'] , axis=1)
y = concrete_df2['strength']
# Split X and y into training and test set in 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 123)

In [None]:
dt_model = DecisionTreeRegressor()
dt_model.fit(X_train , y_train)

In [None]:
print('Feature importances: \n',pd.DataFrame(dt_model.feature_importances_,columns=['Imp'], index=X_train.columns).sort_values('Imp', ascending=False))

In [None]:
y_pred = dt_model.predict(X_test)
# performance on train data
print('Performance on training data using DT:',dt_model.score(X_train,y_train))
# performance on test data
print('Performance on testing data using DT:',dt_model.score(X_test,y_test))
#Evaluate the model using accuracy
acc_DT=metrics.r2_score(y_test, y_pred)
print('Accuracy DT: ',acc_DT)

* The acuracy on testing dataset is not improved, still it is an overfit model.

In [None]:
tempResultsDf = pd.DataFrame({'Method':['Decision Tree2'], 'accuracy': [acc_DT]},index={'3'})
results = pd.concat([results, tempResultsDf])
results = results[['Method', 'accuracy']]
results


### Regularising/Pruning of Decision Tree


In [None]:
X=concrete_df_z.iloc[:,0:8]
y = concrete_df_z.iloc[:,8]
# Split X and y into training and test set in 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 123)

In [None]:
reg_dt_model = DecisionTreeRegressor( max_depth = 4,random_state=1,min_samples_leaf=5)
reg_dt_model.fit(X_train, y_train)

In [None]:
print('Feature importances: \n',pd.DataFrame(reg_dt_model.feature_importances_,columns=['Imp'], index=X_train.columns).sort_values('Imp', ascending=False))

### Visualizing the Regularized Tree

In [None]:
!pip install pydotplus --quiet

In [None]:
from sklearn.tree import export_graphviz
from io import StringIO  
from IPython.display import Image  
import pydotplus
import graphviz
bank_df=concrete_df_z
xvar = bank_df.drop('strength', axis=1)
feature_cols = xvar.columns

In [None]:
dot_data = StringIO()
export_graphviz(reg_dt_model, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True,feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png('concrete_pruned.png')
Image(graph.create_png())

In [None]:
y_pred = reg_dt_model.predict(X_test)
# performance on train data
print('Performance on training data using DT:',reg_dt_model.score(X_train,y_train))
# performance on test data
print('Performance on testing data using DT:',reg_dt_model.score(X_test,y_test))
#Evaluate the model using accuracy
acc_RDT=metrics.r2_score(y_test, y_pred)
print('Accuracy DT: ',acc_RDT)
print('MSE: ',metrics.mean_squared_error(y_test, y_pred))

In [None]:
tempResultsDf = pd.DataFrame({'Method':['Pruned Decision Tree'], 'accuracy': [acc_RDT]},index={'4'})
results = pd.concat([results, tempResultsDf])
results = results[['Method', 'accuracy']]
results

### K fold cross validation

In [None]:
num_folds = 10
seed = 77
kfold = KFold(n_splits=num_folds, random_state=seed)
results1 = cross_val_score(reg_dt_model,X, y, cv=kfold)
accuracy=np.mean(abs(results1))
print('Average accuracy: ',accuracy)
print('Standard Deviation: ',results1.std())

In [None]:
tempResultsDf = pd.DataFrame({'Method':['Pruned Decision Tree k fold'], 'accuracy': [accuracy]},index={'5'})
results = pd.concat([results, tempResultsDf])
results = results[['Method', 'accuracy']]
results

In [None]:
concrete_df3=concrete_df_z.copy()

In [None]:
X = concrete_df3.drop( ['strength','ash','coarseagg','fineagg'] , axis=1)
y = concrete_df3['strength']
# Split X and y into training and test set in 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 1)

In [None]:
reg_dt_model = DecisionTreeRegressor( max_depth = 4,random_state=1,min_samples_leaf=5)
reg_dt_model.fit(X_train, y_train)

In [None]:
y_pred = reg_dt_model.predict(X_test)
# performance on train data
print('Performance on training data using DT:',reg_dt_model.score(X_train,y_train))
# performance on test data
print('Performance on testing data using DT:',reg_dt_model.score(X_test,y_test))
#Evaluate the model using accuracy
acc_RDT=metrics.r2_score(y_test, y_pred)
print('Accuracy DT: ',acc_RDT)
print('MSE: ',metrics.mean_squared_error(y_test, y_pred))

In [None]:
tempResultsDf = pd.DataFrame({'Method':['Pruned Decision Tree2'], 'accuracy': [acc_RDT]},index={'6'})
results = pd.concat([results, tempResultsDf])
results = results[['Method', 'accuracy']]
results

### K-Mean Clustering

In [None]:
from sklearn.cluster import KMeans

In [None]:
cluster_range = range( 1, 15 )  
cluster_errors = []
for num_clusters in cluster_range:
  clusters = KMeans( num_clusters, n_init = 5)
  clusters.fit(concrete_df1)
  labels = clusters.labels_
  centroids = clusters.cluster_centers_
  cluster_errors.append( clusters.inertia_ )
clusters_df = pd.DataFrame( { "num_clusters":cluster_range, "cluster_errors": cluster_errors } )
clusters_df[0:15]

In [None]:
plt.figure(figsize=(12,6))
plt.plot( clusters_df.num_clusters, clusters_df.cluster_errors, marker = "o" )

In [None]:
cluster = KMeans( n_clusters = 6, random_state = 2354 )
cluster.fit(concrete_df_z)

In [None]:
prediction=cluster.predict(concrete_df_z)
concrete_df_z["GROUP"] = prediction     
# Creating a mirror copy for later re-use instead of building repeatedly
concrete_df_z_copy = concrete_df_z.copy(deep = True)  

In [None]:
centroids = cluster.cluster_centers_
centroids

In [None]:
centroid_df = pd.DataFrame(centroids, columns = list(concrete_df1) )
centroid_df

In [None]:
# plot centroids and the data in the cluster into box plots
concrete_df_z.boxplot(by = 'GROUP',  layout=(3,3), figsize=(15, 10))

* Here, None of the dimensions are good predictor of target variable.
* For all the dimensions (variables) every cluster have a similar range of values except in one case.
* We can see that the body of the cluster are overlapping.
* So in k means, though, there are clusters in datasets on different dimensions. But we can not see any distinct characteristics of these clusters which tell us to break data into different clusters and build separate models for them.

## Random Forest Regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, BaggingRegressor

In [None]:
X=concrete_df_z.iloc[:,0:8]
y = concrete_df_z.iloc[:,8]
# Split X and y into training and test set in 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 1)

In [None]:
model=RandomForestRegressor()
model.fit(X_train, y_train)

In [None]:
y_pred = model.predict(X_test)
# performance on train data
print('Performance on training data using RFR:',model.score(X_train,y_train))
# performance on test data
print('Performance on testing data using RFR:',model.score(X_test,y_test))
#Evaluate the model using accuracy
acc_RFR=metrics.r2_score(y_test, y_pred)
print('Accuracy DT: ',acc_RFR)
print('MSE: ',metrics.mean_squared_error(y_test, y_pred))

* This model is also overfit.

In [None]:
tempResultsDf = pd.DataFrame({'Method':['Random Forest Regressor'], 'accuracy': [acc_RFR]},index={'7'})
results = pd.concat([results, tempResultsDf])
results = results[['Method', 'accuracy']]
results

### K fold cross validation

In [None]:
num_folds = 10
seed = 77
kfold = KFold(n_splits=num_folds, random_state=seed)
results1 = cross_val_score(model,X, y, cv=kfold)
accuracy=np.mean(abs(results1))
print('Average accuracy: ',accuracy)
print('Standard Deviation: ',results1.std())

In [None]:
tempResultsDf = pd.DataFrame({'Method':['Random Forest Regressor k fold'], 'accuracy': [accuracy]},index={'8'})
results = pd.concat([results, tempResultsDf])
results = results[['Method', 'accuracy']]
results

## Gradient Boosting Regressor

In [None]:
model=GradientBoostingRegressor()
model.fit(X_train, y_train)

In [None]:
y_pred = model.predict(X_test)
# performance on train data
print('Performance on training data using GBR:',model.score(X_train,y_train))
# performance on test data
print('Performance on testing data using GBR:',model.score(X_test,y_test))
#Evaluate the model using accuracy
acc_GBR=metrics.r2_score(y_test, y_pred)
print('Accuracy DT: ',acc_GBR)
print('MSE: ',metrics.mean_squared_error(y_test, y_pred))

In [None]:
tempResultsDf = pd.DataFrame({'Method':['Gradient Boost Regressor'], 'accuracy': [acc_GBR]},index={'9'})
results = pd.concat([results, tempResultsDf])
results = results[['Method', 'accuracy']]
results

In [None]:
num_folds = 10
seed = 77
kfold = KFold(n_splits=num_folds, random_state=seed)
results1 = cross_val_score(model,X, y, cv=kfold)
accuracy=np.mean(abs(results1))
print('Average accuracy: ',accuracy)
print('Standard Deviation: ',results1.std())

In [None]:
tempResultsDf = pd.DataFrame({'Method':['Gradient Boost Regressor k fold'], 'accuracy': [accuracy]},index={'10'})
results = pd.concat([results, tempResultsDf])
results = results[['Method', 'accuracy']]
results

## Bagging regressor  

In [None]:
model=BaggingRegressor()
model.fit(X_train, y_train)

In [None]:
y_pred = model.predict(X_test)
# performance on train data
print('Performance on training data using GBR:',model.score(X_train,y_train))
# performance on test data
print('Performance on testing data using GBR:',model.score(X_test,y_test))
#Evaluate the model using accuracy
acc_BR=metrics.r2_score(y_test, y_pred)
print('Accuracy DT: ',acc_BR)
print('MSE: ',metrics.mean_squared_error(y_test, y_pred))

In [None]:
tempResultsDf = pd.DataFrame({'Method':['Bagging Regressor'], 'accuracy': [acc_BR]},index={'13'})
results = pd.concat([results, tempResultsDf])
results = results[['Method', 'accuracy']]
results

### K Fold Cross validation

In [None]:
num_folds = 10
seed = 77
kfold = KFold(n_splits=num_folds, random_state=seed)
results1 = cross_val_score(model,X, y, cv=kfold)
accuracy=np.mean(abs(results1))
print('Average accuracy: ',accuracy)
print('Standard Deviation: ',results1.std())

In [None]:
tempResultsDf = pd.DataFrame({'Method':['Bagging Regressor k fold'], 'accuracy': [accuracy]},index={'14'})
results = pd.concat([results, tempResultsDf])
results = results[['Method', 'accuracy']]
results

### KNN Regressor

In [None]:
from sklearn.neighbors import KNeighborsRegressor

In [None]:
error=[]
for i in range(1,30):
    knn = KNeighborsRegressor(n_neighbors=i)
    knn.fit(X_train,y_train)
    pred_i = knn.predict(X_test)
    error.append(np.mean(pred_i!=y_test))

In [None]:
plt.figure(figsize=(12,6))
plt.plot(range(1,30),error,color='red', linestyle='dashed',marker='o',markerfacecolor='blue',markersize=10)
plt.title('Error Rate K Value')
plt.xlabel('K Value')
plt.ylabel('Mean error')

In [None]:
model = KNeighborsRegressor(n_neighbors=3)
model.fit(X_train, y_train)

In [None]:
y_pred = model.predict(X_test)
# performance on train data
print('Performance on training data using KNNR:',model.score(X_train,y_train))
# performance on test data
print('Performance on testing data using KNNR:',model.score(X_test,y_test))
#Evaluate the model using accuracy
acc_K=metrics.r2_score(y_test, y_pred)
print('Accuracy KNNR: ',acc_K)
print('MSE: ',metrics.mean_squared_error(y_test, y_pred))

In [None]:
tempResultsDf = pd.DataFrame({'Method':['KNN Regressor'], 'accuracy': [acc_K]},index={'15'})
results = pd.concat([results, tempResultsDf])
results = results[['Method', 'accuracy']]
results

### K Fold cross validation

In [None]:
num_folds = 10
seed = 77
kfold = KFold(n_splits=num_folds, random_state=seed)
results1 = cross_val_score(model,X, y, cv=kfold)
accuracy=np.mean(abs(results1))
print('Average accuracy: ',accuracy)
print('Standard Deviation: ',results1.std())

In [None]:
tempResultsDf = pd.DataFrame({'Method':['KNN Regressor k fold'], 'accuracy': [accuracy]},index={'16'})
results = pd.concat([results, tempResultsDf])
results = results[['Method', 'accuracy']]
results

## Support Vector Regressor

In [None]:
from sklearn.svm import SVR

In [None]:
model = SVR(kernel='linear')
model.fit(X_train, y_train)

In [None]:
y_pred = model.predict(X_test)
# performance on train data
print('Performance on training data using SVR:',model.score(X_train,y_train))
# performance on test data
print('Performance on testing data using SVR:',model.score(X_test,y_test))
#Evaluate the model using accuracy
acc_S=metrics.r2_score(y_test, y_pred)
print('Accuracy SVR: ',acc_S)
print('MSE: ',metrics.mean_squared_error(y_test, y_pred))

In [None]:
tempResultsDf = pd.DataFrame({'Method':['Support Vector Regressor'], 'accuracy': [acc_S]},index={'17'})
results = pd.concat([results, tempResultsDf])
results = results[['Method', 'accuracy']]
results

### K Fold Cross Validation

In [None]:
num_folds = 10
seed = 77
kfold = KFold(n_splits=num_folds, random_state=seed)
results1 = cross_val_score(model,X, y, cv=kfold)
accuracy=np.mean(abs(results1))
print('Average accuracy: ',accuracy)
print('Standard Deviation: ',results1.std())

In [None]:
tempResultsDf = pd.DataFrame({'Method':['SVR k fold'], 'accuracy': [accuracy]},index={'18'})
results = pd.concat([results, tempResultsDf])
results = results[['Method', 'accuracy']]
results

## Ensemeble KNN Regressor, SVR, LR

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import VotingRegressor

LR=LinearRegression()
KN=KNeighborsRegressor(n_neighbors=3)
SVM=SVR(kernel='linear') 

In [None]:
evc=VotingRegressor(estimators=[('LR',LR),('KN',KN),('SVM',SVM)])
evc.fit(X_train, y_train)

In [None]:
y_pred = evc.predict(X_test)
# performance on train data
print('Performance on training data using ensemble:',evc.score(X_train,y_train))
# performance on test data
print('Performance on testing data using ensemble:',evc.score(X_test,y_test))
#Evaluate the model using accuracy
acc_E=metrics.r2_score(y_test, y_pred)
print('Accuracy ensemble: ',acc_E)
print('MSE: ',metrics.mean_squared_error(y_test, y_pred))

In [None]:
tempResultsDf = pd.DataFrame({'Method':['Ensemble'], 'accuracy': [acc_E]},index={'19'})
results = pd.concat([results, tempResultsDf])
results = results[['Method', 'accuracy']]
results

### K fold cross validation

In [None]:
num_folds = 10
seed = 77
kfold = KFold(n_splits=num_folds, random_state=seed)
results1 = cross_val_score(evc,X, y, cv=kfold)
accuracy=np.mean(abs(results1))
print('Average accuracy: ',accuracy)
print('Standard Deviation: ',results1.std())

In [None]:
tempResultsDf = pd.DataFrame({'Method':['Ensemble k fold'], 'accuracy': [accuracy]},index={'20'})
results = pd.concat([results, tempResultsDf])
results = results[['Method', 'accuracy']]
results

* After applying all the models we can see that Random Forest Regressor, Random Forest Regressor k fold, Gradient Boost Regressor, Gradient Boost Regressor k fold, Bagging Regressor are giving better results as compared to other models.
* Now as the dataset have different gaussians, we can apply k means clustering and then we can apply the models and compare the accuracy.

## Bootstrap Sampling

In [None]:
concrete_XY = X.join(y)

### 4.c Using Gradient Boosting Regressor

In [None]:
from sklearn.utils import resample
values = concrete_XY.values
# Number of bootstrap samples to create
n_iterations = 1000        
# size of a bootstrap sample
n_size = int(len(concrete_df_z) * 1)    

# run bootstrap
# empty list that will hold the scores for each bootstrap iteration
stats = list()   
for i in range(n_iterations):
    # prepare train and test sets
    train = resample(values, n_samples=n_size)  # Sampling with replacement 
    test = np.array([x for x in values if x.tolist() not in train.tolist()])  # picking rest of the data not considered in sample
    
    
     # fit model
    gbmTree = GradientBoostingRegressor(n_estimators=50)
    # fit against independent variables and corresponding target values
    gbmTree.fit(train[:,:-1], train[:,-1]) 
    # Take the target column for all rows in test set

    y_test = test[:,-1]    
    # evaluate model
    # predict based on independent variables in the test data
    score = gbmTree.score(test[:, :-1] , y_test)
    predictions = gbmTree.predict(test[:, :-1])  

    stats.append(score)

In [None]:
from matplotlib import pyplot
pyplot.hist(stats)
pyplot.show()
# confidence intervals
alpha = 0.95                             # for 95% confidence 
p = ((1.0-alpha)/2.0) * 100              # tail regions on right and left .25 on each side indicated by P value (border)
lower = max(0.0, np.percentile(stats, p))  
p = (alpha+((1.0-alpha)/2.0)) * 100
upper = min(1.0, np.percentile(stats, p))
print('%.1f confidence interval %.1f%% and %.1f%%' % (alpha*100, lower*100, upper*100))

### Using Random Forest Regressor

In [None]:
values = concrete_XY.values
# Number of bootstrap samples to create
n_iterations = 1000        
# size of a bootstrap sample
n_size = int(len(concrete_df_z) * 1)    

# run bootstrap
# empty list that will hold the scores for each bootstrap iteration
stats = list()   
for i in range(n_iterations):
    # prepare train and test sets
    train = resample(values, n_samples=n_size)  # Sampling with replacement 
    test = np.array([x for x in values if x.tolist() not in train.tolist()])  # picking rest of the data not considered in sample
    
    
     # fit model
    rfTree = RandomForestRegressor(n_estimators=100)
    # fit against independent variables and corresponding target values
    rfTree.fit(train[:,:-1], train[:,-1]) 
    # Take the target column for all rows in test set

    y_test = test[:,-1]    
    # evaluate model
    # predict based on independent variables in the test data
    score = rfTree.score(test[:, :-1] , y_test)
    predictions = rfTree.predict(test[:, :-1])  

    stats.append(score)

In [None]:
from matplotlib import pyplot
pyplot.hist(stats)
pyplot.show()
# confidence intervals
alpha = 0.95                             # for 95% confidence 
p = ((1.0-alpha)/2.0) * 100              # tail regions on right and left .25 on each side indicated by P value (border)
lower = max(0.0, np.percentile(stats, p))  
p = (alpha+((1.0-alpha)/2.0)) * 100
upper = min(1.0, np.percentile(stats, p))
print('%.1f confidence interval %.1f%% and %.1f%%' % (alpha*100, lower*100, upper*100))

The bootstrap random forest  classification model performance is between 84%-90.8% which is better than other classification algorithms.

# Please **upvote** if you liked this notebook!!!