<b> Dataset : </b> Concrete Compressive Strength

<b> Domain : </b>Material manufacturing

<b>Description: </b> The actual concrete compressive strength (MPa) for a given mixture under a
specific age (days) was determined from laboratory. Data is in raw form (not
scaled).The data has 8 quantitative input variables, and 1 quantitative output
variable, and 1030 instances (observations).

<b> Objective : </b>Modeling of strength of high performance concrete using Machine Learning


<b>Steps : </b>This project involved feature exploration and selection to predict the strength of high-performance concrete. Used Regression models like Decision tree regressors to find out the most important features and predict the strength. Cross-validation techniques and Grid search were used to tune the parameters for best model performance.

<b>Skills and Tools :</b> Regression, Decision trees, feature engineering

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 


# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

## Import the necessary libraries :

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import itertools
from sklearn import model_selection
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.utils import resample
from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier
from scipy.stats import zscore
from sklearn import preprocessing
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import PolynomialFeatures
from sklearn.cluster import KMeans
from sklearn.svm import SVR
from pprint import pprint
from matplotlib import pyplot
import time
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.ensemble import RandomForestRegressor

<b>Comment :</b> Here I have used numpy, pandas, matplotlib, seaborn, scipy for EDA and Data Visualization. Also used sklearn for data spliting, model building and for confusion matrix. 

# ::--------------------------- Exploratory Data Analysis -------------------------------- ::

In [None]:
df  = pd.read_csv('/kaggle/input/concrete-compressive-strength/concrete.csv')
df.head(5)

<b>Comment:</b> Here I have read the Concrete dataset using read_csv() function of pandas. df is a dataframe. I have used head() funtion to display first 10 records of the dataset.

<b> Features(attributes) Understanding from the above dataframe :- </b> 
- <b>Cement </b> measured in kg in a m3 mixture
- <b>Blast </b> measured in kg in a m3 mixture
- <b>Fly ash </b> measured in kg in a m3 mixture
- <b>Water </b> measured in kg in a m3 mixture
- <b>Superplasticizer </b> measured in kg in a m3 mixture
- <b>Coarse Aggregate </b> measured in kg in a m3 mixture
- <b>Fine Aggregate </b> measured in kg in a m3 mixture
- <b>Age </b> day (1~365)

<b>Concrete compressive strength:-</b> measured in MPa



### Shape of the data :- 

In [None]:
rows_count, columns_count = df.shape
print('Total Number of rows :', rows_count)
print('Total Number of columns :', columns_count)


<b>Comment:</b> Shape of the dataframe is (1030, 9).
There are 1030 rows and 9 columns in the dataset. 

###  Data type of each attribute :-

In [None]:
df.info()

<b>Comment :</b> Here we can see that all the variables are numerical.

### Checking the presence of missing values :-

In [None]:
sns.heatmap(df.isna(), yticklabels=False, cbar=False, cmap='viridis')

<b>Observation : </b> From above heatmap we can see that there is no missing values are present.

df.apply(lambda x: sum(x.isnull()))

###  Descriptive Statistics :-

In [None]:
df_transpose = df.describe().T
df_transpose

****************<b>Observations : </b> From above we can see that Mean and the median is nearly same for the Cement, Water, Superplastic, Coarseagg, Fineagg, Strength so we can say it is approximately normally distributed. Slag, Ash, Age are having much values at the maximum portion so we can say it is skewed towards right side.

### Copying Dataframe :-
- Before doing any manipulation with the dataframe it is better to copy the dataframe into another dataframe and keep the original dataframe as it is.

In [None]:
concrete_df = df.copy()

## Checking the presence of outliers :-

In [None]:
plt.figure(figsize=(12,6))
sns.boxplot(data=concrete_df, orient="h", palette="Set2", dodge=False)

<b>Observation : </b> From the above boxplot we can see that there are outliers in some columns. From the above ploting I can say slag, water, superplastic, fineagg and age column are having clear outliers. Let see these plots separately. I will be finding the outliers counts in individual attributes analysis and <b>fixing the outliers</b> after visualization and analysis of each attribute.

# ::-------------------------------------- Data Visualization ------------------------------------::

###  Pair plot that includes all the columns of the data frame :-

In [None]:
sns.pairplot(concrete_df,markers="h", diag_kind = 'kde')
plt.show()

Observation : </b> From the above pair plot we can infer the association among the attributes and target column as follows:
- No high correlation between any two features
- Strength have some possitive linear relation with cement and some with superplastic that means if the quantity of cement or superplastic is more then concrete is having more strength.
- More strength is between 20-150 days aprox.
- Strength is again decreasing again after 250 days approx.
- Also It is quite visible multiple gaussian slag,ash,water, superplastic, age.
- slag, cement and ash also have a tendency to create linear relation but it's not prominant.
- Rest of the relation between other individual attributes are mostly formed cloud shape or symmetrical shape.

## Analysis of each attributes with the help of plots :-

### Cement  :-

In [None]:
fig, (ax1, ax2) = plt.subplots(nrows = 1, ncols = 2, figsize = (13, 5))
fig.set_size_inches(20,6)
sns.distplot(concrete_df['cement'],ax=ax1)
ax1.tick_params(labelsize=15)
ax1.set_xlabel('cement', fontsize=15)
ax1.set_title("Distribution Plot")


sns.boxplot(concrete_df['cement'],ax=ax2)
ax2.set_title("Box Plot")
ax2.set_xlabel('cement', fontsize=15)

<b>Insight : </b>From above we can see that there are no outliers in cement column and it's looks like normally distributed. Cemennt values lies between range 100 to 500. 

### Slag :-

In [None]:
fig, (ax1, ax2) = plt.subplots(nrows = 1, ncols = 2, figsize = (13, 5))
fig.set_size_inches(20,6)
sns.distplot(concrete_df['slag'],ax=ax1)
ax1.set_xlabel('Slag', fontsize=15)
ax1.set_title("Distribution Plot")

sns.boxplot(concrete_df['slag'],ax=ax2)
ax2.set_xlabel('Slag', fontsize=15)
ax2.set_title("Box Plot")


<b>Insight: </b>From above boxplot we can see that there are outliers in slug values lies between range 100 to 200. 400 is the higest slug value.

### Ash :-

In [None]:
fig, (ax1, ax2) = plt.subplots(nrows = 1, ncols = 2, figsize = (13, 5))
fig.set_size_inches(20,6)
sns.distplot(concrete_df['ash'],ax=ax1)
ax1.set_xlabel('Ash', fontsize=15)
ax1.set_title("Distribution Plot")

sns.boxplot(concrete_df['ash'],ax=ax2)
ax2.set_xlabel('Ash', fontsize=15)
ax2.set_title("Box Plot")


<b>Insight :</b> From above we can see that there are no outliers in ash column. We can see a tall tower at range of 0 to 20 which indicates if slug value is between 100 and 200.

### Water :-

In [None]:
fig, (ax1, ax2) = plt.subplots(nrows = 1, ncols = 2, figsize = (13, 5))
fig.set_size_inches(20,6)
sns.distplot(concrete_df['water'],ax=ax1)
ax1.set_xlabel('Water', fontsize=15)
ax1.set_title("Distribution Plot")

sns.boxplot(concrete_df['water'],ax=ax2)
ax2.set_xlabel('Water', fontsize=15)
ax2.set_title("Box Plot")



<b>Insight : </b>From above we can see that there are outliers in water column and there is right skewness because long tail is at the right side.
####  As ouliers are there in water so we will check how many outliers are there in the water.

In [None]:
outlier_columns = []

Q1 =  concrete_df['water'].quantile(0.25) # 1º Quartile
Q3 =  concrete_df['water'].quantile(0.75) # 3º Quartile
IQR = Q3 - Q1                      # Interquartile range

LTV_water = Q1 - 1.5 * IQR   # lower bound 
UTV_water = Q3 + 1.5 * IQR   # upper bound

print('Interquartile range = ', IQR)
print('water <',LTV_water ,'and >',UTV_water, ' are outliers')
print('Numerber of outliers in water column below the lower whisker =', concrete_df[concrete_df['water'] < (Q1-(1.5*IQR))]['water'].count())
print('Numerber of outliers in water column above the upper whisker =', concrete_df[concrete_df['water'] > (Q3+(1.5*IQR))]['water'].count())

# storing column name and upper-lower bound value where outliers are presense 
outlier_columns.append('water')
upperLowerBound_Disct = {'water':UTV_water}

## Superplastic :-

In [None]:
fig, (ax1, ax2) = plt.subplots(nrows = 1, ncols = 2, figsize = (13, 5))
fig.set_size_inches(20,6)
sns.distplot(concrete_df['superplastic'],ax=ax1)
ax1.set_xlabel('superplastic', fontsize=15)
ax1.set_title("Distribution Plot")

sns.boxplot(concrete_df['superplastic'],ax=ax2)
ax2.set_xlabel('Superplastic', fontsize=15)
ax2.set_title("Box Plot")


<b> Insight : </b>From above we can see that there are outliers in superplastic column and there is right skewness because long tail is at right side(mean>median).

<b>As outliers are there in superplastic so  we will check how many outliers are there in the superplastic.</b>

In [None]:
Q1 =  concrete_df['superplastic'].quantile(0.25) # 1º Quartile
Q3 =  concrete_df['superplastic'].quantile(0.75) # 3º Quartile
IQR = Q3 - Q1                      # Interquartile range

LTV_superplastic = Q1 - 1.5 * IQR   # lower bound 
UTV_superplastic = Q3 + 1.5 * IQR   # upper bound

print('Interquartile range = ', IQR)
print('superplastic <',LTV_superplastic ,'and >',UTV_superplastic, ' are outliers')
print('Numerber of outliers in superplastic column below the lower whisker =', concrete_df[concrete_df['superplastic'] < (Q1-(1.5*IQR))]['superplastic'].count())
print('Numerber of outliers in superplastic column above the upper whisker =', concrete_df[concrete_df['superplastic'] > (Q3+(1.5*IQR))]['superplastic'].count())

# storing column name and upper-lower bound value where outliers are presense
outlier_columns.append('superplastic')
upperLowerBound_Disct['superplastic'] = UTV_superplastic

## Coarseagg :-

In [None]:
fig, (ax1, ax2) = plt.subplots(nrows = 1, ncols = 2, figsize = (13, 5))
fig.set_size_inches(20,6)
sns.distplot(concrete_df['coarseagg'],ax=ax1)
ax1.set_xlabel('Coarseagg', fontsize=15)
ax1.set_title("Distribution Plot")

sns.boxplot(concrete_df['coarseagg'],ax=ax2)
ax2.set_xlabel('Coarseagg', fontsize=15)
ax2.set_title("Box Plot")

<b>Insight : </b>From above we can see that there are no outliers in coarseagg and there is a right skewness because long tail is at right side(mean>median).

## Fineagg :-

In [None]:
fig, (ax1, ax2) = plt.subplots(nrows = 1, ncols = 2, figsize = (13, 5))
fig.set_size_inches(20,6)
sns.distplot(concrete_df['fineagg'],ax=ax1)
ax1.set_xlabel('Fineagg', fontsize=15)
ax1.set_title("Distribution Plot")

sns.boxplot(concrete_df['fineagg'],ax=ax2)
ax2.set_xlabel('Fineagg', fontsize=15)
ax2.set_title("Box Plot")


<b>Insight : </b>From above we can see that there are outliers in Fineagg column and there are two peaks in distribution plot and there is right skewness because long tail is at right side(mean>median).

## Age :-

In [None]:
fig, (ax1, ax2) = plt.subplots(nrows = 1, ncols = 2, figsize = (13, 5))
fig.set_size_inches(20,6)
sns.distplot(concrete_df['age'],ax=ax1)
ax1.set_xlabel('Age', fontsize=15)
ax1.set_title("Distribution Plot")

sns.boxplot(concrete_df['age'],ax=ax2)
ax2.set_xlabel('Age', fontsize=15)
ax2.set_title("Box Plot")

<b>Insight : </b>From above we can see that there are outliers in Age column and there are many peaks in distribution plot and there is left skewness because long tail is at left side(mean<median).

In [None]:
Q1 =  concrete_df['age'].quantile(0.25) # 1º Quartile
Q3 =  concrete_df['age'].quantile(0.75) # 3º Quartile
IQR = Q3 - Q1                      # Interquartile range

LTV_age = Q1 - 1.5 * IQR   # lower bound 
UTV_age = Q3 + 1.5 * IQR   # upper bound

print('Interquartile range = ', IQR)
print('age <',LTV_age ,'and >',UTV_age, ' are outliers')
print('Numerber of outliers in age column below the lower whisker =', concrete_df[concrete_df['age'] < (Q1-(1.5*IQR))]['age'].count())
print('Numerber of outliers in age column above the upper whisker =', concrete_df[concrete_df['age'] > (Q3+(1.5*IQR))]['age'].count())

# storing column name and upper-lower bound value where outliers are presense
outlier_columns.append('age')
upperLowerBound_Disct['age'] = UTV_age

## Strength :-

In [None]:
fig, (ax1, ax2) = plt.subplots(nrows = 1, ncols = 2, figsize = (13, 5))
fig.set_size_inches(20,6)
sns.distplot(concrete_df['strength'],ax=ax1)
ax1.tick_params(labelsize=15)
ax1.set_xlabel('strength', fontsize=15)
ax1.set_title("Distribution Plot")


sns.boxplot(concrete_df['strength'],ax=ax2)
ax2.set_title("Box Plot")
ax2.set_xlabel('strength', fontsize=15)

<b>Insight : </b>From above we can see that there are no outliers in strength column and there are two peaks in distribution plot and there is right skewness because long tail is at right side(mean>median)

## ----------------------------------------------- Fixing Outliers -----------------------------------------------------

- As we have seen above outlier are presence in the given dataset.
- There are multiple ways to deal with outliers but I mostly prefer either to drop the outliers or repalce it with median/mean.
- Here I am going to replace the outliers with median becase if we drop then ther may be chance to loose some important information.
- I have also shown the number of outliers presence in each column in below code.

In [None]:
print('These are the columns which have outliers : \n\n',outlier_columns)
print('\n\n',upperLowerBound_Disct)

In [None]:
concrete_df_new = concrete_df.copy()

In [None]:
for col_name in concrete_df_new.columns[:-1]:
    q1 = concrete_df_new[col_name].quantile(0.25)
    q3 = concrete_df_new[col_name].quantile(0.75)
    iqr = q3 - q1
    low = q1-1.5*iqr
    high = q3+1.5*iqr
    
    concrete_df_new.loc[(concrete_df_new[col_name] < low) | (concrete_df_new[col_name] > high), col_name] = concrete_df_new[col_name].median()

In [None]:
plt.figure(figsize=(15,8))
sns.boxplot(data=concrete_df_new, orient="h", palette="Set2", dodge=False)

<b>Important : </b> Now we can see in boxplot that most of the outliers are replaced with their median in dataframe. We have seen that outliers most of the outliers are removed but because of the gaussian by replacing it with median value, the attributes raised with new outliers which we can ignore. 

#### After fixing outliers shape of dataframe:

In [None]:
concrete_df_new.shape

## Creating and view the correlation matrix :-

In [None]:
concrete_df_new.corr()

## ---------------------------------------- Correlation using Heatmap --------------------------------------------

In [None]:
mask = np.zeros_like(concrete_df_new.corr(), dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
plt.figure(figsize=(15,7))
plt.title('Correlation of Attributes', y=1.05, size=19)
sns.heatmap(concrete_df_new.corr(),vmin=-1, cmap='plasma',annot=True,  mask=mask, fmt='.2f')

<b>Correlation Insight :</b>- From above correlation matrix we can see that there are many features which are  correlated. if we see carefully then ash and cement are having corelation of -0.40. superplastic and water are having corelation of -0.66 . fineagg and water are having corelation of -0.45. Age and strenght have correlation of 0.50.
Little correlation of ~0.6 between Superplasticizer and Water (which is negative as evident from scatter matrix), but lets move forward as it is.

'cement' has the highest correlation with the area of 'concrete_compressive_strength'(which is a positive correlation), followed by 'superplasticizer', which is also a positive correlation, 'ash' has the least correlation.

## KMeans Clustering :-

In [None]:
cluster_range = range( 2, 6 )   # expect 3 to four clusters from the pair panel visual inspection hence restricting from 2 to 6
cluster_errors = []
for num_clusters in cluster_range:
  clusters = KMeans( num_clusters, n_init = 5)
  clusters.fit(concrete_df_new)
  labels = clusters.labels_
  centroids = clusters.cluster_centers_
  cluster_errors.append( clusters.inertia_ )
clusters_df = pd.DataFrame( { "num_clusters":cluster_range, "cluster_errors": cluster_errors } )
clusters_df[0:15]

In [None]:
# Elbow plot

plt.figure(figsize=(12,6))
plt.plot( clusters_df.num_clusters, clusters_df.cluster_errors, marker = "o" )

<b>Insight: </b>The elbow plot confirms our visual analysis that there are likely 3 good clusters.

In [None]:
cluster = KMeans( n_clusters = 3, random_state = 2354 )
cluster.fit(concrete_df_new)

prediction=cluster.predict(concrete_df_new)
concrete_df_new["GROUP"] = prediction     # Creating a new column "GROUP" which will hold the cluster id of each record

concrete_df_new_copy = concrete_df_new.copy(deep = True)  # Creating a mirror copy for later re-use instead of building repeatedly

In [None]:
centroids = cluster.cluster_centers_
centroids

<b>Comment :</b> From above we can see that all three groups are at same lavel, I dont find any difference so we will not be proceding with the clustering.

## Standardization Independent Varaibles :-

In [None]:
# All variables are on same scale, hence we can omit scaling.
# But to standardize the process we will do it here
XScaled = concrete_df_new.apply(zscore)
XScaled.head()

In [None]:
plt.figure(figsize=(12,6))
sns.boxplot(data=XScaled, orient="h", palette="Set2", dodge=False)

<b>Insight: </b> From above graph we see that the result of standardization(Z-score) is that the features are rescaled and their properties of a standard normal distribution changed to μ=0 and σ=1.

## Separating Independent and Dependent :-

In [None]:
y_set = XScaled[['strength']]
X_set = XScaled.drop(labels= "strength" , axis = 1)

## We will be  creating create 3 part of our dataset. We'll be working working test and validation data. And one part of data will be kept for Test the final score of our models.

In [None]:
y_set = XScaled[['strength']]
X_set = XScaled.drop(labels= "strength" , axis = 1)

# data spliting using 80:20 train test data ratio and randon seeding 7
X_model_train, X_test, y_model_train, y_test = train_test_split(X_set, y_set, test_size=0.20, random_state=7)

In [None]:
print('---------------------- Data----------------------------- \n')
print('x train data {}'.format(X_model_train.shape))
print('y train data {}'.format(y_model_train.shape))
print('x test data  {}'.format(X_test.shape))
print('y test data  {}'.format(y_test.shape))


In [None]:
# data spliting using 70:30 train test data ratio and randon seeding 7
X_train, X_validate, y_train, y_validate = train_test_split(X_model_train, y_model_train, test_size=0.30, random_state=7)

In [None]:
print('---------------------- Data----------------------------- \n')
print('x train data {}'.format(X_train.shape))
print('y train data {}'.format(y_train.shape))
print('x test data  {}'.format(X_validate.shape))
print('y test data  {}'.format(y_validate.shape))


# :::::::::::::::::::::::::::::::::::::::: Model Building :::::::::::::::::::::::::::::::::::::::::

In [None]:
# Defining the kFold function for the cross validation
n_split = 10
randon_state = 7
kfold = KFold(n_split, random_state = randon_state)
linear_model = []
linear_model_score = []
linear_model_RMSE = []
linear_model_R_2 = []
Model = []
RMSE = []
R_sq = []

## Random Forest Regressor ::-

In [None]:
rfTree = RandomForestRegressor(n_estimators=100)
rfTree.fit(X_train, y_train.values.ravel())
print('Random Forest Regressor')
rfTree_train_score = rfTree.score(X_train, y_train)
print("Random Forest Regressor Model Training Set Score:",rfTree_train_score)


rfTree_score = rfTree.score(X_validate, y_validate)
print("Random Forest Regressor Model Validation Set Score:", rfTree_score)

rfTree_rmse = np.sqrt((-1) * cross_val_score(rfTree, X_train, y_train.values.ravel(), cv=kfold, scoring='neg_mean_squared_error').mean())
print("Random Forest Regressor Model RMSE :", rfTree_rmse)


rfTree_r2 = cross_val_score(rfTree, X_train, y_train.values.ravel(), cv=kfold, scoring='r2').mean()
print("Random Forest Regressor Model R-Square Value :", rfTree_r2)

rfTree_model_df = pd.DataFrame({'Trainng Score': [rfTree_train_score],
                           'Validation Score': [rfTree_score],
                           'RMSE': [rfTree_rmse],
                           'R Squared': [rfTree_r2]})
rfTree_model_df

In [None]:
print("Random Forest Regressor Model Test Data Set Score:")
rfTree_test_score = rfTree.score(X_test, y_test)
print(rfTree_test_score)

<b>Comment: </b> So the model is our Random Forest Regressor model. After executing the model I found. 
 - Training Data Score : 0.974034
 - Validation Data Score : 0.893274
 - Test Data Score : 0.8571364194016591
We need to tune our model further and need to check if the model score in test data can be improvised or not.

## Hyper-tuning Random Forest Regressor ::-
RandomSearchCV

In [None]:
rf = RandomForestRegressor(random_state = 7)
print('Parameters currently in use:\n')
pprint(rf.get_params())

In [None]:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 10 , stop = 100, num = 3)]   # returns evenly spaced 10 numbers
# Number of features to consider at every split
max_features = ['auto', 'log2']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(5, 10, num = 2)]  # returns evenly spaced numbers can be changed to any
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

pprint(random_grid)

In [None]:
rf_random = RandomizedSearchCV(estimator=rf, param_distributions=random_grid,
                              n_iter = 5, scoring='neg_mean_absolute_error', 
                              cv = kfold, verbose=2, random_state=7, n_jobs=-1,
                              return_train_score=True)
# Fit the random search model
rf_random.fit(X_train, y_train.values.ravel());

In [None]:
# best ensemble model (with optimal combination of hyperparameters)
rfTree = rf_random.best_estimator_
rfTree.fit(X_train, y_train.values.ravel())
print('Random Forest Regressor')
rfTree_train_score = rfTree.score(X_train, y_train)
print("Random Forest Regressor Model Training Set Score:",rfTree_train_score)

rfTree_score = rfTree.score(X_validate, y_validate)
print("Random Forest Regressor Model Validation Set Score:",rfTree_score)

rfTree_rmse = np.sqrt((-1) * cross_val_score(rfTree, X_train, y_train.values.ravel(), cv=kfold, scoring='neg_mean_squared_error').mean())
print("Random Forest Regressor Model RMSE :", rfTree_rmse)

rfTree_r2 = cross_val_score(rfTree, X_train, y_train.values.ravel(), cv=kfold, scoring='r2').mean()
print("Random Forest Regressor Model R-Square Value :", rfTree_r2)

rfTree_random_model_df = pd.DataFrame({'Trainng Score': [rfTree_train_score],
                           'Validation Score': [rfTree_score],
                           'RMSE': [rfTree_rmse],
                           'R Squared': [rfTree_r2]})
rfTree_random_model_df

In [None]:
rfTree_test_score = rfTree.score(X_test, y_test)
print("Random Forest Regressor Model Test Data Set Score:", rfTree_test_score)

#### GridsearchCV :

In [None]:
param_grid = {
    'bootstrap': [True],
    'max_depth': [10],
    'max_features': ['log2'],
    'min_samples_leaf': [1, 2, 3],
    'min_samples_split': [5,10],
    'n_estimators': np.arange(50, 71)
}
rfg = RandomForestRegressor(random_state = 7)

grid_search = GridSearchCV(estimator = rfg, param_grid = param_grid, 
                          cv = kfold, n_jobs = 1, verbose = 0, return_train_score=True)

grid_search.fit(X_train, y_train.values.ravel());
grid_search.best_params_

In [None]:
# best ensemble model (with optimal combination of hyperparameters)
rfTree = grid_search.best_estimator_
rfTree.fit(X_train, y_train.values.ravel())
print('Random Forest Regressor')
rfTree_train_score = rfTree.score(X_train, y_train)
print("Random Forest Regressor Model Training Set Score:", rfTree_train_score)

rfTree_score = rfTree.score(X_validate, y_validate)
print("Random Forest Regressor Model Validation Set Score:",rfTree_score)

rfTree_rmse = np.sqrt((-1) * cross_val_score(rfTree, X_train, y_train.values.ravel(), cv=kfold, scoring='neg_mean_squared_error').mean())
print("Random Forest Regressor Model RMSE :", rfTree_rmse)

rfTree_r2 = cross_val_score(rfTree, X_train, y_train.values.ravel(), cv=kfold, scoring='r2').mean()
print("Random Forest Regressor Model R-Square Value :", rfTree_r2)

rfTree_random_model_df = pd.DataFrame({'Trainng Score': [rfTree_train_score],
                           'Validation Score': [rfTree_score],
                           'RMSE': [rfTree_rmse],
                           'R Squared': [rfTree_r2]})
rfTree_random_model_df

# :::::::::::::::::::::Comparing performances of all the models::::::::::::::::::::::: 

Defination of the function to comparing models

In [None]:
def input_scores(name, model, x, y):
    Model.append(name)
    RMSE.append(np.sqrt((-1) * cross_val_score(model, x, y, cv=kfold, 
                                               scoring='neg_mean_squared_error').mean()))
    R_sq.append(cross_val_score(model, x, y, cv=kfold, scoring='r2').mean())
#Comment: Above function uses to append the cross validation scores of the algorithms.

<b>Comment: </b> I found the Random Forest Regressor is having lowest Root Mean Square Error (RMSE) and Higest R Square value. So, I can say it is the best model to execute our model. 

In [None]:
rfTree_random_model_df

## Conclusion ::-

- From the above we have come to the conclusion that RandomForestRegressor is giving the good accuracy score. 
- I have also tested with validation data which is also giving better result with RandomForestRegressor. 
- Hence we can proceed with the RandomForestRegressor to  modeling of strength of high performance concrete.