# Featurization and Model Tuning
## Data Description :
The concrete compressive strength (MPa) for a given mixture under a
specific age (days) was determined from laboratory.
### Domain  : Cement manufacturing/Civil Engineering
### Context : 
The concrete compressive strength is a highly nonlinear function of age and ingredients.
These ingredients include cement, blast furnace slag, fly ash, water,
superplasticizer, coarse aggregate, and fine aggregate

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

In [None]:
values={'?','NA','nA','Na','na'}
df=pd.read_csv('../input/yeh-concret-data/Concrete_Data_Yeh.csv',na_values=values)

In [None]:
df.head()

In [None]:
print ('Our Dataset has:{} rows and columns:{}'.format(df.shape[0],df.shape[1]))

## Exploratory data quality report

### Univariate analysis

#### Data types and Names of the independent attributes

In [None]:
df.info()

Our Data contains 9 columns, which are non null and neumeric in nature

#### Central Tendencies (Mean, Min- Max(Range)), Standard Deviation, Quantiles

In [None]:
df.describe().T
    

###### Independent variables range  measured in kg in a m3 mixture :
cement - 102 to 540
slag - 0 to 359
ash - 0 to 200
water - 121 to 247
superplastic - 0 to 32
coarseagg - 801 to 1145
age - 1 to 365
strength - 2.3 to 87 



Let's check for duplicates in our data

In [None]:
print("Numbder of duplicate rows in our data is:{}".format(df.duplicated().sum()))

In [None]:
df=df.drop_duplicates(subset=None,keep='first',inplace=False)

I am using drop_duplicates from pandas to eliminate duplicate values and retain only the first occuerence of the duplicates for analysis

In [None]:
print ('After Removing Duplicates Our Dataset has:{} rows and columns:{}'.format(df.shape[0],df.shape[1]))

##### Histogram

In [None]:
columns=list(df)
df[columns].hist(stacked=True,density=True, bins=100,color='blue', figsize=(16,30), layout=(10,3));

1. From the above Histogram we could see that cement, coarseag,fineagg, strength and water are almost normally distributed.
2. Age , Ash, superlastic are slightly skewed.

#### Data Skewness & Distribution of curves

In [None]:
df.skew()

In [None]:
fig,ax=plt.subplots(1,9,figsize=(15,8))
sns.distplot(df['cement'],ax=ax[0],kde=True,hist=False)
sns.distplot(df['slag'],ax=ax[1],kde=True,hist=False)
sns.distplot(df['flyash'],ax=ax[2],kde=True,hist=False)
sns.distplot(df['water'],ax=ax[3],kde=True,hist=False)
sns.distplot(df['superplasticizer'],ax=ax[4],kde=True,hist=False)
sns.distplot(df['coarseaggregate'],ax=ax[5],kde=True,hist=False)
sns.distplot(df['fineaggregate'],ax=ax[6],kde=True,hist=False)
sns.distplot(df['age'],ax=ax[7],kde=True,hist=False)
sns.distplot(df['csMPa'],ax=ax[8],kde=True,hist=False)
plt.show()
print(df.skew())

In terms of distribution slag, ash, water, superplastic, coarseagg, fineagg , age are all multi gaussian which means they have multiple peaks and valleys.
Strength seems to be normally distributed, cement has a slightly sharp multiple peaks.

#### Tails
1. Cement seems to be normally distibuted
2. slag is slightly skewed towads right
3. ash is normally distributed
4. water is slighly skewd towards right
5. superlastic is skewd towards right
6. coarseagg is normally distributed
7. fineagg is normally distributed
8. age is slighly skewed positively 
9. strength is normally distributed

In [None]:
df.isnull().sum()

There is no Null values in our dataset.

#### Check the presence of outliers through box plot

Outliers , are extreme values present in the data.There are outlires in our data for some columns as you can see from the below boxplot, Outliers have an impact on all ML algorithms. We should find ways to fix outliers 

In [None]:
plt.figure(figsize=(15,8))
sns.boxplot(data=df)
plt.xticks(rotation=45)

## Multivariate analysis

#### Traget Column call out :
##### In this dataset our variable of interest is the strength column. It is a continous variable which depends on various other parameters in evaluation of the concrete mixture.

In [None]:
sns.pairplot(df,hue_order=df['csMPa'],diag_kind='kde');

In [None]:
cor=df.corr()
sns.heatmap(cor,annot=True);

##### Pair plot analysis:
1. Along the Diagonal, our data has 2-3 gaussians for all the predictor variables. We should do a cluster analysis to understand the grouping and hidden pattern in data.
2. Our predictors have some reationship and dependencies with target. 
3. From the correlation matrix we could infer that , our variables have less correraltion between each other. This is good, as most ML algorithms assume variables are independent of each other for better prediction.


In [None]:
Cor_Matrix=df.corr().abs()
Cor_Matrix
upper_tri = Cor_Matrix.where(np.triu(np.ones(Cor_Matrix.shape),k=1).astype(np.bool))
#print(upper_tri)
to_drop =[column for column in upper_tri.columns if any(upper_tri[column] > 0.60)]

print("The columns those have more than 0.6 correlation is :",to_drop[0:6])

##### Inter Quantile Range Calculation

In [None]:
Q1=df.quantile(0.25)
Q3=df.quantile(0.75)

IQR=Q3-Q1
IQR

In [None]:
S= df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]

### Outlier Removal :

We have used Inter Quantile Range to eliminate outliers, the IQR range calculated by Q3-Q1 is used to eliminate extreme values from the data.

In [None]:
S.info()

Now after outlier removal and duplicates removal we are left with 911 records in our dataset.

In [None]:
plt.figure(figsize=(12,12))
sns.boxplot(data=S)
plt.show()

Now, we do see some outliers again in the final data , but these are not real noise.These are caused because our central tendencies and distribution has been changed or altered after removal of ouliers , and these may not be considered as ouliers now.

## Feature Engineering techniques

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression,Lasso,Ridge
from sklearn.decomposition import PCA
from scipy.stats import zscore

##### We will use PCA for dimensionality reduction 

To identify important features among all , we shall do principal component analysis to see the varience captured by each component. And decide how many features might be really required to predict the strength of the concrete mixture



In [None]:
data_for_pca=S.copy()

In [None]:
data_for_pca=data_for_pca.apply(zscore)

Scaling is an important aspect in PCA as different unit of measurement affects the calculation of new axis that PCA creates. Standardization or normalization will help create the right Principal compoenents with giving every feature the right weightage.

In our dataset though all our independent features are measured in kgs, all have different magnitude, hence I am scaling the dataset before using PCA

In [None]:
Scaler=StandardScaler()
X_PCA=data_for_pca.drop(['csMPa'],axis=1)
Y_PCA=data_for_pca['csMPa']
PC=PCA(n_components=8,random_state=12)
comp_features=PC.fit(X_PCA)

Implementing PCA with all independent features to see the overall varience captured by each component

Displaying the covarience Matrix for each compoenent 

What is Covarience ?

Covariance is just an unstandardized version of correlation.  To compute any correlation, we divide the covariance by the standard deviation of both variables to remove units of measurement.  So a covariance is just a correlation measured in the units of the original variables.

In [None]:
covmatrix=np.cov(X_PCA,rowvar=False)
plt.figure(figsize=(12,7))
sns.heatmap(covmatrix,annot=True)

In [None]:
print("####################The Eigen Values#########################")
print(PC.explained_variance_)
print("####################The Eigen Vectors#########################")
print(PC.components_)

In [None]:
plt.bar(list(range(1,9)),PC.explained_variance_ratio_, align='center')
plt.ylabel('Variation explained')
plt.xlabel('eigen Value')
plt.show()

In [None]:
plt.step(list(range(1,9)),np.cumsum(PC.explained_variance_ratio_), where='mid')
plt.ylabel('Cum of variation explained')
plt.xlabel('eigen Value')
plt.show()

In [None]:
print("Varience Ratio covered by each componenet:{}".format(PC.explained_variance_ratio_ * 100))
P_Components=PC.explained_variance_ratio_
print("The Ideal number of components that could explain:{}% of variance in data is 5".format(np.sum(P_Components[0:6])*100))

##### From the graphs above it is evident that 6 components capture just over 97% of data, rather than using all the features we can use just 6 major components on our models to train and predict.

Now we shall again capture the varience of the data for PCA components 6. This will allow us to train our models on both original data and PCA components seperately.

In [None]:
PCA6=PCA(n_components=6)
PCA6.fit(X_PCA)
X_PCA_6=PCA6.transform(X_PCA)
Y_PCA_6=Y_PCA

In [None]:
PCA6.explained_variance_

In [None]:
PCA6.components_

#### As we have said earlier in multivariate analysis,  its time for us to explore the mix up of Gaussians in our data.

#### Cluster Analysis using Kfold(Centroid based) and Agglomerative(Hierarchial based) to explore gaussian mix

In [None]:
from sklearn.cluster import KMeans,AgglomerativeClustering
k_values=range(1,10)
SSE=[]
for i in k_values:
    model=KMeans(n_clusters=i)
    model.fit(S)
    SSE.append(model.inertia_)

In [None]:
plt.plot(k_values, SSE, 'bx-')
plt.xlabel('k')
plt.ylabel('Average distortion')
plt.title('Selecting k with the Elbow Method')

#### Identify the number of clusters.

The Lloyd's algorithm or what is known as the elbow method is used to find the number of clusters required to group the data.

As you could see from the above graph, the bend is clearly visible at k=3, hence the ideal number of groups for this dataset is 3.

At k=0, the data is highly compressed, as the number of cluster increases the variance in data starts to change and at certain point the "Elbow Bends" in this case it is 3, we can preferably argue that is the ideal number of clusters in our data since it almost captures most of the variance. As K value keeps increasing after 3, you could see that each cluster becomes homogeneous and there is no change in variance.

In [None]:
S=S.apply(zscore)
K_Final=KMeans(n_clusters=3)
K_Final.fit(S)
PRED=K_Final.predict(S)
clusters=S.copy()
clusters['K-Means-Grouping']=PRED
clusters.head(10)

Agglomerative is a  connectivity based clustering technique , dendrogram and Cophentic Coefficient is used to identify the number of clusters

Dendrogram gives a clear view on convergence of data. The Cophentic correlation captures the original distance between two data points and the dendrogrammatic distance between the data points

Here , I have not used dendograms, I am just comparing the clusters grouped by K means and Agglomertaive.

In [None]:
AG=AgglomerativeClustering()
AG.fit(S)
clusters['Agglomerative labels']=AG.labels_
clusters.head()

In [None]:
K_Means_group=clusters['K-Means-Grouping'].value_counts()
Agg_Group=clusters['Agglomerative labels'].value_counts()
fig,ax=plt.subplots(1,2,figsize=(15,5))
K_Means_group.plot.pie(shadow=True, startangle=120,autopct='%.2f',ax=ax[0])
Agg_Group.plot.pie(shadow=True, startangle=120,autopct='%.2f',ax=ax[1])

As you could see from the pie plot, K - Means has identified 3 clusters in our dataset whereas Agglomerative has identified 2 clusters. This is majorly because of the way K -Means(Centroid based) and Agglomerative(Hierarchial Based ) works.

We will compare the attributes based on the clusters for better understanding...

In [None]:
fig,ax=plt.subplots(1,2,figsize=(15,5))
sns.scatterplot(clusters['cement'],clusters['csMPa'],hue=clusters['K-Means-Grouping'],size=clusters['flyash'],ax=ax[0]);
sns.scatterplot(clusters['cement'],clusters['csMPa'],hue=clusters['Agglomerative labels'],size=clusters['flyash'],ax=ax[1]);
fig,ax=plt.subplots(1,2,figsize=(15,5))
sns.scatterplot(clusters['slag'],clusters['csMPa'],hue=clusters['K-Means-Grouping'],size=clusters['flyash'],ax=ax[0]);
sns.scatterplot(clusters['slag'],clusters['csMPa'],hue=clusters['Agglomerative labels'],size=clusters['flyash'],ax=ax[1]);

In [None]:
fig,ax=plt.subplots(1,2,figsize=(15,5))
sns.scatterplot(clusters['flyash'],clusters['csMPa'],hue=clusters['K-Means-Grouping'],size=clusters['flyash'],ax=ax[0],palette='Reds_r');
sns.scatterplot(clusters['flyash'],clusters['csMPa'],hue=clusters['Agglomerative labels'],size=clusters['flyash'],ax=ax[1]);
fig,ax=plt.subplots(1,2,figsize=(15,5))
sns.scatterplot(clusters['water'],clusters['csMPa'],hue=clusters['K-Means-Grouping'],size=clusters['flyash'],ax=ax[0]);
sns.scatterplot(clusters['water'],clusters['csMPa'],hue=clusters['Agglomerative labels'],size=clusters['flyash'],ax=ax[1]);

In [None]:
fig,ax=plt.subplots(1,2,figsize=(15,5))
sns.scatterplot(clusters['superplasticizer'],clusters['csMPa'],hue=clusters['K-Means-Grouping'],size=clusters['flyash'],ax=ax[0]);
sns.scatterplot(clusters['superplasticizer'],clusters['csMPa'],hue=clusters['Agglomerative labels'],size=clusters['flyash'],ax=ax[1]);
fig,ax=plt.subplots(1,2,figsize=(15,5))
sns.scatterplot(clusters['coarseaggregate'],clusters['csMPa'],hue=clusters['K-Means-Grouping'],size=clusters['flyash'],ax=ax[0]);
sns.scatterplot(clusters['coarseaggregate'],clusters['csMPa'],hue=clusters['Agglomerative labels'],size=clusters['flyash'],ax=ax[1]);

In [None]:
fig,ax=plt.subplots(1,2,figsize=(15,5))
sns.scatterplot(clusters['fineaggregate'],clusters['csMPa'],hue=clusters['K-Means-Grouping'],size=clusters['flyash'],ax=ax[0]);
sns.scatterplot(clusters['fineaggregate'],clusters['csMPa'],hue=clusters['Agglomerative labels'],size=clusters['flyash'],ax=ax[1]);
fig,ax=plt.subplots(1,2,figsize=(15,5))
sns.scatterplot(clusters['age'],clusters['csMPa'],hue=clusters['K-Means-Grouping'],size=clusters['flyash'],ax=ax[0]);
sns.scatterplot(clusters['age'],clusters['csMPa'],hue=clusters['Agglomerative labels'],size=clusters['flyash'],ax=ax[1]);

##### From the scatter plot analysis for all the variables with respect to strength we could see that the groupings formed through clustering  evidently prove that similar group of data have similar values

In [None]:
#fig,ax=plt.subplots(1,2,figsize=(15,5))
var = 'age'
var2='cement'
var3='water'
var4='fineaggregate'
var5='slag'
var6='flyash'
var7='superplasticizer'
var8='coarseaggregate'
with sns.axes_style("white"):
    plot = sns.lmplot(var,'csMPa',data=clusters,col='K-Means-Grouping',x_estimator=np.mean)
    plot1 = sns.lmplot(var2,'csMPa',data=clusters,col='K-Means-Grouping',x_estimator=np.mean)
    plot2 = sns.lmplot(var3,'csMPa',data=clusters,col='K-Means-Grouping',x_estimator=np.mean)
    plot3 = sns.lmplot(var4,'csMPa',data=clusters,col='K-Means-Grouping',x_estimator=np.mean)
    
plot.set(ylim = (-3,3));
plot1.set(ylim=(-3,3));
plot2.set(ylim=(-3,3));
plot3.set(ylim=(-3,3));

### Cluster Analysis and their relationship with predictor(Concrete Strength)

1. From the Above plots for Age VS Strength it is very evident  and convincing that Age can be strong preditor in the strength of the concrete mix, As you can see for all the groups of clusters in age we see a strong positive linear relationship between age and strength. We can also infer that as the mixture ages the strength of the concrete increases.
    1.1. The Line of best fit is also around the mean and the residuals or error is also minimal
2. Cement seems to have little positive relationsip with strength , But may not be a strong predictor.
3. Water vs strength (Group 2 has some linear relation ship) whereas, group 0 and group 1 have a slight linear relationship. Hence, water also may not be a strong predictor of strength.
4. Fineagg, for group 0 and group 1 the line is almost horizontal, which means for value change in fineagg there is no considerable change in strength, But for group 3 there is some relationship for fineagg vs strength. Hence Fineagg may also not be good predictor of concrete strength.

In [None]:
with sns.axes_style("white"):
    plot4 = sns.lmplot(var5,'csMPa',data=clusters,col='K-Means-Grouping',x_estimator=np.mean)
    plot5 = sns.lmplot(var6,'csMPa',data=clusters,col='K-Means-Grouping',x_estimator=np.mean)
    plot6 = sns.lmplot(var7,'csMPa',data=clusters,col='K-Means-Grouping',x_estimator=np.mean)
    plot7 = sns.lmplot(var8,'csMPa',data=clusters,col='K-Means-Grouping',x_estimator=np.mean)
plot4.set(ylim=(-3,3));
plot5.set(ylim=(-3,3));
plot6.set(ylim=(-3,3));
plot7.set(ylim=(-3,3));

### Cluster Analysis and their relationship with predictor(Concrete Strength)

1. Slag is almost horizontal for group 0 and group 2, group 1 has a slight relationship which makes slag not a great predictpr of strength.
2. Ash , suplerplastic and coarseagg all have horizontal data distribution on atleast one or more groups with strength also making them weak predictprs of strength.

### So from our cluster analysis , we could infer that age has a strong realationship with strength for data in all clusters

## Model Creation

### For this problem statement, Linear models seems to be a good fit, We are not just going to limit ourselves with linear regression, we are going to explore all the linear models , polynomial models to see which performs best and going to select one.

#### Overview of the next phases :
1. Scale the data : Thusfar we have been using only the raw data(Except PCA), But when it comes to ML algorithms unit of measurement plays a vital role in model performance. Hence, it is essential to scale the data to avoid one unit & magnitude outweigh the other. In this dataset, there are two units kgs and days, so we ll scale the data.
2. Split the data into training and test set with random state , to ensure the training model does not get to know the test data.
3. We will use all Linear model both Gradient descent based and tree based to evalute the performance on training and testing.
4. Explore Feature importance of models wherever applicable.
5. We will use both scaled raw data and PCA feature extracted data with feature number of features as 6, which we have done earlier.
6. Evaluate the scores on all the models on both original scaled data and PCA data and decide the best.
7. Perform Hyper parameter tuning using both Gridsearchcv and Random serach CV
8. Finially, do cross validation on the best model to evaluate the model performance on unseen data.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor,AdaBoostRegressor,BaggingRegressor,GradientBoostingRegressor
X_SCALED=S.drop(['csMPa'],axis=1)
Y_SCALED=S['csMPa']

Split the data using train_test_split

In [None]:
X_Train,X_Test,Y_Train,Y_Test=train_test_split(X_SCALED,Y_SCALED,test_size=0.3,random_state=12)

In [None]:
M1_Linear_Model=LinearRegression()
M2_Poly=Pipeline([('Poly',PolynomialFeatures(degree=2)),
               ('Model2',LinearRegression())
               ])
M3_SVR=SVR()
M4_DTREE=DecisionTreeRegressor()
M5_RF=RandomForestRegressor()
M6_ADA=AdaBoostRegressor()
M7_BAG=BaggingRegressor()
M8_Lasso=Lasso(alpha=0.2)
M9_Ridge=Ridge()
M10_Gradient_Booster=GradientBoostingRegressor()

Calling all the required models that could perform linear regression and train our model on training set

In [None]:
M1_Linear_Model.fit(X_Train,Y_Train)
M2_Poly.fit(X_Train,Y_Train)
M3_SVR.fit(X_Train,Y_Train)
M4_DTREE.fit(X_Train,Y_Train)
M5_RF.fit(X_Train,Y_Train)
M6_ADA.fit(X_Train,Y_Train)
M7_BAG.fit(X_Train,Y_Train)
M8_Lasso.fit(X_Train,Y_Train)
M9_Ridge.fit(X_Train,Y_Train)
M10_Gradient_Booster.fit(X_Train,Y_Train)

Though we have done PCA to identify the features and their corresponding varience , We shall also list out the feature importances captured by each of the models and try to understand how effective is it in prediction of strength.


In [None]:
#Feature Importance from Decision Tree, RF, Lasso and Ridge
Features=(['cement', 'slag', 'ash', 'water', 'superplastic', 'coarseagg','fineagg', 'age'])
Features_Linear_RAW=M1_Linear_Model.coef_
Features_DTREE_RAW=M4_DTREE.feature_importances_
features_RF_RAW=M5_RF.feature_importances_
features_lasso_RAW=M8_Lasso.coef_
features_ridge_RAW=M9_Ridge.coef_
summary={'FEATURES':Features,"Linear":Features_Linear_RAW,"Dtree":Features_DTREE_RAW,'Random Forest':features_RF_RAW,'Lasso':features_lasso_RAW,'Ridge':features_ridge_RAW}


FEATURES_DF=pd.DataFrame(summary)
FEATURES_DF

From the above table we could see that
1. Linear and Ridge(alpha=0.2) alomst has same coefeficients for all features.
2. Decision Tree and Random forest's coefficients are similar.
3. Lasso regularization stands out and it made less important features to be zero, This is mainly because of lasso penalizes the error with high value. Lasso does feature selection first and does parameter shrinkage resulting whereas ridge only does parameter shrinkage.
4. If not PCA , Lasso regularization could also be used to select feature importances , but for this case we have done with PCA

In [None]:
#Predictions on RAW SCALED and Feature importances
Y_PRED_LINEAR_RAW=M1_Linear_Model.predict(X_Test)
Y_PRED_POLY_RAW=M2_Poly.predict(X_Test)
Y_PRED_SVR_RAW=M3_SVR.predict(X_Test)
Y_PRED_DTREE_RAW=M4_DTREE.predict(X_Test)
Y_PRED_RF_RAW=M5_RF.predict(X_Test)
Y_PRED_ADA_RAW=M6_ADA.predict(X_Test)
Y_PRED_BAG_RAW=M7_BAG.predict(X_Test)
Y_PRED_LASSO_RAW=M8_Lasso.predict(X_Test)
Y_PRED_RIDGE_RAW=M9_Ridge.predict(X_Test)
Y_PRED_GRD_RAW=M10_Gradient_Booster.predict(X_Test)



In [None]:
Training_M1_Linear=M1_Linear_Model.score(X_Train,Y_Train)* 100
Training_M2_Poly=M2_Poly.score(X_Train,Y_Train)* 100
Training_M3_SVR=M3_SVR.score(X_Train,Y_Train)* 100
Training_M4_DTREE=M4_DTREE.score(X_Train,Y_Train)* 100
Training_M5_RF=M5_RF.score(X_Train,Y_Train)* 100
Training_M6_ADA=M6_ADA.score(X_Train,Y_Train)* 100
Training_M7_BAG=M7_BAG.score(X_Train,Y_Train)* 100
Training_M8_Lasso=M8_Lasso.score(X_Train,Y_Train)* 100
Training_M9_Ridge=M9_Ridge.score(X_Train,Y_Train)* 100
Training_M10_Gradient=M10_Gradient_Booster.score(X_Train,Y_Train)* 100
Test_M1_Linear=M1_Linear_Model.score(X_Test,Y_Test)* 100
Test_M2_Poly=M2_Poly.score(X_Test,Y_Test)* 100
Test_M3_SVR=M3_SVR.score(X_Test,Y_Test)* 100
Test_M4_DTREE=M4_DTREE.score(X_Test,Y_Test)* 100
Test_M5_RF=M5_RF.score(X_Test,Y_Test)* 100
Test_M6_ADA=M6_ADA.score(X_Test,Y_Test)* 100
Test_M7_BAG=M7_BAG.score(X_Test,Y_Test)* 100
Test_M8_Lasso=M8_Lasso.score(X_Test,Y_Test)* 100
Test_M9_Ridge=M9_Ridge.score(X_Test,Y_Test)* 100
Test_M10_Gradient=M10_Gradient_Booster.score(X_Test,Y_Test)* 100

Split the PCA components using train_test_split with 6 features(n_components=6) to see how our models perfrom on the PCA components 

In [None]:
X_pca_train,X_pca_test,Y_pca_train,Y_pca_test=train_test_split(X_PCA_6,Y_PCA_6,test_size=0.3,random_state=12)

In [None]:
M1_Linear_Model.fit(X_pca_train,Y_pca_train)
M2_Poly.fit(X_pca_train,Y_pca_train)
M3_SVR.fit(X_pca_train,Y_pca_train)
M4_DTREE.fit(X_pca_train,Y_pca_train)
M5_RF.fit(X_pca_train,Y_pca_train)
M6_ADA.fit(X_pca_train,Y_pca_train)
M7_BAG.fit(X_pca_train,Y_pca_train)
M8_Lasso.fit(X_pca_train,Y_pca_train)
M9_Ridge.fit(X_pca_train,Y_pca_train)
M10_Gradient_Booster.fit(X_pca_train,Y_pca_train)

In [None]:
#Predictions on PCA SCALED and Feature importances
Y_PRED_LINEAR_PCA=M1_Linear_Model.predict(X_pca_test)
Y_PRED_POLY_PCA=M2_Poly.predict(X_pca_test)
Y_PRED_SVR_PCA=M3_SVR.predict(X_pca_test)
Y_PRED_DTREE_PCA=M4_DTREE.predict(X_pca_test)
Y_PRED_RF_PCA=M5_RF.predict(X_pca_test)
Y_PRED_ADA_PCA=M6_ADA.predict(X_pca_test)
Y_PRED_BAG_PCA=M7_BAG.predict(X_pca_test)
Y_PRED_LASSO_PCA=M8_Lasso.predict(X_pca_test)
Y_PRED_RIDGE_PCA=M9_Ridge.predict(X_pca_test)
Y_PRED_GRD_PCA=M10_Gradient_Booster.predict(X_pca_test)

In [None]:
Training_PCA_M1_Linear=M1_Linear_Model.score(X_pca_train,Y_pca_train)* 100
Training_PCA_M2_Poly=M2_Poly.score(X_pca_train,Y_pca_train)* 100
Training_PCA_M3_SVR=M3_SVR.score(X_pca_train,Y_pca_train)* 100
Training_PCA_M4_DTREE=M4_DTREE.score(X_pca_train,Y_pca_train)* 100
Training_PCA_M5_RF=M5_RF.score(X_pca_train,Y_pca_train)* 100
Training_PCA_M6_ADA=M6_ADA.score(X_pca_train,Y_pca_train)* 100
Training_PCA_M7_BAG=M7_BAG.score(X_pca_train,Y_pca_train)* 100
Training_PCA_M8_Lasso=M8_Lasso.score(X_pca_train,Y_pca_train)* 100
Training_PCA_M9_Ridge=M9_Ridge.score(X_pca_train,Y_pca_train)* 100
Training_PCA_M10_Gradient=M10_Gradient_Booster.score(X_pca_train,Y_pca_train)* 100
Test_PCA_M1_Linear=M1_Linear_Model.score(X_pca_test,Y_pca_test)* 100
Test_PCA_M2_Poly=M2_Poly.score(X_pca_test,Y_pca_test)* 100
Test_PCA_M3_SVR=M3_SVR.score(X_pca_test,Y_pca_test)* 100
Test_PCA_M4_DTREE=M4_DTREE.score(X_pca_test,Y_pca_test)* 100
Test_PCA_M5_RF=M5_RF.score(X_pca_test,Y_pca_test)* 100
Test_PCA_M6_ADA=M6_ADA.score(X_pca_test,Y_pca_test)* 100
Test_PCA_M7_BAG=M7_BAG.score(X_pca_test,Y_pca_test)* 100
Test_PCA_M8_Lasso=M8_Lasso.score(X_pca_test,Y_pca_test)* 100
Test_PCA_M9_Ridge=M9_Ridge.score(X_pca_test,Y_pca_test)* 100
Test_PCA_M10_Gradient=M10_Gradient_Booster.score(X_pca_test,Y_pca_test)* 100

Now, with all the score that we have stored in variables for Traing and Testing data on scaled raw and PCA featuures , we will create a table that will display the model scores as a dataframe

In [None]:
DTREE_COEFF_PCA=M4_DTREE.feature_importances_
RF_COEFF_PCA=M5_RF.feature_importances_
ADA_COEFF_PCA=M6_ADA.feature_importances_
BAG_COEFF_PCA=M7_BAG.n_features_
LAS_COEFF_PCA=M8_Lasso.coef_
RDGE_COEFF_PCA=M9_Ridge.coef_

In [None]:
TAB=pd.DataFrame({'Model_Names':['Linear Regression','Polynomial_regresison','Support Vector Regressor','Decision Tree Regressor','Random Forest Regressor',
            'Adaboost Regressor','Bagging Regressor','Lasso Regressor','Ridge Regressor','Gradient Boost'],'Training_Score_Scaled_Raw':[Training_M1_Linear,
Training_M2_Poly,
Training_M3_SVR,
Training_M4_DTREE,
Training_M5_RF,
Training_M6_ADA,
Training_M7_BAG,
Training_M8_Lasso,
Training_M9_Ridge,Training_M10_Gradient],'Testing_Score_Scaled_Raw':[Test_M1_Linear,
Test_M2_Poly,
Test_M3_SVR,
Test_M4_DTREE,
Test_M5_RF,
Test_M6_ADA,
Test_M7_BAG,
Test_M8_Lasso,
Test_M9_Ridge,Test_M10_Gradient],'Training_Score_PCA':[Training_PCA_M1_Linear,
Training_PCA_M2_Poly,
Training_PCA_M3_SVR,
Training_PCA_M4_DTREE,
Training_PCA_M5_RF,
Training_PCA_M6_ADA,
Training_PCA_M7_BAG,
Training_PCA_M8_Lasso,
Training_PCA_M9_Ridge,Training_PCA_M10_Gradient],'Testing_Score_PCA':[Test_PCA_M1_Linear,
Test_PCA_M2_Poly,
Test_PCA_M3_SVR,
Test_PCA_M4_DTREE,
Test_PCA_M5_RF,
Test_PCA_M6_ADA,
Test_PCA_M7_BAG,
Test_PCA_M8_Lasso,
Test_PCA_M9_Ridge,Test_PCA_M10_Gradient]})



In [None]:
TAB

#### Linear model hasn't performed that great on both raw as well as PCA components, the scores are not that accurate.
#### Polynomial Regression on a degree of 2 has been implemented ,which means the number of indepndent variables have been increased. This actually turns the linear equation of "y=mx + B" into quadratic(degree=2) , Now we are not going to see for the best straightline, but a curve. For this problem , I am using polynomial with degree 2 as I feel a linear equation suits best for our data from the EDA and higher polynomial function might overfit the data and also we may run into curse of dimensionality with small dataset.
#### Decision tree is a overfit, as it prodces 99% accuracy with trainiing and with test it chokes to 78% and 69% with raw and PCA components, hence it is not the best model.
#### Random forest and Gradient boost seems to be slightly overfit, but looks like it generalises well on the test data on both raw and PCA.



In [None]:
print("Best Test score we have achived on Raw Data:",TAB['Testing_Score_Scaled_Raw'].max())
print("Best Test score we have achived on PCA components:",TAB['Testing_Score_PCA'].max())

#### Techniques employed to squeeze that extra performance out of the model without making it overfit or underfit

In [None]:
#Regularization using GridsearchCV
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

Though Random forest had better score , I am slecting Gradient boost here as the best model , as I feel it generalises well compared to training and testing scores.
##### So , we ll do hyper parameter tuning on Gradient boost by  Grid search to squeeze that extra performance out of the model without making it overfit or underfit

#### GridSearchCV

Every Algorithm has some parameters that are specific to data and some specific to the algorithm , Those parameters which  are specific to algorithm is known as hyperparameters.
We shall try various values for these hyperparameters to find which set of values give the best result for the model.

In [None]:
param_grid={'n_estimators':[100,200,300,400,500,600],'learning_rate':[.001,0.01,.1],'max_depth':[1,2,3,4,5],'subsample':[.5,.75,1],'random_state':[1]                      
           }

In [None]:
estimator=M10_Gradient_Booster
Grid_CV=GridSearchCV(estimator=estimator,param_grid=param_grid,cv=10)

In [None]:
Grid_CV.fit(X_Train,Y_Train)

In [None]:
Grid_CV.best_estimator_

The above are the best estimators for our grid search model and we shall use these values for on our final model to check how our scores are improved.

In [None]:
Grid_CV.best_params_

### Now lets train our best model with best hyperparameters found using GridSearch

In [None]:
Final=GradientBoostingRegressor(learning_rate=0.1,max_depth=2,n_estimators=600,random_state=1,subsample=0.75)

In [None]:
Final.fit(X_Train,Y_Train)
Final.score(X_Train,Y_Train) * 100

In [None]:
Final.score(X_Test,Y_Test) * 100

#### As you can see, the training and the test scores of Gradient boost has improved and parameter tuning allows us to build better model
Training score without tuning 95%, Testing score without tuning 86 
Training score with hyperparameter tuning 98, Testing score with hyperparameter tuning 91%

### Model performance range at 95% confidence level 
#### Having bulit the model on hyperparametrs and evaluating them on training and test set does not guarntee the same performance of our model on unseen data.
#### Hence, it is essential to further do a final evaluation on unseen data. We shall do the same using K fold cross validation 

In [None]:
from sklearn.model_selection import cross_val_score,KFold

In [None]:
K=10
seed=12
kfold_Linear=KFold(shuffle=True,n_splits=K,random_state=seed)
accuracies = cross_val_score(estimator = Final, X = X_SCALED, y = Y_SCALED, cv = kfold_Linear) 
accuracies
print("K Fold score mean:{}".format(accuracies.mean()*100))
print("K Fold score standard deviation:{}".format(accuracies.std()*100))

# Conclusion- "When the Rubber meets the Road"

The hyperparameters tuned model produces a accuracy score of 92.11  % on 10 fold cross validation with a standard deviation of 3.14 
So , I conclude that when this model is deployed on unseen data we could get a accuracy of range 88.97% to 95.25%.
This score is pretty good and can be trusted at 95% confidence Intravel