#Using Diabetes dataset:

In [2]:
from sklearn.datasets import load_diabetes
X, y= load_diabetes(return_X_y=True)

->Splitting data using KFold :

In [3]:
from sklearn.model_selection import KFold
kf=KFold(n_splits=5)
for train, test in kf.split(X,y):
  Xtrain, Xtest= X[train], X[test]
  ytrain, ytest= y[train], y[test]

In [4]:
print(Xtrain.shape, ytrain.shape)
print(Xtest.shape, ytest.shape)

(354, 10) (354,)
(88, 10) (88,)


----

#Applying Linear Regression :

In [5]:
from sklearn.linear_model import LinearRegression
model_lr=LinearRegression()
model_lr.fit(Xtrain, ytrain)

LinearRegression()

In [6]:
y_pred_lr=model_lr.predict(Xtest)

#Metrics : 

Metrics are used to monitor and measure the performance of a model (during training and testing), and don't need to be differentiable.

#->Checking score of our model using some metrics :

In [7]:
from sklearn import metrics
import numpy as np

* Taking 4 metrics :
1. model score
2. mean absolute error - it is the difference b/w predicted value and actual value (absolute diff.)
3. mean square error 
4. root mean square error

1. Mean Absolute Error :

metrics.mean_absolute_error(actual value, predicted_value)

In [8]:
print('Mean Absolute Error : ', metrics.mean_absolute_error(ytest,y_pred_lr))

Mean Absolute Error :  42.38729269073616


2. Mean Square Error : It is the sq. diff. b/w actual value and predicted value.

metrics.mean_squared_error(actual value, predicted_value)

In [9]:
print('Mean Square Error : ', metrics.mean_squared_error(ytest,y_pred_lr))

Mean Square Error :  2910.2069332665283


3. Root Mean Square Error : root of mean square error

np.sqrt(metrics.mean_squared_error(actual value, predicted_value))

In [10]:
print('Root Mean Square Error : ', np.sqrt(metrics.mean_squared_error(ytest,y_pred_lr)))

Root Mean Square Error :  53.94633382600275


4. Model score :

-> Score of Xtest & ytest

In [11]:
model_lr.score(Xtest,ytest)

0.5502492259658611

-> Training score

In [12]:
model_lr.score(Xtrain, ytrain)

0.5077517592258138

->After checking the above score we can see that our model score is very poor and there is some overfitting issue.

---------------

#Applying Decision Tree Regression :

In [13]:
from sklearn.tree import DecisionTreeRegressor
model_dt=DecisionTreeRegressor()
model_dt.fit(Xtrain, ytrain)

DecisionTreeRegressor()

In [14]:
y_pred_dt=model_dt.predict(Xtest)

In [15]:
print('Mean Absolute Error : ', metrics.mean_absolute_error(ytest,y_pred_dt))
print('Mean Square Error : ', metrics.mean_squared_error(ytest,y_pred_dt))
print('Root Mean Square Error : ', np.sqrt(metrics.mean_squared_error(ytest,y_pred_dt)))


Mean Absolute Error :  66.0909090909091
Mean Square Error :  7029.727272727273
Root Mean Square Error :  83.84346887341478


-> We can see that error in Decision tree regression is nore than Linear Regression

* Checking the score :

In [16]:
model_dt.score(Xtest, ytest)

-0.08639191461524609

In [17]:
model_dt.score(Xtrain, ytrain)

1.0

-> We can see that our score is very bad.

->Bad performance is bcz no preprocessing technique is applied.

------

#Ensemble :
Ensemble methods is a machine learning technique that combines several base models in order to produce one optimal predictive model.

There are some built-in ensemble methods we can use, or, we can create our own ensemble method.

#-> Method-1 : Stacking method

It is a level based method(stacking). It depends on how many levels we want to take.

Stacking method takes 2 method, actually 3, but here we take taking 2.

(3rd paramter is no. of times we want to execute)

Here, we'll be taking 2 levels(parameters).

Level 0 have stack containing 2 different models.

Level 1 we have base regression.

Note : it depends on us which model we want to take as the base regressor.

In [18]:
#we are using regression prediction model, there fore stacking regression
#if classification is used then classification

from sklearn.ensemble import StackingRegressor

For this we have to first make the list of all model. For ex: here, in this case we have used 2 model - linear regression & decision tree regression.

#Level - 0 :

Step-1 :
 Create an empty list

In [19]:
level0=list()

Step-2 : Append models in the list, here they are linear and decision tree regression

In [20]:
#in level0 we have 2 diff. models that will work together

level0.append(('lr',LinearRegression()))
level0.append(('dt',DecisionTreeRegressor()))

#Level - 1:

Step-3 : we have to give meta learner(i.e. base model) out of our models

In [21]:
level1=LinearRegression()

----

Step-4 : Creating the model

(Note : instead of above 2 models..here we can append any no. of models

Suppose, we can to append 5 or,6 etc models we can do it bcz it happens that our specific model can perform good for a specific data, thats why we go for ensemble so that we can combine models to give us correct accuracy/ efficient performance

In such a way we can create our own ensemble method using stacking...beside stacking we also have other methods using which we can our own ensemble method)

In [22]:
#model is stackingRegressor which contains estimators
#estimators=level0, final_estimator=level1

#3rd parameter is how many times we want to execute this
#here, we have given 5, that means our model will be able to learn training values 5 times..it is know as cross validation

model_en=StackingRegressor(estimators=level0, final_estimator=level1, cv=5)

Step-5 : fitting our data

In [23]:
model_en.fit(Xtrain, ytrain)

StackingRegressor(cv=5,
                  estimators=[('lr', LinearRegression()),
                              ('dt', DecisionTreeRegressor())],
                  final_estimator=LinearRegression())

Step-6 : prediction

In [24]:
y_pred_en=model_en.predict(Xtest)

->checking diff. b/w predicted & actual(true) values :

> Indented block



In [25]:
print('Mean Absolute Error : ', metrics.mean_absolute_error(ytest,y_pred_en))
print('Mean Square Error : ', metrics.mean_squared_error(ytest,y_pred_en))
print('Root Mean Square Error : ', np.sqrt(metrics.mean_squared_error(ytest,y_pred_en)))


Mean Absolute Error :  42.98505701489864
Mean Square Error :  2958.0993429824925
Root Mean Square Error :  54.3884118446429


In [26]:
model_en.score(Xtest, ytest)

0.5428478112781645

In [27]:
model_en.score(Xtrain, ytrain)

0.5348857133290079

As we can see our model is some what predicting similar values as Linear Regression bcz we have taken Base regression as Linear regression.

Our prediction/ score is not good as we have not applied any preprocessing technique.

Our main focus was to learn how to use ensemble

# This is the way to ensemble any no. of classifier or regressor.

-----

#PCA (Principle Component analysis) :

PCA is a dimensionality reduction technique using feature extraction, that means whatever the original feature values it'll change the feature values

Using PCA we reduce the dimension of the model and bcz of this complexity is reduced

As we can see with or without ensemble we are not getting good preformance using all 10features of our dataset.

Our score of the model should be high but it is low.

So, to resolve this problem we can take only required(important) features of our dataset to check whether the performance id deteriorating or improving.


# * We have to perform this step before sliptting our data :

In [28]:
from sklearn.decomposition import PCA

#there are many parameter in PCA() but, her we'll only take n_components and other by-deafult parameters will be passed

#n_components means out of all features how many features we want to select (most imp. features)
pca=PCA(n_components=2)

#X contains 10 features
#based on 10features it'll select most imp. 2 features based on the formulas dicussed in theory class(there are some statistical steps)
pca1=pca.fit_transform(X)
pca1

array([[ 2.79306207e-02, -9.26011612e-02],
       [-1.34686052e-01,  6.52634060e-02],
       [ 1.29447396e-02, -7.77641691e-02],
       [ 2.34543980e-03,  1.81819367e-02],
       [-3.59806910e-02,  3.86213572e-02],
       [-1.88660280e-01, -1.81251413e-02],
       [-9.48347610e-02, -3.83155499e-02],
       [ 9.87389258e-02,  8.69453424e-02],
       [ 2.86833351e-02, -4.19169143e-02],
       [-1.00910711e-02, -2.34450670e-02],
       [-1.83419418e-01, -7.28738089e-02],
       [ 1.88796730e-02, -3.00873701e-02],
       [-7.59323171e-02,  4.66838722e-02],
       [ 1.51473512e-02, -3.09692487e-02],
       [-8.01694940e-02,  8.64707740e-02],
       [ 1.40488092e-01,  4.51897240e-02],
       [ 7.58600302e-03,  4.33946961e-02],
       [ 1.08839198e-01, -7.06819254e-03],
       [-5.49947993e-02, -2.03725705e-02],
       [-8.42651351e-02,  4.41806793e-02],
       [-9.55955468e-02, -1.37941404e-02],
       [-9.18719735e-02, -3.95423697e-02],
       [-7.97268029e-02,  3.42316116e-02],
       [ 1.

Above are the 2 imp. features selected by PCA.

We can se that these values are diff. than our features values in X.



X is now pca1.

-> Checking which 2 imp. features are selected :

-> Splitting our data (pca1) :

In [29]:
kf2=KFold(n_splits=5)
for train, test in kf2.split(pca1):
  Xtrain, Xtest= pca1[train], pca1[test]
  ytrain, ytest= y[train], y[test]

In [30]:
print(Xtrain.shape, ytrain.shape)
print(Xtest.shape, ytest.shape)

(354, 2) (354,)
(88, 2) (88,)


----

#Checking performance using Linear Regression after using PCA :

In [31]:
model_lr=LinearRegression()
model_lr.fit(Xtrain, ytrain)

LinearRegression()

In [32]:
y_pred_lr=model_lr.predict(Xtest)

In [33]:
print('Mean Absolute Error : ', metrics.mean_absolute_error(ytest,y_pred_lr))
print('Mean Square Error : ', metrics.mean_squared_error(ytest,y_pred_lr))
print('Root Mean Square Error : ', np.sqrt(metrics.mean_squared_error(ytest,y_pred_lr)))

Mean Absolute Error :  53.646092669272754
Mean Square Error :  4031.692590044977
Root Mean Square Error :  63.49561079354208


In [34]:
model_lr.score(Xtest,ytest)

0.376931982975808

In [35]:
model_lr.score(Xtrain,ytrain)

0.33605983178074716

----

#Checking performance using Decision Tree Regression after using PCA :

In [36]:
model_dt=DecisionTreeRegressor()
model_dt.fit(Xtrain, ytrain)

DecisionTreeRegressor()

In [37]:
y_pred_dt=model_dt.predict(Xtest)

In [38]:
print('Mean Absolute Error : ', metrics.mean_absolute_error(ytest,y_pred_dt))
print('Mean Square Error : ', metrics.mean_squared_error(ytest,y_pred_dt))
print('Root Mean Square Error : ', np.sqrt(metrics.mean_squared_error(ytest,y_pred_dt)))

Mean Absolute Error :  71.19318181818181
Mean Square Error :  7922.102272727273
Root Mean Square Error :  89.00619232799072


In [39]:
model_dt.score(Xtest,ytest)

-0.2243018145577318

In [40]:
model_dt.score(Xtrain,ytrain)

1.0

-----

#Checking performance using Ensemble after using PCA :

In [41]:
level0=list()

level0.append(('lr',LinearRegression()))
level0.append(('dt',DecisionTreeRegressor()))

In [42]:
level1=LinearRegression()

In [43]:
model_en=StackingRegressor(estimators=level0, final_estimator=level1, cv=5)
model_en.fit(Xtrain, ytrain)

y_pred_en=model_en.predict(Xtest)

In [44]:
print('Mean Absolute Error : ', metrics.mean_absolute_error(ytest,y_pred_en))
print('Mean Square Error : ', metrics.mean_squared_error(ytest,y_pred_en))
print('Root Mean Square Error : ', np.sqrt(metrics.mean_squared_error(ytest,y_pred_en)))

Mean Absolute Error :  54.46664777677002
Mean Square Error :  4127.310161490468
Root Mean Square Error :  64.24414495882459


In [45]:
model_en.score(Xtest,ytest)

0.3621550005292007

In [46]:
model_en.score(Xtrain,ytrain)

0.25728330131119814

----

From above Linear Regression, Decision Tree Regression and Ensemble method after applying PCA we can see that our performance is not improved.

-------

#Now, applying preprocessing technique and then checking our model for PCA to see whether our model is improved after preprocessing or not :

#data preprocessing :

In [51]:
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
X=scaler.fit_transform(X)

# we don't need to transform y as it is our prediction

----

#Checking without PCA :

In [52]:
kf=KFold(n_splits=5)
for train, test in kf.split(X,y):
  Xtrain, Xtest= X[train], X[test]
  ytrain, ytest= y[train], y[test]

In [53]:
print(Xtrain.shape, ytrain.shape)
print(Xtest.shape, ytest.shape)

(354, 10) (354,)
(88, 10) (88,)


#1. Linear Regression :

In [54]:
model_lr=LinearRegression()
model_lr.fit(Xtrain, ytrain)

LinearRegression()

In [55]:
y_pred_lr=model_lr.predict(Xtest)

In [56]:
print('Mean Absolute Error : ', metrics.mean_absolute_error(ytest,y_pred_lr))
print('Mean Square Error : ', metrics.mean_squared_error(ytest,y_pred_lr))
print('Root Mean Square Error : ', np.sqrt(metrics.mean_squared_error(ytest,y_pred_lr)))

Mean Absolute Error :  42.38729269073615
Mean Square Error :  2910.2069332665287
Root Mean Square Error :  53.94633382600275


In [58]:
model_lr.score(Xtest,ytest)

0.5502492259658609

In [59]:
model_lr.score(Xtrain,ytrain)

0.5077517592258138

Linear Regression after preprocessing & without pca result is similar as previous Linear regressions results.

-------

#2. Decision Tree Regression :

In [60]:
model_dt=DecisionTreeRegressor()
model_dt.fit(Xtrain, ytrain)

DecisionTreeRegressor()

In [61]:
y_pred_dt=model_dt.predict(Xtest)

In [62]:
print('Mean Absolute Error : ', metrics.mean_absolute_error(ytest,y_pred_dt))
print('Mean Square Error : ', metrics.mean_squared_error(ytest,y_pred_dt))
print('Root Mean Square Error : ', np.sqrt(metrics.mean_squared_error(ytest,y_pred_dt)))

Mean Absolute Error :  67.32954545454545
Mean Square Error :  7338.329545454545
Root Mean Square Error :  85.66405048475437


In [65]:
model_dt.score(Xtest,ytest)

-0.13408409397240195

In [66]:
model_dt.score(Xtrain,ytrain)

1.0

Decision tree Regression after preprocessing & without pca result is similar as previous Decision tree regressions results.

--------

#3. Ensemble :

In [67]:
level0=list()

level0.append(('lr',LinearRegression()))
level0.append(('dt',DecisionTreeRegressor()))

In [68]:
level1=LinearRegression()

In [69]:
model_en=StackingRegressor(estimators=level0, final_estimator=level1, cv=5)
model_en.fit(Xtrain, ytrain)

y_pred_en=model_en.predict(Xtest)

In [70]:
print('Mean Absolute Error : ', metrics.mean_absolute_error(ytest,y_pred_en))
print('Mean Square Error : ', metrics.mean_squared_error(ytest,y_pred_en))
print('Root Mean Square Error : ', np.sqrt(metrics.mean_squared_error(ytest,y_pred_en)))

Mean Absolute Error :  43.250340678569685
Mean Square Error :  2992.355646029298
Root Mean Square Error :  54.70242815478393


In [71]:
model_en.score(Xtest,ytest)

0.5375537551632079

In [72]:
model_en.score(Xtrain,ytrain)

0.5865835225790016

Ensemble after preprocessing & without pca result is similar as previous Ensemble results.

-------

#***After Data Preprocessing(using standard scaler) our model prediction is not performing good***

-------
------

#For this dataset even if we perform data preprocessing, then PCA and then predict the performance it'll be still low.


#It might give good result using any other preprocessing technique.

----
-----

step-1 : data pre processsing 

step-2 : pca

step-3 : Splitting

step-4 : model selection, prediction & error calculation(using metrics: mean absolute error, etc.)