# Ensemble Techniques

> All the machine learning algorithms that we have seen have their advantages and disadvantages and there are certain situations where these algorithms are going to give us better result and sometimes they won't. So ensemble learning uses the group of predictors in order to increase accuracy, reduce bias, etc. In ensemble learning combiles the predictions of several base estimators build with a given learning algorithm in order to increase accuracy.

## The Ensemble techniques that we are going to use here are:
> Bagging

> RandomForest

> Boosting (AdaBoost and Gradient Boost)

In [None]:
import warnings

warnings.filterwarnings("ignore")

In [None]:
import numpy as np
import pandas as pd

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
stroke_data = pd.read_csv("../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv")
stroke_data.head()

### Before going with the the machine learning algorithms, lets perform some EDA (Exploratory Data Analysis)

In [None]:
stroke_data.shape

In [None]:
# Total Null Values
stroke_data.isna().sum()

## Column BMI has 201 nan values

#### We have multiple ways to handle nan values either we can remove or we can perform imputation such as filling nan values with mean or the median value. In this case i am going to fill nan values with mean.

In [None]:
stroke_data.describe()

In [None]:
stroke_data["bmi"].fillna(stroke_data["bmi"].mean(), inplace=True)

In [None]:
stroke_data.isna().sum()

In [None]:
# lets look at data once more so we can see that nan values in bmi has been changed to mean value.
stroke_data.head()

In [None]:
# As the id column has no use for us so we will remove that
stroke_data.drop(columns={'id'}, inplace=True)

Just a little description about why I use inplace=True there. If inplace=True is not given then the function will return a back the dataframe that we need to store again. Default value for inplace is false.

df = stroke_data.drop(columns={'id'}, inplace=False) 

>will return a dataframe wihout the column id and put it in the df.

where as,

stroke_data.drop(columns={'id'}, inplace=True)

>will remove the id and put data back to the stroke_data.

### Visualize Data

In [None]:
import matplotlib.pyplot as plt
from matplotlib import rcParams
import seaborn as sns

In [None]:
rcParams["figure.figsize"] = 12, 12

In [None]:
sns.countplot(stroke_data["heart_disease"])

Heart patients visualization using countplot shows that there are few number of heart patients.

In [None]:
plt.figure(figsize=(15, 15))
fig, axs = plt.subplots(1, 3)
axs[0].boxplot(stroke_data["age"])
axs[0].set_title('Age', size=20)
axs[1].boxplot(stroke_data["avg_glucose_level"])
axs[1].set_title('Glucose Level', size=20)
axs[2].boxplot(stroke_data["bmi"])
axs[2].set_title("BMI", size=20)

In [None]:
sns.distplot(stroke_data["avg_glucose_level"], color="blue", label="Glucose Level")

In [None]:
sns.distplot(stroke_data["bmi"], color="red", label="Body Mass Index")

#### avg_glucose has more higest number of outliers. We can check out the outliers also in python using IQR Method

In [None]:
# Finding the all outliers inside the glucose column

def FindOutliers(data):
    outliers = []

    Q1, Q3 = data.quantile([0.25, 0.75])

    IQR = Q3 - Q1

    upper_range = Q3 + IQR*(1.5)
    lower_range = Q1 - IQR*(1.5)
    
    for x in data:
        if x > upper_range or x < lower_range:
            outliers.append(x)
            
    return outliers, upper_range, lower_range

In [None]:
# Outliers for the column avg_glucose_level
outliers_glucose_level, upper_glucose_lev, lower_glucose_lev = FindOutliers(stroke_data["avg_glucose_level"])

In [None]:
# Outliers for the column bmi
outliers_bmi, upper_bmi, lower_bmi = FindOutliers(stroke_data["bmi"])

In [None]:
# Total number of outliers in these two columns
print(len(outliers_glucose_level), len(outliers_bmi))

Total number of outliers are 627 out of almost 5000 records so if we remove we might lost a lots of information so lets choose another method for handling outliers. I will do capping(Replacing the larger outliers with uppers range and smaller outliers with lower range)

In [None]:
# Applying capping for the glucose level column
stroke_data["avg_glucose_level"] = np.where(stroke_data["avg_glucose_level"] < lower_glucose_lev, lower_glucose_lev, stroke_data["avg_glucose_level"])
stroke_data["avg_glucose_level"] = np.where(stroke_data["avg_glucose_level"] > upper_glucose_lev, upper_glucose_lev, stroke_data["avg_glucose_level"])

In [None]:
# Performing Capping for the Bmi column
stroke_data["bmi"] = np.where(stroke_data["bmi"] < lower_bmi, lower_bmi, stroke_data["bmi"])
stroke_data["bmi"] = np.where(stroke_data["bmi"] > upper_bmi, upper_bmi, stroke_data["bmi"])

In [None]:
stroke_data.describe()

In [None]:
plt.figure(figsize=(15, 15))
fig, axs = plt.subplots(1, 2)
axs[0].boxplot(stroke_data["avg_glucose_level"])
axs[0].set_title('Glucose Level', size=20)
axs[1].boxplot(stroke_data["bmi"])
axs[1].set_title("BMI", size=20)

### Now we have handled the outliers. Lets handle categorical data.

## Categorical Data

### For the categorical data we can use dummy variable or i can use labelEncoder but I prefer using labelEncoding as it will be easy to decode a particular label back later after predicting if needed.

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
le = LabelEncoder()

en_stroke_data = stroke_data.apply(le.fit_transform)  ## en_ here is simply to remind that this data is encoded and will be used mostly now on

In [None]:
en_stroke_data.head()

### Machine Learning algorithms.

As we say in the beginning about ensemble technique, it is time to apply one by one all of them and see which one gives better result.

In [None]:
# Dependent(Response) variable Y and Independent(Predictor) variable X.
X = en_stroke_data.iloc[:, : -1]
y = en_stroke_data.iloc[:, -1]

### Train Test Split:
Lets split our data into train and test sets. As we have almost 5000 records we will use 70-30 split.

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 30)

## Bagging.
The first algorithm that we will use is bagging. Bagging is a short name for the bootstrap aggregation is a machine learning ensemble meta-algorithm designed to improve the accuracy and stability of machine learning algorithms. Bootstrap is a sampling technique where out of n samples avaible k samples are choosen with replacement. We then run our algorithm(i.e: Decision Tree Classifier) on each of these samples. The point is to make sampling truly random. Aggregation here means the predictions of all the models is combined to make final predictions.

### Code for Bagging:


In [None]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(),
    n_estimators = 500,       ## Total Number of decision tree that will be used to train an ensemble is 2
    max_samples = 100,         ## each trained on 100 training instances randomly sampled from the training set with replacement
    bootstrap = True,         ## Bootstrap = True means use bagging method, if this option is set to False then it will be Pasting method that we didn't mention here.
    n_jobs = -1               ## n_jobs means how many cores will be used to train the ensemble and -1 here means all of them
)

bag_clf.fit(x_train, y_train)

In [None]:
# Making predictions
y_pred_bagging = bag_clf.predict(x_test)

### Accuracy test of our model using confusion matrix

In [None]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred_bagging)

In [None]:
(1468/len(y_test))*100

#### So far we have achieved 95.7% of accuracy in this.

## RandomForest
The second algorithms that we will use is going to be random forest. Here Forest means we will have n number of trees. The Random Forest algorithm introduces extra randomness when growing trees; instead of searching for the very best feature when splitting a node, it searches for the best feature among a random subset of features. This results in a greater tree diversity, which (once again) trades a higher bias for a lower variance, generally yielding an overall better model.

In [None]:
#  Building a randomForest model

from sklearn.ensemble import RandomForestClassifier

random_forest_clf = RandomForestClassifier(
    n_estimators=350,         ## Training the ensemble model using 350 decision trees, we can use any number there depends on the speed of our machine
    max_leaf_nodes = 15,      ## Each tree will have a maximum number of 10 leaf nodes
    n_jobs = -1,
)

random_forest_clf.fit(x_train, y_train)

In [None]:
y_pred_rf = random_forest_clf.predict(x_test)

In [None]:
from sklearn.metrics import accuracy_score

acc = (accuracy_score(y_test, y_pred_rf))*100
print(f"{round(acc, 2)}% of Accuracy")

#### Seems like we got the same accuracy here.

## Boosting:
Boosting (originally called hypothesis boosting) refers to any Ensemble method that can combine several weak learners into a strong learner. The idea of boosting is to train predictors sequentially and each of them is trying to correct predecessor. The Boosting mehtods that we will use are going to be AdaBoost and Gradient Boost

### AdaBoost:
Everytime we get errors by focusing on those wrongly predicted can help to increase the accuracy. That is how Adaboost works, the first base classifier is trained and predictions are made from training set. The relative weight of all misclassified training instances is increased. A second classifier is then trained using the updated weights and again it makes predictions and again the weights are increased for misclassified instances and so on. It continues until we get the best accuracy.

Each instances will get boosted weights for the misclassified(by predecessor) records and improve accordingly. 

In [None]:
# Building our Adaboost ensemble model
from sklearn.ensemble import AdaBoostClassifier

adaboost_clf = AdaBoostClassifier(
    DecisionTreeClassifier(),
    n_estimators = 400,
    learning_rate = 0.6   
)

adaboost_clf.fit(x_train, y_train)

In [None]:
adaboost_pred = adaboost_clf.predict(x_test)

confusion_matrix(y_test, adaboost_pred)

In [None]:
acc_boost = (accuracy_score(y_test, adaboost_pred))*100
print(f"{round(acc_boost, 2)}% Accuracy achieved")

Finally we have seen that AdaBoost is classifying all the data into both of the categories. The algorithms used prior were giving definitely a better accuracy but they were not being able to classify both the categories.

## Gradient Boost:
Finally our last(in this notebook) algorithm is gradient boost that we will use here. Gradient Boost is a popular boosting algorithm, just like AdaBoost Gradient Boost works by sequentially adding predictors to an ensemble, each one correcting its predecessor. However, instead of tweaking the instance weights at every
iteration like AdaBoost does, this method tries to fit the new predictor to the residual errors made by the previous predictor.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

gradient_clf = GradientBoostingClassifier(
    n_estimators=2000,
    learning_rate=0.5
)

gradient_clf.fit(x_train, y_train)

In [None]:
grad_pred = gradient_clf.predict(x_test)

In [None]:
confusion_matrix(y_test, grad_pred)

In [None]:
grad_acc = (accuracy_score(y_test, grad_pred))*100
print(f"{round(grad_acc, 2)}% Accuracy")

### End note:
Ensemble techniques can boost the performance of the model and give a better result. The ensemble techniques that we have used here boosted also by passing more parameters. We used Classifiers of each of these algorithms they also have a regression version and those work with the numerical data. Those can be used from the same sklearn library, simply import GradientBoostingRegressor, AdaBoostRegressor, etc. We have lot more Ensemble techniques but these are some of the main ones, we can explore further more ensemble techniques.