# Ensemble Study

## 1. Definition

Ensemble learning is a machine learning technique where multiple models are trained on a dataset to make predictions, and the **predictions of those models are combined to produce a more accurate and robust prediction** than any of the individual models. In other words, ensemble learning is about combining the predictions of several weaker models to create a stronger model.

## 2. Types

### Simple Ensemble Techniques

#### 1. Max Voting 

**Definition :** The max voting method is generally used for classification problems. In this technique, multiple models are used to make predictions for each data point. The predictions by each model are considered as a ‘vote’. **The predictions which we get from the majority of the models are used as the final prediction**.

In [47]:
import pandas as pd
# read the text file into a pandas dataframe
df = pd.read_csv("/Users/crystal/Desktop/Random Forest/heart.csv")

In [48]:
# IMPORTS
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import statistics as st
import warnings
warnings.filterwarnings('ignore')

In [49]:
# SPLITTING THE DATASET
x = df.drop('target', axis = 1)
y = df['target']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

In [9]:
# MODELS CREATION
model1 = DecisionTreeClassifier()
model2 = KNeighborsClassifier()
model3= LogisticRegression()

model1.fit(x_train,y_train)
model2.fit(x_train,y_train)
model3.fit(x_train,y_train)

LogisticRegression()

The final prediction is stored in a numpy array called final_pred which is initialized as an empty array using the np.array([]) function. The code then runs a for loop from 0 to the length of the x_test variable (which represents the test dataset). Inside the loop, the mode function from the statistics library is used to calculate the mode of the three model predictions for each observation in the test dataset. These three model predictions are stored in pred1, pred2, and pred3. **The mode function returns the most common prediction value among the three predictions.** The resulting mode prediction is appended to the final_pred array using the np.append function. Finally, the final_pred array is printed using the print function to show the mode predictions for each observation in the test dataset.

In [10]:
# PREDICTION
pred1=model1.predict(x_test)
pred2=model2.predict(x_test)
pred3=model3.predict(x_test)

# FINAL_PREDICTION
final_pred = np.array([])
for i in range(0,len(x_test)):
    final_pred = np.append(final_pred, st.mode([pred1[i], pred2[i], pred3[i]]))
print(final_pred)

[1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 1. 1. 0. 1. 0. 1. 1. 1. 1.
 1. 1. 1. 0. 1. 1. 1. 0. 0. 1. 0. 0. 1. 0. 1. 1. 0. 1. 1. 1. 1. 1. 1. 0.
 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 1. 0. 0. 0. 1. 1. 1. 1. 1.
 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 1. 1. 1. 0. 1. 0. 1. 0. 0. 0. 1. 0. 1. 0.
 1. 0. 1. 1. 1. 1. 1. 0. 0. 1. 0. 1. 0. 1. 1. 1. 1. 0. 1. 0. 1. 0. 0. 1.
 0. 1. 1. 0. 0. 1. 1. 1. 1. 0. 0. 0. 1. 1. 1. 0. 0. 1. 0. 1. 0. 0. 0. 0.
 1. 1. 1. 1. 1. 0. 1. 1. 1. 0. 0. 0. 1. 1. 0. 0. 0. 1. 1. 0. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0. 1. 0. 1. 0. 0. 1.
 0. 1. 0. 1. 1. 1. 0. 0. 1. 1. 1. 0. 1.]


Alternatively, you can use **“VotingClassifier”** module in sklearn as follows:

In [14]:
from sklearn.ensemble import VotingClassifier
model1 = LogisticRegression(random_state=1)
model2 = DecisionTreeClassifier(random_state=1)
model = VotingClassifier(estimators=[('lr', model1), ('dt', model2)], voting='hard')
model.fit(x_train,y_train)
model.score(x_test,y_test)

0.9658536585365853

#### 2. Averaging

Similar to the max voting technique, multiple predictions are made for each data point in averaging. In this method, we take an average of predictions from all the models and use it to make the final prediction. **Averaging can be used for making predictions in regression problems or while calculating probabilities for classification problems.**

In [18]:
model1 = DecisionTreeClassifier()
model2 = KNeighborsClassifier()
model3= LogisticRegression()

model1.fit(x_train,y_train)
model2.fit(x_train,y_train)
model3.fit(x_train,y_train)

pred1=model1.predict_proba(x_test)
pred2=model2.predict_proba(x_test)
pred3=model3.predict_proba(x_test)

finalpred=(pred1+pred2+pred3)/3

In [19]:
#print(finalpred)

#### 3. Weighted Averaging

This is an extension of the averaging method. All models are assigned different weights defining the importance of each model for prediction. 

In [29]:
model1 = DecisionTreeClassifier()
model2 = KNeighborsClassifier()
model3= LogisticRegression()

model1.fit(x_train,y_train)
model2.fit(x_train,y_train)
model3.fit(x_train,y_train)

pred1=model1.predict_proba(x_test)
pred2=model2.predict_proba(x_test)
pred3=model3.predict_proba(x_test)

finalpred=(pred1*0.3+pred2*0.3+pred3*0.4)

In [20]:
#finalpred

### 1).Bagging

### 2).Stacking

Stacking is a popular ensemble learning technique that involves combining multiple individual models to improve overall prediction performance. 

The basic idea behind stacking is to train several base models on the same dataset, then use their predictions as inputs to a meta-model. The meta-model can be trained on the same dataset or a different dataset, using the base model predictions as input features. Once the meta-model is trained, it can be used to predict the final outcome on new data.

**Below is a step-wise explanation for a simple stacked ensemble**

1. The train set is split into 10 parts.

<img style="float: left;" src="https://cdn.analyticsvidhya.com/wp-content/uploads/2018/05/image-11-768x555.png" width="35%"> 

2. A base model (suppose a decision tree) is fitted on 9 parts and predictions are made for the 10th part. This is done for each part of the train set.

<img style="float: left;" src="https://cdn.analyticsvidhya.com/wp-content/uploads/2018/05/image-10-768x638.png" width="30%"> 

3. The base model (in this case, decision tree) is then fitted on the whole train dataset.

4. Using this model, predictions are made on the test set.

<img style="float: left;" src="https://cdn.analyticsvidhya.com/wp-content/uploads/2018/05/image-2-768x577.png" width="35%"> 

5. Steps 2 to 4 are repeated for another base model (say knn) resulting in another set of predictions for the train set and test set.

<img style="float: left;" src="https://cdn.analyticsvidhya.com/wp-content/uploads/2018/05/image-3-768x573.png" width="35%"> 

6. The predictions from the train set are used as features to build a new model.

<img style="float: left;" src="https://cdn.analyticsvidhya.com/wp-content/uploads/2018/05/image12.png" width="25%"> 

7. This model is used to make final predictions on the test prediction set.

**Sample Code**

In [98]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
import numpy as np
import pandas as pd

In [103]:
# SPLITTING THE DATASET
df = pd.read_csv("/Users/crystal/Desktop/Random Forest/heart.csv")
x = df.drop('target', axis = 1)
y = df['target']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

In [109]:
# Split train and test data into two parts
train1, train2, y_train1, y_train2 = train_test_split(x_train, y_train, test_size=0.5, random_state=1)

# Train and predict on the first base model
model1 = DecisionTreeClassifier(random_state=1)
model1.fit(train1, y_train1)
train_pred1 = model1.predict_proba(train2)[:, 1]
test_pred1 = model1.predict_proba(x_test)[:, 1]

# Train and predict on the second base model
model2 = KNeighborsClassifier()
model2.fit(train1, y_train1)
train_pred2 = model2.predict_proba(train2)[:, 1]
test_pred2 = model2.predict_proba(x_test)[:, 1]

# Only select rows that correspond to the same data points in train_pred1 and test_pred1
train_pred2 = train_pred2[train2.index.isin(train1.index)]
test_pred2 = test_pred2[x_test.index.isin(train1.index)]

# Combine predictions from base models into a single dataframe
train_pred1 = pd.DataFrame(train_pred1)
train_pred2 = pd.DataFrame(train_pred2)
test_pred1 = pd.DataFrame(test_pred1)
test_pred2 = pd.DataFrame(test_pred2)

df = pd.concat([train_pred1, train_pred2], axis=1)
df_test = pd.concat([test_pred1, test_pred2], axis=1)

# Fill any missing values with 0
df.fillna(0, inplace=True)
df_test.fillna(0, inplace=True)

# Train a logistic regression model on the stacked predictions
model = LogisticRegression()
model.fit(df, y_train2)

# Make predictions on the test set using the stacked model
y_pred = model.predict(df_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy score of the stacked model:", accuracy)

Accuracy score of the stacked model: 0.9658536585365853


### 3).Boosting