<a href="https://colab.research.google.com/github/raj-vijay/ml/blob/master/03.Tree-based%20Models/09_Ensemble_Learning_Liver_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Ensemble Learning**
- Train different models on the same dataset.
- Let each model make its predictions.
- Meta-model: aggregates predictions of individual models.
- Final prediction: more robust and less prone to errors.
- Best results: models are skillful in different ways.

![alt text](https://raw.githubusercontent.com/raj-vijay/ml/master/images/Ensemble%20Prediction.png)

In [None]:
!wget https://assets.datacamp.com/production/repositories/1796/datasets/24126c0cd9d2bd1ca0e72446c2caa40b222193d6/indian_liver_patient.zip

--2020-05-23 22:45:25--  https://assets.datacamp.com/production/repositories/1796/datasets/24126c0cd9d2bd1ca0e72446c2caa40b222193d6/indian_liver_patient.zip
Resolving assets.datacamp.com (assets.datacamp.com)... 13.224.166.92, 13.224.166.4, 13.224.166.96, ...
Connecting to assets.datacamp.com (assets.datacamp.com)|13.224.166.92|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31629 (31K)
Saving to: ‘indian_liver_patient.zip’


2020-05-23 22:45:26 (140 KB/s) - ‘indian_liver_patient.zip’ saved [31629/31629]



In [None]:
! unzip indian_liver_patient.zip

Archive:  indian_liver_patient.zip
  inflating: indian_liver_patient.csv  
   creating: __MACOSX/
  inflating: __MACOSX/._indian_liver_patient.csv  
  inflating: indian_liver_patient_preprocessed.csv  
  inflating: __MACOSX/._indian_liver_patient_preprocessed.csv  


In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv('indian_liver_patient_preprocessed.csv')

In [None]:
df.head()

Unnamed: 0.1,Unnamed: 0,Age_std,Total_Bilirubin_std,Direct_Bilirubin_std,Alkaline_Phosphotase_std,Alamine_Aminotransferase_std,Aspartate_Aminotransferase_std,Total_Protiens_std,Albumin_std,Albumin_and_Globulin_Ratio_std,Is_male_std,Liver_disease
0,0,1.247403,-0.42032,-0.495414,-0.42887,-0.355832,-0.319111,0.293722,0.203446,-0.14739,0,1
1,1,1.062306,1.218936,1.423518,1.675083,-0.093573,-0.035962,0.939655,0.077462,-0.648461,1,1
2,2,1.062306,0.640375,0.926017,0.816243,-0.115428,-0.146459,0.478274,0.203446,-0.178707,1,1
3,3,0.815511,-0.372106,-0.388807,-0.449416,-0.36676,-0.312205,0.293722,0.329431,0.16578,1,1
4,4,1.679294,0.093956,0.179766,-0.395996,-0.295731,-0.177537,0.755102,-0.930414,-1.713237,1,1


In [None]:
X = np.array(df.drop(['Liver_disease', 'Unnamed: 0'], axis = 1))

In [None]:
y = df['Liver_disease']

In [None]:
# Import functions to compute accuracy and split data
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
# Import models, including VotingClassifier meta-model
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.ensemble import VotingClassifier
# Set seed for reproducibility
SEED = 1

Here, we instantiate three classifiers to predict whether a patient suffers from a liver disease using all the features present in the dataset.

The classifiers are: 
1. LogisticRegression
2. DecisionTreeClassifier, and 
3. KNeighborsClassifier

In [None]:
# Instantiate lr
lr = LogisticRegression(random_state=SEED)

# Instantiate knn
knn = KNN(n_neighbors=27)

# Instantiate dt
dt = DecisionTreeClassifier(min_samples_leaf=0.13, random_state=SEED)

# Define the list classifiers
classifiers = [('Logistic Regression', lr), ('K Nearest Neighbours', knn), ('Classification Tree', dt)]

In [None]:
# Split dataset into 80% train, 20% test
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, stratify=y, random_state=1)

Now we evaluate the performance of the models in the list classifiers.

In [None]:
# Iterate over the pre-defined list of classifiers
for clf_name, clf in classifiers:    
  
    # Fit clf to the training set
    clf.fit(X_train, y_train)    
  
    # Predict y_pred
    y_pred = clf.predict(X_test)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
  
    # Evaluate clf's accuracy on the test set
    print('{:s} : {:.3f}'.format(clf_name, accuracy))

Logistic Regression : 0.672
K Nearest Neighbours : 0.698
Classification Tree : 0.664


**Better performance with a Voting Classifier**

Now we evaluate the performance of a voting classifier that takes the outputs of the models defined in the list classifiers and assigns labels by majority voting.

In [None]:
# Import VotingClassifier from sklearn.ensemble
from sklearn.ensemble import VotingClassifier

# Instantiate a VotingClassifier vc 
vc = VotingClassifier(estimators=classifiers)     

# Fit vc to the training set
vc.fit(X_train, y_train)   

# Evaluate the test set predictions
y_pred = vc.predict(X_test)

# Calculate accuracy score
accuracy = accuracy_score(y_test, y_pred)
print('Voting Classifier: {:.3f}'.format(accuracy))

Voting Classifier: 0.664
