# Ensemble Learning
-------

## 1. Advantages of CARTs
- Simple to understand.
- Simple to interpret.
- Easy to use.
- Flexibility: ability to describe non-linear dependencies.
- Preprocessing: no need to standarize or normalize features ...

## 2. Limitations of CARTs
- Classification: can only produce orthogonal decision boundaries.
- Sensitive to small variations in the training set.
- High variance: unconstrained CARTs may overfit the training set.
- Solution: **ensemble learning**.

## 3. Ensemble Learning
- Train `different models`(svm,knn,linear models,dt ..etc) on the same dataset.
- Let each model make its predictions.
- Meta-model: aggregates predictions of individual models.
- Final prediction: more robust and less prone to errors.
- Best results: models are skillful in different ways.
- for more refer sklear document https://scikit-learn.org/stable/modules/ensemble.html
![image.png](attachment:image.png)

## 4. Ensemble Learning in Practice: Voting Classifier
- Binary classification task.
- N classifiers make predictions: P1 , P2 , ..., Pn with Pi = 0 or 1.
- Meta-model prediction: hard voting/soft voting

 ![image.png](attachment:image.png)

### Project: Use three classifiers/estimators to predict whether a patient suffers from a liver disease using all the features present in the dataset.

[Read dataset](https://www.kaggle.com/jeevannagaraj/indian-liver-patient-dataset)

## Step 1: Import required modules

In [1]:
import os
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import  LogisticRegression # base estimator 1
from sklearn.neighbors import KNeighborsClassifier as KNN # base estimator 2
from sklearn.tree import  DecisionTreeClassifier # base estimator 3
from sklearn.ensemble import VotingClassifier # meta-estimator
from sklearn.metrics import accuracy_score

## Step 2: Import Data

In [2]:
os.chdir("C:\\Users\\ramreddymyla\\Google Drive\\01 DS ML DL NLP and AI With Python Lab Copy\\02 Lab Data\\Python")
df = pd.read_csv("Indian Liver Patient Dataset (ILPD).csv")

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 583 entries, 0 to 582
Data columns (total 11 columns):
age                 583 non-null int64
gender              583 non-null object
tot_bilirubin       583 non-null float64
direct_bilirubin    583 non-null float64
tot_proteins        583 non-null int64
albumin             583 non-null int64
ag_ratio            583 non-null int64
sgpt                583 non-null float64
sgot                583 non-null float64
alkphos             579 non-null float64
is_patient          583 non-null int64
dtypes: float64(5), int64(5), object(1)
memory usage: 50.2+ KB


In [4]:
df.head(10)

Unnamed: 0,age,gender,tot_bilirubin,direct_bilirubin,tot_proteins,albumin,ag_ratio,sgpt,sgot,alkphos,is_patient
0,65,Female,0.7,0.1,187,16,18,6.8,3.3,0.9,1
1,62,Male,10.9,5.5,699,64,100,7.5,3.2,0.74,1
2,62,Male,7.3,4.1,490,60,68,7.0,3.3,0.89,1
3,58,Male,1.0,0.4,182,14,20,6.8,3.4,1.0,1
4,72,Male,3.9,2.0,195,27,59,7.3,2.4,0.4,1
5,46,Male,1.8,0.7,208,19,14,7.6,4.4,1.3,1
6,26,Female,0.9,0.2,154,16,12,7.0,3.5,1.0,1
7,29,Female,0.9,0.3,202,14,11,6.7,3.6,1.1,1
8,17,Male,0.9,0.3,202,22,19,7.4,4.1,1.2,2
9,55,Male,0.7,0.2,290,53,58,6.8,3.4,1.0,1


In [5]:
df['gender'] = np.where(df['gender'] == 'Female', 1, 2)

In [6]:
df.head()

Unnamed: 0,age,gender,tot_bilirubin,direct_bilirubin,tot_proteins,albumin,ag_ratio,sgpt,sgot,alkphos,is_patient
0,65,1,0.7,0.1,187,16,18,6.8,3.3,0.9,1
1,62,2,10.9,5.5,699,64,100,7.5,3.2,0.74,1
2,62,2,7.3,4.1,490,60,68,7.0,3.3,0.89,1
3,58,2,1.0,0.4,182,14,20,6.8,3.4,1.0,1
4,72,2,3.9,2.0,195,27,59,7.3,2.4,0.4,1


In [7]:
data = df.as_matrix().astype(np.float64)
#Dropped na values
data = data[~np.isnan(data).any(axis=1)].copy()

  """Entry point for launching an IPython kernel.


In [8]:
type(data)

numpy.ndarray

In [9]:
data.shape

(579, 11)

In [10]:
data[0:2]

array([[6.50e+01, 1.00e+00, 7.00e-01, 1.00e-01, 1.87e+02, 1.60e+01,
        1.80e+01, 6.80e+00, 3.30e+00, 9.00e-01, 1.00e+00],
       [6.20e+01, 2.00e+00, 1.09e+01, 5.50e+00, 6.99e+02, 6.40e+01,
        1.00e+02, 7.50e+00, 3.20e+00, 7.40e-01, 1.00e+00]])

## Step 3: Split Data

In [11]:
# Set seed for reproducibility
SEED=1
# Splitting data into train and test data
X, y = data[:,:-1], data[:,-1]
X = StandardScaler().fit_transform(X) # same units
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=SEED)
print("train martix:{} \ntest matrix {}".format(X_train.shape, X_test.shape))

train martix:(405, 10) 
test matrix (174, 10)


# Step 4: Fit Evaluate individual classifiers 

In [12]:
# Instantiate lr
lr = LogisticRegression(random_state=SEED)

# Instantiate knn
knn = KNN(n_neighbors=27)

# Instantiate dt
dt = DecisionTreeClassifier(min_samples_leaf=0.13, random_state=SEED)

# Define the list classifiers
classifiers = [('Logistic Regression', lr), 
               ('K Nearest Neighbours', knn), 
               ('Classification Tree', dt)]

In [13]:
# Iterate over the pre-defined list of classifiers
for clf_name, clf in classifiers:    
  
    # Fit clf to the training set
    clf.fit(X_train, y_train)    
  
    # Predict y_pred
    y_pred = clf.predict(X_test)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
  
    # Evaluate clf's accuracy on the test set
    print('{:s} : {:.3f}'.format(clf_name, accuracy))

Logistic Regression : 0.753
K Nearest Neighbours : 0.718
Classification Tree : 0.730




**Which alogorithm is good on this data set ?**

<input type="radio" disabled> Classification Tree

<input type="radio" disabled> K Nearest Neighbours

<input type="radio" disabled checked> Logistic Regression

## Step 5: Fit VotingClassifier Model

In [14]:
VotingClassifier?

In [15]:
# Instantiate a VotingClassifier vc 
vc = VotingClassifier(estimators=classifiers)     

# Fit vc to the training set
vc.fit(X_train, y_train)   

# Evaluate the test set predictions
y_pred = vc.predict(X_test)

# Calculate accuracy score
accuracy = accuracy_score(y_test, y_pred)
print('Voting Classifier: {:.3f}'.format(accuracy))



Voting Classifier: 0.759


**Which alogorithm is good on this data set ?**

<input type="radio" disabled> Classification Tree

<input type="radio" disabled> K Nearest Neighbours

<input type="radio" disabled> Logistic Regression

<input type="radio" disabled checked> VotingClassifier