###Ensemble methods

**Max voting**
* The max voting method is generally used for classification problems. In this technique, multiple models are used to make predictions for each data point. The predictions by each model are considered as a ‘vote’. The predictions which we get from the majority of the models are used as the final prediction.

**Averaging**
* Similar to the max voting technique, multiple predictions are made for each data point in averaging. In this method, we take an average of predictions from all the models and use it to make the final prediction. Averaging can be used for making predictions in regression problems or while calculating probabilities for classification problems.

**Weighted average**
* This is an extension of the averaging method. All models are assigned different weights defining the importance of each model for prediction.




---



**Bagging**
* The idea behind bagging is combining the results of multiple models (for instance, all decision trees) to get a generalized result.
* Bootstrapping is a sampling technique in which we create subsets of observations from the original dataset, with replacement.
* Bagging (or Bootstrap Aggregating) technique uses these subsets (bags) to get a fair idea of the distribution (complete set).

**Random forest**
* Random Forest is another ensemble machine learning algorithm that follows the bagging technique. The base estimators in random forest are decision trees. Random forest randomly selects a set of features and rows which are used to decide the best split at each node of the decision tree.
* Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction

**Boosting**
* If a data point is incorrectly predicted by the first model, and then the next (probably all models), will combining the predictions provide better results? Such situations are taken care of by boosting.
* Boosting is a sequential process, where each subsequent model attempts to correct the errors of the previous model. The succeeding models are dependent on the previous model.


In [1]:
# Bagging using multiple kinds of models
import pandas
from sklearn import model_selection
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

dataframe = pandas.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/diabetes.csv")
print(dataframe.head(5))
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
seed = 7

Dtc = SVC() # 76%
#Dtc = GaussianNB() # 75%
#Dtc = KNeighborsClassifier() # 73#
#Dtc = DecisionTreeClassifier() # 76%

model = BaggingClassifier(base_estimator=Dtc, n_estimators=100, random_state=seed, max_samples=300)
model.fit(X,Y)
print(model.score(X,Y))

results = model_selection.cross_val_score(model, X, Y, cv=20) # change cv and obeserve different outputs

print(results)

print(results.mean())
print(results.min())
print(results.max())


   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1  




0.7591145833333334




[0.74358974 0.76923077 0.71794872 0.71794872 0.71794872 0.66666667
 0.76923077 0.58974359 0.71052632 0.76315789 0.78947368 0.78947368
 0.68421053 0.78947368 0.73684211 0.84210526 0.78947368 0.73684211
 0.78947368 0.78947368]
0.7451417004048582
0.5897435897435898
0.8421052631578947


In [2]:
# Random forest
import pandas
from sklearn import model_selection
from sklearn.ensemble import RandomForestClassifier

dataframe = pandas.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/diabetes.csv")

array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

model = RandomForestClassifier(n_estimators=100, max_features=3)
results = model_selection.cross_val_score(model, X, Y, cv=5)
print(results.mean())

0.7696120872591461


In [3]:
# Boosting
# AdaBoost - short for Adaptive Boosting.
# AdaBoost is adaptive in the sense that subsequent weak learners are
# tweaked in favor of those instances misclassified by previous classifiers.
# DecisionTreeClasifier is used as the default estimator

import pandas
from sklearn import model_selection
from sklearn.ensemble import AdaBoostClassifier

dataframe = pandas.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/diabetes.csv")

array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

model = AdaBoostClassifier(n_estimators=100, random_state=123)
results = model_selection.cross_val_score(model, X, Y, cv=5)
print(results.mean())

0.7617604617604619


In [4]:
# Voting Ensemble for Classification
import pandas
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import RandomForestClassifier

import warnings
warnings.filterwarnings('ignore')

dataframe = pandas.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/diabetes.csv")

array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
seed = 7

# create the sub models
estimators = []
model1 = LogisticRegression()
estimators.append(('logistic', model1))
model2 = DecisionTreeClassifier()
#model2 = RandomForestClassifier()
estimators.append(('cart', model2))
model3 = SVC()
estimators.append(('svm', model3))
# create the ensemble model
ensemble = VotingClassifier(estimators, weights=[3,2,3])
results = model_selection.cross_val_score(ensemble, X, Y, cv=10)
print(results.mean())

0.7695659603554341
