<h3> Voting Classifier </h3>

We train different algorithms and will take majority of them no wighted average is called <b>hard voting classfier</b>

Somewhat surprisingly, this voting classifier often achieves a higher accuracy than the
best classifier in the ensemble. In fact, even if each classifier is a weak learner (mean‐
ing it does only slightly better than random guessing), the ensemble can still be a
strong learner (achieving high accuracy), provided there are a sufficient number of
weak learners and they are sufficiently diverse.

<b><i>Ensemble methods work best when the predictors are as independ‐
ent from one another as possible. One way to get diverse classifiers
is to train them using very different algorithms. This increases the
chance that they will make very different types of errors, improving
the ensemble’s accuracy</i></b>

In [1]:
from sklearn.datasets import load_iris

In [2]:
iris = load_iris()
X, y, = iris['data'], iris['target']

In [63]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_predict
knn_clf = KNeighborsClassifier(n_neighbors=5)
y_predict = cross_val_predict(knn_clf, X, y)
from sklearn.metrics import accuracy_score
print("Knn accuracy is %.3f"%accuracy_score(y, y_predict))

Knn accuracy is 0.987


In [74]:
from sklearn.svm import LinearSVC
svc_clf = LinearSVC(loss="hinge", C=1)
y_predict = cross_val_predict(svc_clf, X, y)
print("Linear SVC accuracy is %.3f"%accuracy_score(y, y_predict))

Linear SVC accuracy is 0.947


In [81]:
from sklearn.tree import DecisionTreeClassifier
dt_clf = DecisionTreeClassifier(max_depth=3)
y_predict = cross_val_predict(dt_clf, X, y)
print("Decision Tree accuracy is %.3f"%accuracy_score(y, y_predict))

Decision Tree accuracy is 0.960


In [83]:
dt_clf = DecisionTreeClassifier(max_depth=3)
knn_clf = KNeighborsClassifier(n_neighbors=5)
svc_clf = LinearSVC(loss="hinge", C=1)
from sklearn.ensemble import VotingClassifier
vt_clf = VotingClassifier(
    estimators = [('dt_clf',dt_clf),('knn',knn_clf),('svc', svc_clf)],
    voting = 'hard'
)
y_predict = cross_val_predict(vt_clf, X, y)
print("accuracy of hard voting classfier%.3f"%accuracy_score(y, y_predict))

accuracy of hard voting classfier0.973


<h4> soft voting classfier takes weighted average of probabilities </h4>

In [89]:
from sklearn.svm import SVC
dt_clf = DecisionTreeClassifier(max_depth=3)
knn_clf = KNeighborsClassifier(n_neighbors=5)
svc_clf = SVC(kernel='linear', C=1, probability=True)
from sklearn.ensemble import VotingClassifier
vt_clf = VotingClassifier(
    estimators = [('dt_clf',dt_clf),('knn',knn_clf),('svc', svc_clf)],
    voting = 'soft'
)
y_predict = cross_val_predict(vt_clf, X, y)
print("accuracy of soft voting classfier%.3f"%accuracy_score(y, y_predict))

accuracy of soft voting classfier0.967


In [90]:
# Dont know why I less accuracy than other classifiers

<h3> Bagging and Pasting</h3>

In above we gave same training data to different classifier and took voting, but here in Bagging we will train same model but different subsamples of training data with replacement this is called Bagging

if with out replacement this is called Pasting

 Each individual
predictor has a higher bias than if it were trained on the original training set, but
aggregation reduces both bias and variance.

In [1]:
from sklearn.datasets import load_breast_cancer
breast_cancer = load_breast_cancer()

In [2]:
X ,y = breast_cancer['data'], breast_cancer['target']

In [19]:
X.shape

(569, 30)

In [7]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=100, n_jobs=-1, max_samples=150, bootstrap=True)

In [14]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y , test_size = 0.2)

In [24]:
from sklearn.model_selection import cross_val_predict
y_predict = cross_val_predict(bag_clf, X, y, cv=10)

In [25]:
from sklearn.metrics import accuracy_score
accuracy_score(y, y_predict)

0.9525483304042179

<h4>With less number of estimators lets check accuracy </h4>

In [26]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=10, n_jobs=-1, max_samples=150, bootstrap=True)
from sklearn.model_selection import cross_val_predict
y_predict = cross_val_predict(bag_clf, X, y, cv=10)
from sklearn.metrics import accuracy_score
accuracy_score(y, y_predict)

0.9472759226713533

In [27]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=3, n_jobs=-1, max_samples=150, bootstrap=True)
from sklearn.model_selection import cross_val_predict
y_predict = cross_val_predict(bag_clf, X, y, cv=10)
from sklearn.metrics import accuracy_score
accuracy_score(y, y_predict)

0.9349736379613357

<h3> Out of Bag Evaluation </h3>

In BaggingClassifier, we take random samples of 70% and we try to build estimators each, intersting thing here is you can evaluate your classifier with remiaining 30% of data

Final Value of Out of Bag Evaluation is average of all Predictors

In [30]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=100, n_jobs=-1, max_samples=150, bootstrap=True, oob_score=True)
from sklearn.model_selection import cross_val_predict
y_predict = cross_val_predict(bag_clf, X, y, cv=10)
from sklearn.metrics import accuracy_score
cross_validation_accuaracy = accuracy_score(y, y_predict)
print("Cross Validation Accuracy is %.3f"%cross_validation_accuaracy)

Cross Validation Accuracy is 0.960


In [31]:
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=100, n_jobs=-1, max_samples=150, bootstrap=True, oob_score=True)
bag_clf.fit(X, y)
oob_score = bag_clf.oob_score_
print("oob score is %.3f"%oob_score)

oob score is 0.951


OOB score is some what similar to accuracy, when you increase n_estimator, they will close than this

<h3>Random Patches and Random Subspaces</h3>

Bagging Classifier Support Sampling Features as well like Sampling Training Instances, using two hyper paramters
<br>
<br>
<br>
<i>
max_features like max_samples
<br>
bootstrap_features like bootstrap
</i>
<br>
<br>
If you random sampling and random features this is called <b>Random Pathes </b>
<br><br>
If you do only random features this is called <b> Random Subspaces </b>
<br><br>
Sampling features results in even more predictor diversity, trading a bit more bias for
a lower variance.

In [37]:
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=100, n_jobs=-1, max_samples=300, bootstrap=True, oob_score=True, max_features=10, bootstrap_features=True)
y_predict = cross_val_predict(bag_clf, X, y, cv=3)
accuracy_score(y, y_predict)

0.9595782073813708

In [35]:
X.shape

(569, 30)

<h3> Feauture Importance Random Forest..</h3>

In [41]:
from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier(n_estimators = 150, n_jobs = -1)
random_forest.fit(X, y)
for name, score in zip(breast_cancer['feature_names'], random_forest.feature_importances_):
    print("%20s %.3f"%(name, score*100))

         mean radius 4.179
        mean texture 1.530
      mean perimeter 5.491
           mean area 3.661
     mean smoothness 0.589
    mean compactness 1.460
      mean concavity 4.618
 mean concave points 7.466
       mean symmetry 0.371
mean fractal dimension 0.315
        radius error 1.952
       texture error 0.463
     perimeter error 1.479
          area error 3.173
    smoothness error 0.468
   compactness error 0.632
     concavity error 0.491
concave points error 0.446
      symmetry error 0.453
fractal dimension error 0.475
        worst radius 12.529
       worst texture 1.806
     worst perimeter 12.104
          worst area 12.761
    worst smoothness 1.410
   worst compactness 1.818
     worst concavity 2.265
worst concave points 13.641
      worst symmetry 1.302
worst fractal dimension 0.651


In [38]:
breast_cancer.keys()

dict_keys(['DESCR', 'target', 'data', 'target_names', 'feature_names'])

In [42]:
from sklearn.datasets import load_iris
iris = load_iris()
iris_X, iris_y = iris['data'], iris['target']

In [43]:
random_forest = RandomForestClassifier(n_estimators = 150, n_jobs = -1)
random_forest.fit(iris_X, iris_y)
for name, score in zip(iris['feature_names'], random_forest.feature_importances_):
    print("%20s %.3f"%(name, score*100))

   sepal length (cm) 10.361
    sepal width (cm) 2.303
   petal length (cm) 45.972
    petal width (cm) 41.364


<h2> Boosting </h2>

Boosting is ensemble learning that combines weak learners in to a strong learner
<br><br>
Basic Idea is training serveral predictors sequentially, each correcting its predecessor<br>
<br>
1. Ada Boosting(Adaptive)
2. Gradient Boosting

As Boosting is sequential, It can't be parallelized like bagging, so boosting doesn't scale well

<h3> Ada Boost </h3>

Scikit-Learn actually uses a multiclass version of AdaBoost called SAMME16 (which
stands for Stagewise Additive Modeling using a Multiclass Exponential loss function).
When there are just two classes, SAMME is equivalent to AdaBoost.

Scikit-Learn can use a variant of <b>SAMME</b> called <b>SAMME.R</b>

In [44]:
from sklearn.ensemble import AdaBoostClassifier
ada_boost = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1), n_estimators=200, algorithm = "SAMME.R", learning_rate=0.5) 

In [48]:
y_predict  = cross_val_predict(ada_boost, X, y)
accuracy_score(y, y_predict)

0.9718804920913884

<h5> Ada boost got very good Accuracy </h5>

If your ada boost is overfitting, then regularize by decreasing number of estimators or more strongly regularizing base estimator.

<h3>Gradient Boosting..</h3>

Generally gradient boosting will be used in regression task so ignoring for now

<h3>Model Stack </h3>

Model stack is very very Important, lot of people use this in Kaggle
<h5> Few tips </h5>
<ol>
    <li>Pick base estimator with varied structure</li>
    <li>Pick meta estimator that can handle high correlations</li>
    <li>Simple meta estomators can aid interpretability(L2 penalized logistic regression..)</li>
    <li>Using continous outputs from base estimators unlike classification values 1/0 using methods like <b>predict_proba, decision_function.</b>
</ol>

<h5> Does this Work </h5>
1. Asymptotically, use stacked model estimator is no worse than your base best estimator
2. Anecdotally, we have found stacked models make great push-button estimators, especially for our 'medium' data

https://www.youtube.com/watch?v=3gpf1lGwecA