# Effects of changing hyperperimeters
## Caution: This notebook has too much maths, not for faint-hearted.

I've seen people trying to achieve highly accurate models without any insight on how to do so. They either try to use hyperperimeters randomly or just don't bother changing them at all. They go for having more data and say nothing better can be done with what they have.

In this notebook I've tried to explain that what does the change in hyperperimeters of a model have on the accuracy of the model. First I've explained **from where and why the parameters are introduced** after that tried to show the **effect of those parameters** on the model.

This notebook has a lot more math, so I recommend going slowerly. It might give you headaches in the starting but by the end you'll see it's worth the trouble.

Before reading further think of **regularization** as a way to penalize large values of learned weights $\theta$ and avoid overfitting.

# Contents:
1. **Read data and see small portion of it**


2. **Normalize data**


3. **Create Train/Test split**


4. **Basic notations**


5.  **Logistic Regression**

    5.1  Effect of value of C on L2 and L1 Regularization
    
    5.2  Weights for L2 regularization
    
    5.3  Weights for L1 Regularization
    
    5.4  Significance of type of regularization used
    
    5.5  Effect of solver on accuracy


6.  **Naive Bayes**
    
    6.1  Basic Notations
    
    6.2  Laplace smoothing


7.  **K Nearest Neighbors**
    
    7.1  Algorithm vs Accuracy


8.  **Support Vector Machine**

    8.1  Hyperperameters:

    8.2  Effect of type of kernel on accuracy

    8.3  Effect of value of C:
    
    8.4  Effect of value of ùõæ in RBF kernel


9.  **Decision Trees**

    9.1  Max Depth vs Accuracy

    9.2  Min samples split vs Accuracy
    
    9.3  Min samples leaf vs Accuracy
    
    9.4  Max leaf Nodes vs Accuracy


10.  **Random Forest (Ensemble method)**


11.  **Boosting**


12.  **Gradient Boosting Classifier**
    
    12.1  Learning rate vs Accuracy
    
    12.2  Number of estimators
    
    12.3  Subsampling vs Accuracy


13.  **AdaBoostClassifier**
    
    13.1  Base Estimator vs Accuracy
    
    13.2  Number of estimators vs Accuracy
    
    13.3  Algorithm vs Accuracy


14.  **Last trick in the arsenal: Voting classifier**
    
    14.1  Voting Strategy Vs Accuracy


15.  **Conclusion**


16. **Acknowledgement**

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
import os

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import VotingClassifier

# Read data and see small portion of it
* We read data as a pandas DataFrame object.
* Then look at the fields in the data.
* Then have a look at some records of data to get an idea on what we are dealing with.
* If doing data analysis we might want to do more then just this but for our purpose that's enough information about the data.

In [None]:
data = pd.read_csv("../input/heart.csv")
data.info()
data.head()

# Normalize data
* Normalization is the process of scaling individual samples to have unit norm. This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples. 
* For polynomial kernels in SVM if we don't have normalized data it takes forever to train.
* Normaliztaion makes algorithms execute quickly and can boost accuracy by huge amounts.
$$
X_{changed} = \frac{X-X_{min}}{X_{max}-X_{min}}
$$


In [None]:
x_data = data.drop('target', 1)
x_data = x = (x_data - np.min(x_data)) / (np.max(x_data) - np.min(x_data)).values

# Create Train/Test split
* Split the data into two parts.
* 80% to train the model and 20% to test the model.
* The train_test_split function has `shuffle` parameter as `True` by default. Hence the data is shuffled automatically before spliting.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    x_data, data['target'], test_size=0.2, random_state=0)
print("Number of training examples: {0}".format(X_train.shape[0]))
print("Number of features for a single example in the dataset: {0}".format(X_train.shape[1]))
print("Number of test examples: {0}".format(X_test.shape[0]))

# Basic notations
* {$x^{(i)}, y^{(i)}$} represent a training example in the dataset. **Note**: $x^{(i)}$ has nothing to do with exponentiation, it just represents the $i'th$ example in the dataset.
* $x$ represent features of an example in the dataset (age, sex, cp etc.).
* $ y \in \{0,1\} $. Where $0, 1$ corresponds to having or not having heart disease respectivaly (target).
* $m$ be number of training examples in the dataset ($242$ in our case).
* $h_\theta(x)$ be the hypothesis function.
* $n$ be the number of input features ($13$ in our case).
* $h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + ... + \theta_nx_n$.
* $h_\theta(x) = \sum\limits_{i=0}^{n} \theta_ix_i$, Where we take $x_0$ = 1.
* $X$ represents all training examples stacked column wise.

$
X = \begin{bmatrix}x^{(1)} \\ x^{(2)} \\ x^{(3)} \\ ... \\ x^{(m)}\end{bmatrix}
$

* $\overrightarrow y$ represents the labels of the training examples.

$
\overrightarrow y = \begin{bmatrix}y^{(1)} \\ y^{(2)} \\ y^{(3)} \\ ... \\ y^{(m)}\end{bmatrix}
$

# Logistic Regression
As an optimization problem, binary class L2 penalized logistic regression minimizes the following cost function:

$
\min_{w, c} \frac{1}{2}w^T w + C \sum_{i=1}^n \log(\exp(- y_i (X_i^T w + c)) + 1) .
$

Similarly, L1 regularized logistic regression solves the following optimization problem

$
\min_{w, c} \|w\|_1 + C \sum_{i=1}^n \log(\exp(- y_i (X_i^T w + c)) + 1).
$

Note that, in this notation, it‚Äôs assumed that the observation $y_i$ takes values in the set $1,-1$ at trial $i$.


## Effect of value of C on L2 and L1 Regularization
* **C** is the inverse of regularization strength; must be a positive float. smaller values specify stronger regularization.
* If you don't want regularization you need to put a very large value of $C$.

In [None]:
C_values = [0.01, 0.1, 0.5, 1, 5, 10, 100, 1000, 1e42]
accuracies = []
weights_l2 = []

for C in C_values:
    clf_l2 = LogisticRegression(penalty='l2',
                             tol=0.0001,
                             C=C,
                             fit_intercept=True,
                             solver='liblinear',
    )

    # Train
    clf_l2.fit(X_train, y_train)
    
    # Store wieghts for further analysis
    weights_l2.append(clf_l2.coef_[0])
    
    # Calculate the mean accuracy on the given test data and labels.
    accuracies.append(clf_l2.score(X_test, y_test))

fig, ax = plt.subplots(ncols=2)

fig.set_size_inches(14, 4)

ax[0].set_xlabel("Value of C", fontsize=16)
ax[0].set_ylabel("Accuracy", fontsize=16)
ax[0].set_title("L2 Regularization", fontsize=20)
for i, accuracy in enumerate(accuracies):
    ax[0].text(i, accuracy, np.round(accuracies[i], 2), color='black', ha="center", fontsize=14)

sns.barplot(x=C_values, y=accuracies, ax=ax[0])


# For L1 regularization
accuracies = []
weights_l1 = []
for C in C_values:
    clf_l1 = LogisticRegression(penalty='l1',
                             C=C,
                             fit_intercept=True,
                             solver='liblinear',
    )

    # Train
    clf_l1.fit(X_train, y_train)

    # Store wieghts for further analysis
    weights_l1.append(clf_l1.coef_[0])

    # Calculate the mean accuracy on the given test data and labels.
    accuracies.append(clf_l1.score(X_test, y_test))

ax[1].set_xlabel("Value of C", fontsize=16)
ax[1].set_ylabel("Accuracy", fontsize=16)
ax[1].set_title("L1 Regularization", fontsize=20)

for i, accuracy in enumerate(accuracies):
    ax[1].text(i, accuracy, np.round(accuracies[i], 2), color='black', ha="center", fontsize=14)

sns.barplot(x=C_values, y=accuracies, ax=ax[1])

* If we choose $C$ to be too small that actually turns out to be worst then using no regularization at all.

Now, Lets have a look at the values of weights for no regularization, too much regularization, and the values for which we got most accuracy.

## For L2 regularization

In [None]:
print("Value of C: %s" % C_values[0])
print("Weights:")
print(weights_l2[0])

print("Value of C: %s" % C_values[2])
print("Weights:")
print(weights_l2[2])

print("Value of C: %s" % C_values[-1])
print("Weights:")
print(weights_l2[-1])


## For L1 Regularization

In [None]:
print("Value of C: %s" % C_values[0])
print("Weights:")
print(weights_l1[0])

print("Value of C: %s" % C_values[2])
print("Weights:")
print(weights_l1[2])

print("Value of C: %s" % C_values[-1])
print("Weights:")
print(weights_l1[-1])

## Significance of type of regularization used
* The intresting part above is that in case of L1 regularization with too strong regularization we get sparse weights. i.e. using just 8 non zero values we got an accuracy of 82% which can be seen as a way of feature selection.
* For C = $1e+42$ We get almost identical values for both regularization.
* For C around 1.0 we get the most out of regularization. And the values of weights turns out to be too small in case of L1 regularization where as for L2 regularization weights are somewhat higher but still significantly lower then no regularization case.

## Effect of solver on accuracy
* Let's fix value of C to 1 and see how changing solver affects the accuracy.
* **Note:** **‚Äònewton-cg‚Äô, ‚Äòlbfgs‚Äô and ‚Äòsag‚Äô** only handle **L2** penalty, whereas **‚Äòliblinear‚Äô and ‚Äòsaga‚Äô** handle **L1** penalty.


In [None]:
C = 1.0
solvers = ['newton-cg', 'lbfgs', 'sag', 'saga', 'liblinear']

# For L2 regularization
fig, ax = plt.subplots()
fig.set_size_inches(14, 4)

accuracies = []
for solver in solvers:
    clf = LogisticRegression(penalty='l2',
                             C=C,
                             fit_intercept=True,
                             solver=solver,
                             max_iter=500,
    )

    # Train
    clf.fit(X_train, y_train)

    # Calculate the mean accuracy on the given test data and labels.
    accuracies.append(clf.score(X_test, y_test))

ax.set_title('L2 Regularization', fontsize=20)
for solver in solvers:
    ax.set_xlabel("Solver", fontsize=16)
    ax.set_ylabel("Accuracy", fontsize=16)
    for i, accuracy in enumerate(accuracies):
        ax.text(i, accuracy, np.round(accuracies[i], 2), color='black', ha="center", fontsize=14)

sns.barplot(x=solvers, y=accuracies, ax=ax)

# For L1 regularization
fig, ax = plt.subplots()
fig.set_size_inches(14, 4)

accuracies = []
for solver in solvers[3:]:
    clf = LogisticRegression(penalty='l1',
                             C=C,
                             fit_intercept=True,
                             solver=solver,
                             max_iter=200,
    )

    # Train
    clf.fit(X_train, y_train)

    # Calculate the mean accuracy on the given test data and labels.
    accuracies.append(clf.score(X_test, y_test))

ax.set_title('L1 Regularization', fontsize=20)
for solver in solvers[3:]:
    ax.set_xlabel("Solver", fontsize=16)
    ax.set_ylabel("Accuracy", fontsize=16)
    for i, accuracy in enumerate(accuracies):
        ax.text(i, accuracy, np.round(accuracies[i], 2), color='black', ha="center", fontsize=14)

sns.barplot(x=solvers[3:], y=accuracies, ax=ax)


* We can't infer a lot from this apart from the fact that **liblinear** does slightly better then others. That may not be the case always.
* The reason why we don't have a lot of choices for L1 regularization is because $||w||_1$ term in L1 regularization is not differentiable due to which solvers which rely on derivatives don't support L1 regularization. Some which support have to use some kind of tricks like [Subgradient method](https://en.m.wikipedia.org/wiki/Subgradient_method?wprov=sfla1) to make it work.
* **'lbfgs'** takes too many iterations to converge which can be timetaking on large datasets. 

# Naive Bayes
* Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes‚Äô theorem with the ‚Äúnaive‚Äù assumption of conditional independence between every pair of features given the value of the class variable.

* As we have only two possible values of $y$, we will use Bernoulli Naive Bayes for our case.
* The decision rule for Bernoulli naive Bayes is based on:
$
\: P(x_i \mid y) = P(i \mid y) x_i + (1 - P(i \mid y)) (1 - x_i)
$
. It explicitly penalizes the non-occurrence of a feature  that is an indicator for class $y$.

* Naive Bayes is also Naive in terms of hyperperameters, we can only experiment with smoothing. Let's first see from where this parameter alpha comes.

## Basic Notations:
* Naive Bayes is a generative learning algorithm. i.e. instead of modeling $p(y|x)$ directly (such as logistic regression), here we try to model $p(x|y)$ (and $p(y)$).
* For instance, if $y$ indicates whether an example is a dog(0) or an elephant(1), then $p(x|y=0)$ models the distribution of dogs' features, and p(x|y=1) models the distributions of elephants' features.
* After modeling $p(y)$ (called class priors) and $p(x|y)$, our algorithm can use Bayes rule to derive $p(y|x)$
$$
    \: p(y|x) = \frac{p(x|y)p(y)}{p(x)}
$$
* Here $p(x) = p(x|y=1)p(y=1) + p(x|y=0)p(y=0)$. Thus this can be expressed in terms of the quantities we have learned.
* In order to make predictions we don't actually need the $p(x)$ term.
$$
argmax_y(p(y|x)) = argmax_y \frac{p(x|y)p(y)}{p(x)}
$$
$$
= argmax_y p(x|y)p(y)
$$

* Let, 

$
y \sim Bernoulli(\phi)
$

$
p(x_i = 1| y=1) = \phi_{i|y=1}
$

$
p(x_i = 1 | y=0) = \phi_{i|y=0}
$

$
p(y=1) = \phi_y
$

* So parameters we want to learn are $\phi_{i|y=1}$, $\phi_{i|y=0}$, $\phi_y$

* We can write down the joint likelihood of the data as:
 $$
 \mathcal{L}(\phi_y, \phi_{i|y=0}, \phi_{i|y=1}) = \prod\limits_{i=1}^m{p(x^{(i)}, y^{(i)})}
 $$
* Maximizing the above equation w.r.t  $\phi_y, \phi_{i|y=0}, \phi_{i|y=1}$ we get our trained parameters.
* To predict new example
$$
p(y=1|x) = \frac{p(x|y=1)p(y=1)}{p(x)}
$$
$$
= \frac{(\prod\limits_{i=1}^n{p(x_i|y=1)})\ p(y=1)}{(\prod\limits_{i=1}^n{p(x_i|y=1)})\ p(y=1)\ +\ (\prod\limits_{i=1}^n{p(x_i|y=0)})\ p(y=0)}
$$

## Laplace smoothing
* Lets take a look at the best explanation I've seen so far: This has been taken from [CS229 Lecture Notes](https://see.stanford.edu/materials/aimlcs229/cs229-notes2.pdf).
* Consider spam/email classification, and lets suppose that, after completing CS229 and having done excellent work on the project, you decide around June 2003 to submit the work you did to the NIPS conference for publication.
(NIPS is one of the top machine learning conferences, and the deadline for submitting a paper is typically in late June or early July.) Because you end up discussing the conference in your emails, you also start getting messages
with the word ‚Äúnips‚Äù in it. But this is your first NIPS paper, and until this time, you had not previously seen any emails containing the word ‚Äúnips‚Äù; in particular ‚Äúnips‚Äù did not ever appear in your training set of spam/non-
spam emails. Assuming that ‚Äúnips‚Äù was the 35000th word in the dictionary, your Naive Bayes spam filter therefore had picked its maximum likelihood estimates of the parameters $\phi_{35000|y}$ to be
$$
\phi_{35000|y=1} = \frac{\sum\limits_{i=1}^{m} 1\{x_{35000}^{(i)} = 1 \wedge y^{(i)}=1\}}{\sum\limits_{i=1}^{m}\{ y^{(i)}=1\}} = 0
$$
$$
\phi_{35000|y=0} = \frac{\sum\limits_{i=1}^{m} 1\{x_{35000}^{(i)} = 1 \wedge y^{(i)}=0\}}{\sum\limits_{i=1}^{m}\{ y^{(i)}=0\}} = 0
$$

* I.e. because we have never seen "npis" before in either spam or not spam training examples, it thinks the probability of seeing it in either type of email is zero. Hence, when trying to decide if one of these messages containing ‚Äúnips‚Äù is spam, it calculates the class posterior probabilities, and obtain
$$
p(y=1|x) = \frac{(\prod\limits_{i=1}^n{p(x_i|y=1)})\ p(y=1)}{(\prod\limits_{i=1}^n{p(x_i|y=1)})\ p(y=1)\ +\ (\prod\limits_{i=1}^n{p(x_i|y=0)})\ p(y=0)} = \frac{0}{0}
$$
* Hence our algorithm doesnot know how to make a prediction in this case.

* Stating the problem more broadly, it is statistically a bad idea to estimate the probability of some event to be zero just because you haven‚Äôt seen it before in your finite training set.

* Take the problem of estimating the mean of a multinomial random variable z taking values in $\{1, . . . , k\}$. We can parameterize our multinomial with $\phi_i = p(z = i)$. Given a set of m independent observations $\{z (1) , . . . , z (m) \}$, the maximum likelihood estimates are given by

$$
\phi_j = \frac{\sum\limits_{i=1}^{m} 1\{z^{(i)} = j \}}{m}
$$

* As we saw previously, if we were to use these maximum likelihood estimates, then some of the $\phi_j$‚Äôs might end up as zero, which was a problem. To avoid this, we can use Laplace smoothing, which replaces the above estimate with
$$
\phi_j = \frac{\sum\limits_{i=1}^{m} 1\{z^{(i)} = j \} + \alpha}{m+k}
$$

* Now we have:
$$
\phi_{j|y=1} = \frac{\sum\limits_{i=1}^{m} 1\{x_{j}^{(i)} = 1 \wedge y^{(i)}=1\} + \alpha}{\sum\limits_{i=1}^{m}\{ y^{(i)}=1\} + 2}
$$
$$
\phi_{j|y=0} = \frac{\sum\limits_{i=1}^{m} 1\{x_{j}^{(i)} = 1 \wedge y^{(i)}=0\} + \alpha}{\sum\limits_{i=1}^{m}\{ y^{(i)}=0\} + 2}
$$

* The parameter $\alpha$ in the above equation is the one we are going to vary and see how our results vary.

In [None]:
alpha_values = [0, 0.8, 1.0, 5.0, 100.0, 200.0, 230.0, 300.0, 500.0, 1000.0, 10000.0]

accuracies = []
for alpha in alpha_values:
    clf = BernoulliNB(alpha=alpha)

    # Train
    clf.fit(X_train, y_train)

    # Calculate the mean accuracy on the given test data and labels.
    accuracies.append(clf.score(X_test, y_test))

fig, ax = plt.subplots()
fig.set_size_inches(14, 4)

ax.set_xlabel("Value of $\\alpha$", fontsize=16)
ax.set_ylabel("Accuracy", fontsize=16)
ax.set_title("Naive Bayes", fontsize=20)

for i, accuracy in enumerate(accuracies):
    ax.text(i, accuracy, np.round(accuracies[i], 2), color='black', ha="center", fontsize=14)

sns.barplot(x=alpha_values, y=accuracies, ax=ax)

* If we choose high values of smoothing we will loose a lot of accuracy. And also we saw that no smoothing may result in numerical errors (divide by 0).

# K Nearest Neighbors
* Neighbors-based classification is a type of instance-based learning or non-generalizing learning: it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the most representatives within the nearest neighbors of the point.

* Here hyperperameters are:
* **K**: The number of nearest neighbors to compare the new point to.
* **Algorithm**: The algorithm to be used to calculate nearest neighbors ('ball_tree', 'kd_tree', 'brute'). 
* **Note**: I'm not considering 'leaf_size' as a hyperparameter because it benifits computation not accuracy.

In [None]:
neighbors = [1, 2, 5, 10, 14, 16, 20, 25, 32, 50, 60]

accuracies = []
for neighbor in neighbors:
    clf = KNeighborsClassifier(
        n_neighbors=neighbor,
        algorithm='brute',
    )

    # Train
    clf.fit(X_train, y_train)

    # Calculate the mean accuracy on the given test data and labels.
    accuracies.append(clf.score(X_test, y_test))
fig, ax = plt.subplots()
fig.set_size_inches(14, 4)

ax.set_xlabel("Value of $K$", fontsize=16)
ax.set_ylabel("Accuracy", fontsize=16)
ax.set_title("K Nearest Neighbors", fontsize=20)

for i, accuracy in enumerate(accuracies):
    ax.text(i, accuracy, np.round(accuracies[i], 2), color='black', ha="center", fontsize=14)

sns.barplot(x=neighbors, y=accuracies, ax=ax)

## Algorithm vs Accuracy
* Let's set K = 5 as that gave us the best accuracy.
* compare the three algorithms.
* **Brute force** algorithm does N comparisions to predict a new point where N in number of points.
* **KD-Tree** exploits tree data structures, it generalizes two-dimensional Quad-trees and 3-dimensional Oct-trees to an arbitrary number of dimensions. Requires $O(log(N))$ time complexity to predict a new point. 
* **Ball trees** partition data in a series of nesting hyper-spheres. And a single comparision is needed to predict a point.

In [None]:
algorithms = ['ball_tree', 'kd_tree', 'brute']

accuracies = []
for algorithm in algorithms:
    clf = KNeighborsClassifier(
        n_neighbors=5,
        algorithm=algorithm,
    )

    # Train
    clf.fit(X_train, y_train)

    # Calculate the mean accuracy on the given test data and labels.
    accuracies.append(clf.score(X_test, y_test))
fig, ax = plt.subplots()
fig.set_size_inches(14, 4)

ax.set_xlabel("Algorithm", fontsize=16)
ax.set_ylabel("Accuracy", fontsize=16)
ax.set_title("K Nearest Neighbors", fontsize=20)

for i, accuracy in enumerate(accuracies):
    ax.text(i, accuracy, np.round(accuracies[i], 2), color='black', ha="center", fontsize=14)

sns.barplot(x=algorithms, y=accuracies, ax=ax)

* We can see that it makes no difference on how the nearest neighbours are found we get the same results.
* It turns out the **algorithms** help in **reducing computational time** not the accuracy of the algorithm.

# Support Vector Machine
* A support vector machine constructs a hyper-plane or set of hyper-planes in a high or infinite dimensional space, which can be used for classification, regression or other tasks. Intuitively, a good separation is achieved by the hyper-plane that has the largest distance to the nearest training data points of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier.
* Given training vectors $x_i \in R^p$, $i=1,‚Ä¶, n$, in two classes, and a vector $y \in \{1,-1\}$, SVC solves the following primal problem:
$$
\begin{align}\begin{aligned}\min_ {w, b, \zeta} \frac{1}{2} w^T w + C \sum_{i=1}^{n} \zeta_i\\\begin{split}\textrm {subject to } & y_i (w^T \phi (x_i) + b) \geq 1 - \zeta_i,\\
& \zeta_i \geq 0, i=1, ..., n\end{split}\end{aligned}\end{align}
$$

* The decision function is:
$$
\operatorname{sgn}(\sum_{i=1}^n y_i \alpha_i K(x_i, x) + \rho)
$$

## Hyperperameters:
* **C**: Same as we described in logistic regression.  **C** is the inverse of regularization strength; must be a positive float. smaller values specify stronger regularization.
* **Kernal**: The kernel to use (linear, polynomial, rbf, sigmoid).
* **gamma**: Kernel coefficient for rbf, poly and sigmoid.
* **degree**: Degree of the polynomial kernel function (‚Äòpoly‚Äô).

## Effect of type of kernel on accuracy
* **linear**: $\langle x, x'\rangle$.
* **polynomial**: $(\gamma \langle x, x'\rangle + r)^d$. $d$ is specified by keyword `degree`, $r$ by `coef0`. In practice these kernels turns out to be too slow if we don't normalize data.
* **rbf (Radial Basis Function)**: $\exp(-\gamma \|x-x'\|^2)$. $\gamma$ is specified by keyword `gamma`, must be greater than 0. This is an example of infinite dimensional kernel. I.e. it represents the data in an infinite dimensional space.
* **sigmoid**: $\tanh(\gamma \langle x,x'\rangle + r)$, where $r$ is specified by `coef0`.

In [None]:
kernels = ['linear', 'poly', 'rbf', 'sigmoid']

accuracies = []
for kernel in kernels:
    clf = clf = svm.SVC(
        kernel=kernel,
    )

    # Train
    clf.fit(X_train, y_train)

    # Calculate the mean accuracy on the given test data and labels.
    accuracies.append(clf.score(X_test, y_test))

fig, ax = plt.subplots()
fig.set_size_inches(14, 4)

ax.set_xlabel("Kernel", fontsize=16)
ax.set_ylabel("Accuracy", fontsize=16)
ax.set_title("SVM Classifier", fontsize=20)

for i, accuracy in enumerate(accuracies):
    ax.text(i, accuracy, np.round(accuracies[i], 2), color='black', ha="center", fontsize=14)

sns.barplot(x=kernels, y=accuracies, ax=ax)

* The *linear kernel* outperforms other two by a very significant margin. Which is not usually true in most of the classification tasks. rbf turns out to be better most of the times.

## Effect of value of C:


In [None]:
C_values = [0.01, 0.1, 0.2, 0.5, 1, 5, 10, 20]

accuracies = []

for C in C_values:
    clf = clf = svm.SVC(
        C=C,
        kernel='linear',
    )

    # Train
    clf.fit(X_train, y_train)

    # Calculate the mean accuracy on the given test data and labels.
    accuracies.append(clf.score(X_test, y_test))

    # Train
    clf.fit(X_train, y_train)

fig, ax = plt.subplots()
fig.set_size_inches(14, 4)

ax.set_xlabel("Value of C", fontsize=16)
ax.set_ylabel("Accuracy", fontsize=16)
ax.set_title("Effect of C SVM", fontsize=20)
for i, accuracy in enumerate(accuracies):
    ax.text(i, accuracy, np.round(accuracies[i], 2), color='black', ha="center", fontsize=14)

sns.barplot(x=C_values, y=accuracies, ax=ax)

## Effect of value of $\gamma$ in RBF kernel
* It's kernel coefficient for rbf, poly and sigmoid.
* When training an SVM with the Radial Basis Function (RBF) kernel, gamma defines how much influence a single training example has. The larger gamma is, the closer other examples must be to be affected.

In [None]:
gammas = [0.01, 0.1, 0.2, 0.5, 1, 5]

accuracies = []

for gamma in gammas:
    clf = clf = svm.SVC(
        C=5,
        kernel='rbf',
        gamma=gamma,
    )

    # Train
    clf.fit(X_train, y_train)

    # Calculate the mean accuracy on the given test data and labels.
    accuracies.append(clf.score(X_test, y_test))

    # Train
    clf.fit(X_train, y_train)

fig, ax = plt.subplots()
fig.set_size_inches(14, 4)

ax.set_xlabel("Value of $\\gamma$", fontsize=16)
ax.set_ylabel("Accuracy", fontsize=16)
ax.set_title("Effect of $\\gamma$ SVM", fontsize=20)
for i, accuracy in enumerate(accuracies):
    ax.text(i, accuracy, np.round(accuracies[i], 2), color='black', ha="center", fontsize=14)

sns.barplot(x=gammas, y=accuracies, ax=ax)

* Value of $\gamma$ can dramatically increase accuracy as we see above.

# Decision Trees

* Decision trees are non-linear classifiers, capable of performing multi-class classification on a dataset.
* The bases of this classifier is to divide input space $\chi$ into disjoint subsets (or regions) $R_i$.

$$
\chi = \bigcup\limits_{i=0}^{n} R_i \\
s.t. R_i \cap R_j = \phi\ for\ i\ \neq j
$$

* To select regions we use a greedy, top-down, recursive partitioning. Split is done into two child reions by thresholding a single feature. Then take it's childern in a recursive manner, always selecting a leaf node, a feature, and a threshold to form a new split. Given a parent region $R_p$, a feature index $j$, and a threshold $t \in R$, we obtain two child regions $R_1$ and $R_2$ as follows:
$$
    R_1 = \{X|X_j < t, X \in R_p \} \\
    R_2 = \{X|X_j \ge t, X \in R_p \}
$$

* In order to choose our splits we need to define a loss function. For a classification problem, we are intrested in the misclassification loss $L_{misclass}$. For region $R$ let $\hat{p_c}$ be the proportion of examples in R that are of class c. Misclassification loss on $R$ can be written as:
$$
L_{misclass}(R) = 1- max_c(\hat{p_c})
$$
* A more senstive loss is cross-entropy loss.
$$
L_{cross}(R) = - \sum\limits_{c} \hat{p_c} log_2 \hat{p_c}
$$
* Or you can use [Gini impurity](https://en.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity).
* Decision trees have a simple interpratation but have too many hyperperemeters.
* **criterion** : The loss function to measure the quality of a split. ('crossentropy' or 'gini').

* **max_depth** : The maximum allowed depth of the tree.

* **min_samples_split** : The minimum number of samples required to split an internal node.

* **min_samples_leaf** : The minimum number of samples required to be at a leaf node.

* **max_leaf_nodes** : The maximum number of leaf nodes.

* **min_impurity_decrease** : A node will be split if this split induces a decrease of the impurity greater than or equal to this value. **Note** : This is problematic approach as the greedy, single feature at a time approach of decision tree could mean missing higher order interactions.

## Max Depth vs Accuracy
* The maximum allowed depth of the tree.

In [None]:
max_depths = [2, 3, 4, 5, 10, 20, 50, 100]
accuracies = []

# For cross entropy loss
for max_depth in max_depths:
    clf = DecisionTreeClassifier(criterion='entropy',
                                max_depth=max_depth,
    )

    # Train
    clf.fit(X_train, y_train)
    
    # Calculate the mean accuracy on the given test data and labels.
    accuracies.append(clf.score(X_test, y_test))

fig, ax = plt.subplots(ncols=2)
fig.set_size_inches(14, 4)

ax[0].set_xlabel("Max Depth", fontsize=16)
ax[0].set_ylabel("Accuracy", fontsize=16)
ax[0].set_title("Cross Entropy", fontsize=20)
for i, accuracy in enumerate(accuracies):
    ax[0].text(i, accuracy, np.round(accuracies[i], 2), color='black', ha="center", fontsize=14)

sns.barplot(x=max_depths, y=accuracies, ax=ax[0])

accuracies = []

# For Gini loss
for max_depth in max_depths:
    clf = DecisionTreeClassifier(criterion='entropy',
                                max_depth=max_depth,
    )

    # Train
    clf.fit(X_train, y_train)
    
    # Calculate the mean accuracy on the given test data and labels.
    accuracies.append(clf.score(X_test, y_test))

ax[1].set_xlabel("Max Depth", fontsize=16)
ax[1].set_ylabel("Accuracy", fontsize=16)
ax[1].set_title("Gini impurity", fontsize=20)
for i, accuracy in enumerate(accuracies):
    ax[1].text(i, accuracy, np.round(accuracies[i], 2), color='black', ha="center", fontsize=14)

sns.barplot(x=max_depths, y=accuracies, ax=ax[1])


## Min samples split vs Accuracy
* The minimum number of samples required to split an internal node.

In [None]:
min_samples_splits = [2, 3, 4, 5, 10, 20, 50, 100]

accuracies = []

for min_samples_split in min_samples_splits:
    clf = DecisionTreeClassifier(criterion='gini',
                                 max_depth=3,
                                 min_samples_split=min_samples_split,
    )

    # Train
    clf.fit(X_train, y_train)
    
    # Calculate the mean accuracy on the given test data and labels.
    accuracies.append(clf.score(X_test, y_test))

fig, ax = plt.subplots()
fig.set_size_inches(14, 4)

ax.set_xlabel("Min Samples Split", fontsize=16)
ax.set_ylabel("Accuracy", fontsize=16)
ax.set_title("Min Samples Split vs Accuracy", fontsize=20)
for i, accuracy in enumerate(accuracies):
    ax.text(i, accuracy, np.round(accuracies[i], 2), color='black', ha="center", fontsize=14)

sns.barplot(x=min_samples_splits, y=accuracies, ax=ax)


## Min samples leaf vs Accuracy
* The minimum number of samples required to be at a leaf node.

In [None]:
min_samples_leaves = [2, 3, 4, 5, 10, 20, 50, 100]

accuracies = []

for min_samples_leaf in min_samples_leaves:
    clf = DecisionTreeClassifier(criterion='gini',
                                 max_depth=3,
                                 min_samples_leaf=min_samples_leaf,
    )

    # Train
    clf.fit(X_train, y_train)
    
    # Calculate the mean accuracy on the given test data and labels.
    accuracies.append(clf.score(X_test, y_test))

fig, ax = plt.subplots()
fig.set_size_inches(14, 4)

ax.set_xlabel("Min Samples Leaf", fontsize=16)
ax.set_ylabel("Accuracy", fontsize=16)
ax.set_title("Min Samples Leaf vs Accuracy", fontsize=20)
for i, accuracy in enumerate(accuracies):
    ax.text(i, accuracy, np.round(accuracies[i], 2), color='black', ha="center", fontsize=14)

sns.barplot(x=min_samples_leaves, y=accuracies, ax=ax)

## Max leaf Nodes vs Accuracy
* The maximum number of leaf nodes.

In [None]:
max_leaves_nodes = [2, 3, 4, 5, 6, 7, 10, 20, 50, 100]

accuracies = []

for max_leaf_nodes in max_leaves_nodes:
    clf = DecisionTreeClassifier(criterion='gini',
                                 max_depth=3,
                                 max_leaf_nodes=max_leaf_nodes,
    )

    # Train
    clf.fit(X_train, y_train)
    
    # Calculate the mean accuracy on the given test data and labels.
    accuracies.append(clf.score(X_test, y_test))

fig, ax = plt.subplots()
fig.set_size_inches(14, 4)

ax.set_xlabel("Max Leaf Nodes", fontsize=16)
ax.set_ylabel("Accuracy", fontsize=16)
ax.set_title("Max Leaf Nodes vs Accuracy", fontsize=20)
for i, accuracy in enumerate(accuracies):
    ax.text(i, accuracy, np.round(accuracies[i], 2), color='black', ha="center", fontsize=14)

sns.barplot(x=max_leaves_nodes, y=accuracies, ax=ax)

# Random Forest (Ensemble method)
* A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if `bootstrap=True`.
* Bootstrap is a method from statistics traditionally used to measure uncertainty of some extimator (e.g. mean).
* This is an ensemble of number of decision trees and hence will also contain the hyperperemeters from **decision tree** classifiers. In addition to those we only have two more hyperperemeters.
* **Number of estimators** : Number of trees to use for the estimation.
* **Bootstrap**: Wheather to use bootstraping or not.

In [None]:
num_estimators = [2, 3, 4, 5, 10, 20, 50, 100, 200]
accuracies = []

# With bootstraping
for n_estimators in num_estimators:
    clf = RandomForestClassifier(
        criterion='gini',
        max_depth=3,
        min_samples_leaf=4,
        max_leaf_nodes=5,
        n_estimators=n_estimators,
        bootstrap=True,
        random_state=10,
    )

    # Train
    clf.fit(X_train, y_train)

    # Calculate the mean accuracy on the given test data and labels.
    accuracies.append(clf.score(X_test, y_test))

fig, ax = plt.subplots(ncols=2)
fig.set_size_inches(14, 4)

ax[0].set_xlabel("Number of estimators", fontsize=16)
ax[0].set_ylabel("Accuracy", fontsize=16)
ax[0].set_title("With Bootstraping", fontsize=20)
for i, accuracy in enumerate(accuracies):
    ax[0].text(i, accuracy, np.round(accuracies[i], 2), color='black', ha="center", fontsize=14)

sns.barplot(x=num_estimators, y=accuracies, ax=ax[0])

accuracies = []

# Without bootstraping
for n_estimators in num_estimators:
    clf = RandomForestClassifier(
        criterion='gini',
        max_depth=3,
        min_samples_leaf=4,
        max_leaf_nodes=5,
        n_estimators=n_estimators,
        bootstrap=False,
        random_state=10,
    )

    # Train
    clf.fit(X_train, y_train)
    
    # Calculate the mean accuracy on the given test data and labels.
    accuracies.append(clf.score(X_test, y_test))

ax[1].set_xlabel("Number of estimators", fontsize=16)
ax[1].set_ylabel("Accuracy", fontsize=16)
ax[1].set_title("Without Bootstraping", fontsize=20)
for i, accuracy in enumerate(accuracies):
    ax[1].text(i, accuracy, np.round(accuracies[i], 2), color='black', ha="center", fontsize=14)

sns.barplot(x=num_estimators, y=accuracies, ax=ax[1])

# Boosting
* Boosting is used for bias-reduction. We therefore want high bias, low varience models, also know as weak learners.
* The general idea of boosting is to draw samples randomly while increasing probability of a sample which has been predicted incorrectly and decreasing probability of other samples which were correctly classified.
* Here we will Exploit two boosting classifiers **GradientBoostingClassifier** and **AdaBoostClassifier**.

# Gradient Boosting Classifier
* Gradient Tree Boosting or Gradient Boosted Regression Trees (GBRT) is a generalization of boosting to arbitrary differentiable loss functions. GBRT is an accurate and effective off-the-shelf procedure that can be used for both regression and classification problems. Gradient Tree Boosting models are used in a variety of areas including Web search ranking and ecology.

* The advantages of GBRT are:
* Natural handling of data of mixed type (= heterogeneous features)
* Predictive power
* Robustness to outliers in output space (via robust loss functions)

* The disadvantages of GBRT are:
* Scalability, due to the sequential nature of boosting it can hardly be parallelized.

* GBRT considers additive models of the following form:
$$
F(x) = \sum_{m=1}^{M} \gamma_m h_m(x)
$$

* Gradient Tree Boosting uses decision trees of fixed size as weak learners. Decision trees have a number of abilities that make them valuable for boosting, namely the ability to handle data of mixed type and the ability to model complex functions.
* Similar to other boosting algorithms, GBRT builds the additive model in a greedy fashion:
$$
F_m(x) = F_{m-1}(x) + \gamma_m h_m(x)
$$
* Where the newly added tree $h_m$ tries to minimize the loss L, given the previous ensemble $F_{m-1}$:
$$
h_m =  \arg\min_{h} \sum_{i=1}^{n} L(y_i, F_{m-1}(x_i) + h(x_i)).
$$

* Gradient Boosting attempts to solve this minimization problem numerically via steepest descent: The steepest descent direction is the negative gradient of the loss function evaluated at the current model $F_{m-1}$ which can be calculated for any differentiable loss function:
$$
F_m(x) = F_{m-1}(x) - \gamma_m \sum_{i=1}^{n} \nabla_F L(y_i, F_{m-1}(x_i))
$$

* Where the step length $\gamma_m$ is chosen using line search:
$$
\gamma_m = \arg\min_{\gamma} \sum_{i=1}^{n} L(y_i, F_{m-1}(x_i) - \gamma \frac{\partial L(y_i, F_{m-1}(x_i))}{\partial F_{m-1}(x_i)})
\\
$$

* Apart from the decision tree hyperperemeters. The new hyperperemeters are:
* **loss** : Loss function ($L$ in above equations) to be optimized (deviance, exponential).
* **Learning rate**: Learning rate ($\gamma_m$) shrinks the contribution of each tree by learning_rate. There is a trade-off between learning rate and number of estimators.
* **Number of estimators** : The number of boosting stages($M$) to perform. Gradient boosting is fairly robust to over-fitting so a large number usually results in better performance.
* **subsample** : The fraction of samples to be used for fitting the individual base learners.
* **criterion** : The function to measure the quality of a split (friedman_mse(mean squared error with improvement score by Friedman), mse(mean squared error), mae( mean absolute error)).

## Learning rate vs Accuracy
* Learning rate ($\gamma_m$) shrinks the contribution of each tree by learning_rate. There is a trade-off between learning rate and number of estimators.

In [None]:
learning_rates = [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3]

accuracies = []

# With deviance loss
for learning_rate in learning_rates:
    clf = GradientBoostingClassifier(
        loss='deviance',
        learning_rate=learning_rate,
        max_depth=3,
        min_samples_leaf=4,
        max_leaf_nodes=5,
        random_state=10,
    )

    # Train
    clf.fit(X_train, y_train)

    # Calculate the mean accuracy on the given test data and labels.
    accuracies.append(clf.score(X_test, y_test))

fig, ax = plt.subplots(ncols=2)
fig.set_size_inches(14, 4)

ax[0].set_xlabel("Learning rate", fontsize=16)
ax[0].set_ylabel("Accuracy", fontsize=16)
ax[0].set_title("With Deviance", fontsize=20)
for i, accuracy in enumerate(accuracies):
    ax[0].text(i, accuracy, np.round(accuracies[i], 2), color='black', ha="center", fontsize=14)

sns.barplot(x=learning_rates, y=accuracies, ax=ax[0])

accuracies = []

# With exponential loss
for learning_rate in learning_rates:
    clf = GradientBoostingClassifier(
        loss='exponential',
        learning_rate=learning_rate,
        max_depth=3,
        min_samples_leaf=4,
        max_leaf_nodes=5,
        random_state=10,
    )

    # Train
    clf.fit(X_train, y_train)
    
    # Calculate the mean accuracy on the given test data and labels.
    accuracies.append(clf.score(X_test, y_test))

ax[1].set_xlabel("Learning rate", fontsize=16)
ax[1].set_ylabel("Accuracy", fontsize=16)
ax[1].set_title("Exponential Loss", fontsize=20)
for i, accuracy in enumerate(accuracies):
    ax[1].text(i, accuracy, np.round(accuracies[i], 2), color='black', ha="center", fontsize=14)

sns.barplot(x=learning_rates, y=accuracies, ax=ax[1])

* We see that learning rate has a lot of impace on accuracy. Too slow learning and too fast learning both are problematic.

## Number of estimators
* The number of boosting stages($M$) to perform. Gradient boosting is fairly robust to over-fitting so a large number usually results in better performance.

In [None]:
num_estimators = [2, 3, 4, 5, 10, 20, 50, 100, 200]
accuracies = []

# With bootstraping
for n_estimators in num_estimators:
    clf = GradientBoostingClassifier(
        n_estimators=n_estimators,
        loss='deviance',
        learning_rate=0.1,
        max_depth=3,
        min_samples_leaf=4,
        max_leaf_nodes=5,
        random_state=10,
    )

    # Train
    clf.fit(X_train, y_train)

    # Calculate the mean accuracy on the given test data and labels.
    accuracies.append(clf.score(X_test, y_test))

fig, ax = plt.subplots()
fig.set_size_inches(14, 4)

ax.set_xlabel("Number of estimators", fontsize=16)
ax.set_ylabel("Accuracy", fontsize=16)
ax.set_title("Number of estimators Vs accuracy", fontsize=20)
for i, accuracy in enumerate(accuracies):
    ax.text(i, accuracy, np.round(accuracies[i], 2), color='black', ha="center", fontsize=14)

sns.barplot(x=num_estimators, y=accuracies, ax=ax)

## Subsampling vs Accuracy
* The fraction of samples to be used for fitting the individual base learners.

In [None]:
subsampling_values = [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
accuracies = []

# With bootstraping
for subsample in subsampling_values:
    clf = GradientBoostingClassifier(
        subsample=subsample,
        n_estimators=50,
        loss='deviance',
        learning_rate=0.1,
        max_depth=3,
        min_samples_leaf=4,
        max_leaf_nodes=5,
        random_state=10,
    )

    # Train
    clf.fit(X_train, y_train)

    # Calculate the mean accuracy on the given test data and labels.
    accuracies.append(clf.score(X_test, y_test))

fig, ax = plt.subplots()
fig.set_size_inches(14, 4)

ax.set_xlabel("Subsampling", fontsize=16)
ax.set_ylabel("Accuracy", fontsize=16)
ax.set_title("Subsampling Vs accuracy", fontsize=20)
for i, accuracy in enumerate(accuracies):
    ax.text(i, accuracy, np.round(accuracies[i], 3), color='black', ha="center", fontsize=14)

sns.barplot(x=subsampling_values, y=accuracies, ax=ax)

# AdaBoostClassifier
* The core principle of AdaBoost is to fit a sequence of weak learners (i.e., models that are only slightly better than random guessing, such as small decision trees) on repeatedly modified versions of the data. The predictions from all of them are then combined through a weighted majority vote (or sum) to produce the final prediction. The data modifications at each so-called boosting iteration consist of applying weights $w_1, w_2, ..., w_n$ to each of the training samples. Initially, those weights are all set to $w_i = 1/N$, so that the first step simply trains a weak learner on the original data. For each successive iteration, the sample weights are individually modified and the learning algorithm is reapplied to the reweighted data. At a given step, those training examples that were incorrectly predicted by the boosted model induced at the previous step have their weights increased, whereas the weights are decreased for those that were predicted correctly. As iterations proceed, examples that are difficult to predict receive ever-increasing influence. Each subsequent weak learner is thereby forced to concentrate on the examples that are missed by the previous ones in the sequence.
* There is not much math to explain on where hyperparameters come from in this case. Learning rate is analogous to learning rate in gradient boosting classifier.
* **base_estimator**: The classifier to be used as weak learner.
* **n_estimators**: Number of estimators to be used for prediction.
* **algorithm**: (SAMME, SAMME.R) If ‚ÄòSAMME.R‚Äô then use the SAMME.R real boosting algorithm. base_estimator must support calculation of class probabilities. If ‚ÄòSAMME‚Äô then use the SAMME discrete boosting algorithm. The SAMME.R algorithm typically converges faster than SAMME, achieving a lower test error with fewer boosting iterations.

## Base Estimator vs Accuracy
* The classifier to be used as weak learner.

In [None]:

classifiers = [
    svm.SVC(probability=True, C=5, kernel='rbf', gamma=0.5,),
    LogisticRegression(penalty='l2', tol=0.0001, C=1.0, fit_intercept=True, solver='liblinear'),
    RandomForestClassifier(criterion='gini', max_depth=3, min_samples_leaf=4, max_leaf_nodes=5,
                           n_estimators=10, bootstrap=True,random_state=10
    ),
    DecisionTreeClassifier(criterion='gini', max_depth=3, max_leaf_nodes=5),
    GradientBoostingClassifier(n_estimators=50, subsample=0.6, loss='deviance', learning_rate=0.1, max_depth=3,
                               min_samples_leaf=4, max_leaf_nodes=5, random_state=10,
    ),
]

classifier_names=['SVM', 'LogisticRegression', 'RandomForest', 'DecisionTree', 'GradientBoosting']

accuracies = []

# With bootstraping
for classifier in classifiers:
    clf = AdaBoostClassifier(
        base_estimator=classifier,
        learning_rate=0.1,
        random_state=10,
    )

    # Train
    clf.fit(X_train, y_train)

    # Calculate the mean accuracy on the given test data and labels.
    accuracies.append(clf.score(X_test, y_test))

fig, ax = plt.subplots()
fig.set_size_inches(14, 4)

ax.set_xlabel("Number of estimators", fontsize=16)
ax.set_ylabel("Accuracy", fontsize=16)
ax.set_title("Number of estimators Vs accuracy", fontsize=20)
for i, accuracy in enumerate(accuracies):
    ax.text(i, accuracy, np.round(accuracies[i], 2), color='black', ha="center", fontsize=14)

sns.barplot(x=classifier_names, y=accuracies, ax=ax)

* Here we see that it's not the case that we get better accuracies by using ensemble instead of using a single classifier. The SVM used above has an individual accuracy of 89% but with AdaBoost it's accuracy is decreased by 4%.

## Number of estimators vs Accuracy
* Number of estimators to be used for prediction.

In [None]:
num_estimators = [2, 3, 4, 5, 10, 20, 50, 100, 200]
accuracies = []

# With bootstraping
for n_estimators in num_estimators:
    clf = AdaBoostClassifier(
        n_estimators=n_estimators,
        learning_rate=0.1,
        random_state=10,
    )

    # Train
    clf.fit(X_train, y_train)

    # Calculate the mean accuracy on the given test data and labels.
    accuracies.append(clf.score(X_test, y_test))

fig, ax = plt.subplots()
fig.set_size_inches(14, 4)

ax.set_xlabel("Number of estimators", fontsize=16)
ax.set_ylabel("Accuracy", fontsize=16)
ax.set_title("Number of estimators Vs accuracy", fontsize=20)
for i, accuracy in enumerate(accuracies):
    ax.text(i, accuracy, np.round(accuracies[i], 2), color='black', ha="center", fontsize=14)

sns.barplot(x=num_estimators, y=accuracies, ax=ax)

* Can't predict any pattern from this. We see that it varies a lot.

## Algorithm vs Accuracy
* If ‚ÄòSAMME.R‚Äô then use the SAMME.R real boosting algorithm. base_estimator must support calculation of class probabilities. If ‚ÄòSAMME‚Äô then use the SAMME discrete boosting algorithm. The SAMME.R algorithm typically converges faster than SAMME, achieving a lower test error with fewer boosting iterations.

In [None]:
algorithms = ['SAMME', 'SAMME.R']
accuracies = []

for algorithm in algorithms:
    clf = AdaBoostClassifier(
        n_estimators=5,
        learning_rate=0.1,
        random_state=10,
        algorithm=algorithm,
    )

    # Train
    clf.fit(X_train, y_train)

    # Calculate the mean accuracy on the given test data and labels.
    accuracies.append(clf.score(X_test, y_test))

fig, ax = plt.subplots()
fig.set_size_inches(14, 4)

ax.set_xlabel("Algorithms", fontsize=16)
ax.set_ylabel("Accuracy", fontsize=16)
ax.set_title("Agorithm Vs accuracy", fontsize=20)
for i, accuracy in enumerate(accuracies):
    ax.text(i, accuracy, np.round(accuracies[i], 2), color='black', ha="center", fontsize=14)

sns.barplot(x=algorithms, y=accuracies, ax=ax)

# Last trick in the arsenal: Voting classifier
* Voting classifier is an ensemble of classifiers which can be used to get a little more out the best calibrated classifiers.

## Voting Strategy Vs Accuracy

In [None]:
votings = ['hard', 'soft']

accuracies = []

classifiers = [
    ('SVM',svm.SVC(probability=True, C=5, kernel='rbf', gamma=0.5)),
    ('LR',LogisticRegression(penalty='l2', tol=0.0001, C=1.0, fit_intercept=True, solver='liblinear')),
    ('RF',RandomForestClassifier(criterion='gini', max_depth=3, min_samples_leaf=4, max_leaf_nodes=5,
                           n_estimators=10, bootstrap=True,random_state=10
    )),
    ('DT',DecisionTreeClassifier(criterion='gini', max_depth=3, max_leaf_nodes=5)),
    ('GB',GradientBoostingClassifier(n_estimators=50, subsample=0.6, loss='deviance', learning_rate=0.1, max_depth=3,
                               min_samples_leaf=4, max_leaf_nodes=5, random_state=10,
    )),
    ('AB',AdaBoostClassifier(n_estimators=5, learning_rate=0.1, random_state=10)),
]

for voting in votings:
    clf = VotingClassifier(
        estimators=classifiers,
        voting=voting,
    )

    # Train
    clf.fit(X_train, y_train)

    # Calculate the mean accuracy on the given test data and labels.
    accuracies.append(clf.score(X_test, y_test))

fig, ax = plt.subplots()
fig.set_size_inches(14, 4)

ax.set_xlabel("Voting Strategy", fontsize=16)
ax.set_ylabel("Accuracy", fontsize=16)
ax.set_title("Voting Strategy Vs Accuracy", fontsize=20)
for i, accuracy in enumerate(accuracies):
    ax.text(i, accuracy, np.round(accuracies[i], 3), color='black', ha="center", fontsize=14)

sns.barplot(x=votings, y=accuracies, ax=ax)

* Voting strategy in itself can give drastic changes in the accuracy.

# Conclusion
## Final accuracy: 90.2%
* We saw how tuning hyperperimeters can be used to gain 5-10% accuracy in classifiers.
* This 5-10% increase can win you competitions.
* We can never say which set of hyperperameters will work best, but we can surely say it's worth tuning those.
* Some hyperperameters may be less effective then others. Knowing this we can save ourself some time.
* For **large datasets** we can't do such a through analysis but still we can change a few hyperperemeters which usually make a lot of difference.
* Some of the hyperperameters are more intutive then others, while others just happen to exist.
* Voting classifiers can give you a slight edge in the end if you plan to use multiple classifiers.
* **Tree classifiers** tend to have more hyperperemeters then other classifiers and the hyperperemeter values that works tend to be very unstable in the sense that we are not able to predict any definite pattern for those.
* If we have **large datasets** and can't play around with hyperparameters a lot then **SVM** seems to be a better choice. Trees even though give comparable accuracy tend to be very complicated when it comes to tuning hyperparameters.
* **Note: The hyperperemeters I've tuned isn't a complete set and hence we never know what's yet to come.**

# Acknowledgement

#### The content for this notebook is taken mostly from two resources:
* [Sklearn User Guide](https://scikit-learn.org/stable/user_guide.html).
* [CS229 Leacture Notes](https://see.stanford.edu/course/cs229).


### Thanks for reading.