# Optimize Your Algorithms
Do you ever wonder if you can squeeze some more performance out of your models? I certainly do. Using the default parameters of a model can be useful if you need a quick indication of its effectiveness but it won't always give you the accuracy you need. Here is where the parameters come in. By tuning your parameters, you can increase (or decrease) the accuracy of your model. In this kernel I'll show you some predictive models and I'll also show you how to tune them.

## Index
1. Importing Libraries and Data Cleaning
2. Data Analysis
3. Testing different models with default parameters
4. Support Vector Machines (SVM)
5. Ridge Classifier
6. Decision Tree Classifier
7. K Neighbors Classifier
8. Extra Trees Classifier
9. Random Forest Classifier
10. Gradient Boosting Classifier
11.  Ada Boost Classifier 
12. Final Score
13. Conclusion
14. Sources





## 1. Importing Libraries and Data Cleaning
Before we can start trying different algorithms, we will have to clean the data we're going to use. This will ensure more optimal results later on.

We are going to start by importing the libraries we are going to use and turning our csv into a pandas DataFrame.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
import warnings



warnings.filterwarnings("ignore")
data = pd.read_csv('../input/data.csv')

Let's take a look at the data we've just imported.

In [None]:
data.head(5)

We get a lot of data about the cell nucleus: radius, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, fractal dimension and diagnosis. 

For each point of data (except diagnosis) we get 3 numbers: the mean, the standard deviation and the worst.
The worst of a data point is the highest/largest number of the data for this cell nucleus.


In [None]:
data.describe()

There seems to be a column named **Unnamed: 32** that is full of NaN values. We can drop it without affecting our data.

In [None]:
data.drop('Unnamed: 32', axis = 1, inplace = True)

In [None]:
print('Data has {} missing values.'.format(data.isnull().sum().sum()))

# 2. Data Analysis
Now that our data has been cleaned, we can take a look at what it contains. Let's start by looking at the distribution of benign and malignant cell nuclei.

The data contains 357 benign and 212 malignant cell nuclei. This means that approximately 1 out of 3 cell nuclei in the dataset is malignant. This should be more than enough to train our models.

In [None]:
sns.countplot(data['diagnosis'])

Now let's take a look at some of the differences in the data between benign and malignant cell nuclei, starting with radius and texture.

The mean radius and texture of malignant cell nuclei are significantly higher than those of the benign cell nuclei. So, these are clear indications of cell nuclei being malignant.

In [None]:
data.groupby(['diagnosis']).mean()[['radius_mean', 'texture_mean', ]].plot.barh()

The mean of the perimeter also shows a significant increase when a cell nucleus is malignant. Another clear indication.

In [None]:
data.groupby(['diagnosis']).mean()['perimeter_mean'].plot.barh()

Looking at the remaining data we can see this trend continuing for every data point except fractal dimension. The mean of the concavity and concave points in a malignant cell nucleus are more than triple the mean of a benign cell nucleus!

In [None]:
data.groupby(['diagnosis']).mean()[['smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'fractal_dimension_mean' ]].plot.barh(figsize = (8,4))

When we take a closer look at the fractal dimension of a cell nucleus, we can see that while the mean doesn't differ much between benign and malignant cell nuclei the worst values of malignant cell nuclei are higher than those of benign cell nuclei.

In [None]:
data.groupby(['diagnosis']).mean()[['fractal_dimension_mean', 'fractal_dimension_worst',  ]].plot.barh(figsize = (8,4))

The standard deviation of the fractal dimensions of a malignant cell nucleus are also higher, though not as significantly as some other values.

In [None]:
data.groupby(['diagnosis']).mean()['fractal_dimension_se'].plot.barh(figsize = (8,4))


Before we take a deeper look at these correlations, we have to edit our data so the diagnosis becomes a boolean instead of a character.

In [None]:
data['diagnosis'].replace(to_replace='M', value = 1, inplace=True)
data['diagnosis'].replace(to_replace='B', value = 0, inplace=True)

Just as we concluded earlier, a high value of the radius, texture, perimeter and smoothness are all indicators of malignant cell nucleus

In [None]:
sns.heatmap(data[['diagnosis','radius_mean', 'texture_mean', 'perimeter_mean', 'smoothness_mean']].corr(), annot=True)

Again, these correlations match up with our earlier conclusions. The fractal dimension mean is the only mean that isn't effective when trying to identify a malignant cell nucleus.

In [None]:
sns.heatmap(data[['diagnosis','compactness_mean', 'concavity_mean', 'concave points_mean', 'fractal_dimension_mean']].corr(), annot=True)

And finally, the fractional dimension worst and standard deviation are also both useful in predicting malignant cell nuclei.

In [None]:
sns.heatmap(data[['diagnosis', 'fractal_dimension_mean', 'fractal_dimension_worst', 'fractal_dimension_se']].corr(), annot=True)

# 3. Testing different models with default parameters
Before we start optimising our models, we'll first run them with the default parameters. These values can later be used to compare the optimised models against them. 

The models are going to predict whether the cell nucleus is malignant or not. 


In [None]:
y = data['diagnosis']
X = data.drop(['diagnosis', 'id'], axis=1)

To view the model’s performance, we'll make a new DataFrame in which we'll store the Algorithm, it's scores and the standard deviation. When possible we use **random_state=0** to ensure fair results.

In [None]:
from sklearn import ensemble, linear_model, svm, neighbors, gaussian_process, naive_bayes, tree 

scoreFrame = pd.DataFrame(columns = ['Algorithm Name', 'Average', 'Standard Deviation'])

algList=[
    #linear
    linear_model.Ridge(random_state=0),
    linear_model.SGDClassifier(random_state=0),
    #Neighbors
    neighbors.KNeighborsClassifier(),
    #SVM
    svm.SVC(),
    #Gaussian Process
    gaussian_process.GaussianProcessClassifier(random_state=0),
    #Naive Bayes
    naive_bayes.GaussianNB(),
    #Tree
    tree.DecisionTreeClassifier(random_state=0),
    #Ensemble
    ensemble.GradientBoostingClassifier(random_state=0),
    ensemble.RandomForestClassifier(random_state=0),
    ensemble.ExtraTreesClassifier(random_state=0),
    ensemble.AdaBoostClassifier(random_state=0)
]

We are going to test the models using cross validation score. This function fits and tests the model multiple times and then returns a list of scores. Using the mean of this list we get an average score of the model.

In [None]:
from sklearn.model_selection import cross_val_score

for alg in algList:
    scores = cross_val_score(alg, X, y, cv = 10)
    algName = alg.__class__.__name__
    scoreAverage = scores.mean()
    scoreSTD = scores.std() * 2
    scoreFrame.loc[len(scoreFrame)] = [algName, scoreAverage, scoreSTD]

Using the sort_values function on the DataFrame gives us a nice overview of the performance of all the tested models.

In [None]:
scoreFrame.sort_values('Average', ascending=False)

# 4. Support Vector Machines (SVM)


### How does it work?
An SVM uses hyperplanes to classify the data. Hyperplanes are subspaces that consist of one less dimension than the original space. So a 3D space becomes 2D, a 2D space becomes 1D, etcetera.
### What parameters can we tune?
SVM's are able to use different kernels to classify the given data. These kernels can all be tuned using different parameters.

Kernels: Parameters
* Linear: C
* RBF: C, gamma
* Poly: C, gamma,  degree

### Default parameters
By default SVC uses these default values:
* kernel = 'rbf'
* C = 1.0
* gamma = 'auto' ( = 1 / n_features )
* degree = 3

## C - Penalty Parameter
C is the penalty parameter of the error term. It controls the trade off between smooth decision boundary and classifying the training points correctly.

Let's start by creating a List with some basic values to check the average score they give our model.

In [None]:
svmPenaltyFrame = pd.DataFrame(columns = ['C', 'Average', 'Standard Deviation'])

for c in [0.00001, 0.0001, 0.001, 0.01, 0.1,1,10,100,1000]:
    alg = svm.SVC(C=c)
    scores = cross_val_score(alg, X, y, cv = 3)
    scoreAverage = scores.mean()
    scoreSTD = scores.std() * 2
    svmPenaltyFrame.loc[len(svmPenaltyFrame)] = [c, scoreAverage, scoreSTD]
  
svmPenaltyFrame.sort_values('Average', ascending=False).head(10)

These values don't seem to change our predictions, so we'll wait untill we have some more parameters figured out and then we'll come back to the C parameter.

## Gamma - How exact should I fit the data?
The higher the gamma value the harder it tries to exactly fit the training data set

Let's start by using a range of 0.001 to 1000

In [None]:
svmGammaFrame = pd.DataFrame(columns = ['Gamma', 'Average', 'Standard Deviation'])

for g in [0.001, 0.01, 0.1,1,10,100,1000]:
    alg = svm.SVC(gamma=g, C=0.1)
    scores = cross_val_score(alg, X, y, cv = 3)
    scoreAverage = scores.mean()
    scoreSTD = scores.std() * 2
    svmGammaFrame.loc[len(svmGammaFrame)] = [g, scoreAverage, scoreSTD]
  
svmGammaFrame.sort_values('Average', ascending=False).head(10)

This range doesn't give us any different scores, so we'll try a more extreme range.
This time we'll use all numbers between 0.000001 and 0.001

In [None]:
for g in range(1,1000):
    g = g/1000000
    alg = svm.SVC(gamma=g, C=0.1)
    scores = cross_val_score(alg, X, y, cv = 3)
    scoreAverage = scores.mean()
    scoreSTD = scores.std() * 2
    svmGammaFrame.loc[len(svmGammaFrame)] = [g, scoreAverage, scoreSTD]
   
svmGammaFrame.sort_values('Average', ascending=False).head(10)

Now we get some good results out of the SVM. The accuracy became around 0.3 points higher!

In [None]:
optimalSVMGamma = svmGammaFrame.sort_values('Average', ascending=False).iloc[0].values[0]

## Kernel - How do I fit the data?
The kernel decides how the data gets seperated on the hyperplane.

In [None]:
svmKernelFrame = pd.DataFrame(columns = ['Kernel', 'Average', 'Standard Deviation'])
kernelList = ['linear', 'poly', 'rbf']
for k in kernelList:
    alg = svm.SVC(gamma=optimalSVMGamma, kernel=k, C=0.1)
    scores = cross_val_score(alg, X, y, cv = 3)
    scoreAverage = scores.mean()
    scoreSTD = scores.std() * 2
    svmKernelFrame.loc[len(svmKernelFrame)] = [k, scoreAverage, scoreSTD]
    
svmKernelFrame.sort_values('Average', ascending=False).head(10)

It turns out that the Poly Kernel is the best type of kernel for our data

In [None]:
optimalSVMKernel = 'poly'

## Degree - How sharp are the Poly Kernel's lines?
Degree determines the degrees of the polynomial used to split the data on the hyperplane.

We'll try values between 1 and 3. Any values above this will only lead to really slow models.

In [None]:
svmDegreeFrame = pd.DataFrame(columns = ['Degrees', 'Average', 'Standard Deviation'])

for d in range(1,4):
    alg = svm.SVC(gamma=optimalSVMGamma, kernel='poly', degree=d, C=0.1)
    scores = cross_val_score(alg, X, y, cv = 3)
    scoreAverage = scores.mean()
    scoreSTD = scores.std() * 2
    svmDegreeFrame.loc[len(svmDegreeFrame)] = [d, scoreAverage, scoreSTD]
   
svmDegreeFrame.sort_values('Average', ascending=False).head(10)

In [None]:
optimalSVMDegree = svmDegreeFrame.sort_values('Average', ascending=False).iloc[0].values[0]

## C - Again
Now that we have some more parameters tuned, let's try our C parameter again. 

In [None]:
svmPenaltyFrame = pd.DataFrame(columns = ['C', 'Average', 'Standard Deviation'])

for c in [0.00001, 0.0001, 0.001, 0.01, 0.1, 1]:
    alg = svm.SVC(gamma=optimalSVMGamma, kernel=optimalSVMKernel, degree=optimalSVMDegree, C=c)
    scores = cross_val_score(alg, X, y, cv = 3)
    scoreAverage = scores.mean()
    scoreSTD = scores.std() * 2
    svmPenaltyFrame.loc[len(svmPenaltyFrame)] = [c, scoreAverage, scoreSTD]
  
svmPenaltyFrame.sort_values('Average', ascending=False).head(10)

In [None]:
optimalSVMPenalty = svmPenaltyFrame.sort_values('Average', ascending=False).iloc[0].values[0]

# 5. Ridge Classifier
### How does it work?
The Ridge Classifier creates a formula that outputs the prediction. It chooses coefficients in this formula based on the data given. The closer these coefficients are to 0 the better the predictions.

### What parameters can we tune?
The only parameter in this model that we can tune is alpha

### Default parameters
By default Ridge Classifiers uses these default values:
* alpha= 1.0

## Alpha - Forcing coefficients toward zero
Ridge Classifiers use alpha to force coefficients more/less toward zero. If alpha is higher than it will force the coefficients more towards zero.

Starting with a simple list of values from 0.001 to 1000 we'll try to find the optimal alpha.

In [None]:
ridgeFrame = pd.DataFrame(columns = ['Alpha', 'Average', 'Standard Deviation'])

for a in [0.001, 0.01, 0.1, 1, 10, 100, 1000]:
    alg = linear_model.Ridge(alpha = a)
    scores = cross_val_score(alg, X, y, cv = 3)
    scoreAverage = scores.mean()
    scoreSTD = scores.std() * 2
    ridgeFrame.loc[len(ridgeFrame)] = [a, scoreAverage, scoreSTD]

ridgeFrame.sort_values('Average', ascending=False).head(10)

It's clear that lower alphas are better in this case, so we'll create a longer list with some lower values

In [None]:
ridgeFrame = pd.DataFrame(columns = ['Alpha', 'Average', 'Standard Deviation'])

for a in range(1,1000):
    a = a/100000
    alg = linear_model.Ridge(alpha = a)
    scores = cross_val_score(alg, X, y, cv = 3)
    scoreAverage = scores.mean()
    scoreSTD = scores.std() * 2
    ridgeFrame.loc[len(ridgeFrame)] = [a, scoreAverage, scoreSTD]

ridgeFrame.sort_values('Average', ascending=False).head(10)

In [None]:
optimalRidgeAlpha = ridgeFrame.sort_values('Average', ascending=False).iloc[0].values[0]

In [None]:
sns.relplot(x = 'Alpha', y = 'Average', data=ridgeFrame, kind="line")

It's clear in this graph that we won't get a much better prediction than around 0.7, but to be sure we'll graph a large range of alphas.

In [None]:
for a in [0.000001,0.00001,0.0001,0.001,0.01,0.1,1,10,100,1000,10000,100000]:
    alg = linear_model.Ridge(alpha = a)
    scores = cross_val_score(alg, X, y, cv = 3)
    scoreAverage = scores.mean()
    scoreSTD = scores.std() * 2
    ridgeFrame.loc[len(ridgeFrame)] = [a, scoreAverage, scoreSTD]

sns.relplot(x = 'Alpha', y = 'Average', data=ridgeFrame, kind="line")

As expected, we don't see any spikes in accuracy above 0.7

#  	6. Decision Tree Classifier
### How does it work?
Decision Trees create a network of leaves (or nodes) and branches between these leaves. Every leaf has a condition that is either True or False. This then sends the user over to another leaf, etcetera. 

### What parameters can we tune?
The parameters we'll tune are min_samples_leaf and max_depth

### Default parameters
By default Decision Tree Classifiers use these default values:
* min_samples_leaf = 2
* max_depth = None

## min_samples_leaf - Minimum number of samples
The min_samples_leaf parameter decides if a Decision Tree is allowed to split a node. It will only be allowed if there are at least X training samples in both the left and right branch of the node, where X is the integer you gave the parameter.

In [None]:
treeSampleFrame = pd.DataFrame(columns = ['Samples', 'Average', 'Standard Deviation'])

for n in range(1,20):
    alg = tree.DecisionTreeClassifier(min_samples_leaf = n, random_state=0)
    scores = cross_val_score(alg, X, y, cv = 5)
    scoreAverage = scores.mean()
    scoreSTD = scores.std() * 2
    treeSampleFrame.loc[len(treeSampleFrame)] = [n, scoreAverage, scoreSTD]

treeSampleFrame.sort_values('Average', ascending=False).head(5)

In [None]:
optimalTreeSamples = int(treeSampleFrame.sort_values('Average', ascending=False).iloc[0].values[0])

In [None]:
sns.relplot(x = 'Samples', y = 'Average', data=treeSampleFrame, kind="line")

## max_depth
The integer you set here determines the maximum number of layers your Decision Tree makes.

In [None]:
treeDepthFrame = pd.DataFrame(columns = ['Depth', 'Average', 'Standard Deviation'])

for d in range(1,20):
    alg = tree.DecisionTreeClassifier(min_samples_leaf = optimalTreeSamples, max_depth = d, random_state=0)
    scores = cross_val_score(alg, X, y, cv = 5)
    scoreAverage = scores.mean()
    scoreSTD = scores.std() * 2
    treeDepthFrame.loc[len(treeDepthFrame)] = [d, scoreAverage, scoreSTD]

treeDepthFrame.sort_values('Average', ascending=False)

In [None]:
optimalTreeDepth = int(treeDepthFrame.sort_values('Average', ascending=False).iloc[0].values[0])

In [None]:
sns.relplot(x = 'Depth', y = 'Average', data=treeDepthFrame, kind="line")

Max_depth doesn't matter much after 4 layers.

## Depth & Samples
Let's try both max_depth and min_samples_leaf at once to make sure we have the optimal values.

In [None]:
treeFrame = pd.DataFrame(columns = ['Depth', 'Samples', 'Average', 'Standard Deviation'])

for n in range(1,20):
    for d in range(1, 10):
        alg = tree.DecisionTreeClassifier(min_samples_leaf = n, max_depth=d, random_state=0)
        scores = cross_val_score(alg, X, y, cv = 5)
        scoreAverage = scores.mean()
        scoreSTD = scores.std() * 2
        treeFrame.loc[len(treeFrame)] = [d, n, scoreAverage, scoreSTD]

treeFrame.sort_values('Average', ascending=False).head(10)

In [None]:
optimalTreeDepth = int(treeFrame.sort_values('Average', ascending=False).iloc[0].values[0])
optimalTreeSamples = int(treeFrame.sort_values('Average', ascending=False).iloc[0].values[1])

In [None]:
sns.heatmap(treeFrame[['Average', 'Depth', 'Samples']].corr(), annot=True)

Using the graphviz library we'll visualize the end result of the Decision Tree. 
The leaves show multiple pieces of information:
* Condition of the leaf
* Gini (or chance of incorrect measurement of a random training sample at that point)
* The number of samples that passed during fitting
* Class (or prediction) of the sample at that point

As we can see in the image, not all bottom leaves have a gini equal to 0.0. These leaves are the ones that could give use incorrect predictions. These could be improved by gathering more data that involves these leaves, though this is not always possible.

In [None]:
DTC = tree.DecisionTreeClassifier(min_samples_leaf = optimalTreeSamples, max_depth=optimalTreeDepth, random_state=0)
DTC.fit(X,y)
import graphviz
dot_data = tree.export_graphviz(DTC, feature_names=X.columns.values, class_names=['B', 'M'], filled=True )
graphviz.Source(dot_data) 


# 7. K Neighbors Classifier
### How does it work?
The K Neighbors Classifier searches the correct answer based on the data it already has. It tries to predict the classification by looking at which data points are near it and what they are classified as.

### What parameters can we tune?
This algorithm uses 3 parameters that we can tune: n_neighbors, weight and algorithm.

### Default parameters
* n_neighbors = 5
* algorithm = 'auto'
* weight = 'uniform'

## n_neighbors - How many points are around me?
The value of this parameter determines how much points around the target data point are measured to determine the outcome.

In [None]:
kNeighborFrame = pd.DataFrame(columns = ['Neighbors', 'Average', 'Standard Deviation'])
for n in range(1,50):
    alg = neighbors.KNeighborsClassifier(n_neighbors=n)
    scores = cross_val_score(alg, X, y, cv = 5)
    scoreAverage = scores.mean()
    scoreSTD = scores.std() * 2
    kNeighborFrame.loc[len(kNeighborFrame)] = [n, scoreAverage, scoreSTD]
kNeighborFrame.sort_values('Average', ascending=False).head(5)

In [None]:
optimalKNNNeighbors = int(kNeighborFrame.sort_values('Average', ascending=False).iloc[0].values[0])

In [None]:
sns.relplot(x="Neighbors", y='Average', data = kNeighborFrame, kind = "line")

As shown in the table and plot, the optimal number of neighbors is 14.

## Algorithm - How do I find my neighbors?
There are 3 types of algorithms to use for finding neighbors:
* ball_tree
* kd_tree
* brute (force)
If you give the parameter auto the model will try to find the best model himself.

In [None]:
kAlgorithmFrame = pd.DataFrame(columns = ['Algorithm', 'Average', 'Standard Deviation'])
for a in ['ball_tree', 'kd_tree', 'brute','auto']:
    alg = neighbors.KNeighborsClassifier(n_neighbors=optimalKNNNeighbors ,algorithm = a)
    scores = cross_val_score(alg, X, y, cv = 5)
    scoreAverage = scores.mean()
    scoreSTD = scores.std() * 2
    kAlgorithmFrame.loc[len(kAlgorithmFrame)] = [a, scoreAverage, scoreSTD]

kAlgorithmFrame.sort_values('Average', ascending=False).head(10)

In [None]:
optimalKNNAlg = kAlgorithmFrame.sort_values('Average', ascending=False).iloc[0].values[0]

It seems that the algorithm doesn't matter much this time, but it never hurts to have checked.

## Weight - Are all my neighbors equal?
We have all these different neighbors, but they aren't all the same. Not every neighbor is equally far away from our data point. This impacts our predictions.

We have 2 types of weight to give to data points:
* uniform - All data points are equal
* distance - The closer a data point is, the more influential it is to the prediction

In [None]:
kWeightFrame = pd.DataFrame(columns = ['Weight', 'Average', 'Standard Deviation'])
for w in ['uniform', 'distance']:
        alg = neighbors.KNeighborsClassifier(weights=w, algorithm = optimalKNNAlg, n_neighbors=optimalKNNNeighbors)
        scores = cross_val_score(alg, X, y, cv = 5)
        scoreAverage = scores.mean()
        scoreSTD = scores.std() * 2
        kWeightFrame.loc[len(kWeightFrame)] = [w, scoreAverage, scoreSTD]

kWeightFrame.sort_values('Average', ascending=False).head(10)

In [None]:
optimalKNNWeight = kWeightFrame.sort_values('Average', ascending=False).iloc[0].values[0]

The uniform weight apparently suits our data better, though it doesn't differ much from the distance weight.

# 8. Extra Trees Classifier
### How does it work?
The Extra Trees Classifier fits a number of randomized Decision Trees on sub-samples of the dataset and then uses averaging to get a more accurate prediction.

### What parameters can we tune?
The Extra Tree Classifier parameters that we'll tune are: n_estimators, max_depth and min_samples_leaf.

We have already discussed max_depth and min_samples_leaf in Decision Trees so we will skip their explanation in this part.

### Default parameters
* n_estimators = 10
* max_depth = None
* min_samples_leaf = 1

## n_estimators - How many?
The value for this parameter determines how many Decision Trees are made by the model.

Let's start by trying N: 1 -> 100, Depth: 1 -> 30 and Samples: 0.1 -> 1

In [None]:
extraTreeFrame = pd.DataFrame(columns = ['n_estimators', 'Depth', 'Samples', 'Average', 'Standard Deviation'])

for n in range(1, 100, 10):
    for d in range(1, 30, 5):
        for s in range(1, 10):
            alg = ensemble.ExtraTreesClassifier(n_estimators=n, max_depth=d, min_samples_leaf=s ,random_state=0)
            scores = cross_val_score(alg, X, y, cv = 5)
            scoreAverage = scores.mean()
            scoreSTD = scores.std() * 2
            extraTreeFrame.loc[len(extraTreeFrame)] = [n, d, s, scoreAverage, scoreSTD]

extraTreeFrame.sort_values('Average', ascending=False).head(5)

In [None]:
sns.heatmap(extraTreeFrame[['Average','n_estimators', 'Depth', 'Samples']].corr(), annot=True)

In [None]:
sns.relplot(x = 'n_estimators', y = 'Average', data=extraTreeFrame, kind="line")

In [None]:
sns.relplot(x = 'Depth', y = 'Average', data=extraTreeFrame, kind="line")

Looking at the results and at the heatmap, we see that samples isn't really affecting our average score. The other plots show that the optimal number of trees is somewhere between 10 and 40. Knowing this we can try the other parameters a bit more in depth. The best depths are above 10 according to the plot.

In [None]:
extraTreeFrame = pd.DataFrame(columns = ['n_estimators', 'Depth', 'Average', 'Standard Deviation'])

for n in range(10, 50):
    for d in range(10, 30):
        alg = ensemble.ExtraTreesClassifier(n_estimators=n, max_depth=d, random_state=0)
        scores = cross_val_score(alg, X, y, cv = 5)
        scoreAverage = scores.mean()
        scoreSTD = scores.std() * 2
        extraTreeFrame.loc[len(extraTreeFrame)] = [n, d, scoreAverage, scoreSTD]

extraTreeFrame.sort_values('Average', ascending=False).head(10)

In [None]:
sns.relplot(x = 'n_estimators', y = 'Average', data=extraTreeFrame, kind="line")

In [None]:
sns.relplot(x = 'Depth', y = 'Average', data=extraTreeFrame, kind="line")

In [None]:
optimalExtraTreeN = int(extraTreeFrame.sort_values('Average', ascending=False).iloc[0].values[0])
optimalExtraTreeDepth = int(extraTreeFrame.sort_values('Average', ascending=False).iloc[0].values[1])

The plots show that the optimal n_estimators are mostly around 35 and the optimal maximum depth is mostly above 17.

# 9. Random Forest Classifier
### How does it work?
The Random Forest Classifier fits a number of randomized Decision Trees on sub-samples of the dataset and then uses averaging to get a more accurate prediction. This sounds really similar to the Extra Trees Classifier model, because it is. The big difference is that the Random Forest Classifier chooses the features it will use by the most discriminative thresholds instead of randomly. 

### What parameters can we tune?
The Random Forest parameters that we'll tune are: n_estimators, max_depth and min_samples_leaf.

We have already discussed all of these in previous models, max_depth & min_samples_leaf in Decision Tree and n_estimators in Extra Trees.

### Default parameters
* n_estimators = 10
* max_depth = None
* min_samples_leaf = 1

Let's start by trying N: 1 -> 100, Depth: 1 -> 30 and Samples: 1 -> 5

In [None]:
randomForestFrame = pd.DataFrame(columns = ['n_estimators', 'Depth', 'Samples', 'Average', 'Standard Deviation'])

for n in range(1, 100, 10):
    for d in range(1, 30, 5):
        for s in range(2, 5):
            alg = ensemble.RandomForestClassifier(n_estimators=n, max_depth=d, min_samples_split=s, random_state=0)
            scores = cross_val_score(alg, X, y, cv = 5)
            scoreAverage = scores.mean()
            scoreSTD = scores.std() * 2
            randomForestFrame.loc[len(randomForestFrame)] = [n, d, s, scoreAverage, scoreSTD]

randomForest = randomForestFrame.sort_values('Average', ascending=False)
randomForest.head(10)

In [None]:
sns.relplot(x = 'n_estimators', y = 'Average', data = randomForestFrame, kind="line")

In [None]:
sns.relplot(x = 'Depth', y = 'Average', data=randomForestFrame, kind="line")

In [None]:
sns.relplot(x = 'Samples', y = 'Average', data=randomForestFrame)

In [None]:
sns.heatmap(randomForestFrame[['Average','n_estimators', 'Depth', 'Samples']].corr(), annot=True)

Looking at the plots we can clearly see that n_estimators should be somewhere between 40 and 80,  max_depth should be above 5 and samples can vary wildly.

In [None]:
randomForestFrame = pd.DataFrame(columns = ['n_estimators', 'Depth', 'Samples', 'Average', 'Standard Deviation'])

for n in range(40,80,2):
    for d in range(5,25,5):
        for s in range(2, 5):
            alg = ensemble.RandomForestClassifier(n_estimators=n, max_depth=d, min_samples_split=s, random_state=0)
            scores = cross_val_score(alg, X, y, cv = 5)
            scoreAverage = scores.mean()
            scoreSTD = scores.std() * 2
            randomForestFrame.loc[len(randomForestFrame)] = [n, d, s, scoreAverage, scoreSTD]

randomForestFrame.sort_values('Average', ascending=False).head(10)

In [None]:
optimalRandomForestN = int(randomForestFrame.sort_values('Average', ascending=False).iloc[0].values[0])
optimalRandomForestDepth = int(randomForestFrame.sort_values('Average', ascending=False).iloc[0].values[1])
optimalRandomForestSamples = int(randomForestFrame.sort_values('Average', ascending=False).iloc[0].values[2])

In [None]:
sns.relplot(x = 'n_estimators', y = 'Average', data=randomForestFrame, kind="line")

# 	10. Gradient Boosting Classifier
### How does it work?
The Gradient Boosting Classifier builds multiple Decision Trees (just like the Random Forest Classifier), but does this in different stages while trying to optimize the loss function.

### What parameters can we tune?
The Gradient Boosting Classifier that we'll tune are: learning_rate, n_estimators, max_depth.

We have already discussed max_depth in Decision Tree and n_estimators in Extra Trees.

### Default parameters
* learning_rate = 0.1
* n_estimators = 10
* max_depth = None

## learning_rate - How much does a single Decision Tree contribute?
The learning_rate parameter shrinks the contribution of each tree by learning_rate.

Let's start by trying N: 1 -> 100, Depth: 1 -> 30 and learning_rate: 0.1 -> 1

In [None]:
gradientBoostingTotalFrame = pd.DataFrame(columns = ['n_estimators', 'Depth', 'learning rate', 'Average', 'Standard Deviation'])

for n in range(1, 100, 10):
    for d in range(1, 30, 5):
        for l in range(1, 11):
            l = l/10
            alg = ensemble.GradientBoostingClassifier(n_estimators = n, learning_rate = l, max_depth = d, random_state = 0)
            scores = cross_val_score(alg, X, y, cv = 5)
            scoreAverage = scores.mean()
            scoreSTD = scores.std() * 2
            gradientBoostingTotalFrame.loc[len(gradientBoostingTotalFrame)] = [n, d, l, scoreAverage, scoreSTD]

gradientBoostingTotalFrame.sort_values('Average', ascending=False).head(5)

In [None]:
sns.heatmap(gradientBoostingTotalFrame[['Average','n_estimators', 'Depth', 'learning rate']].corr(), annot=True)

In [None]:
sns.relplot(x = 'n_estimators', y = 'Average', data=gradientBoostingTotalFrame, kind="line")

In [None]:
sns.relplot(x = 'Depth', y = 'Average', data=gradientBoostingTotalFrame, kind="line")

In [None]:
sns.relplot(x = 'learning rate', y = 'Average', data=gradientBoostingTotalFrame, kind="line")

Looking at these results and plots we can see that n_estimators should be somewhere above 60, a max_depth of 1 is optimal and the learning_rate should be somewhere between 0.5 and 1

In [None]:
gradientBoostingFrame = pd.DataFrame(columns = ['n_estimators', 'learning rate', 'Average', 'Standard Deviation'])
for n in range(60, 150):
    for l in range(1, 11):
        l = l / 10
        alg = ensemble.GradientBoostingClassifier(n_estimators = n, learning_rate = l, random_state = 0)
        scores = cross_val_score(alg, X, y, cv = 5)
        scoreAverage = scores.mean()
        scoreSTD = scores.std() * 2
        gradientBoostingFrame.loc[len(gradientBoostingFrame)] = [n, l, scoreAverage, scoreSTD]

gradientBoostingFrame.sort_values('Average', ascending=False).head(5)

In [None]:
optimalGBCN = int(gradientBoostingFrame.sort_values('Average', ascending=False).iloc[0].values[0])
optimalGBCLearningRate = gradientBoostingFrame.sort_values('Average', ascending=False).iloc[0].values[1]

optimalGBCDepth = int(gradientBoostingTotalFrame.sort_values('Average', ascending=False).iloc[0].values[1])

# 11. Ada Boost Classifier 
### How does it work?
The Ada Boost Classifier builds the same model multiple times, but the data gets assigned different weights every time.

### What parameters can we tune?
We'll use the following parameters: base_estimator, n_estimators, learning_rate

We have already discussed n_estimators in Extra Trees and learning_rate in Gradient Boosting.

### Default parameters
* base_estimator = DecisionTreeClassifier(max_depth=1)
* n_estimators = 50
* learning_rate = 1

## base_estimator - What kind of model do I use?
The base estimator is the model on which the boost is built.

In [None]:
adaBoostFrame = pd.DataFrame(columns = ['Name', 'n_estimators', 'Learning Rate', 'Average', 'Standard Deviation'])
for b in [    
    tree.DecisionTreeClassifier(min_samples_leaf = optimalTreeSamples, max_depth= optimalTreeDepth, random_state=0),
    ensemble.ExtraTreesClassifier(n_estimators=optimalExtraTreeN, max_depth=optimalExtraTreeDepth, random_state=0),
    ensemble.RandomForestClassifier(n_estimators=optimalRandomForestN, max_depth=optimalRandomForestDepth, min_samples_split=optimalRandomForestSamples, random_state=0),
    ensemble.GradientBoostingClassifier(n_estimators = optimalGBCN, learning_rate = optimalGBCLearningRate, max_depth = 1, random_state=0),
]:
    for n in range(1,100, 5):
        for l in range(1,11):
            l = l/10
            alg = ensemble.AdaBoostClassifier(n_estimators = n, base_estimator=b, learning_rate = l, random_state = 0)
            scores = cross_val_score(alg, X, y, cv = 5)
            scoreAverage = scores.mean()
            scoreSTD = scores.std() * 2
            adaBoostFrame.loc[len(adaBoostFrame)] = [b.__class__.__name__, n, l, scoreAverage, scoreSTD]

adaBoostFrame.sort_values('Average', ascending=False).head(10)

In [None]:
optimalAdaBoostN = int(adaBoostFrame.sort_values('Average', ascending=False).iloc[0].values[1])
optimalAdaBoostLearningRate = adaBoostFrame.sort_values('Average', ascending=False).iloc[0].values[2]
optimalAdaBoostBase = tree.DecisionTreeClassifier(min_samples_leaf = optimalTreeSamples, max_depth= optimalTreeDepth, random_state=0)

# 12. Final Score

In [None]:
finalScoreFrame = pd.DataFrame(columns = ['Algorithm Name', 'Average', 'Standard Deviation'])

In [None]:
alg = svm.SVC(gamma=optimalSVMGamma, kernel=optimalSVMKernel, degree=optimalSVMDegree, C = optimalSVMPenalty)
scores = cross_val_score(alg, X, y, cv = 5)
scoreAverage = scores.mean()
print('Final SVM Score: {:01.5f}'.format(scoreAverage))
scoreSTD = scores.std() * 2
finalScoreFrame.loc[len(finalScoreFrame)] = [alg.__class__.__name__, scoreAverage, scoreSTD]

In [None]:
alg = linear_model.Ridge(alpha = optimalRidgeAlpha)
scores = cross_val_score(alg, X, y, cv = 5)
scoreAverage = scores.mean()
print('Final Ridge Score: {:01.5f}'.format(scoreAverage))
scoreSTD = scores.std() * 2
finalScoreFrame.loc[len(finalScoreFrame)] = [alg.__class__.__name__, scoreAverage, scoreSTD]

In [None]:
alg = tree.DecisionTreeClassifier(min_samples_leaf = optimalTreeSamples, max_depth= optimalTreeDepth, random_state=0)
scores = cross_val_score(alg, X, y, cv = 5)
scoreAverage = scores.mean()
print('Final DecisionTree Score: {:01.5f}'.format(scoreAverage))
scoreSTD = scores.std() * 2
finalScoreFrame.loc[len(finalScoreFrame)] = [alg.__class__.__name__, scoreAverage, scoreSTD]

In [None]:
alg = neighbors.KNeighborsClassifier(n_neighbors = optimalKNNNeighbors, weights = optimalKNNWeight, algorithm = optimalKNNAlg)
scores = cross_val_score(alg, X, y, cv = 5)
scoreAverage = scores.mean()
print('Final KNC Score: {:01.5f}'.format(scoreAverage))
scoreSTD = scores.std() * 2
finalScoreFrame.loc[len(finalScoreFrame)] = [alg.__class__.__name__, scoreAverage, scoreSTD]

In [None]:
alg = ensemble.ExtraTreesClassifier(n_estimators=optimalExtraTreeN, max_depth=optimalExtraTreeDepth, random_state=0)
scores = cross_val_score(alg, X, y, cv = 5)
scoreAverage = scores.mean()
print('Final Extra Trees Score: {:01.5f}'.format(scoreAverage))
scoreSTD = scores.std() * 2
finalScoreFrame.loc[len(finalScoreFrame)] = [alg.__class__.__name__, scoreAverage, scoreSTD]

In [None]:
alg = ensemble.RandomForestClassifier(n_estimators=optimalRandomForestN, max_depth=optimalRandomForestDepth, min_samples_split=optimalRandomForestSamples, random_state=0)
scores = cross_val_score(alg, X, y, cv = 5)
scoreAverage = scores.mean()
print('Final Random Forest Score: {:01.5f}'.format(scoreAverage))
scoreSTD = scores.std() * 2
finalScoreFrame.loc[len(finalScoreFrame)] = [alg.__class__.__name__, scoreAverage, scoreSTD]

In [None]:
alg = ensemble.GradientBoostingClassifier(n_estimators = optimalGBCN, learning_rate = optimalGBCLearningRate, max_depth = optimalGBCDepth, random_state = 0)
scores = cross_val_score(alg, X, y, cv = 5)
scoreAverage = scores.mean()
print('Final Gradient Boosting Score: {:01.5f}'.format(scoreAverage))
scoreSTD = scores.std() * 2
finalScoreFrame.loc[len(finalScoreFrame)] = [alg.__class__.__name__, scoreAverage, scoreSTD]

In [None]:
alg = ensemble.AdaBoostClassifier(n_estimators = optimalAdaBoostN, base_estimator=optimalAdaBoostBase, learning_rate=optimalAdaBoostLearningRate, random_state = 0)
scores = cross_val_score(alg, X, y, cv = 5)
scoreAverage = scores.mean()
print('Final AdaBoost Score: {:01.5f}'.format(scoreAverage))
scoreSTD = scores.std() * 2
finalScoreFrame.loc[len(finalScoreFrame)] = [alg.__class__.__name__, scoreAverage, scoreSTD]

In [None]:
compareScoreFrame = pd.DataFrame(columns = ['Algorithm Name', 'Average', 'Standard Deviation', 'Before/After'])

for i in range(len(scoreFrame)):
    row = scoreFrame.loc[i]
    compareScoreFrame.loc[len(compareScoreFrame)] = [row['Algorithm Name'], row['Average'], row['Standard Deviation'], 'Before']
    
for i in range(len(finalScoreFrame)):
    row = finalScoreFrame.loc[i]
    compareScoreFrame.loc[len(compareScoreFrame)] = [row['Algorithm Name'], row['Average'], row['Standard Deviation'], 'After']



In [None]:
compareScoreFrame.sort_values('Average', ascending=False).head(3)

So now we can see the results of our hard work. All of the models that we've tuned have higher outcomes than the default models. The most surprising one is the SVM who went up by about 0.3 points of accuracy. The AdaBoostClassifier is the best model however with almost 98% accuracy.

In [None]:
compareScoreFrame.sort_values('Average', ascending=False)
g = sns.relplot(x = "Algorithm Name", y = "Average", hue="Before/After", size = 'Standard Deviation', data = compareScoreFrame, sizes = (100,500), height=7) 
g.fig.autofmt_xdate()

# 13. Conclusion
By tuning your models, you can squeeze a bit more performance out of them. However this does require a lot more work and time than just using the default parameters. So if you have the time, I definitely  recommend optimizing your models.

# 14. Sources
* *Introduction To Machine Learning by A. C. Müller and S. Guido*
* Scikit-learn: https://scikit-learn.org/stable/index.html
* SVM Tuning: https://medium.com/all-things-ai/in-depth-parameter-tuning-for-svc-758215394769
* GBC Tuning: https://medium.com/all-things-ai/in-depth-parameter-tuning-for-gradient-boosting-3363992e9bae