# Decision Trees, Ensemble Methods and Hyperparameter tuning workshop:

# 3rd November 2022

![title](images/pydata_cardiff.jpg)

## Outline of the workshop

This notebook will hopefully provide a simple outline of how classification models can be created using Decision Trees: either single trees or groups of trees in an "ensemble model". The following key points are raised:

* What is a decision tree?
* How is it used to build a classification model?
* How the results of multiple decision trees can be combined to create very powerful and accurate models
    * An explanation of the term: "Ensembles of __weak learners__"
* A description of the hyper-parameters, which effects how trees are built, is provided
* An example of how the best parameters are chosen is given using the [Optuna](https://optuna.org/) package

In [None]:
!pip install optuna palmerpenguins

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, ExtraTreesClassifier
from sklearn.datasets import load_wine, load_breast_cancer
from sklearn.metrics import accuracy_score

from palmerpenguins import load_penguins

### The first dataset - Palmer Penguins!

<img src="images/palmer_penguins.png" width="600">

This dataset consists of 344 datapoints on 3 species of penguin. We are focussing on the following columns for this workshop:

* Bill length (mm)
* Flipper length (mm)
* Bill depth (mm)
* Body mass (g)

The idea behind this is that we can build various model to use this information to predict which species of penguin the datapoints belong to. The model uses the labels in the existing dataset to build models that would be able to then classify new datapoints. These are all examples of [supervised learning](https://scikit-learn.org/stable/tutorial/basic/tutorial.html)

In [None]:
penguins_raw = load_penguins()
penguins = (
    penguins_raw
    .drop(columns=["island", "sex", "year"])
    .dropna()
)

Note that 2 of these rows must have contained some missing data points for these columns. As this is such a low number, they are just removed from the analysis here

In [None]:
penguins_raw.shape

In [None]:
penguins.shape

In [None]:
penguins.head()

### Visualising the data - Seaborn `pairplot`

If you are dealing with a fairly small number of datapoints, then the [pairplot](https://seaborn.pydata.org/generated/seaborn.pairplot.html) function from the Seaborn library.

This shows all of the scatter plots for pairwise combinations of columns, as well as the density plots of the individual features along the diagonal.

What we are interested in finding here is the plot that shows the clearest separation between the different species. When just looking at this graph, it look as though this is done by comparing `flipper_length_mm` and `bill_length_mm`.

Note that these plots are symmetric about the diagonal, so the graph of these 2 features occurs twice.

In [None]:
sns.pairplot(
    penguins,
    hue="species",
    height=2.5,
    plot_kws={"s": 10}
);

## Using a JointPlot to focus on 2 features

Another useful function from seaborn is the [jointplot](https://seaborn.pydata.org/generated/seaborn.jointplot.html), which by default shows the scatter plot and distributions.

In [None]:
ax = sns.jointplot(
    data=penguins,
    x="flipper_length_mm",
    y="bill_length_mm",
    hue="species",
);

## Splitting by feature value

One thing to note when using a decision tree is that the splits are made at the feature level. What this means is that, when looking at this 2D plot - the splits will occur in a perpendicular fashion to the axes. This means that the splits are either horizontal or vertical - we cannot get diagonal splits.

What can be seen below is an attempt to find the best splits using visual judgement alone - using the following values:

* `flipper_length_mm` - 206mm
* `bill_length_mm` - 44mm

These values are very similar to those found by the decision tree algorithm, which you will see a few cells below - and hopefully provides an intuition as to what the algorithm is looking to perform.

In this first plot - all of the datapoints to the right of the red line can be classified as Gentoo - with only a few errors

In [None]:
ax = sns.jointplot(
    data=penguins,
    x="flipper_length_mm",
    y="bill_length_mm",
    hue="species"
)
ax.ax_joint.axvline(206, c='r');

#### This next plot is only looking at the remaining data points

Once again - the split is not perfect, but provides a very good generalisation to perform the remaining classifications

In [None]:
ax = sns.jointplot(
    data=penguins.loc[lambda x: x["flipper_length_mm"] < 206],
    x="flipper_length_mm",
    y="bill_length_mm",
    hue="species"
)
ax.ax_joint.axhline(44, c='r');

## Trying out some Decision Trees

In this section, 2 different trees will be built with 1 and 2 splits respectively.

If this is not clear now - then some later diagrams should help to clarify what is going on.

In order to assess performance, we are making use of a procedure called "cross validation" (CV). This is used to ensure that a model's performance is always carried out on datapoints that __were not used when building the model__. This is such an important concept to remember, as the main interest when building a model is to assess how it will respond to new data points that it did not see when it was fitted. A more detailed description of this process can be read on [the scikit-learn website](https://scikit-learn.org/stable/modules/cross_validation.html).

Some key points to note:

* The data is split into _K_ "folds", and each split gets to become the validation set once.
    * So for 10-fold CV the following process occurs:
        * The data is split into 10, with each split containing 10% of the data
        * The model is trained on 90% of the data - with the performance assessed on the held out 10%
            * This process is repeated 10 times
* "Stratified" ensures that the proportions of each class is kept at each split
* "shuffle" ensures that the whole dataset is shuffled before the splits occur
    * This can be kept the same for comparison purposes by setting a "seed" value via the `random_state` parameter

In [None]:
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

In [None]:
X_penguins = penguins.drop(columns="species")
y_penguins = penguins["species"]

### Note that the number of penguins in each species is not the same - this is why using a Stratified CV is important

In [None]:
y_penguins.value_counts()

In [None]:
y_penguins.value_counts() / y_penguins.value_counts().sum()

## Fitting the models

Note that a decision tree with a depth of 1 can be called a __Decision Stump__

This tree only has the freedom to perform __one__ split to try and get the best possible classifier. It is therefore no surprise that its performance is below 83% accuracy for all splits.

### A note on performance metrics

Choosing the best way of judging a model's performance can get very complicated. A full description of this process is outside of the scope of this workshop, and to keep things simple, this notebook is making use of the `accuracy` metric, which means that the model is judged solely on whether it make the correct classification. The process is binary for every validation datapoint and the score is given as the proportion of correctly classified datapoints.

This metric is completely unsuitable if you have a very imbalanced dataset. For example, if there were only 2 species of penguin, with one of them accounting for <1% of the total number; then a model that just states that all penguins are of the majority species will have an accuracy value of >99% while also being completely useless!

In [None]:
tree1 = DecisionTreeClassifier(max_depth=1)

In [None]:
tree1_scores = cross_val_score(tree1, X_penguins, y_penguins, scoring="accuracy", cv=cv)

In [None]:
tree1_scores

In [None]:
tree1_scores.mean()

### A tree with 2 splits

This next tree is permitted to split the data again following the first split. This is very similar to the approach that was shown in the plots above and, not surprisingly, gets a much higher level of accuracy.

In [None]:
tree2 = DecisionTreeClassifier(max_depth=2)

In [None]:
tree2_scores = cross_val_score(tree2, X_penguins, y_penguins, scoring="accuracy", cv=cv)

In [None]:
tree2_scores

In [None]:
tree2_scores.mean()

## Plotting the Trees

The advantage of using single Decision Trees is that they are very easy to interpret. This can be seen when using the `plot_tree` function on a fitted model.

Some terminology:

* __Node__ - this is the visual representation of a block of datapoints - either before or after a split
    * This can be seen as the rectangular boxes in the plots below
* __Parent node__ - the node _before_ a split has occurred and is higher in the plot
* __Child node__ - the node _after_ a split has occurred and is lower in the plot
* __Branches__ - all of the paths flowing downward throughout the tree
* __Leaf node__ - the final nodes at the ends of the branches:
    * This is where the classification occurs

In [None]:
tree1.fit(X_penguins, y_penguins)

In [None]:
plot_tree(tree1);

### Getting the column names

Unfortunately - the plot does not show the column names, but we can use the index values and the columns from the original dataframe to obtain the feature that was used for the split

In [None]:
X_penguins.columns[2]

## Majority rules

Note that this decision stump has only been able to make a binary classification, as it was only able to perform a single split. The Gentoo penguins have been identified very well, but the remaining 2 can be separated. Because there are more Adelie penguins than Chinstrap, the model automatically assumes that all will be classified as Adelie, as this way the accuracy score will be higher.

In [None]:
np.unique(tree1.predict(X_penguins))

## Gini Impurity

This metric is the key to explaining how the decision tree makes its classifications. It can be thought of as "how clear is the signal from this node?". If a node following a split only contains values from 1 of the classes, then this message is 100% umambiguous... it is predicting that class with complete certainty. If however, the node contains equal numbers of each class, then it is impossible to tell which class is more likely.

This ambiguity can be quantified using the __Gini Impurity__ measure:

* A __lower__ value suggests __less__ ambiguity (with a minimum value of 0)
* A __higher__ value suggests __more__ ambiguity (max value dependent on number of classes)

#### The formula is as follows: (don't worry if this does not make sense)

$\Large 1 - \sum_{i=1}^{n} (p_{i})^{2}$

Here we create arrays of probabilities for class A and B. This represents the situations when a leaf node is fully populated with category A $p(A) = 1$ or category B $p(B) = 1$, and all combinations inbetween.

We can see the the impurity value is 0 when the probability of A is either 1 or 0 - in the latter case because we can be 100% sure of class B

In [None]:
prob_a = np.linspace(0, 1, 1000)
prob_b = 1 - prob_a

In [None]:
gini_results = 1 - (prob_a**2 + prob_b**2)

In [None]:
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(prob_a, gini_results)
ax.set_xlabel("Probability of A")
ax.set_ylabel("Gini Impurity Value")
plt.suptitle("Gini Impurity for Binary Classification");

Note that when using more classes - the impurity value can go greater than the 0.5 level that is suggested in the figure

In [None]:
1 - (0.333333**2 + 0.333333**2 + 0.33333**2)

## Information Entropy

Another criterion measure that can be used to the same effect is Information or __Shannon's__ entropy. It has the following formula (using a logarithm of base 2):

$\Large - \sum_{i=1}^{n} p_{i} \cdot log_{2}(p_{i})$ 

Note that the __dot__ here represents multiplication.

This is a perfectly reasonable alternative - but it is computationally slower due to the calculation of the logarithm.

In [None]:
entropy_results = -(prob_a[1:] * np.log2(prob_a[1:]) + prob_b[:-1] * np.log2(prob_b[:-1]))

In [None]:
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(prob_a[1:], entropy_results)
ax.set_xlabel("Probability of A")
ax.set_ylabel("Entropy Value")
plt.suptitle("Entropy for Binary Classification");

In [None]:
tree1.criterion = "entropy"

In [None]:
tree1.fit(X_penguins, y_penguins)

In [None]:
plot_tree(tree1);

## Looking at a tree with a depth of 2

In [None]:
tree2.fit(X_penguins, y_penguins)

### Note that the 3rd leaf node achieves a Gini score of 0 - meaning no impurity

In [None]:
_, ax = plt.subplots(figsize=(10, 5))
plot_tree(tree2, ax=ax);

### Looking at the information on the 2 left branches

This is what we saw earlier, and gets very good classification results

In [None]:
X_penguins.columns[2]

In [None]:
X_penguins.columns[0]

In [None]:
ax = sns.jointplot(
    data=penguins,
    x="flipper_length_mm",
    y="bill_length_mm",
    hue="species"
)

ax.ax_joint.axvline(206.5, c='r')
ax.ax_joint.axhline(43.35, c='r', xmax=0.555);  # This just took experimentation to get the xmax value correct!

## The right hand path looks a bit different - is the second split really necessary?

Also note that the top left area contains data points for Gentoo penguins __only__ - this is why we are getting a Gini value of 0 for this leaf node.

Also - notice that there is still 2 of the Gentoo penguins outside of this classification region.

__A zero Gini score does NOT mean that we have captured ALL of a particular class__

In [None]:
X_penguins.columns[1]

In [None]:
ax = sns.jointplot(
    data=penguins,
    x="bill_depth_mm",
    y="flipper_length_mm",
    hue="species"
)

ax.ax_joint.axvline(17.65, c='r', ymin=0.56)
ax.ax_joint.axhline(206.5, c='r');

### Is this last split really necessary?

The split in the top right hand corner has only identified 7 data points. While it can be argued that it is necessary here... it could be _specific to the data used to train the model_. This is a good example of what could very likely be a case of __Overfitting__. This is when we make a model that is too tailored to the particular data points used to train the model.

#### This is a common problem that can occur when using a single decision tree

# Ensemble methods

The next stage is not to just use 1 Decision Tree - but build lots of them in order to get a concensus decision!

## Wine Quality Dataset

This is a more complex dataset, but is great to illustrate the power of ensemble methods.

The dataset consists of 178 datapoints of different wines, which have been classified into 3 different types, labelled 1, 2, and 3. Each wine contains information on 13 various characteristics, and the task of the model is again a multi-classfication task using this labelled data.

Note that this dataset is now getting too large to be able to use the pair plot function effectively.

In [None]:
wine_raw = load_wine()

In [None]:
wine_df = pd.DataFrame(
    data=wine_raw["data"],
    columns=wine_raw["feature_names"]
)

full_wine_df = (
    wine_df
    .assign(**{
        "Type": wine_raw["target"].astype(str)
    })
)

## Slight imbalance again

We will look at a way that this can dealt with once we start looking at Random Forests

In [None]:
full_wine_df["Type"].value_counts()

## Fitting the Models - both single trees and ensembles

The work here is heavily based on a fantastic blog by Frank Ceballos, which can be read [here](https://towardsdatascience.com/an-intuitive-explanation-of-random-forest-and-extra-trees-classifiers-8507ac21d54b).

The main idea here is that are to do the following

* Create a decision stump
* See how well this performs on the wine data
* Create different ensembles of this stump to try and get better performance - note that the final 2 require small changes to the decision tree parameters:
    * 1 with a deliberate error
    * 1 that works via bootstrap sampling
    * 1 that works in the manner of a __Random Forest model__
    * 1 that works in the manner of an __Extremely Randomised Trees model__

## Incorrect Bagging

The following cells show the first attempt at an ensemble.

First we create the Decision stump in the same manner as before. In order to get a baseline - we will use the [DummyClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html) from scikit-learn. This model just produces predictions at random, and shows the result of a model that performs purely at chance level.

In [None]:
X_wine = wine_df.copy()
y_wine = full_wine_df["Type"]

The results from the dummy classifier show expected poor results - note that with 3 classes that do not show massive levels of imbalance, making predictions based solely on chance would intuitively result in a score of c.$\frac{1}{3}$

In [None]:
dummy = DummyClassifier(strategy="stratified")

In [None]:
cross_val_score(dummy, X_wine, y_wine, cv=cv)

In [None]:
wine_tree = DecisionTreeClassifier(max_depth=1)

#### Slightly better performance

This is expected, as the single tree has at least _started_ to segregate the data - but it is not doing enough.

In [None]:
cross_val_score(wine_tree, X_wine, y_wine, cv=cv)

## Creating the ensemble

Note the parameters:

* `base_estimator` - The type of decision tree that will be used multiple times
* `n_estimators` - The number of trees that will be built
* `bootstrap` - More on this later!
* `n_jobs` - How many jobs that you want to run in parallel

### Parallel computation!

A wonderful feature of these trees is that each can be trained _independently_ of each other. This means that we can make use of the multiple cores on the computer to train the models in an _embarrassingly parallel_ manner.

If you want to use as much parallelisation as possible, but don't know how many cores you have - just set this value to -1

In [None]:
wine_bagging_bad = BaggingClassifier(
    base_estimator=wine_tree,
    n_estimators=100,
    bootstrap=False,
    n_jobs=-1
)

In [None]:
cross_val_score(wine_bagging_bad, X_wine, y_wine, cv=cv, n_jobs=-1)

## What??!!!!!!1

... is going on here? Wasn't this supposed to be better??

So - what is happening here is that we are effectively building __exact replicas__ of the same tree. As Frank Ceballos says - this is like asking the same person what their favourite movie is multiple times - you're going to get the same answer every time!

The trick here is in the `bootstrap` parameter - we need to set it to `True`

### What is bootstrapping?

In short - the term refers to __sampling with replacement__. So - if we have 100 data points - we don't just take all of them, we _sample_ 100 of them, with each datapoint having the chance to be selected multiple times.

The idea behind this is that each tree will not get the same datapoints, but the _sampled_ datapoints.

Another thing to remember when doing this resampling is that you are more likely to sample those that tend to cluster together, and _less_ likely to get the outliers. Of course - some of the trees will get the outliers - but remember that the result is taken as the _aggregate performance_, so on average, the outliers will have less of an effect on the final outcome.

It really seems as though this simple change of doing the sampling would not make much difference - but in fact it makes _all the difference!_.

This is a perfect example of how these models are __Ensembles of weak learners__

In [None]:
wine_bagging = BaggingClassifier(
    base_estimator=wine_tree,
    n_estimators=100,
    bootstrap=True,
    random_state=456, # This is to ensure that the bootstrapping is the same
    n_jobs=-1
)

In [None]:
cross_val_score(wine_bagging, X_wine, y_wine, cv=cv, n_jobs=-1)

## Ensemble like a Random Forest

The following shows how the bagging occurs in a Random Forest:

* Each tree does __not__ get all of the features to use!
    * Instead - it has to work with a limited set
    * You can specify the exact number - but a common strategy is to use the square root or the log-2 value of the total number
    * Another this is the `splitter` parameter - we will discuss this more in the next section
        * Here we explicitly set it as `best` - but this is the default anyway

In [None]:
wine_tree_2 = DecisionTreeClassifier(
    max_depth=1,
    max_features="sqrt",
    splitter="best",
    random_state=753  # this is needed due to the random nature of the feature selection
)

In [None]:
wine_bagging_2 = BaggingClassifier(
    base_estimator=wine_tree_2,
    n_estimators=100,
    bootstrap=True,
    random_state=456,
    n_jobs=-1
)

In [None]:
cross_val_score(wine_bagging_2, X_wine, y_wine, cv=cv, n_jobs=-1)

## Why is this so much better?

This result shows that there _must_ be some variables that contain more useful information to allow the model to discriminate between the classes. Some of the features of the wine might not be of much use at all.

By only giving the different trees subsets of the variables, there will be cases where some trees get the _less_ useful ones, and others get the _more_ useful ones. Those with the better features will do a better job of classification, with lower Gini values in their leaf nodes. When we take the final aggregated outcome, these trees are sending a robust and cohesive signal, whereas the trees that had to work with the worse features will be more confused. __The majority signal will win on aggregate__

## Ensemble like an Extremely Randomised Trees Classifier

There are 2 key differences here:

* We do not bootstrap (by default when using the actual module from scikit-learn - you can if you want to!)
* We use a random splitter

#### What is meant by the splitter?

So this was the bit that confused me the most when I was learning the difference between these 2 algorithms; so I will do my best to summarise here:

* The best method performs a thorough scan of all the features when building the trees - calculating the Gini (or entropy) at all the stages
    * This gets the _best_ possible split
    * But... it is computationally expensive!
* A random method will instead pick a number of random splits to check (all different at each split and for the different trees) - and __not__ do a full scan
    * This won't necessarily find the best split overall - but will return the _best split of those that it has checked_
    * The idea being that if this is done across enough trees - we will still get a very powerful answer

In [None]:
wine_tree_3 = DecisionTreeClassifier(
    max_depth=1,
    max_features="sqrt",
    splitter="random",
    random_state=753  # This will also fix the randomness for the splitter
)

In [None]:
wine_bagging_3 = BaggingClassifier(
    base_estimator=wine_tree_3,
    n_estimators=100,
    bootstrap=False,
    n_jobs=-1,
    random_state=456
)

In [None]:
cross_val_score(wine_bagging_3, X_wine, y_wine, cv=cv, n_jobs=-1)

## This isn't doing as well - why not just use the BEST splitter? Isn't "best" always best?

This was the bit that confused me the most when learning about this. But it does actually make sense when you think about what can happen with different datasets.

#### Random Forest

This algorithm will try its hardest to find the best set in all of the features given to each tree. While this is fantastic if there is a clear signal, it can actually be a hindrance if the trees have any features that contain no (or very little) signal, as they will _still_ try to find what they can. What this will result in is the model __trying to find signal in noise__, as it will find any bespoke signal in the training data and _assume_ that this can be generalised to new data points. This is the typical example of __what leads to overfitting__.

In fact, I have read that Random Forests, while often being the first algorithm that researchers will try, will often require some form of feature selection method, used in addition to the fitting of the model in a pipeline. More about this can be read in the [feature-selection](https://scikit-learn.org/stable/modules/feature_selection.html) and [pipeline](https://scikit-learn.org/stable/modules/compose.html#pipeline) sections of the scikit-learn documentation.

#### Extremely Randomised Trees

These, on the other hand are not so hampered by noise variables, as they will not expend so much effort trying to find non-existent best splits in noisy features. They will select some random splits, see what happens, and then move on.

As for the bootstrapping - it is True by default for Random Forests and False for Extremely Randomised Trees. This can be changed by setting the `bootstrap` parameter

# Using these models directly

Note that these models will get better performance, as the trees used to make them are no longer restricted to being stumps

## [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

Click on the title to see the scikit-learn documentation.

As can be seen - with the default parameters, we are getting amazing performance, with 7 out of the 10 splits achieving perfect classification

In [None]:
random_forest = RandomForestClassifier(random_state=123)

In [None]:
cross_val_score(random_forest, X_wine, y_wine, cv=cv, n_jobs=-1)

## [Extra Trees](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html)

Click on the title to see the scikit-learn documentation.

Note that this is doing even better with 8 of the 10 splits having perfect classification! Showing that the "best" split doesn't necessarily mean "best" performance.

In [None]:
extra_trees = ExtraTreesClassifier(random_state=123)

In [None]:
cross_val_score(extra_trees, X_wine, y_wine, cv=cv, n_jobs=-1)

### Showing the time difference with `%%timeit`

This is a useful "magic" function that can be used in Jupyter notebooks. It runs the command several times in order to get a distribution of the compute time required.

As can be see, the Extra Trees model runs in 70-80% of the time (or better) of the Random Forest. Note that this will vary on each run and might be different on your computer.

In [None]:
timer_random_forest = RandomForestClassifier(random_state=345)

In [None]:
%%timeit
timer_random_forest.fit(X_wine, y_wine)

In [None]:
timer_extra_trees = ExtraTreesClassifier(random_state=345)

In [None]:
%%timeit
timer_extra_trees.fit(X_wine, y_wine)

## A note on imbalanced classes and setting the `class_weight`

One interesting parameter that can be set in all of the algorithms used so far is `class_weight`.

This can be very useful in situations when you don't have equal representation of the different classes that you are trying to predict. It can be argued that we could have used it in this notebook so far.

### Weighted Classes

A full description of how this is done will not be discussed here, but the idea is as follows:

* The classes with lower representation are given a higher weight
* These datapoints then contribute __more__ towards a higher Gini Impurity than the lower weighted classes
* This has the effect that the splits start to favour trying to classify these _minority_ classes with higher accuracy
* The idea is that this counteracts the type of behaviour that we saw earlier with the Chinstrap penguins
* A good strategy is to use `class_weight="balanced"` when building the model, as this weights each classes by their __inverse proportion__
    * This means that those classes with fewer data points get a higher weighting
    * You can also choose `class_weight="balanced_subsample"` to change the weights based on the proportions in each split
        * But this should not be an issue when using Stratified splits

## A note on feature importance

While the ensemble methods are, in general, far superior at classification prediction, a big disadvantage is that they are not as interpretable, as we cannot just simply see how a single tree is traversed.

However, what we can do with a fitted model is look at the feature importances

In [None]:
rf_wine_importance = timer_random_forest.feature_importances_

wine_importance_df = pd.DataFrame({
    "Feature": X_wine.columns,
    "Importance": rf_wine_importance
}).sort_values("Importance", ascending=False)

In [None]:
wine_importance_df

In [None]:
top_5_features = list(wine_importance_df["Feature"].values[:5]) + ["Type"]

In [None]:
top_5_features

In [None]:
sns.pairplot(
    data=full_wine_df.loc[:, top_5_features],
    hue="Type",
    height=2.5,
    plot_kws={'s': 10}
);

# Hyper-parameters and their tuning

## Random Forest hyper-parameters - more can be seen in the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

It should now be clear what these hyperparameters are doing and how they affect the building of the trees in the ensemble. Note that many of them are put in place to actually restrict the freedom that each tree has to a greater or lesser degree. This is vital when dealing with data sets that are not as clear cut as the examples that we have seen, and we need to make sure that the models do not overfit to noise in the training data.

Note that this is not the full this, and there are some parameters in the documentation that we have not discussed due to time. These will not be listed here, and are not required for the final task.

* `n_estimators` - Integer: The number of trees to build - default 100
* `criterion` - String: 'gini' or 'entropy' or 'log_loss' (which we haven't discussed here) - default 'gini'
* `max_depth` - Integer: Sets how deep the tree can go - default None = no limit
* `min_samples_split` - Integer or Float: How many datapoints a node must if it is allowed to be split further (proportion of total if float) - default 2
* `min_samples_leaf` - Integer or Float: How many datapoints must remain in a leaf node (proportion of total if float) - default 1
* `min_weighted_fraction_leaf` - Float: The minimum weighted fraction of the datapoints in a leaf as a proportion of the entire original set - default 0
    * Each datapoint is equally weighted if no `class_weight` is set
    * This parameter is a bit more complicated to understand - but is another way of stopping a leaf node from having too few datapoints
* `max_features` - Integer, Float or String: Either specify the exact number, the proportion - or use one of the built in strings like 'sqrt' - default 'sqrt'
    * see docs for more info
* `max_leaf_nodes` - Integer: How many leaf nodes are permitted per tree - default None = no limit
* `min_impurity_decrease` - Float: The reduction in impurity that must be seen for a split to occur - default 0
    * Simply put - "is this split really worth it? Does it bring any real benefit?"
* `max_samples` - Integer or Float - defaults to the original sample size

In [None]:
heart_raw = pd.read_csv("https://raw.githubusercontent.com/pydatacardiff/pydata_cardiff_workshop4/main/data/heart.csv")

In [None]:
X_heart = heart_raw.drop(columns="output")
y_heart = heart_raw["output"]

Note that this a pretty balanced

In [None]:
y_heart.mean()

In [None]:
random_forest_heart = RandomForestClassifier(random_state=123)
extra_trees_heart = ExtraTreesClassifier(random_state=123)

In [None]:
cross_val_score(random_forest_heart, X_heart, y_heart, cv=cv, n_jobs=-1).mean()

In [None]:
cross_val_score(extra_trees_heart, X_heart, y_heart, cv=cv, n_jobs=-1).mean()

## Optuna

In [None]:
import optuna

In [None]:
def objective(trial):
    # This will check both Random Forests and Extra Trees
    classifier_type = trial.suggest_categorical("classifier", ["RandomForest", "ExtraTrees"])
    
    # Setting the distributions for the parameters - sticking to common parameters
    n_estimators = trial.suggest_int("n_estimators", 10, 1000)
    max_depth = trial.suggest_int("max_depth", 3, 200, log=True)
    min_samples_split = trial.suggest_float("min_samples_split", 0, 1)
    min_samples_leaf = trial.suggest_float("min_samples_leaf", 0, 0.5
                                          )
    
    if classifier_type == "RandomForest":
        model = RandomForestClassifier(
            n_estimators=n_estimators,
            max_depth=max_depth,
            min_samples_split=min_samples_split,
            min_samples_leaf=min_samples_leaf,
            random_state=123,
            n_jobs=-1
        )
    else:
        model = ExtraTreesClassifier(
            n_estimators=n_estimators,
            max_depth=max_depth,
            min_samples_split=min_samples_split,
            min_samples_leaf=min_samples_leaf,
            random_state=123,
            n_jobs=-1
        )
        
    cv_scores = cross_val_score(model, X_heart, y_heart, cv=cv, n_jobs=-1, scoring="accuracy")
    
    score = cv_scores.mean()
    
    return score

In [None]:
study = optuna.create_study(study_name="ensemble_study", direction="maximize")
study.optimize(objective, n_trials=200)

In [None]:
from optuna.visualization import plot_optimization_history

In [None]:
plot_optimization_history(study)