# (Task 1) Importing and filling the data
We're going to be abusing `pandas`, a pretty [standard tool for managing datasets for analysis](https://pandas.pydata.org/docs/). For the importing of the data we'll just read csv files into [dataframes](https://pandas.pydata.org/docs/user_guide/dsintro.html#basics-dataframe) directly and handle missing values [leaning on `pandas`](https://pandas.pydata.org/docs/user_guide/missing_data.html) by specifying the "NA values", and filling them in via respective column means. 

This assignment gives us the impression that data is presented nicely to us. But in reality I hear from colleagues, is that data cleaning is a very sizeable task in of itself.

In [2]:
import pandas as pd
from sklearn import tree

# WARN: critical assumption here is missing data is denoted via '?', this was provided via Piazza clarification at
# at the time of the assignment, but we would absolutely need to do some data cleaning.
df = pd.read_csv('datasets/website-phishing.csv', na_values='?') 
df = df.fillna(df.mean())

# Training/Testing Sets
We will split here the training and testing datasets. Going old school with the 70/30 split. Not much to commentate here other than pointing out that it's important to;

- Shuffle the dataset, to remove any potential sorting that exists (mitigating training bias) and,
- Adhering to the golden rule of not polluting the training efforts with our testing dataset.

Ofcourse `scikit-learn` has a utility (aptly named [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)) here. As it very well should be, it's literally one of the most common procedures in these excercises to perform).

We finish here by conveniently seperating the features and labels too for easy access too (`tr_` prefixed to represent training).

In [3]:
from sklearn.model_selection import train_test_split

print("overall size of the full dataset: ", df.size)
[training, testing] = train_test_split(df, shuffle=True, train_size=0.7, test_size=0.3)
print(f"size of training set: {training.size} (%{training.size/df.size * 100}%)")
print(f"size of testing set: {testing.size} (%{testing.size/df.size * 100}%)")

tr_features, tr_labels = testing.iloc[:, :-1], testing.iloc[:, -1]

overall size of the full dataset:  342705
size of training set: 239878 (%69.99547715965626%)
size of testing set: 102827 (%30.004522840343732%)


# (Task 2a & 2b) Decision Stump & Unpruned Tree
And so we begin! I seperated these two subtasks out because they are adequately bland. But for future reference; for decision tree refresher I'd read the `scikit-learn` [chapter on decision trees](https://scikit-learn.org/stable/modules/tree.html). As we're about to dive right into the realms of our friend the [`DecisionTreeClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html).

To pay respects to the lecturer Jörg Simon Wicker hosted in Summer-Autumn of 2023, we will fix the node splitting strategy for the length of this notebook to `criterion="entropy"`. That is quantifying data "impurity" via a flavour of the very well explained "entropy" function (Shannon entropy is used in `sci-kit learn`).

The [generalised definition](https://scikit-learn.org/stable/modules/tree.html#classification-criteria) is used here as we have datasets that have more than two possible labels (important distinction to make since we looked at only the binary classification definitions of entropy in the course).

In conjunction to fixing the `criterion` we will also for the length of this document fix `random_state=0` and `splitter="best"` as the implementation of the `DecisionTreeClassifier` within `scikit-learn` introduces randomness within selection of candidate features by default otherwise. We want deterministic behaviour to align with the teachings of this course. Hence all `DecisionTreeClassifier` instances will be set via;
```
DecisionTreeClassifier(random_state=0, criterion="entropy", splitter="best" ...)
```

The decision stump (called `stump_classifier` below) is just a decision tree classifier model with depth one. Now we're _technically_ assembling this tree via a "pre-pruning" technique of setting a `max_depth` of the tree, although sufficient for demonstrational purposes. The completely unpruned decision tree (called `unpruned_classifier` below) is tree that has no `max_depth` set (and no other hyperparameters set that would restrict the growth of this tree). Hence we have the following;

_NOTE: we log out the depth of the underlying constructed tree of the classifiers (perhaps unconventionally) via the property `.tree_.max_depth`_

_NOTE: the implementation of the `DecisionTreeClassifier` within `scikit-learn` introduces randomness via the technique of splitting, hence we fix this from now on too via setting the `random_state` parameter._

In [173]:
stump_classifier = tree.DecisionTreeClassifier(random_state=0, criterion="entropy", splitter="best", max_depth=1)
stump_classifier.fit(tr_features, tr_labels)
print("the depth of tree assembled for stump_classifier:", stump_classifier.tree_.max_depth)

unpruned_classifier = tree.DecisionTreeClassifier(random_state=0, criterion="entropy", splitter="best")
unpruned_classifier.fit(tr_features, tr_labels)
print("the depth of the tree assembled for the unpruned_classifier:", unpruned_classifier.tree_.max_depth)

the depth of tree assembled for stump_classifier: 1
the depth of the tree assembled for the unpruned_classifier: 22


# (Task 2c) A Post Pruned Tree
We opt for a post-pruned tree, that is we'll take this `unpruned_tree` and apply the "cost complexity pruning" technique onto it.

Resources about this post pruning technique (as it's a novel technique relative to what has been covered in lectures);
- [Short video by StatQuest contextualised via Regression Tree's](https://www.youtube.com/watch?v=D0efHEJsfHo&t)
- [Chapter from `scikit-learn` docs directly](https://scikit-learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html)

It is a parameterised technique (takes in a "cost complexity" parameter, $\alpha$), hence open as a possible hyper parameter. But the idea is that the general magnitude of $\alpha$ impacts how **aggressively** we prune this tree (_how many nodes we trim back_).

_NOTE: to drive the intuition of how this alpha hyper parameter works, there exists a large enough alpha such that the final output tree would be a decision stump yet again._

Technically, a post pruned tree here for sake of demonstration can simply be a product of this tree and a "strong enough" alpha to prune at least one decision node. Arbitrarily choosing this value is slightly innacurate as we need to garuantee some sort of actual "pruning" right? So we utulise this handy method on the instance of a `DecisionTreeClassifer` called `cost_complexity_pruning_path` that lists out possible alpha variables that _make an impact_. And from this selection we can arbitrarily choose one and apply the prune.

In [4]:
import random

classifier = tree.DecisionTreeClassifier(random_state=0, criterion="entropy", splitter="best")
path = classifier.cost_complexity_pruning_path(tr_features, tr_labels)

# NOTE: this purely a demonstration, we'd not just randomly choose the cost complexity pruning alpha like this
chosen_ccp_alpha = random.choice(path.ccp_alphas) 

post_pruned_classifier = tree.DecisionTreeClassifier(random_state=0, criterion="entropy", splitter="best", ccp_alpha=chosen_ccp_alpha)
post_pruned_classifier.fit(tr_features, tr_labels)
print("total number of trimmed nodes: ", unpruned_classifier.tree_.node_count - post_pruned_classifier.tree_.node_count)

NameError: name 'unpruned_classifier' is not defined

# (Task 3) Hyper parameter search
Now the real cool shit (and the data science really begins). We want expand on what we found in our post pruned tree above... that is finding the $\alpha$ that _performs the best_ because above we surfaced a series of candidates (provided by the above `cost_complexity_pruning_path`), but just picked a random one!

So how can we find the best $\alpha$? Well we can split the training data set (that is still **strictly separated** from our `testing` dataset way at the beginning of this notebook) into the 70:30 ratio again, building two new sets, call them the `training` and `validation` sets. We'd do this because that allows us to iterate through each candidate $\alpha$ here, train it on the `training` set and test the result of this $\alpha$ tuned model on the `validation` set.

But as pointed out in the lecture notes, this restricts our training dataset quite a lot and impacts the ability for our models to generalise well (as our training dataset becomes less and less representative of our overall global dataset the smaller it gets). This caveat here is really only visible when contrasted against the [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)) technique 🙌.

We covered this technique thoroughly in course lectures, effectively, we end up evaluating the regulated learning process under supervision of an $\alpha$ whilst correctly testing it. Wasting nothing of our original `testing` data set (at the cost of some addded compute resource).

In `scikit-learn` we utulise what is known as the [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) utility which does what we want here for us. Effectively taking in a list of "grids" to check exhaustively candidate parameters, and neatly summarising after completion the outcome, surfacing the best parameter to use.

Some resources for further reading;

- [Chapter from `scitkit-learn` about hyper parameter tuning](https://scikit-learn.org/stable/modules/grid_search.html)
- [More practical to the point medium article on hyper parameter tuning](https://medium.com/chinmaygaikwad/hyperparameter-tuning-for-tree-models-f99a66446742)

_Note: in general, the more folds (`cv` param in `GridSearchCV`), the more representative the model accuracy would be of the real error. The tradeoff here is patience, more accuracy requires more time... but I'm lucky here, I have a really powerful machine_

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import LeaveOneOut

print("cost complexity pruning alpha candidates: ", len(path.ccp_alphas))
params = {'ccp_alpha': path.ccp_alphas}

base_classifier = tree.DecisionTreeClassifier(random_state=0, criterion="entropy", splitter="best")
GS = GridSearchCV(estimator=base_classifier, n_jobs=-1 param_grid = params,cv=LeaveOneOut(),verbose=True, scoring='accuracy')
GS.fit(tr_features, tr_labels)

print('Best Parameters:',GS.best_params_,end='\n\n')
print('Best Score:',GS.best_score_)