# (Task 1) Importing and filling the data
We're going to be abusing `pandas`, a pretty [standard tool for managing datasets for analysis](https://pandas.pydata.org/docs/). For the importing of the data we'll just read csv files into [dataframes](https://pandas.pydata.org/docs/user_guide/dsintro.html#basics-dataframe) directly and handle missing values [leaning on `pandas`](https://pandas.pydata.org/docs/user_guide/missing_data.html) by specifying the "NA values", and filling them in via respective column means. 

This assignment gives us the impression that data is presented nicely to us. But in reality... that data cleaning is a very sizeable task in of itself.

In [1]:
import pandas as pd
from sklearn import tree

import warnings
warnings.filterwarnings('ignore')

# WARN: critical assumption here is missing data is denoted via '?', this was provided via Piazza clarification at
# at the time of the assignment, but we would absolutely need to do some data cleaning.
wb_df = pd.read_csv('datasets/website-phishing.csv', na_values='?') 
bcp_df = pd.read_csv('datasets/BCP.csv', na_values='?') 
ary_df = pd.read_csv('datasets/arrhythmia.csv', na_values='?') 

wb_df = wb_df.fillna(wb_df.mean())
bcp_df = bcp_df.fillna(bcp_df.mean())
ary_df = ary_df.fillna(ary_df.mean())

imports = [
    {"name":"website-phishing", "df":wb_df},
    {"name":"arrythmia", "df":ary_df},
    {"name":"bcp", "df":bcp_df},
]

# Training/Testing Sets
We will split here the training and testing datasets. Going old school with the 70/30 split. Not much to commentate here other than pointing out that it's important to;

- Shuffle the dataset, to remove any potential sorting that exists (mitigating training bias) and,
- Adhering to the golden rule of not polluting the training efforts with our testing dataset.

Ofcourse `scikit-learn` has a utility (aptly named [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)) here. As it very well should be, it's literally one of the most common procedures in these excercises to perform).

We introduce a couple lightweight [dataclasses](https://docs.python.org/3/library/dataclasses.html) to help structure the handling and iterating of task requirements on each `Dataset`.

In [15]:
from sklearn.model_selection import train_test_split
from dataclasses import dataclass

@dataclass
class LabelsFeatures:
    features: pd.DataFrame
    labels: pd.DataFrame


@dataclass
class Dataset:
    name: str
    dataframe_: pd.DataFrame
    testing: LabelsFeatures
    training: LabelsFeatures

full_datasets = []

for dataset in imports:
    name, df = dataset["name"], dataset["df"]
    rows, columns = df.shape
    print(f"-- Overall row count of the {name} full dataset: {rows}")
    print(f"-- With overall column count: {columns}")

    [training, testing] = train_test_split(df, shuffle=True, train_size=0.7, test_size=0.3)
    print(f"Size of training set: {training.shape[0]} (%{training.shape[0]/rows * 100}%)")
    print(f"Size of testing set: {testing.shape[0]} (%{testing.shape[0]/rows * 100}%)")
    print("\n")

    tr_features, tr_labels = training.iloc[:, :-1], training.iloc[:, -1]
    te_features, te_labels = testing.iloc[:, :-1], testing.iloc[:, -1]
    full_datasets.append(Dataset(
        name=name,
        dataframe_=df,
        training=LabelsFeatures(features=tr_features, labels=tr_labels),
        testing=LabelsFeatures(features=te_features, labels=te_labels)
    ))

-- Overall row count of the website-phishing full dataset: 11055
-- With overall column count: 31
Size of training set: 7738 (%69.99547715965626%)
Size of testing set: 3317 (%30.004522840343732%)


-- Overall row count of the arrythmia full dataset: 452
-- With overall column count: 280
Size of training set: 316 (%69.91150442477876%)
Size of testing set: 136 (%30.08849557522124%)


-- Overall row count of the bcp full dataset: 683
-- With overall column count: 11
Size of training set: 478 (%69.98535871156662%)
Size of testing set: 205 (%30.01464128843338%)




# (Task 2a & 2b) Decision Stump & Unpruned Tree
And so we begin! I seperated these two subtasks out because they are adequately bland. But for future reference; for decision tree refresher I'd read the `scikit-learn` [chapter on decision trees](https://scikit-learn.org/stable/modules/tree.html). As we're about to dive right into the realms of our friend the [`DecisionTreeClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html).

To pay respects to the lecturer Jörg Simon Wicker hosted in Summer-Autumn of 2023, we will fix the node splitting strategy for the length of this notebook to `criterion="entropy"`. That is quantifying data "impurity" via a flavour of the very well explained "entropy" function (Shannon entropy is used in `sci-kit learn`).

The [generalised definition](https://scikit-learn.org/stable/modules/tree.html#classification-criteria) is used here as we have datasets that have more than two possible labels (important distinction to make since we looked at only the binary classification definitions of entropy in the course).

In conjunction to fixing the `criterion` we will also for the length of this document fix `random_state=0` and `splitter="best"` as the implementation of the `DecisionTreeClassifier` within `scikit-learn` introduces randomness within selection of candidate features by default otherwise. We want deterministic behaviour to align with the teachings of this course. Hence all `DecisionTreeClassifier` instances will be set via;
```
DecisionTreeClassifier(random_state=0, criterion="entropy", splitter="best" ...)
```

The decision stump (called `getTrainedStumpClassifier` below) is just a decision tree classifier model with depth one. Now we're _technically_ assembling this tree via a "pre-pruning" technique of setting a `max_depth` of the tree, although sufficient for demonstrational purposes. The completely unpruned decision tree (called `getTrainedUnprunedClassifier` below) is tree that has no `max_depth` set (and no other hyperparameters set that would otherwise restrict the growth of this tree). Hence we have the following;

_NOTE: we structure this assignment via some methods that will be called again and again (eg, `getTrainedStumpClassifier`), this is to help the generalisation approach of running the same tasks against all three datasets. There's a bit to be desired with this method (like effective caching so we're not re-training models everytime we call these methods), but that in my eyes can be considered an optimisation if needed (which is not needed here)_

In [14]:
def getTrainedStumpClassifier(ds: Dataset):
    stump_classifier = tree.DecisionTreeClassifier(
        random_state=0,
        criterion="entropy",
        splitter="best",
        max_depth=1)
    stump_classifier.fit(ds.training.features, ds.training.labels)
    return stump_classifier

def getTrainedUnprunedClassifier(ds: Dataset):
    unpruned_classifier = tree.DecisionTreeClassifier(
        random_state=0,
        criterion="entropy",
        splitter="best")
    unpruned_classifier.fit(ds.training.features, ds.training.labels)
    return unpruned_classifier

for ds in full_datasets:
    print("-- For dataset", ds.name)

    stump_classifier = getTrainedStumpClassifier(ds)
    print("The depth of tree assembled for stump_classifier:", stump_classifier.tree_.max_depth)

    unpruned_classifier = getTrainedUnprunedClassifier(ds)
    print("The depth of the tree assembled for the unpruned_classifier:", unpruned_classifier.tree_.max_depth)
    print("\n")

-- For dataset website-phishing
The depth of tree assembled for stump_classifier: 1
The depth of the tree assembled for the unpruned_classifier: 24


-- For dataset arrythmia
The depth of tree assembled for stump_classifier: 1
The depth of the tree assembled for the unpruned_classifier: 13


-- For dataset bcp
The depth of tree assembled for stump_classifier: 1
The depth of the tree assembled for the unpruned_classifier: 6




# (Task 2c) A Post Pruned Tree
We enter "pruning" the general technique of trimming the decision tree down in order to tackle the problem of "overfitting" - concretely in this section we'll take this `getTrainedUnprunedClassifier` and apply the "cost complexity pruning" technique onto it.

Resources about this post pruning technique (as it's a novel technique relative to what has been covered in lectures);
- [Short video by StatQuest contextualised via Regression Tree's](https://www.youtube.com/watch?v=D0efHEJsfHo&t)
- [Chapter from `scikit-learn` docs directly](https://scikit-learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html)

It is a parameterised technique (takes in a "cost complexity" parameter, $\alpha$), hence open as a tunable hyper parameter. But the idea is that the general magnitude of $\alpha$ impacts how **aggressively** we prune this tree (_how many nodes we trim back_).

Technically, a post pruned tree here for sake of demonstration can simply be a product of this tree and a "strong enough" alpha to prune at least one decision node. Arbitrarily choosing this value is slightly innacurate as we need to garuantee some sort of actual "pruning" right? So we utulise this handy method on the instance of a `DecisionTreeClassifer` called `cost_complexity_pruning_path` that lists out possible alpha variables that _make an impact_. And from this selection we can arbitrarily choose one (for sake of demonstration) and apply the prune.

_NOTE: we conveniently wrap the candidate selection via `getCandidateCCPParameters`, I have removed the first candidate of the ccp alphas path as this value has no impact, and I want to force "some" sort of post pruning for the sake of clear demonstration and comparision._

In [16]:
import random

def getCandidateCCPParameters(ds: Dataset):
    classifier = tree.DecisionTreeClassifier(random_state=0, criterion="entropy", splitter="best")
    path = classifier.cost_complexity_pruning_path(ds.training.features, ds.training.labels)
    return path.ccp_alphas[1:]

for ds in full_datasets:
    print("-- For dataset", ds.name)

    # NOTE: this purely a demonstration, we'd not just randomly choose the cost complexity pruning alpha like this
    chosen_ccp_alpha = random.choice(getCandidateCCPParameters(ds)) 

    post_pruned_classifier = tree.DecisionTreeClassifier(
        random_state=0,
        criterion="entropy",
        splitter="best",
        ccp_alpha=chosen_ccp_alpha
    )
    post_pruned_classifier.fit(ds.training.features, ds.training.labels)
    
    print("Number of nodes in unpruned tree: ", getTrainedUnprunedClassifier(ds).tree_.node_count)
    print("Total number of trimmed nodes: ", getTrainedUnprunedClassifier(ds).tree_.node_count - post_pruned_classifier.tree_.node_count)
    print("\n")

-- For dataset website-phishing
Number of nodes in unpruned tree:  849
Total number of trimmed nodes:  712


-- For dataset arrythmia
Number of nodes in unpruned tree:  131
Total number of trimmed nodes:  110


-- For dataset bcp
Number of nodes in unpruned tree:  47
Total number of trimmed nodes:  44




# (Task 3) Hyperparameter (in our case just $\alpha$ ) search
Now the real cool shit (and the data science really begins). We want expand on what we found in our post pruned tree above... that is finding the $\alpha$ that _performs the best_ because above we surfaced a series of candidates (provided by the above `cost_complexity_pruning_path`), but just picked a random one!

So how can we find the best $\alpha$? Well we can split the training data set (that is still **strictly separated** from our `testing` dataset way at the beginning of this notebook) into the 70:30 ratio again, building two new sets, call them the `training` and `validation` sets. We'd do this because that allows us to iterate through each candidate $\alpha$ here, train it on the `training` set and test the result of this $\alpha$ tuned model on the `validation` set.

But as pointed out in the lecture notes, this restricts our training dataset quite a lot and impacts the ability for our models to generalise well (as our training dataset becomes less and less representative of our overall global dataset the smaller it gets). This caveat here is really only visible when contrasted against the [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)) technique.

We covered this technique thoroughly in course lectures, effectively, we end up evaluating the regulated learning process under supervision of an $\alpha$ while training the model on the entire `training` set without this second layer of splitting using _folding_. In `scikit-learn`, we have the [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) utility; where given a list of hyper parameters (and their respective candidate values), it performs an exhaustive candidate hyperparameter combination search, returning to us the best hyper parameter combination for the `testing` set provided. It does this by iterating through all combinations of hyper parameters and evaluating those hyper parameters by k-fold cross validation. Each run giving an aggregated (mean) accuracy score of the model trained under the regulation of those hyperparameters, later used for comparing different hyperparameter combination effectiveness.

We in particular have to define the strategy used to evaluate the performance of each $\alpha$ tuned model... the list of strategies available in `GridSearchCV` are [listed here](https://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values). Since we are dealing with basic classification we opt for the `accuracy` scoring method, that is, given your $i$th predicted value $\hat y_i$ with the associated true label $y_i$, and keeping in mind the sample count $n_{samples}$;

$$accuracy(y,\hat y) = \frac{\sum_{i=0}^{n_{samples}-1}1(\hat y_i = y_i)}{n_{samples}}$$

We can go a step further and perform what is known as a nested cross-validation, this method is known to yield more generalisable `accuracy` scores for each hyper parameter selection. I have left this out for simplicity as using `GridSearchCV` exclusively is still a "proper way to select hyperparameters".

- [Chapter in `scikit-learn` that covers an exemplar nested cross-validation run](https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html)
- [Video that helps describe the intuition behind nested cross validation](https://www.youtube.com/watch?v=az60jS7MQhU)


## Once we find this best $\alpha$

We build a final model, shown in `getFittedAlphaTunedClassifier` here, and use it as our strongest candidate that is using the "cost complexity pruning" technique. 

Some resources for further reading;

- [Chapter from `scitkit-learn` about hyper parameter tuning](https://scikit-learn.org/stable/modules/grid_search.html)
- [More practical to the point medium article on hyper parameter tuning](https://medium.com/chinmaygaikwad/hyperparameter-tuning-for-tree-models-f99a66446742)

_Note: in general, the more folds (`cv` param in `GridSearchCV`), the more representative the model accuracy would be of the real accuracy. The tradeoff here is patience, more accuracy requires more `cv` which in turn requires more time... I arbitrarilly chose 10 fold._

In [17]:
from sklearn.model_selection import GridSearchCV
from typing import List

def getFittedAlphaTunedClassifier(ds: Dataset, alphaCandidates: List[float]):
    base_classifier = tree.DecisionTreeClassifier(random_state=0, criterion="entropy", splitter="best")
    params = {'ccp_alpha': alphaCandidates}
    search = GridSearchCV(
        estimator=base_classifier,
        param_grid=params,
        scoring='accuracy',
        cv=10,
        n_jobs=-1,
    )
    search.fit(ds.training.features, ds.training.labels)
    
    # NOTE: this isn't nesseceray, the return of GridSearchCV returns the optimally tuned and trained model
    best_performing_post_pruned_classifier = tree.DecisionTreeClassifier(
        random_state=0,
        criterion="entropy",
        splitter="best",
        ccp_alpha=search.best_params_['ccp_alpha']
    )
    
    best_performing_post_pruned_classifier.fit(ds.training.features, ds.training.labels)
    return best_performing_post_pruned_classifier

for ds in full_datasets:
    print("-- For dataset", ds.name)

    candidates = getCandidateCCPParameters(ds)
    print("Cost complexity pruning alpha candidates: ", len(candidates))
   
    best_performing_post_pruned_classifier = getFittedAlphaTunedClassifier(ds, candidates)
   
    print('Best "cost complexity alpha":             ',  best_performing_post_pruned_classifier.get_params()['ccp_alpha'])
    
    print("Number of nodes in unpruned tree:         ", getTrainedUnprunedClassifier(ds).tree_.node_count)
    print("Total number of trimmed nodes:            ", getTrainedUnprunedClassifier(ds).tree_.node_count - best_performing_post_pruned_classifier.tree_.node_count)
    print("\n")

-- For dataset website-phishing
Cost complexity pruning alpha candidates:  254
Best "cost complexity alpha":              0.0
Number of nodes in unpruned tree:          849
Total number of trimmed nodes:             0


-- For dataset arrythmia
Cost complexity pruning alpha candidates:  64
Best "cost complexity alpha":              0.03991433583500406
Number of nodes in unpruned tree:          131
Total number of trimmed nodes:             98


-- For dataset bcp
Cost complexity pruning alpha candidates:  18
Best "cost complexity alpha":              0.005763362975237382
Number of nodes in unpruned tree:          47
Total number of trimmed nodes:             8




## Datasets Commentary

From the importing of the data, we know that the given dataset for website-fishing is significantly bigger (wrt. row count) than the arrythmia and bcp. This could explain why the unpruned tree for the website-fishing dataset is significantly more complex (wrt. to number of tree nodes). This would suggest there is "more to trim" on website-fishing dataset however the hyper parameter search of alpha re: cost complexity pruning technique showed that we cannot remove "much" from this dataset (12 nodes of 899). This suggests to me that the samples at the bottom of the tree are still quite entropic compared to the arrythmia and bcp; where a significant proportion of their overall trees could be trimmed (104 nodes of 125 and 28 nodes of 49). This would suggest less relative entropy through the bottom of those trees.

# (Task 4) Evaluating these Models
It's time to formally evaluate these three models... the one `getFittedAlphaTunedClassifier`, `getTrainedUnprunedClassifier` and `getTrainedStumpClassifier`. The absolute performance of these models are to be considered and you can quite easily fit each model on the `training` data provided and go ahead and predict the labels on these unseen `testing` features, finding the [`accuracy_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) of each model and comparing this way. It's very important to keep in mind other very useful evaluation metrics such as [precision](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.precision_score), and [recall](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html). These metrics as per mentioned in course can be used to further expand on particular evaluation behaviours of models, perhaps we would prefer models that over classify positive in a binary classification realm, or instead be overly cautious.

Here we will focus on just `accuracy` for brevity.

In [30]:
from sklearn.model_selection import cross_validate
from sklearn.metrics import accuracy_score

for ds in full_datasets:
    print("-- For dataset", ds.name)
    
    predicted_labels = getFittedAlphaTunedClassifier(ds, getCandidateCCPParameters(ds)).predict(ds.testing.features)
    score = accuracy_score(ds.testing.labels, predicted_labels)
    print(f"Post pruned classifier accuracy:                 %{score*100}")

    predicted_labels = getTrainedUnprunedClassifier(ds).predict(ds.testing.features)
    score = accuracy_score(ds.testing.labels, predicted_labels)
    print(f"Unpruned classifier classifier accuracy:         %{score*100}")

    predicted_labels = getTrainedStumpClassifier(ds).predict(ds.testing.features)
    score = accuracy_score(ds.testing.labels, predicted_labels)
    print(f"Stump classifier classifier accuracy:            %{score*100}")
    print("\n")

-- For dataset website-phishing
Post pruned classifier accuracy:                 %96.26168224299066
Unpruned classifier classifier accuracy:         %96.26168224299066
Stump classifier classifier accuracy:            %88.81519445281882


-- For dataset arrythmia
Post pruned classifier accuracy:                 %70.58823529411765
Unpruned classifier classifier accuracy:         %66.91176470588235
Stump classifier classifier accuracy:            %58.82352941176471


-- For dataset bcp
Post pruned classifier accuracy:                 %94.14634146341463
Unpruned classifier classifier accuracy:         %94.6341463414634
Stump classifier classifier accuracy:            %92.6829268292683




## Evaluating the statistical significance of pairwise model performance differences 

Now it's not too safe to just look at these absolute values and compare them this way, we need to account for chance. How do we rule this chance out? We look at two models we're comparing, say $M_1$ and $M_2$ and set the null hypothesis to; $M_1$ are the same $M_2$, and using a [t-test](https://en.wikipedia.org/wiki/Student%27s_t-test), if we can reject the null hypothesis (ie, p-value < 0.05), then we can conclude the difference between $M_1$ and $M_2$ is **statistically significant**.

- [Quick video on the inards of the t-test for independent samples for those unfamiliar](https://www.youtube.com/watch?v=c9ombGmaEy8)

We need a series of like of like scoring methods, in our case we can keep using the standard `accuracy` scoring. We can generate these scores via something like a 100-fold cross validation, where we add the `accuracy` score of each iteration onto the series of scores $S_{M_1}$ and $S_{M_2}$ and utilise [`ttest_ind`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html#scipy-stats-ttest-ind) found within `scipy` in order calculate the T-test for the means of two independent samples of scores.

In this evaluation method, it makes sense to perform the `cross_validate` across the entire dataset, as we are purely evaluating the models at this point.

In [31]:
from sklearn.model_selection import cross_validate
from scipy import stats

for ds in full_datasets:
    print("-- For dataset", ds.name)
    
    scores = cross_validate(
        getFittedAlphaTunedClassifier(ds, getCandidateCCPParameters(ds)),
        pd.concat([ds.training.features, ds.testing.features]),
        pd.concat([ds.training.labels, ds.testing.labels]),
        scoring='accuracy',
        cv=100
    )
    tuned_post_pruned_scores = scores['test_score']

    scores = cross_validate(
        getTrainedStumpClassifier(ds),
        pd.concat([ds.training.features, ds.testing.features]),
        pd.concat([ds.training.labels, ds.testing.labels]),
        scoring='accuracy',
        cv=100
    )
    stump_scores = scores['test_score']
    
    scores = cross_validate(
        getTrainedUnprunedClassifier(ds),
        pd.concat([ds.training.features, ds.testing.features]),
        pd.concat([ds.training.labels, ds.testing.labels]),
        scoring='accuracy',
        cv=100
    )
    unpruned_scores = scores['test_score']


    _, p_value = stats.ttest_ind(tuned_post_pruned_scores, unpruned_scores)
    print("P-Value for independent t-test (tuned post pruned classifier, unpruned classifier): ", p_value)
    _, p_value = stats.ttest_ind(tuned_post_pruned_scores, stump_scores)
    print("P-Value for independent t-test (tuned post pruned classifier, stump classifier)   : ", p_value)
    _, p_value = stats.ttest_ind(unpruned_scores, stump_scores)
    print("P-Value for independent t-test          (unpruned classifier, stump classifier)   : ", p_value)
    print("\n")

-- For dataset website-phishing
P-Value for independent t-test (tuned post pruned classifier, unpruned classifier):  1.0
P-Value for independent t-test (tuned post pruned classifier, stump classifier)   :  1.591989221442911e-64
P-Value for independent t-test          (unpruned classifier, stump classifier)   :  1.591989221442911e-64


-- For dataset arrythmia
P-Value for independent t-test (tuned post pruned classifier, unpruned classifier):  0.44681146073063105
P-Value for independent t-test (tuned post pruned classifier, stump classifier)   :  9.218024593725492e-06
P-Value for independent t-test          (unpruned classifier, stump classifier)   :  0.00013376738822302387


-- For dataset bcp
P-Value for independent t-test (tuned post pruned classifier, unpruned classifier):  0.9839270620187288
P-Value for independent t-test (tuned post pruned classifier, stump classifier)   :  0.12099988221516042
P-Value for independent t-test          (unpruned classifier, stump classifier)   :  0.1

## Datasets Commentary

For all three datasets, it appears that a training fitted stump classifier consistently performed worse wrt. model accuracy when evaluated on the testing dataset. However, from a independent 2 sample t-test significance test, samples populated of each iteration of the 100-k cross validation run - we can only truly say that there is a difference with the datasets website-fishing and arrythmia (as p-value < 0.05).

Hence we can say for the bcp dataset that the stump classifiers accuracy was not _that different_ than the other tree appraoches, this is evidence that there is a single column in the bcp dataset that highly correlates to the class label. In contrast, with datasets arrythmia and website-phishing appearing to signficantly perform better as the tree grows more complex, which suggests classification patterns are more complex than relying on a single column.

# (Task 5) Pre-Pruned Tree
In contrast to our simple section in "A Post Pruned Tree", we will look into exclusively some pre-pruning techniques that actively restrict the growth of the tree as it builds. The hyperparameters we will look at will arbitrarilly be `max_depth` and to highlight further how `GridSearchCV` works, also consider in conjunction a series of `min_samples_leaf` and also `min_samples_split`.

Candidates for `max_depth` will be just `[1, 2, ..., unpruned.tree_.max_depth]` and for `min_samples_leaf`, `min_samples_split` to be `[1, 5, 10, 20, 50, 100, 200, 500]`.

_Note: as you can see, we can definitely search within an abritrarily sized hyperparameter space_

In [28]:
def getFittedTunedPrePrunedClassifier(ds: Dataset):
    base_classifier = tree.DecisionTreeClassifier(random_state=0, criterion="entropy", splitter="best")
    params = {
        'max_depth': [i for i in range(1, getTrainedUnprunedClassifier(ds).tree_.max_depth+1)],
        'min_samples_leaf': [1, 5, 10, 20, 50, 100, 200, 500],
    }
    search = GridSearchCV(
        estimator=base_classifier,
        n_jobs=-1,
        param_grid=params,
        cv=10,
        verbose=True,
        scoring='accuracy'
    )
    search.fit(ds.training.features, ds.training.labels)

    best_performing_pre_pruned_classifier = tree.DecisionTreeClassifier(
        random_state=0,
        criterion="entropy",
        splitter="best",
        max_depth=search.best_params_['max_depth'],
        min_samples_leaf=search.best_params_['min_samples_leaf'],
    )
    
    best_performing_pre_pruned_classifier.fit(ds.training.features, ds.training.labels)
    return best_performing_pre_pruned_classifier

for ds in full_datasets:
    print("-- For dataset", ds.name)
    
    preClassifier = getFittedTunedPrePrunedClassifier(ds)
    print('Best "max_depth": ' + str(preClassifier.get_params()['max_depth']) 
          + ' (out of: ' + str(getTrainedUnprunedClassifier(ds).tree_.max_depth) + ")")
    print('Best "min_samples_leaf":', preClassifier.get_params()['min_samples_leaf'])
    
    predicted_labels = preClassifier.predict(ds.testing.features)
    score = accuracy_score(ds.testing.labels, predicted_labels)
    print(f"Best performing pre pruned classifier accuracy: %{score*100}")

    postClassifier = getFittedAlphaTunedClassifier(ds, getCandidateCCPParameters(ds))
    predicted_labels = postClassifier.predict(ds.testing.features)
    score = accuracy_score(ds.testing.labels, predicted_labels)
    print(f"Best performing post pruned classifier accuracy: %{score*100}")
          
    scores = cross_validate(
        preClassifier,
        pd.concat([ds.training.features,ds.testing.features]),
        pd.concat([ds.training.labels,ds.testing.labels]),
        scoring='accuracy',
        cv=100
    )
    preScores = scores['test_score']
    
    scores = cross_validate(
        postClassifier,
        pd.concat([ds.training.features,ds.testing.features]),
        pd.concat([ds.training.labels,ds.testing.labels]),
        scoring='accuracy',
        cv=100
    )      
    postScores = scores['test_score']
    _, p_value = stats.ttest_ind(preScores, postScores)
    print("P-Value for independent t-test (pre pruned strategy, post pruned strategy): ", p_value)
    
    print("\n")

-- For dataset website-phishing
Fitting 10 folds for each of 216 candidates, totalling 2160 fits
Best "max_depth": 26 (out of: 27)
Best "min_samples_leaf": 1
Best performing pre pruned classifier accuracy: %96.29182996683751
Best performing post pruned classifier accuracy: %96.26168224299066
P-Value for independent t-test (pre pruned strategy, post pruned strategy):  0.7992654587493159


-- For dataset arrythmia
Fitting 10 folds for each of 104 candidates, totalling 1040 fits
Best "max_depth": 6 (out of: 13)
Best "min_samples_leaf": 10
Best performing pre pruned classifier accuracy: %69.11764705882352
Best performing post pruned classifier accuracy: %70.58823529411765
P-Value for independent t-test (pre pruned strategy, post pruned strategy):  0.727295682859028


-- For dataset bcp
Fitting 10 folds for each of 56 candidates, totalling 560 fits
Best "max_depth": 5 (out of: 7)
Best "min_samples_leaf": 1
Best performing pre pruned classifier accuracy: %95.60975609756098
Best performing po

## Datasets Commentary
Using the same 2 sample independence t-test, we see that for all datasets, that there aren't significant differences between model accuracy of these particular post-pruned and pre-pruned techniques. This really doesn't suggest anything and should not discontinue any sort of continued interest in exclusively either/or technique.

What is interesting is that the post pruning hyper parameters selected show that there _is_ some sort of benefit from applying these pre-pruning techniques as the search has shown non-zero hyper parameters perform well. Considering that both pre and post pruning techniques show an improvement overall, it'd be very interesting to perform a hyper parameter search within the context of a hybrid approach utulising both pre and post pruning techniques.