# 2 Decision Trees II

With this task, we want to explore how to program decision trees in Python. We will use some popular packages and libraries that are specifically developed for this machine learning method, but also manually design some of the functions needed to set up a decision tree classifier ourselves. 

Before you start working on this task, check out the [Tree](https://scikit-learn.org/stable/modules/tree.html) module from the [Scikit-Learn](https://scikit-learn.org/stable/index.html) library and get familiar with some functions this module provides.

If you haven't installed the correct version 0.24.1 of scikit-learn yet, just run the following line of code to properly install it on your machine.

In [None]:
!pip install scikit-learn==0.24.1

## 2.1 Obtaining the **Iris Plants** dataset
Let's use another interesting dataset this time: The [Iris Plants Dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#iris-dataset) provided by scikit-learn is used to classify three different types of iris flower. Similar to the Boston Housing Prices dataset, this one is fully preprocessed and ready for analyzing.

In [None]:
# import libraries
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import datasets

In [None]:
iris = datasets.load_iris()

In [None]:
features = iris.feature_names
features

In [None]:
classes = iris.target_names
classes

So as you can see, the class names are the Latin names for each type of iris flower. But let's continue by taking a look at the dataset description:

In [None]:
print(iris.DESCR)

## 2.2 Automated Decision Tree Classification
We can now get the DecisionTreeClassifier and train it on our feature and target values.

In [None]:
from sklearn import tree
X, y = iris.data, iris.target
dt = tree.DecisionTreeClassifier(criterion='entropy', random_state=42)
dt = dt.fit(X, y)

Awesome, now it's time to plot our tree:

In [None]:
tree.plot_tree(dt)

This doesn't look too good. We need a bigger figure size and don't want to see this unreadable text. So let's use ```matplotlib``` to adjust the size of the plot and enter some parameters to the ```plot_tree()``` function:

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(30,10))
_ = tree.plot_tree(dt, feature_names=features, class_names=classes, filled=True, fontsize=14)

By the way ```filled=True``` as described in the ```plot_tree``` [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html) is responsible for coloring the decision tree according to indicate majority class for classification, extremity of values for regression, or purity of node for multi-output.

So now we know how to automatically create a decision tree using existing Python libraries. But since we still haven't taken a look behind the curtains of such automated functions, we want to write some decision tree classifier ourselves.

## 2.3 Attribute Splitting with ID3 Algorithm

In the lecture, you got to know two criteria for attribute split, i.e. information gain (entropy) and gini impurity. Now it's time to write some functions that determine the next attribute to split on according to the entropy criterion. **Your part will be to implement the four helper functions below**.

The first function ```entropy(...)``` takes an array of values (usually labels) and outputs the entropy (or _info_).

In [None]:
def entropy(value_arr):
    # your code here
    return entropy

The second function ```avg_info(...)``` takes an array of values of an attribute (that is usually an entire column in a dataset if each attribute represents one column) and an array of labels of the same length and outputs the average information of that attribute. This means, the function has to be invoked once for each attribute in a dataset.

In [None]:
def avg_info(attr_values, labels):
    # your code here
    return avg_info

Now that we have the overall entropy (as integer/float/...) and the average information of all attributes (as an array), the function ```info_gain(...)``` calculates the information gain of all attributes and returns the corresponding array. This means, we call this function once, not for all attributes separately.

In [None]:
def info_gain(info, attr_info):
    # your code here
    return gain

Finally, in the function ```get_split_attr(...)``` the index position of the attrib 

In [None]:
def get_split_attr(gain_arr):
    # your code here
    return attr_pos

Here's some code to test your functions (you don't have to change anything here). All steps are basically following the equations in slide 34 of the [decision tree slide set](https://lernen.min.uni-hamburg.de/pluginfile.php/164416/mod_resource/content/2/L07%20Simple%20Decision%20trees_2021.pdf). So try to understand what calculations have to be done there and this task will be a breeze. ;-)

In [None]:
#
# TEST CODE / MAIN FUNCTION
#


# Step 1: Calculate Information (Entropy)
info = entropy(iris.target)

# Step 2: Calculate Average Information of all Attributes
attr_info = [avg_info(attr, iris.target) for attr in iris.data.T]

# Step 3: Calculate Information Gain
gain = info_gain(info, attr_info)

# Step 4: Determine Split Attribute based on Information Gain
attr_pos = get_split_attr(gain)
attr_name = iris.feature_names[attr_pos]

# Step 5: Some fancy output for debugging
print('The next attribute to use for splitting is {}.'.format(attr_name))

_Hint: Does the print-statement output the same attribute that has been split on in the decision tree above right in the root? That should be a good sign!_