## Decision Tree - Exercises

**NOTICE:**
1. You are allowed to work in groups of up to three people but **have to document** your group's\
 members in the top cell of your notebook.
2. **Comment your code**, explain what you do (refer to the slides). It will help you understand the topics\
 and help me understand your thinking progress. Quality of comments will be graded. 
3. **Discuss** and analyze your results, **write-down your learnings**. These exercises are no programming\
 exercises it is about learning and getting a touch for these methods. Such questions might be asked in the\
 final exams. 
4. Feel free to **experiment** with these methods. Change parameters think about improvements, write down\
 what you learned. This is not only about collecting points for the final grade, it is about understanding\
  the methods. 
5. All exercises can be part of the final exam, your **answers, experiments and documented learnings will be graded**. 


In [1]:
# Execute this cell, it sets up some data used to work with. 

import numpy as np

# attribute1: corner ({true, false})
# attribute2: blue ({true, false})
#

circle = 1
triangle = 2
rectangle = 3

A = np.array([  circle, circle, circle, circle, circle, circle, 
                triangle, triangle, triangle, 
                rectangle, rectangle, rectangle, rectangle, rectangle
                ])

B1 = np.array([ triangle, triangle, triangle,
                rectangle,rectangle,rectangle,rectangle,rectangle 
                ])

B2 = np.array([circle, circle, circle, circle, circle, circle])

C1 =  np.array([ triangle, triangle, triangle,
                circle, circle
                ])

C2 = np.array([circle, circle, circle, circle,
                rectangle, rectangle, rectangle, rectangle, rectangle])

### Exercise 1 - Entropy: 


**Summary:** In this task you implement a method to compute the entropy using python. Compare your implementation with your    
hand-calculated values from the exercise in the slides.

**Provided Code:** Use the cell below for your implementation.

**Your Tasks in this exercise:**
1. Implement a python function that computes the entropy.
2. Document your learnings. 


In [2]:
def entropy(Y: np.array):
    """ Compute the entropy of a given array of class-labels.
    
    Parameters
    ----------
    Y: np.array
        A one dimensional numpy array containing class labels. Class labels
        are assumed not to be one-hot encoded but categorical integer values. 

    Returns
    ----------
    Entropy of Y. 
    """
    pass

### Exercise 2 - Information Gain:

**Summary:** In this exercise you will implement the conditional entropy and the information gain in python. 

**Provided Code:** Use the method stubs in the cells below for your implementation. 

**Your Tasks in this exercise:**
1. Implement a python function that computes the conditional entropy. 
2. Implement a python function that computes the information gain. 
3. Compare your implementation with your hand-calculated values. 
4. Document your learnings. 

In [3]:
def conditional_entropy(Sa: list):
    """ Compute the conditional entropy.

    Compute the conditional entropy for a list of numpy arrays with given class labels. Each list entry 
    is assumed to contain the class labels of a set of data that was created by splitting a training set
    of data according to an attribute.

    Parameters
    ----------
    Sa: [np.array]
        A list of one dimensional numpy arrays each containing class labels. Class labels
        are assumed not to be one-hot encoded but categorical integer values. 

    Returns
    ----------
    Entropy of Y. 
    """
    pass

In [4]:
def information_gain(T : np.array, Sa : list):
    """ Compute the information gain.

    Parameters
    ----------
    T: np.array
        A one dimensional numpy array containing class labels. Class labels
        are assumed not to be one-hot encoded but categorical integer values. 

    Sa: [np.array]
        A list of one dimensional numpy arrays each containing class labels. Class labels
        are assumed not to be one-hot encoded but categorial integer values. 

    Returns
    ----------
    Bits saved when encoding Sa instaed of T. 
    """
    pass

### Exercise 3 - Gini Impurity:

**Summary:** In this exercise you will implement the gini impurity. 

**Provided Code:** Use the method stubs in the cells below for your implementation. 

**Your Tasks in this exercise:**
1. Implement a python function that computes the gini impurity. 
2. Compare the results of your implementation with your hand-calculations. 
3. Document your learnings. 

In [5]:
def gini_impurity(Y:np.array):
    """ Compute the gini impurity.

    Parameters
    ----------
    Y: np.array
        A one dimensional numpy array containing class labels. Class labels
        are assumed not to be one-hot encoded but categorical integer values. 

    Returns
    ----------
    Gini impurity of the set with labels Y.
    """
    pass

### Exercise 4 - Decision Tree in scikit-learn:

**Summary:** In this exercise you will build a decision tree based on a public implementation in scikit-learn. 

**Provided Code:** The cell below creates your training data. 

**Your Tasks in this exercise:**
1. Build a decision tree using **gini impurity** and a decision tree using **entropy** based on the implementation in sklearn (https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html). 
2. Interpret and compare the results by using the ```plot_tree()``` method. 
3. Document your learnings. 


In [26]:
from sklearn import tree
Y = np.array([  circle, circle, circle, circle, circle, circle, 
                triangle, triangle, triangle, 
                rectangle, rectangle, rectangle, rectangle, rectangle
            ])

# Attributes (aka Features)
# x1 = corners, x2 = blue
# Notice that sklearn does not support categorical attributes we therefore encode this
# (one-hot-encoded) by using 0 as False and 1 as True. 
#
X = np.array([[0, 1],  # circle 1
              [0, 1],  # circle 2
              [0, 0], # circle 3
              [0, 0], # circle 4
              [0, 0], # circle 5
              [0, 0], # circle 6
              [1, 1],   # triangle 1
              [1, 1],   # triangle 2
              [1, 1],   # triangle 3
              [1, 0],  # rectangle 1
              [1, 0],  # rectangle 2
              [1, 0],  # rectangle 3
              [1, 0],  # rectangle 4
              [1, 0]   # rectangle 5
            ])

### Exercise 5 - Experiment with Decision Trees in scikit-learn:

**Summary:** In this exercise it is your job to experiment with the decision tree implementation of sklearn. 

**Provided Code:**  I provided you with a ```gen_data()``` method which is capable of generating (random) data.

**Your Tasks in this exercise:**
1. Train a decision-tree using scikit-learn for this synthetic data. Evaluate its performance using correct evaluation techniques. 
2. Answer the following questions
    * How does the accuracy change depending on the number of data?
3. Create a seperate dataset for training and testing
    * What is the difference between training and test accuracy? Why is it different?
4. Restrict the depth of the decision tree.
    * What's the effect of changing the depth of the decision tree?
    * In which scenarios could this be useful?
5. Have a closer look on the decisions in a tree. Explain the results. (Note: You can increase the size of the plotted tree using   
this line of code ```plt.figure(figsize=(20,20))``` before calling the ```tree.plot_tree``` method)
6. Document your learnings. 

**Hints:**
* Use the ```accuracy_score()``` method from sklearn.metrics for evaluation.


In [30]:
import matplotlib.pyplot as plt

def gen_data(num_samples=10):

    std = 10
    mean = 0

    X = std * np.random.uniform(0, 1, (num_samples, 2)) + mean
    Y = np.zeros(num_samples)
    Y[0:int(num_samples/2)] = 1

    plt.figure()
    plt.scatter(X[0:int(num_samples/2),0], X[0:int(num_samples/2),1])
    plt.scatter(X[int(num_samples/2):-1,0], X[int(num_samples/2):-1,1])
    plt.legend(['Class-1', 'Class-2'])
    return X,Y

### Exercise 6 - Condition monitoring of hydraulic systems


**Summary:** In this exercise you will work with a real-world data set to create decision-tree's to predict failure modes of a   
hydraulic system. 

**Provided Code:** The cell below can be used to import the data set. 

**Your Tasks in this exercise:**
1. Load the dataset and use a DecisionTree to classify it. 
2. Make sure you are using a test- and train-splits.
3. Try to predict the different type of faults using decision trees. 
4. Explain what attributes the trees selects for precitions. 
5. Document your learnings. 


------------

#### Additional Information

Source: https://archive.ics.uci.edu/ml/datasets/Condition+monitoring+of+hydraulic+systems

The data set was experimentally obtained with a hydraulic test rig. This test rig consists of a primary working and a secondary cooling-filtration circuit which are connected via the oil tank [1], [2]. The system cyclically repeats constant load cycles (duration 60 seconds) and measures process values such as pressures, volume flows and temperatures while the condition of four hydraulic components (cooler, valve, pump and accumulator) is quantitatively varied.

#### Attribute Information:

The data set was experimentally obtained with a hydraulic test rig. This test rig consists of a primary working and a secondary cooling-filtration circuit which are connected via the oil tank [1], [2]. The system cyclically repeats constant load cycles (duration 60 seconds) and measures process values such as pressures, volume flows and temperatures while the condition of four hydraulic components (cooler, valve, pump and accumulator) is quantitatively varied.

**Attributes are:**
```
Attribute   Sensor	    Physical quantity		        Unit	    Sampling rate
X[0]           PS1		Pressure			bar		100 Hz
X[1]           PS2		Pressure			bar		100 Hz
X[2]           PS3		Pressure			bar		100 Hz
X[3]           PS4		Pressure			bar		100 Hz
X[4]           PS5		Pressure			bar		100 Hz
X[5]           PS6		Pressure			bar		100 Hz
X[6]           EPS1             Motor power			W		100 Hz
X[7]           FS1		Volume flow			l/min		10 Hz
X[8]           FS2		Volume flow			l/min		10 Hz
X[9]           TS1		Temperature			Â°C		1 Hz
X[10]          TS2		Temperature			Â°C		1 Hz
X[11]          TS3		Temperature			Â°C		1 Hz
X[12]          TS4		Temperature			Â°C		1 Hz
X[13]          VS1		Vibration			mm/s		1 Hz
X[14]          CE		Cooling efficiency (virtual)	%		1 Hz
X[15]          CP		Cooling power (virtual)		kW		1 Hz
X[16]          SE		Efficiency factor		%		1 Hz
```

The target conditions are:

**1: Cooler condition / %:***
* 3: close to total failure
* 20: reduced effifiency
* 100: full efficiency

**2: Valve condition / %:**
* 100: optimal switching behavior
* 90: small lag
* 80: severe lag
* 73: close to total failure

**3: Internal pump leakage:**
* 0: no leakage
* 1: weak leakage
* 2: severe leakage

**4: Hydraulic accumulator / bar:**
* 130: optimal pressure
* 115: slightly reduced pressure
* 100: severely reduced pressure
* 90: close to total failure

    

In [33]:
!wget https://github.com/shegenbart/Jupyter-Exercises/raw/main/data/condition_monitoring.pickle -P ../data

import pickle 
import numpy as  np
from dataclasses import dataclass

@dataclass
class Dataset:
    Description: str
    Attributes: list()
    Targets_cooler: list()
    Targets_valve: list()
    Targets_leakage: list()
    Targets_accu: list()
    X: np.array
    Y_cooler: np.array
    Y_valve: np.array    
    Y_leakage: np.array
    Y_accu: np.array

def load_dataset(filename):
    with open(filename, 'rb') as fd:
        dataset = pickle.load(fd)
    return dataset

data = load_dataset('../data/condition_monitoring.pickle')