## ME4: Decision Tree Classifiers

#### Collaborators

Notes: This homework can be done individually or by a troup of two. If this is done by a group, please write the names of students who work together. Make sure that each individual makes own submission(s). 

-

-



### Setup

First, let's import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures. We also check that Python 3.5 or later is installed (although Python 2.x may work, it is deprecated so we strongly recommend you use Python 3 instead), as well as Scikit-Learn ≥0.20.

In [None]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# Common imports
import numpy as np
import os
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)


# to make this notebook's output stable across runs
np.random.seed(42)

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "decision_trees"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)

os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

## Part 0

- Read each cell of the examples below, run and check the outputs. 

### Confusion matrix simple example 1 - binary classification

In [None]:
y_true1 = [1, 0, 0, 1, 1, 0, 1, 1, 0]
y_pred1 = [1, 1, 0, 1, 1, 0, 1, 1, 1]


confusion_mat1 = confusion_matrix(y_true1, y_pred1)

print(confusion_mat1)

In [None]:
# Print classification report
target_names2 = ['Class-0', 'Class-1']

result_metrics = classification_report(y_true1, y_pred1, target_names=target_names2)

print(result_metrics)

### Confusion matrix simple example 2 - multiclass classification

In [None]:
y_true2 = [1, 0, 0, 2, 1, 0, 3, 3, 3]
y_pred2 = [1, 1, 0, 2, 1, 0, 1, 3, 3]

confusion_mat2 = confusion_matrix(y_true2, y_pred2)
print(confusion_mat2)

In [None]:
# Print classification report
target_names2 = ['Class-0', 'Class-1', 'Class-2', 'Class-3']

result_metrics2 = classification_report(y_true2, y_pred2, target_names=target_names2)

print(result_metrics2)

### Decision Tree Classifier

- dataset: iris dataset

In [None]:
from IPython.display import Image

Image("images/iris.png")

#### Data visualization of the iris dataset before we start training and testing a model

- iris.csv is stored in a local folder 'data'. 

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

# read data from CSV file to dataframe
iris = pd.read_csv('./data/iris.csv')

# make sure you understand the type of the object
print(type(iris))

# check the top five and the botoom five data tuples
print(iris.head())
print(iris.tail())

# scatter matrix plot
pd.plotting.scatter_matrix(iris);

plt.figure()

## Decision Trees

- Read the details of decision tree classifier

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

- Check out the difference between model parameters and hyper parameters:

https://towardsdatascience.com/model-parameters-and-hyperparameters-in-machine-learning-what-is-the-difference-702d30970f6


## A simple example of DT modeling

- We first start the modeling without k-cross validation here but show step-by-step code segments how to train a model and test it. 

### Load data

- For the following code, we use sklearn.datasets package for loading a dataset instead of reading a data file stored on a local machine. 

In [None]:
from sklearn.datasets import load_iris

iris = load_iris()

# make sure that you understand the type of the object
print(type(iris)) 
print(iris)

### Split the data to training and testing 

In [None]:
from sklearn.model_selection import train_test_split

X = iris.data # sepal length and width, petal length and width
y = iris.target

#print(X)

# split the data 70% for training, 30% for test data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=3, test_size=0.20)


## Training
### Learing using training data

- use Gini index measure 

*** Notes: you can also use gain information (entropy) measure by setting criterion="entropy" in the model

In [None]:
from sklearn.tree import DecisionTreeClassifier

tree_clf = DecisionTreeClassifier(max_depth=2, criterion='gini')
tree_clf.fit(X_train, y_train)

## Testing
### Evaluating the model using testing data

In [None]:
y_pred = tree_clf.predict(X_test)

y_pred

## Model performance

### Confusion matrix

In [None]:
# plot a confusion matrix

confusion_mat = confusion_matrix(y_test, y_pred)

print(confusion_mat)

### Model performance summary

In [None]:
# Print classification report

target_names = iris.target_names

result_metrics = classification_report(y_test, y_pred, target_names=target_names)

print(result_metrics)


In [None]:
# you can access each class's metrics from result_metrics

# you can access each class's metrics from result_metrics
# output_dict should be set to True
result_metrics_dict = classification_report(y_test, y_pred, target_names=target_names, output_dict=True)

print(result_metrics_dict)

# an example that shows how to access the value of precision metric of class 'setosa'
print(result_metrics_dict['setosa']['precision'])

### Draw a decision tree

- You may need to install graphviz package!

In [None]:
from graphviz import Source
from sklearn.tree import export_graphviz

export_graphviz(
        tree_clf,
        out_file=os.path.join(IMAGES_PATH, "iris_tree.dot"),
        feature_names=iris.feature_names,
        class_names=iris.target_names,
        rounded=True,
        filled=True
    )

Source.from_file(os.path.join(IMAGES_PATH, "iris_tree.dot"))

### Important features from a decision tree using gini index

- Decision tree classifier

https://scikit-learn.org/stable/modules/tree.html#classification

- Metrics

https://scikit-learn.org/stable/modules/model_evaluation.html

In [None]:
# plot important features
def plot_feature_importances(clf, feature_names):
    c_features = len(feature_names)
    plt.barh(range(c_features), clf.feature_importances_)
    plt.xlabel("Feature importance")
    plt.ylabel("Feature name")
    plt.yticks(np.arange(c_features), feature_names)

In [None]:
clf1 = DecisionTreeClassifier(criterion='gini').fit(X_train, y_train)

print('Accuracy of DT classifier on training set: {:.2f}'
     .format(clf1.score(X_train, y_train)))
print('Accuracy of DT classifier on test set: {:.2f}'
     .format(clf1.score(X_test, y_test)))


plt.figure(figsize=(8,4), dpi=60)

# call the function above
plot_feature_importances(clf1, iris.feature_names)
plt.show()

print('Feature importances: {}'.format(tree_clf.feature_importances_))

In [None]:
clf2 = DecisionTreeClassifier(criterion='entropy').fit(X_train, y_train)

print('Accuracy of DT classifier on training set: {:.2f}'
     .format(clf2.score(X_train, y_train)))
print('Accuracy of DT classifier on test set: {:.2f}'
     .format(clf2.score(X_test, y_test)))


plt.figure(figsize=(8,4), dpi=60)

# call the function above
plot_feature_importances(clf2, iris.feature_names)
plt.show()

print('Feature importances: {}'.format(tree_clf.feature_importances_))

## k-Cross Validation

- using KFold function with freedom

In [None]:
from sklearn.model_selection import KFold # import k-fold validation

kf = KFold(n_splits=3, random_state=None, shuffle=True) # Define the split - into 2 folds 

kf.get_n_splits(X) # returns the number of splitting iterations in the cross-validator

print(kf) 


### Applying k-Cross Validation

In [None]:
tree_clf = DecisionTreeClassifier(max_depth=2, random_state=42)

for train_index, test_index in kf.split(X):
    #print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    tree_clf.fit(X_train, y_train)
    
    y_pred = tree_clf.predict(X_test)
    
    # Print classification report
    target_names = iris.target_names
    print(classification_report(y_test, y_pred, target_names=target_names))


# Predicting classes and class probabilities

In [None]:
# Class0 (setona) 
prob = tree_clf.predict_proba([[5.0, 3.4, 1.3, 0.2]]) 

# check predictions for different samples
# Class1 (versicolor)
#prob = tree_clf.predict_proba([[7.1, 3.1, 4.8, 1.4]])

# Class2 (virginica)
#prob = tree_clf.predict_proba([[6.4, 2.7, 4.9, 1.8]])

print(prob)


In [None]:
# predict class1 (versicolor)
predicted = tree_clf.predict([[5.0, 3.4, 1.3, 0.2]])

print(predicted)

## ME4

### Part 1

## Construct decision trees

#### 1. Construct  a decision tree using the following parameters

- Use information gain (entropy) measure
- Apply k=10 cross validation and print a summary of statistics (performance evaluation) for each fold


#### 2. Compare the performance results with those of the decision tree using Gini index measure in the above example

#### 3. For both trees, change the following parameters and observe the changes:

- The depth of tree: currently max_depth=2 in the model training step. Change the depth 3, 4, 5 and check if this affects the overall results. 

- The k value for cross validation is currently set to 3. Change k value, k = 5, 7, 10 and check if this affects the overall results. 

## Part 2

1. See DT examples at:
    
https://www.kaggle.com/dmilla/introduction-to-decision-trees-titanic-dataset

2. Discuss about different ways to handle the following types of data for decision tree classification. 

    - text data (strings): in the case a dataset includes non-numerical data. 

    - continuous data like age, weight, income, etc.


### Submission(s): Each individual student should make own submission. 

- Upload the notebook on your Git repo and provide an URL link in your summar. 

- Submit your summar to Canvas
