#Lab 7 - Machine Learning, Fall 2023
Collaborators: SJ Franklin

# <center>Lab : Decision Trees and Random Forests</center>

We use Decision Trees and Random Forests to the classify the quality of Wine from it's attributes. The wine dataset contains the results of a chemical analysis of wines grown in a specific area of Italy. Three types of wine are represented in the 178 samples, with the results of 13 chemical analyses recorded for each sample. The class label variable has been transformed into a categoric variable.

The data contains no missing values and consits of only numeric data, with a three class target variable (class label) for classification.

In [12]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import precision_score, recall_score, f1_score

import os
from IPython.display import Image

In [2]:
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine//wine.data', header=None)

df.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',
              'Alcalinity of ash', 'Magnesium', 'Total phenols',
              'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',
              'Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline']

display(df.head())

print('# data points: %d' % df.shape[0])
print('Class labels:', np.unique(df['Class label']))

Unnamed: 0,Class label,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


# data points: 178
Class labels: [1 2 3]


## Data Preperation

### Exercise
Define a variable **X** which consists of all columns except Class label and a variable **y** that has just the class label column

In [3]:
X = df.drop('Class label',axis=1)
y = df['Class label']

### Exercise
Split the data into training and test subsets. The test data should be about 20% of the total records.

In [4]:
X_train,  X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=19)
X_train,  X_val, y_train, y_val = train_test_split(X_train,y_train,  test_size=0.1, random_state=19)

## Decision Tree

### Exercise :
Use the Scikit-learn package [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) to build a decision tree for our training set. You should use the entropy for evaluating the splits and the maximum depth of the tree should be 3

In [6]:
tree_clf = DecisionTreeClassifier(
    criterion="entropy",
    max_depth=3,
    random_state=19
    )
tree_clf.fit(X_train, y_train)

### Exercise :
Display the accuracy of your trained model on the test set

In [9]:
cross_val_score(tree_clf, X_test, y_test, cv=3, scoring="accuracy")

array([0.75      , 0.91666667, 0.75      ])

### Exercise :
Display the Confusion matrix, precision, recall, and f1 scores (for the later items, you may want to look at the classification report method)

In [11]:
# Confusion matrix
y_test_pred = cross_val_predict(tree_clf, X_test, y_test, cv=3)
confusion_matrix(y_test, y_test_pred)

array([[ 9,  1,  0],
       [ 0, 15,  2],
       [ 2,  2,  5]])

In [14]:
# Precision, recall, f1

print(precision_score(y_test, y_test_pred, average='weighted'))
print(recall_score(y_test, y_test_pred, average='weighted'))
print(f1_score(y_test, y_test_pred, average='weighted'))

0.7993626743626745
0.8055555555555556
0.7991071428571429


### Optional Exercise:
It can be interesting to see a visualization of the tree. To do this, you will need to follow the instructions in the book starting around Page 175. If you are doing this on your local machine, then you may need to install the graphviz package (http://www.graphviz.org/download)  The graphviz application is already installed on colab.     

If you want to display the png file in your notebook, you can put use the following code : Image(filename='./output/fig-tree.png')

In [39]:
import graphviz
from sklearn.tree import export_graphviz

if not os.path.exists('./output/'):
    os.makedirs('./output/')

class_names = ["1", "2", "3"]

export_graphviz(
  tree_clf,
  out_file='./output/fig-tree.png',
  feature_names=X.columns.tolist(),
  class_names=class_names,
  rounded=True,
  filled=True
)

# Was having a hard time getting the image to display. This creates a PDF of the graph
graph = graphviz.Source.from_file('./output/fig-tree.png')
graph.render('./output/fig-tree.png', view=True)

'output/fig-tree.png.pdf'

## Random Forest

### Exercise :
Use the Scikit-learn package [RandomForestCassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) to build a random forest for our training set. You should use the entropy for evaluating the splits and you should create approximately 200 trees

In [40]:
rnd_clf = RandomForestClassifier(
    n_estimators=200,
    criterion='entropy',
    random_state=19
    )
rnd_clf.fit(X_train, y_train)

### Exercise :
Display the accuracy of your trained model on the test set

In [41]:
cross_val_score(rnd_clf, X_test, y_test, cv=3, scoring="accuracy")

array([1.        , 0.91666667, 0.91666667])

### Exercise :
Display the Confusion matrix, precision, recall, and f1 scores (for the later items, you may want to look at the classification report method)

In [43]:
# Confusion matrix
y_test_pred = cross_val_predict(rnd_clf, X_test, y_test, cv=3)
confusion_matrix(y_test, y_test_pred)

array([[ 9,  1,  0],
       [ 0, 16,  1],
       [ 0,  0,  9]])

In [44]:
# Precision, recall, f1

print(precision_score(y_test, y_test_pred, average='weighted'))
print(recall_score(y_test, y_test_pred, average='weighted'))
print(f1_score(y_test, y_test_pred, average='weighted'))

0.9472222222222223
0.9444444444444444
0.9444444444444444


### Exercise:
Display the features that random forest classifier found important (Pg 198 in the book)

In [48]:
for name, score in zip(X.columns, rnd_clf.feature_importances_):
  print(name, score)

Alcohol 0.10127100685964699
Malic acid 0.037531143520856916
Ash 0.015838205642826696
Alcalinity of ash 0.03655050978806521
Magnesium 0.023398203629595785
Total phenols 0.06567571546048477
Flavanoids 0.1790001991572172
Nonflavanoid phenols 0.015489066784891129
Proanthocyanins 0.014985454043488734
Color intensity 0.12563294528900623
Hue 0.061980129529799886
OD280/OD315 of diluted wines 0.1367229210881229
Proline 0.18592449920599777
