# Problem 1 (30 points)

## Problem Description
In this problem you will train decision tree and random forest models using sklearn on a real world dataset. The dataset is the *Cylinder Bands Data Set* from the UCI Machine Learning Repository: [https://archive.ics.uci.edu/ml/datasets/Cylinder+Bands](https://archive.ics.uci.edu/ml/datasets/Cylinder+Bands). The dataset is generated from rotogravure printers, with 39 unique features, and a binary classification label for each sample. The class is either 0, for 'band' or 1 for 'no band', where banding is an undesirable process delay that arises during the rotogravure printing process. By training ML models on this dataset, you could help identify or predict cases where these process delays are avoidable, thereby improving the efficiency of the printing. For the sake of this exercise, we only consider features 21-39 in the above link, and have removed any samples with missing values in that range. No further processing of the data is required on your behalf. The data has been partitioned into a training and testing set using an 80/20 split. Your models will be trained on just the test set, and accuracy results will be reported on both the training and testing sets.

Fill out the notebook as instructed, making the requested plots and printing necessary values. 

*You are welcome to use any of the code provided in the lecture activities.*

#### Summary of deliverables:

- Accuracy function
- Report accuracy of the DT model on the training and testing set
- Report accuracy of the Random Forest model on the training and testing set

#### Imports and Utility Functions:

In [60]:
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

## Load the data

Use the `np.load()` function to load "w5-hw1-train.npy" (training data) and "w5-hw1-test.npy" (testing data). The first 19 columns of each are the features. The last column is the label

In [61]:
# YOUR CODE GOES HERE
# load the data
training_data = np.load("data/w5-hw1-train.npy")
testing_data = np.load("data/w5-hw1-test.npy")

# split the data into features and labels
X_train, y_train = training_data[:, :19], training_data[:, -1]
X_test, y_test = testing_data[:, :19], testing_data[:, -1]

# print(X_train.shape)
# print(y_train.shape)

## Write an accuracy function

Write a function `accuracy(pred,label)` that takes in the models prediction, and returns the percentage of predictions that match the corresponding labels.

In [62]:
# YOUR CODE GOES HERE
def accuracy(pred, label):
    accuracy = np.sum(pred == label) / len(label) * 100
    return accuracy

## Train a decision tree model

Train a decision tree using `DecisionTreeClassifier()` with a `max_depth` of 10 and using a `random_state` of 0 to ensure repeatable results. Print the accuracy of the model on both the training and testing sets.

In [63]:
# YOUR CODE GOES HERE
# train decision tree classifier
dt = DecisionTreeClassifier(max_depth=10, random_state=0)

dt.fit(X_train, y_train)
dt_pred_train = dt.predict(X_train)
dt_pred_test = dt.predict(X_test)

dt_accuracy_training = accuracy(dt_pred_train, y_train)
dt_accuracy_testing = accuracy(dt_pred_test, y_test)

# print the accuracies
print("Training accuracy of decision tree model: ", dt_accuracy_training,"%")
print("Testing accuracy of decision tree model: ", dt_accuracy_testing,"%")

Training accuracy of decision tree model:  93.12714776632302 %
Testing accuracy of decision tree model:  65.75342465753424 %


#

## Train a random forest model

Train a random forest model using `RandomForestClassifier()` with a `max_depth` of 10, a `n_estimators` of 100, and using a random state of `0` to ensure repeatable results. Print the accuracy of the model on both the training and testing sets. 

In [64]:
# YOUR CODE GOES HERE
# train forst model
rf = RandomForestClassifier(max_depth=10, n_estimators=100, random_state=0)
rf.fit(X_train, y_train)
rf_pred_train = rf.predict(X_train)
rf_pred_test = rf.predict(X_test)
rf_accuracy_training = accuracy(rf_pred_train, y_train)
rf_accuracy_testing = accuracy(rf_pred_test, y_test)

# print the accuracies
print("Training accuracy of random forest model: ", rf_accuracy_training,"%")
print("Testing accuracy of random forest model: ", rf_accuracy_testing,"%")

Training accuracy of random forest model:  100.0 %
Testing accuracy of random forest model:  82.1917808219178 %


## Discuss the performance of the models

Compare the training and testing accuracy of the two models, and explain why the random forest model is advantageous compared to a standard decision tree model

# 

For training accuracy, the random forest model (100%) has a higher accuracy compared to the decision tree model (93.13%). For testing accuracy, the random forest model (82.19%) also has a higher accuracy compared to the decision tree model (65.75%). The random forest model is more advantageous compared to a standard decision tree model because of its ability to reduce overfitting and generalize better to unseen data.