## Homework 3 - Part 1
## Decision Tree and Random forest

In this homework, you will perform classification on the provided datasets using Decision Tree and Random Forest algorithms. 

The first dataset you will be working with contains 2 features. The second dataset contains 50 features. Both of them have a target label which can be 0 or 1.

You will go step by step with the first dataset. <br>
1 - Use a Decision Tree Classifier and observe the model performance.<br>
2 - Use a Random Forest Classifier and observe the model performance.<br>
3 - Use Grid Search to choose the optimal values for hyperparameters and observe the performance of the best model.


For the second dataset, you are required to generate an optimized Random Forest model using what you have learned in the steps mentioned above.

Dataset 1:
train_2features.csv and test_2features.csv are the training set and testing set respecitvely.


Dataset 2:
train_50features.csv and test_50features.csv are the training set and testing set respecitvely.


To obtain a deterministic behavior, keep the random_state in all algorithms fixed to the value given. 


In [1]:
#Basic functions
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

#Models and metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_curve, roc_auc_score, accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score

#Defined functions
from utils import * 

# to ignore warnings in sklearn
import warnings
warnings.filterwarnings("ignore")

# magic to not to call show every time 
%matplotlib inline

plt.style.use("ggplot")

ModuleNotFoundError: No module named 'utils'

# Dataset 1

In [None]:
train_df = pd.read_csv("train_2features.csv")
f, ax = visualize_2d_data(train_df)
plt.title("Train Features")

In [None]:
train_df.head(2)

In [None]:
test_df = pd.read_csv("test_2features.csv")
f, ax = visualize_2d_data(test_df)
plt.title("Test Features")

In [None]:
test_df.head(2)

__Q. From the above visualizations, what can you tell about the need for a linear/non-linear model for classification?__


### Decision Tree
From sklearn.tree use DecisionTreeClassifier to build a classification model with default parameters.

In [None]:
#train test split
y_train = train_df.y.values.reshape(-1,1)
x_train = train_df.drop("y", axis=1)
y_test = test_df.y.values.reshape(-1,1)
x_test = test_df.drop("y", axis=1)

In [None]:
### Fit the classifier on the training data
model_tree = DecisionTreeClassifier(random_state=26)
model_tree.fit(x_train, y_train)

tree_pred_train = model_tree.predict(x_train)
tree_pred_proba_train = model_tree.predict_proba(x_train)[::,1]

__Q. Print accuracy, prediction and recall for the predictions made on the training data.__

In [None]:
### WRITE CODE HERE ###
print("Accuracy on train set: ",round(accuracy_score(y_train,tree_pred_train)*100,3),"%")
print("Precision on train set: ",round(precision_score(y_train,tree_pred_train)*100,3),"%")
print("Recall on train set: ",round(recall_score(y_train,tree_pred_train)*100,3),"%")
print("F1 Score on train set: ",round(f1_score(y_train,tree_pred_train)*100,3),"%")
print("ROC AUC Score on train set: ",round(roc_auc_score(y_train,tree_pred_proba_train),3))

In [None]:
print(classification_report(y_train, tree_pred_train))

In [None]:
### Make predictions on the testing data
### WRITE CODE HERE ###
tree_pred_test = model_tree.predict(x_test)
tree_pred_proba_test = model_tree.predict_proba(x_test)[::,1]

__Q. Print accuracy, prediction and recall for the predictions made on the testing data.__ 

In [None]:
### WRITE CODE HERE ###
print("Accuracy on train set: ",round(accuracy_score(y_test,tree_pred_test)*100,3),"%")
print("Precision on train set: ",round(precision_score(y_test,tree_pred_test)*100,3),"%")
print("Recall on train set: ",round(recall_score(y_test,tree_pred_test)*100,3),"%")
print("F1 Score on train set: ",round(f1_score(y_test,tree_pred_test)*100,3),"%")
print("ROC AUC Score on train set: ",round(roc_auc_score(y_test,tree_pred_proba_test),3))

In [None]:
print(classification_report(y_test, tree_pred_test))

__Q. Plot ROC curve and obtain AUC for test predictions__

In [None]:
# Plot the ROC curve by giving appropriate names for title and axes. 
### WRITE CODE HERE
fpr, tpr, _ = roc_curve(y_test,  tree_pred_proba_test)
auc = roc_auc_score(y_test, tree_pred_proba_test)

#Plotting
plt.plot(fpr,tpr,label="Class '1', auc="+str(round(auc,3)), color='gold',lw=.8)
plt.plot([0, 1], [0, 1], color='navy', lw=.5, linestyle='--', label='Random Model')

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.legend(loc=4)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic curve')

__Q. Based on the scores for training set and test set, explain the performance of the above model in terms of bias and variance.__ 

### Random Forest


Decision Trees have low predictive power compared to other methods due to high variance. Random Forest increases prediction power at the expense of decreased interpretability. 


From sklearn.ensemble use RandomForestClassifier to build a classification model with default parameters.

In [None]:
### Fit the classifier on the training data

model_rf = RandomForestClassifier(random_state=26)
model_rf.fit(x_train, y_train)

rf_pred_train = model_rf.predict(x_train)
rf_pred_proba_train = model_rf.predict_proba(x_train)[::,1]

__Q. Print accuracy, prediction and recall for the predictions made on the training data.__

In [None]:
### WRITE CODE HERE ###
print("Accuracy on train set: ",round(accuracy_score(y_train,rf_pred_train)*100,3),"%")
print("Precision on train set: ",round(precision_score(y_train,rf_pred_train)*100,3),"%")
print("Recall on train set: ",round(recall_score(y_train,rf_pred_train)*100,3),"%")
print("F1 Score on train set: ",round(f1_score(y_train,rf_pred_train)*100,3),"%")
print("ROC AUC Score on train set: ",round(roc_auc_score(y_train,rf_pred_proba_train),3))
print("-"*80)
print(classification_report(y_train, rf_pred_train))

In [None]:
### Make predictions on the testing data
### WRITE CODE HERE ###
rf_pred_test = model_rf.predict(x_test)
rf_pred_proba_test = model_rf.predict_proba(x_test)[::,1]

__Q. Print accuracy, prediction and recall for the predictions made on the testing data.__ 

In [None]:
### WRITE CODE HERE ###
print("Accuracy on test set: ",round(accuracy_score(y_test,rf_pred_test)*100,3),"%")
print("Precision on test set: ",round(precision_score(y_test,rf_pred_test)*100,3),"%")
print("Recall on test set: ",round(recall_score(y_test,rf_pred_test)*100,3),"%")
print("F1 Score on test set: ",round(f1_score(y_test,rf_pred_test)*100,3),"%")
print("ROC AUC Score on test set: ",round(roc_auc_score(y_test,rf_pred_proba_test),3))
print("-"*80)
print(classification_report(y_test, rf_pred_test))

__Q. Plot ROC curve and obtain AUC for the test predictions__

In [None]:
# Plot the ROC curve by giving appropriate names for title and axes. 
### WRITE CODE HERE
fpr, tpr, _ = roc_curve(y_test,  rf_pred_proba_test)
auc = roc_auc_score(y_test, rf_pred_proba_test)

#Plotting
plt.plot(fpr,tpr,label="Class '1', auc="+str(round(auc,3)), color='gold',lw=.8)
plt.plot([0, 1], [0, 1], color='navy', lw=.5, linestyle='--', label='Random Model')

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.legend(loc=4)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic curve')

__Q. Based on the scores for training set and test set, explain the performance of the above model in terms of bias and variance. Is the Random Forest model better or worse than the Decision Tree model? Explain why you think the performance may have improved or deteriorated.__ 

## Hyperparameters

"Model tuning" refers to model adjustments to better fit the data. This is separate from "fitting" or "training" the model. The fitting/training procedure is governed by the amount and quality of your training data, as the fitting algorithm is unique to each classifier (e.g. logistic regression or random forest). 

However, there are aspects of some models that are user specified. For example, when using a random forest (which is basically an ensemble of decision trees), it is probably a good idea to choose the right number of underlying trees. Too many and the model might overfit, and too few and the model might not be able to properly learn the data. Parameters such as these are referred to as "hyperparameters" or "free parameters", as the values for these are determined by the user and not the algorithm.

A quick and efficient way to optimize hyperparameters is to perform Grid Search over different values of the parameters. 


In [None]:
# In the below dictionary, fill in the list of values that you want to try out for each parameter
# Refer to the descriptions in sklearn's doc to understand what the parameters depict

param_grid = {
    'max_depth': [i for i in range(1,7)], #explain about 1
    'max_features': [1, 2], #It will save time for conversion
    'min_samples_leaf': [2,3,6],
    'min_samples_split': [z for z in range(6,13,2)],
    'n_estimators': [int(x) for x in np.linspace(start = 300, stop = 1100, num = 8)] #should not include more than the number of observations
}

In [None]:
rf = RandomForestClassifier(random_state=26)

In [None]:
grid_search_rf = GridSearchCV(estimator = rf, scoring='f1', param_grid = param_grid, 
                          cv = 3, n_jobs = -1, verbose = 2)
grid_search_rf.fit(x_train,y_train)

__Q. Display the parameters of the best model.__

In [None]:
grid_search_rf.best_params_

In [None]:
grid_search_rf.best_score_

In [None]:
### Using the best model, do the following:
### Make predictions on the training set and display accuracy, precision and recall.
### Make predictions on the testing set and display accuracy, precision and recall. Plot ROC curve and print AUC.

### WRITE CODE HERE ###
model_rf_best_params = RandomForestClassifier(max_depth=4 , max_features='auto' , min_samples_leaf=2 , min_samples_split=10,
                                              n_estimators=528, random_state=26)
model_rf_best_params.fit(x_train, y_train)

best_rf_pred_train = model_rf_best_params.predict(x_train)
best_rf_pred_proba_train = model_rf_best_params.predict_proba(x_train)[::,1]

best_rf_pred_test = model_rf_best_params.predict(x_test)
best_rf_pred_proba_test = model_rf_best_params.predict_proba(x_test)[::,1]

In [None]:
### WRITE CODE HERE ###
print("Accuracy on train set: ",round(accuracy_score(y_train,best_rf_pred_train)*100,3),"%")
print("Precision on train set: ",round(precision_score(y_train,best_rf_pred_train)*100,3),"%")
print("Recall on train set: ",round(recall_score(y_train,best_rf_pred_train)*100,3),"%")
print("F1 Score on train set: ",round(f1_score(y_train,best_rf_pred_train)*100,3),"%")
print("ROC AUC Score on train set: ",round(roc_auc_score(y_train,best_rf_pred_proba_train),3))
print("-"*80)
print(classification_report(y_train, best_rf_pred_train))

In [None]:
### WRITE CODE HERE ###
print("Accuracy on test set: ",round(accuracy_score(y_test,best_rf_pred_test)*100,3),"%")
print("Precision on test set: ",round(precision_score(y_test,best_rf_pred_test)*100,3),"%")
print("Recall on test set: ",round(recall_score(y_test,best_rf_pred_test)*100,3),"%")
print("F1 Score on test set: ",round(f1_score(y_test,best_rf_pred_test)*100,3),"%")
print("ROC AUC Score on test set: ",round(roc_auc_score(y_test,best_rf_pred_proba_test),3))
print("-"*80)
print(classification_report(y_test, best_rf_pred_test))

In [None]:
fpr, tpr, _ = roc_curve(y_test,  best_rf_pred_proba_test)
auc = roc_auc_score(y_test, best_rf_pred_proba_test)

#Plotting
plt.plot(fpr,tpr,label="Class '1', auc="+str(round(auc,3)), color='gold',lw=.8)
plt.plot([0, 1], [0, 1], color='navy', lw=.5, linestyle='--', label='Random Model')

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.legend(loc=4)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic curve')

__Q. How did performing Grid Search impact the performance of the model? Were you able to optimize the hyperparameters?__

# Dataset 2

Given this procedure, you are to optimize a random forest classifier for a dataset with 50 features. Training data are provided, but testing data does not include the labels. It is up to you to use the training data to optimize generalization performance to the test data. You will submit a csv file with your predictions. It should contain one column and the column should be named "y".


In [None]:
train_df = pd.read_csv("train_50features.csv")
print(train_df.head(2))

test_data = pd.read_csv("test_50features.csv")
print(test_data.head(2))

In [None]:
x_train_50 = train_df.drop('y',axis=1)
y_train_50 = train_df.y.values.reshape(-1,1)

In [None]:
##########################################
### Construct your final random forest model and optimize the hyperparameters using Grid Search ###
model_rf_50 = RandomForestClassifier(random_state=26)
model_rf_50.fit(x_train_50, y_train_50)

rf_50_pred = model_rf_50.predict(x_train_50)

print("Initial Accuracy on train set: ",round(accuracy_score(y_train_50,rf_50_pred)*100,3),"%")
print("Initial Precision on train set: ",round(precision_score(y_train_50,rf_50_pred)*100,3),"%")
print("Initial Recall on train set: ",round(recall_score(y_train_50,rf_50_pred)*100,3),"%")
print("Initial F1 Score on train set: ",round(f1_score(y_train_50,rf_50_pred)*100,3),"%")

print(classification_report(y_train_50, rf_50_pred))

In [None]:
param_grid_50 = {
    'max_depth': [i for i in range(3,7,1)],
    'max_features': ['auto', 'sqrt'],
    'min_samples_leaf': [2,4,6],
    'min_samples_split': [z for z in range(8,13,2)],
    'n_estimators': [int(x) for x in np.linspace(start = 300, stop = 1100, num = 7)],
    'class_weight': ['balanced', None]
}

In [None]:
grid_search_rf_50 = GridSearchCV(estimator = RandomForestClassifier(random_state=26), scoring='f1', param_grid = param_grid_50, 
                          cv = 3, n_jobs = -1, verbose = 2)
grid_search_rf_50.fit(x_train_50, y_train_50)
grid_search_rf_50.best_params_

In [None]:
model_rf_50_best_params = RandomForestClassifier(max_depth = 5 , max_features = 'auto', min_samples_leaf= 4, min_samples_split= 10,
                                              n_estimators= 433,class_weight = None , random_state=26)

model_rf_50_best_params.fit(x_train_50, y_train_50)

rf_50_pred_bp = model_rf_50_best_params.predict(x_train_50)

print("Initial Accuracy on train set: ",round(accuracy_score(y_train_50, rf_50_pred_bp)*100,3),"%")
print("Initial Precision on train set: ",round(precision_score(y_train_50, rf_50_pred_bp)*100,3),"%")
print("Initial Recall on train set: ",round(recall_score(y_train_50, rf_50_pred_bp)*100,3),"%")
print("Initial F1 Score on train set: ",round(f1_score(y_train_50, rf_50_pred_bp)*100,3),"%")

print(classification_report(y_train_50, rf_50_pred_bp))

In [None]:
rf_50_pred_bp_test = model_rf_50_best_params.predict(test_data)
np.savetxt("predictions.csv", rf_50_pred_bp_test, delimiter=",", header='y', fmt='%g', comments='')