### Random Forest Exercise

------------------

In [1]:
# import pandas
import pandas as pd

In [2]:
# list for column headers
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

# load data
df = pd.read_csv("pima_indians_diabetes.data.csv", names=names)

Spend some time to explore the dataset.
- head
- shape

In [3]:
df.head()

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [5]:
df.shape

(768, 9)

* create the X and y (the goal is to predict column **class** based on other variables)

In [6]:
X = df.drop('class', axis=1)
y = df['class']

* split data set into a train set and test set

In [7]:
from sklearn.model_selection import train_test_split

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

------------------------
#### Part 1: Setting up the Random Forest Classifier
* import RandomForestClassifier from sklearn. It is suggested to spend some time on the doccumentation of this classifier to get familiar with the available parameters.

In [9]:
from sklearn.ensemble import RandomForestClassifier

* create model

In [10]:
# Create a RandomForestClassifier
clf = RandomForestClassifier(random_state=42)

* fit training set with default parameters

In [11]:
# Train the model using the training sets
clf.fit(X_train, y_train)

* predict X_test

In [12]:
# Predict the response for test dataset
y_pred = clf.predict(X_test)

* import roc_auc_score and confusion_matrix from sklearn

In [13]:
from sklearn.metrics import roc_auc_score, confusion_matrix

* print confusion matrix

In [14]:
# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)

In [15]:
# Print the confusion matrix
print('Confusion Matrix:')
print(cm)

Confusion Matrix:
[[121  30]
 [ 27  53]]


* print AUC

In [16]:
# Compute AUC
auc = roc_auc_score(y_test, y_pred)

In [17]:
# Print AUC
print('AUC: ' + str(auc))

AUC: 0.7319122516556292


----------------------------------
#### Part 2: Using a Grid Search
- import GridSearchCV from sklearn

In [18]:
from sklearn.model_selection import GridSearchCV

* create grid (optimize for number of trees and max depth in one tree)

In [19]:
# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200, 300, 400, 500],
    'max_depth': [2, 4, 6, 8, 10]
}

* fit training data with grid search

In [20]:
# Create a RandomForestClassifier
clf = RandomForestClassifier(random_state=42)

In [21]:
# Create a GridSearchCV object
grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, cv=5)

In [22]:
# Fit the GridSearchCV object to the data
grid_search.fit(X_train, y_train)

* print confusion matrix with the best model

In [23]:
# Predict the response for test dataset using the best model
y_pred_best = grid_search.best_estimator_.predict(X_test)

In [24]:
# Compute confusion matrix
cm_best = confusion_matrix(y_test, y_pred_best)

In [25]:
# Print the confusion matrix
print('Confusion Matrix for Best Model:')
print(cm_best)

Confusion Matrix for Best Model:
[[122  29]
 [ 29  51]]


* print AUC with the best model

In [26]:
# Compute AUC for the best model
auc_best = roc_auc_score(y_test, y_pred_best)

In [27]:
# Print AUC
print('AUC for Best Model: ' + str(auc_best))

AUC for Best Model: 0.7227235099337749


- is the model better than default?

The AUC (Area Under the Curve) is a metric used to evaluate the performance of a binary classification model. The value of AUC ranges from 0 to 1. A model whose predictions are 100% correct has an AUC of 1 and a model whose predictions are 100% incorrect has an AUC of 0.

In the case above, the AUC of the default model is approximately 0.732 and the AUC of the best model found by GridSearchCV is approximately 0.723. 

Since the AUC of the default model is higher than the AUC of the best model, it suggests that the default model has better performance on the test set compared to the best model found by GridSearchCV. 

However, the difference between the two AUC values is relatively small, so the performance of the two models is quite similar. It's also important to consider other factors, such as the complexity of the model and the computational resources required, when choosing between models.