This is my exploration of Random Forest for Classification.

In this notebook I use the Wisconsin Breast Cancer dataset from UCI. More information about the dataset and about FNA scans, [here](http://pages.cs.wisc.edu/~olvi/uwmp/cancer.html). I also took some guidance from Raul's notebook on this topic which can also be found, [here](https://www.kaggle.com/raviolli77/random-forest-in-python). I strongly recommend learning about the dataset and visiting Raul's notebook before exploring Random Forest on your own.

Additionally, reviewing Principle Component Analysis may serve useful before viewing my work.

The purpose of this notebook is to be able to classify cells as Malignant or Benign based off information recorded from an FNA scan. 

I will be releasing another notebook soon which will compare random forest, Logistic Regression, Nearest-Neighbor, SVM, and Neural Networks for model selection. 

For now lets jump into this Random Forest classification problem. 

In [None]:
# imports needed
%matplotlib inline

import time
import random
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.model_selection import KFold, cross_val_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier 
from urllib.request import urlopen 

plt.style.use('ggplot')
pd.set_option('display.max_columns', 500) 

# Get Wisconsin Breast Cancer Data
breast_cancer = pd.read_csv('../input/breast-cancer-wisconsin-data/data.csv')

names = ['id', 'diagnosis', 'radius_mean', 
         'texture_mean', 'perimeter_mean', 'area_mean', 
         'smoothness_mean', 'compactness_mean', 
         'concavity_mean','concave_points_mean', 
         'symmetry_mean', 'fractal_dimension_mean',
         'radius_se', 'texture_se', 'perimeter_se', 
         'area_se', 'smoothness_se', 'compactness_se', 
         'concavity_se', 'concave_points_se', 
         'symmetry_se', 'fractal_dimension_se', 
         'radius_worst', 'texture_worst', 
         'perimeter_worst', 'area_worst', 
         'smoothness_worst', 'compactness_worst', 
         'concavity_worst', 'concave_points_worst', 
         'symmetry_worst', 'fractal_dimension_worst'] 

dx = ['Benign', 'Malignant']
breast_cancer.head()

In [None]:
# Setting 'id_number' as our index
breast_cancer.set_index(['id'], inplace = True) 

# Converted to binary to help later on with models and plots
breast_cancer['diagnosis'] = breast_cancer['diagnosis'].map({'M':1, 'B':0})

# check mapping worked 
mapping = breast_cancer[['diagnosis']]
mapping

In [None]:
# check check number of null values in each field
breast_cancer.apply(lambda x: x.isnull().sum())

# drop the field we dont need
breast_cancer = breast_cancer.drop(columns=['Unnamed: 32'])

# double check the field is gone
breast_cancer.apply(lambda x: x.isnull().sum())

In [None]:
# some infomation about the data, could also do this with breast_cancer.info()
print("Shape of our dataframe:\n", 
     breast_cancer.shape)
print("data types of our columns:\n",
     breast_cancer.dtypes)

In [None]:
# check if dataset is suffers imbalance between classes Benign = 0 and Malignant = 1
s = pd.value_counts(breast_cancer.diagnosis)

# for class 0
num_of_benign = s[0]
# for class 1
num_of_malignant = s[1]

total_cases = len(breast_cancer)

# calculate percentages of data that resides in both classes
percent_b = num_of_benign / total_cases
percent_m = num_of_malignant / total_cases

print("Distribution between Benign and Malignant\nPercent Benign: {0:.3f} \nPercent Malignant: {1:.3f} ".format(percent_b, percent_m))

No class imbalance here so we can proceed.

In [None]:
# create dataset
feature_space = breast_cancer.iloc[:, breast_cancer.columns != 'diagnosis']
feature_class = breast_cancer.iloc[:, breast_cancer.columns == 'diagnosis']

# train_test_split
training_set, test_set, class_set, test_class_set = train_test_split(feature_space,
                                                                    feature_class,
                                                                    test_size = 0.20, 
                                                                    random_state = 42)
# Cleaning test sets to avoid future warning messages
class_set = class_set.values.ravel() 
test_class_set = test_class_set.values.ravel()

In [None]:
# instantiate classifier 
rf_classifier = RandomForestClassifier(random_state=42, n_estimators=10)

In [None]:
np.random.seed(42)
start = time.time()

# give to GridSearchCV
param_dist = {'max_depth': [2, 3, 4],
              'bootstrap': [True, False],
              'max_features': ['auto', 'sqrt', 'log2', None],
              'criterion': ['gini', 'entropy']}

# set up the GridSearch
cv_rf = GridSearchCV(rf_classifier, cv = 5,
                     param_grid=param_dist, 
                     n_jobs = 3)

# fit the GridSearch
cv_rf.fit(training_set, class_set)
print('Best Parameters using grid search: \n', cv_rf.best_params_)


end = time.time()
print('Time taken in grid search: {0: .2f}'.format(end - start))

In [None]:
# Set best parameters given by grid search 
rf_classifier.set_params(criterion = 'gini',
                  max_features = 'log2', 
                  max_depth = 3, 
                  )

In [None]:
# warm_start = True reuse the solution of the previous call to fit 
# and add more estimators to the ensemble, otherwise, just fit a whole new forest.

rf_classifier.set_params(warm_start=True, 
                  oob_score=True)

# found this from sci-kit learn https://scikit-learn.org/stable/auto_examples/ensemble/plot_ensemble_oob.html
min_estimators = 15
max_estimators = 500

error_rate = {}

for i in range(min_estimators, max_estimators + 1):
    rf_classifier.set_params(n_estimators=i)
    rf_classifier.fit(training_set, class_set)

    oob_error = 1 - rf_classifier.oob_score_
    error_rate[i] = oob_error

In [None]:
# Convert dictionary to a pandas series for easy plotting 
oob_series = pd.Series(error_rate)

In [None]:
# Plotting the OOB_scores line graph: oob_error vs. n_estimators
fig, ax = plt.subplots(figsize=(10, 10))

ax.set_facecolor('#fafafa')

oob_series.plot(kind='line',
                color = 'blue')
plt.axhline(0.055, 
            color='#875FDB',
           linestyle='--')
plt.axhline(0.05, 
            color='#875FDB',
           linestyle='--')
plt.xlabel('n_estimators')
plt.ylabel('OOB Error Rate')
plt.title('OOB Error Rate Across various Forest sizes \n(From 15 to 1000 trees)')

I ended up testing and choosing values 400-430 for the number of estimators. This is where the error fluctuate last, near 0.0500. 

In [None]:
# set the estimators so we can proceed to fitting the rf_classifier turning warm start and oob_score to False
rf_classifier.set_params(n_estimators=420,
                  bootstrap = True,
                  warm_start=False, 
                  oob_score=False)

In [None]:
# fit the Random forest to the training data
rf_classifier.fit(training_set, class_set)

In [None]:
# returns a dict with value pairs {importance: indices} for printing
def variable_importance(fit):
    try:
        # Checks whether first parameter is a model
        if not hasattr(fit, 'fit'):
            return print("'{0}' is not an instantiated model from scikit-learn".format(fit)) 

        # Checks whether model has been trained
        if not vars(fit)["estimators_"]:
            return print("Model does not appear to be trained.")
    except KeyError:
        print("Model entered does not contain 'estimators_' attribute.")

    importances = fit.feature_importances_
    # sort from most import to least
    indices = np.argsort(importances)[::-1]
    return {'importance': importances,
            'index': indices}

In [None]:
# get variable importance and their indexes
rf_var_imp = variable_importance(rf_classifier)

rf_importances = rf_var_imp['importance']

rf_indices = rf_var_imp['index']

In [None]:
# unpacks and prints values in importance dict according to the index 
def print_var_importance(importance, indices, name_index):
    print("Feature ranking:")
    # iterate thru variable indices
    for f in range(0, indices.shape[0]):
        i = f
        # prints the name of the feature and its importance metric 
        print("{0}. The feature '{1}' has a Mean Decrease in Impurity of {2:.5f}"
              .format(f + 1, names_index[indices[i]], importance[indices[f]]))

In [None]:
# get the classes uses to train this model
names_index = names[2:]

# print out classes by importance in decending order
print_var_importance(rf_importances, rf_indices, names_index)

In [None]:
# Make a horizontal bar chart to visualize feature importantance

def variable_importance_plot(importance, indices, name_index):
    index = np.arange(len(names_index))

    importance_desc = sorted(importance)

    feature_space = []

    for i in range(indices.shape[0] - 1, -1, -1):
        feature_space.append(names_index[indices[i]])

    fig, ax = plt.subplots(figsize=(10, 10))

    plt.title('Feature importances for Random Forest Model\\nBreast Cancer (Diagnostic)')
    
    plt.barh(index,
              importance_desc,
              align="center",
              color = '#FFB6C1')
    plt.yticks(index,
                feature_space)

    plt.ylim(-1, 30)
    plt.xlim(0, max(importance_desc) + 0.01)
    plt.xlabel('Mean Decrease in Impurity')
    plt.ylabel('Feature')

    plt.show()
    plt.close()

In [None]:
variable_importance_plot(rf_importances, rf_indices, names_index)

In [None]:
# Perform Cross_validation to see how robust our model is 
import time

def cross_val_metrics(fit, training_set, class_set, estimator, print_results = True):
    start = time.time()
    """
    Returns Mean Accurancy with standard_dev of model over Kfolds Validation
    ----------
    scores.mean(): Float representing cross validation score
    scores.std() / 2: Float representing the standard error (derived
                from cross validation score's standard deviation)
    """
    my_estimators = {
    'rf': 'estimators_',
    'nn': 'out_activation_',
    'knn': '_fit_method'
    }
    try:
        # Checks whether first parameter is a model
        if not hasattr(fit, 'fit'):
            return print("'{0}' is not an instantiated model from scikit-learn".format(fit)) 

        # Checks whether the model has been trained
        if not vars(fit)[my_estimators[estimator]]:
            return print("Model does not appear to be trained.")

    except KeyError as e:
        print("'{0}' does not correspond with the appropriate key inside the estimators dictionary. \
              \nPlease refer to function to check `my_estimators` dictionary.".format(estimator))
        raise

    # create KFolds validation
    n = KFold(n_splits=10)

    # record score for each split
    scores = cross_val_score(fit, 
                         training_set, 
                         class_set, 
                         cv = n)
    end = time.time() 
    # print how much time the Kfolds took
    print("Time elapsed to do Cross Validation: {0:.2f} seconds.".format(end-start))
    if print_results:
        for i in range(0, len(scores)):
            # print out the scores for each validation split
            print("Cross validation run {0}: {1: 0.3f}".format(i, scores[i]))

        print("Accuracy: {0: 0.3f} (+/- {1: 0.3f})".format(scores.mean(), scores.std() / 2))     
    else:
        return scores.mean(), scores.std() / 2

In [None]:
# call cross_val_metrics to see how our model did
cross_val_metrics(rf_classifier, 
                  training_set, 
                  class_set, 
                  'rf',
                  print_results = True)

In [None]:
# make a prediction on test set now that the model has been tuned and validated
rf_predictions = rf_classifier.predict(test_set)

In [None]:

def create_conf_mat(test_class_set, predictions):
    """Function returns confusion matrix comparing two arrays"""
    if (len(test_class_set.shape) != len(predictions.shape) == 1):
        return print('Arrays entered are not 1-D.\nPlease enter the correctly sized sets.')
    elif (test_class_set.shape != predictions.shape):
        return print('Number of values inside the Arrays are not equal to each other.\nPlease make sure the array has the same number of instances.')
    else:
        # Set Metrics; Compute a simple cross tabulation of two (or more) factors. 
        # By default computes a frequency table of the factors unless an array of values and an aggregation function are passed.
        test_crosstb_comp = pd.crosstab(index = test_class_set,
                                        columns = predictions)
        # Changed for Future deprecation of as_matrix
        test_crosstb = test_crosstb_comp.values
        return test_crosstb

In [None]:
conf_mat = create_conf_mat(test_class_set, rf_predictions)

# use seaborn heatmap
sns.heatmap(conf_mat, annot=True, fmt='d', cbar=False)
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.title('Actual vs. Predicted Confusion Matrix')
plt.show()

In [None]:
# use built in score function to get accurancy of this model against the test set
rf_accuracy = rf_classifier.score(test_set, test_class_set)

print("Here is our mean accuracy on the test set:\n {0:.3f}"\
      .format(rf_accuracy))

In [None]:
# Here we calculate the test error rate!
rf_test_error_rate = 1 - rf_accuracy
print("The test error rate for our model is:\n {0: .4f}"\
      .format(rf_test_error_rate))

Set up for ROC (receiver operating characteristic) curve which calculates the false positive rates and true positive rates across different thresholds.

An ideal model will have a false positive rate of 0 and true positive rate of 1. Most the curve will be in the top left corner of the graph.

On the other hand, a ROC curve that is at 45 degrees is indicative of a model that is essentially randomly guessing. Most of the curve will be in the middle of the graph.

In [None]:
# predict_proba returns two arrays that represent the predicted_prob of negative class and positive class respectively.

# I just want to take the positive classes in this instance
predictions_prob = rf_classifier.predict_proba(test_set)[:, 1]

# use roc_curve to produce 
fpr2, tpr2, _ = roc_curve(test_class_set,
                          predictions_prob,
                          pos_label = 1)

auc_rf = auc(fpr2, tpr2)

print(auc_rf)

In [None]:
def plot_roc_curve(fpr, tpr, auc, estimator, xlim=None, ylim=None):
    """
    Purpose
    ----------
    Function creates ROC Curve for respective model given selected parameters.
    Optional x and y limits to zoom into graph

    Parameters
    ----------
    * fpr: Array returned from sklearn.metrics.roc_curve for increasing
            false positive rates
    * tpr: Array returned from sklearn.metrics.roc_curve for increasing
            true positive rates
    * auc: Float returned from sklearn.metrics.auc (Area under Curve)
    * estimator: String represenation of appropriate model, can only contain the
    following: ['knn', 'rf', 'nn']
    * xlim: Set upper and lower x-limits
    * ylim: Set upper and lower y-limits
    """
    my_estimators = {'knn': ['Kth Nearest Neighbor', 'deeppink'],
              'rf': ['Random Forest', 'red'],
              'nn': ['Neural Network', 'purple']}

    try:
        plot_title = my_estimators[estimator][0]
        color_value = my_estimators[estimator][1]
    except KeyError as e:
        print("'{0}' does not correspond with the appropriate key inside the estimators dictionary. \
\nPlease refer to function to check `my_estimators` dictionary.".format(estimator))
        raise

    fig, ax = plt.subplots(figsize=(10, 10))
    ax.set_facecolor('#fafafa')

    plt.plot(fpr, tpr,
             color=color_value,
             linewidth=1)
    plt.title('ROC Curve For {0} (AUC = {1: 0.3f})'\
              .format(plot_title, auc))

    plt.plot([0, 1], [0, 1], 'k--', lw=2) # Add Diagonal line
    plt.plot([0, 0], [1, 0], 'k--', lw=2, color = 'black')
    plt.plot([1, 0], [1, 1], 'k--', lw=2, color = 'black')
    if xlim is not None:
        plt.xlim(*xlim)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.show()
    plt.close()

In [None]:
plot_roc_curve(fpr2, tpr2, auc_rf, 'rf',
               xlim=(-0.01, 1.05), 
               ylim=(0.001, 1.05))

In [None]:
# zoom in
plot_roc_curve(fpr2, tpr2, auc_rf, 'rf', 
               xlim=(-0.01, 0.2), 
               ylim=(0.85, 1.01))

In [None]:
def print_class_report(predictions, alg_name):
    # print some title
    print('Classification Report for {0}:'.format(alg_name))
    # print class report metrics for each target 
    print(classification_report(predictions, 
            test_class_set, 
            target_names = dx))

In [None]:
class_report = print_class_report(rf_predictions, 'Random Forest')

In [None]:
# ReImport modules
# notebook isnt producing the correct results unless I redo this little bit of work.

%matplotlib inline

import time
import random
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.model_selection import KFold, cross_val_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier 
from urllib.request import urlopen 

# read in data and look at it 
plt.style.use('ggplot')
pd.set_option('display.max_columns', 500) 

breast_cancer = pd.read_csv('../input/breast-cancer-wisconsin-data/data.csv')

names = ['id', 'diagnosis', 'radius_mean', 
         'texture_mean', 'perimeter_mean', 'area_mean', 
         'smoothness_mean', 'compactness_mean', 
         'concavity_mean','concave_points_mean', 
         'symmetry_mean', 'fractal_dimension_mean',
         'radius_se', 'texture_se', 'perimeter_se', 
         'area_se', 'smoothness_se', 'compactness_se', 
         'concavity_se', 'concave_points_se', 
         'symmetry_se', 'fractal_dimension_se', 
         'radius_worst', 'texture_worst', 
         'perimeter_worst', 'area_worst', 
         'smoothness_worst', 'compactness_worst', 
         'concavity_worst', 'concave_points_worst', 
         'symmetry_worst', 'fractal_dimension_worst'] 

dx = ['Benign', 'Malignant']
breast_cancer.head()

In [None]:
# Setting 'id_number' as our index
breast_cancer.set_index(['id'], inplace = True) 

# Converted to binary to help later on with models and plots
breast_cancer['diagnosis'] = breast_cancer['diagnosis'].map({'M':1, 'B':0})

# check mapping worked 
mapping = breast_cancer[['diagnosis']]
mapping

In [None]:
# check check number of null values in each field
breast_cancer.apply(lambda x: x.isnull().sum())

# drop the field we dont need
breast_cancer = breast_cancer.drop(columns=['Unnamed: 32'])

# double check the field is gone
breast_cancer.apply(lambda x: x.isnull().sum())

In [None]:
breast_cancer.info()

In [None]:
from sklearn.model_selection import train_test_split

feature_space = breast_cancer.iloc[:, breast_cancer.columns != 'diagnosis']
feature_class = breast_cancer.iloc[:, breast_cancer.columns == 'diagnosis']

X_train, X_test, y_train, y_test = train_test_split(feature_space, feature_class, test_size=0.20, random_state = 42, stratify=feature_class)

In [None]:
import numpy as np
from sklearn.preprocessing import StandardScaler

# instantiate scaler
scaler = StandardScaler()

# scale the train and test data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

y_train = np.array(y_train).ravel()

In [None]:
rf_pca_classifier = RandomForestClassifier(random_state=42, n_estimators=10)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sb

from sklearn.decomposition import PCA 

# instantiate PCA for testing number of components, I will analyze explained_variance
component_test = PCA(n_components=30)
component_test.fit(X_train_scaled)

sb.set(style='whitegrid')
plt.xlabel('Number of components')
plt.ylabel('Cumulative Explained Variance')
plt.plot(np.cumsum(component_test.explained_variance_ratio_))
plt.axvline(linewidth=4, color='r', linestyle = '--', x=14, ymin=0, ymax=1)
plt.show()

I will use the PCA to reduce our feature space into 14 components with a variance over 95%

In [None]:
# initialize n_components
pca = PCA(n_components=14)

# fit to our scaled training data
pca.fit(X_train_scaled)

# use PCA fit to transform X_train and X_test data 
X_train_scaled_pca = pca.transform(X_train_scaled)
X_test_scaled_pca = pca.transform(X_test_scaled)

In [None]:
# just so I can view how much variance features carry
evr = component_test.explained_variance_ratio_
cvr = np.cumsum(component_test.explained_variance_ratio_)
pca_df = pd.DataFrame()
pca_df['Cumulative Variance Ratio'] = cvr
pca_df['Explained Variance Ratio'] = evr
pca_df

In [None]:
# get the indices for the pca
pca_dims = []
for x in range(0, len(pca_df)):
    pca_dims.append('PCA Component {}'.format(x))

pca_test_df = pd.DataFrame(component_test.components_, columns=names_index, index=pca_dims)
pca_test_df.head()

In [None]:
np.random.seed(42)
start = time.time()

# give to GridSearchCV
param_dist = {'max_depth': [2, 3, 9, 10, 12, 13, 15],
              'bootstrap': [True, False],
              'max_features': ['auto', 'sqrt', 'log2', None],
              'criterion': ['gini', 'entropy'],
              'min_samples_split': [6,8,10,12,16,18,20],
              'min_samples_leaf':[3,4,5,6,7,9]}

# set up the GridSearch
cv_rf = GridSearchCV(rf_pca_classifier, cv = 5,
                     param_grid=param_dist, 
                     n_jobs = 5)

# fit the GridSearch
cv_rf.fit(X_train_scaled_pca, y_train)
print('Best Parameters using grid search: \n', cv_rf.best_params_)


end = time.time()
print('Time taken in grid search: {0: .2f}'.format(end - start))

In [None]:
# I set bootstrap = true so I can do OOB
rf_pca_classifier.set_params(bootstrap=True,criterion='gini', max_depth=9,
                             max_features='auto', min_samples_leaf=3, min_samples_split=6)


In [None]:
rf_pca_classifier.set_params(warm_start=True, 
                  oob_score=True)

min_estimators = 15
max_estimators = 200

error_rate = {}

for i in range(min_estimators, max_estimators + 1):
    rf_pca_classifier.set_params(n_estimators=i)
    rf_pca_classifier.fit(X_train_scaled_pca, y_train)

    oob_error = 1 - rf_pca_classifier.oob_score_
    error_rate[i] = oob_error

In [None]:
fig, ax = plt.subplots(figsize=(10, 10))

ax.set_facecolor('#fafafa')

oob_series.plot(kind='line',
                color = 'blue')
plt.axhline(0.055, 
            color='#875FDB',
           linestyle='--')
plt.axhline(0.05, 
            color='#875FDB',
           linestyle='--')
plt.xlabel('n_estimators')
plt.ylabel('OOB Error Rate')
plt.title('OOB Error Rate Across various Forest sizes \n(From 15 to 1000 trees)')

In [None]:
rf_pca_classifier.set_params(n_estimators=185,
                  bootstrap = False,
                  warm_start=False, 
                  oob_score=False)

In [None]:
rf_pca_classifier.fit(X_train_scaled_pca, y_train)

In [None]:
cross_val_metrics(rf_pca_classifier, 
                  X_train_scaled_pca, 
                  y_train, 
                  'rf',
                  print_results = True)

In [None]:
pca_rf_predictions = rf_pca_classifier.predict(X_test_scaled_pca)

In [None]:
conf_mat = create_conf_mat(test_class_set, pca_rf_predictions)
sns.heatmap(conf_mat, annot=True, fmt='d', cbar=False)
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.title('Actual vs. Predicted Confusion Matrix')
plt.show()

In [None]:
accuracy_rf = rf_pca_classifier.score(X_test_scaled_pca, y_test)

print("Here is our mean accuracy on the test set:\n {0:.3f}"\
      .format(accuracy_rf))

In [None]:
# Here we calculate the test error rate!
test_error_rate_rf = 1 - accuracy_rf
print("The test error rate for our model is:\n {0: .4f}"\
      .format(test_error_rate_rf))

In [None]:
# predict_proba returns two arrays that represent the predicted_prob of negative class and positive class respectively.
# I just want to take the positive classes in this instance

predictions_prob_pca_rf = rf_pca_classifier.predict_proba(X_test_scaled_pca)[:, 1]

# use roc_curve to produce 
fals_pos_r2, true_pos_r2, _ = roc_curve(y_test,
                          predictions_prob_pca_rf,
                          pos_label = 1)

auc_pca_rf = auc(fals_pos_r2, true_pos_r2)

print(auc_pca_rf)

In [None]:
plot_roc_curve(fals_pos_r2, true_pos_r2, auc_pca_rf, 'rf',
               xlim=(-0.01, 1.05), 
               ylim=(0.001, 1.05))

In [None]:
plot_roc_curve(fals_pos_r2, true_pos_r2, auc_pca_rf, 'rf',
               xlim=(-0.01, 0.2), 
               ylim=(0.85, 1.01))

In [None]:
print('Classification Report for PCA reduced Random Forest\n', classification_report(pca_rf_predictions, y_test, target_names = dx))

Conclusion

More work is to be done...

As we can see from our results of these two models, Random Forest does well at classifying given enough information. The use of PCA greatly enhances the performance of the overall model by being able to capture the variance of 30 original features and consolidate into 14 components. This reduction allowed for the model to make faster predictions without losing any precision or accuracy compared to the stand-alone Random Forest model.

There is a ton of analysis that could be done on the feature space that I did not do. The most analysis I did in the notebook was understanding the importance of each feature but I went on to explore PCA's rather than understand each feature in deep depth. I most likely will explore this another time. 