# Final Model Evaluation Notebook

This notebook consists of an evaluation of our best model for the Forest Cover Type Prediction competition. Our best model is an Extra Trees model, which outperformed our ensemble. 

# Import Statements

First, we will import the necessary packages for data manipulation and analysis.

In [None]:
import numpy as np, pandas as pd, matplotlib.pyplot as plt
import joblib
import sklearn 

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import confusion_matrix, classification_report
from keras.metrics import top_k_categorical_accuracy

In [None]:
# load data
train = pd.read_csv('../input/forest-cover-type-prediction/train.csv')
# view data
train.head()

# Preprocessing

Next, we will preprocess the data to remove the ID column from the set, add new features from our feature engineering and PCA analysis, split the data into training and validation sets, and scale the data. 

In [None]:
# remove ID column from set
train = train.iloc[:, 1:]
train.head()

In [None]:
# add new features from feature engineering
train['Elev_to_Horizontal_Hyd'] = train.Elevation - 0.2 * train.Horizontal_Distance_To_Hydrology 
train['Elev_to_Horizontal_Road'] = train.Elevation - 0.05 * train.Horizontal_Distance_To_Roadways  
train['Elev_to_Verticle_Hyd'] = train.Elevation - train.Vertical_Distance_To_Hydrology 
train['Mean_Horizontal_Dist'] = (train.Horizontal_Distance_To_Fire_Points + train.Horizontal_Distance_To_Hydrology + 
                                 train.Horizontal_Distance_To_Roadways)/3 
train['Mean_Fire_Hydro'] = (train.Horizontal_Distance_To_Fire_Points + train.Horizontal_Distance_To_Hydrology)/2

In [None]:
# move target to first column
first_column = train.pop('Cover_Type')
  
# insert column using insert(position,column_name,first_column) function
train.insert(0, 'Cover_Type', first_column)
  
# view
train.head()

In [None]:
# create cat, num, and y
X_cat = train.iloc[:, 11:55].values
B = train.iloc[:, 55:60]
A = train.iloc[:, 1:11]
X_num = pd.concat([A, B], axis = 1).values
y = train.iloc[:, 0].values

In [None]:
# scale/standardizing numerical columns
# scaler object
scaler = StandardScaler()
# fit to training data
scaler.fit(X_num)
# scale num columns
X_num = scaler.transform(X_num)

# shape
print(f'Categorical Shape: {X_cat.shape}')
print(f'Numerical Shape: {X_num.shape}')
print(f'Label Shape: {y.shape}')

In [None]:
# combine num and cat
X = np.hstack((X_num, X_cat))
print(X.shape)

In [None]:
# train/validate split
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size = .20, random_state = 1)
print(X_train.shape)
print(X_valid.shape)
print(y_train.shape)
print(y_valid.shape)

# Load Model

Below, we will load in the best model from our Training Model Notebook and fit it to the training set. Normally, we would not refit the model, but we are interested in exploring the strength of the model by splitting the training set into a training and validation set, so we need to refit it to the training portion of the data.

In [None]:
# load final model
final_model = joblib.load('../input/dsci-598-fa21/Team_2/tree_model_final.joblib')

In [None]:
# fit model to training set 
final_model.fit(X_train, y_train)

# Accuracy
As we do not have access to the labels for the test set for the Forest Cover Type competition, we are going to examine the accuracy of the model with a validation set. 

In [None]:
# score for training set
train_acc = final_model.score(X_train, y_train)
# score for validation set
valid_acc = final_model.score(X_valid, y_valid)

print('Training Accuracy for Final Model', {round(train_acc, 4)})
print('Validation Accuracy for Final Model', {round(valid_acc, 4)})

The training accuracy for the model is 1.0, with the validation accuracy at .8932. This means that the model is overfitting the training set. 
The validation score is .8932 and the test score accuracy is .7962. This means that the model is fitting the validation set better than the test set. 

In [None]:
# predictions
valid_pred = final_model.predict(X_valid)
# prob predictions
valid_proba = final_model.predict_proba(X_valid)

# Top K-Accuracy

We will now explore the accuracy of the final model by calculating the top number of successes of finding the actual label in the top 2 and 3 predicted labels.

In [None]:
def top_k_accuracy(y_true, pred_prob, K):
    count = 0
    for i in range(len(y_true)):
        p = pred_prob[i, :]          # Get predictions for current observation
        rank = np.argsort(p) + 1     # Rank classes in increasing order; add 1 to get from 0-6 to 1-7 for label
        correct = y_true[i]          # Get correct class.
        if correct in rank[-K:]:     # See if correct class is in top k
            count += 1               # Increment count if so.

    return count / len(y_true)       # Return score

In [None]:
# calculate accuracy for top 2 predictions
top_2_accuracy = top_k_accuracy(y_valid, valid_proba, 2)
print(round(top_2_accuracy, 4))

In [None]:
# calculate accuracy for top 3 predictions
top_3_accuracy = top_k_accuracy(y_valid, valid_proba, 3)
print(round(top_3_accuracy, 4))

While the model is 89.32% accurate on the validation set for predicting the correct label, the model is 97.98% accurate at predicting the label that is in the top 2 for each cover type and 99.54% accurate at predicting the label that is within the top 3 for each cover type. If selecting a model where predicting within the top 2 classifications were acceptable, then this model would be very strong. 

# Classification Report

One of the ways to assess the accuracy of our model is through examining precision and recall. The classification report below includes the precision and recall for each cover type. 

In [None]:
# classification report
c_report = classification_report(y_valid, valid_pred)
print(c_report)

Precision = TP/TP + FP

Recall = TP/TP + FN

- The model precision was highest for Cover Type 4. 
- The model recall was highest for Cover Type 4 as well. 
- The model also yielded the highest f1-score for Cover Type 4. 

This makes sense as the support for Cover Type 4 was among the top 2 most common cover types in the data set. 


# Confusion Matrix

We will now examine the accuracy of the model via a confusion matrix. The confusion matrix displays the number of times the cover type was classified correctly and how many times it was misclassified as another cover type. 

In [None]:
# confusion matrix
cm = confusion_matrix(y_valid, valid_pred)
cm_df = pd.DataFrame(cm)
# Change the column names
cm_df.columns =[1, 2, 3, 4, 5, 6, 7]
cm_df.index = [1, 2, 3, 4, 5, 6, 7]
# display
cm_df

# Distribution of Class Probability Predictions

In this section, we provide some histograms displaying the distributions of probability estimates generated for each label.

In [None]:
# validation predictions data frame
df_prob = pd.DataFrame(valid_proba)
df_prob.columns = [1, 2, 3, 4, 5, 6, 7]
df_prob

In [None]:
# histogram chart for probabilities by cover type
plt.figure(figsize = [12,9])
plt.hist(df_prob)
plt.title('Histogram for Probabilities for Each Cover Type')
plt.ylabel('Count')
plt.xlabel('Probability')
plt.legend([1, 2, 3, 4, 5, 6, 7], title = 'Cover Type')
plt.show()

These visualizations provide us with information about the labels that our model is more likely to predict, or if there are labels that it tends to be more or less confident about.

In [None]:
df_prob.plot(kind = 'hist',
        alpha = 0.7,
        bins = 30,
        title = 'Histogram of Probabilities by Cover Type',
        rot = 45,
        grid = False,
        figsize = (12,8),
        fontsize = 15, 
        color = ['purple', 'orange', 'gold', 'pink', 'forestgreen', 'lightblue', 'navy'])
plt.xlabel('Probability')
plt.ylabel("Count");

In [None]:
maxValueIndex = df_prob.idxmax(axis=1)
maxValues = df_prob.max(axis=1)
pred_prob = pd.concat([maxValueIndex, maxValues], axis=1)
df_1 = pred_prob.loc[lambda x: x[0] == 1]
df_2 = pred_prob.loc[lambda x: x[0] == 2]
df_3 = pred_prob.loc[lambda x: x[0] == 3]
df_4 = pred_prob.loc[lambda x: x[0] == 4]
df_5 = pred_prob.loc[lambda x: x[0] == 5]
df_6 = pred_prob.loc[lambda x: x[0] == 6]
df_7 = pred_prob.loc[lambda x: x[0] == 7]

In [None]:
# histograms of probability estimates generated for each label
c_label = [1, 2, 3, 4, 5, 6, 7]
palette = ['purple', 'orange', 'gold', 'pink', 'forestgreen', 'lightblue', 'navy']
dflist = [df_1, df_2, df_3, df_4, df_5, df_6, df_7]
plt.figure(figsize = [18, 9])

for c in c_label:
    plt.figure(figsize = [18, 9])
    plt.subplot(2, 4, c)
    plt.hist(dflist[c-1][1], color = palette[c-1])
    plt.title(f'Probability Distribution for Cover Type {c_label[c-1]}')
    plt.ylabel('Count')
    plt.ylim(0, 300)
    plt.xlabel('Probability')
plt.tight_layout()
plt.show()

From these visualizations, we can see that our model is most confident about Cover Types 7, 4, and 5. This corresponds with the classification report data explored earlier. 