In [None]:
from libs import *
from sklearn.metrics import confusion_matrix, accuracy_score

plt.style.use(['seaborn-whitegrid', 'seaborn-poster'])

# Model Evaluation
In the previous notebook:
- We built a ML model to predict Reddit comment categories
- We performed a simple evaluation using the accuracy score metric

In this notebook we will look at the types of errors that our model makes in a little more detail.

First we load the best trained random forest model from before and the test data:

In [None]:
vectorizer_pipe = joblib.load('trained_models/vectorizer_pipe.pkl')
rf_clf = joblib.load('trained_models/random_forest_classifier.pkl')

test_df = pd.read_csv('datasets/test_data_text.csv')
TARGET = 'label'
y_test = test_df[TARGET]
X_test = vectorizer_pipe.transform(test_df.drop(columns=[TARGET], axis=1))

Next we use the model to predict the labels for the test dataset (we also retrieve the probability with which the class was predicted)

In [None]:
y_test_pred = rf_clf.predict(X_test)
y_test_pred_proba_arr = rf_clf.predict_proba(X_test)
y_test_pred_proba = np.amax(y_test_pred_proba_arr, axis=1)

You will recall that the model accuracy was around 65%

In [None]:
acc = accuracy_score(y_test, y_test_pred)
print(f'Model accuracy is {acc:.1%}')

The confusion matrix below provides an intuitive way to look at the performance of a model on a multi-class classfication problem (one with more that 1 label). The rows of the matrix represent that actual labels for our test data. The columns represent the predicted labels from our model.

If our model has predicted the label correctly for a particular post, then we will increment one of the cells on the diagonal of the matrix. Otherwise a non-diagonal cell will be incremented. For a good classifier we hope to see high numbers on the diagonal and low numbers off the diagonal.

The confusion matrix also helps us to understand the types of errors that our model makes. For example, we can see that the model misclassifies quite a lot of the posts that are labelled as `other`. Similarly, we can see that the `bug` class is frequently misclassified as a `recorder` or `screener` issue.

In [None]:
cf = confusion_matrix(y_test, y_test_pred)
df_cf = pd.DataFrame(cf, columns=rf_clf.classes_, index=rf_clf.classes_)
fig, ax = plt.subplots(figsize=(9, 8))
sns.heatmap(df_cf, ax=ax, annot=True, cmap='Blues')
ax.set_xlabel('Predicted Label')
_ = ax.set_ylabel('True Label')

Finally we can investigate the probabilities that the model has returned for each prediction. This is a measure of the confidence with which the model has predicted a particular class. 

We can for example look at the posts that the model labelled wrongly but with high confidence. The following are the top 10 most confidence incorrect predictions.

In [None]:
test_df['pred_label'] = y_test_pred
test_df['pred_proba'] = y_test_pred_proba

incorrect_df = test_df[test_df['label'] != test_df['pred_label']].sort_values('pred_proba', ascending=False)

for i, row in incorrect_df[0:10].iterrows():
    print('-' * 80)
    print('Actual Label: {0} | Predicted Label: {1}'.format(row['label'], row['pred_label']))
    print('Predicted Probability: {0:.2%}'.format(row['pred_proba']))
    print()
    print(row['text'])
    print()