<a href="https://colab.research.google.com/github/ritterl/MachineLearning/blob/master/Chapter10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
%load_ext rpy2.ipython

# Introduction

In this notebook, chapter 10 of [Machine Learning](https://www.amazon.de/Machine-Learning-techniques-predictive-modeling-ebook/dp/B07PYXX3H5) with R is summarized and the code samples are described. This section of the book serves to show how machine learning algorithms can be evaluated. More precisely, it gives reasons why also other measures than predictive accuracy are needed to assess performance. Further, approaches to ensure that the performance measures reasonably reflect a model's ability to predict or forecast unseen cases are provided. 

By simply dividing the number of correct predictions by the total number of predictions, a wrong picture about the performance of the classifier may occur. This especially arises in datasets with a large class imbalance and is also referred to as **class imbalance problem**. For instance, a positive event occurs very often (in say 99% of the cases), a model which predicts always a positive case has an accuracy of 99%. However, it is not useful for predicting the negative cases. 

Besides the two important data types: **actual class values** and **predicted class values** which are obvious, however, the majority of models can deliver another important type of information: the **estimated probability of the prediction** or in other words the confidence of the model about a particular decision. So, when comparing two models with the same number of mistakes, it is possible to say that the one which makes better assessments regarding its uncertainty is smarter.

First of all, the predicted probabilities from the SMS model developed in chapter 4 of the book are drawn. Therefore, it is important to note that the code of chapter 4 needs to be executed prior to this section. In this example, the predict () function gives the probability for each possible outcome. Each line of the following output sums up to 1 due to the fact that these are mutually exclusive and exhaustive events. In other words: An SMS can only either be "ham" or "spam", but not both at the same time and it cannot be something else or something in between. 

In a next step, the results are combined into a data frame. (Do I even need line 10 - 19 if I have the csv-file sms_results already? --> no!)

In [0]:
%%R
sms_classifier <- load("sms_classifier.RData")


In [0]:
%%R
# obtain the predicted probabilities
sms_test_prob <- predict(sms_classifier, sms_test, type = "raw")
head(sms_test_prob)

In [0]:
%%R
## Confusion matrixes in R ----
sms_results <- read.csv("sms_results.csv")

In this step a first glimpse into the sms_results is made. It shows the actual type as well as the predicted type on the LHS and on the RHS the probability (estimated by the model) of the object being either spam or ham. As it can be observed, the model was extremely certain about its decisions. 

In [39]:
%%R
# the first several test cases
head(sms_results)

  actual_type predict_type prob_spam prob_ham
1         ham          ham   0.00000  1.00000
2         ham          ham   0.00000  1.00000
3         ham          ham   0.00016  0.99984
4         ham          ham   0.00004  0.99996
5        spam         spam   1.00000  0.00000
6         ham          ham   0.00020  0.99980


In the previous cases, the model was very confident, but of course, there are also other cases in which the model was unconfident about its decision.

In [40]:
%%R
head(subset(sms_results, prob_spam > 0.40 & prob_spam < 0.60))

     actual_type predict_type prob_spam prob_ham
377         spam          ham   0.47536  0.52464
717          ham         spam   0.56188  0.43812
1311         ham         spam   0.57917  0.42083


In [42]:
%%R
head(subset(sms_results, actual_type != predict_type))

    actual_type predict_type prob_spam prob_ham
53         spam          ham   0.00071  0.99929
59         spam          ham   0.00156  0.99844
73         spam          ham   0.01708  0.98292
76         spam          ham   0.00851  0.99149
184        spam          ham   0.01243  0.98757
332        spam          ham   0.00003  0.99997


Especially in this cases the question can be posed, whether the model is useful or not. 

In the next passage **confusion matrices** will be introduced. A confusion matrix has two dimensions. One dimension displays the actual values (rows) and the other dimension displays the predicted values (columns). In the diagonal cells the model predicted the actual value correct and in the off-diagonal cells the predictions were incorrect. 

In [45]:
%%R
table(sms_results$actual_type, sms_results$predict_type)

      
        ham spam
  ham  1203    4
  spam   31  152


With the introduction of confusion matrices, also a new terminology needs to be implemented, whereas first of all the class of interest needs to be defined. In the ham/spam example, "spam" is referred to as the **positive class**, since the spam-filter is interested in finding spam messages. Consequently, the **negative class** are the ham messages. It should be obvious that positive and negative are not related with "good and bad", but solely needed to distinguish between the different classes. 
However, now it is possible to implement this terminology in the confusion matrix. 
- predicted spam / actual spam = True positive (TP -> 152)
- predicted ham / acutal ham = True negative (TN -> 1203)
- predicted spam / actual ham = False positive (FP -> 4)
- predicted ham / actual spam = False negative (FN -> 31)


