# NAIVE BAYES WITH SMS SPAM COLLECTION DATASET: A TEXT MINING CASE

- This exercise is adapted from [Chapter 4 of "Machine Learning with R" by Brett Lantz](https://books.google.com.tr/books?id=ZaJNCgAAQBAJ&printsec=frontcover&hl=tr&source=gbs_ge_summary_r&cad=0#v=onepage&q&f=false)

- To develop the Naive Bayes classifier, we will use data adapted from the SMS Spam
Collection at http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/.

- This dataset includes the text of SMS messages along with a label indicating
whether the message is unwanted. Junk messages are labeled spam, while
legitimate messages are labeled ham. 

If you want to continue from a previously saved session state:

In [None]:
sessionfile <- "02_naive_bayes_01.RData"

if(file.exists(sessionfile)) load(sessionfile)

Load the necessary libraries:

In [None]:
library(data.table) # to handle the data in a more convenient manner
library(tidyverse) # for a better work flow and more tools to wrangle and visualize the data
library(tm) # for text mining
library(SnowballC) # for word stemming
library(gridExtra) # for multiple plots
library(wordcloud) # visualize text data
library(RColorBrewer) # for beautifying visualizations with custom colors
library(e1071) # for naive bayes
library(gmodels) # model evaluation
library(knitr) # for better table printing
library(kableExtra) # for better table printing
library(scales) # for formatting numbers
library(magrittr) # tools for better handling data structures
library(purrr) # tools for better handling data structures
library(IRdisplay) # printing html tables from kable
options(warn = -1) # for suppressing messages

# Data preparation

First let's read the data into a data.table object:

In [None]:
sms_raw <- fread("../data/csv/02_01_sms_spam.csv")

Review the data:

In [None]:
head(sms_raw)

Let's sample 10 ham and 10 spam entries:

In [None]:
set.seed(2018)
sample1 <- data.table(ham = sms_raw[type == "ham"][sample(.N, 10), text],
           spam = sms_raw[type == "spam"][sample(.N, 10), text])



In [None]:
sample1

"free" and "urgent" words appear in spams while not in hams

View the structure of the object:

In [None]:
str(sms_raw)

It is better that we convert "type" from character to factor: 

In [None]:
sms_raw[,type := as.factor(type)]

In [None]:
str(sms_raw)

## Clean and Standardize Text Data

SMS messages are strings of text composed of words, spaces, numbers, and
punctuation.

Handling this type of complex data takes a lot of thought and
effort.

One needs to consider how to remove numbers and punctuation; handle
uninteresting words such as and, but, and or; and how to break apart sentences into
individual words.

The first step in processing text data involves creating a corpus, which is a collection of text documents.

The documents can be short or long, from individual news articles, pages in a book or on the web, or entire books.

In our case, the corpus will be a collection of SMS messages.

In [None]:
# read text with VectorSource and create corpus with VCorpus

sms_corpus <- sms_raw[,tm::VectorSource(text)] %>% tm::VCorpus()

Corpus holds documents for each of the messages:

In [None]:
sms_corpus

We can get a summary of specific messages with tm::inspect() function

In [None]:
tm::inspect(sms_corpus[1:2])

To get the actual message, we should convert a list item to character:

In [None]:
as.character(sms_corpus[[1]])

For viewing multiple messages, we'll use sapply:

In [None]:
sapply(sms_corpus[1:2], as.character)

In order to perform our analysis, we need to divide these messages into individual words.

But first, we need to clean the text, in order to standardize the words, by removing punctuation and other characters that clutter the result.

For example, we would like the strings Hello!, HELLO, and hello to be counted as instances of the same word.

The tm_map() function provides a method to apply a transformation (also known as mapping) to a tm corpus.

**tm_map to a corpus object is what "lapply" to an ordinary list object is: It applies the same function to all of its items and returns a corpus object**

### Case lowering

Our first order of business will be to standardize the messages to use only lowercase
characters. 

In [None]:
sms_corpus_clean <- tm::tm_map(sms_corpus, tm::content_transformer(tolower))

Let's compare a message before and after transformation:

In [None]:
as.character(sms_corpus[[1]])
as.character(sms_corpus_clean[[1]])

### Cleaning numbers

Let's remove numbers from messages:

In [None]:
sms_corpus_clean <- tm::tm_map(sms_corpus_clean, removeNumbers)

And the result:

In [None]:
sapply(sms_corpus[4:5], as.character)
sapply(sms_corpus_clean[4:5], as.character)

### Clean stop words

Our next task is to remove filler words such as to, and, but, and or from our SMS
messages.

These terms are known as stop words and are typically removed prior to
text mining.

This is due to the fact that although they appear very frequently, they do
not provide much useful information for machine learning.

In [None]:
tm::stopwords()

In [None]:
sms_corpus_clean <- tm::tm_map(sms_corpus_clean, tm::removeWords, tm::stopwords())

### Remove punctuation

We can also eliminate any punctuation from
the text messages using the built-in removePunctuation() transformation:

In [None]:
sms_corpus_clean <- tm::tm_map(sms_corpus_clean, tm::removePunctuation)

We could also write a custom function to replace punctuation with whitespaces instead of removing them and then apply with tm_map:

In [None]:
replacePunctuation <- function(x)
{
    gsub("[[:punct:]]+", " ", x)
}

In [None]:
removePunctuation("Hello World")
replacePunctuation("Hello...World")
replacePunctuation("Hello... World")

### Word stemming

Another common standardization for text data involves reducing words to their root form in a process called stemming.

The stemming process takes words like learned,learning, and learns, and strips the suffix in order to transform them into the base
form, learn. 

This allows machine learning algorithms to treat the related terms as a single concept rather than attempting to learn a pattern for each variant.

Let's see an example on how it works:

In [None]:
SnowballC::wordStem(c("learn", "learned", "learning", "learns"))

We "apply" this function to a corpus through the tm function tm::stemDocument

In [None]:
sms_corpus_clean <- tm::tm_map(sms_corpus_clean, tm::stemDocument)

And let's see the results:

In [None]:
set.seed = 1500
samplerows <- sample(1:length(sms_corpus), 10)
data.frame(type = sms_raw[samplerows, type],
           original = sapply(sms_corpus[samplerows], as.character),
          cleaned = sapply(sms_corpus_clean[samplerows], as.character))

### Strip whitespaces

Now we should strip additional whitespaces

In [None]:
tm::stripWhitespace("a       a")

In [None]:
sms_corpus_clean <- tm::tm_map(sms_corpus_clean, tm::stripWhitespace)

### Split documents into words 

Now that the data are processed to our liking, the final step is to split the messages
into individual components through a process called tokenization.

A token is a single element of a text string; in this case, the tokens are words.

We have two options two have an object for this:

- a data structure called a Document Term Matrix (DTM) in which rows indicate documents (SMS messages) and columns indicate terms (words).

- a data structure for a Term Document Matrix (TDM), which is simply a transposed DTM in which the rows are terms and the columns are documents.


Why the need for both?

Sometimes, it is more convenient to work with one or the other.

- For example, if the number of documents is small, while the word list is large, it may make sense to use a TDM because it is generally easier to display many rows than to display many columns.
- This said, the two are often interchangeable.

Let's create the DocumentTermMatrix:

In [None]:
sms_dtm <- tm::DocumentTermMatrix(sms_corpus_clean)

Let's see the structure and an excerpt of the matrix:

In [None]:
str(sms_dtm)

We should interpret above object as follows:

- There are a total of 5559 documents and 6559 terms
- Although there are 3.6e7 possible doc-term matches (5559 * 6559), each document contains only a handful of terms. Total number of doc-term matches are 42147 (where a document has at least one instance of a term)
- "i" object shows the index of the docs in 42K matches
- "j" object shows the index of the terms in 42K matches
- "v" object shows the count of the appearance of the term in the match

- First document has one instances of 967., 2282., 2581. 2938. and 6210. terms each

Let's check:

Let's play with the 1st document:

In [None]:
doci <- 1
doci

First let's view the first document: 

In [None]:
sapply(sms_corpus_clean[doci], as.character)

Term matches of the 1st document occurs in the dtm at:

In [None]:
# get the indices in dtm where matches of 1st document occurs 
dtm_indices_1 <- which(sms_dtm$i == doci)
dtm_indices_1

Get the term indices of those matches:

In [None]:
term_indices <- sms_dtm$j[dtm_indices_1]
term_indices

See those terms:

In [None]:
sms_dtm$dimnames$Terms[term_indices]

Just the sorted and unique versions of the terms of the 1st document

Now, v shows the count of occurences of the term inside the doc. Let's possible values in this corpus:

In [None]:
unique_vs <- unique(sms_dtm$v)
unique_vs

So there is at least one instance in which a term appears 15 times in a doc. Let's get that:

First let's find the indices of 15 occurences:

In [None]:
index_at_max_v <- which(sms_dtm$v == max(unique_vs))
index_at_max_v

See which term it is:

In [None]:
term_ind_at_max_v <- sms_dtm$j[index_at_max_v]
term_ind_at_max_v

sms_dtm$dimnames$Terms[term_ind_at_max_v]

And see which doc it is:

In [None]:
doc_ind_at_max_v <- sms_dtm$i[index_at_max_v]
doc_ind_at_max_v

sapply(sms_corpus_clean[doc_ind_at_max_v], as.character)

We can also index the dtm as a matrix and view the contents with inspect

In [None]:
inspect(sms_dtm[doc_ind_at_max_v, term_ind_at_max_v])

Let's get 10 random matches (docs + terms) and subset the dtm for them:

In [None]:
set.seed(2000)
sample2 <- sample(length(sms_dtm[[1]]), 10)

docs <- sms_dtm[[1]][sample2]
terms <- sms_dtm[[2]][sample2]

sample_mat <- tm::inspect(sms_dtm[docs, terms])

However, it would be better to view them together with the doc contents:

In [None]:
data.table(docs = sapply(sms_corpus_clean[docs], as.character), sample_mat)

We just created an sms_dtm object that contains the tokenized corpus using the default settings, which apply minimal processing.

The default settings are appropriate because we have already prepared the corpus manually.

On the other hand, if we hadn't performed the preprocessing, we could do so
here by providing a list of control parameter options to override the defaults.

For example, to create a DTM directly from the raw, unprocessed SMS corpus, we can use the following command:

In [None]:
sms_dtm2 <- DocumentTermMatrix(sms_corpus,
    control = list(
        tolower = TRUE,
        removeNumbers = TRUE,
        stopwords = TRUE,
        removePunctuation = TRUE,
        stemming = TRUE
        )
    )

In [None]:
str(sms_dtm2)

It might have some differences with the previous dtm due to application order of cleaning steps

In [None]:
sms_tdm <- tm::TermDocumentMatrix(sms_corpus_clean)

## Split dataset into train and test sets

With a .75/.25 split, we will have 4169 train and 1390 test observations:

Remember that, sms_raw is a data.table and in a data.table, .I is a placeholder for 1:nrows(DTobject)

In [None]:
train_ind <- 1:4169
test_ind <- sms_raw[,.I[-train_ind]]

In [None]:
sms_dtm_train <- sms_dtm[train_ind,]
sms_dtm_test <- sms_dtm[test_ind,]

We should have a respective split of the type vector also:

In [None]:
sms_train_labels <- sms_raw[train_ind, type]
sms_test_labels <- sms_raw[test_ind, type]

To confirm that the subsets are representative of the complete set of SMS data, let's
compare the proportion of spam in the training and test data frames:

In [None]:
p1 <- ggplot2::ggplot(data.frame(labels = sms_train_labels)) +
geom_bar(aes(x = labels, y = ..count../sum(..count..)), height = 0.1) +
ggtitle("Train Labels") +
labs(x = "type", y = "proportion")

p2 <- ggplot2::ggplot(data.frame(labels = sms_test_labels)) +
geom_bar(aes(x = labels, y = ..count../sum(..count..)), height = 0.1) +
ggtitle("Test Labels") +
labs(x = "type", y = "proportion")

gridExtra::grid.arrange(p1, p2, ncol = 2)

The proportions are alike across sets.

We can also confirm this with prop tables:

Note the use of sapply against the newly created list with titles and the closure - unnamed embedded function:

In [None]:
sapply(list(Train = sms_train_labels,
            Test = sms_test_labels),
      function(x) prop.table(table(x)))

## Visualize text data with word cloud

A word cloud is a way to visually depict the frequency at which words appear in text data.

The cloud is composed of words scattered somewhat randomly around the figure.

Words appearing more often in the text are shown in a larger font, while less common terms are shown in smaller fonts.

This type of figures grew in popularity recently, since it provides a way to observe trending topics on social media websites.

We can create a word cloud directly from the corpus, for words that appear at least 50 times.

With random.order = F, more frequent words are placed closer to the center.

We beautify the cloud with RColorBrewer package:

In [None]:
wordcloud::wordcloud(sms_corpus_clean,
                        min.freq = 50,
                        random.order = F,
                        colors = RColorBrewer::brewer.pal(8, "Dark2"))

We can also create separate word clouds for hams and spams directly from the raw sms object.

wordcloud automatically does the necessary transformations

We do that in the concise "data.table" way:

In [None]:
# open new plot
plot.new()

# set parameters for layout, margins and main title relative text size
par(mfrow=c(1,2), mar = rep(1,4), cex.main = 1.5)

# inside data.table, for each type category (ham, spam), create a wordcloud and set a main title
# note the use of curly braces "{...}" for multiple statements
# "by" clause for aggregate/split operations
sms_raw[, { wordcloud::wordcloud(text,
         max.words = 40,
         #scale = c(1, 1),
         random.order = F,
         colors = RColorBrewer::brewer.pal(8, "Dark2"))
         title(main = type) }
, by = type]

You can realize that "spam" cloud includes some words like "free", "claim", "mobile", "prize" that do not appear in the "ham" cloud

within words common to both, "call" is more frequent in "spam" while "now", "get", "just", "you" and "will" are more frequent in "ham" cloud

"can" appears in the "ham" cloud and not "spam" cloud

## Filtering for more frequent words and creating indicator features

The final step in the data preparation process is to transform the sparse matrix into a data structure that can be used to train a Naive Bayes classifier.

Currently, the sparse matrix includes over 6,500 features; this is a feature for every word that appears in at least one SMS message. It's unlikely that all of these are useful for classification.

To reduce the number of features, we will eliminate any word that appear in less than five SMS messages, or in less than about 0.1 percent of the records in the training data.

In [None]:
sms_freq_words <- tm::findFreqTerms(sms_dtm_train, 5)

In [None]:
str(sms_freq_words)

There are 1139 words appearing in at least 5 messages

We now need to filter our DTM to include only the terms appearing in a specified
vector.

As done earlier, we'll use the data frame style [row, col] operations to
request specific portions of the DTM, noting that the columns are named after the
words the DTM contains. We can take advantage of this to limit the DTM to specific
words

In [None]:
sms_dtm_freq_train <- sms_dtm_train[ , sms_freq_words]

sms_dtm_freq_test <- sms_dtm_test[ , sms_freq_words]

In [None]:
lapply(list(train = sms_dtm_train, freq_train = sms_dtm_freq_train), dim)
lapply(list(test = sms_dtm_test, freq_test = sms_dtm_freq_test), dim)

Note how number of columns - denoting terms - shrank after the filtering. Row counts are preserved

The Naive Bayes classifier is typically trained on data with categorical features.

This poses a problem, since the cells in the sparse matrix are numeric and measure the number of times a word appears in a message.

We need to change this to a categorical variable that simply indicates yes or no depending on whether the word appears at all.

We first create a custom function and then apply it on the column margin of the dtm's:

In [None]:
convert_counts <- function(x)
{
    x <- ifelse(x > 0, "Yes", "No")
}

In [None]:
sms_train <- apply(sms_dtm_freq_train, MARGIN = 2, convert_counts)

sms_test <- apply(sms_dtm_freq_test, MARGIN = 2, convert_counts)


The result will be two character type matrixes, each with cells indicating "Yes" or "No" for whether the word represented by the column appears at any point in the message represented by the row.

In [None]:
sms_train[1:10, 1:10]
sms_test[1:10, 1:10]

# Training a model on the data

Naive Bayes algorithm will use the presence or absence of words to estimate the probability that a
given SMS message is spam.

To build our model on the sms_train matrix, we'll use the following command:

In [None]:
sms_classifier <- e1071::naiveBayes(sms_train, sms_train_labels)

The sms_classifier object now contains a naiveBayes classifier object that can be
used to make predictions.

See the structure:

In [None]:
str(sms_classifier)

The probabilities appear inside the "tables" item of the list

Now let's view and interpret the tables for selected terms

In [None]:
selected_terms_spam <- c("free", "claim", "mobil", "prize", "call")

selected_terms_ham <- c("now", "get", "just", "you", "will", "can")

In [None]:
sms_classifier$table[selected_terms_spam]

For a selected term, the table should be read as follows:

In [None]:
termm <- "free"
table1 <- sms_classifier$table[[termm]]
table1

rows1 <- toupper(rownames(table1))
cols1 <- toupper(c("does not appear", "appears"))

outer(1:2,
      1:2,
      Vectorize(function(x, y) sprintf("Given that the sms is a %s, the prob. that the term \"%s\" %s is %.2f",
                                      rows1[x], toupper(termm), cols1[y], table1[x,y]))) %>%
    as.vector() %>%
    data.frame()

How this info as it is, is not so useful. We try to figure out whether the message is a ham or spam based on the knowledge of whether the term appears.

So we should process this info.

Now a little bit information on how Naive Bayes algorithm works:

Bayes theorem says: (from wikipedia)

$${\displaystyle p(C_{k}\mid \mathbf {x} )={\frac {p(C_{k})\ p(\mathbf {x} \mid C_{k})}{p(\mathbf {x} )}}\,}$$

In plain English, using Bayesian probability terminology, the above equation can be written as

$${\displaystyle {\mbox{posterior}}={\frac {{\mbox{prior}}\times {\mbox{likelihood}}}{\mbox{evidence}}}\,}$$


In our example:

$${\displaystyle {\mbox{p(spam | free) }}={\frac {{\mbox{p(spam)}}\times {\mbox{p(free | spam)}}}{\mbox{p(free)}}}\,}$$

So:
- We start with "the probability that FREE appears given the message is a SPAM" (likelihood)
- Multiply it with "the probability of SPAM" (prior)
- Divide with "the probability of FREE" (evidence)

"evidence" is simply the sum of p(spam) * ( p(free | spam) + p(free | ham) )

Recall our likelihoods:

In [None]:
table1

The sms classifier list holds the "a priori" counts of ham/spam, we convert them to proportions or probabilities:

In [None]:
prop <- prop.table(sms_classifier$apriori)
prop

We multiply the "a priori" probabilities with likelihoods (not matrix multiplication) to arrive at likelihoods of ham and spam given No/Yes

In [None]:
posterior1 <- as.numeric(prop) * table1
posterior1

And last we have to scale each likelihood column with their respective sums to arrive at the posterior probabilities:

In [None]:
posterior2 <- prop.table(posterior1, 2)
posterior2

We interpret the posterior probabilities as:

In [None]:
outer(1:2,
      1:2,
      Vectorize(function(x, y) sprintf("Given that the term \"%s\" %s, the prob. that the sms is a %s is %.2f",
                                      toupper(termm), cols1[y], rows1[x], posterior2[x,y] ))) %>%
    as.vector() %>%
    data.frame()

We are 74% confident that an sms with "free" is a spam and 89% confident that an sms without "free" is a ham 

## Joint model

A simplifying assumption about Naive Bayes is that, the probabilities of terms are independent of each other. So we can easily calculate posterior probabilities of joint conditions as:

is:

${\displaystyle p(C_{k}\mid x_{1},\dots ,x_{n})={\frac {1}{Z}}p(C_{k})\prod _{i=1}^{n}p(x_{i}\mid C_{k})}$

where the evidence ${\displaystyle Z=p(\mathbf {x} )=\sum _{k}p(C_{k})\ p(\mathbf {x} \mid C_{k})}$ is a scaling factor

Let's remember the selected words likely to signal spams:

In [None]:
table_spam <- sms_classifier$table[c("free", "mobil", "call")]
table_spam

First, let's collect only the second "YES" columns from each list item into a matrix elegantly:

In [None]:
mat_spam <- do.call(cbind, lapply(table_spam, "[", 1:2, 2))
mat_spam

And calculate the joint independent likelihoods of each row before multiplying with "a priori" probs of ham or spam:

In [None]:
mat_spam2 <- apply(mat_spam, 1, prod)
mat_spam2

With a pipe, we can do both in a single combined step.
Note that we format the output to 6 digits for better display with sprintf:

In [None]:
mat_spam2 <- do.call(cbind, lapply(table_spam, "[", 1:2, 2)) %>%
    apply(1,prod)

# we format the numbers as 6 digits and preserve the names in to vector
sprintf("%.6f", mat_spam2) %>%
    purrr::set_names(names(mat_spam2))

Remember the "a priori" probabilities of ham and spam

In [None]:
prop

We calculate the likelihoods:

In [None]:
posterior_mat <- as.numeric(prop) * mat_spam2

sprintf("%.6f", posterior_mat) %>%
    purrr::set_names(names(posterior_mat))

And scale with likelihoods of Yes/No evidence:

Note that we now further format the posterior probabilities:

* To percentage
* Keeping the original names
* And with a right aligned table format


In [None]:
posterior_mat2 <- prop.table(posterior_mat)

posterior_mat2 %>%
    scales::percent(accuracy = 0.0001, trim = F) %>%
    purrr::set_names(names(posterior_mat2)) %>%
    knitr::kable(col.names = "posterior", align = "r") %>%
    as.character() %>%
    IRdisplay::display_html()

So when "free", "mobil" and "call" terms appear together in an sms,

with 99.9% prob. it is a spam,
with only 0.08% prob. it is a ham!

## Model prediction

### Manually Calculating posterior probabilities for the model

Now let's use the prior likelihoods of all terms to calculate the posterior probabilities of the ham/spam classes:

In [None]:
mat_spam_all_ham <- do.call(cbind, lapply(sms_classifier$table, "[", 1, 1:2))

In [None]:
mat_spam_all_ham

In [None]:
mat_spam_all_spam <- do.call(cbind, lapply(sms_classifier$table, "[", 2, 1:2))

In [None]:
mat_spam_all_spam

With the joint model, we have to multiply all likelihoods in a row. However zero values will distort the model, since multiplication with zero will yield zero (we'll see more on this below in the "Laplace estimator" section

We will replace values below a certain threshold with that threshold value. We arbitrarily choose thresh = 0.001, since this is the default value that R's NaiveBayes method in predict() functon  

In [None]:
thresh <- 0.001
mat_spam_all_ham[mat_spam_all_ham < thresh] <- thresh
mat_spam_all_spam[mat_spam_all_spam < thresh] <- thresh

In [None]:
mat_spam_all_ham
mat_spam_all_spam

The dtm for test data is as follows:

In [None]:
sms_test

For each document, we'll calculate the joint "ham" and "spam" likelihoods (and then posterior probs) of all terms that appear or do not appear in this document: 

For each of ham/spam likelihoods, if the term does not appear, the likelihood in the "No" row, if the term appears, the likelihood in the "Yes" row will be taken and multiplied:

In [None]:
termsn <- ncol(mat_spam_all_ham)
sprintf("Number of terms is %s", termsn)

# This will give posterior probabilities of ham and spam for each document:
# Note that in the second (third) row, "ham" ("spam") likelihoods from respective No/Yes rows of each term column of are subsetted
posterior <- t(apply(sms_test, 1, function(x)
    c(prod(mat_spam_all_ham[cbind((x == "Yes") + 1, 1:termsn)]),
      prod(mat_spam_all_spam[cbind((x == "Yes") + 1, 1:termsn)])))) %>%  
    prop.table(margin = 1)
                     
# this gives "ham and spam"
names_hs <- names(posterior_mat2)
sprintf("Column titles are %s", paste(names_hs, collapse = " and "))

# this will determine whether ham or spam posterior probability is higher
# and hence classify the document accordingly:
sms_test_pred_manual <- apply(posterior, 1, function(x) names_hs[which.max(x)])

# format the posterior probabilities with appropriate percent decimals, column and row names:
sms_test_pred_manual_percent <- posterior %>%
    apply(2,scales::percent, accuracy = 0.0001, trim = F) %>%
    magrittr::set_colnames(names_hs) %>%
    magrittr::set_rownames(sms_test_pred_manual)

sms_test_pred_manual_percent

### Prediction with predict() function

We usually do not manually go through the steps above to interpret the model:

We can easily apply the model on the test set to calculate predictions:

In [None]:
sms_test_pred <- predict(sms_classifier, sms_test)

sms_test_pred is a factor vector of levels "ham" and "spam":

In [None]:
str(sms_test_pred)

Just as sms_test_labels is:

In [None]:
str(sms_test_labels)

From the predict function we can also get the ham/spam posterior probabilities for each document with the "raw" option for type argument:

In [None]:
sms_test_pred_raw <- predict(sms_classifier, sms_test, type = "raw")

In [None]:
str(sms_test_pred_raw)

Let's print this matrix nicely:

In [None]:
# ham and spam
names_hs <- names(posterior_mat2)

# set percent format, rownames and column names
sms_test_pred_raw_percent <- sms_test_pred_raw %>%     
    apply(2,scales::percent, accuracy = 0.0001, trim = F) %>%
    magrittr::set_colnames(names_hs) %>%
    magrittr::set_rownames(apply(sms_test_pred_raw, 1, function(x) names_hs[which.max(x)]))

sms_test_pred_raw_percent

Now let's compare the ham/spam classifications we did manually and through the predict function:

In [None]:
data.frame(as.factor(sms_test_pred_manual), sms_test_pred)

Mostly they are parallel, but there may be some classification differences:

In [None]:
manual_predict_match <- sms_test_pred_manual == as.character(sms_test_pred)

sprintf("From a total of %s documents, in %s ones both our manual and predict function classifications are the same",
total_doc <- length(sms_test_pred_manual),
cor_match <- sum(manual_predict_match))

sprintf("So classifications of %s documents differ between our manual method and predict function",
        total_doc - cor_match)

Let's pick those documents where classifications of predict() function is different from our manual calculation:

In [None]:
# mismatches from manual posterior probs.
m1 <- sms_test_pred_manual_percent[!manual_predict_match,]
# mismatches from predict() function
m2 <- sms_test_pred_raw_percent[!manual_predict_match,]

# display both matrices side by side
knitr::kable(list(list(m1, caption = data.frame(calculated_by = "manual")),
                  list(m2, caption = data.frame(calculated_by = "predict")))) %>%
    as.character() %>%
    IRdisplay::display_html()

See that in none of those mismatches, there is any highly dominant posterior probability such as > 99% 

So the differences are mostly from the accuracy limits of numeric values in R: Base R can only handle up to 22 decimal digits. After that, accuracy is lost

Multiplying +1000 small numbers can lose accuracy. This may happen in our manual calculation. Libraries like Rmpfr can handle much detailed accuracy.

predict() function is probably designed to handle such possible accuracy losses. So it is better to use the built in predict() function to predict the labels of the train set

## Model evaluation

Now, we will form a confusion matrix

CrossTable visualizes how much the test labels are correctly classified:

In [None]:
ct_nb <- gmodels::CrossTable(sms_test_pred,
                             sms_test_labels,
                             prop.chisq = F,
                             prop.t = F,
                             prop.r = F,
                             dnn = c('predicted', 'actual'))

ct_nb

Now, let's automatically report findings:

In [None]:
sprintf("Out of a total of %s sms's:
- %s sms's are correctly classified as either ham or spam (%s),
- while %s sms's are misclassified (%s)",
        all <- ct_nb$t %>% sum(),
        cor <- ct_nb$t %>% diag() %>% sum(),
        (cor / all) %>% scales::percent(accuracy = 0.01),
        fal <- all - cor,
        (fal / all) %>% scales::percent(accuracy = 0.01)
       ) %>% cat()

## Laplace estimator

Let's get those terms that never occur in spams in our corpus:

In [None]:
sapply(sms_classifier$tables, "[", 2, 2) %>%
    magrittr::extract(.==0) %>%
    names()

Let's select a word here which we think, might signal a spam message, such as "present". And let's repeat the steps above including "present":

In [None]:
table_spam <- sms_classifier$table[c("free", "mobil", "call", "present")]
table_spam

In [None]:
mat_spam <- do.call(cbind, lapply(table_spam, "[", 1:2, 2))
mat_spam

In [None]:
mat_spam2 <- do.call(cbind, lapply(table_spam, "[", 1:2, 2)) %>%
    apply(1,prod)

# we format the numbers as 6 digits and preserve the names in to vector
sprintf("%.10f", mat_spam2) %>%
    purrr::set_names(names(mat_spam2))

In [None]:
mat_spam2 == 0

It seems that, with the "zero" occurrence word "present", our model tend to misclassify mails that contain "free", "mobil", "call" and "present" as a ham with full confidence!

Multiplication with 0 yields 0. So we have to correct with this using the laplace estimator by adding a very small number to 0 likelihoods:

In [None]:
sms_classifier2 <- naiveBayes(sms_train, sms_train_labels, laplace = 1)

In [None]:
sms_test_pred2 <- predict(sms_classifier2, sms_test)

In [None]:
ct_nb2 <- gmodels::CrossTable(sms_test_pred2,
           sms_test_labels,
           prop.chisq = F,
           prop.t = F,
           prop.r = F,
           dnn = c('predicted', 'actual'))

ct_nb2

In [None]:
sprintf("Out of a total of %s sms's:
- %s sms's are correctly classified as either ham or spam (%s),
- while %s sms's are misclassified (%s)",
        all2 <- ct_nb2$t %>% sum(),
        cor2 <- ct_nb2$t %>% diag() %>% sum(),
        (cor2 / all2) %>% scales::percent(accuracy = 0.01),
        fal2 <- all2 - cor2,
        (fal2 / all2) %>% scales::percent(accuracy = 0.01)
       ) %>% cat()

In [None]:
perf <- c("Worse", "Same", "Better")

sprintf("%s performance of new vs. old:
(%s vs. %s correct,
%s vs. %s misclassified)",
    perf[sign(cor2 - cor) + 2],
    cor2, cor, fal2, fal) %>% cat()

In [None]:
save.image(sessionfile)