# Exercise 02: Spam filtering with naive Bayes

## The naive Bayes classifier

The naive Bayes classifier is a probabilistic machine learning model that is used for classification tasks. It is based on the idea that features in a dataset are independent of each other, which is called the "naive" assumption. Despite this assumption not always being true, the naive Bayes classifier has shown to be quite effective in many real-world applications.

The algorithm works by training on a labeled dataset, where the input data is split into classes based on the target variable. For each class, the algorithm calculates the probability of each feature being associated with that class. During the prediction phase, the algorithm uses these probabilities to predict the class for a new, unlabeled example by finding the class with the highest probability.

One of the strengths of the naive Bayes classifier is that it is simple and easy to implement, yet it can perform well on a variety of tasks. It is also relatively fast to train, which makes it a good choice for large datasets. However, it is important to note that the assumption of feature independence can sometimes lead to inaccurate predictions, particularly when the features are highly correlated. Despite this, the naive Bayes classifier can still be a useful tool in many situations.

Given a dataset with $n$ features and a target variable with $k$ classes, the goal is to estimate the class probabilities $P(C_i)$ for each class $C_i$ and the feature probabilities $P(x_j|C_i)$ for each feature $x_j$ given each class.

To make a prediction for a new, unlabeled example with feature values $x$, we can use Bayes' theorem to calculate the probability $P(C_i|x)$ for each class $C_i$:

$$ P(C_i|x) = \frac{P(x|C_i)P(C_i)}{P(x)} $$

$P(C_i|x)$ is called the likelihood function, and we will often calculate the logarithm of this function often referred to as the log-likelihood. Then, we can predict the class with the highest probability (or maximum likelihood):

$$\arg\max_i P(C_i|x)$$

The probability $P(x)$ is often difficult to calculate, so it is usually dropped from the equation. This results in the simplified prediction rule:

$$\arg\max_i (P(x|C_i) \cdot P(C_i))$$

In the case of the naive Bayes classifier, we assume that the features are independent, so we can estimate $P(x|C_i)$ as the product of the individual feature probabilities:

$$P(x|C_i) = P(x_1|C_i) \cdot P(x_2|C_i) \cdot ... \cdot P(x_n|C_i)$$

The class probabilities $P(C_i)$ and feature probabilities $P(x_j|C_i)$ can be estimated using maximum likelihood estimation, which involves counting the number of occurrences of each class and feature in the training data and dividing by the total number of examples.

For example, to estimate the probability $P(C_i)$, we would count the number of examples in class $C_i$ and divide by the total number of examples:

$$P(C_i) = \frac{\text{count}(C_i)}{\text{count}(total)}$$

To estimate the probability $P(x_j|C_i)$, we would count the number of occurrences of feature $x_j$ in examples belonging to class $C_i$ and divide by the total number of examples in class $C_i$:

$$P(x_j|C_i) = \frac{\text{count}(x_j, C_i)}{\text{count}(C_i)} $$

These probabilities can then be plugged into the prediction rule to make predictions for new examples.

The prediction rule is:

$$\arg\max_i (P(x|C_i) \cdot P(C_i))$$

This means that, given a new example with feature values $x$, we want to find the class $C_i$ that has the highest probability of occurring given the feature values $x$. To do this, we multiply the probability of the features $x$ given the class $C_i$ by the probability of the class $C_i$ occurring, and choose the class with the highest resulting probability.

The probability $P(x|C_i)$ is calculated as the product of the individual feature probabilities $P(x_j|C_i)$, which are estimated using maximum likelihood estimation as described above.

## Spam filtering

A classic example of a classification problem that can be solved using the Naive Bayes algorithm is spam filtering.

Here the input data consists of the text from either emails, sms or other types of messages. The goal is to classify the message as either "spam" or "not spam" based on the words that appear in the message and other features such as the sender and the subject line. We will only look at the content of the message.

To solve this problem using the Naive Bayes algorithm, we start by extracting features from the messages, such as the presence or absence of certain words or phrases. We would then create a dataset with these features and the corresponding labels ("spam" or "not spam") for each message.

Next, we train a naive Bayes classifier on this dataset, using the features to make predictions about the labels. The classifier would use the relative frequency of each word or phrase in the "spam" and "not spam" messages to estimate the probability that a new message is spam or not spam. We could then use this trained classifier to predict the class label for new, unseen email messages.

<p>

## Data

The SMS Spam Collection is a set of labeled SMS messages that have been collected for mobile phone spam research. The dataset contains 5,572 SMS messages in English, tagged as either "spam" or "ham" (not spam). The messages have been collected from various sources, mostly from a publicly available corpus of SMS messages.

Each line in the data file corresponds to one message, and each line contains the label (ham or spam) and the message text, separated by a tab character.

Source: https://archive.ics.uci.edu/ml/datasets/sms+spam+collection

### Exercises:

1. Load the `smsspamcolection` dataset and inspect its content (ham means "no spam"). The dataset is also available on itslearning. Change the label (spam or ham) to a numeric value, e.g., `[0,1]`.
2. Split the data into a training and a test set (you can use `from sklearn.model_selection import train_test_split`)
3. Create a feature matrix (one feature is one word), in this case this the count matrix. You can use `from sklearn.feature_extraction.text import CountVectorizer` to generate this matrix
4. Compute the class probabilities $P(C_1) = P(C_i='\text{Spam}')$ and $P(C_2) = P(C_i='\text{No spam}')$. What is the probability of an SMS being spam?
5. Compute the conditional probabilities $P(x_i|C_1)$ and $P(x_i|C_2)$.
6. Print out the five most used word in the messages classified as spam and not-spam.
7. Calculate $log(P(C_1|x)) \propto log(P(C_1)) + \sum_{i=1}^n log(P(x_i|C_1))$ and $log(P(C_2|x)) \propto log(P(C_2)) + \sum_{i=1}^n log(P(x_i|C_2))$
8. Without using Scikit-learn write a classifier and classify the messages in the test set using the prediction rule and evaluate how well your predictions are compared to the true labels.
9. Use a builtin model for Naive Bayes in  Scikit-learn, e.g., `MultinomialNB` to train the classifier on the training set and evaluate how well it performs on the test set.

In [1]:
import numpy as np
import pandas as pd

In [2]:
# Solution 1

# Load the dataset
df = pd.read_csv('../data/smsspamcollection/SMSSpamCollection',
                 sep='\t', header=None, names=['label', 'message'])

# We have two classes:
# $C_1 = 0$ (spam)
# $C_2 = 1$ (not spam)
df['label'] = df['label'].map({'spam': 0, 'ham': 1})

# Print the first 5 rows
df

Unnamed: 0,label,message
0,1,"Go until jurong point, crazy.. Available only ..."
1,1,Ok lar... Joking wif u oni...
2,0,Free entry in 2 a wkly comp to win FA Cup fina...
3,1,U dun say so early hor... U c already then say...
4,1,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,0,This is the 2nd time we have tried 2 contact u...
5568,1,Will ü b going to esplanade fr home?
5569,1,"Pity, * was in mood for that. So...any other s..."
5570,1,The guy did some bitching but I acted like i'd...


In [3]:
# Solution 2

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['message'], df['label'],
                                                    test_size=0.33, random_state=1)

In [4]:
# Solution 3

# Import the CountVectorizer class from the sklearn.feature_extraction.text module
from sklearn.feature_extraction.text import CountVectorizer

# Create an instance of the CountVectorizer class
vectorizer = CountVectorizer()

# Fit the vectorizer to the training data and transform it into a numerical array
X_train = vectorizer.fit_transform(X_train).toarray()

# Transform the test data into a numerical array using the already-fitted vectorizer
X_test = vectorizer.transform(X_test).toarray()

In [5]:
# The shape of the count matrix
np.shape(X_train)

(3733, 6984)

In [6]:
# The count matrix is very sparse 
X_train

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [7]:
# Solution 4

# Create a dictionary to store the class probabilities
class_probs = {}

# Calculate the class probabilities
# P(C_1) and P(C_2)
class_probs[0] = (y_train == 0).mean()
class_probs[1] = (y_train == 1).mean()


print(f'The probability of an email being spam: {class_probs[0].round(3)}')

The probability of an email being spam: 0.134


In [8]:
# Solution 5

# The conditional probabilities for each class is given by P(x_i|C_1) and P(x_i|C_2)
# We get these by calculating the number of times each word appears in each class
# and dividing by the total number of words in each class

# The total number of words in each class
total_words_in_class_0 = X_train[y_train == 0].sum()
total_words_in_class_1 = X_train[y_train == 1].sum()

# Calculate the number of times each word appears in each class and add 0.1 to each count to avoid taking the log of 0 later
count_x_0 = X_train[y_train == 0].sum(axis=0) + 0.1
count_x_1 = X_train[y_train == 1].sum(axis=0) + 0.1

cond_probs_0 = count_x_0 / total_words_in_class_0
cond_probs_1 = count_x_1 / total_words_in_class_1

In [9]:
np.shape(cond_probs_0)

(6984,)

In [10]:
# Solution 6

# The most used words in the "spam" sms
top5_spam_words = pd.DataFrame(cond_probs_0).sort_values(by=0, ascending=False).head(10)

# The most used words in the "no-spam" sms
top5_ham_words = pd.DataFrame(cond_probs_1).sort_values(by=0, ascending=False).head(10)

# Create a utility function for printing out the top words
def idx_to_word(idx):
    print(list(vectorizer.vocabulary_.keys())[list(vectorizer.vocabulary_.values()).index(idx)])

In [11]:
for i in top5_spam_words.index:
    idx_to_word(i)

to
call
you
your
for
the
free
or
now
txt


In [12]:
for i in top5_ham_words.index:
    idx_to_word(i)

you
to
the
and
in
me
is
my
it
that


In [13]:
# Solution 7

# We can now calculate the probability of a message being spam
# P(C_1|x) = \frac{P(C_1) \prod_{i=1}^n P(x_i|C_1)}{P(x)}

# We can ignore the denominator since it is the same for both classes
# P(C_1|x) \propto P(C_1) \prod_{i=1}^n P(x_i|C_1)

# We can calculate the log of the probability to avoid underflow
# log(P(C_1|x)) \propto log(P(C_1)) + \sum_{i=1}^n log(P(x_i|C_1))

log_probs = {}
log_probs[0] = np.log(class_probs[0]) + np.log(cond_probs_0).dot(X_test.T)

# And we do the same for P(C_2|x)
log_probs[1] = np.log(class_probs[1]) + np.log(cond_probs_1).dot(X_test.T)

In [14]:
# Solution 8

# We can now use the log probabilities to make our predictions

# Use the argmax function to find the class with the highest probability
y_pred = np.argmax([log_probs[0], log_probs[1]], axis=0)

# Check the accuracy of our model
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred).round(4))

0.9869


In [15]:
# Solution 9

from sklearn.naive_bayes import MultinomialNB

# Train a Naive Bayes classifier
model = MultinomialNB()
model.fit(X_train, y_train)

# Check the accuracy of the model
y_pred_sklearn = model.predict(X_test)
print(accuracy_score(y_test, y_pred_sklearn).round(4))

0.9869
