<font size="5"> Data Literacy Exercise 01</font>

Machine Learning in Science, University of Tübingen, Winter Semester 2022



## Theoretical part of assignment 01

### 1. EXAMple question

given:

$P(match|guilty) = 0.99$ <br>
$P(match|\neg guilty) = 0.001$ <br>
$P(guilty) = 0.0001$ <br>

wanted:

$P(guilty|match) = \frac{P(match|guilty) \cdot P(guilty)}{P(match)}$ <br>

$P(match) = P(match|guilty)P(guilty) + P(match|\neg guilty)P(\neg guilty) = 0.99 \cdot{0.0001} + 0.001 \cdot{0.99} = 0.001089$

Thus, we can conclude, that

$P(guilty|match) = \frac{0.99 \cdot 0.0001}{0.01089} \approx 0.09091$

the probability of me being guilty, given the presented evidence is 0.09091

### 2. Theory Question - Pooled Testing

A. The probability of at least one being infected, lets call it $P(C>1)$, can be calculated as 1 minus the probability that none of the k samples is infected $1 - (1 - P(C))^k$.

B. The specificity, given one is not infected is calculated as 
$P(\neg T | \neg C) \cdot P(\neg T)^{k-1}$

where $P(\neg T) = P(C)P(\neg T|C) + P(\neg C)P(\neg T| \neg C)$

all that is left is to calculate $P(\neg T|C)$




### Introduction

You'll have probably received quite a number of spam messages, ever since you started using digital forms of communication. In this exercise, we will use a Bayesian approach to judge the probability of a message being spam, based on the words in the message.

To start, you will need to download the dataset from https://archive.ics.uci.edu/ml/datasets/sms+spam+collection, which contains a number of messages labeled _spam_ or _ham_ (not *spam*). See the link for more information on the dataset.

You will also need the packages `pandas`, `numpy` and `sklearn` installed

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np

### Loading and processing the data

We start by loading and preprocessing the data. We will create a `pandas` `Dataframe` with columns *label*, and *content*, containing whether a message is judged spam or ham, and the content (words) of the message, respectively. 

If you are unfamiliar with `Pandas`, we suggest you to take a look here: #https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html, as this package will likely be used in more tutorials.

Since we would like to judge how well our spam filter does on future messages, in the first step we split our data into a train and a test set. We will use the train set for setting up the the spam filter. The test set will only be used to validate the algorithm at the end of the notebook.


In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split

# TO DO: add the path to where you stored the dataset
dataset_path = 'smsspamcollection/SMSSpamCollection'

def read_and_clean_data(path):
    """
    load and clean dataset
    """
    
    # read data:
    sms_data = pd.read_csv(path, header=None, sep='\t', names=['label', 'content'])
    sms_data.groupby('label').count()
    
    # clean data:
    sms_data['content'] = sms_data['content'].str.replace('\W+', ' ',regex=True)
    sms_data['content'] =sms_data['content'].str.replace('\s+', ' ',regex=True)
    sms_data['content'] =sms_data['content'].str.strip()
    sms_data['content'] = sms_data['content'].str.split()
    # Your code: make all letters lowercase
    
    sms_data['content'] = sms_data['content'].apply(lambda x: list(map(lambda y: y.lower(), x)))
    
    # Your code: make a 80% / 20% train / test split using the train_test_split function
    train_data, test_data = train_test_split(sms_data, test_size=0.2)
    
    return train_data, test_data
    
train_data, test_data = read_and_clean_data(dataset_path)

In [3]:
train_data


Unnamed: 0,label,content
4395,ham,"[baaaaaaaabe, wake, up, i, miss, you, i, crave..."
3473,ham,"[i, think, i, m, waiting, for, the, same, bus,..."
3788,ham,"[whore, you, are, unbelievable]"
1395,ham,"[r, we, still, meeting, 4, dinner, tonight]"
4186,ham,"[i, m, good, have, you, registered, to, vote]"
...,...,...
4466,ham,"[cheers, for, callin, babe, sozi, culdnt, talk..."
2108,ham,"[hmmm, and, imagine, after, you, ve, come, hom..."
5430,ham,"[if, you, can, make, it, any, time, tonight, o..."
109,ham,"[i, know, grumpy, old, people, my, mom, was, l..."


### Posterior probability of a message being spam (1/2)

Let's say you receive a message containing the word *money*, what is the posterior probability of this message being spam?

Use the following notation: 

$P(S)$: _prior_ probability of a message being spam

$P(H)$: _prior_ probability of a message being ham (not spam)

$P(w)$: _evidence_  of the word *money*


Answer the following two questions before continuing:

#### Use Bayes rule to calculate the posterior probability, $P(S|w)$ of a message being spam, given it contains the word *money*

$P(S|w) = \frac{P(w|S)\cdot P(S)}{P(w)}$


#### Split up $P(w)$ using the law of total probability

$P(w) = P(w|S)\cdot P(S) + P(w|H) \cdot P(H)$

Now you are ready to calculate $P(S|w)$ based on the train data set
#### Calculate the prior probabilities of a message being spam or ham, based on their occurrences



In [4]:
# YOUR CODE: calculate the total number of spam and ham messages in the dataset
n_spam = train_data['label'].value_counts()['spam']
n_ham = train_data['label'].value_counts()['ham']

# YOUR CODE: use this to calculate the prior probability of a message being spam / ham
prior_spam = n_spam / (n_spam + n_ham)
prior_ham = 1 - prior_spam

print("P(S) = " + str(prior_spam) + ", P(H) = " +str(prior_ham))


P(S) = 0.1359658963428315, P(H) = 0.8640341036571685


#### Create a look up table that contains likelihood of each word given that a message is spam $P(w|S)$ or ham $P(w|H)$


In [5]:
ds_spam = train_data.loc[train_data['label'] == 'spam']
ds_ham = train_data.loc[train_data['label'] == 'ham']


In [6]:
# get a list of all the words in the train data
vocabulary = list(set(train_data['content'].sum()))

# create a Pandas dataframe that we can use as lookup table for the likelihoods:
Likelihoods = pd.DataFrame(np.zeros((len(vocabulary),2)), 
                           index = vocabulary, columns =["p(w|S)","p(w|H)"])

# YOUR CODE: fill in the table. Hint: the pandas iterrows() function might be useful


for word in vocabulary:
    
    # count the occurences of the word in the spam sentences
    
    count_spam = sum([1 if word in s else 0 for s in ds_spam['content']])
    count_ham = sum([1 if word in s else 0 for s in ds_ham['content']])
    Likelihoods['p(w|S)'][word] = count_spam / n_spam
    Likelihoods['p(w|H)'][word] = count_ham / n_ham
    


#### Get the Likelihoods for *money*

If you did the last step correctly, these can simply be looked up from the Likelihoods table

In [7]:
# YOUR CODE: get the likelihoods for the word money
likelihood_spam = Likelihoods['p(w|S)']['money']
likelihood_ham = Likelihoods['p(w|H)']['money']

In [10]:
Likelihoods

Unnamed: 0,p(w|S),p(w|H)
flood,0.00000,0.000260
cup,0.00495,0.000779
pride,0.00000,0.000519
08718738034,0.00165,0.000000
blame,0.00000,0.000779
...,...,...
strewn,0.00000,0.000260
wc1n,0.00165,0.000000
received,0.00330,0.000519
sec,0.00000,0.001039


#### Calculate posterior probability

In [11]:
# YOUR CODE: Use Bayes rule to calculate the posterior probabilities of a message 
# being spam or ham, given that it contains the word 'money'

"""
P(S|w) = P(w|S) * P(S) / P(w)
"""

prior_w = likelihood_spam * prior_spam + likelihood_ham * prior_ham # also called the evidence 
posterior_spam = prior_spam * likelihood_spam / prior_w
posterior_ham = prior_ham * likelihood_ham / prior_w

print("P(S|w) = " + str(posterior_spam) + ", P(H|w) = " +str(posterior_ham))


P(S|w) = 0.06249999999999999, P(H|w) = 0.9375


### Posterior probability of a message being spam (2/2)
Here, messages do not contain only one word. We will make the simplifying assumption that all words are independent (which of course is not actually true!). This is called a Naive Bayes approach.

For a message containing $n$ words $w_i$, we are now interested in:

$P(S|w_1, w_2, ... w_n)$

#### Use independence to rewrite this in terms of the likelihoods and priors of individual words.
$P(S|w_1, w_2, ... w_n) \propto P(S) \cdot \prod_n P(w_n | S)$

#### Now you are ready to calculate the probability of a message being spam, given all words in this message

In [43]:
# sample a random message from the train set
message =  list(train_data.sample(axis='index')['content'])[0]
print(message)
# YOUR CODE: calculate posterior probability of this message being spam, given all words in it

# denominator = np.prod([Likelihoods['p(w|S)'][word] * prior_spam + Likelihoods['p(w|H)'][word] * prior_ham 
#                       for word in message])

def normalize(a,b):
    
    try:
        v = np.array([a,b])
        d = np.sqrt(np.sum(v**2))
        return v / d
    except:
        return [0.5,0.5]
        
        

posterior_spam = prior_spam * np.prod([Likelihoods['p(w|S)'][word] for word in message])
posterior_ham = prior_ham * np.prod([Likelihoods['p(w|H)'][word] for word in message])

posterior_spam, posterior_ham = normalize(posterior_spam, posterior_ham)

print("P(S|W) = " + str(posterior_spam) + ", P(H|W) = " +str(posterior_ham))



['aiyar', 'hard', '2', 'type', 'u', 'later', 'free', 'then', 'tell', 'me', 'then', 'i', 'call', 'n', 'scold', 'n', 'tell', 'u']
P(S|W) = 0.0, P(H|W) = 1.0


### Now we can create a classifier

Complete the following code to create a Bayesian spam filter. Note that, for this exercise, it is okay to ignore words not in the likelihood table. Furthermore, you do not need to calculate evidence $P(w)$ for judging whether a message is more likely to be spam or ham

In [44]:
class Classifier():
    """
    A Naive Bayes classifier
    """
    def __init__(self, Likelihoods, prior_spam, prior_ham):
        self.Likelihoods = Likelihoods
        self.prior_spam = prior_spam
        self.prior_ham = prior_ham
        
    def classify_message(self, message):
        # YOUR CODE: calculate posterior probabilities of a message being spam
        # If it is more likely to be spam, return 1, else return 0
        
        posterior_spam = self.prior_spam * np.prod([self.Likelihoods['p(w|S)'][word] for word in message])
        posterior_ham = self.prior_ham * np.prod([self.Likelihoods['p(w|H)'][word] for word in message])

        return np.argmax([posterior_ham, posterior_spam])

        

In [51]:
classifier = Classifier(Likelihoods, prior_spam, prior_ham)

# YOUR CODE: calculate accuracy of the classifier on the train and test sets
labels = ['ham', 'spam']
train_true_pos_counter = 0
test_true_pos_counter = 0

# Classify all message in the train set:
for _, row in train_data.iterrows():
    
    try:
        pred = classifier.classify_message(row[1])
    
        if labels[pred] == row['label']:
            train_true_pos_counter += 1
    except KeyError:
        print('Unknown word inside message. Skipping datapoint')

train_acc = train_true_pos_counter / len(train_data)

# Classify all message in the test set:
for _, row in test_data.iterrows():
    
    try:
        pred = classifier.classify_message(row[1])
        if labels[pred] == row['label']:
            test_true_pos_counter += 1

    except KeyError:
        print('Unknown word inside message. Skipping datapoint')

test_acc = test_true_pos_counter / len(test_data)

print(f"train accuracy: {train_acc}, test accuracy: {test_acc}")

Unknown word inside message. Skipping datapoint
Unknown word inside message. Skipping datapoint
Unknown word inside message. Skipping datapoint
Unknown word inside message. Skipping datapoint
Unknown word inside message. Skipping datapoint
Unknown word inside message. Skipping datapoint
Unknown word inside message. Skipping datapoint
Unknown word inside message. Skipping datapoint
Unknown word inside message. Skipping datapoint
Unknown word inside message. Skipping datapoint
Unknown word inside message. Skipping datapoint
Unknown word inside message. Skipping datapoint
Unknown word inside message. Skipping datapoint
Unknown word inside message. Skipping datapoint
Unknown word inside message. Skipping datapoint
Unknown word inside message. Skipping datapoint
Unknown word inside message. Skipping datapoint
Unknown word inside message. Skipping datapoint
Unknown word inside message. Skipping datapoint
Unknown word inside message. Skipping datapoint
Unknown word inside message. Skipping da

### Bonus (optional)

Can you increase the classification accuracy of the Naive Mayes method? *Hint*: look at the likelihood computation for words that only ever occur in one of the two labels. Can you somehow regularize the likelihoods?

We could think about smoothing out the probability distribution of a word over the k classes we would like to predict. This would have the benefit, that we get more accurate predictions, since our results are now not that sensitive to missing occurences.

Suppose that we have a spam sms, which contains a word, which has never been seen before in our training dataset. Our classifier of course assigns the probability of this word occuring to 0, which is not incorrect, based upon the training data we've seen. Thus, our classifier cannot decide, whether the instance is a spam or ham instance. We can see this very clearly taking a look at the accuracy for the test set in the previous cell. Assigning a very small default probability to any word or smoothing out the probabilities could help. 