# INFO371 Problem Set 6: Naïve Bayes

* Stanley Susanto
* Ratul Jain


In [1]:
#import libraries needed
import numpy as np
import pandas as pd
import textwrap
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

In [2]:
import warnings
warnings.filterwarnings("ignore", category=RuntimeWarning) 

## 1. Load data and clean data

### 1. Load the lingspam-emails.csv.bz2 dataset.

In [3]:
# load the data
df = pd.read_csv("../data/lingspam-emails.csv.bz2", sep="\t")

#Browse a handful of emails, both spam and non-spam ones
df.sample(6)

Unnamed: 0,spam,files,message
1571,False,9-1091msg1.txt,Subject: new book from mitwpl : phonology - sy...
1542,False,9-1055msg1.txt,"Subject: available for review : semantics , le..."
565,True,spmsgc86.txt,Subject: april vip specials from legend micro ...
2662,False,9-1819msg1.txt,Subject: when voices clash : a study in litera...
943,False,6-1118msg3.txt,Subject: looking for horn 's negation book i ...
2761,False,9-496msg1.txt,Subject: workshop on modality ( second call fo...


In [4]:
df.shape

(2893, 3)

In [5]:
#textwrap.wrap(text = df.iloc[1140].message), textwrap.wrap(text = df.iloc[2127].message) #print long strings on multiple lines

#### Ensure the data is clean: remove all cases with missing spam and empty message field. We do not care about the file names.

In [6]:
df = df.dropna(subset=['spam','message']) #drop NA from spam and message
df.shape

(2893, 3)

As we can see, that the number of rows before and after dropping the NA from spam and message column is the same. Hence, the data is clean

### 2. Vectorize emails so you have a DTM (I’ll refer to this as the design matrix X) and the spam/non-spam indicator y.

In [7]:
vectorizer = CountVectorizer(binary=True)
# define vectorizer
X = vectorizer.fit_transform(df.message)
# vectorize your data. Note: this creates a sparse matrix, # use .toarray() if you run into trouble
y = df.spam
vocabulary = vectorizer.get_feature_names()
# in case you want to see what are the actual words

In [8]:
X

<2893x60925 sparse matrix of type '<class 'numpy.int64'>'
	with 636763 stored elements in Compressed Sparse Row format>

In [9]:
different_emails = len(pd.unique(df['message']))
different_emails

2876

In [10]:
words = df['message'].str.lower().str.split()
different_words = words.apply(set).apply(len)
different_words.sum()

688548

There are 2876 different emails and 688548 different words in this message data

### 3. Split data into training/validation chunks.

In [11]:
Xt, Xv, yt, yv = train_test_split(X, y, test_size = 0.2) #split into training and validation data

### 4. Design a scheme to name your variables so you can understand from the variable name which mathematical concept it refers to.

- P(S = 1): Probability of spam in data -> P_S1
- P(S = 0): Probability of non-spam in data -> P_S0
- P(W = 1): Probability in the messages the word is seen -> P_W1
- P(W = 1|S = 1): Probability in the messages the word is seen and that are spam -> P_W1_S1
- P(W = 1|S = 0): Probability in the messages the word is seen and that are not spam -> P_W1_S0
- P(W = 0): Probability in the messages the word is not seen -> P_W0
- P(W = 0|S = 1): Probability in the messages the word is not seen and that are spam -> P_W0_S1
- P(W = 0|S = 0): Probability in the messages the word is not seen and that are not spam -> P_W0_S0
- logP(S = 1): Log probability of spam in data -> logP_S1
- logP(S = 0): Log probability of non-spam in data -> logP_S0
- logP(W = 1): Log probability in the messages the word is seen -> logP_W1
- logP(W = 1|S = 1): Log probability in the messages the word is seen and that are spam -> logP_W1_S1
- logP(W = 1|S = 0): Log probability in the messages the word is seen and that are not spam -> logP_W1_S0
- logP(W = 0): Log probability in the messages the word is not seen -> logP_W0
- logP(W = 0|S = 1): Log probability in the messages the word is not seen and that are spam -> logP_W0_S1
- logP(W = 0|S = 0): Log probability in the messages the word is not seen and that are not spam -> logP_W0_S0

## 2. Naïve Bayes

### 1. What do these numbers show:

In [12]:
X[946:949, 30037:30042].toarray()

array([[0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0]], dtype=int64)

#### a) which emails do the rows correspond to?

#### b) which words do the columns correspond to?

#### c) what does the single “1” in the middle of the table mean?

#### d) what do the zeros mean?

### 2. What is the accuracy of the naive model that predicts all emails into the majority category?

### 3. Compute the unconditional (log) probability that the email is spam/non-spam, log Pr(S = 1), and log Pr(S = 0). These probabilities are based on the values of y (i.e. spam) alone. They do not contain information about the words in emails.

In [13]:
P_S1 = np.mean(yt == 1)
P_S0 = 1 - P_S1

logP_S1 = np.log(P_S1)
logP_S0 = np.log(P_S0)

logP_S1, logP_S0

(-1.7780253477682562, -0.1850911621648422)

### 4. For each word w, compute the (log) probability that the word is present in spam emails, log Pr(W = 1|S = 1), and (log) probability that the word is present in non-spam emails, log Pr(W = 1|S = 0). These probabilities can easily be calculated from counts of how many times these words are present for each class.

In [14]:
P_W1_S1 = np.mean(Xt[yt == 1], axis = 0)
P_W1_S0 = np.mean(Xt[yt == 0], axis = 0)

logP_W1_S1 = np.log(P_W1_S1)
logP_W1_S0 = np.log(P_W1_S0)

logP_W1_S1, logP_W1_S0

(matrix([[-1.14842599, -1.18121582, -5.96870756, ...,        -inf,
          -5.96870756,        -inf]]),
 matrix([[-1.76254909, -3.05078224, -7.56164175, ..., -7.56164175,
                 -inf,        -inf]]))

### 5. For both classes, S = 1 and S = 0, compute the log-likelihood that the email belongs to this class.

In [15]:
S1 = Xv @ logP_W1_S1.T + logP_S1
S0 = Xv @ logP_W1_S0.T + logP_S0
S1, S0

(matrix([[          -inf],
         [          -inf],
         [          -inf],
         [          -inf],
         [          -inf],
         [          -inf],
         [          -inf],
         [          -inf],
         [          -inf],
         [          -inf],
         [          -inf],
         [          -inf],
         [          -inf],
         [          -inf],
         [          -inf],
         [          -inf],
         [          -inf],
         [          -inf],
         [          -inf],
         [          -inf],
         [          -inf],
         [          -inf],
         [          -inf],
         [          -inf],
         [          -inf],
         [          -inf],
         [          -inf],
         [          -inf],
         [          -inf],
         [          -inf],
         [          -inf],
         [          -inf],
         [          -inf],
         [          -inf],
         [          -inf],
         [          -inf],
         [          -inf],
 

### 6. How many log-likelihoods you have to compute? Explain why do you have to have this many log-likelihoods.

### 7. Based on the log-likelihoods, predict the class S = 1 or S = 0 for each email in the validation set.

In [16]:
yhat = S1 > S0
yhat

matrix([[False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [ True],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False

### 8. Print the resulting confusion matrix and accuracy (feel free to use existing libraries).

In [17]:
confusion_matrix(yv, yhat)

array([[489,   0],
       [ 79,  11]], dtype=int64)

In [18]:
accuracy_score(yv, yhat)

0.8635578583765112

### 9. If your results are like mine, you can see that the results are not impressive at all, your model works no better than the naive guess. Explain why do you get such mediocre results.

## 3 (32pt) Add smoothing

### 1. (2pt) As you will be doing validation below, your first task is to mold what you did above into two functions: one for fitting and another one for predicting.

### 2. (18pt) Add smoothing to the model. Smoothing amounts to assuming that we have “seen” every possible word α ⩾ 0 times already, in both spam and non-spam emails.

In [19]:
def fitting (At, bt, α):
    P_S1_num = np.sum(bt == 1) + α
    P_S1_denom = bt.shape[0] + 2 * α
    P_S1 = P_S1_num / P_S1_denom
    P_S0_num = np.sum(bt == 0) + α
    P_S0_denom = bt.shape[0] + 2 * α
    P_S0 = P_S0_num / P_S0_denom
    
    P_W1_S1_num = np.sum(At[bt == 1], axis = 0) + α
    P_W1_S1_denom = np.sum(bt == 1) + 2 * α
    P_W1_S1 = P_W1_S1_num / P_W1_S1_denom
        
    P_W1_S0_num = np.sum(At[bt == 0], axis = 0) + α
    P_W1_S0_denom = np.sum(bt == 0) + 2 * α
    P_W1_S0 = P_W1_S0_num / P_W1_S0_denom
    
    return P_S1, P_S0, P_W1_S1, P_W1_S0


In [20]:
def predicting (P_S1, P_S0, P_W1_S1, P_W1_S0,Av, yv):
    logP_S1 = np.log(P_S1)
    logP_S0 = np.log(P_S0)
    logP_W1_S1 = np.log(P_W1_S1)
    logP_W1_S0 = np.log(P_W1_S0)
    S1 = Av @ logP_W1_S1.T + logP_S1
    S0 = Av @ logP_W1_S0.T + logP_S0
    yhat = S1 > S0
    cm = confusion_matrix(yv, yhat)
    acc = accuracy_score(yv, yhat)
    return cm,acc
fit = fitting(Xt,yt,0.001)
PS1 = fit[0]
PS0 = fit[1]
PW1S1 = fit[2]
PW1S0 = fit[3]
predicting(PS1, PS0, PW1S1, PW1S0,Xv, yv)

(array([[487,   2],
        [  2,  88]], dtype=int64),
 0.9930915371329879)

## Finally