#DAT405 Introduction to Data Science and AI 
##2020-2021, Reading Period 2
## Assignment 4: Spam classification using Naïve Bayes 
There will be an overall grade for this assignment. To get a pass grade (grade 5), you need to pass items 1-3 below. To receive higher grades, finish items 4 and 5 as well. 

The exercise takes place in a notebook environment where you can chose to use Jupyter or Google Colabs. We recommend you use Google Colabs as it will facilitate remote group-work and makes the assignment less technical. 
Hints:
You can execute certain linux shell commands by prefixing the command with `!`. You can insert Markdown cells and code cells. The first you can use for documenting and explaining your results the second you can use writing code snippets that execute the tasks required.  

In this assignment you will implement a Naïve Bayes classifier in Python that will classify emails into spam and non-spam (“ham”) classes.  Your program should be able to train on a given set of spam and “ham” datasets. 
You will work with the datasets available at https://spamassassin.apache.org/old/publiccorpus/. There are three types of files in this location: 
-	easy-ham: non-spam messages typically quite easy to differentiate from spam messages. 
-	hard-ham: non-spam messages more difficult to differentiate 
-	spam: spam messages 

**Execute the cell below to download and extract the data into the environment of the notebook -- it will take a few seconds.** If you chose to use Jupyter notebooks you will have to run the commands in the cell below on your local computer, with Windows you can use 7zip (https://www.7-zip.org/download.html) to decompress the data.



In [None]:
#Download and extract data
!wget https://spamassassin.apache.org/old/publiccorpus/20021010_easy_ham.tar.bz2
!wget https://spamassassin.apache.org/old/publiccorpus/20021010_hard_ham.tar.bz2
!wget https://spamassassin.apache.org/old/publiccorpus/20021010_spam.tar.bz2
!tar -xjf 20021010_easy_ham.tar.bz2
!tar -xjf 20021010_hard_ham.tar.bz2
!tar -xjf 20021010_spam.tar.bz2

--2020-12-02 22:53:30--  https://spamassassin.apache.org/old/publiccorpus/20021010_easy_ham.tar.bz2
Resolving spamassassin.apache.org (spamassassin.apache.org)... 40.79.78.1, 95.216.24.32, 95.216.26.30, ...
Connecting to spamassassin.apache.org (spamassassin.apache.org)|40.79.78.1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1677144 (1.6M) [application/x-bzip2]
Saving to: ‘20021010_easy_ham.tar.bz2.6’


2020-12-02 22:53:31 (2.95 MB/s) - ‘20021010_easy_ham.tar.bz2.6’ saved [1677144/1677144]

--2020-12-02 22:53:31--  https://spamassassin.apache.org/old/publiccorpus/20021010_hard_ham.tar.bz2
Resolving spamassassin.apache.org (spamassassin.apache.org)... 95.216.26.30, 40.79.78.1, 95.216.24.32, ...
Connecting to spamassassin.apache.org (spamassassin.apache.org)|95.216.26.30|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1021126 (997K) [application/x-bzip2]
Saving to: ‘20021010_hard_ham.tar.bz2.6’


2020-12-02 22:53:31 (5.34 MB/s) - ‘20

*The* data is now in the three folders `easy_ham`, `hard_ham`, and `spam`.

In [None]:
!ls -lah

total 27M
drwxr-xr-x 1 root root 4.0K Dec  2 22:53 .
drwxr-xr-x 1 root root 4.0K Dec  2 20:21 ..
-rw-r--r-- 1 root root 1.6M Jun 29  2004 20021010_easy_ham.tar.bz2
-rw-r--r-- 1 root root 1.6M Jun 29  2004 20021010_easy_ham.tar.bz2.1
-rw-r--r-- 1 root root 1.6M Jun 29  2004 20021010_easy_ham.tar.bz2.2
-rw-r--r-- 1 root root 1.6M Jun 29  2004 20021010_easy_ham.tar.bz2.3
-rw-r--r-- 1 root root 1.6M Jun 29  2004 20021010_easy_ham.tar.bz2.4
-rw-r--r-- 1 root root 1.6M Jun 29  2004 20021010_easy_ham.tar.bz2.5
-rw-r--r-- 1 root root 1.6M Jun 29  2004 20021010_easy_ham.tar.bz2.6
-rw-r--r-- 1 root root 998K Dec 16  2004 20021010_hard_ham.tar.bz2
-rw-r--r-- 1 root root 998K Dec 16  2004 20021010_hard_ham.tar.bz2.1
-rw-r--r-- 1 root root 998K Dec 16  2004 20021010_hard_ham.tar.bz2.2
-rw-r--r-- 1 root root 998K Dec 16  2004 20021010_hard_ham.tar.bz2.3
-rw-r--r-- 1 root root 998K Dec 16  2004 20021010_hard_ham.tar.bz2.4
-rw-r--r-- 1 root root 998K Dec 16  2004 20021010_hard_ham.tar.bz2.5
-rw-r--r--

###1. Preprocessing: 
1.	Note that the email files contain a lot of extra information, besides the actual message. Ignore that for now and run on the entire text. Further down (in the higher-grade part), you will be asked to filter out the headers and footers. 
2.	We don’t want to train and test on the same data. Split the spam and the ham datasets in a training set and a test set. (`hamtrain`, `spamtrain`, `hamtest`, and `spamtest`)


In [None]:
#pre-processing code here
#cwd = os.getcwd()
#cwd # we are in '/content'

import os
import random

####### Create hamtrain, spamtrain, hamtest and spamtest #######
easy_ham_data = []
spam_data=[]
hard_ham_data=[]

for filename in os.listdir('easy_ham'):
        easy_ham_data.append(filename)

for filename in os.listdir('hard_ham'):
        hard_ham_data.append(filename)

for filename in os.listdir('spam'):
        spam_data.append(filename)

# easy_ham training data 70%, test data 30%
random.shuffle(easy_ham_data)
random.shuffle(hard_ham_data)

# Easy ham
hamtrain = easy_ham_data[:round(0.7*len(easy_ham_data))]
hamtest = easy_ham_data[round(0.7*len(easy_ham_data))+1:]

# Hard ham
hamtrain_hard = hard_ham_data[:round(0.7*len(hard_ham_data))]
hamtest_hard = hard_ham_data[round(0.7*len(hard_ham_data))+1:]



# spam training data 70%, test data 30%
random.shuffle(spam_data)

spamtrain = spam_data[:round(0.7*len(spam_data))]
spamtest = spam_data[round(0.7*len(spam_data))+1:]

In [None]:

len(hamtest_hard)

74

I tried to extract the files into the data frame like a table with labels and the messages (write the text into an array) but unfortunately failed. tried "import glob" and write a loop to read the files one by one. decided to give up now. wish I can solve it before the due. Once I get this I can answer q4.

###2. Write a Python program that: 
1.	Uses four datasets (`hamtrain`, `spamtrain`, `hamtest`, and `spamtest`) 
2.	Trains a Naïve Bayes classifier (e.g. Sklearn) on `hamtrain` and `spamtrain`, that classifies the test sets and reports True Positive and False Negative rates on the `hamtest` and `spamtest` datasets. You can use `CountVectorizer` to transform the email texts into vectors. Please note that there are different types of Naïve Bayes Classifier in SKlearn ([Documentation here](https://scikit-learn.org/stable/modules/naive_bayes.html)). Test two of these classifiers that are well suited for this problem
- Multinomial Naive Bayes  
- Bernoulli Naive Bayes. 

Please inspect the documentation to ensure input to the classifiers is appropriate. Discuss the differences between these two classifiers. 





In [None]:

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

# Transform e-mail texts from training data set into vectors (CountVectorizer)
ham_train=[]
for i in range(len(hamtrain)):
  f=open('easy_ham/'+hamtrain[i], 'r', encoding = "ISO-8859-1")
  ham_train.append(f.read())

ham_train_hard=[]
for i in range(len(hamtrain_hard)):
  f=open('hard_ham/'+hamtrain_hard[i], 'r', encoding = "ISO-8859-1")
  ham_train.append(f.read())

spam_train=[]
for i in range(len(spamtrain)):
  f=open('spam/'+spamtrain[i], 'r', encoding = "ISO-8859-1")
  spam_train.append(f.read())

X=[]
X.extend(ham_train)
X.extend(spam_train)
X.extend(ham_train_hard)
X_train = vectorizer.fit_transform(X)

type(X_train)

scipy.sparse.csr.csr_matrix

In [None]:

import numpy as np 

ytrain=['ham'] * len(ham_train)
ytrain.extend(['spam'] * len(spam_train))
ytrain.extend(['ham'] * len(ham_train_hard))

# Multinomial Naive Bayes
from sklearn.naive_bayes import MultinomialNB
clf_multinom = MultinomialNB()
clf_multinom.fit(X_train, ytrain)

# Bernoulli Naive Bayes
from sklearn.naive_bayes import BernoulliNB
clf_bernoulli = BernoulliNB()
clf_bernoulli.fit(X_train, ytrain)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [None]:
# Now we run it on the test data
spam_test=[]
for i in range(len(spamtest)):
  f=open('spam/'+spamtest[i], 'r', encoding = "ISO-8859-1")
  spam_test.append(f.read())

ham_test=[]
for i in range(len(hamtest)):
  f=open('easy_ham/'+hamtest[i], 'r', encoding = "ISO-8859-1")
  ham_test.append(f.read())

ham_test_hard=[]
for i in range(len(hamtest_hard)):
  f=open('hard_ham/'+hamtest_hard[i], 'r', encoding = "ISO-8859-1")
  ham_test.append(f.read())

X=[]
X.extend(ham_test)
X.extend(spam_test)
X.extend(ham_test_hard)

X_test = vectorizer.transform(X)

### Predict if e-mails in test dataset is ham or spam
# Multinomial
predicted_multinom = clf_multinom.predict(X_test)

# Bernoulli
predicted_bernoulli=clf_bernoulli.predict(X_test)

# Labels of the test dataset
ytest=['ham'] * len(ham_test)
ytest.extend(['ham'] * len(ham_test_hard))
ytest.extend(['spam'] * len(spam_test))

In [None]:
a_multinom=predicted_multinom==ytest
a_bernoulli=predicted_bernoulli==ytest
print("Accuracy of prediction (Multinomial):")
print(np.count_nonzero(a_multinom)/len(a_multinom)) 
print("Accuracy of prediction (Bernoulli):")
print(np.count_nonzero(a_bernoulli)/len(a_bernoulli)) 

Accuracy of prediction (Multinomial):
0.9766970618034447
Accuracy of prediction (Bernoulli):
0.878419452887538


**Comment:** Multinomial is quite a lot better than Bernoulli.

We get knowledge from the given document about:

The Multinomial Naïve Bayes classifier classifies the dataset by the frequency of words occur.

The Bernoulli  Naïve  Bayes  Classifier classifies the dataset by binary concept (i.e. 0 and 1, which 0 means not occur, and 1 means occurs). 

Our output shows that the Multinomial classification method (Accuracy of prediction:0.9706) is a little bit better than the Bernoulli classification method (Accuracy of prediction:0.8754).

### 3.Run your program on 
-	Spam versus easy-ham 
-	Spam versus hard-ham.

In [None]:
# Training and test data (easy ham)
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

# Transform e-mail texts from training data set into vectors (CountVectorizer)
ham_train=[]
for i in range(len(hamtrain)):
  f=open('easy_ham/'+hamtrain[i], 'r', encoding = "ISO-8859-1")
  ham_train.append(f.read())

spam_train=[]
for i in range(len(spamtrain)):
  f=open('spam/'+spamtrain[i], 'r', encoding = "ISO-8859-1")
  spam_train.append(f.read())

X=[]
X.extend(ham_train)
X.extend(spam_train)
X_train = vectorizer.fit_transform(X)

type(X_train)


import numpy as np 

ytrain=['ham'] * len(ham_train)
ytrain.extend(['spam'] * len(spam_train))

# Multinomial Naive Bayes
from sklearn.naive_bayes import MultinomialNB
clf_multinom = MultinomialNB()
clf_multinom.fit(X_train, ytrain)

# Bernoulli Naive Bayes
from sklearn.naive_bayes import BernoulliNB
clf_bernoulli = BernoulliNB()
clf_bernoulli.fit(X_train, ytrain)

spam_test=[]
for i in range(len(spamtest)):
  f=open('spam/'+spamtest[i], 'r', encoding = "ISO-8859-1")
  spam_test.append(f.read())

ham_test=[]
for i in range(len(hamtest)):
  f=open('easy_ham/'+hamtest[i], 'r', encoding = "ISO-8859-1")
  ham_test.append(f.read())

X=[]
X.extend(ham_test)
X.extend(spam_test)

X_test = vectorizer.transform(X)

### Predict if e-mails in test dataset is ham or spam
# Multinomial
predicted_multinom = clf_multinom.predict(X_test)

# Bernoulli
predicted_bernoulli=clf_bernoulli.predict(X_test)

# Labels of the test dataset
ytest=['ham'] * len(ham_test)
ytest.extend(['spam'] * len(spam_test))

In [None]:
a_multinom=predicted_multinom==ytest
a_bernoulli=predicted_bernoulli==ytest
print("Accuracy of prediction (Multinomial) easy ham:")
print(np.count_nonzero(a_multinom)/len(a_multinom)) 
print("Accuracy of prediction (Bernoulli) easy ham:")
print(np.count_nonzero(a_bernoulli)/len(a_bernoulli)) 

Accuracy of prediction (Multinomial) easy ham:
0.9671412924424972
Accuracy of prediction (Bernoulli) easy ham:
0.8970427163198248


In [None]:
#hamtrain_hard := training data for hard_ham
#hamtest_hard := test data for hard_ham

# Training and test data (hard_ham)
ham_train_hard=[]
for i in range(len(hamtrain_hard)):
  f=open('hard_ham/'+hamtrain_hard[i], 'r', encoding = "ISO-8859-1")
  ham_train_hard.append(f.read())

ham_test_hard=[]
for i in range(len(hamtest_hard)):
  f=open('hard_ham/'+hamtest_hard[i], 'r', encoding = "ISO-8859-1")
  ham_test_hard.append(f.read())

# Vectorizer
X=[]
X.extend(ham_train_hard)
X.extend(spam_train)
X_train_hard = vectorizer.fit_transform(X)

# Labels
ytrain_hard=['ham'] * len(ham_train_hard)
ytrain_hard.extend(['spam'] * len(spam_train))

# Multinomial Naive Bayes
from sklearn.naive_bayes import MultinomialNB
clf_multinom = MultinomialNB()
clf_multinom.fit(X_train_hard, ytrain_hard)

# Bernoulli Naive Bayes
from sklearn.naive_bayes import BernoulliNB
clf_bernoulli = BernoulliNB()
clf_bernoulli.fit(X_train_hard, ytrain_hard)

# Test data
X=[]
X.extend(ham_test_hard)
X.extend(spam_test)

X_test_hard = vectorizer.transform(X)

### Predict if e-mails in test dataset is ham or spam
# Multinomial
predicted_multinom = clf_multinom.predict(X_test_hard)

# Bernoulli
predicted_bernoulli=clf_bernoulli.predict(X_test_hard)

# Labels of the test dataset
ytest=['ham'] * len(ham_test_hard)
ytest.extend(['spam'] * len(spam_test))

In [None]:
a_multinom=predicted_multinom==ytest
a_bernoulli=predicted_bernoulli==ytest
print("Accuracy of prediction (Multinomial) hard ham:")
print(np.count_nonzero(a_multinom)/len(a_multinom)) 
print("Accuracy of prediction (Bernoulli) hard ham:")
print(np.count_nonzero(a_bernoulli)/len(a_bernoulli)) 

Accuracy of prediction (Multinomial) hard ham:
0.9327354260089686
Accuracy of prediction (Bernoulli) hard ham:
0.8834080717488789


In [None]:
# Training and test data (hard_ham)
ham_train_hard=[]
for i in range(len(hamtrain_hard)):
  f=open('hard_ham/'+hamtrain_hard[i], 'r', encoding = "ISO-8859-1")
  ham_train_hard.append(f.read())

ham_test_hard=[]
for i in range(len(hamtest_hard)):
  f=open('hard_ham/'+hamtest_hard[i], 'r', encoding = "ISO-8859-1")
  ham_test_hard.append(f.read())




# Vectorizer


X=[]
X.extend(ham_train_hard)
X.extend(spam_train)

X_train_hard = vectorizer.fit_transform(X)
# Test data
X=[]
X.extend(ham_test_hard)
X.extend(spam_test)

X_test_hard = vectorizer.transform(X)
vectorizer=CountVectorizer(min_df=2, max_df=0.7)
vectorizer.fit(X)



# Labels
ytrain_hard=['ham'] * len(ham_train_hard)
ytrain_hard.extend(['spam'] * len(spam_train))

# Multinomial Naive Bayes
from sklearn.naive_bayes import MultinomialNB
clf_multinom = MultinomialNB()
clf_multinom.fit(X_train_hard, ytrain_hard)

# Bernoulli Naive Bayes
from sklearn.naive_bayes import BernoulliNB
clf_bernoulli = BernoulliNB()
clf_bernoulli.fit(X_train_hard, ytrain_hard)



### Predict if e-mails in test dataset is ham or spam
# Multinomial
predicted_multinom = clf_multinom.predict(X_test_hard)

# Bernoulli
predicted_bernoulli=clf_bernoulli.predict(X_test_hard)

# Labels of the test dataset
ytest=['ham'] * len(ham_test_hard)
ytest.extend(['spam'] * len(spam_test))

a_multinom=predicted_multinom==ytest
a_bernoulli=predicted_bernoulli==ytest
print("Accuracy of prediction (Multinomial) hard ham:")
print(np.count_nonzero(a_multinom)/len(a_multinom)) 
print("Accuracy of prediction (Bernoulli) hard ham:")
print(np.count_nonzero(a_bernoulli)/len(a_bernoulli)) 

Accuracy of prediction (Multinomial) hard ham:
0.9327354260089686
Accuracy of prediction (Bernoulli) hard ham:
0.8834080717488789


**Comment:** Accuracy is lower for hard_ham as compared to easy_ham. Multinomial is once again better than Bernoulli.

when we run the program on Spam versus easy-ham:
Multinomial(0.96) is still better than Bernoulli (0.89) .

When we run the program on Spam versus hard-ham:
Multinomial (0.93) is still a little bit better than Bernoulli(0.88).



###4.	To avoid classification based on common and uninformative words it is common to filter these out. 

**a.** Argue why this may be useful. Try finding the words that are too common/uncommon in the dataset. 

**b.** Use the parameters in Sklearn’s `CountVectorizer` to filter out these words. Update the program from point 3 and run it on your data and report your results.

You have two options to do this in Sklearn: either using the words found in part (a) or letting Sklearn do it for you. Argue for your decision-making.


Answer to a: It might be useful because if there are words that are too common and uninformative, filtering these out may result in a larger difference between the ham and the spam elements in the dataset.

In [None]:
def count_uncommon_common(data, return_res=False):
  # Transform the data into feature vector
  vectorizer = CountVectorizer()
  count = vectorizer.fit(data)
  clump = count.transform(data)

  # Count the occurances
  sumw = clump.sum(axis=0)
  frequency = [(word, sumw[0,index]) for word, index in count.vocabulary_.items()]
  frequency = sorted(frequency, key=lambda x: x[1], reverse=True)
  if return_res:
    return frequency
  else: 
    print(f"The data had {len(frequency)} different words")
    print(f"The top 5 common words are:")
    print(*frequency[0:5])
    print(f"{(frequency,1)}only occur once")

count_uncommon_common(ham_train+ham_train_hard+spam_train)

The data had 82833 different words
The top 5 common words are:
('com', 48049) ('the', 29044) ('to', 27225) ('http', 21926) ('td', 20314)


The top 5 common words are: com, the, to, HTTP and td.

There are a lot of uncommon words, some of them are: consumable, money2002, 200209140347, easydownline .....

To do b) we decided to remove the words that occur less than 5 times or appear above 75% of the dataset.

In [None]:
#remove the words occur less than 2 times or appear above 75% of the dataset. 
# Training and test data (easy ham)


vectorizer = CountVectorizer(min_df=5, max_df=0.75)

# Transform e-mail texts from training data set into vectors (CountVectorizer)
ham_train=[]
for i in range(len(hamtrain)):
  f=open('easy_ham/'+hamtrain[i], 'r', encoding = "ISO-8859-1")
  ham_train.append(f.read())

spam_train=[]
for i in range(len(spamtrain)):
  f=open('spam/'+spamtrain[i], 'r', encoding = "ISO-8859-1")
  spam_train.append(f.read())

X=[]
X.extend(ham_train)
X.extend(spam_train)
X_train = vectorizer.fit_transform(X)

type(X_train)


import numpy as np 

ytrain=['ham'] * len(ham_train)
ytrain.extend(['spam'] * len(spam_train))

# Multinomial Naive Bayes
from sklearn.naive_bayes import MultinomialNB
clf_multinom = MultinomialNB()
clf_multinom.fit(X_train, ytrain)

# Bernoulli Naive Bayes
from sklearn.naive_bayes import BernoulliNB
clf_bernoulli = BernoulliNB()
clf_bernoulli.fit(X_train, ytrain)

spam_test=[]
for i in range(len(spamtest)):
  f=open('spam/'+spamtest[i], 'r', encoding = "ISO-8859-1")
  spam_test.append(f.read())

ham_test=[]
for i in range(len(hamtest)):
  f=open('easy_ham/'+hamtest[i], 'r', encoding = "ISO-8859-1")
  ham_test.append(f.read())

X=[]
X.extend(ham_test)
X.extend(spam_test)

X_test = vectorizer.transform(X)

### Predict if e-mails in test dataset is ham or spam
# Multinomial
predicted_multinom = clf_multinom.predict(X_test)

# Bernoulli
predicted_bernoulli=clf_bernoulli.predict(X_test)

# Labels of the test dataset
ytest=['ham'] * len(ham_test)
ytest.extend(['spam'] * len(spam_test))

a_multinom=predicted_multinom==ytest
a_bernoulli=predicted_bernoulli==ytest
print("Accuracy of prediction (Multinomial) easy ham (removed):")
print(np.count_nonzero(a_multinom)/len(a_multinom)) 
print("Accuracy of prediction (Bernoulli) easy ham (removed):")
print(np.count_nonzero(a_bernoulli)/len(a_bernoulli)) 

Accuracy of prediction (Multinomial) easy ham (removed):
0.9934282584884995
Accuracy of prediction (Bernoulli) easy ham (removed):
0.9868565169769989


In [None]:
#remove the words occur less than 2 times or appear above 75% of the dataset. 
# Training and test data (hard ham)
vectorizer = CountVectorizer(min_df=2, max_df=0.75)


# Transform e-mail texts from training data set into vectors (CountVectorizer)
ham_train_hard=[]
for i in range(len(hamtrain_hard)):
  f=open('hard_ham/'+hamtrain_hard[i], 'r', encoding = "ISO-8859-1")
  ham_train_hard.append(f.read())

spam_train=[]
for i in range(len(spamtrain)):
  f=open('spam/'+spamtrain[i], 'r', encoding = "ISO-8859-1")
  spam_train.append(f.read())

X=[]
X.extend(ham_train_hard)
X.extend(spam_train)
X_train = vectorizer.fit_transform(X)

type(X_train)


import numpy as np 

ytrain=['ham'] * len(ham_train_hard)
ytrain.extend(['spam'] * len(spam_train))

# Multinomial Naive Bayes
from sklearn.naive_bayes import MultinomialNB
clf_multinom = MultinomialNB()
clf_multinom.fit(X_train, ytrain)

# Bernoulli Naive Bayes
from sklearn.naive_bayes import BernoulliNB
clf_bernoulli = BernoulliNB()
clf_bernoulli.fit(X_train, ytrain)

spam_test=[]
for i in range(len(spamtest)):
  f=open('spam/'+spamtest[i], 'r', encoding = "ISO-8859-1")
  spam_test.append(f.read())

ham_test_hard=[]
for i in range(len(hamtest_hard)):
  f=open('hard_ham/'+hamtest_hard[i], 'r', encoding = "ISO-8859-1")
  ham_test_hard.append(f.read())

X=[]
X.extend(ham_test_hard)
X.extend(spam_test)

X_test = vectorizer.transform(X)

### Predict if e-mails in test dataset is ham or spam
# Multinomial
predicted_multinom = clf_multinom.predict(X_test)

# Bernoulli
predicted_bernoulli=clf_bernoulli.predict(X_test)

# Labels of the test dataset
ytest=['ham'] * len(ham_test_hard)
ytest.extend(['spam'] * len(spam_test))

a_multinom=predicted_multinom==ytest
a_bernoulli=predicted_bernoulli==ytest
print("Accuracy of prediction (Multinomial) hard ham (removed):")
print(np.count_nonzero(a_multinom)/len(a_multinom)) 
print("Accuracy of prediction (Bernoulli) hard ham (removed):")
print(np.count_nonzero(a_bernoulli)/len(a_bernoulli)) 

Accuracy of prediction (Multinomial) hard ham (removed):
0.9237668161434978
Accuracy of prediction (Bernoulli) hard ham (removed):
0.8923766816143498


we did remove words on both easy_ham and hard_ham, and we found that both the Bernoulli and multinomial classifiers accuracy increases by almost 10% when we test on easy_ham. But almost the same accuracy as we did not do any remove on hard_ham.

Here is the output:

Accuracy of prediction (Multinomial) hard ham (removed):
0.905829596412556

Accuracy of prediction (Bernoulli) hard ham (removed):
0.8789237668161435



Accuracy of prediction (Multinomial) easy ham (removed):
0.9934282584884995

Accuracy of prediction (Bernoulli) easy ham (removed):
0.9868565169769989

And the accuracy of Multinomial classifiers is always better than Bernoulli classifiers.
