# Experiment: Spam Classifier with Naive Bayes model

In this experiment, I used a Naive Bayes model from `sklearn` to classify SMS messages as spam or not spam - based on instructions from an assignment on Udacity.

In this notebook, I converted the messages in the dataset into Bag-of-Word representations manually first as examples. Then I used `CountVectorizer` class in sklearn to automatically convert the data into Bag-of-Word form that will be input into the Naive Bayes model. 

Spam messages tend to contain words like "free," "winner," "discount," etc. so based on the probabilities of those words in a large volume of spam messages, a Naive Bayes model can determine the probability that a new message is spam or not.

The notebook can be executed sequentially. New messages can be entered at the end in order to see the classification (1 being spam, 0 being not spam).

Four types of metrics are displayed at the end (accuracy, precision, recall and F1 score). In this example, the scenario where spams are mis-classified as not-spam are far less serious than if an email from your mother is mis-classified as spam. So accuracy alone is not enough to assess the performance of the model. Instead, precision score would be much more important.

## Examples
```
new_message = "first month free - no credit card needed"
test_instance = count_vectorizer.transform([new_message])
naive_bayes.predict(test_instance)
```
returns 1 (spam).

```
new_message = "win big prize - this weekend only. Terms and conditions apply."
test_instance = count_vectorizer.transform([new_message])
naive_bayes.predict(test_instance)
```
returns 1 (spam).

```
new_message = "hey, when're you picking me up?"
test_instance = count_vectorizer.transform([new_message])
naive_bayes.predict(test_instance)
```
returns 0 (not spam).


In [1]:
import pandas as pd

In [2]:
df = pd.read_table('./smsspamcollection/SMSSpamCollection', sep='\t', names=['classification', 'message'])
df.head()

Unnamed: 0,classification,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
# convert values in the "classification" column into 0 (ham) and 1 (spam)
df['int_classification'] = df['classification'].map({'ham': 0, 'spam': 1})
df.head()

Unnamed: 0,classification,message,int_classification
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


In [4]:
documents = ['Hello, how are you!',
             'Win money, win from home.',
             'Call me now.',
             'Hello, Call hello you tomorrow?']
lower_case_documents = []
for doc in documents:
	lower_case_documents.append(doc.lower())
print(lower_case_documents)
sans_punctuation_documents = []
import string
for doc in lower_case_documents:
	doc_sans_punc = doc.translate(str.maketrans('', '', string.punctuation))
	sans_punctuation_documents.append(doc_sans_punc)
print(sans_punctuation_documents)


['hello, how are you!', 'win money, win from home.', 'call me now.', 'hello, call hello you tomorrow?']
['hello how are you', 'win money win from home', 'call me now', 'hello call hello you tomorrow']


In [5]:
preprocessed_documents = []
for doc in sans_punctuation_documents:
	split_doc = doc.split(sep=' ')
	preprocessed_documents.append(split_doc)
print(preprocessed_documents)


[['hello', 'how', 'are', 'you'], ['win', 'money', 'win', 'from', 'home'], ['call', 'me', 'now'], ['hello', 'call', 'hello', 'you', 'tomorrow']]


In [6]:
frequency_list = []
from collections import Counter
for doc in preprocessed_documents:
	frequency_list.append(dict(Counter(doc)))
print(frequency_list)
import pprint
pprint.pprint(frequency_list)

[{'hello': 1, 'how': 1, 'are': 1, 'you': 1}, {'win': 2, 'money': 1, 'from': 1, 'home': 1}, {'call': 1, 'me': 1, 'now': 1}, {'hello': 2, 'call': 1, 'you': 1, 'tomorrow': 1}]
[{'are': 1, 'hello': 1, 'how': 1, 'you': 1},
 {'from': 1, 'home': 1, 'money': 1, 'win': 2},
 {'call': 1, 'me': 1, 'now': 1},
 {'call': 1, 'hello': 2, 'tomorrow': 1, 'you': 1}]


In [7]:
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer()
print(count_vectorizer)

CountVectorizer()


In [8]:
count_vectorizer.fit(documents)
count_vectorizer.get_feature_names_out()

array(['are', 'call', 'from', 'hello', 'home', 'how', 'me', 'money',
       'now', 'tomorrow', 'win', 'you'], dtype=object)

In [9]:
doc_array = count_vectorizer.transform(documents).toarray()
doc_array

array([[1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1],
       [0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 2, 0],
       [0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0],
       [0, 1, 0, 2, 0, 0, 0, 0, 0, 1, 0, 1]], dtype=int64)

In [10]:
frequency_matrix = pd.DataFrame(doc_array, columns=count_vectorizer.get_feature_names_out())
frequency_matrix

Unnamed: 0,are,call,from,hello,home,how,me,money,now,tomorrow,win,you
0,1,0,0,1,0,1,0,0,0,0,0,1
1,0,0,1,0,1,0,0,1,0,0,2,0
2,0,1,0,0,0,0,1,0,1,0,0,0
3,0,1,0,2,0,0,0,0,0,1,0,1


In [11]:
# split into the training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['message'], df['int_classification'], random_state=1)
print(f"Number of rows in the total set: {df.shape[0]}")
print(f"Number of rows in the training set: {X_train.shape[0]}")
print(f"Number of rows in the test set: {X_test.shape[0]}")


Number of rows in the total set: 5572
Number of rows in the training set: 4179
Number of rows in the test set: 1393


In [12]:
count_vectorizer = CountVectorizer()
count_vectorizer.fit(X_train)
training_data = count_vectorizer.transform(X_train).toarray()
training_data.shape

(4179, 7456)

In [13]:
train_matrix = pd.DataFrame(training_data, columns=count_vectorizer.get_feature_names_out())
train_matrix

Unnamed: 0,00,000,008704050406,0121,01223585236,01223585334,0125698789,02,0207,02072069400,...,zed,zeros,zhong,zindgi,zoe,zoom,zouk,zyada,èn,〨ud
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4174,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4175,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4176,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4177,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
test_data = count_vectorizer.transform(X_test)
test_data.shape

(1393, 7456)

In [15]:
from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data, y_train)

In [16]:
predictions = naive_bayes.predict(test_data)
predictions

array([0, 0, 0, ..., 0, 1, 0], dtype=int64)

In [17]:
# count_vectorizer.transform("")

In [18]:
word_index_list = [i for i in range(training_data.shape[1]) if training_data[0][i]==1]

In [19]:
[count_vectorizer.get_feature_names_out()[idx] for idx in word_index_list]

['08000938767',
 '11mths',
 '4mths',
 'call',
 'camera',
 'cs',
 'had',
 'half',
 'latest',
 'line',
 'mobilesdirect',
 'now',
 'on',
 'or2stoptxt',
 'orange',
 'phone',
 'phones',
 'price',
 'rental',
 'to',
 'update',
 'your']

In [20]:
# compute accuracy, precision, recall, f1
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print(f"Accuracy score: {accuracy_score(y_test, predictions)}")
print(f"Precision score: {precision_score(y_test, predictions)}")
print(f"Recall score: {recall_score(y_test, predictions)}")
print(f"F1 score: {f1_score(y_test, predictions)}")

Accuracy score: 0.9885139985642498
Precision score: 0.9720670391061452
Recall score: 0.9405405405405406
F1 score: 0.9560439560439562


In [21]:
new_message = "first month free - no credit card needed"
test_instance = count_vectorizer.transform([new_message])
# test_instance.toarray()

In [22]:
naive_bayes.predict(test_instance)

array([1], dtype=int64)

In [23]:
new_message = "win big prize - this weekend only. Terms and conditions apply."
test_instance = count_vectorizer.transform([new_message])
naive_bayes.predict(test_instance)

array([1], dtype=int64)

In [24]:
new_message = "hey, when're you picking me up?"
test_instance = count_vectorizer.transform([new_message])
naive_bayes.predict(test_instance)

array([0], dtype=int64)