Import all the relevant libraries:

In [33]:
import pandas as pd
import numpy as np
import porter
import os
from sklearn.feature_extraction.text import CountVectorizer
import glob
import sklearn.utils

Now, we use a dataset that I downloaded from [kaggle](https://www.kaggle.com/datasets/venky73/spam-mails-dataset). Originally, I wanted to use the Spam Assassin's Public Corpus as mentioned in the Andrew Ng's course on this topic. However, the emails all had a lot of headers regarding the sender and the receiver that would not work well for text only emails.  

In [51]:
df = pd.read_csv(os.path.join('datasets', 'spam_ham_dataset.csv'))

The feature sets have to come from the emails, which is stored under the 'text' feature. The target is stored as either 'spam' or 'ham' in the dataframe.

In [35]:
X=df['text']
y=df['label'].apply(lambda x: 1 if x=='spam' else 0)

X, y = sklearn.utils.shuffle(X, y, random_state=420)


We split the code into training and test sets.

In [36]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=420)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(4136,) (4136,) (1035,) (1035,)


In [37]:
X_tr_copy = X_train.values

In [38]:
vec = CountVectorizer()
for i, x in enumerate(X_tr_copy):
    X_tr_copy[i] = porter.processEmail(X_tr_copy[i])
X_train_prepared = vec.fit_transform(X_tr_copy)

In [39]:
X_test_copy = X_test.values

In order to convert the emails into a vector which can be used for training a model, we initially have to process the emails into a set of "standard vocabulary" words. For example, having numbers in the emails could lead to a new "word" added in the vocabulary for every number. So, we process the email and standardise the vocabulary so that it is easier to vectorise it.

In [40]:
for i, x in enumerate(X_test_copy):
    X_test_copy[i] = porter.processEmail(x)
X_test_prepared = vec.transform(X_test_copy)

We are using a MLP Classifier to classify the data.

In [41]:
from sklearn.neural_network import MLPClassifier 
clf = MLPClassifier()
clf.fit(X_train_prepared, y_train)

In [42]:
y_pred = clf.predict(X_test_prepared)

In [43]:
accuracy_score(y_pred, y_test)

0.9758454106280193

In [44]:
y_train_pred = clf.predict(X_train_prepared)
from sklearn.metrics import accuracy_score
accuracy_score(y_train_pred, y_train)

1.0

In [45]:
a = open('sample/spamSample1.txt', 'r').read()
a = porter.processEmail(a)
print(a)

do you want to make dollar number or more per week if you ar a motiv and qualifi individu  i will person demonstr to you a system that will make you dollar number number per week or more thi is not mlm call our number hour prerecord number to get the detail number  number  number i need peopl who want to make seriou monei make the call and get the fact invest number minut in yourself now number  number  number look forward to your call and i will introduc you to peopl like yourself who ar current make dollar number number plu per week number  number  number number ljgv number  number lean number lrm number  number wxho number qiyt number  number rjuv number hqcf number  number eidb number dmtvl number


In [46]:
a = vec.transform([a])

In [47]:
clf.predict(a)

array([1])

And finally, we store the models in a pickle file, so that we can reuse it later on.

In [48]:
import pickle
pickle.dump(clf, open('model.pkl', 'wb'))

In [49]:
pickle.dump(vec, open('vec.pkl', 'wb'))