Step 0. Go to the Ham or Spam dataset [website](http://www2.aueb.gr/users/ion/data/enron-spam/index.html) and download the dataset.

In [1]:
# !wget http://www.aueb.gr/users/ion/data/enron-spam/preprocessed/enron1.tar.gz
# !tar -xvzf enron1.tar.gz

Step 1. Traverse the dataset and create a Pandas dataframe. This is already done for you. You should recognize Pandas from task 1.

In [2]:
import pandas as pd
import os

def read_spam():
    category = 'spam'
    directory = './enron1/spam'
    return read_category(category, directory)

def read_ham():
    category = 'ham'
    directory = './enron1/ham'
    return read_category(category, directory)

def read_category(category, directory):
    emails = []
    for filename in os.listdir(directory):
        if not filename.endswith(".txt"):
            continue
        with open(os.path.join(directory, filename), 'r') as fp:
            try:
                content = fp.read()
                emails.append({'name': filename, 'content': content, 'category': category})
            except:
                print(f'skipped {filename}')
    return emails

ham = read_ham()
spam = read_spam()

df = pd.DataFrame.from_records(ham)
df = df.append(pd.DataFrame.from_records(spam))

skipped 2649.2004-10-27.GP.spam.txt
skipped 0754.2004-04-01.GP.spam.txt
skipped 2042.2004-08-30.GP.spam.txt
skipped 3304.2004-12-26.GP.spam.txt
skipped 4142.2005-03-31.GP.spam.txt
skipped 3364.2005-01-01.GP.spam.txt
skipped 4201.2005-04-05.GP.spam.txt
skipped 2140.2004-09-13.GP.spam.txt
skipped 2248.2004-09-23.GP.spam.txt
skipped 4350.2005-04-23.GP.spam.txt
skipped 4566.2005-05-24.GP.spam.txt
skipped 2526.2004-10-17.GP.spam.txt
skipped 1414.2004-06-24.GP.spam.txt
skipped 2698.2004-10-31.GP.spam.txt
skipped 5105.2005-08-31.GP.spam.txt


Step 2. Data cleaning is a critical part of machine learning. You and I can recognize that 'Hello' and 'hello' are the same word but a machine does not know this a priori. Therefore, we can 'help' the machine by conducting such normalization steps for it. Write a function `preprocessor` that takes in a string and replaces all non alphabet characters with a space and then lowercases the result.

In [3]:
import re

def preprocessor(e):
    return re.sub('[^A-Za-z]', ' ', e).lower()

Step 3. We will now train the machine learning model. All the functions that you will need are imported for you. The instructions explain how the work and hint at which functions to use. You will likely need to refer to the scikit learn documentation to see how exactly to invoke the functions. It will be handy to keep that tab open.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

vectorizer = CountVectorizer(preprocessor=preprocessor)

X_train,X_test,y_train,y_test = train_test_split(df["content"],df["category"],test_size=0.2,random_state=1)

X_train_df = vectorizer.fit_transform(X_train)

model = LogisticRegression()
model.fit(X_train_df,y_train)


X_test_df = vectorizer.transform(X_test)
y_pred = model.predict(X_test_df)


print(f'Accuracy:\n{accuracy_score(y_test,y_pred)}\n')
print(f'Confusion Matrix:\n{confusion_matrix(y_test,y_pred)}\n')
print(f'Detailed Statistics:\n{classification_report(y_test,y_pred)}\n')


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Accuracy:
0.9767441860465116

Confusion Matrix:
[[726  16]
 [  8 282]]

Detailed Statistics:
              precision    recall  f1-score   support

         ham       0.99      0.98      0.98       742
        spam       0.95      0.97      0.96       290

    accuracy                           0.98      1032
   macro avg       0.97      0.98      0.97      1032
weighted avg       0.98      0.98      0.98      1032




Step 4.

In [5]:
features = vectorizer.get_feature_names_out()

importance = model.coef_[0]

l = list(enumerate(importance))
print()
l.sort(key=lambda e: e[1], reverse=True)
for i,imp in l[:10]:
    print(imp, features[i])
print()
l.sort(key=lambda e: -e[1], reverse=True)
for i,imp in l[:10]:
    print(imp, features[i])


1.061288626482321 http
0.8171514687730761 prices
0.7939415840701827 no
0.7566963969909719 remove
0.7217203117456084 removed
0.6916604371247976 off
0.6795655311184997 hello
0.6537526327583835 mobile
0.6338517298349112 more
0.6164935024552568 message

-1.5437128445038255 attached
-1.5401915054858066 daren
-1.5015980523929702 enron
-1.3008715559663415 thanks
-1.2496145154883649 doc
-1.1804716676252003 meter
-1.0830241823597755 xls
-1.0658050172716116 hpl
-1.04603390757529 neon
-1.0367514536221871 deal


All Done!