Step 0. Unzip enron1.zip into the current directory.

Step 1. Traverse the dataset and create a Pandas dataframe. This is already done for you and should run without any errors. You should recognize Pandas from task 1.

In [10]:
import pandas as pd
import os

def read_spam():
    category = 'spam'
    directory = './enron1/spam'
    return read_category(category, directory)

def read_ham():
    category = 'ham'
    directory = './enron1/ham'
    return read_category(category, directory)

def read_category(category, directory):
    emails = []
    for filename in os.listdir(directory):
        if not filename.endswith(".txt"):
            continue
        with open(os.path.join(directory, filename), 'r') as fp:
            try:
                content = fp.read()
                emails.append({'name': filename, 'content': content, 'category': category})
            except:
                print(f'skipped {filename}')
    return emails

ham = read_ham()
spam = read_spam()

df = pd.DataFrame.from_records(ham)
df = pd.concat([df, pd.DataFrame.from_records(spam)])

skipped 2248.2004-09-23.GP.spam.txt
skipped 2526.2004-10-17.GP.spam.txt
skipped 2698.2004-10-31.GP.spam.txt
skipped 4566.2005-05-24.GP.spam.txt


Step 2. Data cleaning is a critical part of machine learning. You and I can recognize that 'Hello' and 'hello' are the same word but a machine does not know this a priori. Therefore, we can 'help' the machine by conducting such normalization steps for it. Write a function `preprocessor` that takes in a string and replaces all non alphabet characters with a space and then lowercases the result.

In [15]:
import re

def preprocessor(e):
    return re.sub('[^A-Za-z]', ' ', e).lower()

Step 3. We will now train the machine learning model. All the functions that you will need are imported for you. The instructions explain how the work and hint at which functions to use. You will likely need to refer to the scikit learn documentation to see how exactly to invoke the functions. It will be handy to keep that tab open.

In [16]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

vectorizer = CountVectorizer(preprocessor=preprocessor)
X_train,X_test,y_train,y_test = train_test_split(df["content"],df["category"],test_size=0.2,random_state=1)


X_train_df = vectorizer.fit_transform(X_train)


model = LogisticRegression()
model.fit(X_train_df,y_train)


X_test_df = vectorizer.transform(X_test)
y_pred = model.predict(X_test_df)


print(f'Accuracy:\n{accuracy_score(y_test,y_pred)}\n')
print(f'Confusion Matrix:\n{confusion_matrix(y_test,y_pred)}\n')
print(f'Detailed Statistics:\n{classification_report(y_test,y_pred)}\n')

Accuracy:
0.9748549323017408

Confusion Matrix:
[[729  16]
 [ 10 279]]

Detailed Statistics:
              precision    recall  f1-score   support

         ham       0.99      0.98      0.98       745
        spam       0.95      0.97      0.96       289

    accuracy                           0.97      1034
   macro avg       0.97      0.97      0.97      1034
weighted avg       0.98      0.97      0.97      1034




STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Step 4.

In [17]:
features = vectorizer.get_feature_names_out()
importance = model.coef_[0]
l = list(enumerate(importance))
print()
l.sort(key=lambda e: e[1], reverse=True)
for i,imp in l[:10]:
    print(imp, features[i])
print()
l.sort(key=lambda e: -e[1], reverse=True)
for i,imp in l[:10]:
    print(imp, features[i])




0.9465430515498962 http
0.8662010695956848 prices
0.8151424869384507 no
0.7513520177815852 pain
0.722508455690836 only
0.7190209612640248 paliourg
0.7186432269269624 money
0.693039324056552 more
0.6866479314882128 remove
0.6802226547274995 removed

-1.5992384489428075 enron
-1.5480715450490574 attached
-1.4053750528481797 daren
-1.382260089161263 thanks
-1.3564449591282988 doc
-1.185896700897139 deal
-1.1421505214315102 xls
-1.124158355843412 meter
-1.0600763079219155 hpl
-1.0216031374860164 neon


Submission
1. Upload the jupyter notebook to Forage.

All Done!