Step 0. Unzip enron1.zip into the current directory.

Step 1. Traverse the dataset and create a Pandas dataframe. This is already done for you and should run without any errors. You should recognize Pandas from task 1.

In [2]:
import pandas as pd
import os

def read_spam():
    category = 'spam'
    directory = './enron1/spam'
    return read_category(category, directory)

def read_ham():
    category = 'ham'
    directory = './enron1/ham'
    return read_category(category, directory)

def read_category(category, directory):
    emails = []
    for filename in os.listdir(directory):
        if not filename.endswith(".txt"):
            continue
        with open(os.path.join(directory, filename), 'r') as fp:
            try:
                content = fp.read()
                emails.append({'name': filename, 'content': content, 'category': category})
            except:
                print(f'skipped {filename}')
    return emails

ham = read_ham()
spam = read_spam()

df_ham = pd.DataFrame.from_records(ham)
df_spam = pd.DataFrame.from_records(spam)
df = pd.concat([df_ham, df_spam], ignore_index=True)


# df = pd.DataFrame.from_records(ham)
# df = df.append(pd.DataFrame.from_records(spam))

skipped 2140.2004-09-13.GP.spam.txt
skipped 2042.2004-08-30.GP.spam.txt
skipped 1414.2004-06-24.GP.spam.txt
skipped 3304.2004-12-26.GP.spam.txt
skipped 4201.2005-04-05.GP.spam.txt
skipped 2698.2004-10-31.GP.spam.txt
skipped 0754.2004-04-01.GP.spam.txt
skipped 5105.2005-08-31.GP.spam.txt
skipped 2526.2004-10-17.GP.spam.txt
skipped 2649.2004-10-27.GP.spam.txt
skipped 4350.2005-04-23.GP.spam.txt
skipped 3364.2005-01-01.GP.spam.txt
skipped 4566.2005-05-24.GP.spam.txt
skipped 4142.2005-03-31.GP.spam.txt
skipped 2248.2004-09-23.GP.spam.txt


Step 2. Data cleaning is a critical part of machine learning. You and I can recognize that 'Hello' and 'hello' are the same word but a machine does not know this a priori. Therefore, we can 'help' the machine by conducting such normalization steps for it. Write a function `preprocessor` that takes in a string and replaces all non alphabet characters with a space and then lowercases the result.

In [3]:
import re

def preprocessor(e):
    cleaned = re.sub(r'[^a-zA-Z]', ' ', e)
    cleaned = cleaned.lower()
    return cleaned

Step 3. We will now train the machine learning model. All the functions that you will need are imported for you. The instructions explain how the work and hint at which functions to use. You will likely need to refer to the scikit learn documentation to see how exactly to invoke the functions. It will be handy to keep that tab open.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Step 1: Instantiate CountVectorizer
vectorizer = CountVectorizer(preprocessor=preprocessor)

# Step 2: Split the dataset into train and test sets
X = df['content']
y = df['category']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Transform the text data into feature vectors
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

# Step 4: Fit the Logistic Regression model
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

# Step 5: Generate predictions on the test set
y_pred = model.predict(X_test_vectorized)

# Step 6: Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy:.2f}')
print('Confusion Matrix:\n', conf_matrix)
print('Classification Report:\n', class_report)


Accuracy: 0.98
Confusion Matrix:
 [[695  12]
 [  5 320]]
Classification Report:
               precision    recall  f1-score   support

         ham       0.99      0.98      0.99       707
        spam       0.96      0.98      0.97       325

    accuracy                           0.98      1032
   macro avg       0.98      0.98      0.98      1032
weighted avg       0.98      0.98      0.98      1032



Step 4.

In [5]:
# Step 7: Display features created by CountVectorizer
features = vectorizer.get_feature_names_out()
print(f'Total features: {len(features)}')

# Step 8: Access model coefficients
coefficients = model.coef_[0]
feature_importance = pd.DataFrame({'feature': features, 'importance': coefficients})

# Step 9: Top 10 positive and negative features
top_positive_features = feature_importance.nlargest(10, 'importance')
top_negative_features = feature_importance.nsmallest(10, 'importance')

print('Top 10 Positive Features (Spam):\n', top_positive_features)
print('Top 10 Negative Features (Ham):\n', top_negative_features)


Total features: 41142
Top 10 Positive Features (Spam):
         feature  importance
18020      http    0.948530
28958    prices    0.880992
25568        no    0.830795
30849    remove    0.754530
30850   removed    0.712527
27036  paliourg    0.700734
17320      here    0.689060
24530      more    0.662717
17251     hello    0.645492
26137       off    0.641608
Top 10 Negative Features (Ham):
         feature  importance
12789     enron   -1.520467
36202    thanks   -1.501006
2616   attached   -1.488542
9646      daren   -1.448227
11212       doc   -1.314941
23792     meter   -1.251408
9836       deal   -1.227553
40382       xls   -1.199253
25256      neon   -1.068413
27978  pictures   -0.982330


Submission
1. Upload the jupyter notebook to Forage.

All Done!