Step 0. Unzip enron1.zip into the current directory.

Step 1. Traverse the dataset and create a Pandas dataframe. This is already done for you and should run without any errors. You should recognize Pandas from task 1.

In [9]:
import pandas as pd
import os

def read_spam():
    category = 'spam'
    directory = './enron1/spam'
    return read_category(category, directory)

def read_ham():
    category = 'ham'
    directory = './enron1/ham'
    return read_category(category, directory)

def read_category(category, directory):
    emails = []
    for filename in os.listdir(directory):
        if not filename.endswith(".txt"):
            continue
        with open(os.path.join(directory, filename), 'r') as fp:
            try:
                content = fp.read()
                emails.append({'name': filename, 'content': content, 'category': category})
            except:
                print(f'skipped {filename}')
    return emails

ham = read_ham()
spam = read_spam()

df_ham = pd.DataFrame(ham)
df_spam = pd.DataFrame(spam)

data = pd.concat([df_ham, df_spam], ignore_index=True)

print(data.head())

skipped 2248.2004-09-23.GP.spam.txt
skipped 2526.2004-10-17.GP.spam.txt
skipped 2698.2004-10-31.GP.spam.txt
skipped 4566.2005-05-24.GP.spam.txt
                             name  \
0  0001.1999-12-10.farmer.ham.txt   
1  0002.1999-12-13.farmer.ham.txt   
2  0003.1999-12-14.farmer.ham.txt   
3  0004.1999-12-14.farmer.ham.txt   
4  0005.1999-12-14.farmer.ham.txt   

                                             content category  
0            Subject: christmas tree farm pictures\n      ham  
1  Subject: vastar resources , inc .\ngary , prod...      ham  
2  Subject: calpine daily gas nomination\n- calpi...      ham  
3  Subject: re : issue\nfyi - see note below - al...      ham  
4  Subject: meter 7268 nov allocation\nfyi .\n- -...      ham  


Step 2. Data cleaning is a critical part of machine learning. You and I can recognize that 'Hello' and 'hello' are the same word but a machine does not know this a priori. Therefore, we can 'help' the machine by conducting such normalization steps for it. Write a function `preprocessor` that takes in a string and replaces all non alphabet characters with a space and then lowercases the result.

In [8]:
import re

def preprocessor(sentance: str):
    cleaned_text = re.sub(r'[^a-zA-Z]', ' ', sentance)
    cleaned_text = cleaned_text.lower()
    return cleaned_text


Step 3. We will now train the machine learning model. All the functions that you will need are imported for you. The instructions explain how the work and hint at which functions to use. You will likely need to refer to the scikit learn documentation to see how exactly to invoke the functions. It will be handy to keep that tab open.

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression as logistic_regression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.pipeline import make_pipeline

# The CountVectorizer converts a text sample into a vector (think of it as an array of floats).
# Each entry in the vector corresponds to a single word and the value is the number of times the word appeared.
# Instantiate a CountVectorizer. Make sure to include the preprocessor you previously wrote in the constructor.

vectorizer = CountVectorizer(preprocessor=preprocessor)


# Use train_test_split to split the dataset into a train dataset and a test dataset.
# The machine learning model learns from the train dataset.
# Then the trained model is tested on the test dataset to see if it actually learned anything.
# If it just memorized for example, then it would have a low accuracy on the test dataset and a high accuracy on the train dataset.
X_train, X_test, y_train, y_test = train_test_split(data['content'], data['category'], test_size=0.2, random_state=42)



# Use the vectorizer to transform the existing dataset into a form in which the model can learn from.
# Remember that simple machine learning models operate on numbers, which the CountVectorizer conveniently helped us do.
pipeline = make_pipeline(CountVectorizer(preprocessor=preprocessor), logistic_regression(max_iter=1000))


# Use the LogisticRegression model to fit to the train dataset.
# You may remember y = mx + b and Linear Regression from high school. Here, we fitted a scatter plot to a line.
# Logistic Regression is another form of regression. 
# However, Logistic Regression helps us determine if a point should be in category A or B, which is a perfect fit.
pipeline.fit(X_train, y_train)

# Validate that the model has learned something.
# Recall the model operates on vectors. First transform the test set using the vectorizer. 
# Then generate the predictions.
y_pred = pipeline.predict(X_test)


# We now want to see how we have done. We will be using three functions.
# `accuracy_score` tells us how well we have done. 
# 90% means that every 9 of 10 entries from the test dataset were predicted accurately.
# The `confusion_matrix` is a 2x2 matrix that gives us more insight.
# The top left shows us how many ham emails were predicted to be ham (that's good!).
# The bottom right shows us how many spam emails were predicted to be spam (that's good!).
# The other two quadrants tell us the misclassifications.
# Finally, the `classification_report` gives us detailed statistics which you may have seen in a statistics class.
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

print("Classification Report:")
print(classification_report(y_test, y_pred))


Accuracy: 0.97
Classification Report:
              precision    recall  f1-score   support

         ham       0.98      0.98      0.98       729
        spam       0.96      0.95      0.96       305

    accuracy                           0.97      1034
   macro avg       0.97      0.97      0.97      1034
weighted avg       0.97      0.97      0.97      1034



Step 4.

In [13]:
# Let's see which features (aka columns) the vectorizer created. 
# They should be all the words that were contained in the training dataset.
vectorizer.fit(X_train)
feature_names = vectorizer.get_feature_names_out()
print("Feature names:", feature_names)



# You may be wondering what a machine learning model is tangibly. It is just a collection of numbers. 
# You can access these numbers known as "coefficients" from the coef_ property of the model
# We will be looking at coef_[0] which represents the importance of each feature.
# What does importance mean in this context?
# Some words are more important than others for the model.
# It's nothing personal, just that spam emails tend to contain some words more frequently.
# This indicates to the model that having that word would make a new email more likely to be spam.

model = pipeline.named_steps['logisticregression']

# Get feature names and coefficients
feature_names = vectorizer.get_feature_names_out()
coefficients = model.coef_[0]

# Create a dictionary of feature names and their corresponding coefficients
feature_importance = dict(zip(feature_names, coefficients))

# Sort the features by importance (absolute value of coefficients)
sorted_features = sorted(feature_importance.items(), key=lambda x: abs(x[1]), reverse=True)

# Print the sorted features
print("Feature importances:")
for feature, importance in sorted_features:
    print(f"{feature}: {importance:.4f}")


# Iterate over importance and find the top 10 positive features with the largest magnitude.
# Similarly, find the top 10 negative features with the largest magnitude.
# Positive features correspond to spam. Negative features correspond to ham.
# You will see that `http` is the strongest feature that corresponds to spam emails. 
# It makes sense. Spam emails often want you to click on a link.
# Separate positive and negative features
positive_features = [feature for feature, coef in sorted_features if coef > 0]
negative_features = [feature for feature, coef in sorted_features if coef < 0]

# Get the top 10 positive and negative features
top_10_positive_features = positive_features[:10]
top_10_negative_features = negative_features[:10]

# Print the results
print("Top 10 Positive Features (Spam Indicators):")
for feature in top_10_positive_features:
    print(f"{feature}: {feature_importance[feature]:.4f}")

print("\nTop 10 Negative Features (Ham Indicators):")
for feature in top_10_negative_features:
    print(f"{feature}: {feature_importance[feature]:.4f}")


Feature names: ['aa' 'aaa' 'aabda' ... 'zzezrjok' 'zzo' 'zzocb']
Feature importances:
enron: -1.4940
thanks: -1.4594
attached: -1.3685
doc: -1.3260
daren: -1.2955
pictures: -1.2922
xls: -1.2271
deal: -1.1550
neon: -1.1528
hpl: -1.0435
meter: -1.0106
revised: -1.0056
sitara: -0.9888
no: 0.9201
gas: -0.8875
nom: -0.8833
wassup: -0.8760
http: 0.8536
prices: 0.8512
deals: -0.8290
pm: -0.8192
know: -0.8162
numbers: -0.7686
remove: 0.7604
schedule: -0.7557
resume: -0.7516
june: -0.7465
april: -0.7461
hello: 0.7363
bob: -0.7172
only: 0.7113
plant: -0.6974
nomination: -0.6861
love: -0.6857
mary: -0.6827
removed: 0.6774
hl: -0.6687
yahoo: -0.6684
here: 0.6637
flow: -0.6583
hope: -0.6447
texas: -0.6410
please: -0.6338
party: -0.6273
more: 0.6259
paliourg: 0.6223
hotmail: -0.6223
ken: -0.6202
pain: 0.6160
week: -0.6092
call: -0.6064
will: -0.6046
at: -0.6041
listbot: -0.6003
address: -0.5993
htm: -0.5791
mail: -0.5736
capacity: -0.5713
mobile: 0.5701
february: -0.5661
point: -0.5632
updates: -0.5

Submission
1. Upload the jupyter notebook to Forage.

All Done!