Step 0. Unzip enron1.zip into the current directory.

Step 1. Traverse the dataset and create a Pandas dataframe. This is already done for you and should run without any errors. You should recognize Pandas from task 1.

In [25]:
import pandas as pd
import os

def read_spam():
    category = 'spam'
    directory = './enron1/spam'
    return read_category(category, directory)

def read_ham():
    category = 'ham'
    directory = './enron1/ham'
    return read_category(category, directory)

def read_category(category, directory):
    emails = []
    for filename in os.listdir(directory):
        if not filename.endswith(".txt"):
            continue
        with open(os.path.join(directory, filename), 'r') as fp:
            try:
                content = fp.read()
                emails.append({'name': filename, 'content': content, 'category': category})
            except:
                print(f'skipped {filename}')
    return emails

ham_emails = read_ham()
spam_emails = read_spam()

ham_df = pd.DataFrame.from_records(ham_emails)
spam_df = pd.DataFrame.from_records(spam_emails)

print("Number of Ham Emails:", len(ham_df))
print("Number of Spam Emails:", len(spam_df))

skipped 2248.2004-09-23.GP.spam.txt
skipped 2526.2004-10-17.GP.spam.txt
skipped 2698.2004-10-31.GP.spam.txt
skipped 4566.2005-05-24.GP.spam.txt
Number of Ham Emails: 3672
Number of Spam Emails: 1496


Step 2. Data cleaning is a critical part of machine learning. You and I can recognize that 'Hello' and 'hello' are the same word but a machine does not know this a priori. Therefore, we can 'help' the machine by conducting such normalization steps for it. Write a function `preprocessor` that takes in a string and replaces all non alphabet characters with a space and then lowercases the result.

In [34]:
import re

def preprocessor(e):
    processed_text = re.sub(r'[^a-zA-Z]', ' ', e)
    processed_text = processed_text.lower()
    return processed_text

Step 3. We will now train the machine learning model. All the functions that you will need are imported for you. The instructions explain how the work and hint at which functions to use. You will likely need to refer to the scikit learn documentation to see how exactly to invoke the functions. It will be handy to keep that tab open.

In [35]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Instantiate CountVectorizer
count_vectorizer = CountVectorizer(preprocessor=preprocessor)

# Split the dataset into train and test sets
X_train_ham, X_test_ham, y_train_ham, y_test_ham = train_test_split(ham_df["content"], ham_df["category"], test_size=0.2, random_state=1)
X_train_spam, X_test_spam, y_train_spam, y_test_spam = train_test_split(spam_df["content"], spam_df["category"], test_size=0.2, random_state=1)

# Concatenate the training and testing sets for both ham and spam
X_train = pd.concat([X_train_ham, X_train_spam])
X_test = pd.concat([X_test_ham, X_test_spam])
y_train = pd.concat([y_train_ham, y_train_spam])
y_test = pd.concat([y_test_ham, y_test_spam])

# Transform the dataset using CountVectorizer
X_train_transformed = count_vectorizer.fit_transform(X_train)
X_test_transformed = count_vectorizer.transform(X_test)

# Fit Logistic Regression model to the train dataset
logreg = LogisticRegression()
logreg.fit(X_train_transformed, y_train)

# Generate predictions on the test dataset
y_pred = logreg.predict(X_test_transformed)

# Evaluate the model
print(f'Accuracy:\n{accuracy_score(y_test, y_pred)}\n')
print(f'Confusion Matrix:\n{confusion_matrix(y_test, y_pred)}\n')
print(f'Detailed Statistics:\n{classification_report(y_test, y_pred)}\n')


Accuracy:
0.9748792270531401

Confusion Matrix:
[[717  18]
 [  8 292]]

Detailed Statistics:
              precision    recall  f1-score   support

         ham       0.99      0.98      0.98       735
        spam       0.94      0.97      0.96       300

    accuracy                           0.97      1035
   macro avg       0.97      0.97      0.97      1035
weighted avg       0.98      0.97      0.97      1035




STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Step 4.

In [44]:
# Let's see which features (aka columns) the vectorizer created. 
# They should be all the words that were contained in the training dataset.
# Get the vocabulary (terms) from the CountVectorizer object
vocabulary = count_vectorizer.vocabulary_

# Get the feature names (words) from the vocabulary
features = list(vocabulary.keys())

# You may be wondering what a machine learning model is tangibly. It is just a collection of numbers. 
# You can access these numbers known as "coefficients" from the coef_ property of the model
# We will be looking at coef_[0] which represents the importance of each feature.
# What does importance mean in this context?
# Some words are more important than others for the model.
# It's nothing personal, just that spam emails tend to contain some words more frequently.
# This indicates to the model that having that word would make a new email more likely to be spam.
importance = logreg.coef_[0]

# Iterate over importance and find the top 10 positive features with the largest magnitude.
# Similarly, find the top 10 negative features with the largest magnitude.
# Positive features correspond to spam. Negative features correspond to ham.
# You will see that `http` is the strongest feature that corresponds to spam emails. 
# It makes sense. Spam emails often want you to click on a link.
# Create a list of tuples containing index and importance
indexed_importance = list(enumerate(importance))

# Sort the list based on importance (descending order)
indexed_importance.sort(key=lambda e: e[1], reverse=True)

# Print the top 10 features with highest importance
print("Top 10 positive features (associated with spam):")
for i, imp in indexed_importance[:10]:
    print(imp, feature_names[i])

# Sort the list in reverse order to get top negative features
indexed_importance.sort(key=lambda e: -e[1], reverse=True)

# Print the top 10 features with lowest importance (negative)
print("\nTop 10 negative features (associated with ham):")
for i, imp in indexed_importance[:10]:
    print(imp, feature_names[i])

Top 10 positive features (associated with spam):
0.9053617461034007 quzqqcy
0.8530954395077608 toiling
0.8310957525690819 wncl
0.7752318571570987 synchronous
0.7501530396859558 levar
0.7160693884730204 acronymbronchitis
0.6903139675750463 vai
0.6809317580870108 limitate
0.652019209528551 fullness
0.6472422612668653 ikeja

Top 10 negative features (associated with ham):
-1.805572070184771 caught
-1.542090920102343 chawkins
-1.4784043515773455 tified
-1.3603392075670555 smerrill
-1.3446254573683243 bd
-1.172531720742926 dugout
-1.1508078100744572 dime
-1.081884981629174 czwsxm
-1.078875815265842 archilochian
-1.044061936205263 utdallas


Submission
1. Upload the jupyter notebook to Forage.

All Done!