Step 0. Unzip enron1.zip into the current directory.

Step 1. Traverse the dataset and create a Pandas dataframe. This is already done for you and should run without any errors. You should recognize Pandas from task 1.

In [1]:
import pandas as pd
import os

def read_spam():
    category = 'spam'
    directory = './enron1/spam'
    return read_category(category, directory)

def read_ham():
    category = 'ham'
    directory = './enron1/ham'
    return read_category(category, directory)

def read_category(category, directory):
    emails = []
    for filename in os.listdir(directory):
        if not filename.endswith(".txt"):
            continue
        with open(os.path.join(directory, filename), 'r') as fp:
            try:
                content = fp.read()
                emails.append({'name': filename, 'content': content, 'category': category})
            except:
                print(f'skipped {filename}')
    return emails

ham = read_ham()
spam = read_spam()

df = pd.DataFrame.from_records(ham)
df = df.append(pd.DataFrame.from_records(spam))

skipped 2248.2004-09-23.GP.spam.txt
skipped 2526.2004-10-17.GP.spam.txt
skipped 2698.2004-10-31.GP.spam.txt
skipped 4566.2005-05-24.GP.spam.txt


  df = df.append(pd.DataFrame.from_records(spam))


Step 2. Data cleaning is a critical part of machine learning. You and I can recognize that 'Hello' and 'hello' are the same word but a machine does not know this a priori. Therefore, we can 'help' the machine by conducting such normalization steps for it. Write a function `preprocessor` that takes in a string and replaces all non alphabet characters with a space and then lowercases the result.

In [2]:
import re

def preprocessor(e):
    return re.sub('[^A-Za-z]',' ',e).lower() 

#This code wil replace any non-alphabetic characters to a blank space
#.lower() at the end will change all the alphabetic words to lowercase 
#Therefore, after replacing the non-alphabetic with blank space, the remaining letters will be converted to lowercase 

Step 3. We will now train the machine learning model. All the functions that you will need are imported for you. The instructions explain how the work and hint at which functions to use. You will likely need to refer to the scikit learn documentation to see how exactly to invoke the functions. It will be handy to keep that tab open.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# The CountVectorizer converts a text sample into a vector (think of it as an array of floats).
# Each entry in the vector corresponds to a single word and the value is the number of times the word appeared.
# Instantiate a CountVectorizer. Make sure to include the preprocessor you previously wrote in the constructor.
# TODO
vectorizer = CountVectorizer(preprocessor=preprocessor)

# Use train_test_split to split the dataset into a train dataset and a test dataset.
# The machine learning model learns from the train dataset.
# Then the trained model is tested on the test dataset to see if it actually learned anything.
# If it just memorized for example, then it would have a low accuracy on the test dataset and a high accuracy on the train dataset.
# TODO
X_train, X_test, y_train, y_test = train_test_split(df['content'],df['category'],test_size=0.2,random_state=1)

# Use the vectorizer to transform the existing dataset into a form in which the model can learn from.
# Remember that simple machine learning models operate on numbers, which the CountVectorizer conveniently helped us do.
# TODO
X_train_df = vectorizer.fit_transform(X_train)# learns vocab from X_train and transforms it into numerical data at once

# Use the LogisticRegression model to fit to the train dataset.
# You may remember y = mx + b and Linear Regression from high school. Here, we fitted a scatter plot to a line.
# Logistic Regression is another form of regression. 
# However, Logistic Regression helps us determine if a point should be in category A or B, which is a perfect fit.
# TODO
model = LogisticRegression()
model.fit(X_train_df,y_train) # by calling this, the regression model learns to predict whether future messages are spam or not base on the input of the training set
# learns the relationship between the input data and the target label, which then will be used to predict new data 

# Validate that the model has learned something.
# Recall the model operates on vectors. First transform the test set using the vectorizer. 
# Then generate the predictions.
# TODO
X_test_df = vectorizer.transform(X_test)# transforms test data into numerical data
y_pred = model.predict(X_test_df)# uses the trained data and predicts the target label for the transformed test data

# We now want to see how we have done. We will be using three functions.
# `accuracy_score` tells us how well we have done. 
# 90% means that every 9 of 10 entries from the test dataset were predicted accurately.
# The `confusion_matrix` is a 2x2 matrix that gives us more insight.
# The top left shows us how many ham emails were predicted to be ham (that's good!).
# The bottom right shows us how many spam emails were predicted to be spam (that's good!).
# The other two quadrants tell us the misclassifications.
# Finally, the `classification_report` gives us detailed statistics which you may have seen in a statistics class.
# TODO
print(f'Accuracy Score:\n{accuracy_score(y_test,y_pred)}\n')
print(f'Confusion Matrix:\n{confusion_matrix(y_test,y_pred)}\n')
print(f'Detailed Statistics:\n{classification_report(y_test,y_pred)}\n')

# accurancy_score, confusion_matrix, and classification_report are all functions from sklearn matrics


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Accuracy Score:
0.9748549323017408

Confusion Matrix:
[[729  16]
 [ 10 279]]

Detailed Statistics:
              precision    recall  f1-score   support

         ham       0.99      0.98      0.98       745
        spam       0.95      0.97      0.96       289

    accuracy                           0.97      1034
   macro avg       0.97      0.97      0.97      1034
weighted avg       0.98      0.97      0.97      1034




Step 4.

In [6]:
# Let's see which features (aka columns) the vectorizer created. 
# They should be all the words that were contained in the training dataset.
# TODO
features = vectorizer.get_feature_names_out() 
# returns array of featured names learned from the training dataset during the fitting process (fit())

# You may be wondering what a machine learning model is tangibly. It is just a collection of numbers. 
# You can access these numbers known as "coefficients" from the coef_ property of the model
# We will be looking at coef_[0] which represents the importance of each feature.
# What does importance mean in this context?
# Some words are more important than others for the model.
# It's nothing personal, just that spam emails tend to contain some words more frequently.
# This indicates to the model that having that word would make a new email more likely to be spam.
# TODO
importance = model.coef_[0] # extracts the coefficients for the first class 

# Iterate over importance and find the top 10 positive features with the largest magnitude.
# Similarly, find the top 10 negative features with the largest magnitude.
# Positive features correspond to spam. Negative features correspond to ham.
# You will see that `http` is the strongest feature that corresponds to spam emails. 
# It makes sense. Spam emails often want you to click on a link.
# TODO
l = list(enumerate(importance))
print()
l.sort(key=lambda e: e[1], reverse=True)# sorted in descending order
for i,imp in l[:10]:
    print(imp,features[i])
print()
l.sort(key=lambda e: -e[1], reverse=True)#sorted in ascending order
for i,imp in l[:10]:
    print(imp,features[i])#similar to above loop, but this time it prints the lowest 10 coef/least importance    

#list() : converts enumerated object into a list of tuples(e.g. index,coef)
#enumerate(importance): provides the index and the coressponding coef value
#key=lambda e: e[1]: specifies that the list should be sorted by second element for each tuple
#l[:10] slices the sorted list to get top 10 data with the highest coef
#key=lambda e: -e[1]: sorts by negative coef values




0.9465510988012696 http
0.8661669423989268 prices
0.8151872641882113 no
0.7514211047666692 pain
0.7224411637471494 only
0.7189674663194974 paliourg
0.7186620449539075 money
0.6930729227716521 more
0.6866452781323215 remove
0.6802337530355546 removed

-1.5991402977917646 enron
-1.5480767907203077 attached
-1.405469124385368 daren
-1.3822225375373334 thanks
-1.3566712317313248 doc
-1.1859457594058171 deal
-1.1422686635202326 xls
-1.1241666686031604 meter
-1.060049346821973 hpl
-1.021684303017507 neon


Submission
1. Upload the jupyter notebook to Forage.

All Done!