## Part 3 - Spam Email Classifier

##### The dataset “spam_ham_dataset.csv” contains the text of different emails a company has received over a period. Some are spam and some are not and are labeled accordingly. There is a label column which says ‘ham’ / ‘spam’ and another column called ‘class’ which represents it as 1s and 0s. Use the class column as the target.

In [1]:
import pandas as pd
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
import matplotlib.pyplot as plt

In [4]:
'''Read the data using pandas read_CSV method'''
df = pd.read_csv('spam_ham_dataset.csv')

In [5]:
'''Check for missing values'''
missing_values = df.isnull().sum() # Count the number of missing values in each column
print(f"Missing values:\n {missing_values}") # Print the missing values

Missing values:
 label      0
message    0
class      0
dtype: int64


In [6]:
'''Drop rows which have missing values'''
df_cleaned = df.dropna() # Drop rows with missing values

In [5]:
'''Convert all text to lowercase, remove punctuations with commas or apostrophes, and lemmanize the text using the WordNetLemmatizer'''
def clean_text(text): # Define a function to preprocess the text
    cleaned_text = "" # Initialize an empty string to store the filtered text

    # Convert text to lowercase
    text = text.lower()

    # Remove punctuations or apostrophes
    text = text.replace(",", "").replace("'", "") # Remove commas and apostrophes

    # Create the stop words set
    lemmatizer = WordNetLemmatizer() # Initialize the WordNetLemmatizer
    text = word_tokenize(text) # Tokenize the text into words
    text = [lemmatizer.lemmatize(word) for word in text] # Lemmatize the words in the string

    # Join the lemmanized words back into a single string
    cleaned_text = " ".join(text)

    return cleaned_text # Return the cleaned text

# Apply the preprocessing function to the text column
df_cleaned["message"] = df_cleaned["message"].apply(clean_text)

# Print the preprocessed data
print("Preprocessed Data:")
df_cleaned.head() # Print the preprocessed data

Preprocessed Data:


Unnamed: 0,label,message,class
0,ham,subject : enron methanol ; meter # : 988291 th...,0
1,ham,subject : hpl nom for january 9 2001 ( see att...,0
2,ham,subject : neon retreat ho ho ho we re around t...,0
3,spam,subject : photoshop window office . cheap . ma...,1
4,ham,subject : re : indian spring this deal is to b...,0


In [6]:
'''Create the X and y datasets consisting of the features and labels'''
X = df_cleaned['message'] # Features
y = df_cleaned['class'] # Labels
print(f"Features:\n{X.head()}\nLabels:\n{y.head()}") # Print the features and labels

Features:
0    subject : enron methanol ; meter # : 988291 th...
1    subject : hpl nom for january 9 2001 ( see att...
2    subject : neon retreat ho ho ho we re around t...
3    subject : photoshop window office . cheap . ma...
4    subject : re : indian spring this deal is to b...
Name: message, dtype: object
Labels:
0    0
1    0
2    0
3    1
4    0
Name: class, dtype: int64


In [7]:
'''Split them into training and testing datasets'''
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=42) # Split the data into training and testing

In [8]:
'''Instantiate a TfidfVectorizer object and fit it to the training data'''
vectorizer = TfidfVectorizer(stop_words='english') # Initialize the TfidfVectorizer object
X_train_tfidf = vectorizer.fit_transform(X_train) # Fit the vectorizer to the training data
print(f"Sparseform\n{X}\n\nFeatures\n{vectorizer.get_feature_names_out()}\n\nShape: {X_train_tfidf.shape}\n") # print the sparse matrix, features, and shape

Sparseform
0       subject : enron methanol ; meter # : 988291 th...
1       subject : hpl nom for january 9 2001 ( see att...
2       subject : neon retreat ho ho ho we re around t...
3       subject : photoshop window office . cheap . ma...
4       subject : re : indian spring this deal is to b...
                              ...                        
5166    subject : put the 10 on the ft the transport v...
5167    subject : 3 / 4 / 2000 and following noms hpl ...
5168    subject : calpine daily gas nomination > > jul...
5169    subject : industrial worksheet for august 2000...
5170    subject : important online banking alert dear ...
Name: message, Length: 5171, dtype: object

Features
['00' '000' '0000' ... 'zzo' 'zzocb' 'zzsyt']

Shape: (4395, 44088)



In [9]:
# Transform the testing data
X_test_tfidf = vectorizer.transform(X_test)

In [10]:
# Gaussian Naive Bayes Classifier
gnb = GaussianNB() # Initialize the Gaussian Naive Bayes Classifier
X_train_tfidf_dense = X_train_tfidf.toarray() # Convert the sparse matrix to a dense matrix
X_test_tfidf_dense = X_test_tfidf.toarray() # Convert the sparse matrix to a dense matrix
gnb.fit(X_train_tfidf_dense, y_train) # Fit the Gaussian Naive Bayes Classifier to the training data
y_pred_gnb = gnb.predict(X_test_tfidf_dense) # Predict the labels of the testing data
accuracy_gnb = accuracy_score(y_test, y_pred_gnb) # Calculate the accuracy of the model
conf_matrix_gnb = confusion_matrix(y_test, y_pred_gnb) # Calculate the confusion matrix

print("\nGaussian Naive Bayes Classifier:")
print("Accuracy:", accuracy_gnb)
print("Confusion Matrix:")
print(conf_matrix_gnb)

# Multinomial Naive Bayes Classifier
mnb = MultinomialNB() # Initialize the Multinomial Naive Bayes Classifier
mnb.fit(X_train_tfidf, y_train) # Fit the Multinomial Naive Bayes Classifier to the training data
y_pred_mnb = mnb.predict(X_test_tfidf) # Predict the labels of the testing data
accuracy_mnb = accuracy_score(y_test, y_pred_mnb) # Calculate the accuracy of the model
conf_matrix_mnb = confusion_matrix(y_test, y_pred_mnb) # Calculate the confusion matrix

print("\nMultinomial Naive Bayes Classifier:") # Print the results
print("Accuracy:", accuracy_mnb) # Print the accuracy
print("Confusion Matrix:") # Print the confusion matrix
print(conf_matrix_mnb) # Print the confusion matrix

# RandomForest Classifier
rfc = RandomForestClassifier(random_state=42) # Initialize the RandomForest Classifier
rfc.fit(X_train_tfidf, y_train) # Fit the RandomForest Classifier to the training data
y_pred_rfc = rfc.predict(X_test_tfidf) # Predict the labels of the testing data
accuracy_rfc = accuracy_score(y_test, y_pred_rfc) # Calculate the accuracy of the model
conf_matrix_rfc = confusion_matrix(y_test, y_pred_rfc) # Calculate the confusion matrix

print("\nRandomForest Classifier:") # Print the results
print("Accuracy:", accuracy_rfc) # Print the accuracy
print("Confusion Matrix:") # Print the confusion matrix
print(conf_matrix_rfc) # Print the confusion matrix



Gaussian Naive Bayes Classifier:
Accuracy: 0.9523195876288659
Confusion Matrix:
[[535  18]
 [ 19 204]]

Multinomial Naive Bayes Classifier:
Accuracy: 0.9265463917525774
Confusion Matrix:
[[552   1]
 [ 56 167]]

RandomForest Classifier:
Accuracy: 0.9780927835051546
Confusion Matrix:
[[544   9]
 [  8 215]]


##### Random Forest is the best model.

In [11]:
'''Pick the best model and get predictions for new emails'''
# New email data
emails = [
    "Hello George, how about a game of tennis tomorrow?",
    "Hello, click here if you want drugs tonight",
    "We offer free viagra!!! Click here now!!!",
    "Dear Sara, I prepared the annual report.",
    "Hi David, will we go for cinema tonight?",
    "Best holidays offers only here!!!",
    'Sir, Waiting for your mail.',
    '#@Photoshop a fake image!',
    'No problem. How are you doing?'
]

# Convert the list to a DataFrame
email_df = pd.DataFrame(emails, columns=['message']) # Convert the list to a DataFrame

# Apply the clean_text function to the new email data
email_df['message'] = email_df['message'].apply(clean_text) # Apply the clean_text function to the new email data

# Transform the new email data using the TF-IDF vectorizer
emails_tfidf = vectorizer.transform(email_df['message']) # Transform the new email data using the TF-IDF vectorizer

# Predict the classes of the new emails
email_predictions = rfc.predict(emails_tfidf) # Predict the classes of the new emails
email_df['predicted_class'] = email_predictions # Add the predicted classes to the DataFrame

print("\nPredictions for new emails:") # Print the predictions for the new emails
print(email_df) # Print the predictions for the new emails


Predictions for new emails:
                                             message  predicted_class
0  hello george how about a game of tennis tomorr...                1
1          hello click here if you want drug tonight                1
2    we offer free viagra ! ! ! click here now ! ! !                1
3           dear sara i prepared the annual report .                1
4           hi david will we go for cinema tonight ?                1
5                 best holiday offer only here ! ! !                1
6                        sir waiting for your mail .                1
7                       # @ photoshop a fake image !                1
8                   no problem . how are you doing ?                1


***Which model performed the best and what were the metrics for that model?*** <br>
* Ans) Random performed the bestnd its metrics got 97% accuracy, with roughly an equal amount of false positives adn false negatives.

***Looking at the new data for prediction, you should have a sense for which is spam
and which is not – how did your model do against conclusions?***
* Ans) my model predicted all the emails as spam. Clearly, this is not right and perplexing as my random forest classifier had the highest accuracy score out of all the models.

***Why did I ask you not remove other types of punctuation?***
* Ans) Punctuation like exclamation points, question marks, @'s and other symbols are fundamental indicators of spam emails, hence, they should not be removed during preproccessing.

***What were some of the challenges you faced in doing this MP and what were the
key learnings?***
* Ans) The big chellenge I faced was with the Tf-IDF surprisingly where there was a value error. I learned that the value error was due to me combining the label and message for the features, when tf-idf only works for one string of text and not a combined double feature array of two different strings. Another issue is the overfitting of my random forest classifier which is supposed to be my 'best' algorithm. Other than that I had no issues when implementing these algorithms.