# Ham or Spam?

🎯 The goal of this challenge is to classify emails as spams (1) or normal emails (0)

🧹 First, you will apply cleaning techniques to these textual data

👩🏻‍🔬 Then, you will convert the cleaned texts into a numerical representation

✉️ Eventually, you will apply the ***Multinomial Naive Bayes*** model to classify each email as either a spam or a regular email.

## (0) The NTLK library (Natural Language Toolkit)

In [2]:
# !pip install nltk

In [3]:
# When importing nltk for the first time, we need to also download a few built-in libraries

import nltk

nltk.download('stopwords')
nltk.download('punkt')      # For nltk<3.9.0
nltk.download('punkt_tab')  # For nltk>=3.9.0
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/saranjthilak92/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /home/saranjthilak92/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/saranjthilak92/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/saranjthilak92/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /home/saranjthilak92/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [4]:
import pandas as pd

df = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/ham_spam_emails.csv")
df.head()

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1


## (1) Cleaning the (text) dataset

The dataset is made up of emails that are classified as ham [0] or spam[1]. You need to clean the dataset before training a prediction model.

### (1.1) Remove Punctuation

❓ Create a function to remove the punctuation. Apply it to the `text` column and add the output to a new column in the dataframe called `clean_text` ❓

In [5]:
import pandas as pd 
import string
def remove_punctuation(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, '')
    return text
# Apply function and create new column
df['clean_text'] = df['text'].apply(remove_punctuation)

# View result
print(df)


                                                   text  spam  \
0     Subject: naturally irresistible your corporate...     1   
1     Subject: the stock trading gunslinger  fanny i...     1   
2     Subject: unbelievable new homes made easy  im ...     1   
3     Subject: 4 color printing special  request add...     1   
4     Subject: do not have money , get software cds ...     1   
...                                                 ...   ...   
5723  Subject: re : research and development charges...     0   
5724  Subject: re : receipts from visit  jim ,  than...     0   
5725  Subject: re : enron case study update  wow ! a...     0   
5726  Subject: re : interest  david ,  please , call...     0   
5727  Subject: news : aurora 5 . 2 update  aurora ve...     0   

                                             clean_text  
0     Subject naturally irresistible your corporate ...  
1     Subject the stock trading gunslinger  fanny is...  
2     Subject unbelievable new homes made eas

### (1.2) Lower Case

❓ Create a function to lowercase the text. Apply it to `clean_text` ❓

In [6]:
# Function to lowercase text
def to_lowercase(text):
    return text.lower()

# Apply to the 'clean_text' column
df['clean_text'] = df['clean_text'].apply(to_lowercase)

# View result
print(df)


                                                   text  spam  \
0     Subject: naturally irresistible your corporate...     1   
1     Subject: the stock trading gunslinger  fanny i...     1   
2     Subject: unbelievable new homes made easy  im ...     1   
3     Subject: 4 color printing special  request add...     1   
4     Subject: do not have money , get software cds ...     1   
...                                                 ...   ...   
5723  Subject: re : research and development charges...     0   
5724  Subject: re : receipts from visit  jim ,  than...     0   
5725  Subject: re : enron case study update  wow ! a...     0   
5726  Subject: re : interest  david ,  please , call...     0   
5727  Subject: news : aurora 5 . 2 update  aurora ve...     0   

                                             clean_text  
0     subject naturally irresistible your corporate ...  
1     subject the stock trading gunslinger  fanny is...  
2     subject unbelievable new homes made eas

### (1.3) Remove Numbers

❓ Create a function to remove numbers from the text. Apply it to `clean_text` ❓

In [7]:
# Function to remove numbers from text
def remove_num(text):
    return ''.join(char for char in text if not char.isdigit())

# Apply to the 'clean_text' column
df['clean_text'] = df['clean_text'].apply(remove_num)

# View result
print(df)


                                                   text  spam  \
0     Subject: naturally irresistible your corporate...     1   
1     Subject: the stock trading gunslinger  fanny i...     1   
2     Subject: unbelievable new homes made easy  im ...     1   
3     Subject: 4 color printing special  request add...     1   
4     Subject: do not have money , get software cds ...     1   
...                                                 ...   ...   
5723  Subject: re : research and development charges...     0   
5724  Subject: re : receipts from visit  jim ,  than...     0   
5725  Subject: re : enron case study update  wow ! a...     0   
5726  Subject: re : interest  david ,  please , call...     0   
5727  Subject: news : aurora 5 . 2 update  aurora ve...     0   

                                             clean_text  
0     subject naturally irresistible your corporate ...  
1     subject the stock trading gunslinger  fanny is...  
2     subject unbelievable new homes made eas

### (1.4) Remove StopWords

❓ Create a function to remove stopwords from the text. Apply it to `clean_text`. ❓

In [11]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

# Function to remove stop words
def remove_stop_words(text):
    tokens = word_tokenize(text)  # Tokenize inside the function
    tokens_cleaned = [w for w in tokens if w.lower() not in stop_words]  

# Apply to 'clean_text'
df['clean_text'] = df['clean_text'].apply(remove_stop_words)

# View result
print(df)


                                                   text  spam  \
0     Subject: naturally irresistible your corporate...     1   
1     Subject: the stock trading gunslinger  fanny i...     1   
2     Subject: unbelievable new homes made easy  im ...     1   
3     Subject: 4 color printing special  request add...     1   
4     Subject: do not have money , get software cds ...     1   
...                                                 ...   ...   
5723  Subject: re : research and development charges...     0   
5724  Subject: re : receipts from visit  jim ,  than...     0   
5725  Subject: re : enron case study update  wow ! a...     0   
5726  Subject: re : interest  david ,  please , call...     0   
5727  Subject: news : aurora 5 . 2 update  aurora ve...     0   

                                             clean_text  
0     subject naturally irresistible corporate ident...  
1     subject stock trading gunslinger fanny merrill...  
2     subject unbelievable new homes made eas

### (1.5) Lemmatize

❓ Create a function to lemmatize the text. Make sure the output is a single string, not a list of words. Apply it to `clean_text`. ❓

In [13]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))
# Function to lemmatize
def lemmatize(text):
    tokens = word_tokenize(text)  # Tokenize inside the function
    tokens_cleaned = [w for w in tokens if w.lower() not in stop_words]
    # Lemmatizing the verbs
    verb_lemmatized = [
    WordNetLemmatizer().lemmatize(word, pos = "v") # v --> verbs
    for word in tokens_cleaned
    ]

    # 2 - Lemmatizing the nouns
    noun_lemmatized = [
    WordNetLemmatizer().lemmatize(word, pos = "n") # n --> nouns
    for word in verb_lemmatized
    ]
    return ' '.join(noun_lemmatized)
# Apply to 'clean_text'
df['clean_text'] = df['clean_text'].apply(lemmatize)

# View result
print(df)

                                                   text  spam  \
0     Subject: naturally irresistible your corporate...     1   
1     Subject: the stock trading gunslinger  fanny i...     1   
2     Subject: unbelievable new homes made easy  im ...     1   
3     Subject: 4 color printing special  request add...     1   
4     Subject: do not have money , get software cds ...     1   
...                                                 ...   ...   
5723  Subject: re : research and development charges...     0   
5724  Subject: re : receipts from visit  jim ,  than...     0   
5725  Subject: re : enron case study update  wow ! a...     0   
5726  Subject: re : interest  david ,  please , call...     0   
5727  Subject: news : aurora 5 . 2 update  aurora ve...     0   

                                             clean_text  
0     subject naturally irresistible corporate ident...  
1     subject stock trade gunslinger fanny merrill m...  
2     subject unbelievable new home make easy

## (2) Bag-of-words Modelling

### (2.1) Digitizing the textual data into numbers

❓ Vectorize the `clean_text` to a Bag-of-Words representation with a default CountVectorizer. Save as `X_bow`. ❓

In [45]:
from sklearn.feature_extraction.text import CountVectorizer

# 1. Initialize the vectorizer
count_vectorizer = CountVectorizer()

# 2. Fit and transform the cleaned text
X_bow = count_vectorizer.fit_transform(df['clean_text'])

# 3. View the result
print(X_bow.shape)

# Optionally, convert to array or DataFrame to inspect
X_bow_array = X_bow.toarray()
pd.DataFrame(X_bow_array, columns=vectorizer.get_feature_names_out())


(5728, 28173)


Unnamed: 0,aa,aaa,aaaenerfax,aadedeji,aagrawal,aal,aaldous,aaliyah,aall,aanalysis,...,zwzm,zxghlajf,zyban,zyc,zygoma,zymg,zzmacmac,zzn,zzncacst,zzzz
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5723,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5724,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5725,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5726,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### (2.2) Multinomial Naive Bayes Modelling

❓ Cross-validate a MultinomialNB model with the bag-of-words data. Score the model's accuracy. ❓

In [46]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score

clf = MultinomialNB()
accuracy_scores = cross_val_score(clf, X_bow, y, cv=5, scoring='accuracy')

print("Cross-validated accuracy scores:", accuracy_scores)
print(f"Mean accuracy: {accuracy_scores.mean():.2f}")



Cross-validated accuracy scores: [0.9877836  0.98516579 0.9921466  0.98515284 0.99213974]
Mean accuracy: 0.99


🏁 Congratulations !

💾 Don't forget to git add/commit/push your notebook...

🚀 ... and move on to the next challenge !