# Ham or Spam?

🎯 The goal of this challenge is to classify emails as spams (1) or normal emails (0)

🧹 First, you will apply cleaning techniques to these textual data

👩🏻‍🔬 Then, you will convert the cleaned texts into a numerical representation

✉️ Eventually, you will apply the ***Multinomial Naive Bayes*** model to classify each email as either a spam or a regular email.

## (0) The NTLK library (Natural Language Toolkit)

In [1]:
# !pip install nltk

In [1]:
# When importing nltk for the first time, we need to also download a few built-in libraries

import nltk

nltk.download('stopwords')
nltk.download('punkt')      # For nltk<3.9.0
nltk.download('punkt_tab')  # For nltk>=3.9.0
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/glaznos/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/glaznos/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/glaznos/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /home/glaznos/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/glaznos/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [2]:
import pandas as pd

df = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/ham_spam_emails.csv")
df.spam.sum()

1368

In [3]:
df

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1
...,...,...
5723,Subject: re : research and development charges...,0
5724,"Subject: re : receipts from visit jim , than...",0
5725,Subject: re : enron case study update wow ! a...,0
5726,"Subject: re : interest david , please , call...",0


## (1) Cleaning the (text) dataset

The dataset is made up of emails that are classified as ham [0] or spam[1]. You need to clean the dataset before training a prediction model.

### (1.1) Remove Punctuation

❓ Create a function to remove the punctuation. Apply it to the `text` column and add the output to a new column in the dataframe called `clean_text` ❓

In [4]:
# YOUR CODE HERE
import string
string.punctuation

def replace_punc(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, '')
    return text

df['clean_text'] = df.text.map(replace_punc)

df

Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,Subject naturally irresistible your corporate ...
1,Subject: the stock trading gunslinger fanny i...,1,Subject the stock trading gunslinger fanny is...
2,Subject: unbelievable new homes made easy im ...,1,Subject unbelievable new homes made easy im w...
3,Subject: 4 color printing special request add...,1,Subject 4 color printing special request addi...
4,"Subject: do not have money , get software cds ...",1,Subject do not have money get software cds fr...
...,...,...,...
5723,Subject: re : research and development charges...,0,Subject re research and development charges t...
5724,"Subject: re : receipts from visit jim , than...",0,Subject re receipts from visit jim thanks ...
5725,Subject: re : enron case study update wow ! a...,0,Subject re enron case study update wow all ...
5726,"Subject: re : interest david , please , call...",0,Subject re interest david please call shi...


### (1.2) Lower Case

❓ Create a function to lowercase the text. Apply it to `clean_text` ❓

In [5]:
'JOHN'.lower()

'john'

In [6]:
# YOUR CODE HERE
df['clean_text'] = df.clean_text.map(lambda x: x.lower())
df

Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,subject naturally irresistible your corporate ...
1,Subject: the stock trading gunslinger fanny i...,1,subject the stock trading gunslinger fanny is...
2,Subject: unbelievable new homes made easy im ...,1,subject unbelievable new homes made easy im w...
3,Subject: 4 color printing special request add...,1,subject 4 color printing special request addi...
4,"Subject: do not have money , get software cds ...",1,subject do not have money get software cds fr...
...,...,...,...
5723,Subject: re : research and development charges...,0,subject re research and development charges t...
5724,"Subject: re : receipts from visit jim , than...",0,subject re receipts from visit jim thanks ...
5725,Subject: re : enron case study update wow ! a...,0,subject re enron case study update wow all ...
5726,"Subject: re : interest david , please , call...",0,subject re interest david please call shi...


### (1.3) Remove Numbers

❓ Create a function to remove numbers from the text. Apply it to `clean_text` ❓

In [7]:
# YOUR CODE HERE
df['clean_text'] = df.clean_text.map(lambda x: ''.join(char for char in x if not char.isdigit()))
df

Unnamed: 0,text,spam,clean_text
0,Subject: naturally irresistible your corporate...,1,subject naturally irresistible your corporate ...
1,Subject: the stock trading gunslinger fanny i...,1,subject the stock trading gunslinger fanny is...
2,Subject: unbelievable new homes made easy im ...,1,subject unbelievable new homes made easy im w...
3,Subject: 4 color printing special request add...,1,subject color printing special request addit...
4,"Subject: do not have money , get software cds ...",1,subject do not have money get software cds fr...
...,...,...,...
5723,Subject: re : research and development charges...,0,subject re research and development charges t...
5724,"Subject: re : receipts from visit jim , than...",0,subject re receipts from visit jim thanks ...
5725,Subject: re : enron case study update wow ! a...,0,subject re enron case study update wow all ...
5726,"Subject: re : interest david , please , call...",0,subject re interest david please call shi...


### (1.4) Remove StopWords

❓ Create a function to remove stopwords from the text. Apply it to `clean_text`. ❓

In [8]:
# YOUR CODE HERE
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english')) # you can also choose other languages

print(len(stop_words), 'stopwords in English it seems!')

# First tokenize the sentencs, i.e. break them into list of words
df['clean_text'] = df.clean_text.map(lambda wrd: [wrd for wrd in word_tokenize(df.clean_text.iloc[0]) if wrd not in stop_words])

198 stopwords in English it seems!


In [9]:
from nltk import WordNetLemmatizer

def lemmatizer(sentence):
    tmp = ' '.join(w for w in [WordNetLemmatizer().lemmatize(word, pos='v') for word in sentence])
    tmp = ' '.join(w for w in [WordNetLemmatizer().lemmatize(word, pos='n') for word in word_tokenize(tmp)])
    tmp = ' '.join(w for w in [WordNetLemmatizer().lemmatize(word, pos='a') for word in word_tokenize(tmp)])
    return tmp

lemmatizer(df.clean_text.iloc[10])

# ' '.join(w for w in [WordNetLemmatizer().lemmatize(word, pos='v') for word in df.clean_text.iloc[10]])

'subject naturally irresistible corporate identity lt really hard recollect company market full suqgestions information isoverwhelminq good catchy logo stylish statlonery outstanding website make task much easy promise havinq order iogo company automaticaily become world ieader isguite ciear without good product effective business organization practicable aim hotat nowadays market promise market effort become much effective list clear benefit creativeness hand make original logo specially do reflect distinctive company image convenience logo stationery provide format easy use content management system letsyou change website content even structure promptness see logo draft within three business day affordability market break make gap budget satisfaction guarantee provide unlimited amount change extra fee surethat love result collaboration look portfolio interest'

### (1.5) Lemmatize

❓ Create a function to lemmatize the text. Make sure the output is a single string, not a list of words. Apply it to `clean_text`. ❓

In [None]:
# YOUR CODE HERE


# (lambda x: ' '.join(w for w in [WordNetLemmatizer().lemmatize(word, pos='v') for word in x]))(df.clean_text.iloc[0])
df['clean_text'] = df.clean_text.map(lambda x: lemmatizer(x))
# df['clean_text'] = df.clean_text.map(lambda x: ' '.join(w for w in [WordNetLemmatizer().lemmatize(word, pos='a') for word in x]))

## (2) Bag-of-words Modelling

### (2.1) Digitizing the textual data into numbers

❓ Vectorize the `clean_text` to a Bag-of-Words representation with a default CountVectorizer. Save as `X_bow`. ❓

In [11]:
# YOUR CODE HERE
from sklearn.feature_extraction.text import CountVectorizer

# lemma_text = df.clean_text.iloc[0]
# count_vect = CountVectorizer()
# X = count_vect.fit_transform([lemma_text])
# X.toarray()

count_vect = CountVectorizer()
X = count_vect.fit_transform(df.clean_text)
X_bow = X.toarray()

# ?count_vect
# df.clean_text.values

ValueError: empty vocabulary; perhaps the documents only contain stop words

### (2.2) Multinomial Naive Bayes Modelling

❓ Cross-validate a MultinomialNB model with the bag-of-words data. Score the model's accuracy. ❓

In [68]:
# YOUR CODE HERE
import numpy as np

from sklearn.model_selection import cross_validate
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import recall_score

cv_results = cross_validate(MultinomialNB(), X_bow, df.spam.values, cv=5, scoring=['recall'])

In [69]:
cv_results['test_recall'].mean()

0.0

🏁 Congratulations !

💾 Don't forget to git add/commit/push your notebook...

🚀 ... and move on to the next challenge !

In [72]:
!git add 01-Ham-or-Spam.ipynb

! git commit -m "ham or spam"

! git push origin master

[master ee04ec1] ham or spam
 1 file changed, 935 insertions(+), 29 deletions(-)
Enumerating objects: 8, done.
Counting objects: 100% (8/8), done.
Delta compression using up to 4 threads
Compressing objects: 100% (8/8), done.
Writing objects: 100% (8/8), 5.79 KiB | 2.90 MiB/s, done.
Total 8 (delta 1), reused 0 (delta 0)
remote: Resolving deltas: 100% (1/1), done.[K
To github.com:roninrp/data-ham-or-spam.git
 * [new branch]      master -> master
