<h1><center>Case Study 3:</center></h1>
<h2><center>Spam Classifier, Bayes and Clustering</center></h2>
<h3>Authors:</h3>
Joaquin Dominguez <br>
Richard Kim <br>

In [41]:
import re
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string

### Importing/Merging from local folder

- The path must contain all folders with emails
- Folders must have 'ham' or 'spam' in its name, and the rest of the directory must not include 'ham'
- If 'ham' is included in a spam folder's directory, spam emails may labeled 'ham' later on

In [17]:
import os

folders_path = '/home/joaquindominguez/Documents/QTW/case_studies/QTW_CaseStudy/Case Study 3/Data'

file_list = []
for root, dirs, files in os.walk(folders_path, topdown=False):
    for name in files:
        tmp = os.path.join(root,name)
        file_list.append(tmp)
    for item in dirs:
        print(item)

spam_2
easy_ham_2
easy_ham
hard_ham
spam


In [18]:
# Total number of files
len(file_list)

9353

### Parsing Emails: Text vs Multipart

- Email messages are parsed into 1 of 2 arrays, text emails and multipart emails
- Content types (e.g. text) and labels (e.g. spam) are saved in separate arrays for each category (text vs multipart)
- Directories are saved as unique IDs
- Messages with multipart content types that **DID NOT contain content types for each of its parts** were treated like single texts

In [19]:
# All-in-One
import email

text_list = []

messages_list = []
type_list = []
labels = []

uniq_mult_list = []
mult_list = []

mult_messages_list = []
mult_type_list = []
mult_labels = []

for i in range(len(file_list)):
    with open(file_list[i],'r',encoding='latin1') as f:        
        message = email.message_from_file(f)
        body = message.get_payload()
        content_type = message.get_content_type()
        
        if 'text' in content_type: 
            text_list.append(file_list[i])
            messages_list.append(body)
            if 'ham' in file_list[i]:
                labels.append(0)
            else: 
                labels.append(1)
            type_list.append(content_type)
        elif 'mult' in content_type: 
            uniq_mult_list.append(file_list[i])
            if 'text' in body:
                mult_list.append(file_list[i])
                mult_messages_list.append(body)
                mult_type_list.append(content_type)
                if 'ham' in file_list[i]:
                    mult_labels.append(0)
                else: 
                    mult_labels.append(1)
            else: 
                for j in body: 
                    if 'text' in j.get_content_type(): 
                        mult_list.append(file_list[i])
                        mult_messages_list.append(j.get_payload())
                        mult_type_list.append(j.get_content_type())
                        if 'ham' in file_list[i]:
                            mult_labels.append(0)
                        else: 
                            mult_labels.append(1)

In [20]:
# All-in-one count of files
print('**Total Number of Files**')
print(len(file_list), '\n')

print('**Text Emails**')
print('Text Email Count:',len(text_list))
print('Messages:',len(messages_list))
print('Spam/Ham Labels:',len(labels))
print('Content Type Labels:',len(type_list), '\n')

print('**Multipart Emails**')
print('Multipart Email Count:',len(uniq_mult_list))
print('Separated Messages:',len(mult_messages_list))
print('Spam/Ham Labels:',len(mult_labels))
print('Content Type Labels:',len(mult_type_list), '\n')

**Total Number of Files**
9353 

**Text Emails**
Text Email Count: 8607
Messages: 8607
Spam/Ham Labels: 8607
Content Type Labels: 8607 

**Multipart Emails**
Multipart Email Count: 746
Separated Messages: 1034
Spam/Ham Labels: 1034
Content Type Labels: 1034 



In [21]:
import pandas as pd

text_df = pd.DataFrame({'directory':text_list,'message':messages_list,'spam1':labels,'content type':type_list})
mult_df = pd.DataFrame({'directory':mult_list,'message':mult_messages_list,'spam1':mult_labels,'content type':mult_type_list})

In [22]:
# Checking how many directories come up more than once (indicating 2+ messages from a multipart email)
dups = mult_df.groupby(mult_df['directory'],as_index=False).size()
multipart_df = pd.merge(mult_df, dups, on='directory', how='left')
test = multipart_df[multipart_df['size'] > 1]

pd.set_option("display.max_colwidth", None)
pd.set_option('display.max_rows', None)

#print(test['message'])
sum(multipart_df['size'] > 1)

677

In [8]:
pd.reset_option('display.max_colwidth')
pd.reset_option('display.max_rows')
print(multipart_df.iloc[1].head(5))

directory       /home/joaquindominguez/Documents/QTW/case_stud...
message         <html>\n<head>\n<title>Free Sizzling LTC Sales...
spam1                                                           1
content type                                            text/html
size                                                            2
Name: 1, dtype: object


In [45]:

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
text = ''
text_list = []
for i in range(len(multipart_df)):
    val = multipart_df.iloc[i,1]
    soup = BeautifulSoup(val,'lxml')
    text = soup.get_text().lower()
    text = re.sub(r'(https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%)*\b', '', text, flags=re.MULTILINE)
    text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '', text, flags=re.MULTILINE)
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = ''.join([i for i in text if not i.isdigit()])
    stop_words = stopwords.words('english')
    words_list = [w for w in text.split() if w not in stop_words]
    words_list = [lemmatizer.lemmatize(w) for w in words_list]
    words_list = [stemmer.stem(w) for w in words_list]
    text_list.append(' '.join(words_list))

" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.


In [46]:
multipart_df['proc_text'] = text_list

In [49]:
text = ''
text_list = []
for i in range(len(text_df)):
    val = text_df.iloc[i,1]
    soup = BeautifulSoup(val,'lxml')
    text = soup.get_text().lower()
    text = re.sub(r'(https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%)*\b', '', text, flags=re.MULTILINE)
    text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '', text, flags=re.MULTILINE)
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = ''.join([i for i in text if not i.isdigit()])
    stop_words = stopwords.words('english')
    words_list = [w for w in text.split() if w not in stop_words]
    words_list = [lemmatizer.lemmatize(w) for w in words_list]
    words_list = [stemmer.stem(w) for w in words_list]
    text_list.append(' '.join(words_list))




" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.


In [50]:
text_df['proc_text'] = text_list

### Appendix (additional codes)

In [18]:
# Text Emails (multipart directories saved elsewhere)
import email
text_list = []
messages_list = []
type_list = []
labels = []

mult_list = []

for i in range(len(file_list)):
    with open(file_list[i],'r',encoding='latin1') as f:     
        text_list.append(file_list[i])   
        message = email.message_from_file(f)
        body = message.get_payload()
        content_type = message.get_content_type()
        
        if 'text' in content_type: 
            messages_list.append(body)
            if 'ham' in file_list[i]:
                labels.append(0)
            else: 
                labels.append(1)
            type_list.append(content_type)
        elif 'mult' in content_type: 
            mult_list.append(file_list[i])

In [19]:
# Multipart (keep parts where content types are text)
mult_messages_list = []
mult_type_list = []
mult_labels = []

for i in range(len(mult_list)):
    with open(mult_list[i],'r',encoding='latin1') as f:
        messages = email.message_from_file(f)
        body = messages.get_payload()
        content_type = messages.get_content_type()

        if 'text' in body:
            mult_messages_list.append(body)
            mult_type_list.append(content_type)
            if 'ham' in mult_list[i]:
                mult_labels.append(0)
            else: 
                mult_labels.append(1)
        else: 
            for j in body: 
                if 'text' in j.get_content_type(): 
                    mult_messages_list.append(j.get_payload())
                    mult_type_list.append(j.get_content_type())
                    if 'ham' in mult_list[i]:
                        mult_labels.append(0)
                    else: 
                        mult_labels.append(1)

### Checking Individual Emails

In [21]:
# Codeblock for looking at individual emails
filename = 'spam/00116.29e39a0064e2714681726ac28ff3fdef'

import os
with open(os.path.join('/home/joaquindominguez/Documents/QTW/case_studies/QTW_CaseStudy/Case Study 3/Data/',filename),'r',encoding='latin1') as f: 
    message = email.message_from_file(f)
    body = message.get_payload()
    print(body)

HABERDAR.COM - HABER VE MEDYA PORTALI
Artýk tüm haberleri sadece tek siteden takip edebileceksiniz. Haberdar.com açýldý!
Haber baþlýklarý, spor haberleri, teknoloji haberleri, kültür ve sanat haberleri, internet haberleri, bilim ve uzay, 
sinema, saðlýk...
Aradýðýnýz içerik http://www.haberdar.com adresinde
Sadece týklayýn ve haberdar olun

ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÓ+,ùÞµéX¬²'²Þu¼ÿ9 Íý8«yÚ¶­±©¢W\zYiÞüg­jw°êÞ~ÅDAÿÛi³ÿÿÃÿza¢xýÊ&þ¿Ú²ë­Ç¢¸×úÞ}Ê{³}ýÓÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿJ+^±Ê(¥êÿµ©d¨¥x%ËR×¬r)z¿íjYÿ+-³û(º·~à{ùÞ¶m¦ÏÿþX¬¶Ïì¢êÜyú+ïçzßåËlþX¬¶)ß£û"µë¢^¯ûZ


