# Spam detection Data Science Project
This project was inspired by the Chapter 3 exercises of the book 'Hands on Machine Learning' by Aurelien Geron, if you are new to Machile learning / Data Science, I highly recommend the book<br>
In order to make this project, I followed the line of thought present on the following link, by the same author cited above: https://github.com/ageron/handson-ml2/blob/master/03_classification.ipynb<br>
https://towardsdatascience.com/spam-detection-in-emails-de0398ea3b48
I avoided copying the above code, because doing so would hinder my learning, so if you want to check a better solution, you should check the link above<br>
The used dataset can be found in the following link: https://spamassassin.apache.org/old/publiccorpus/


## First, let's import the data

In [2]:
import os
import re

In [3]:
SPAM_PATH = './data/spam'
HAM_PATH = './data/easy_ham'

In [4]:
spam_filenames = os.listdir('./data/spam')
ham_filenames = os.listdir('./data/easy_ham')
# spam_filenames

In [5]:
import email
import email.policy

def load_email(filename, folder_path):
    with open(os.path.join(folder_path, filename), "rb") as f:
        return email.parser.BytesParser(policy=email.policy.default).parse(f)

In [6]:
ham_emails = [load_email(filename=name, folder_path=HAM_PATH) for name in ham_filenames]
spam_emails = [load_email(filename=name, folder_path=SPAM_PATH) for name in spam_filenames]

## Now it is time to clean the data. We need to remove hyperlinks, special characters and other stuff as well

In [7]:
print(ham_emails[0].get_content().strip())

use Perl Daily Headline Mailer

Installing Perl 5.8.0 on Mac OS X 10.2
    posted by pudge on Thursday August 29, @15:03 (releases)
    http://use.perl.org/article.pl?sid=02/08/29/193225




Copyright 1997-2002 pudge.  All rights reserved.



You have received this message because you subscribed to it
on use Perl.  To stop receiving this and other
messages from use Perl, or to add more messages
or change your preferences, please go to your user page.

	http://use.perl.org/my/messages/

You can log in and change your preferences from there.


In [8]:
ham_emails[1].get_payload()

"URL: http://boingboing.net/#85515879\nDate: Not supplied\n\nDismuke has a 24 hour radio station and RealAudio archive of '20s and '30s \nmusic. Some nice stuff in here. (Also check out my favorite music archive, Red \nHot Jazz[1].) Link[2] Discuss[3]\n\n[1] http://boingboing.net/redhotjazz.com\n[2] http://dismuke.org/\n[3] http://www.quicktopic.com/16/H/xywUUftYHEg\n\n\n"

In [9]:
def get_email_body(emailobj):
    """ Return the body of the email, preferably in text.
    """

    def _get_body(emailobj):
        """ Return the first text/plain body found if the email is multipart
        or just the regular payload otherwise.
        """
        if emailobj.is_multipart():
            for payload in emailobj.get_payload():
                # If the message comes with a signature it can be that this
                # payload itself has multiple parts, so just return the
                # first one
                if payload.is_multipart():
                    return _get_body(payload)

                body = payload.get_payload()
                if payload.get_content_type() == "text/plain":
                    return body
        else:
            return emailobj.get_payload()

    body = _get_body(emailobj)

    enc = emailobj["Content-Transfer-Encoding"]
    if enc == "base64":
        body = base64.decodestring(body)

    return body 

In [10]:
def clean_body(email):
    to_clean = ['\n', '@', '=', '!', '.', ',', '\'', ';', '(', ')', '<','>', ':','?', '"', '*', '1', '2', '3' ,  '4', '5', '6', '7', '8', '9','0']
    to_space = ['-','_','/']
    result = re.sub(r"http\S+", "", email)
    # result = result.split()
    for sub in to_clean: result=result.replace(sub,"")
    for sub in to_space: result= result.replace(sub,' ')
    result = ' '.join(word for word in result.split(' ') if not word.startswith(('/', '%', 'r`')))
    # result = ''.join(digit for digit in result if not digit.isdigit())
    result = result.lower()

    
    return result

In [14]:
%%time
clean_email = clean_body(get_email_body(ham_emails[3])).strip()
clean_email
# get_email_body(ham_emails[3])

CPU times: user 828 µs, sys: 54 µs, total: 882 µs
Wall time: 889 µs




In [None]:
clean

In [16]:
from collections import Counter
count = Counter(clean_email.split())
count

Counter({'matthias': 2,
         'saou': 1,
         'wrote': 1,
         'i': 10,
         'guess': 1,
         'hope': 1,
         'some': 3,
         'other': 1,
         'people': 1,
         'from': 3,
         'the': 19,
         'list': 2,
         'will': 1,
         'try': 1,
         'it': 5,
         'out': 2,
         'both': 2,
         'problems': 1,
         'you': 1,
         'reported': 1,
         'libasoundso': 1,
         'and': 15,
         'wrong': 1,
         'xine': 1,
         'dependency': 3,
         'are': 2,
         'now': 2,
         'fixed': 1,
         'in': 7,
         'current': 1,
         'packages': 1,
         'oh': 1,
         'its': 1,
         'maybe': 1,
         'also': 4,
         'worth': 1,
         'pointing': 1,
         'ive': 6,
         'implemented': 1,
         'at': 4,
         'last': 2,
         'sorting': 1,
         'by': 5,
         'change': 1,
         'date': 1,
         'alphabetically': 1,
         'for': 10,
         'my

In [None]:
#para todo email vamos rodar o clean e o counter

In [17]:
counter_list = [Counter(clean_body(get_email_body(email)).strip().split()) for email in ham_emails]

In [18]:
counter_list[0]

Counter({'use': 3,
         'perl': 4,
         'daily': 1,
         'headline': 1,
         'mailerinstalling': 1,
         'on': 2,
         'mac': 1,
         'os': 1,
         'x': 1,
         'posted': 1,
         'by': 1,
         'pudge': 2,
         'thursday': 1,
         'august': 1,
         'releases': 1,
         'copyright': 1,
         'all': 1,
         'rights': 1,
         'reservedyou': 1,
         'have': 1,
         'received': 1,
         'this': 2,
         'message': 1,
         'because': 1,
         'you': 2,
         'subscribed': 1,
         'to': 4,
         'iton': 1,
         'stop': 1,
         'receiving': 1,
         'and': 2,
         'othermessages': 1,
         'from': 2,
         'or': 1,
         'add': 1,
         'more': 1,
         'messagesor': 1,
         'change': 2,
         'your': 3,
         'preferences': 2,
         'please': 1,
         'go': 1,
         'user': 1,
         'page': 1,
         'can': 1,
         'log': 1,
         'in

In [None]:
## convert HTML to plain text -> beautifulsoup

In [19]:
import pandas as pd
df = pd.DataFrame(counter_list)

In [21]:
df.fillna(value=0,inplace=True)

In [22]:
df

Unnamed: 0,use,perl,daily,headline,mailerinstalling,on,mac,os,x,posted,...,walkcome,revisited,ratein,datafalse,afterstripping,merepresence,trafficso,revisitedif,saywhat,tomeasure
0,3.0,4.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2546,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2547,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2548,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2549,0.0,0.0,0.0,0.0,0.0,6.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Huge amount of features, there is still room for a lot of cleaning