# <center> TP - Text-Mining <center/>

** Before you begin this notebook, please make sure that you have the data folder + that you are running this notebook from your workspace **

_Now let's practice and use some text-mining techniques!_
<br/>
<br/> In this notebook we will study a mail dataset.
Our final goal is to have some insights about the different subjects that appear in mailboxes. Among other problems, one big issue is that these mailboxes are poluted with a lot of spams.

> The sequence we propose :
- First you'll work on pre-processing the content of the mails
- Then you will try to detect whether a mail is a spam or not through a supervized learning algorithm
- Finally, once you've trained an algorithm to detect spams, you will try to identify the main topics in the remaining non-spam mails

## The data

Data are separated in six parts, each containing around 5000 mails.
<br/> Each of these data chunks are separated in two parts : 
>- hams : that is to say non-spam mails
- spams : containing advertisement or unrelevant content


get the zip file here :
- https://drive.google.com/file/d/1j1xO3HSevP__ZsP2FvTx7p0UHJO__7rj/view?usp=sharing


***
***

## 0. Imports

In [2]:
#Import usefull package
import os
import re
import string
import random
import numpy as np
import pandas as pd
from collections import Counter

#Import nltk packages to manipulate text
import nltk
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.metrics import ConfusionMatrix
from nltk.stem.snowball import SnowballStemmer

from nltk import word_tokenize, WordNetLemmatizer, PorterStemmer
from nltk import NaiveBayesClassifier, classify
from nltk import pos_tag
from nltk import ngrams

#***
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
nltk.download('omw-1.4')

from nltk import word_tokenize,sent_tokenize

# Let's add a path containing some useful nltk data
nltk.data.path += ['/mnt/share/nltk_data']




[nltk_data] Downloading package punkt to
[nltk_data]     /Users/wailbenfatma/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/wailbenfatma/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/wailbenfatma/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/wailbenfatma/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [6]:
# Modify the variable hereunder to you repository
path_data  = '/Users/wailbenfatma/Documents/work/ESME/NLP/exoCours2/'

# I. Data ingestion

In [4]:
# Nothing to understand
def read_mails(folder):
    """
    Reads all the mails contained in a folder and gathers them in a list
    
    Args :
        folder (str) : path to the folder containing mails
        
    Returns :
        mails_list (list) : a list containing mails contents
    """
    
    mails_list = []
    files_list = os.listdir(folder)
    for file_name in files_list:
        file_content = open(folder + file_name, 'r', encoding='latin1')
        mails_list.append(file_content.read())
    file_content.close()
    return mails_list

## A. Load data

Data are separated in six parts, each containing around 5000 mails.
<br/> Each of these data chunks are separated in two parts : ham and spam
<br/>Let's load the first data chunk

In [7]:
spams = []
hams = []
for folder in os.listdir(path_data):
    if not folder.startswith('.'):
        # Load corre spams
        spams.extend(read_mails(path_data + folder + "/spam/"))
        # Load corresponding hams
        hams.extend(read_mails(path_data + folder + "/ham/"))

# II. Preprocessing

## A. Preprocessing one email

Display one email content

In [8]:
#Get one spam-email & print it
single_email = spams[0]

In [9]:
print(single_email)

Subject: what up , , your cam babe
what are you looking for ?
if your looking for a companion for friendship , love , a date , or just good ole '
fashioned * * * * * * , then try our brand new site ; it was developed and created
to help anyone find what they ' re looking for . a quick bio form and you ' re
on the road to satisfaction in every sense of the word . . . . no matter what
that may be !
try it out and youll be amazed .
have a terrific time this evening
copy and pa ste the add . ress you see on the line below into your browser to come to the site .
http : / / www . meganbang . biz / bld / acc /
no more plz
http : / / www . naturalgolden . com / retract /
counterattack aitken step preemptive shoehorn scaup . electrocardiograph movie honeycomb . monster war brandywine pietism byrne catatonia . encomia lookup intervenor skeleton turn catfish .



### Lower verbatim

Lower the content of the previously displayed mail

In [10]:
#TODO : lower case the email & print it
lower_mail = single_email.lower()
print(lower_mail)

subject: what up , , your cam babe
what are you looking for ?
if your looking for a companion for friendship , love , a date , or just good ole '
fashioned * * * * * * , then try our brand new site ; it was developed and created
to help anyone find what they ' re looking for . a quick bio form and you ' re
on the road to satisfaction in every sense of the word . . . . no matter what
that may be !
try it out and youll be amazed .
have a terrific time this evening
copy and pa ste the add . ress you see on the line below into your browser to come to the site .
http : / / www . meganbang . biz / bld / acc /
no more plz
http : / / www . naturalgolden . com / retract /
counterattack aitken step preemptive shoehorn scaup . electrocardiograph movie honeycomb . monster war brandywine pietism byrne catatonia . encomia lookup intervenor skeleton turn catfish .



### Tokenization

Here we are using a word tokenizer to divide the sentence into tokens

In [11]:
#TODO : tokenize your email and print it
tokenized_mail = word_tokenize(lower_mail)
print(tokenized_mail)

['subject', ':', 'what', 'up', ',', ',', 'your', 'cam', 'babe', 'what', 'are', 'you', 'looking', 'for', '?', 'if', 'your', 'looking', 'for', 'a', 'companion', 'for', 'friendship', ',', 'love', ',', 'a', 'date', ',', 'or', 'just', 'good', 'ole', "'", 'fashioned', '*', '*', '*', '*', '*', '*', ',', 'then', 'try', 'our', 'brand', 'new', 'site', ';', 'it', 'was', 'developed', 'and', 'created', 'to', 'help', 'anyone', 'find', 'what', 'they', "'", 're', 'looking', 'for', '.', 'a', 'quick', 'bio', 'form', 'and', 'you', "'", 're', 'on', 'the', 'road', 'to', 'satisfaction', 'in', 'every', 'sense', 'of', 'the', 'word', '.', '.', '.', '.', 'no', 'matter', 'what', 'that', 'may', 'be', '!', 'try', 'it', 'out', 'and', 'youll', 'be', 'amazed', '.', 'have', 'a', 'terrific', 'time', 'this', 'evening', 'copy', 'and', 'pa', 'ste', 'the', 'add', '.', 'ress', 'you', 'see', 'on', 'the', 'line', 'below', 'into', 'your', 'browser', 'to', 'come', 'to', 'the', 'site', '.', 'http', ':', '/', '/', 'www', '.', 'me

In [14]:
#TODO : create bigrams of your email & print it
bigrams = [bigram for bigram in ngrams(tokenized_mail,2)]
bigrams

[('subject', ':'),
 (':', 'what'),
 ('what', 'up'),
 ('up', ','),
 (',', ','),
 (',', 'your'),
 ('your', 'cam'),
 ('cam', 'babe'),
 ('babe', 'what'),
 ('what', 'are'),
 ('are', 'you'),
 ('you', 'looking'),
 ('looking', 'for'),
 ('for', '?'),
 ('?', 'if'),
 ('if', 'your'),
 ('your', 'looking'),
 ('looking', 'for'),
 ('for', 'a'),
 ('a', 'companion'),
 ('companion', 'for'),
 ('for', 'friendship'),
 ('friendship', ','),
 (',', 'love'),
 ('love', ','),
 (',', 'a'),
 ('a', 'date'),
 ('date', ','),
 (',', 'or'),
 ('or', 'just'),
 ('just', 'good'),
 ('good', 'ole'),
 ('ole', "'"),
 ("'", 'fashioned'),
 ('fashioned', '*'),
 ('*', '*'),
 ('*', '*'),
 ('*', '*'),
 ('*', '*'),
 ('*', '*'),
 ('*', ','),
 (',', 'then'),
 ('then', 'try'),
 ('try', 'our'),
 ('our', 'brand'),
 ('brand', 'new'),
 ('new', 'site'),
 ('site', ';'),
 (';', 'it'),
 ('it', 'was'),
 ('was', 'developed'),
 ('developed', 'and'),
 ('and', 'created'),
 ('created', 'to'),
 ('to', 'help'),
 ('help', 'anyone'),
 ('anyone', 'find'),


In [15]:
#TODO : create trigrams of your email & print it
trigrams = [trigram for trigram in ngrams(tokenized_mail,3)]
trigrams

[('subject', ':', 'what'),
 (':', 'what', 'up'),
 ('what', 'up', ','),
 ('up', ',', ','),
 (',', ',', 'your'),
 (',', 'your', 'cam'),
 ('your', 'cam', 'babe'),
 ('cam', 'babe', 'what'),
 ('babe', 'what', 'are'),
 ('what', 'are', 'you'),
 ('are', 'you', 'looking'),
 ('you', 'looking', 'for'),
 ('looking', 'for', '?'),
 ('for', '?', 'if'),
 ('?', 'if', 'your'),
 ('if', 'your', 'looking'),
 ('your', 'looking', 'for'),
 ('looking', 'for', 'a'),
 ('for', 'a', 'companion'),
 ('a', 'companion', 'for'),
 ('companion', 'for', 'friendship'),
 ('for', 'friendship', ','),
 ('friendship', ',', 'love'),
 (',', 'love', ','),
 ('love', ',', 'a'),
 (',', 'a', 'date'),
 ('a', 'date', ','),
 ('date', ',', 'or'),
 (',', 'or', 'just'),
 ('or', 'just', 'good'),
 ('just', 'good', 'ole'),
 ('good', 'ole', "'"),
 ('ole', "'", 'fashioned'),
 ("'", 'fashioned', '*'),
 ('fashioned', '*', '*'),
 ('*', '*', '*'),
 ('*', '*', '*'),
 ('*', '*', '*'),
 ('*', '*', '*'),
 ('*', '*', ','),
 ('*', ',', 'then'),
 (',', 'th

### Lemmatize 

We want to lemmatize the verbatims

In [19]:
nltk.download('omw-1.4')

#TODO get lemmatizer
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/wailbenfatma/nltk_data...


True

In [20]:
#TODO print lemmatisation of "found" without POS
result = lemmatizer.lemmatize("found")
print(result)
#TODO print lemmatisation of "found" with POS
result2 = lemmatizer.lemmatize("found",'v')
print(result2)
#

found
find


One little subtelty : Lemmatizing efficiently requires to pos_tag the words to know their grammatical nature

### Let's pos_tag the tokens

In [22]:
#TODO pos_tag your tokenized email & print it
pos_tagged_mail = pos_tag(tokenized_mail)
print(pos_tagged_mail)

[('subject', 'NN'), (':', ':'), ('what', 'WP'), ('up', 'IN'), (',', ','), (',', ','), ('your', 'PRP$'), ('cam', 'NN'), ('babe', 'IN'), ('what', 'WP'), ('are', 'VBP'), ('you', 'PRP'), ('looking', 'VBG'), ('for', 'IN'), ('?', '.'), ('if', 'IN'), ('your', 'PRP$'), ('looking', 'VBG'), ('for', 'IN'), ('a', 'DT'), ('companion', 'NN'), ('for', 'IN'), ('friendship', 'NN'), (',', ','), ('love', 'NN'), (',', ','), ('a', 'DT'), ('date', 'NN'), (',', ','), ('or', 'CC'), ('just', 'RB'), ('good', 'JJ'), ('ole', 'NN'), ("'", 'POS'), ('fashioned', 'VBN'), ('*', 'NNP'), ('*', 'NNP'), ('*', 'NNP'), ('*', 'NNP'), ('*', 'NNP'), ('*', 'NNP'), (',', ','), ('then', 'RB'), ('try', 'VB'), ('our', 'PRP$'), ('brand', 'NN'), ('new', 'JJ'), ('site', 'NN'), (';', ':'), ('it', 'PRP'), ('was', 'VBD'), ('developed', 'VBN'), ('and', 'CC'), ('created', 'VBN'), ('to', 'TO'), ('help', 'VB'), ('anyone', 'NN'), ('find', 'VB'), ('what', 'WP'), ('they', 'PRP'), ("'", 'VBP'), ('re', 'JJ'), ('looking', 'VBG'), ('for', 'IN'), ('

In [23]:
# Try to understand
def get_wordnet_pos(pos_tag):
    """
    Modifies pos_tag to get a more general nature of word
    """
    if pos_tag.startswith('J'):
        return wordnet.ADJ
    elif pos_tag.startswith('V'):
        return wordnet.VERB
    elif pos_tag.startswith('N'):
        return wordnet.NOUN
    elif pos_tag.startswith('R'):
        return wordnet.ADV
    else:
        return 'n'

### Lemmatizer 

In [24]:
#TODO: Lemmatize your email without pos & print it
lemmatized_mail_no_pos = [lemmatizer.lemmatize(token) for token in tokenized_mail]
print(lemmatized_mail_no_pos)

['subject', ':', 'what', 'up', ',', ',', 'your', 'cam', 'babe', 'what', 'are', 'you', 'looking', 'for', '?', 'if', 'your', 'looking', 'for', 'a', 'companion', 'for', 'friendship', ',', 'love', ',', 'a', 'date', ',', 'or', 'just', 'good', 'ole', "'", 'fashioned', '*', '*', '*', '*', '*', '*', ',', 'then', 'try', 'our', 'brand', 'new', 'site', ';', 'it', 'wa', 'developed', 'and', 'created', 'to', 'help', 'anyone', 'find', 'what', 'they', "'", 're', 'looking', 'for', '.', 'a', 'quick', 'bio', 'form', 'and', 'you', "'", 're', 'on', 'the', 'road', 'to', 'satisfaction', 'in', 'every', 'sense', 'of', 'the', 'word', '.', '.', '.', '.', 'no', 'matter', 'what', 'that', 'may', 'be', '!', 'try', 'it', 'out', 'and', 'youll', 'be', 'amazed', '.', 'have', 'a', 'terrific', 'time', 'this', 'evening', 'copy', 'and', 'pa', 'ste', 'the', 'add', '.', 'res', 'you', 'see', 'on', 'the', 'line', 'below', 'into', 'your', 'browser', 'to', 'come', 'to', 'the', 'site', '.', 'http', ':', '/', '/', 'www', '.', 'mega

In [25]:
##TODO: Lemmatize your email with pos & print it
lemmatized_mail = [lemmatizer.lemmatize(word[0],get_wordnet_pos(word[1])) for word in pos_tagged_mail]
print (lemmatized_mail)
                   

['subject', ':', 'what', 'up', ',', ',', 'your', 'cam', 'babe', 'what', 'be', 'you', 'look', 'for', '?', 'if', 'your', 'look', 'for', 'a', 'companion', 'for', 'friendship', ',', 'love', ',', 'a', 'date', ',', 'or', 'just', 'good', 'ole', "'", 'fashion', '*', '*', '*', '*', '*', '*', ',', 'then', 'try', 'our', 'brand', 'new', 'site', ';', 'it', 'be', 'develop', 'and', 'create', 'to', 'help', 'anyone', 'find', 'what', 'they', "'", 're', 'look', 'for', '.', 'a', 'quick', 'bio', 'form', 'and', 'you', "'", 're', 'on', 'the', 'road', 'to', 'satisfaction', 'in', 'every', 'sense', 'of', 'the', 'word', '.', '.', '.', '.', 'no', 'matter', 'what', 'that', 'may', 'be', '!', 'try', 'it', 'out', 'and', 'youll', 'be', 'amaze', '.', 'have', 'a', 'terrific', 'time', 'this', 'evening', 'copy', 'and', 'pa', 'ste', 'the', 'add', '.', 'res', 'you', 'see', 'on', 'the', 'line', 'below', 'into', 'your', 'browser', 'to', 'come', 'to', 'the', 'site', '.', 'http', ':', '/', '/', 'www', '.', 'meganbang', '.', 'bi

In [27]:
#TODO: Did we deleted word between tokenization & lemmatization ?
print(len(lemmatized_mail_no_pos))
print(len(lemmatized_mail))

186
186


## Stemmer

In [30]:
stemmer = PorterStemmer()
#TODO: stem your email
stemmed_email = [stemmer.stem(token) for token in tokenized_mail] 
print(stemmed_email)


['subject', ':', 'what', 'up', ',', ',', 'your', 'cam', 'babe', 'what', 'are', 'you', 'look', 'for', '?', 'if', 'your', 'look', 'for', 'a', 'companion', 'for', 'friendship', ',', 'love', ',', 'a', 'date', ',', 'or', 'just', 'good', 'ole', "'", 'fashion', '*', '*', '*', '*', '*', '*', ',', 'then', 'tri', 'our', 'brand', 'new', 'site', ';', 'it', 'wa', 'develop', 'and', 'creat', 'to', 'help', 'anyon', 'find', 'what', 'they', "'", 're', 'look', 'for', '.', 'a', 'quick', 'bio', 'form', 'and', 'you', "'", 're', 'on', 'the', 'road', 'to', 'satisfact', 'in', 'everi', 'sens', 'of', 'the', 'word', '.', '.', '.', '.', 'no', 'matter', 'what', 'that', 'may', 'be', '!', 'tri', 'it', 'out', 'and', 'youll', 'be', 'amaz', '.', 'have', 'a', 'terrif', 'time', 'thi', 'even', 'copi', 'and', 'pa', 'ste', 'the', 'add', '.', 'ress', 'you', 'see', 'on', 'the', 'line', 'below', 'into', 'your', 'browser', 'to', 'come', 'to', 'the', 'site', '.', 'http', ':', '/', '/', 'www', '.', 'meganbang', '.', 'biz', '/', 'b

In [29]:
#TODO: print the length of your email
print(len(stemmed_email))

186


### Stopwords

In [31]:
# Have a look at stopwords
stoplist = stopwords.words('english')

#TODO: print example of stop words
print(stoplist)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [32]:
#TODO: removing stopwords
mail_no_stopwords = [word for word in tokenized_mail if word not in stoplist]
print(mail_no_stopwords)

['subject', ':', ',', ',', 'cam', 'babe', 'looking', '?', 'looking', 'companion', 'friendship', ',', 'love', ',', 'date', ',', 'good', 'ole', "'", 'fashioned', '*', '*', '*', '*', '*', '*', ',', 'try', 'brand', 'new', 'site', ';', 'developed', 'created', 'help', 'anyone', 'find', "'", 'looking', '.', 'quick', 'bio', 'form', "'", 'road', 'satisfaction', 'every', 'sense', 'word', '.', '.', '.', '.', 'matter', 'may', '!', 'try', 'youll', 'amazed', '.', 'terrific', 'time', 'evening', 'copy', 'pa', 'ste', 'add', '.', 'ress', 'see', 'line', 'browser', 'come', 'site', '.', 'http', ':', '/', '/', 'www', '.', 'meganbang', '.', 'biz', '/', 'bld', '/', 'acc', '/', 'plz', 'http', ':', '/', '/', 'www', '.', 'naturalgolden', '.', 'com', '/', 'retract', '/', 'counterattack', 'aitken', 'step', 'preemptive', 'shoehorn', 'scaup', '.', 'electrocardiograph', 'movie', 'honeycomb', '.', 'monster', 'war', 'brandywine', 'pietism', 'byrne', 'catatonia', '.', 'encomia', 'lookup', 'intervenor', 'skeleton', 'turn

In [33]:
print(tokenized_mail)
print('---------------------------------')
print(mail_no_stopwords)

['subject', ':', 'what', 'up', ',', ',', 'your', 'cam', 'babe', 'what', 'are', 'you', 'looking', 'for', '?', 'if', 'your', 'looking', 'for', 'a', 'companion', 'for', 'friendship', ',', 'love', ',', 'a', 'date', ',', 'or', 'just', 'good', 'ole', "'", 'fashioned', '*', '*', '*', '*', '*', '*', ',', 'then', 'try', 'our', 'brand', 'new', 'site', ';', 'it', 'was', 'developed', 'and', 'created', 'to', 'help', 'anyone', 'find', 'what', 'they', "'", 're', 'looking', 'for', '.', 'a', 'quick', 'bio', 'form', 'and', 'you', "'", 're', 'on', 'the', 'road', 'to', 'satisfaction', 'in', 'every', 'sense', 'of', 'the', 'word', '.', '.', '.', '.', 'no', 'matter', 'what', 'that', 'may', 'be', '!', 'try', 'it', 'out', 'and', 'youll', 'be', 'amazed', '.', 'have', 'a', 'terrific', 'time', 'this', 'evening', 'copy', 'and', 'pa', 'ste', 'the', 'add', '.', 'ress', 'you', 'see', 'on', 'the', 'line', 'below', 'into', 'your', 'browser', 'to', 'come', 'to', 'the', 'site', '.', 'http', ':', '/', '/', 'www', '.', 'me

In [None]:
#TODO: add some relevant stopwords
print(tokenized_mail)
print('---------------------------------')
print(mail_no_stopwprds)


In [None]:
#TODO: removing new stopwords
mail_no_stopwords = 

In [None]:
print(tokenized_mail)
print('---------------------------------')
print(mail_no_stopwords)
print('---------------------------------')
print ('Length single email without stopwords = ' + str(len(mail_no_stopwords)) + ' words' )

### Punctuation

In [35]:
#TODO: create a punctuation list
stop_punctuation = ['.','?',',','!',':']

#TODO: removing punctuation & print it
mail_clean = [word for word in tokenized_mail if word not in stop_punctuation]
print(mail_clean)

['subject', 'what', 'up', 'your', 'cam', 'babe', 'what', 'are', 'you', 'looking', 'for', 'if', 'your', 'looking', 'for', 'a', 'companion', 'for', 'friendship', 'love', 'a', 'date', 'or', 'just', 'good', 'ole', "'", 'fashioned', '*', '*', '*', '*', '*', '*', 'then', 'try', 'our', 'brand', 'new', 'site', ';', 'it', 'was', 'developed', 'and', 'created', 'to', 'help', 'anyone', 'find', 'what', 'they', "'", 're', 'looking', 'for', 'a', 'quick', 'bio', 'form', 'and', 'you', "'", 're', 'on', 'the', 'road', 'to', 'satisfaction', 'in', 'every', 'sense', 'of', 'the', 'word', 'no', 'matter', 'what', 'that', 'may', 'be', 'try', 'it', 'out', 'and', 'youll', 'be', 'amazed', 'have', 'a', 'terrific', 'time', 'this', 'evening', 'copy', 'and', 'pa', 'ste', 'the', 'add', 'ress', 'you', 'see', 'on', 'the', 'line', 'below', 'into', 'your', 'browser', 'to', 'come', 'to', 'the', 'site', 'http', '/', '/', 'www', 'meganbang', 'biz', '/', 'bld', '/', 'acc', '/', 'no', 'more', 'plz', 'http', '/', '/', 'www', 'na

## B. Preprocessing all emails

### Functions definition 

In [41]:
def preprocess(sentence):
    """
    Tokenizes, lowers, and stems
    """
    stemmer = PorterStemmer('english')
    lemmatizer = WordNetLemmatizer()
    stop_list = stop_punctuation + stoplist

    tokenized_mail = word_tokenize(sentence.lower())

    pos_tagged_mail = pos_tag(tokenized_mail)

    lemmatized_mail = [lemmatizer.lemmatize(word[0],get_wordnet_pos(word[1])) for word in pos_tagged_mail]
    
    return [stemmer.stem(word) for word in lemmatized_mail if word not in stop_list]


In [42]:
prepro_hams = [preprocess(mail) for mail in hams]

In [43]:
prepro_hams[0]

['subject',
 'ena',
 'sale',
 'hpl',
 'updat',
 'project',
 "'",
 'status',
 'base',
 'new',
 'report',
 'scott',
 'mill',
 'ran',
 'sitara',
 'come',
 'follow',
 'counterparti',
 'one',
 'ena',
 'sell',
 'gas',
 'hpl',
 "'",
 'pipe',
 'altrad',
 'transact',
 'l',
 'l',
 'c',
 'gulf',
 'gas',
 'util',
 'compani',
 'brazoria',
 'citi',
 'panther',
 'pipelin',
 'inc',
 'central',
 'illinoi',
 'light',
 'compani',
 'praxair',
 'inc',
 'central',
 'power',
 'light',
 'compani',
 'reliant',
 'energi',
 '-',
 'entex',
 'ce',
 '-',
 'equistar',
 'chemic',
 'lp',
 'reliant',
 'energi',
 '-',
 'hl',
 '&',
 'p',
 'corpus',
 'christi',
 'gas',
 'market',
 'lp',
 'southern',
 'union',
 'compani',
 '&',
 'h',
 'gas',
 'compani',
 'inc',
 'texa',
 'util',
 'fuel',
 'compani',
 'duke',
 'energi',
 'field',
 'servic',
 'inc',
 'txu',
 'gas',
 'distribut',
 'entex',
 'gas',
 'market',
 'compani',
 'union',
 'carbid',
 'corpor',
 'equistar',
 'chemic',
 'lp',
 'unit',
 'gas',
 'transmiss',
 'compani',
 