# <center> TP - Text-Mining <center/>

** Before you begin this notebook, please make sure that you have the data folder + that you are running this notebook from your workspace **

_Now let's practice and use some text-mining techniques!_
<br/>
<br/> In this notebook we will study a mail dataset.
Our final goal is to have some insights about the different subjects that appear in mailboxes. Among other problems, one big issue is that these mailboxes are poluted with a lot of spams.

> The sequence we propose :
- First you'll work on pre-processing the content of the mails
- Then you will try to detect whether a mail is a spam or not through a supervized learning algorithm
- Finally, once you've trained an algorithm to detect spams, you will try to identify the main topics in the remaining non-spam mails

## The data

Data are separated in six parts, each containing around 5000 mails.
<br/> Each of these data chunks are separated in two parts : 
>- hams : that is to say non-spam mails
- spams : containing advertisement or unrelevant content


get the zip file here :
- https://drive.google.com/file/d/1j1xO3HSevP__ZsP2FvTx7p0UHJO__7rj/view?usp=sharing


***
***

## 0. Imports

In [1]:
#Import usefull package
import os
import re
import string
import random
import numpy as np
import pandas as pd
from collections import Counter

#Import nltk packages to manipulate text
import nltk
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.metrics import ConfusionMatrix
from nltk.stem.snowball import SnowballStemmer

from nltk import word_tokenize, WordNetLemmatizer, PorterStemmer
from nltk import NaiveBayesClassifier, classify
from nltk import pos_tag
from nltk import ngrams

#***
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')

from nltk import word_tokenize,sent_tokenize

# Let's add a path containing some useful nltk data
nltk.data.path += ['/mnt/share/nltk_data']

# Plotting
import matplotlib.pyplot as plt
%matplotlib inline


[nltk_data] Downloading package punkt to /Users/sebila/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/sebila/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/sebila/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/sebila/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
# Modify the variable hereunder to you repository
path_data  = "./enron1"

# I. Data ingestion

In [29]:
# Nothing to understand
def read_mails(folder):
    """
    Reads all the mails contained in a folder and gathers them in a list
    
    Args :
        folder (str) : path to the folder containing mails
        
    Returns :
        mails_list (list) : a list containing mails contents
    """
    
    mails_list = []
    files_list = os.listdir(folder)
    for file_name in files_list:
        file_content = open(folder + file_name, 'r', encoding='latin1')
        mails_list.append(file_content.read())
    file_content.close()
    return mails_list

## A. Load data

In [30]:
 os.listdir(path_data)

['spam', '.DS_Store', 'ham', 'Summary.txt', '.idea']

Data are separated in six parts, each containing around 5000 mails.
<br/> Each of these data chunks are separated in two parts : ham and spam
<br/>Let's load the first data chunk

In [39]:
spams = []
hams = []
for folder in os.listdir(path_data):
    if not folder.startswith('.') and folder != 'Summary.txt' :
        # print(path_data +  '/' + folder )
        # Load corresponding spams
        spams.extend(read_mails(path_data +  '/' + folder + '/'))
        # Load corresponding hams
        hams.extend(read_mails(path_data +  '/' + folder +'/'))

# II. Preprocessing

## A. Preprocessing one email

Display one email content

In [40]:
#Get one spam-email & print it
single_email = spams[0]

In [42]:
print(single_email)

Subject: what up , , your cam babe
what are you looking for ?
if your looking for a companion for friendship , love , a date , or just good ole '
fashioned * * * * * * , then try our brand new site ; it was developed and created
to help anyone find what they ' re looking for . a quick bio form and you ' re
on the road to satisfaction in every sense of the word . . . . no matter what
that may be !
try it out and youll be amazed .
have a terrific time this evening
copy and pa ste the add . ress you see on the line below into your browser to come to the site .
http : / / www . meganbang . biz / bld / acc /
no more plz
http : / / www . naturalgolden . com / retract /
counterattack aitken step preemptive shoehorn scaup . electrocardiograph movie honeycomb . monster war brandywine pietism byrne catatonia . encomia lookup intervenor skeleton turn catfish .



### Lower verbatim

Lower the content of the previously displayed mail

In [46]:
#TODO : lower case the email & print it

lower_mail = single_email.lower()
print(lower_mail)

subject: what up , , your cam babe
what are you looking for ?
if your looking for a companion for friendship , love , a date , or just good ole '
fashioned * * * * * * , then try our brand new site ; it was developed and created
to help anyone find what they ' re looking for . a quick bio form and you ' re
on the road to satisfaction in every sense of the word . . . . no matter what
that may be !
try it out and youll be amazed .
have a terrific time this evening
copy and pa ste the add . ress you see on the line below into your browser to come to the site .
http : / / www . meganbang . biz / bld / acc /
no more plz
http : / / www . naturalgolden . com / retract /
counterattack aitken step preemptive shoehorn scaup . electrocardiograph movie honeycomb . monster war brandywine pietism byrne catatonia . encomia lookup intervenor skeleton turn catfish .



### Tokenization

Here we are using a word tokenizer to divide the sentence into tokens

In [47]:
#TODO : tokenize your email and print it
tokenized_mail = nltk.word_tokenize(single_email)
print(tokenized_mail)

['Subject', ':', 'what', 'up', ',', ',', 'your', 'cam', 'babe', 'what', 'are', 'you', 'looking', 'for', '?', 'if', 'your', 'looking', 'for', 'a', 'companion', 'for', 'friendship', ',', 'love', ',', 'a', 'date', ',', 'or', 'just', 'good', 'ole', "'", 'fashioned', '*', '*', '*', '*', '*', '*', ',', 'then', 'try', 'our', 'brand', 'new', 'site', ';', 'it', 'was', 'developed', 'and', 'created', 'to', 'help', 'anyone', 'find', 'what', 'they', "'", 're', 'looking', 'for', '.', 'a', 'quick', 'bio', 'form', 'and', 'you', "'", 're', 'on', 'the', 'road', 'to', 'satisfaction', 'in', 'every', 'sense', 'of', 'the', 'word', '.', '.', '.', '.', 'no', 'matter', 'what', 'that', 'may', 'be', '!', 'try', 'it', 'out', 'and', 'youll', 'be', 'amazed', '.', 'have', 'a', 'terrific', 'time', 'this', 'evening', 'copy', 'and', 'pa', 'ste', 'the', 'add', '.', 'ress', 'you', 'see', 'on', 'the', 'line', 'below', 'into', 'your', 'browser', 'to', 'come', 'to', 'the', 'site', '.', 'http', ':', '/', '/', 'www', '.', 'me

In [48]:
#TODO : create bigrams of your email & print it
bigrams = [bigram for bigram in ngrams(tokenized_mail, 2)]

bigrams

print(bigrams)

[('Subject', ':'), (':', 'what'), ('what', 'up'), ('up', ','), (',', ','), (',', 'your'), ('your', 'cam'), ('cam', 'babe'), ('babe', 'what'), ('what', 'are'), ('are', 'you'), ('you', 'looking'), ('looking', 'for'), ('for', '?'), ('?', 'if'), ('if', 'your'), ('your', 'looking'), ('looking', 'for'), ('for', 'a'), ('a', 'companion'), ('companion', 'for'), ('for', 'friendship'), ('friendship', ','), (',', 'love'), ('love', ','), (',', 'a'), ('a', 'date'), ('date', ','), (',', 'or'), ('or', 'just'), ('just', 'good'), ('good', 'ole'), ('ole', "'"), ("'", 'fashioned'), ('fashioned', '*'), ('*', '*'), ('*', '*'), ('*', '*'), ('*', '*'), ('*', '*'), ('*', ','), (',', 'then'), ('then', 'try'), ('try', 'our'), ('our', 'brand'), ('brand', 'new'), ('new', 'site'), ('site', ';'), (';', 'it'), ('it', 'was'), ('was', 'developed'), ('developed', 'and'), ('and', 'created'), ('created', 'to'), ('to', 'help'), ('help', 'anyone'), ('anyone', 'find'), ('find', 'what'), ('what', 'they'), ('they', "'"), ("'",

In [None]:
#TODO : create trigrams of your email & print it
trigrams = 

### Lemmatize 

We want to lemmatize the verbatims

In [3]:
#TODO get lemmatizer
lemmatizer = WordNetLemmatizer()

In [4]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to /Users/sebila/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [5]:
#TODO print lemmatisation of "found" without POS
result = lemmatizer.lemmatize('found')
print(result)
#TODO print lemmatisation of "found" with POS
result2 = lemmatizer.lemmatize('found', 'v')
print(result2)
#

found
find


One little subtelty : Lemmatizing efficiently requires to pos_tag the words to know their grammatical nature

### Let's pos_tag the tokens

In [6]:
#TODO pos_tag your tokenized email & print it
pos_tagged_mail = 

SyntaxError: invalid syntax (50727860.py, line 2)

In [7]:
# Try to understand
def get_wordnet_pos(pos_tag):
    """
    Modifies pos_tag to get a more general nature of word
    """
    if pos_tag.startswith('J'):
        return wordnet.ADJ
    elif pos_tag.startswith('V'):
        return 'v'
        #return wordnet.VERB
    elif pos_tag.startswith('N'):
        return wordnet.NOUN
    elif pos_tag.startswith('R'):
        return wordnet.ADV
    else:
        return 'n'

### Lemmatizer 

In [8]:
#TODO: Lemmatize your email without pos & print it
lemmatized_mail_no_pos = 
print(lemmatized_mail_no_pos)

SyntaxError: invalid syntax (3181786431.py, line 2)

In [None]:
##TODO: Lemmatize your email with pos & print it
lemmatized_mail = 
print (lemmatized_mail)
                   

In [None]:
#TODO: Did we deleted word between tokenization & lemmatization ?


## Stemmer

In [None]:
stemmer = 
#TODO: stem your email
stemmed_email = 

In [None]:
#TODO: print the length of your email


### Stopwords

In [None]:
# Have a look at stopwords
stoplist = 

#TODO: print example of stop words
print(stoplist)

In [None]:
#TODO: removing stopwords
mail_no_stopwords = 

In [None]:
print(tokenized_mail)
print('---------------------------------')
print(mail_no_stopwords)

In [None]:
#TODO: add some relevant stopwords
print(tokenized_mail)
print('---------------------------------')
print(mail_no_stopwprds)


In [None]:
#TODO: removing new stopwords
mail_no_stopwords = 

In [None]:
print(tokenized_mail)
print('---------------------------------')
print(mail_no_stopwords)
print('---------------------------------')
print ('Length single email without stopwords = ' + str(len(mail_no_stopwords)) + ' words' )

### Punctuation

In [None]:
#TODO: create a punctuation list
stop_punctuation = ['.','?',',','!']

#TODO: removing punctuation & print it
mail_clean = 
print(mail_clean)

## B. Preprocessing all emails

### Functions definition 

In [None]:
def preprocess(sentence):
    """
    Tokenizes, lowers, and stems
    """


def get_features(text):

def get_features_no_processing(text):
  