## Email spam classification

We're gonna try to classify emails in a dataset into spams.

The first step will be to load the emails, then we're gonna need to clean them a bit.

The second step will be to transform text features into a representation encoding that the machine can understand. We will use TF-IDF encoding here (term frequency inverse document frequency).

And finally, we're gonna use Multinomial Naive Bayes to represent our data.

In [9]:
import pickle
import pandas as pd

### 1. Load the data and preprocess

Click this [link](https://wagon-public-datasets.s3.amazonaws.com/Data-Challenges_ML-Day02-Ex03_emails-dataset.pickle) to download the **spam email dataset**. Then put the file in this **exercise folder**.

Loading the dataset using with [`open [DOC]`](https://www.pythonforbeginners.com/files/reading-and-writing-files-in-python). The file is a pickle binary file so you're gonna need to use the 'rb' parameter.

To load the pickle file into the data frame, simply `import pickle` and then use it like this
```python
emails_df = pickle.load(file)
```

The text strings are still in "Bytes" format (you can see they start with `b'`). 

We need to convert them into regular encoded strings. For that you can use a lambda function over the "message" column

`.apply(lambda x: x.decode('latin-1'))`

Please take a look at lambda functions, and string encodings, they're really common procedures.

### 2. Clean the emails

In [17]:
from string import punctuation
import re

def clean_email(email):    
    email = re.sub(r'http\S+', ' ', email)
    email = re.sub("\d+", " ", email)
    email = email.replace('\n', ' ')
    email = email.translate(str.maketrans("", "", punctuation))
    email = email.lower()
    return email

emails_df['message'] = emails_df['message'].apply(clean_email)

You can call head to see what it looks like, and maybe some histograms in the repartition of spam vs ham

### 3. Count the occurences in the email

There is a problem with how to encode the email strings. We want to have an encoding that we can feed to our machine learning models. One way of doing it is having a dictionary of the size of the whole indexed vocabulary, and representing every document with the number of occurences of each word.

Let's say you have 3 emails you want to encode\
'name is Thomas'\
'name is David'\
'you have time today'

The whole encoding would be 

| | name | is | Thomas | David | you | have | time | today |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
x1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
x2 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
x3 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |

(As you can notice, it's a very sparse representation of the data !)

We're gonna use the utility class of sklearn [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

Use the param:
`stop_words`: in order to remove the words that don't bring any information (like 'and', 'for', 'the' etc)
`max_df`: eliminate dataset specific stop words by removing words that appear in too many documents (let's say 50%)

Calling `get_feature_names()` on your vectorizer gives you a dictionary of the words (= features) used for your representation of each email. Get the number of features you're working on. You should be at around 144143 unique words/features.

### 4. Train test split

Now that we are satisfied with the representation of our document, we're gonna split the dataset into testing / training set. use the `train_test_split` function over our new representation of the emails.

Look at the shape of your dataset. With a test size of 30%, it should have about the size of 23601 and 10115

### 5. Call and fit multinomial

We can now use the naive bayes classifier over our training set. Make sure to test the score on the training set as well. Careful to use the right Naive Bayes model (Multinomial or Gaussian?)

### 6. Observe classifier

#### Try on a few values

You can now observe the quality of your classifier by entering email strings and seeing if it classifies correctly.

In [3]:
email_samples = ["Hello George, how about a game of tennis tomorrow?",
         "Dear Sara, I prepared the annual report. Please check the attachment.",
         "Hi David, will we go for cinema tonight?",
         "Best holidays offers only here!!!"]

#### Look at the most important features

Using the classifier `coef_` parameter, try to get the top 10 features and use the `get_feature_names()` to find the corresponding words.

#### Observe missclassification

Try to extract some missclassified emails and look at what might have made the classifier an error.
Hint : you can use `np.where`


### 7. Optional - Minimum train size

Using `learning_curve` Find the minimum training size that you would need to get 97%+ performance

### 8. Optional - use tf-idf

Try to see if you can improve the performance using [`tf-idf`](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)
with [`TfidfVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) instead of a simple count