# Naive Bayes

### Introduction


In this exercice, you are going to train your first Naive Bayes model! You will classify emails in a dataset into spam or not spam (ham). Naive Bayes is a particlarly powerful Machine Learning when it comes to classifying texts, hence the chosen dataset.

The first step will be to load the emails, then we're gonna need to clean them a bit.

**Dataset:** "emails.csv"<br>

**Columns:** <br>
**"Text"**, emails in raw text <br>
**"spam"**, range range [0,1] <br>

## 1. Load the data

Load the csv file into a dataframe.

## 2. Clean the emails

This part is already done for you. It consists of cleaning the texts, removing strange characters, email addresses, punctuation, lowercasing. This is general practice when it comes to text preprocessing. Don't panic, you will be guided through it during the Natural Language Processing day. 

Run the code below and try make sense of it!

In [67]:
from string import punctuation
import re

def clean_email(email):    
    
    email = re.sub(r'http\S+', ' ', email) # Remove email addresses
    
    email = re.sub("\d+", " ", email) # Remove "\d" pattern
    
    email = email.replace('\n', ' ') # Remove "\n" pattern
    
    email = email.translate(str.maketrans("", "", punctuation)) # Remove punctuation
    
    email = email.lower() # Lowercase text
    
    return email

df['text'] = df['text'].apply(clean_email)

df.head()

Unnamed: 0,text,spam
0,subject naturally irresistible your corporate ...,1
1,subject the stock trading gunslinger fanny is...,1
2,subject unbelievable new homes made easy im w...,1
3,subject color printing special request addi...,1
4,subject do not have money get software cds fr...,1


## 3. Class Balance

Check for class balance. You are probably going to need to downsample...

## 4. Transforming (Vectorizing) text data

Machine Learning algorithms don't take raw text as input. One way to transform text to a suitable numerical form is called Count Vectorizing.

Count Vectorizing consists of creating a dictionary with all the words within the dataset, and representing every document with the number of occurences of each word.

Let's say you have 3 texts you want to encode:

x1 = 'name is Thomas'\
x2 = 'name is David'\
x3 = 'you have time today'

The whole encoding would be 

| | name | is | Thomas | David | you | have | time | today |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
x1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
x2 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
x3 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |




Don't panic, you don't need to hand code it! Use the utility class of sklearn `CountVectorizer` to vectorize the text.  [[doc]](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

Calling `get_feature_names()` on your vectorizer gives you a dictionary of the words (= features) used for your representation of each email. Get the number of features you're working on.

## 5. Train/test split

Now that you have a vectorized representation of the emails, you can go ahead and split the dataset into testing / training sets.

## 6. Train a Naive Bayes

Time to train a Naive Bayes classifier!

Go ahead, train the model and test its accuracy!

Be careful to use the right Naive Bayes model (Multinomial or Gaussian?) and to correctly check its accuracy.

## 7. Predicting

You can now observe the quality of your classifier by predicting on new emails.

Go ahead and predict the emails given below. Remember you need to vectorize the text (in the same exact way!) before feeding it to the model for prediction.

In [1]:
new_emails = ["Hello George, how about a game of tennis tomorrow?",
              "Hi David, are we going to the movies tonight?",
              "Best holidays offers only here!!!"]
