# Naive Bayes for Spam Classification


Naive bayes is a relatively simple probabilistic classfication algorithm that is well suitable for categorical data .

In machine learning, common application of Naive Bayes are spam email classification, sentiment analysis, document categorization. Naive bayes is advantageous over other commonly used classification algorithms in its simplicity, speed, and its accuracy on small data sets.

## Data Description

We will be using a data from the UCI machine learning repository that countains several Youtube comments from very popular music videos. Each comment in the data has been labeled as either spam or ham (legitimate comment), we will use this data to train our Naive Bayes algorithm for youtube comment spam classification. 

In [None]:
# Import modules

# For data manipulation
import pandas as pd

# For matrix operations
import numpy as np

# For regex
import re

In [None]:
# Load data from the 'YoutubeCommentsSpam.csv' file using pandas
data_comments = pd.read_csv("YoutubeCommentsSpam.csv")

# Create column labels: 'content' and 'label'. 
# tips: the 'colums' method can be of help 
data_comments.rename(index=None, columns={'class':'label'}, inplace=True)

# display the first rows of our dataset to make sure that the labels have been added
# data_comments.head()
data_comments.columns

$\textbf{WARNING: DO NOT check the links in the spam comments! ;)}$

In [None]:
# Show spam comments in data
# DO NOT GO ON THE LINKS BELOW!!! seriously, they're spams... 
print(data_comments["content"][data_comments["label"] == 1])

Browsing over the comments that have been labeled as spam in this data, it seems like these comments are either unrelated to the video, or are some form of advertisement. The phrase "check out" seems to be very popular in this comments.

## Summary Statistics and Data Cleaning

The table below shows that this data set consist of $1959$ youtube comments, about $49\%$ of them are legitimate comments and about $51\%$ are spam. This high variation of classes in our data set will help us test our algorithms accuracy on the test data set. The average length of each comment is about $96$ characters, which is roughly about $15$ words on average per comment. 

In [None]:
# Add another column with corresponding comment length
# tips: use map and lambda
data_comments['length'] = data_comments['content'].str.count(re.compile(" +|$")) #\w+ ??
### FAIRE AVEC MAP ET LAMBDA 
###

# Display summary statistics (mean, stdev, min, max)
data_comments[["label","length"]].describe()
# data_comments.head()

For the purposes of evaluation our Naive Bayes classification algorithm, we will split the data into a training and test set. The training set will be used to train the spam classification algorithm, and the test set will only be used to test its accuracy. In general the training set should be bigger than the test set and both have should be drawn from the same population (population in our case is youtube comments for music videos). We will randomly select $75\%$ of the data as training, and $25\%$ of the data for testing. 

In [None]:
# Let's split data into training and test set (75% training, 25% test)

# Set seed so we get same random allocation on each run of code
np.random.seed(2017)

# Add column vector 'uniform' of randomly generated numbers from 0 to 1 
# tips: in numpy there is a method to draw sample from a uniform distribution
data_comments["uniform"] = np.random.uniform(0, 1, data_comments.shape[0])

# As the number in our 'uniform' column is uniformly distributed, 
# about 75% of these numbers should be less than 0.75, let's grab those 75%
data_comments_train = data_comments[data_comments["uniform"] < 0.75]

# same for the 25% of these numbers should that are greater than 0.75
data_comments_test = data_comments[data_comments["uniform"] >= 0.75]

data_comments_train.head()

In [None]:
# Check that training data have both spam and ham comments
data_comments_train["label"].describe()

In [None]:
# Same for the test data 
data_comments_test["label"].describe()

Both the training and test data have a good mix spam and ham comments, so we are ready to move onto training the Naive Bayes classifier. 

In [None]:
# Join all the comments into a big list
# tips: list.join()
training_list_words = " ".join(data_comments['content'])
training_list_words = training_list_words.lower()
training_list_words = re.sub(r" +", " ", training_list_words)
training_list_words = training_list_words.split(" ")

# Convert to lower case and get unique set of words
train_unique_words = set(training_list_words)
train_unique_words = [re.sub(r"\W", "", x) for x in train_unique_words]

# Number of unique words in training 
vocab_size_train = len(train_unique_words)

# Description of summarized comments in training data
print('Unique words in training data: %s' % vocab_size_train)
print('First 5 words in our unique set of words: \n % s' % list(train_unique_words)[1:6])

train_unique_words

## Naive Bayes for Spam Classification

ok, so here's the deal:

- first we are going to separate our training data into 2 subsets: train and test

- then create several functions to check how many time each word apparear in spam and not spam comments, check the probability of each word appearing in spam/not spam

- then the 2 most important function: train() and classify()

- And finally check the accurracy of our predictions

Let's code!

In [None]:
# Dictionary with comment words as "keys", and their label as "value"
trainPositive = dict()
trainNegative = dict()

# Intiailize classes to zero
positiveTotal = 0
negativeTotal = 0

# Initialize Prob. of to zero, but float ;) 
pSpam = 0.0
pNotSpam = 0.0

# Laplace smoothing
alpha = 1

In [None]:
# Initialize dictionary of words and their labels   
for word in train_unique_words:
    
    # Classify all words for now as ham (legitimate)
    trainPositive[word] = 0
    trainNegative[word] = 0

In [None]:
# Count number of times word in comment appear in spam and ham comments
def processComment(comment,label):
    global positiveTotal
    global negativeTotal
    
    # Split comments into words
    comment = set(comment.split(" ").lower())
    comment = [re.sub(r"\W", "", x) for x in comment]
    
    # Go over each word in comment
    for word in comment :
        # check if comment is not spam
        if label == 1 :
            # Increment number of times word appears in not spam comments
             trainPositive[word] += 1
         # spam comments
        elif label == 0 :
            # Increment number of times word appears in spam comments
            trainNegative[word] += 1
            

In [None]:
# Define Prob(word|spam) and Prob(word|ham)
def conditionalWord(word,label):
   
    # Laplace smoothing parameter
    # remider: to have access to a global variable inside a function 
    # you have to specify it using the word 'global'
    
    
    # word in ham comment
    if(label == 0):
        # Compute Prob(word|ham)
        
    
    # word in spam comment
    else:
        
        # Compute Prob(word|ham)
       

In [None]:
# Define Prob(spam|comment) or Prob(ham|comment)
def conditionalComment(comment,label):
    
    # Initialize conditional probability
    prob_label_comment = 1.0
    
    # Split comments into list of words
    
    
    # Go through all words in comments
    for ...
        
        # Compute value proportional to P(label|comment)
        # Conditional indepdence is assumed here
        
    
    return prob_label_comment

In [None]:
# Train naive bayes by computing several conditional probabilities in training data
def train():
    # reminder: we will need pSpam and pNotSpam here ;) 


    # Initiailize our variables: the total number of comment and the number of spam comments 

    
    # Go over each comment in training data 
    for ...
        
       # check if comment is spam or not 
    
       # increment the values depending if comment is spam or not
        
       # update dictionary of spam and not spam comments
    
    
    # Compute prior probabilities, P(spam), P(ham)
    pSpam = 
    pNotSpam = 
    


In [None]:
# Run naive bayes
train()

In [None]:
# Classify comment are spam or ham
def classify(comment):
    
    # get global variables
    
    
    # Compute value proportional to Pr(comment|ham)
    isNegative = 
    
    # Compute value proportional to Pr(comment|spam)
    isPositive = 
    
    # Output True = spam, False = ham depending of the 2 previously compute variables
    return 

In [None]:
# Initialize spam prediction in test data
prediction_test = []

# Get prediction accuracy on test data
for ...

    # add classified comment to prediction_test list 
    

# Check accuracy: 
# first the number of correct prediction 
correct_labels = 
# then the mean of correct predictions
test_accuracy = 

#print prediction_test
print("Proportion of comments classified correctly on test set: %s" % test_accuracy)

Let's try to write some comments to see whether they are classified as spam or ham. Recall the "True" is for spam comments, and "False" is for ham comments. 
Try your own !

In [None]:
# spam
classify("Guys check out my new chanell")

In [None]:
# spam
classify("I have solved P vs. NP, check my video https://www.youtube.com/watch?v=dQw4w9WgXcQ")

In [None]:
# ham
classify("I liked the video")

In [None]:
# ham
classify("Its great that this video has so many views")

### to go further...
## Extending Bag of Words by Using TF-IDF

So far we have been using the Bag of Words model to represent comments as vectors. The "Bag of Words" is a list of all unique words found in the training data, then each comment can be represented by a vector that contains the frequency of each unique word that appeared in the comment. For example if the training data contains the words $(hi, how, my, grade, are, you),$ then the text "how are you you" can be represented by $(0,1,0,0,1,2).$ The main reason we do this in our application is because comments can vary in length, but the length of all unique words stays fixed. 

In our context, the TF-IDF is a measure of how important a word is in a comment relative to all the words in our training data. For example if a word such as "the" appeared in most of the comments, the TF-IDF would be small as this word does not help us differentiate accross comments. Note that "TF" stands for "Term Frequency", and "IDF" stands for "Inverse Document Frequency". In particular, "TF" denoted by $tf(w,c)$ is the number of times the word $w$ appears in the given comment $c$. Whereas "IDF" is a measure of how much information a given word provides in differentiating comments. Specefically, $IDF$ is formulated as $idf(w, D) = log(\frac{\text{Number of comments in train data $D$}}{\text{Number of comments containing the word $w$}}).$ To combine "TF" and "IDF" together, we simple take the product, hence $$TFIDF = tf(w,c) \times idf(w, D) = (\text{Number of times $w$ appears in comment $c$})\times log(\frac{\text{Number of comments in train data $D$}}{\text{Number of comments containing the word $w$}}).$$
Now the $TF-IDF$ can be used to weight the vectors that result from the "Bag of Words" approach. For example, suppose a comment contains "this" 2 times, hence $tf = 2$. If we then had 1000 comments in our traininig data, and the word "this" appears in 100 comments, $idf = log(1000/100) = 2.$ Therefore in this example, the TF-IDF weight would be 2*2 = 4 for the word "this" appear twice in a particular comment. To incorprate TF-IDF into the naive bayes setting, we can compute $$Pr(word|spam) = \frac{\sum_{\text{c is spam}}TFIDF(word,c,D)}{\sum_{\text{word in spam c}}\sum_{\text{c is spam}}TFIDF(word,c,D)+ \text{Number of unique words in data}},$$ where $TFIDF(word,c,D) = TF(word,c) \times IDF(word,data).$ 

In [None]:
# Compute tfidf(word, comment, data)
def TFIDF(comment, train):
    
    # Split comment into list of words
    comment = 
    
    # Initiailize tf-idf for given comment
    tfidf_comment = 
    
    # Initiailize number of comments containing a word
    num_comment_word = 0
    
    # Intialize index for words in comment
    word_index = 0
    
    # Go over all words in comment
    for...
        
        # Compute term frequence (tf)
        # Count frequency of word in comment
        tf = 
        
        # Find number of comments containing word
        for ...
            
            # Increment word counter if word found in comment
            if ...
        
        # Compute inverse document frequency (idf)
        # log(Total number of comments/number of comments with word)
        idf = 
        
        # Update tf-idf weight for word
        
        
        # Reset number of comments containing a word
        
        
        # Move onto next word in comment
        
        
    return tfidf_comment

In [None]:
TFIDF("Check out my new music video plz",data_comments_train)