# Naive Bayes Spam Classifier

Naive Bayes classifiers are a type of machine learning algorithm based applying [bayes theorem](https://en.wikipedia.org/wiki/Bayes%27_theorem) with strong (naive) independence assumptions between features. In short, a naive bayes classifier treats every features independent from each other, making inference very efficient. These types of classifiers are commonly used for spam detection.

## Before we start...

Let's quickly cover some of the basic definitions needed to understand our problem.

### Bayes Theorem

Bayes Theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event. Bayes Theorem is considered "naive" because it assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. In other words, every feature is taken into account without considering the existence of another feature.

Mathmatically, Bayes Theorem can be written as:

$$
P(A \mid B) = \frac{P(B \mid A) \, P(A)}{P(B)}
$$

Lets break this down:
- $A$ and $B$ are considered seperate events, and the Probability of $B$ (ie $P(B)$) ≠ 0.
- $P(A)$ and $P(B)$ are the probabilities of observing events $A$ and $B$ without regard to each other.
- $ P(A \mid B) $ is the probability of observing event $A$ given that $B$ is true
- $ P(B \mid A) $ is the probability of observing event $B$ given that $A$ is true

When applying Bayes Theorem to spam classification, we can rewrite the problem statment as:

$$
P(\textrm{spam} \mid \textrm{w}1 \cap \textrm{w}2 \> \cap .. \cap \> \textrm{w}n) = \frac{P(\textrm{w}1 \cap \textrm{w}2 \> \cap \> .. \cap \> \textrm{w}n \mid \textrm{spam}) \, P(\textrm{spam})}{P(\textrm{w}1 \cap \textrm{w}2 \> \cap \> .. \cap \> \textrm{w}n)}
$$

Now, we have a message m that is made up of n number of words, or m = $ (w1 \cap w2 \cap .. \cap wn) $. We assume the occurence of any word wn is independent of all other words.



## 1. Environment Setup

In [1]:
# Loads watermark extension and prints details about current platform
%load_ext watermark
%watermark -v -n -m -p numpy,scipy,sklearn,pandas,matplotlib
# autoreloads changes in imported files
%load_ext autoreload
%autoreload 2
 
# import packages
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import os
import re
from sklearn.naive_bayes import BernoulliNB

# Get project directory
PROJ_ROOT = os.path.abspath(os.path.join(os.pardir, os.pardir))
print(PROJ_ROOT)
import sys

module_path = os.path.abspath(os.path.join(os.pardir, os.pardir, 'models'))
if module_path not in sys.path:
    print (module_path)
    sys.path.append(module_path)

import naive_bayes as classifier

%load_ext dotenv

# Use find_dotenv to locate the file
%dotenv


# make sure matplotlib will display inline
%matplotlib inline

Wed Dec 26 2018 

CPython 3.7.1
IPython 7.2.0

numpy 1.15.4
scipy 1.1.0
sklearn 0.20.1
pandas 0.23.4
matplotlib 3.0.2

compiler   : Clang 4.0.1 (tags/RELEASE_401/final)
system     : Darwin
release    : 18.0.0
machine    : x86_64
processor  : i386
CPU cores  : 4
interpreter: 64bit
/Users/sebp/LocalDocuments2/DataScience/Personal/MachineLearning/MachineLearning
/Users/sebp/LocalDocuments2/DataScience/Personal/MachineLearning/MachineLearning/models


Build and test the model

In [2]:
def build_and_test_model():
    naive_bayes_classifier = classifier.NaiveBayesClassifier(db_credentials = {'user' : os.environ.get('DB_USER'),
        'password' : os.environ.get('DB_PASSWORD'),
        'host' : os.environ.get('DB_HOST'),
        'port' : os.environ.get('DB_PORT'),
        'database' : os.environ.get('DB_DATABASE')})

    print('UCI SMS SPAM CLASSIFICATION PROBLEM SET\n  -- implemented by Bernoulli Naive Bayes Model\n')
#     naive_bayes_classifier.tabu_file = PROJ_ROOT + '/data/naive_bayes/interim/tabu.txt'          # user defined tabu file

    # build a tabu list based on the training data
    naive_bayes_classifier.tabu_list = naive_bayes_classifier.generate_tabu_list()

    naive_bayes_classifier.tabu, naive_bayes_classifier.tabu_length = naive_bayes_classifier.read_tabu_list(naive_bayes_classifier.tabu_list)
    # train the Naive Bayes Model using training data
    naive_bayes_classifier.NaiveBayes=naive_bayes_classifier.learn()
    # Test Model using testing data
    naive_bayes_classifier.test(naive_bayes_classifier.NaiveBayes)
    print('>>>Testing')
    # I select two messages from the test data here.
    naive_bayes_classifier.predictSMS('how many rows through dataframe that function will process')
    naive_bayes_classifier.predictSMS('Had your mobile 10 mths? Update to the latest Camera/Video phones for FREE. KEEP UR SAME NUMBER, Get extra free mins/texts. Text YES for a call')

In [3]:
build_and_test_model()

UCI SMS SPAM CLASSIFICATION PROBLEM SET
  -- implemented by Bernoulli Naive Bayes Model

>>>Generating Tabu List...
  Tabu List Size: 300
   The words shorter than 3 are ignored by model

>>>Learning...
  Learning Sample Size: 4429
  Accuarcy (Training sample): 98.22％

>>>Cross Validation...
  Testing Sample Size: 1118
  Accuarcy (Testing sample): 98.12％

>>>Testing
HAM: how many rows through dataframe that function will process
SPAM: Had your mobile 10 mths? Update to the latest Camera/Video phones for FREE. KEEP UR SAME NUMBER, Get extra free mins/texts. Text YES for a call


In [6]:
def build_and_test_model2():
    naive_bayes_classifier = classifier.NaiveBayesClassifier(db_credentials = {'user' : os.environ.get('DB_USER'),
        'password' : os.environ.get('DB_PASSWORD'),
        'host' : os.environ.get('DB_HOST'),
        'port' : os.environ.get('DB_PORT'),
        'database' : os.environ.get('DB_DATABASE')})
    naive_bayes_classifier.data_source= 'sql'

    print('UCI SMS SPAM CLASSIFICATION PROBLEM SET\n  -- implemented by Bernoulli Naive Bayes Model\n')

    # build a tabu list based on the training data
    naive_bayes_classifier.tabu_list = naive_bayes_classifier.generate_tabu_list(spam_label=1, valid_label=0)

    naive_bayes_classifier.tabu, naive_bayes_classifier.tabu_length = naive_bayes_classifier.read_tabu_list(naive_bayes_classifier.tabu_list)
    # train the Naive Bayes Model using training data
    naive_bayes_classifier.NaiveBayes=naive_bayes_classifier.learn(spam_label=1)
    # Test Model using testing data
    naive_bayes_classifier.test(naive_bayes_classifier.NaiveBayes, spam_label=1)
    print('>>>Testing')
    # I select two messages from the test data here.
    naive_bayes_classifier.predict_fraud('I wanted to do something different so I made a list of everything I love about you. I ran out of paper, so I sent this text. You are the absolute best friend anyone could have.')
    naive_bayes_classifier.predict_fraud('You are my best friend that I’ve ever had in my life and even in imagination too. Happy Birthday.')
    naive_bayes_classifier.predict_fraud('Merry Christmas! Happy baby shopping.')
    naive_bayes_classifier.predict_fraud('Merry Christmas! Wish I was there for NutMeg’s First Christmas! Love you all so much!!!')

In [7]:
build_and_test_model2()

UCI SMS SPAM CLASSIFICATION PROBLEM SET
  -- implemented by Bernoulli Naive Bayes Model

>>>Generating Tabu List...
  Tabu List Size: 300
   The words shorter than 3 are ignored by model

>>>Learning...
  Learning Sample Size: 842
  Accuarcy (Training sample): 98.69％

>>>Cross Validation...
  Testing Sample Size: 220
  Accuarcy (Testing sample): 99.09％

>>>Testing
Fraud: I wanted to do something different so I made a list of everything I love about you. I ran out of paper, so I sent this text. You are the absolute best friend anyone could have.
Fraud: You are my best friend that I’ve ever had in my life and even in imagination too. Happy Birthday.
Not Fraud: Merry Christmas! Happy baby shopping.
Not Fraud: Merry Christmas! Wish I was there for NutMeg’s First Christmas! Love you all so much!!!
