# **Problem:** 
let’s make a quick filter for an online message board that flags
a message as inappropriate if the author uses negative or abusive language.

# **Prepare: making word vectors from text**

In [71]:
# This function creates some example data to experiment with. This is the training data
# The first variable returned from loadDatSet() is a list of documents from a Dog lovers message board.
# The text has been broken up into a set of tokens (words)
# Punctuation has been removed from this text as well.

# The second variable of loadDatSet() returns a set of class labels.
# Here you have two classes, abusive and not abusive.
# The text has been labeled by a human and will be used to train a program to automatically detect abusive posts.

import numpy as np

def loadDataSet():
    documentVec=[['my', 'dog', 'has', 'flea', # document - list of tokens or words
    'problems', 'help', 'please'],
    ['maybe', 'not', 'take', 'him',
    'to', 'dog', 'park', 'stupid'],
    ['my', 'dalmation', 'is', 'so', 'cute',
    'I', 'love', 'him'],
    ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
    ['mr', 'licks', 'ate', 'my', 'steak', 'how',
    'to', 'stop', 'him'],
    ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    # 6 labels for 6 messages
    classVec = [0,1,0,1,0,1] #1 is abusive, 0 not
    return documentVec,classVec

In [72]:
# This function will create a list of all the unique words from all of our documents.
# vocabulary list is the list of all the words you’d like to examine and create a feature for each of them.

def createVocabList(dataSet): # here, dataset should be a list of lists
    vocabSet = set([]) # set only consists of unique numbers
    for document in dataSet:
        # append the set with a new set from each document.
        # | stands for union of two tests.  
        vocabSet = vocabSet | set(document)
    return list(vocabSet)

In [73]:
# This function takes the vocabulary list and a document and outputs a vector
# of 1s and 0s to represent whether a word from our vocabulary is present or not in the given document.

# first create a vector the same length as the vocabulary list and fill it up with 0s.
# and Next, go through the words in the document, and if the word
# is in the vocabulary list, you set its value to 1 in the output vector.

def bagOfWords2Vec(vocabList, tokenList):
    returnVec = [0]*len(vocabList)
    for word in tokenList:
        try:
            index = vocabList.index(word)
        except ValueError:
            print("the word: {} is not in my Vocabulary!".format(word))
        else:
            returnVec[index] += 1 # bag of words model
            
    return np.array(returnVec)

# **Train: calculating probabilities from word vectors**

let’s see how to calculate the probabilities with these numbers. You know whether a word occurs in a document,
and you know what class the document belongs to.

Let $w$ represent the vector of words. $w$ = $[w_{0}, w_{1}, w_{2}, w_{3}, ...]$
<br>now, we need to calculate the probability of a message being abusive or not abusive (probability that it belongs to a particular class) given that it contains a particular vector of words.
<br> or $P(class_{i}|w)$

By Bayes theorem,
<br>$P(class_{i}|w)$ = $\dfrac {P(w|class_{i})\ * P(class_{i})}{P(w)}$

Now, $P(class_{i})$ can be calculated as, $\dfrac{Number\ of\ times\ class_{i}\ has\ been\ reported}{Total\ number\ of\ reports}$

$P(w/class_{i})$ is the probability of the particular vector of words occuring given that it's belongs to $class_{i}$ 
<br>How can we calculate this? This is where our naïve assumption comes in. 
<br>If we expand $w$ into
individual features, we could rewrite this as $p(w0,w1,w2..wN|ci)$. 
<br>Our assumption that
all the words were independently likely, and something called conditional independence,
says we can calculate this probability as $p(w0|ci)\ *\ p(w1|ci)\ *\ p(w2|ci)\ ...p(wN|ci)$

In [74]:
# this function takes in a list of word vectors(vector of 1s and 0s) and their corresponding category list
# and returns a list of their conditional probabilities
# i.e [P(w0|ci) P(w1|ci) P(w2|ci) P(w3|ci) P(w4|ci) .......]

# we get this by adding all the word vectors for each category,
# Now the addendum list contains the count of words
# we divide this by the total count (summing the addendum) to get the conditional probability

# trainMatrix - matrix (list) of documents
# trainCategory - corresponding list of categories
def trainNB0(trainMatrix,trainCategory):
    numTrainDocs = len(trainMatrix) # number of documents
    numWords = len(trainMatrix[0]) # number of words in each word vector (the length of vocabulary list)
    pAbusive = sum(trainCategory)/float(numTrainDocs) # probability of Abusive class
    
    # np.zeroes creates an array of zeroes of given size passed as an argument
    # The numerator is a NumPy array of zeros with length same as the word vector
    # In the for loop we loop over all the documents in trainMatrix, or our training set. 
    # Every time a word appears in a document, the count for that word (p1Num or p0Num) gets incremented, 
    # and the total number of words for a document gets summed up over all the documents.
    
    # Now to calculate the conditional probability for a given class
    # We have to multiply all the values in the probability vector except 0
    # the problem is if we multiply small values in python, it'll become smaller and eventually winds up to 0
    # hence we convert them log values and add them
    # like log(a*b) = log(a) + log(b)
    # natural log of a function can be used in place of a function when you’re interested
    # in finding the maximum value of that function.
    
    # To convert to log function, we've to initialize the numerator with ones(since log(0)=inf) and denominators with 2
    # there isn't an exact science, but generally when you increment the numerator with 1,
    # you'd have to increment the denominator with 2 to get almost the same ratio
    
    p0Num = np.ones(numWords); p1Num = np.ones(numWords) # p0 - Class0, p0Num - p0 numerator
    p0Denom = 2.0; p1Denom = 2.0 # the denominator is initialized by twos
    for i in range(numTrainDocs): # p1-abusive, p0-notabusive
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i]
            p1Denom += sum(trainMatrix[i])      
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    p1Vect = np.log(p1Num/p1Denom) # we convert them into log
    p0Vect = np.log(p0Num/p0Denom) 
    pNotAbusive = 1 - pAbusive
    return p0Vect,p1Vect,pAbusive, pNotAbusive

In [75]:
trainMatrix = []
for i in doclist:
    trainMatrix.append(bagOfWords2Vec(vocablist, i))
p0Vec, p1Vec, class1, class0 = trainNB0(trainMatrix, categorylist)
p1Vec

array([-3.04452244, -3.04452244, -3.04452244, -2.35137526, -3.04452244,
       -3.04452244, -3.04452244, -3.04452244, -3.04452244, -1.94591015,
       -2.35137526, -3.04452244, -2.35137526, -2.35137526, -3.04452244,
       -3.04452244, -3.04452244, -3.04452244, -2.35137526, -2.35137526,
       -3.04452244, -1.65822808, -3.04452244, -2.35137526, -3.04452244,
       -2.35137526, -1.94591015, -2.35137526, -2.35137526, -3.04452244,
       -2.35137526, -2.35137526])

In [76]:
# So up until this point, we've calculated the conditional probability vectors i.e P(w|c) from the huge training set
# what they mean is, the probability of each feature given that they occur in the population of c

# In the Naive Bayes Classifier, we'll be given a new document to classify
# the probability that the document belongs to classi is 𝑃(𝑐𝑙𝑎𝑠𝑠𝑖|𝑤) = 𝑃(𝑤|𝑐𝑙𝑎𝑠𝑠𝑖) ∗𝑃(𝑐𝑙𝑎𝑠𝑠𝑖)/𝑃(𝑤)

# First we need to calculate 𝑃(𝑤|𝑐𝑙𝑎𝑠𝑠𝑖)
# we know (or atleast assume) that the probability of the independent tokens (features or words)
# is already present in the probability vector
# hence we extract the required probabilities from the vector and compute P(w|classi)

# we already computed P(classi) from the training set

# We don't need to calculate P(w)
# since we're essentially comparing the probabilities of classes here and P(w) gets cancelled

# let's build the Naive Bayes Classifier

def classifyNB(vector2classify, p0Vec, p1Vec, class1, class0):
    p1 = sum(vector2classify * p1Vec) + np.log(class1) # extract the probabilities
    p0 = sum(vector2classify * p0Vec) + np.log(class0) # and logarithmically add to P(class)
    if p1 > p0:
        return 1
    else:
        return 0

In [77]:
# let's put 'em all together
def testNB(document2classify):
    trainDocuments, trainClass = loadDataSet()
    vocabList = createVocabList(trainDocuments)
    trainDocumentMatrix = []
    for i in range(len(trainDocuments)):
        trainDocumentMatrix.append(bagOfWords2Vec(vocabList, trainDocuments[i]))
    testDocVec = bagOfWords2Vec(vocabList, document2classify)
    p0Vec, p1Vec, class1, class0 = trainNB0(trainDocumentMatrix, trainClass)
    return classifyNB(testDocVec, p0Vec, p1Vec, class1, class0)

In [78]:
def NBclassifier(document2classify):
    x = testNB(document2classify)
    if x ==1:
        print("This document is abusive")
    elif x ==0:
        print("This document is not abusive")

In [79]:
NBclassifier(["stupid", "garbage"])

This document is abusive
