<img src="Images/naive_slide_1.png" width="700" height="700">



<img src="Images/naive_slide_2.png" width="700" height="700">

## Text Classification

Text classification is to attach labels to bodies of text, e.g., tax document, medical form, etc. based on the text itself

Think of your spam folder in your email. How does your email provider know that a particular message is spam or “ham” (not spam)

## What are the pre-processings to apply a machine learning algorithm on text data?

1. The text must be parsed to words, called tokenization

2. Then the words need to be encoded as integers or floating point values

3. scikit-learn library offers easy-to-use tools to perform both tokenization and feature extraction of text data

## Review Activity: Apply Bag-of Word (BoW) to text dataset

BoW model is simple. It throws away all of the order information in the words and focuses on the occurrence of words in a document

Apply BoW with `CountVectorizer` in sklearn to the following `corpus`

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
corpus = ['This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?']
# vectorizer = TfidfVectorizer()
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.toarray())
print(vectorizer.get_feature_names())
print(X.shape)

[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
(4, 9)


## Word Frequencies with TfidfVectorizer

Word counts are a good starting point, but are very basic

An alternative is to calculate word frequencies, and by far the most popular method is called TF-IDF. 

**Term Frequency**: This summarizes how often a given word appears within a document

**Inverse Document Frequency**: This downscales words that appear a lot across documents

## Intuitive idea behind TF-IDF:
    
- If a word appears frequently in a document, it's important. Give the word a high score

- But if a word appears in many documents, it's not a unique identifier. Give the word a low score

<img src="Images/tfidf_slide.png" width="700" height="700">

## TF-IDF in Sklearn

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
corpus = ['This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?']
vectorizer = TfidfVectorizer()
# vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.toarray())
print(vectorizer.get_feature_names())
print(X.shape)

## Optional Reading: Use Keras for text pre-processing

Another package that has good methods for tokenization and BoW, TF-IDF, ... is Keras

Keras is a high-level neural networks API (Deep Learning API), written in Python 

In [20]:
from keras.preprocessing.text import Tokenizer
corpus = ['This is the first document.',
          'This document is the second document.',
          'And this is the third one.',
          'Is this the first document?']
# tok = Tokenizer(num_words=20)
tok = Tokenizer()
tok.fit_on_texts(corpus)
BoW = tok.texts_to_matrix(corpus, mode='tfidf')
print(BoW)
X = tok.texts_to_sequences(corpus)
print(X)
print(tok.word_index)

[[0.         0.58778666 0.58778666 0.58778666 0.69314718 0.84729786
  0.         0.         0.         0.        ]
 [0.         0.58778666 0.58778666 0.58778666 1.17360019 0.
  1.09861229 0.         0.         0.        ]
 [0.         0.58778666 0.58778666 0.58778666 0.         0.
  0.         1.09861229 1.09861229 1.09861229]
 [0.         0.58778666 0.58778666 0.58778666 0.69314718 0.84729786
  0.         0.         0.         0.        ]]
[[1, 2, 3, 5, 4], [1, 4, 2, 3, 6, 4], [7, 1, 2, 3, 8, 9], [2, 1, 3, 5, 4]]
{'this': 1, 'is': 2, 'the': 3, 'document': 4, 'first': 5, 'second': 6, 'and': 7, 'third': 8, 'one': 9}


## Naive Bayes Classifier (Math)

The Bayes Theorem : $P(spam | w_1, w_2, ..., w_n) = {P(w_1, w_2, ..., w_n | spam)}/{P(w_1, w_2, ..., w_n)}$

Naive Bayes assumption is that each word is independent of all other words, In reality, this is not true! But lets try it out for real world examples

So the above relations become simple with this assumption: $P(spam | w_1, w_2, ..., w_n) = {P(w_1| spam)P(w_2| spam) ... P(w_n| spam)P(spam)}/{P(w_1, w_2, ..., w_n)}$

Taking log from both side and the denuminator is independent of spam or ham:

$logP(spam | w_1, w_2, ..., w_n) \propto {\sum_{i=1}^{n}log P(w_i| spam)+ log P(spam)}$

Also:

$logP(ham | w_1, w_2, ..., w_n) \propto {\sum_{i=1}^{n}log P(w_i| ham)+ log P(ham)}$


if 

${\sum_{i=1}^{n}log P(w_i| spam)+ log P(spam)} > {\sum_{i=1}^{n}log P(w_i| ham)+ log P(ham)}$

then that sentences is spam 

else

the sentences is ham

## Pseudo-code for Naive Bayes for spam/ham dataset:

- Assume the following small dataset is given

- The first column is the labels of received emails

- The second column is the sentenses, body of the email

<img src="Images/spam_minidataset.png" width="500" height="500">

1- Based on the given dataset above, create the following two dictionaries:

    - Ham -> D_ham = {'Jos': 1,'ask':1, 'you':1,... }
    
    - Spam- > D_spam= {'Did': 1, 'you':3, ... }
    
Each dictionary representes all words for the spam and ham emails and their occurance numbers (as the value of dictionaries)

2- For any new given sentenses, having $w_1$, $w_2$, ... $w_n$ words, assume the sentences is ham, calculate:

 $P(w_1| ham)$, $P(w_2| ham)$, ..., $P(w_n| ham)$
 
 then $log(P(w_1| ham))$, $log(P(w_2| ham))$, ..., $log(P(w_n| ham))$
 
add them all 


3- Calculate what percentage of labels is ham -> $P(ham)$ -> take log -> $log(P(ham))$

4- Add the value from step (2) and (3)

5- Do Steps (2) - (4), and now assume the given new sentences is spam

6- Compare the two values, the greater one would be the class
  

## Activity: Apply the naive Bayes to spam/ham email dataset:

Please read this article: https://pythonmachinelearning.pro/text-classification-tutorial-with-naive-bayes/

```
data = pd.read_csv('spam.csv',encoding='latin-1')
data = data.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1)
data = data.rename(columns={"v1":'label', "v2":'text'})
print(data.head())
tags = data["label"]
texts = data["text"]
```

## Find the Naive Bayes core parts in the following Class

In [39]:
import os
import re
import string
import math
import pandas as pd

class SpamDetector(object):
    """Implementation of Naive Bayes for binary classification"""
    def clean(self, s):
        translator = str.maketrans("", "", string.punctuation)
        return s.translate(translator)

    def tokenize(self, text):
        text = self.clean(text).lower()
        return re.split("\W+", text)

    def get_word_counts(self, words):
        word_counts = {}
        for word in words:
            word_counts[word] = word_counts.get(word, 0.0) + 1.0
        return word_counts

    def fit(self, X, Y):
        """Fit our classifier
        Arguments:
            X {list} -- list of document contents
            y {list} -- correct labels
        """
        self.num_messages = {}
        self.log_class_priors = {}
        self.word_counts = {}
        self.vocab = set()

        n = len(X)
        self.num_messages['spam'] = sum(1 for label in Y if label == 'spam')
        self.num_messages['ham'] = sum(1 for label in Y if label == 'ham')
        self.log_class_priors['spam'] = math.log(self.num_messages['spam'] / n )
        self.log_class_priors['ham'] = math.log(self.num_messages['ham'] / n )
        self.word_counts['spam'] = {}
        self.word_counts['ham'] = {}

        for x, y in zip(X, Y):
            c = 'spam' if y == 'spam' else 'ham'
            counts = self.get_word_counts(self.tokenize(x))
            for word, count in counts.items():
                if word not in self.vocab:
                    self.vocab.add(word)
                if word not in self.word_counts[c]:
                    self.word_counts[c][word] = 0.0

                self.word_counts[c][word] += count

    def predict(self, X):
        result = []
        for x in X:
            counts = self.get_word_counts(self.tokenize(x))
            spam_score = 0
            ham_score = 0
            for word, _ in counts.items():
                if word not in self.vocab: continue
                
                # add Laplace smoothing
                log_w_given_spam = math.log( (self.word_counts['spam'].get(word, 0.0) + 1) / (self.num_messages['spam'] + len(self.vocab)) )
                log_w_given_ham = math.log( (self.word_counts['ham'].get(word, 0.0) + 1) / (self.num_messages['ham'] + len(self.vocab)) )

                spam_score += log_w_given_spam
                ham_score += log_w_given_ham

            spam_score += self.log_class_priors['spam']
            ham_score += self.log_class_priors['ham']

            if spam_score > ham_score:
                result.append('spam')
            else:
                result.append('ham')
        return result
        

if __name__ == '__main__':
    from sklearn.model_selection import train_test_split
    data = pd.read_csv('spam.csv',encoding='latin-1')
    data = data.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1)
    data = data.rename(columns={"v1":'label', "v2":'text'})
    print(data.head())
    tags = data["label"]
    texts = data["text"]
    X, y = texts, tags
    print(len(X))
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
    MNB = SpamDetector()
    MNB.fit(X_train.values, y_train.values)
    print(MNB.num_messages)
#     print(MNB.word_counts)
    pred = MNB.predict(X_test.values)
    true = y_test.values
    accuracy = sum(1 for i in range(len(pred)) if pred[i] == true[i]) / float(len(pred))
    print("{0:.4f}".format(accuracy))

  label                                               text
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...
5572
{'spam': 567, 'ham': 3612}
0.9641


## Activity: use sklearn CountVectorizer and MultinomialNB to spam email dataset

Read this blog first: https://www.ritchieng.com/machine-learning-multinomial-naive-bayes-vectorization/

steps:

1- Split the dataset

`from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)`

2- Vectorizing our dataset : `vect = CountVectorizer()`

3- Transform training data into a document-term matrix (BoW): `X_train_dtm = vect.fit_transform(X_train)`

4- Build and evaluate the model


In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

# instantiate a Multinomial Naive Bayes model
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
X_test_dtm = vect.transform(X_test)
y_pred_class = nb.predict(X_test_dtm)
metrics.accuracy_score(y_test, y_pred_class)

## Solution:

In [11]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.cross_validation import train_test_split
from sklearn import metrics

data = pd.read_csv('spam.csv',encoding='latin-1')
data = data.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1)
data = data.rename(columns={"v1":'label', "v2":'text'})
print(data.head())
tags = data["label"]
texts = data["text"]

X, y = texts, tags

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)


vectorizer = CountVectorizer()
X_train_dtm = vectorizer.fit_transform(X_train)
print(X_train_dtm)

nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
X_test_dtm = vectorizer.transform(X_test)
y_pred_class = nb.predict(X_test_dtm)
metrics.accuracy_score(y_test, y_pred_class)

  label                                               text
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...
  (0, 4527)	1
  (0, 3636)	1
  (0, 3647)	1
  (0, 3493)	1
  (0, 5683)	1
  (0, 3263)	1
  (0, 4469)	1
  (0, 2236)	1
  (0, 4798)	1
  (0, 1488)	1
  (0, 4854)	1
  (0, 3451)	1
  (0, 2218)	1
  (0, 6326)	2
  (0, 6604)	1
  (0, 1535)	2
  (0, 4176)	2
  (0, 7373)	1
  (0, 5065)	2
  (0, 2112)	1
  (0, 6727)	1
  (0, 5712)	1
  (0, 819)	1
  (0, 802)	1
  (0, 919)	1
  :	:
  (4176, 4894)	1
  (4176, 4833)	1
  (4176, 3439)	1
  (4176, 1590)	1
  (4176, 4219)	1
  (4176, 7163)	1
  (4176, 4450)	1
  (4176, 6638)	1
  (4176, 2304)	1
  (4176, 3416)	1
  (4176, 3252)	1
  (4176, 4747)	1
  (4177, 6953)	1
  (4177, 5232)	1
  (4177, 1848)	1
  (4177, 4766)	1
  (4177, 3162)	1
  (4

0.9856424982053122

## Summary:

- We learned how TF-IDF assign score to the words while considering its frequency in that sentences and in other sentences 

- Naive Bayes, is the simplest text classification algorithm

- Although  it is simple, it works for many applications

- To upgrade the text classification performance: Use more advanced models that incorporate (consider) the sentences as the sequence of the words (which word is before X and which word is after X)

# Resources:

- https://www.ritchieng.com/machine-learning-multinomial-naive-bayes-vectorization/
