In [1]:
import numpy as np
import pandas as pd
import csv
from math import log
from nltk import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, LancasterStemmer 

#nltk.download()

# Artificial Intelligence Computer Assignment 3
### ARASH HATEFI / Student No: 810098016

# Aim of Project

In this project, we apply the Naïve Bayes classifier for categorizing three types of emails based on their contents. Each email would be classified as related to one of the "Travel", "Business", or "Style & Beauty" categories. In the beginning, we try to separate emails from the first two mentioned categories using Naïve Bayes, and finally, we will apply the method for classifying the three categories.

The methodology and codes are all explained throughout this report and the effect of several related parameters and techniques are discussed. 

# Fast Access to CA's Questions

<a href="#Q1">Question 1: Stemming vs. Lemmatization</a><br />
<a href="#Q2">Question 2: Using TF-IDF with the Naïve Bayes classifier</a><br />
<a href="#Q3">Question 3: The reason why precision is not enough for evaluating a classifier</a><br />
<a href="#Q4">Question 4: The impact of a single word which appeared only once in one of the training classes on</a><br />

# 1. Naïve Bayes Classifiers

**Naïve Bayes classifiers** are simple probabilistic classifiers based on **applying Bayes' theorem with strong independence assumptions** between the features. They are among the simplest Bayesian network models.


Given a problem instance to be classified, represented by a vector $ {x} =(x_{1},\ldots ,x_{n})$ representing some n features (independent variables), Naive Bayes assigns to this instance probabilities 

$$P ( C_k ∣ x_1,\ldots, x_n )$$
    
for each of K possible outcomes or classes $C_k$. The instance is then classified as $C_k$, while $k$ is the class index which maximizes the above conditional probability.

$$Class = \underset{C_k}{\operatorname{argmax}}P( C_k ∣ x_1,\ldots, x_n )\qquad(1)$$

For computing the above expressions, the classifier will apply the Bayes' theorem to each of the probabilities.

Bayes' theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event and is stated as follow:

$$P(C_k∣x) = \frac {P(C_k)}{P(x)}P(x ∣ C_k)$$

Using the Bayes' rule, the conditional probabilities can be extended:

$P( C_k ∣ x 1,\ldots, x_n ) = [\frac{ P(C_K)}{P(x_1,\ldots, x_n )}]\ P( C_k, x_1,\ldots, x_n )= [\frac{ P(C_K)}{P(x_1,\ldots, x_n )}]\ P(x_1,\ldots, x_n , C_k) $

$\qquad= [\frac{ P(C_K)}{P(x_1,\ldots, x_n )}]\ P(x_1 | x_2,\ldots, x_n , C_k)\ P(x_2,\ldots, x_n , C_k)$

$\qquad=[\frac{ P(C_K)}{P(x_1,\ldots, x_n )}]\ P(x_1 | x_2,\ldots, x_n , C_k)\ P(x_2 | x_3,\ldots, x_n , C_k)\ P(x_3,\ldots, x_n , C_k)$

$\qquad=\ldots=[\frac{ P(C_K)}{P(x_1,\ldots, x_n )}]\ P(x_1 | x_2,\ldots, x_n , C_k)\ P(x_2 | x_3,\ldots, x_n , C_k)\ldots P(x_n | C_k)\qquad(2)$



Now the "naive" conditional independence assumptions come into play: assume that all features in $x$ are mutually independent, conditional on the category $C_k$. Under this assumption,

$$ P ( x i ∣ x_{i + 1} , … , x_n , C_k ) = P ( x_i ∣ C_k ),\qquad i={1,2,\ldots,n}\qquad(3)$$

Thus, equation (2) can be expressed as

$$P( C_k ∣ x 1,\ldots, x_n ) = \frac{ P(C_K)}{P(x_1,\ldots, x_n )} \prod_{i=0}^{n}{P( x_i ∣ C_k )}\qquad(4)$$

Using equation (4), we can rewrite equation (1) as follow:

$$Class = \underset{C_k}{\operatorname{argmax}}\frac{ P(C_K)}{P(x_1,\ldots, x_n )} \prod_{i=0}^{n}{P( x_i ∣ C_k )}$$


The term $P(x_1,\ldots, x_n )$ is appeadred for all values of $k$ in the above expression and so it does not affect finding the class of the instance. By removing it, we get:

$$Class = \underset{C_k}{\operatorname{argmax}}P(C_K) \prod_{i=0}^{n}{P( x_i ∣ C_k )}\qquad(5)$$


# 2. Bag of Words Model

The bag of words model is a very common feature extraction procedure for sentences and documents. In this approach, we look at the histogram of the words within the text, i.e. considering each word count as a feature. It is called a "bag" of words as any information about the order or structure of words in the document is discarded and the model is only concerned with whether known words occur in the document, not where in the document.

# 3. The Dataset

The dataset consists of about 25500 emails and their relative information including "**authors**", "**date**", "**headline**", "**link**", and "**short description**". Approximately 23000 of the emails are labeled as one of "**Travel**", "**Business**", or "**Style and Beauty**" categories while others are unlabeled. The labeled and unlabeled emails are stored in two different CSV files.

In [2]:
LABELED_DATA_PATH = "./data.csv"
UNLABELED_DATA_PATH = "./test.csv"

raw_labeled_data = pd.read_csv(LABELED_DATA_PATH)
raw_unlabeled_data = pd.read_csv(UNLABELED_DATA_PATH)

The first 10 rows of the raw labeled data are as follow:

raw_labeled_data.head(5)

and the first 10 rows of the raw unlabeled data are as follow:

In [4]:
raw_unlabeled_data.head(5)

Unnamed: 0,index,headline,authors,link,short_description,date
0,0,Kate Middleton Has Not One But Two Style Wins ...,Jamie Feldman,https://www.huffingtonpost.com/entry/kate-midd...,"Now, we're not exactly saying Kate is cutting ...",2014-04-10
1,1,Instagram Local Lens Series Features Insider's...,,https://www.huffingtonpost.com/entry/instagram...,Instagram's Local Lens series is the perfect w...,2013-12-30
2,2,Where To Go This Thanksgiving,"Fodor's, ContributorFodors.com",https://www.huffingtonpost.com/entry/where-to-...,,2014-09-22
3,3,Retailers Hiring The Most Employees For The Ho...,,https://www.huffingtonpost.com/entry/retailers...,,2014-10-05
4,4,How To Get Flat Iron Waves In Under 2 Minutes ...,Simone Kitchens,https://www.huffingtonpost.com/entry/flat-iron...,Check out the video on how to get our favorite...,2012-04-25


# 4. Preparing the Dataset

## 4.1. Removing extra Information

As was mentioned in section 2, the classification process is based on the emails' descriptions. Therefore, all the other irrelevant information like authors, date, link, etc. will be removed from the dataset. In the labeled dataset, the only remaining information about each email is its short description and category while in the unlabeled dataset, all extra information except the indices and the short descriptions are removed. 

Also, there are some defects in some of the dataset instances. To avoid facing problems, we remove all instances with defects from the dataset.

In [5]:
labeled_data = raw_labeled_data[['category', 'short_description']].dropna()
unlabeled_data = raw_unlabeled_data[['index', 'short_description']].dropna()

## 4.2. Processing the Short Descriptions

As was discussed in section 3, the bag of words model deals with individual words in a text. Therefore, in the beginning, each email's description is split into its words. 

Some sources of irrelevancy exist in descriptions and for increasing the performance of the algorithm, it is important to remove or at least reduce these sources. In the two next sections, we will discuss different sources of irrelevancy and ways for removing them from the descriptions.

### 4.2.1. Irrelevancies in  Emails' Descriptions

An important fact to deal with is the irrelevancies in descriptions. These irrelevancies my harder the task of classification for the algorithm as well as increasing the computational expenses. Three of the probable forms of irrelevancies are as follow:

1. **Upper Case and Lower Case characters**: Despite their same meaning, words with mixed uppercase and lowercase letters vary form each other in shape and Naïve Bayes makes difference between them. So it is expected that changing all the letters in descriptions to their lowercase form raises the efficiency of the classifier.


2. **Stop Words**: Stop words are a set of commonly used words in any language. For example, in English, "the", "is", and "and" would easily qualify as stop words. In language text processing tasks, these frequent words would be removed as they usually appear in all kinds of texts regardless of their category and carry no useful information for classifying texts.


3. **Inflections**: For grammatical reasons, descriptions use different inflections of words. Despite their different appearance, inflections of a root word convey similar meanings. In the email classification problem, the classifier acts based on the similarity of words between a given instance and each class. Consequently, by reducing each word to its root form, the variety of words will decrease and the classifier is expected to perform better.

We will discuss the effect of removing each of the mentioned probable irrelevancies on the final performance of the algorithm.   

### 4.2.2. Ways of Removing Irrelevancies from Emails' Descriptions

Here, some methods are introduced for removing the discussed sources of irrelevancy in the previous section.

Dealing with diversity in the shape of words caused by lowercase and uppercase letters is rather easy. We change all the letters in the descriptions to their lowercase form. Also, for finding the stop words in the descriptions, we need a set of English stopwords that can be provided form the Python **Natural Language Tool Kit (NLTK) library**. The stop words can easily be found and removed from the descriptions.


In the case of the inflections, there are two main methods for removing words inflections:

1. **Stemming**: Stemming is the process of reducing inflection in words to their root form such as mapping a group of words to the same stem even if the stem is not a valid word in the language.


2. **Lemmatization**: Lemmatization is the process of replacing words with their lemma (the base or dictionary form of a word) depending on their meaning.

Lemmatization, unlike Stemming, reduces the inflected words properly ensuring that the root word belongs to the language. In Lemmatization root word is called Lemma. A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words.

we will use both of the mentioned methods for removing the infection words and measure the effect of both on the final results of the classifier. 

Both of the mentioned methods for dealing with infections are provided in the NLTK library.

### 4.2.3. Removing Irrelevancies form Emails' Descriptions

The function bellow gets a list of the descriptions and returns the processed texts. Each processing step is specified by comment and each processing stage can be removed by commenting its corresponding line. 

In [6]:
def process_descriptions(description_list):
    
    description_list = list(description_list)
    
    tokenizer = RegexpTokenizer(r"\w+")
    ps = PorterStemmer()
    ls = LancasterStemmer()
    english_stopwords = stopwords.words("english")
    
    # Converting each description to a list of its words
    description_list = [tokenizer.tokenize(description) for description in description_list]
    
    # Removing the stop words for the lists
    #description_list = [[word for word in description if not word in english_stopwords] for description in description_list]
    
    # Removing inflections using the stemming method
    description_list = [[ps.stem(word) for word in description] for description in description_list]
    
    # Removing inflections using the Lemmatization method
    description_list = [[ls.stem(word) for word in description] for description in description_list]
    
    # Converting all the characters to their lowercase form
    description_list = [[word.lower() for word in description] for description in description_list]
    
    return description_list


labeled_data['processed_description'] = process_descriptions(labeled_data['short_description'])
unlabeled_data['processed_description'] = process_descriptions(unlabeled_data['short_description'])

The first 10 rows of the processed labeled data are as follow:

In [7]:
labeled_data.head(5)

Unnamed: 0,category,short_description,processed_description
0,TRAVEL,Påskekrim is merely the tip of the proverbial ...,"[påskekrim, is, mer, the, tip, of, the, prover..."
2,STYLE & BEAUTY,"Madonna is slinking her way into footwear now,...","[madonn, is, slink, her, way, into, footwear, ..."
3,TRAVEL,But what if you're a 30-something couple that ...,"[but, what, if, you, re, a, 30, some, coupl, t..."
4,BUSINESS,Obamacare was supposed to make birth control f...,"[obamac, wa, suppo, to, mak, bir, control, fre..."
5,STYLE & BEAUTY,Madonna previously released a Truth or Dare fr...,"[madonn, prevy, relea, a, tru, or, dar, fragr,..."


and the first 10 rows of the processed unlabeled data are as follow:

In [8]:
unlabeled_data.head(5)

Unnamed: 0,index,short_description,processed_description
0,0,"Now, we're not exactly saying Kate is cutting ...","[now, we, re, not, exactl, say, kat, is, cut, ..."
1,1,Instagram's Local Lens series is the perfect w...,"[instagram, s, loc, len, ser, is, the, perfect..."
4,4,Check out the video on how to get our favorite...,"[check, out, the, video, on, how, to, get, our..."
5,5,Want to meet the flesh-and-blood Annie Oakley?...,"[want, to, meet, the, flesh, and, blood, ann, ..."
6,6,The latest line might be coming to us from Cha...,"[the, latest, lin, might, be, com, to, us, fro..."


## 6.2. Balancing the Dataset

Bellow, the number of emails in dataset labeled as each of the categories is shown:

In [9]:
print("Number of emails labeled as TRAVEL: {}".format(len(labeled_data[(labeled_data['category'] == 'TRAVEL')])))
print("Number of emails labeled as BUSINESS: {}".format(len(labeled_data[(labeled_data['category'] == 'BUSINESS')])))
print("Number of emails labeled as STYLE & BEAUTY: {}".format(len(labeled_data[(labeled_data['category'] == 'STYLE & BEAUTY')])))

Number of emails labeled as TRAVEL: 8461
Number of emails labeled as BUSINESS: 4568
Number of emails labeled as STYLE & BEAUTY: 8674


In can be seen that the number of emails labeled as 'TRAVEL' is roughly the same as 'STYLE & BEAUTY' but dramatically bigger than 'BUSINESS'. This may cause future problems including a low performance of the classifier on detecting 'STYLE & BEAUTY' emails. For preventing such problems, we use a technique called **random oversampling** for artificially increasing the instances of the 'STYLE & BEAUTY' category.

Random oversampling involves randomly duplicating examples from the minority class and adding them to the training dataset. Here, the minority class is 'BUSINESS' and so we randomly duplicate 4000 of its samples to increase its population to near the population of two other classes and balance the dataset.

In [10]:
N_SAMPLE=4000
duplicated_samples = labeled_data[(labeled_data['category'] == 'BUSINESS')].sample(N_SAMPLE)
labeled_data=labeled_data.append(duplicated_samples).reset_index()

Now the dataset is almost balanced.

In [11]:
print("Number of emails labeled as TRAVEL: {}".format(len(labeled_data[(labeled_data['category'] == 'TRAVEL')])))
print("Number of emails labeled as BUSINESS: {}".format(len(labeled_data[(labeled_data['category'] == 'BUSINESS')])))
print("Number of emails labeled as STYLE & BEAUTY: {}".format(len(labeled_data[(labeled_data['category'] == 'STYLE & BEAUTY')])))

Number of emails labeled as TRAVEL: 8461
Number of emails labeled as BUSINESS: 8568
Number of emails labeled as STYLE & BEAUTY: 8674


## 4.3. Splitting the Dataset to Train and Test sets

In classification tasks, the dataset is usually split into two subsets of training and test data. 

The training set is the actual dataset that we use to train the model (in the naïve Bayes classifies, this is equal to computing probabilities required in equation (5))). The model sees and learns from this data to be generalized to other data later on. 

The test set is a subset of data used to provide an unbiased evaluation of a model fit on the training dataset. The test set is used for measuring how the model can be generalized to other instances.

Here, we consider 80% of the data as a training set while 20% as a test set.

In [12]:
TRAIN_DATA_PROPORTION = 0.8

train_set = labeled_data.sample(frac=TRAIN_DATA_PROPORTION).copy()
test_set = labeled_data.copy().drop(train_set.index)

In [13]:
len(train_set)/len(test_set)

3.9996109706282823

# 5. Model Evaluation

Three below criteria are used for evaluating the Naïve Bayes classifier:

**1- Recall**

For each class, Recall is defined as the ratio of the instances classified correctly in the class to the total number of instances in the class:

$$Recall\ for\ C_k = \frac {Number\ of\ Instanses\ Corrrectly\ Classified\ as\ C_k} {Number\ of\ Instanses\ in\ C_k}\qquad(6)$$

**2- Precision**

For each class, Precision is defined as the ratio of the instances classified correctly in the class to the total number of instances classified as the class:

$$Precision\ for\ C_k = \frac {Number\ of\ Instanses\ Corrrectly\ Classified\ as\ C_k} {Number\ of\ Instanses\ Classified\ as\ C_k}\qquad(7)$$

<a id="Q3"></a>
Due to the formula, the number of instances that belong to class $C_k$ but are misclassified is not counted in Precision. This fact makes Precision, alone, not sufficient for evaluating a classifier. It may misclassify lots for instances form $C_k$ but still get a good precision because most of its predictions for $C_k$ is correct.

**3- Accuracy**

Accuracy of the model is defined as the ratio of the instances classified correctly:

$$Accuracy = \frac {Number\ of\ Instanses\ Classified\ Corrrectly} {Total\ Number\ of\ Instances}\qquad(8)$$

**4- The Confusion Matrix**

In classification tasks, a confusion matrix is a specific table layout that allows visualization of the performance of an algorithm. Each row of the matrix represents the instances in a predicted class while each column represents the instances in an actual class (or vice versa). The j'th element in i'th row of the matrix represents the number of instances, which belong to class j and classified as class i.


# 6. Applying the Naïve Bayes Classifiers to the Bag of Words Model 

Here, we apply the Naïve Bayes classifying method for categorizing the dataset labels based on their short descriptions. We consider each of the words in a preprocessed email description as features and design a classifier for separating emails based on their description contents.


In the beginning, we try to separate emails from the first two mentioned categories using Naïve Bayes, and finally, we apply the method for classifying all emails from three categories. In both parts, we then evaluate the obtained models with criteria from section (5).

## 6.1. Computing the Probability for each class

To use equation (5) for determining the class which each email belongs to, we must first find the probability of occasion for each class ($P(C_k)$ for each class $k$) and also the probability of single words to appear in a class ($P(x_i|C_k)$ for each feature $i$ and class $k$). 

An approximation for the probability of occasion for each class ($P(C_k)$ for each class $k$) is as follow:

$$\hat{P}(C_k) = \frac {Number\ of\ instances\ from\ C_k} {Total\ number\ of\ training\ samples}\qquad(9)$$

In [14]:
travel_category_probability = len(train_set[train_set['category'] == 'TRAVEL'])/len(train_set)
business_category_probability = len(train_set[train_set['category'] == 'BUSINESS'])/len(train_set)
style_category_probability = len(train_set[train_set['category'] == 'STYLE & BEAUTY'])/len(train_set)

## 6.2. Computing the Relative Probability of Words in Classes

### 6.2.1. Using the Empirical Probabilities With Smoothing

One way of estimating the conditional probability of feature $x_i$ to appear in class $C_k$ ($P(x_i|C_k)$) is estimating them with their imperial probability as bellow:

$$\hat{P}(x_i | C_k) = \frac {Times\ that\ x_i\ appeared\ in\ instances\ from\ C_k} {Total\ number\ of\ words\ in\ training\ samples}\qquad(10)$$

This is the right idea, but there's a small problem:  what if there's a word $x$ form a new description belong to class $C_k$ that we've not seen before in $C_k$ instances from train set? In that case, $P(x|C_k) = 0$, and the entire probability for the email to be labeled as $C_k$ will go to zero (see equation 5). Similarly, lots of new emails out of our train set can easily get misclassified because only one word from their description is not in the true class words list.

We would like our classifier to be robust to words it has not seen before. To address this problem, we must never let any word's probabilities to be zero, by smoothing the probabilities upwards. The solution is applying additive smoothing to the probabilities. 

In statistics, additive smoothing, also called Laplace smoothing or Lidstone smoothing, is a technique used to solve the problem of zero probability occasion. 

Given an observation $x  = ( x_1 , x_2 , … , x_d )$  from a multinomial distribution with $N $ trials, a smoothed version of the data gives the estimator:

$$ \hat{P}(x_i) = \frac {x_i + α} {N + \alpha d},\qquad ( i = 1 , … , d )$$

where the pseudo count α > 0 is a smoothing parameter. α = 0 corresponds to no smoothing.  Additive smoothing is a type of shrinkage estimator, as the resulting estimate will be between the empirical probability (relative frequency) $\frac{x_i} {N}$, and the uniform probability $\frac {1}{d}$.

Using additive smoothing with $\alpha=0$, the conditional probability of feature $x_i$ to appear in class $C_k$ ($P(x_i|C_k)$) would be:

$$P(x_i | C_k) = \frac {Times\ that\ x_i\ appeared\ in\ instances\ from\ C_k +1} {Total\ number\ of\ words\ in\ training\ samples + 3}\qquad(11)$$

<a id="Q4"></a>
Applying the above formula for estimating probabilities, can prevent the classifier from mislabeling an instance only because one of its words has only appeared in the wrong class. Instead, there would be a change for other instances' words to affect the prediction.

### 6.2.2. Using TF-IDF <a id="Q2"></a>

Another approach for estimating relative probabilities is using the **TF-IDF** measure of each word. **TF-IDF** stands for **term frequency-inverse document frequency** and is a statistical measure used to evaluate how important a word is to a document in a collection. The importance increases proportionally to **the number of times a word appears in the document** but is offset by the **frequency of the word in the collection**.

The goal of using TF-IDF instead of the empirical probabilities with smoothing is to scale down the impact of words that occur very frequently in different classes and that are hence empirically less informative than features that occur in a small fraction of the training set.

For applying TF-IDF in the Naïve Bayes classifier, we compute the term frequency (TF) for each of the words using equation (11) and then we modify it by multiplying it by an IDF factor bellow:

$$IDF = log \frac{Number\ of\ classes} {Number\ of\ classes\ containing\ the\ word}$$


Therefore, the TF-IDF measure for each word in the train set can be calculated using the eqaution bellow:


$$TF-IDF(x_1|C_k) = \frac {Times\ that\ x_i\ appeared\ in\ instances\ from\ C_k +1} {Total\ number\ of\ words\ in\ training\ samples + 3} * log\ \frac{3} {Number\ of\ classes\ containing\ x_i}\qquad(12)$$



Finally, we can estimate the conditional probability $P(x_i | C_k)$ in equation (5) with $TF-IDF(x_1|C_k)$ and estimate the category of each email. 

### 6.2.3. Implementing Relative  Probabilities in Code

In code, we use equation (11) for calculating relative probabilities. The class bellow is defined to estimate the relative empirical relative probabilities of each word in each category.

In [15]:
class log_relative_probabilities:
    
    
    def __init__(self, class_probability, list_of_sentences, n_classes):
        
        self.__log_class_probability = log(class_probability)
        
        n_total_words = sum([len(sentence) for sentence in list_of_sentences])
        self.__log_probability_denominator = log(n_total_words+n_classes)
        
        words_frequency = {word:1 for sentence in list_of_sentences for word in sentence}
            
        for sentence in list_of_sentences:
            for word in sentence:
                words_frequency[word] += 1

        self.__log_words_probabilities = {key:log(words_frequency[key])-self.__log_probability_denominator for key in words_frequency.keys()}

        
    def __getitem__(self, word):
        
        try: 
            return self.__log_words_probabilities[word]
        
        except  KeyError:
            return -self.__log_probability_denominator
       
    
    def get_sentence_probability(self, sentence):
        
        return sum([self[word] for word in sentence]) + self.__log_class_probability
        
        
          
travel = log_relative_probabilities(class_probability=travel_category_probability,
                                    list_of_sentences=train_set.loc[train_set['category'] == 'TRAVEL']['processed_description'],
                                    n_classes=3)

business = log_relative_probabilities(class_probability=business_category_probability,
                                      list_of_sentences=train_set.loc[train_set['category'] == 'BUSINESS']['processed_description'],
                                      n_classes=3)

style = log_relative_probabilities(class_probability=style_category_probability,
                                   list_of_sentences=train_set.loc[train_set['category'] == 'STYLE & BEAUTY']['processed_description'],
                                   n_classes=3)

Having the above values, the class of each of the emails is determined using equation (5).

## 6.2. Classifying Emails

Bellow, there are codes for classifying test set emails by applying the Naïve Bayes classifier. Initially, we make attempts for classifying emails with "Travel" and "Business" labels from each other, and then we use the same method for categorizing all emails in the test set.

### 6.2.1. Classifying Emails With "Travel" and "Business" Labels

First, we select the test set instances which are labeled as "Travel" or "Bussines".

In [16]:
limited_test_set = test_set.loc[(test_set['category'] == 'TRAVEL') ^ (test_set['category'] == 'BUSINESS')].copy()
limited_test_set.assign(predicted_category="");

Now we apply the classification method to the limited test set and evaluate the classifier.

In [17]:
for idx in limited_test_set.index:

    sentence = limited_test_set['processed_description'][idx]

    travel_probability = travel.get_sentence_probability(sentence)
    business_probability = business.get_sentence_probability(sentence)
    
    limited_test_set.loc[idx,'predicted_category'] = 'TRAVEL' if travel_probability>business_probability else 'BUSINESS'
    

Finally, we evaluate the classifier with criteria from section 5.

The value  of Accuracy is as follow: 

In [18]:
print("Accuracy: {}".format(sum(limited_test_set['category']==limited_test_set['predicted_category'])/len(limited_test_set)))

Accuracy: 0.9079718640093787


Recall and Precision criterium for different classes are show in the table bellow:

In [19]:
phase1_travel_recall = len(limited_test_set.loc[(limited_test_set['category'] == 'TRAVEL') & (limited_test_set['predicted_category'] == 'TRAVEL')])/len(limited_test_set.loc[limited_test_set['category'] == 'TRAVEL'])
phase1_business_recall = len(limited_test_set.loc[(limited_test_set['category'] == 'BUSINESS') & (limited_test_set['predicted_category'] == 'BUSINESS')])/len(limited_test_set.loc[limited_test_set['category'] == 'BUSINESS'])

phase1_travel_precision = len(limited_test_set.loc[(limited_test_set['category'] == 'TRAVEL') & (limited_test_set['predicted_category'] == 'TRAVEL')])/len(limited_test_set.loc[limited_test_set['predicted_category'] == 'TRAVEL'])
phase1_business_precision = len(limited_test_set.loc[(limited_test_set['category'] == 'BUSINESS') & (limited_test_set['predicted_category'] == 'BUSINESS')])/len(limited_test_set.loc[limited_test_set['predicted_category'] == 'BUSINESS'])


phase1_evaluation = pd.DataFrame({'Travel': [phase1_travel_recall, phase1_travel_precision], 
                                  'Business': [phase1_business_recall, phase1_business_precision]},
                                  index=['Recall', 'Precision']) 

phase1_evaluation


Unnamed: 0,Travel,Business
Recall,0.886837,0.928448
Precision,0.923125,0.894386


### 6.3.2. Classifying All Test Set Emails

First, we add an empty column to the new test set for saving the classifier's predicted categories.

In [20]:
test_set.assign(predicted_category="");

Now we apply the classification method to the complete test set and evaluate the classifier.

In [21]:
for idx in test_set.index:
    
    sentence = test_set['processed_description'][idx]

    travel_probability = travel.get_sentence_probability(sentence)
    business_probability = business.get_sentence_probability(sentence)
    style_probability = style.get_sentence_probability(sentence)
    
    if travel_probability>business_probability and travel_probability>style_probability:
        test_set.loc[idx,'predicted_category'] = 'TRAVEL'
    
    elif business_probability>travel_probability and business_probability>style_probability:
        test_set.loc[idx,'predicted_category'] = 'BUSINESS'
    
    else:
        test_set.loc[idx,'predicted_category'] = 'STYLE & BEAUTY'
    

Finally, we evaluate the classifier with criteria from section 5.

The value of Accuracy is as follow:

In [22]:
print("Accuracy: {}".format(sum(test_set['category']==test_set['predicted_category'])/len(test_set)))

Accuracy: 0.8562536471503599


Recall and Precision criterium for different classes are show in the table bellow:

In [23]:
phase2_travel_recall = len(test_set.loc[(test_set['category'] == 'TRAVEL') & (test_set['predicted_category'] == 'TRAVEL')])/len(test_set.loc[test_set['category'] == 'TRAVEL'])
phase2_business_recall = len(test_set.loc[(test_set['category'] == 'BUSINESS') & (test_set['predicted_category'] == 'BUSINESS')])/len(test_set.loc[test_set['category'] == 'BUSINESS'])
phase2_style_recall = len(test_set.loc[(test_set['category'] == 'STYLE & BEAUTY') & (test_set['predicted_category'] == 'STYLE & BEAUTY')])/len(test_set.loc[test_set['category'] == 'STYLE & BEAUTY'])

phase2_travel_precision = len(test_set.loc[(test_set['category'] == 'TRAVEL') & (test_set['predicted_category'] == 'TRAVEL')])/len(test_set.loc[test_set['predicted_category'] == 'TRAVEL'])
phase2_business_precision = len(test_set.loc[(test_set['category'] == 'BUSINESS') & (test_set['predicted_category'] == 'BUSINESS')])/len(test_set.loc[test_set['predicted_category'] == 'BUSINESS'])
phase2_style_precision = len(test_set.loc[(test_set['category'] == 'STYLE & BEAUTY') & (test_set['predicted_category'] == 'STYLE & BEAUTY')])/len(test_set.loc[test_set['predicted_category'] == 'STYLE & BEAUTY'])


phase2_evaluation = pd.DataFrame({'Travel': [phase2_travel_recall, phase2_travel_precision], 
                                  'Business': [phase2_business_recall, phase2_business_precision],
                                  'Style & Beauty': [phase2_style_recall, phase2_style_precision]},
                                  index=['Recall', 'Precision']) 


phase2_evaluation

Unnamed: 0,Travel,Business,Style & Beauty
Recall,0.83383,0.896134,0.838057
Precision,0.847971,0.841734,0.880851


Also, the confusion matrix is as follow:

In [24]:
confusion_matrix = pd.DataFrame({'Actual Travel': [len(test_set.loc[(test_set['category'] == 'TRAVEL') & (test_set['predicted_category'] == 'TRAVEL')]),
                                                   len(test_set.loc[(test_set['category'] == 'TRAVEL') & (test_set['predicted_category'] == 'BUSINESS')]),
                                                   len(test_set.loc[(test_set['category'] == 'TRAVEL') & (test_set['predicted_category'] == 'STYLE & BEAUTY')])], 
                                 'Actual Business': [len(test_set.loc[(test_set['category'] == 'BUSINESS') & (test_set['predicted_category'] == 'TRAVEL')]),
                                                     len(test_set.loc[(test_set['category'] == 'BUSINESS') & (test_set['predicted_category'] == 'BUSINESS')]),
                                                     len(test_set.loc[(test_set['category'] == 'BUSINESS') & (test_set['predicted_category'] == 'STYLE & BEAUTY')])],
                                 'Actual Style & Beauty': [len(test_set.loc[(test_set['category'] == 'STYLE & BEAUTY') & (test_set['predicted_category'] == 'TRAVEL')]),
                                                           len(test_set.loc[(test_set['category'] == 'STYLE & BEAUTY') & (test_set['predicted_category'] == 'BUSINESS')]),
                                                           len(test_set.loc[(test_set['category'] == 'STYLE & BEAUTY') & (test_set['predicted_category'] == 'STYLE & BEAUTY')])]},
                                index=['Predicted Travel', 'Predicted Business', 'Predicted Style & Beauty'])
                                 
confusion_matrix

Unnamed: 0,Actual Travel,Actual Business,Actual Style & Beauty
Predicted Travel,1400,98,153
Predicted Business,165,1553,127
Predicted Style & Beauty,114,82,1449


## 6.5. The Effect of Preprocessing Stages on the Final Performance of the Classifier

### 6.5.1. Effect of Changing All the Letters to Their Lowercase Form

It was seen that changing all letters in the descriptions to their lowercase form increases the efficiency of the classifier. the reason for this is that words with mixed uppercase and lowercase letters vary form each other in shape and not in meaning. Despite this fact, Naïve Bayes makes the difference between these shapes of words and consider them as instances form different classes. Changing all letters in the descriptions solves this problem and consequently, it would enhance the performance of the algorithm.

### 6.5.2. Stemming Vs. Lemmatization  <a id="Q1"></a>

Both stemming and lemmatization increase the performance of the classification task dramatically and the final classification evaluation results were almost the same while using either of them. (The difference of them was less than 3 percent for all evaluation methods). Bellow, we discuss some of the basic differences between these two methods.

Both stemming and lemmatization try to bring inflected words to the same form. Stemming uses an algorithmic approach to removing prefixes and suffixes. The result might not be an actual dictionary word. On the other hand, lemmatization uses a corpus and so the result is always a dictionary word.

This principle difference makes stemming much faster than lemmatizers but also less accurate as it only uses an algorithm for reducing inflections and not a valid dictionary. Another highlight fact about lemmatization is that it can group larger words with similar meanings together and usually can reduce the diversity of words more than stemming. (For more on stemming and lemmatization, please check section 4.4.2)
  

## 6.6. Classifying Unlabeled Emails

Here, we classify the unlabeled emails using the preprocess operations with the highest performance in the previous section. The preprocessing steps for preparing the train, test, and unlabeled datasets are the ones that are written in the process_descriptions function in section 4.2.3 and are as follow:
1. Changing all letters in the descriptions to their lowercase form
2. Removing the stopwords
3. Applying lemmatization

Like previous sections, we add an empty column to the new test set for saving the classifier's predicted categories.


In [25]:
unlabeled_data.assign(category="");

Now we apply the classification method to the complete test set and evaluate the classifier.

In [26]:
for idx in unlabeled_data.index:
    
    sentence = unlabeled_data['processed_description'][idx]

    travel_probability = travel.get_sentence_probability(sentence)
    business_probability = business.get_sentence_probability(sentence)
    style_probability = style.get_sentence_probability(sentence)
    
    if travel_probability>business_probability and travel_probability>style_probability:
        unlabeled_data.loc[idx,'category'] = 'TRAVEL'

    elif business_probability>travel_probability and business_probability>style_probability:
        unlabeled_data.loc[idx,'category'] = 'BUSINESS'
    
    else:
        unlabeled_data.loc[idx,'category'] = 'STYLE & BEAUTY'
        
unlabeled_data

Unnamed: 0,index,short_description,processed_description,category
0,0,"Now, we're not exactly saying Kate is cutting ...","[now, we, re, not, exactl, say, kat, is, cut, ...",STYLE & BEAUTY
1,1,Instagram's Local Lens series is the perfect w...,"[instagram, s, loc, len, ser, is, the, perfect...",TRAVEL
4,4,Check out the video on how to get our favorite...,"[check, out, the, video, on, how, to, get, our...",STYLE & BEAUTY
5,5,Want to meet the flesh-and-blood Annie Oakley?...,"[want, to, meet, the, flesh, and, blood, ann, ...",TRAVEL
6,6,The latest line might be coming to us from Cha...,"[the, latest, lin, might, be, com, to, us, fro...",STYLE & BEAUTY
...,...,...,...,...
2543,2543,Ironically the new taxes will have relatively ...,"[iron, the, new, tax, wil, hav, rel, littl, ef...",BUSINESS
2544,2544,Pack with purpose.,"[pack, with, purpo]",BUSINESS
2545,2545,"Despite some downtrodden Santas, there are 700...","[despit, som, downtrod, sant, ther, ar, 700, 0...",BUSINESS
2546,2546,"In the end, it doesn’t matter if your food is ...","[in, the, end, it, doesn, t, mat, if, yo, food...",TRAVEL


Finally, we extract the results to a CSV file.

In [27]:
unlabeled_data[['index', 'category']].to_csv (r'output.csv', index = False, header=True)