# AI - CA3 - Text Classification using Naïve Bayes

## Introduction

In this assignment, the purpose is to train a classifier model (based on Naive Bayes) to be able to classify news based on their short descriptions

## The Classifier Model

The classifier model uses the Bayes Theorem (the model is thoroughly explained below).
The model uses the following formula: 

$$ P(c_i | X) = \frac{P(X | c_i)P(c_i)}{P(X)}$$

### The model definitions

#### Posterior Probability
The Posterior Probability is the probability of a news piece being in category $c_i$ if it include the words $X$

#### Likelihood
The probability of the word $x_i$ being in a news of category $c_i$. Calculating this value is explained later in the report (in _Training the Model_ part specifically)

#### Class Prior Probability
The probability of a news piece being in category $c_i$. This is calculated by dividing the number of news pieces in the $c_i$ category by the total number of news pieces.

#### Predictor prior probability
This is the probability of a word being $x_1,x_2,...\ or\  x_n$. This value is constant and can be obtained from the dataset, it will be ommited in the processing calculations for the sake of simplicity as it wont have any effects on the result.

## How the Classifier Works

### Cleaning the given data
A set of training data is given to the model. In this assignment we want to use the words of the _short_description_ text as features, so every other column from the dataset is ommitted in the explanations as well as the codes. 

At first, the short description texts are cleaned. Cleaning the texts consist of removing the stopwords and the punctuation marks, replacing uppercase characters with their lowercase counterparts (because while classifying the news, _Program_ and _program_ are the same to us and shouldn't be counted as different words)  and finally words are replaced by their roots using either **lemmatization** or **stemming** (These processes are explained later in the report). The words are reduced to their roots because _program_, _programmed_ and _programming_ are all the same to us while trying to classify the news, and with reducing them we have the sum of their repitions as the repition count of _program_, which results in it's probability being higher and it's effect more considerable.

### Preparing the data
Afterwards, the data are grouped by their category and splitted into two parts, the training data and the test data. In this report, 80% of the data is used for training and the remaining 20% is used for testing but these numbers can be easily changed. 

### Training the model
In the next step, the classifier calculates $P(x_i | c_i)$ ($x$ being a word, and $c$ a category) for each word and each category. $P(x_i | c_i)$ is calculated using this formula: $\frac{R_{x_i,c_i}}{W_{c_i}}$ where $R_{x_i,c_i}$ is the number of repetitions of $x_i$ in $c_i$ category news and $W_{c_i}$ is the total word count of the news pieces in category $c_i$.

### Predicting the categories
Finally, the classifier uses the previously calculated $P(x_i | c_i)$ values to calculate $P(c_i | X)\ (X = x_1,x_2,...,x_n)$.
To calculate this, we use Bayes' theorem: 

$$ P(c_i | X) = \frac{P(X | c_i)P(c_i)}{P(X)} = \frac{P(x_1,x_2,...,x_n | c_i)P(c_i)}{P(X)}$$

As we are using Naive Bayes in this assignment, each feature is assumed to be independent from the rest. So the above formula will be as following:

$$\frac{P(x_1,x_2,...,x_n | c_i)P(c_i)}{P(X)} = \frac{P(x_1 | c_i)P(x_2 | c_i)P(x_3 | c_i)...P(x_n | c_i)P(c_i)}{P(X)}$$

As everything is being divided by $P(X)$ and $P(X)$ is a constant, it can be ommitted for the calculations to be simpler:

$${P(x_1,x_2,...,x_n | c_i)P(c_i)} = {P(x_1 | c_i)P(x_2 | c_i)P(x_3 | c_i)...P(x_n | c_i)P(c_i)}$$

To prevent the numbers becoming too small and resulting in a loss of accuracy, we use logarithm:

$$\log_{10}P(c_i | X) = \log_{10}({P(x_1 | c_i)P(x_2 | c_i)P(x_3 | c_i)...P(x_n | c_i)P(c_i)})\\ = \log_{10}P(x_1 | c_i) + \log_{10}P(x_2 | c_i) + ... + \log_{10}P(x_n | c_i) + \log_{10}P(c_i)$$

To predict the news category, $P(c_i | X)$ is calculated for each category, and the one with the maximum probability is regarded as the predicted category.

#### Absent words in a category
If a word is absent in a category, that is it hasn't been seen in the data pieces of this category, it's repetition count will be zero, and thus, it's $P(x | c_i)$ will be zero too which can result in the whole $P(X|c_i)$ becoming zero which is not desirable at all. In order to solve this problem and preventing the probability from becoming zero, the repetition count of absent words will be set to 1 as it's the lowest logical count.

### Confusion Matrix
The confusion matrix is a matrix used to visualize the performance of a model. In the confusion matrix A, the cell $A_{i,j}$ shows the number of times the predicted class was $j$, while the actual class was $i$, thus making the cells in the form $A_{i,i}$ the number of correct predictions per class.

## Reducing the words to their roots
For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization. In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set. 

### Lemmatization
**Lemmatization** usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma .

### Stemming
**Stemming** usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes.

**The classifier designed in this assignment can use either of these methods.**

As seen further below, the stemming and lemmatization methods get very similar results in this assignment.

## What is the TD-IDF weight?
TF-IDF is a weight used to find the importance of a word to data-mining or classifying projects. 

TF stands for Term Frequency, it measures how frequently a word occurs in a document. as this value is heavily dependent on the document length, it's usually divided by the length of the document.

IDF stands for Inverse Document Frequency, it measures the importance of a word by counting the number of documents this word occurs in. Words like the stopwords occur repetitively in many documents but are not of high importance, so the IDF measure tries to weigh down the words with a low level of importance but the rare words are weighed up.

A simple implementation of the TF measure would be:

$$TF(t) = \frac{Number\ of\ times\ term\ t\ appears\ in\ a\ document}{Total\ number\ of\ terms\ in\ the\ document}$$

and the implementation of IDF:

$$IDF(t) = log_e(\frac{Total\ number\ of\ documents}{Number\ of\ documents\ with\ term\ t\ in\ it})$$

In this assignment, TF is used (although without dividing the word count by the document length) but if IDF is to be included, we could divide the $P(x_i|c_i)$ by the number of categories which include $x_i$

In [1]:
import time
from collections import defaultdict
import math

import pandas as pd
import prettytable as pt
from nltk import pos_tag
from nltk.corpus import stopwords, wordnet
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.metrics import ConfusionMatrix
from prettytable import PrettyTable

DATASET_FILENAME = './data.csv'
TEST_FILENAME = './test.csv'
OUTPUT_FILENAME = './output.csv'

In [2]:
def print_results_table(results):
    table = PrettyTable()
    table.field_names = ['Category', 'Normalization Method', 'Oversampling', 'Recall', 'Precision', 'Accuracy']
    rows = []
    for category, category_results in results['categories'].items():
        rows.append([category, results['normalization_method'], results['oversampling'], category_results['recall'], category_results['precision'], results['accuracy']])
    for row in rows:
        table.add_row(row)
    print(table)

## Reading the Dataset
As seen below, the dataset contains some rows which have empty short_description columns. These rows are deleted in the cleaning process.

In [3]:
dataset = pd.read_csv(DATASET_FILENAME)
dataset.head()

Unnamed: 0,index,authors,category,date,headline,link,short_description
0,0,"Katherine LaGrave, ContributorTravel writer an...",TRAVEL,2014-05-07,"EccentriCities: Bingo Parties, Paella and Isla...",https://www.huffingtonpost.com/entry/eccentric...,Påskekrim is merely the tip of the proverbial ...
1,1,Ben Hallman,BUSINESS,2014-06-09,Lawyers Are Now The Driving Force Behind Mortg...,https://www.huffingtonpost.com/entry/mortgage-...,
2,2,Jessica Misener,STYLE & BEAUTY,2012-03-12,Madonna 'Truth Or Dare' Shoe Line To Debut Thi...,https://www.huffingtonpost.com/entry/madonna-s...,"Madonna is slinking her way into footwear now,..."
3,3,"Victor and Mary, Contributor\n2Sense-LA.com",TRAVEL,2013-12-17,Sophistication and Serenity on the Las Vegas S...,https://www.huffingtonpost.com/entry/las-vegas...,But what if you're a 30-something couple that ...
4,4,"Emily Cohn, Contributor",BUSINESS,2015-03-19,It's Still Pretty Hard For Women To Get Free B...,https://www.huffingtonpost.com/entry/free-birt...,Obamacare was supposed to make birth control f...


In [4]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22925 entries, 0 to 22924
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   index              22925 non-null  int64 
 1   authors            18523 non-null  object
 2   category           22925 non-null  object
 3   date               22925 non-null  object
 4   headline           22924 non-null  object
 5   link               22925 non-null  object
 6   short_description  21703 non-null  object
dtypes: int64(1), object(6)
memory usage: 1.2+ MB


In [5]:
class DataCleaner:
    
    def __init__(self, normalization_method):
        if normalization_method == 'lemmatization':
            self.normalize = self.normalize_with_lemmatization
        elif normalization_method == 'stemming':
            self.normalize = self.normalize_with_stemmization
        else:
            raise ValueError("normalization_method value should be either 'lemmatization' or 'stemming'")
    
    def clean(self, dataset, desired_categories=None):
        dataset = dataset[dataset.short_description.notnull()]
        dataset = dataset.filter(['index', 'category', 'short_description'])
        dataset['short_description'] = dataset['short_description'].apply(self.normalize)
        return dataset[dataset.category.isin(desired_categories)] if desired_categories else dataset
        
    
    @staticmethod
    def normalize_with_lemmatization(text):
        def get_mapped_tag(nltk_tag):
            if nltk_tag.startswith('J'):
                return wordnet.ADJ
            elif nltk_tag.startswith('V'):
                return wordnet.VERB
            elif nltk_tag.startswith('N'):
                return wordnet.NOUN
            elif nltk_tag.startswith('R'):
                return wordnet.ADV
            else:
                return None

        sentences = sent_tokenize(text.lower())
        stop_words = set(stopwords.words('english'))
        lemmatized_words = set()
        lemmatizer = WordNetLemmatizer()
        for sentence in sentences:
            tokens = [token for token in word_tokenize(sentence) if (token not in stop_words and token.isalnum())]
            nltk_tagged = pos_tag(tokens)
            wn_tagged = map(lambda x: (x[0], get_mapped_tag(x[1])), nltk_tagged)
            for word, tag in wn_tagged:
                if tag is None:                        
                    lemmatized_words.add(word)
                else:
                    lemmatized_words.add(lemmatizer.lemmatize(word, tag))
        return ','.join([word.lower() for word in lemmatized_words])
    
    @staticmethod
    def normalize_with_stemmization(text):
        stop_words = set(stopwords.words('english'))
        stemmer = PorterStemmer()
        tokens = [token for token in word_tokenize(text) if (token not in stop_words and token.isalnum())]
        tokens = map(stemmer.stem, tokens)
        return ','.join([token.lower() for token in tokens])

In [6]:
class Classifier:
    ds = None
    test_data = None
    train_data = None
    category_word_count = None
    category_total_word_count = None
    word_probabilities = None
    total_dataset_length = None

    def __init__(
        self, 
        dataset, 
        normalization_method, 
        desired_categories, 
        test_data_filename, 
        training_data_percentage,
        output_filename,
        oversample=False
    ):
        self.data_cleaner = DataCleaner(normalization_method)
        dataset = self.data_cleaner.clean(dataset, desired_categories)
        self.categories = set(dataset['category'])
        self.ds = {}
        self.total_dataset_length = len(dataset)
        for category in self.categories:
            self.ds[category] = dataset[dataset['category'] == category].reset_index(drop=True)
        self.training_data_percentage = training_data_percentage
        self._prepare_train_and_test_data(oversample)
        self._train()
        self.test_data_filename = test_data_filename
        self.output_filename = output_filename
        self.is_oversampled = oversample
        self.normalization_method = normalization_method
        

    def _prepare_train_and_test_data(self, oversample):
        self.test_data = {}
        self.train_data = {}
        for category, data in self.ds.items():
            partition_point = len(data) * self.training_data_percentage // 100
            self.train_data[category] = data[:partition_point]
            self.test_data[category] = data[partition_point:]
        if oversample:
            self.oversample()

    def oversample(self):
        max_category_data_count = len(max(self.train_data.values(), key=len))
        for category, data in self.train_data.items():
            lst = [data]
            lst.append(data.sample(
                max_category_data_count-len(data), replace=True))
            self.train_data[category] = pd.concat(lst)

    def _find_word_count(self):
        self.category_word_count = defaultdict(lambda: defaultdict(lambda: 0))
        self.category_total_word_count = defaultdict(lambda: 0)
        for category, data in self.train_data.items():
            for text in data['short_description']:
                tokens = text.split(',')
                for token in tokens:
                    self.category_word_count[category][token] += 1
                self.category_total_word_count[category] += len(tokens)

    def _train(self):
        self._find_word_count()
        self.word_probabilities = defaultdict(dict)
        for category, word_count in self.category_word_count.items():
            for word, count in word_count.items():
                self.word_probabilities[category][word] = math.log10(
                    count / self.category_total_word_count[category])

    def _predict_news_category(self, text):
        results = {}
        tokens = text.split(',')
        for category in self.categories:
            probability = math.log10(
                len(self.ds[category]) / self.total_dataset_length)
            for token in tokens:
                probability += self.word_probabilities[category].get(
                    token, math.log10(1/self.category_total_word_count[category]))
            results[category] = probability
        return max(results, key=results.get)

    def check_with_test_data(self):
        results = defaultdict(lambda: {'correct': 0, 'wrong': 0})
        prediction_count = defaultdict(lambda: 0)
        total_correct_predictions = 0
        test_data_count = 0
        actual_categories = []
        predicted_categories = []
        for category, data in self.test_data.items():
            test_data_count += len(data)
            actual_categories.extend([category] * len(data))
            for index, row in data.iterrows():
                predicted_category = self._predict_news_category(
                    row['short_description'])
                predicted_categories.append(predicted_category)
                prediction_count[predicted_category] += 1
                if predicted_category == category:
                    results[category]['correct'] += 1
                    total_correct_predictions += 1
                else:
                    results[category]['wrong'] += 1
        confusion_matrix = ConfusionMatrix(actual_categories, predicted_categories)
        category_results = defaultdict(dict)
        accuracy = total_correct_predictions / test_data_count
        for category in self.categories:
            category_results[category]['recall'] = results[category]['correct'] / \
                len(self.test_data[category])
            category_results[category]['precision'] = results[category]['correct'] / \
                prediction_count[category]
        
        return {
            'categories': {**category_results},
            'accuracy': accuracy,
            'oversampling': self.is_oversampled,
            'normalization_method': self.normalization_method,
            'confusion_matrix': confusion_matrix
        }

    def predict(self):
        data = pd.read_csv(self.test_data_filename)
        data = self.data_cleaner.clean(data)
        results = []
        for index, row in data.iterrows():
            results.append([row['index'], self._predict_news_category(row['short_description'])])
        final_df = pd.DataFrame(results, columns=['index', 'category'])
        final_df.to_csv(self.output_filename)
        return final_df
        

## A problem with the dataset
If we group the dataset by category, it can be seen that the data is imbalanced and the number of rows with BUSINESS category is much lower than the other two categories. To solve this problem, oversampling is used. In oversampling, random rows from the class with lower repetition are duplicated and appended to the dataset until all classes have an  almost identical number of data pieces and their Prior Class Probability becomes almost equal.

In [7]:
sample_dataset = dataset[dataset.short_description.notnull()]
sample_dataset = sample_dataset.groupby('category')
for group, data in sample_dataset:
    print(group, len(data))

BUSINESS 4568
STYLE & BEAUTY 8674
TRAVEL 8461


## Showing the functionalities
A sample classifier is instantiated below to show the changes on the dataset

In [8]:
sample_classifier = Classifier(
        dataset=dataset,
        normalization_method='stemming', 
        desired_categories=['TRAVEL', 'BUSINESS'], 
        test_data_filename=TEST_FILENAME, 
        training_data_percentage=80,
        output_filename=OUTPUT_FILENAME,
        oversample=True
    )

### Training data
The training data are grouped by their categories, and if the oversampling argument value is set to True, the training data will be oversampled and each category will have an almost equal number of data pieces.

In [9]:
sample_classifier.train_data

{'BUSINESS':       index  category                                  short_description
 0         4  BUSINESS  obamacar,suppos,make,birth,control,free,women,...
 1        15  BUSINESS  in,fact,busi,insid,point,sunday,break,bad,audi...
 2        18  BUSINESS  the,measur,part,effort,block,teenag,adolesc,ge...
 3        23  BUSINESS  a,relief,well,compani,began,drill,earli,decemb...
 4        26  BUSINESS  some,talk,american,banker,said,polit,pressur,a...
 ...     ...       ...                                                ...
 177     942  BUSINESS                                  watch,wall,street
 241    1268  BUSINESS  perhap,surprisingli,three,ivi,leagu,school,lis...
 3058  15338  BUSINESS  when,think,great,leader,come,mind,richard,bran...
 2627  13132  BUSINESS                  how,prepar,how,anybodi,prepar,you
 2102  10559  BUSINESS  sinc,septemb,intergener,month,take,time,apprec...
 
 [6768 rows x 3 columns],
 'TRAVEL':       index category                                  short_d

In [10]:
sample_classifier.train_data['BUSINESS'].head()

Unnamed: 0,index,category,short_description
0,4,BUSINESS,"obamacar,suppos,make,birth,control,free,women,..."
1,15,BUSINESS,"in,fact,busi,insid,point,sunday,break,bad,audi..."
2,18,BUSINESS,"the,measur,part,effort,block,teenag,adolesc,ge..."
3,23,BUSINESS,"a,relief,well,compani,began,drill,earli,decemb..."
4,26,BUSINESS,"some,talk,american,banker,said,polit,pressur,a..."


In [11]:
sample_classifier.train_data['BUSINESS'].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6768 entries, 0 to 2102
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   index              6768 non-null   int64 
 1   category           6768 non-null   object
 2   short_description  6768 non-null   object
dtypes: int64(1), object(2)
memory usage: 211.5+ KB


### Test data
The test data are also grouped by their categories.

In [12]:
sample_classifier.test_data

{'BUSINESS':       index  category                                  short_description
 3654  18365  BUSINESS  we,live,corpor,environ,often,place,great,impor...
 3655  18370  BUSINESS  thi,year,term,reput,everywher,it,longer,primar...
 3656  18373  BUSINESS                   here,challeng,take,love,put,work
 3657  18374  BUSINESS  we,got,test,new,game,suppos,better,teach,strat...
 3658  18378  BUSINESS  madison,nicol,robinson,design,best,known,creat...
 ...     ...       ...                                                ...
 4563  22876  BUSINESS  in,addit,stronger,rule,need,make,sure,system,n...
 4564  22884  BUSINESS  privat,prison,corpor,geo,group,grew,1980,state...
 4565  22892  BUSINESS  low,price,make,oil,tanker,heist,africa,western...
 4566  22910  BUSINESS  simultan,without,know,i,learn,mani,valuabl,les...
 4567  22914  BUSINESS  the,bank,pledg,april,stop,financ,compani,sell,...
 
 [914 rows x 3 columns],
 'TRAVEL':       index category                                  short_de

In [13]:
sample_classifier.test_data['BUSINESS'].head()

Unnamed: 0,index,category,short_description
3654,18365,BUSINESS,"we,live,corpor,environ,often,place,great,impor..."
3655,18370,BUSINESS,"thi,year,term,reput,everywher,it,longer,primar..."
3656,18373,BUSINESS,"here,challeng,take,love,put,work"
3657,18374,BUSINESS,"we,got,test,new,game,suppos,better,teach,strat..."
3658,18378,BUSINESS,"madison,nicol,robinson,design,best,known,creat..."


In [14]:
sample_classifier.test_data['BUSINESS'].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 914 entries, 3654 to 4567
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   index              914 non-null    int64 
 1   category           914 non-null    object
 2   short_description  914 non-null    object
dtypes: int64(1), object(2)
memory usage: 21.6+ KB


As it can be seen above, there are 4568 pieces of news in the BUSINESS category, but after oversampling in the sample_classifier, there are 6939 pieces of news in the BUSINESS category in the training data (Oversampling only affects the training data)

## Assignment Phase 1 results

The phase 1 and 2 results are the _Recall_ and _Precision_ values for each category and the _accuracy_ of the whole prediction process.

These values are evaluated from the following formulas:

$$Recall = \frac{Correct\ Category\ Predictions}{Total\ Category\ News\ Count}$$

$$Precision = \frac{Correct\ Category\ Predictions}{Total\ Category\ Predictions}$$

$$Accuracy = \frac{Correct\ Category\ Predictions}{Total\ Data\ Count}$$

Phase 1 considers only the 'BUSINESS' and 'TRAVEL' categories.

### What happens if only precision is considered?
The precision value shows the number of true positives divided by the sum of true positives and false positives. In this manner, if a model predicts only one true positive, it's precision goes to 1.0! But this measurement completely ignores the false negatives the model predicted. For example consider a model which has to detect diseases. In a dataset with a 1000 pieces of data from which 10 are positives, the model only predicts 2 true positives. It's precision will be 1.0 as there are no false positives, but this won't make it a good model as it has had 8 false negatives while predicting. The recall for such a prediction would be 20% which is not good at all. So with combining different measures such as recall, precision and accuracy, this problem will be solved but considering precision only won't result in a good model. 

In [15]:
def phase_1():
    stemmized_classifier = Classifier(
        dataset=dataset,
        normalization_method='stemming', 
        desired_categories=['TRAVEL', 'BUSINESS'], 
        test_data_filename=TEST_FILENAME, 
        training_data_percentage=80,
        output_filename=OUTPUT_FILENAME,
        oversample=True
    )
    stemmization_test_results = stemmized_classifier.check_with_test_data()
    
    lemmatized_classifier = Classifier(
        dataset=dataset,
        normalization_method='lemmatization', 
        desired_categories=['TRAVEL', 'BUSINESS'], 
        test_data_filename=TEST_FILENAME, 
        training_data_percentage=80,
        output_filename=OUTPUT_FILENAME,
        oversample=True
    )
    lemmatization_test_results = lemmatized_classifier.check_with_test_data()
    return {
        'lemmatization': lemmatization_test_results,
        'stemming': stemmization_test_results
    }
    
p1_results = phase_1()

In [16]:
for norm_method, results in p1_results.items():
    print_results_table(results)

+----------+----------------------+--------------+--------------------+--------------------+--------------------+
| Category | Normalization Method | Oversampling |       Recall       |     Precision      |      Accuracy      |
+----------+----------------------+--------------+--------------------+--------------------+--------------------+
| BUSINESS |    lemmatization     |     True     | 0.8599562363238512 |      0.81875       | 0.8841580360567702 |
|  TRAVEL  |    lemmatization     |     True     | 0.8972238629651507 | 0.9222829386763813 | 0.8841580360567702 |
+----------+----------------------+--------------+--------------------+--------------------+--------------------+
+----------+----------------------+--------------+--------------------+--------------------+--------------------+
| Category | Normalization Method | Oversampling |       Recall       |     Precision      |      Accuracy      |
+----------+----------------------+--------------+--------------------+-----------------

## Assignment Phase 1 results without oversampling
This part demonstrates the effect of oversampling on the result

In [17]:
def phase_1_without_oversampling():
    stemmized_classifier = Classifier(
        dataset=dataset,
        normalization_method='stemming', 
        desired_categories=['TRAVEL', 'BUSINESS'], 
        test_data_filename=TEST_FILENAME, 
        training_data_percentage=80,
        output_filename=OUTPUT_FILENAME,
        oversample=False
    )
    stemmization_test_results = stemmized_classifier.check_with_test_data()
    
    lemmatized_classifier = Classifier(
        dataset=dataset,
        normalization_method='lemmatization', 
        desired_categories=['TRAVEL', 'BUSINESS'], 
        test_data_filename=TEST_FILENAME, 
        training_data_percentage=80, 
        output_filename=OUTPUT_FILENAME,
        oversample=False
    )
    lemmatization_test_results = lemmatized_classifier.check_with_test_data()
    
    return {
        'lemmatization': lemmatization_test_results,
        'stemming': stemmization_test_results
    }

p1_results_without_oversampling = phase_1_without_oversampling()

In [18]:
for norm_method, results in p1_results_without_oversampling.items():
    print_results_table(results)

+----------+----------------------+--------------+--------------------+--------------------+--------------------+
| Category | Normalization Method | Oversampling |       Recall       |     Precision      |      Accuracy      |
+----------+----------------------+--------------+--------------------+--------------------+--------------------+
| BUSINESS |    lemmatization     |    False     | 0.8916849015317286 | 0.7560296846011132 | 0.8611430763329497 |
|  TRAVEL  |    lemmatization     |    False     | 0.8446544595392794 | 0.935251798561151  | 0.8611430763329497 |
+----------+----------------------+--------------+--------------------+--------------------+--------------------+
+----------+----------------------+--------------+--------------------+--------------------+--------------------+
| Category | Normalization Method | Oversampling |       Recall       |     Precision      |      Accuracy      |
+----------+----------------------+--------------+--------------------+-----------------

## Assignment Phase 2 results
This phase calculates the recall, precision and accuracy values for 'BUSINESS', 'TRAVEL' and 'STYLE & BEAUTY' categories.
The confusion matrix is also printed for this prediction process.

In [19]:
def phase_2():
    stemmized_classifier = Classifier(
        dataset=dataset,
        normalization_method='stemming', 
        desired_categories=['TRAVEL', 'BUSINESS', 'STYLE & BEAUTY'], 
        test_data_filename=TEST_FILENAME, 
        training_data_percentage=80,
        output_filename=OUTPUT_FILENAME,
        oversample=True
    )
    stemmization_test_results = stemmized_classifier.check_with_test_data()
    
    lemmatized_classifier = Classifier(
        dataset=dataset,
        normalization_method='lemmatization', 
        desired_categories=['TRAVEL', 'BUSINESS', 'STYLE & BEAUTY'], 
        test_data_filename=TEST_FILENAME, 
        training_data_percentage=80,
        output_filename=OUTPUT_FILENAME,
        oversample=True
    )
    lemmatization_test_results = lemmatized_classifier.check_with_test_data()
    
    return {
        'lemmatization': lemmatization_test_results,
        'stemming': stemmization_test_results
    }
    
p2_results = phase_2()

In [20]:
for norm_method, results in p2_results.items():
    print_results_table(results)

+----------------+----------------------+--------------+--------------------+--------------------+-------------------+
|    Category    | Normalization Method | Oversampling |       Recall       |     Precision      |      Accuracy     |
+----------------+----------------------+--------------+--------------------+--------------------+-------------------+
|    BUSINESS    |    lemmatization     |     True     | 0.824945295404814  | 0.7384916748285995 | 0.845923537540304 |
|     TRAVEL     |    lemmatization     |     True     | 0.8417011222681631 | 0.8631132646880678 | 0.845923537540304 |
| STYLE & BEAUTY |    lemmatization     |     True     | 0.8610951008645533 | 0.8946107784431138 | 0.845923537540304 |
+----------------+----------------------+--------------+--------------------+--------------------+-------------------+
+----------------+----------------------+--------------+--------------------+--------------------+--------------------+
|    Category    | Normalization Method | Overs

In [21]:
for norm_method, data in p2_results.items():
    print('Normalization Method:', data['normalization_method'])
    print(data['confusion_matrix'])

Normalization Method: lemmatization
               |         S      |
               |         T      |
               |         Y      |
               |         L      |
               |         E      |
               |                |
               |    B    &      |
               |    U           |
               |    S    B    T |
               |    I    E    R |
               |    N    A    A |
               |    E    U    V |
               |    S    T    E |
               |    S    Y    L |
---------------+----------------+
      BUSINESS | <754>  66   94 |
STYLE & BEAUTY |  109<1494> 132 |
        TRAVEL |  158  110<1425>|
---------------+----------------+
(row = reference; col = test)

Normalization Method: stemming
               |         S      |
               |         T      |
               |         Y      |
               |         L      |
               |         E      |
               |                |
               |    B    &      |
               |  

## Final phase of the assignment
This phase tries to predict the categories of some uncalssified data and writes the results to the given file (in this assignment, the output file is 'output.csv'.

In [22]:
def final_phase(normalization_method):
    classifier = Classifier(
        dataset=dataset,
        normalization_method=normalization_method, 
        desired_categories=['TRAVEL', 'BUSINESS', 'STYLE & BEAUTY'], 
        test_data_filename=TEST_FILENAME, 
        training_data_percentage=80,
        output_filename=OUTPUT_FILENAME,
        oversample=True
    )
    results = classifier.predict()
    
final_phase('stemming')