# Bayes Theorem Classifier Report CS361 A3
#### UPI: ytia165, ID: 402799865
*****


The CSV file given contains 3 columns, ID, class, and abstract. The start of my data preprocessing consists of reading the classes and abstracts of each row and converting the unique classes into numerical classes. 

Next comes my functions which include the implementation of various models of Naive Bayes and the validation functions to test these implementations. These functions are clearly described within the docstring. The first method I used was the standard Naive Bayes model and with a 80/20 train/test split I yielded an accuracy of 0.1975. Since the test set could have been in favour for my model, I used 10-fold cross validation to get an average accuracy for the model. This came out to be 

| Split         | Accuracy      | 
| ------------- |---------------|
| 1             | 0.2075        | 
| 2             | 0.185         |
| 3             | 0.2175        | 
| 4             | 0.18          | 
| 5             | 0.1875        | 
| 6             | 0.1925        | 
| 7             | 0.185         | 
| 8             | 0.17          | 
| 9             | 0.19          | 
| 10            | 0.195         | 
| AVG           | 0.191         | 

Clearly, the standard implementation was extremely poor as a performed much worse than a 'majority class' model. Looking at several extenstions for Naive Bayes, I decided to implement the log version to smooth down words that appear a lot exponentially. With a quick log change and summing to find the maximum, our 80/20 test yielded an accuracy of 0.94, a vast increase from the basic model. Here is the 10-fold cross validation for a more accuracte accuracy. 

| Split         | Accuracy      | 
| ------------- |---------------|
| 1             | 0.9475        | 
| 2             | 0.935         |
| 3             | 0.9425        | 
| 4             | 0.93          | 
| 5             | 0.9475        | 
| 6             | 0.92          | 
| 7             | 0.9525        | 
| 8             | 0.935         | 
| 9             | 0.9425        | 
| 10            | 0.9325        | 
| AVG           | 0.9385        | 

An average accuracy of 93.85% was immensely better at classifying classes based on text probability. Looking at our dictionary of words, we have common words appearing many times that have no or little relation with the class. Some of these words include 'The', 'It', and 'And'. With this knowledge, we make our improved model where if we see these words, we will not include them in our probability dictionary. You can find the more comprehensive list in the variable 'stopwords'. Running this model on 80/20 we get the same accuracy of 0.94. However, running our 10-fold cross validation yields different results. Here is the accuracy for stopword removal.

| Split         | Accuracy      | 
| ------------- |---------------|
| 1             | 0.95          | 
| 2             | 0.9525        |
| 3             | 0.945         | 
| 4             | 0.9375        | 
| 5             | 0.96          | 
| 6             | 0.935         | 
| 7             | 0.9525        | 
| 8             | 0.94          | 
| 9             | 0.945         | 
| 10            | 0.9425        | 
| AVG           | 0.9460        | 

An average accuracy of 94.60% was a tad better than our logged Naive bayes without stopwords. Here is a summary of the 3 models. 

| Standard | Logged | Logged + Stopwords |
| :---: | :---: | :---: |
| 0.191 | 0.9385 | 0.9460 |

Finally, looking at the raw data, we have an unbalanced number of classes. Class 'E' is our most common with 2144 appearences and class 'V' is our least common with only 216 appearences out of our 4000 data rows. This is a prime example in where a complement Naive Bayes model might be a better fit for this data. Here, we take the inverse of a normal Naive bayes model. Unfortunately, this was unable to be implemented as a further model since there seems to be a bug somewhere in the code. This is only giving us an accuracy of around 50% which clearly should not be the case. 

However, with the logged stopword removal model, this was submitted onto Kaggle with a public leaderboard accuracy of 94.6%. The functions were finally cleaned and documented. Here is the code below.

# Code for Classifier Based on Text Inputs
*****

In [1]:
import csv

import numpy as np

### Reading in CSV files

In [2]:
classes = []
train_text = []
with open("trg.csv") as input_csv:
    reader = csv.reader(input_csv, delimiter=",", quotechar='"')
    next(reader)
    for row in reader:
        classes.append(row[1])
        train_text.append(row[2])
        
# change categorical classes to numeric class
unique_classes = sorted(set(classes))
class_to_id = {x: unique_classes.index(x) for x in unique_classes}
id_to_class = dict([(value, key) for key, value in class_to_id.items()])
classes_numeric = np.array([class_to_id[x] for x in classes])

### Bayes Theorem and Validation functions 

In [3]:
def train_test_split(x, y, test_split):
    '''
    Returns training and test set 

            Parameters (in this instance):
                    x (list/numpy.ndarry): A list of explanatory variables
                    y (list/numpy.ndarry): A list of response variables 
                    test_split (int): The proprotion cutoff of test to train sets

            Returns:
                    x[train], x[test] (numpy.ndarry): explanatory training and test set
                    y[train], y[test] (numpy.ndarry): response training and test set
    '''
    n_test = int(test_split * len(x))
    x, y = np.array(x), np.array(y)
    perm = np.random.default_rng(seed = 402799865).permutation(len(x))
    test, train = perm[:n_test], perm[n_test:]
    
    return x[train], x[test], y[train], y[test]

In [4]:
def class_prob(y_train):
    '''Returns the probability of a class (format: dictionary)'''
    unique, counts = np.unique(y_train, return_counts=True)
    dicY_counts = dict(zip(unique, counts))
    dicY_prob = {}
    for key, value in dicY_counts.items():
        dicY_prob[key] = value/sum(dicY_counts.values())
        
    return dicY_prob

In [5]:
def dictionary(x_train, y_train, stopwords = []):
    '''
    Returns all given words and words subsetted by class

            Parameters (in this instance):
                    x_train (numpy.ndarry): A list of explanatory variables
                    y_train (numpy.ndarry): A list of response variables 
                    stopwords (list): Words to not include in the dictionary

            Returns:
                    dictWordsFull (dictionary): All words 
                    listdict (dictionaries in list): All words subsetted by class
    '''
    dictWordsFull = {}
    listdict = [{} for _ in range(len(unique_classes))]
    for array in range(0, len(x_train)-1):
        for word in x_train[array].split(): 
            if word not in stopwords:
                if word not in dictWordsFull:
                    dictWordsFull[word] = 1  
                else:
                    dictWordsFull[word] += 1

                if word not in listdict[y_train[array]]:
                    listdict[y_train[array]][word] = 1
                else:
                    listdict[y_train[array]][word] += 1
                    
    return dictWordsFull, listdict

In [6]:
def dictionary_prob_log(dictWordsFull, listdict):
    '''Returns the probability of a word appearing given the class (format: dictionaries in list)'''
    listdictprob = [{} for _ in range(len(unique_classes))]
    for key in dictWordsFull:
        for classi in range(0,len(unique_classes)):
            if key not in listdict[classi]:
                listdictprob[classi][key] = np.log(1/(sum(listdict[classi].values())+len(dictWordsFull)))
            else:
                listdictprob[classi][key] = np.log((listdict[classi][key]+1)/(sum(listdict[classi].values())+len(dictWordsFull)))
                
    return listdictprob

In [7]:
def class_calculate(words, listdictprob, dicY_prob):
    '''Returns the most likely class based on Baysean Probability'''
    classes = [[np.log(prob)] for prob in dicY_prob.values()]
    for word in words.split():
        for classi in range(0,len(unique_classes)):
            if word in listdictprob[classi]:
                classes[classi].append(listdictprob[classi][word])
    
    maxSum = []
    for lst in classes:
        maxSum.append(np.sum(lst))

    return maxSum.index(max(maxSum))

In [8]:
def test_accuracy(x_test, listdictprob, dicY_prob, y_test):
    '''Returns accuracy based on x and y test sets'''
    test_array = []
    for array in x_test:
        test_class = class_calculate(array, listdictprob, dicY_prob)
        test_array.append(test_class)
    test_array_class = np.array([id_to_class[c] for c in test_array])
    
    return sum(1 for x,y in zip(test_array,y_test) if x == y) / len(test_array)

In [9]:
def classi_identifier_test(string, classi, test_split = 0.2):
    '''
    Returns accuracy of model

            Parameters (in this instance):
                    string (list/numpy.ndarry): A list of explanatory variables
                    classi (list/numpy.ndarry): A list of response variables 

            Returns:
                    accuracy (int): The accuracy of the model
    '''
    x_train, x_test, y_train, y_test = train_test_split(string, classi, test_split)
    dicY_prob = class_prob(y_train)
    dictWordsFull, listdict = dictionary(x_train, y_train)
    listdictprob = dictionary_prob_log(dictWordsFull, listdict)
    accuracy = test_accuracy(x_test, listdictprob, dicY_prob, y_test)
    
    return accuracy

In [10]:
print(classi_identifier_test(train_text, classes_numeric))

0.94


In [11]:
def classi_identifier_test_kfold(string, classi, fold = 10):
    '''Returns accuracy of model based on k-fold cross validation'''
    accuracy_lst = []
    perm = np.random.default_rng(seed = 402799865).permutation(len(string))
    chunks = [perm[x:x+int(len(perm)/fold)] for x in range(0, len(perm), int(len(perm)/fold))]
    string, classi = np.array(string), np.array(classi)
    
    for i in range(0,fold):
        training_lst = chunks[:i] + chunks[i+1:]
        training_lst_concat = [j for i in training_lst for j in i]
        string_train, classi_train = string[training_lst_concat], classi[training_lst_concat]
        string_test, classi_test = string[chunks[i]], classi[chunks[i]]
        
        dicY_prob = class_prob(classi_train)
        dictWordsFull, listdict = dictionary(string_train, classi_train)
        listdictprob = dictionary_prob_log(dictWordsFull, listdict)
        accuracy = test_accuracy(string_test, listdictprob, dicY_prob, classi_test)
        accuracy_lst.append(accuracy)
        
    return accuracy_lst, np.mean(accuracy_lst)

In [12]:
print(classi_identifier_test_kfold(train_text, classes_numeric))

([0.9475, 0.935, 0.9425, 0.93, 0.9475, 0.92, 0.9525, 0.935, 0.9425, 0.9325], 0.9385)


# Training with Entire Dataset

In [13]:
abstracts = []
with open("tst.csv") as input_csv:
    reader = csv.reader(input_csv, delimiter=",", quotechar='"')
    next(reader)
    for row in reader:
        abstracts.append(row[1])

In [14]:
def class_output(xpreds, listdictprob, dicY_prob):
    '''Class Identifier but for full dataset'''
    lst = []
    for array in xpreds:
        class_pred = class_calculate(array, listdictprob, dicY_prob)
        lst.append(class_pred)
    lst2 = np.array([id_to_class[c] for c in lst])
    
    return lst2

In [15]:
def classi_identifier_full(string, classi, xpreds, to_csv = False, stopw = []):
    '''MAIN Class Identifier but for full dataset'''
    dicY_prob = class_prob(classi)
    dictWordsFull, listdict = dictionary(string, classi, stopwords)
    listdictprob = dictionary_prob_log(dictWordsFull, listdict)
    output = class_output(xpreds, listdictprob, dicY_prob)
    if to_csv == True:
        np.savetxt("ytia165_CS361_A3_PREDICTIONS2.csv", output, delimiter =", ", fmt ='% s')
        
    return output

In [16]:
stopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

In [17]:
first_try = classi_identifier_full(train_text, classes_numeric, abstracts, to_csv = False)

### Incorporating stopwords

In [18]:
second_try = classi_identifier_full(train_text, classes_numeric, abstracts, to_csv = False, stopw = stopwords)

### Complement Naive Bayes

In [19]:
def dictionary_prob_complement(dictWordsFull, listdict):
    '''Returns the probability of a word appearing given the class (format: dictionaries in list)'''
    listdictprob = [{} for _ in range(len(unique_classes))]
    for key in dictWordsFull:
        for classi in range(0,len(unique_classes)):
            listdictprob[classi][key] = 1/((dictWordsFull[key]-listdict[classi][key]+1)/(sum(dictWordsFull.values())-sum(listdict[classi].values())))
                
    return listdictprob

In [20]:
def class_calculate_mult(words, listdictprob, dicY_prob):
    '''Returns the most likely class based on Baysean Probability'''
    classes = [[prob] for prob in dicY_prob.values()]
    for word in words.split():
        for classi in range(0,len(unique_classes)):
            if word in listdictprob[classi]:
                classes[classi].append(listdictprob[classi][word])
    
    minSum = []
    for lst in classes:
        minSum.append(np.prod(lst))

    return minSum.index(min(minSum))