CS 762 A3 - yliu523

Data Preprocessing: 
There are no missing data in the trg.csv. In the preprocessing stage common stopwords as well as all punctuations have been removed for better training accuracy. All abstracts have been transformed into lowercase, as well as manual stemming of popular suffixes have been implemented. These methods were chosen based on my research for industry standard text preprocessing implementations.

Implementation:
Log probabilities - Instead of calculating the fractional conditional probabilities, we are instead calculating the sum of log probabilities for the word. This is because we are dealing with a large amount of words per paragraph and a small fractional value for each word. When these small fractions are multiplied together hundreds of times we may experience underflow - the value can get so small that python recognizes it as zero. Taking a log of this fraction preserves the propotionality, so our inference of maximum posterior probability still works the same way.

Word stemming: We observe multiple versions of the same words (genes vs gene), so all words that end in 's' had their final letter removed as part of pre-processing. This should prevent the split in distribution between plural words that share the same meaning without affecting words that naturally end in 's' as their porportions will remain the same.

Extensive stop-words: Originally I had a list of only 20 stopwords. During my implementation I found a more extensive list and wrote a python script to import the list to use in my own implementation. This increased the prediction accuracy slightly.

Conclusion:
A few different implementations were experimented with, however the base version that considered all words with extensive stopword list yieldeded the best results. I was unable to optimize TF-IDF runtime with my current structure and thus was only able to submit the base version of naive-bayes.

Due to time constraint oversights I was only able to make an unsuccessful attempt at unsuccessful cross-validation and advanced metrics for measuring success.  

Overall model accuracy - test set: 83.66%

In [1]:
import csv
import string
import numpy as np
import math

with open('trg.csv', 'r') as file:
    data_iter=csv.reader(file,delimiter=',')
    next(data_iter)
    data=[data for data in data_iter]
   
data_array = np.asarray(data)

processed={}

# list of common english words
common_words= ['able', 'about', 'above', 'abroad', 'according', 'accordingly', 'across', 'actually', 'adj', 'after', 'afterwards', 'again', 'against', 'ago', 'ahead', "ain't", 'all', 'allow', 'allows', 'almost', 'alone', 'along', 'alongside', 'already', 'also', 'although', 'always', 'am', 'amid', 'amidst', 'among', 'amongst', 'an', 'and', 'another', 'any', 'anybody', 'anyhow', 'anyone', 'anything', 'anyway', 'anyways', 'anywhere', 'apart', 'appear', 'appreciate', 'appropriate', 'are', "aren't", 'around', 'as', "a's", 'aside', 'ask', 'asking', 'associated', 'at', 'available', 'away', 'awfully', 'back', 'backward', 'backwards', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'begin', 'behind', 'being', 'believe', 'below', 'beside', 'besides', 'best', 'better', 'between', 'beyond', 'both', 'brief', 'but', 'by', 'came', 'can', 'cannot', 'cant', "can't", 'caption', 'cause', 'causes', 'certain', 'certainly', 'changes', 'clearly', "c'mon", 'co', 'co.', 'com', 'come', 'comes', 'concerning', 'consequently', 'consider', 'considering', 'contain', 'containing', 'contains', 'corresponding', 'could', "couldn't", 'course', "c's", 'currently', 'dare', "daren't", 'definitely', 'described', 'despite', 'did', "didn't", 'different', 'directly', 'do', 'does', "doesn't", 'doing', 'done', "don't", 'down', 'downwards', 'during', 'each', 'edu', 'eg', 'eight', 'eighty', 'either', 'else', 'elsewhere', 'end', 'ending', 'enough', 'entirely', 'especially', 'et', 'etc', 'even', 'ever', 'evermore', 'every', 'everybody', 'everyone', 'everything', 'everywhere', 'ex', 'exactly', 'example', 'except', 'fairly', 'far', 'farther', 'few', 'fewer', 'fifth', 'first', 'five', 'followed', 'following', 'follows', 'for', 'forever', 'former', 'formerly', 'forth', 'forward', 'found', 'four', 'from', 'further', 'furthermore', 'get', 'gets', 'getting', 'given', 'gives', 'go', 'goes', 'going', 'gone', 'got', 'gotten', 'greetings', 'had', "hadn't", 'half', 'happens', 'hardly', 'has', "hasn't", 'have', "haven't", 'having', 'he', "he'd", "he'll", 'hello', 'help', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', "here's", 'hereupon', 'hers', 'herself', "he's", 'hi', 'him', 'himself', 'his', 'hither', 'hopefully', 'how', 'howbeit', 'however', 'hundred', "i'd", 'ie', 'if', 'ignored', "i'll", "i'm", 'immediate', 'in', 'inasmuch', 'inc', 'inc.', 'indeed', 'indicate', 'indicated', 'indicates', 'inner', 'inside', 'insofar', 'instead', 'into', 'inward', 'is', "isn't", 'it', "it'd", "it'll", 'its', "it's", 'itself', "i've", 'just', 'k', 'keep', 'keeps', 'kept', 'know', 'known', 'knows', 'last', 'lately', 'later', 'latter', 'latterly', 'least', 'less', 'lest', 'let', "let's", 'like', 'liked', 'likely', 'likewise', 'little', 'look', 'looking', 'looks', 'low', 'lower', 'ltd', 'made', 'mainly', 'make', 'makes', 'many', 'may', 'maybe', "mayn't", 'me', 'mean', 'meantime', 'meanwhile', 'merely', 'might', "mightn't", 'mine', 'minus', 'miss', 'more', 'moreover', 'most', 'mostly', 'mr', 'mrs', 'much', 'must', "mustn't", 'my', 'myself', 'name', 'namely', 'nd', 'near', 'nearly', 'necessary', 'need', "needn't", 'needs', 'neither', 'never', 'neverf', 'neverless', 'nevertheless', 'new', 'next', 'nine', 'ninety', 'no', 'nobody', 'non', 'none', 'nonetheless', 'noone', 'no-one', 'nor', 'normally', 'not', 'nothing', 'notwithstanding', 'novel', 'now', 'nowhere', 'obviously', 'of', 'off', 'often', 'oh', 'ok', 'okay', 'old', 'on', 'once', 'one', 'ones', "one's", 'only', 'onto', 'opposite', 'or', 'other', 'others', 'otherwise', 'ought', "oughtn't", 'our', 'ours', 'ourselves', 'out', 'outside', 'over', 'overall', 'own', 'particular', 'particularly', 'past', 'per', 'perhaps', 'placed', 'please', 'plus', 'possible', 'presumably', 'probably', 'provided', 'provides', 'que', 'quite', 'qv', 'rather', 'rd', 're', 'really', 'reasonably', 'recent', 'recently', 'regarding', 'regardless', 'regards', 'relatively', 'respectively', 'right', 'round', 'said', 'same', 'saw', 'say', 'saying', 'says', 'second', 'secondly', 'see', 'seeing', 'seem', 'seemed', 'seeming', 'seems', 'seen', 'self', 'selves', 'sensible', 'sent', 'serious', 'seriously', 'seven', 'several', 'shall', "shan't", 'she', "she'd", "she'll", "she's", 'should', "shouldn't", 'since', 'six', 'so', 'some', 'somebody', 'someday', 'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhat', 'somewhere', 'soon', 'sorry', 'specified', 'specify', 'specifying', 'still', 'sub', 'such', 'sup', 'sure', 'take', 'taken', 'taking', 'tell', 'tends', 'th', 'than', 'thank', 'thanks', 'thanx', 'that', "that'll", 'thats', "that's", "that've", 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', "there'd", 'therefore', 'therein', "there'll", "there're", 'theres', "there's", 'thereupon', "there've", 'these', 'they', "they'd", "they'll", "they're", "they've", 'thing', 'things', 'think', 'third', 'thirty', 'this', 'thorough', 'thoroughly', 'those', 'though', 'three', 'through', 'throughout', 'thru', 'thus', 'till', 'to', 'together', 'too', 'took', 'toward', 'towards', 'tried', 'tries', 'truly', 'try', 'trying', "t's", 'twice', 'two', 'un', 'under', 'underneath', 'undoing', 'unfortunately', 'unless', 'unlike', 'unlikely', 'until', 'unto', 'up', 'upon', 'upwards', 'us', 'use', 'used', 'useful', 'uses', 'using', 'usually', 'v', 'value', 'various', 'versus', 'very', 'via', 'viz', 'vs', 'want', 'wants', 'was', "wasn't", 'way', 'we', "we'd", 'welcome', 'well', "we'll", 'went', 'were', "we're", "weren't", "we've", 'what', 'whatever', "what'll", "what's", "what've", 'when', 'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein', "where's", 'whereupon', 'wherever', 'whether', 'which', 'whichever', 'while', 'whilst', 'whither', 'who', "who'd", 'whoever', 'whole', "who'll", 'whom', 'whomever', "who's", 'whose', 'why', 'will', 'willing', 'wish', 'with', 'within', 'without', 'wonder', "won't", 'would', "wouldn't", 'yes', 'yet', 'you', "you'd", "you'll", 'your', "you're", 'yours', 'yourself', 'yourselves', "you've", 'zero', 'a', "how's", 'i', "when's", "why's", 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'j', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'uucp', 'w', 'x', 'y', 'z', 'I', 'www', 'amount', 'bill', 'bottom', 'call', 'computer', 'con', 'couldnt', 'cry', 'de', 'describe', 'detail', 'due', 'eleven', 'empty', 'fifteen', 'fifty', 'fill', 'find', 'fire', 'forty', 'front', 'full', 'give', 'hasnt', 'herse', 'himse', 'interest', 'itse”', 'mill', 'move', 'myse”', 'part', 'put', 'show', 'side', 'sincere', 'sixty', 'system', 'ten', 'thick', 'thin', 'top', 'twelve', 'twenty', 'abst', 'accordance', 'act', 'added', 'adopted', 'affected', 'affecting', 'affects', 'ah', 'announce', 'anymore', 'apparently', 'approximately', 'aren', 'arent', 'arise', 'auth', 'beginning', 'beginnings', 'begins', 'biol', 'briefly', 'ca', 'date', 'ed', 'effect', 'et-al', 'ff', 'fix', 'gave', 'giving', 'heres', 'hes', 'hid', 'home', 'id', 'im', 'immediately', 'importance', 'important', 'index', 'information', 'invention', 'itd', 'keys', 'kg', 'km', 'largely', 'lets', 'line', "'ll", 'means', 'mg', 'million', 'ml', 'mug', 'na', 'nay', 'necessarily', 'nos', 'noted', 'obtain', 'obtained', 'omitted', 'ord', 'owing', 'page', 'pages', 'poorly', 'possibly', 'potentially', 'pp', 'predominantly', 'present', 'previously', 'primarily', 'promptly', 'proud', 'quickly', 'ran', 'readily', 'ref', 'refs', 'related', 'research', 'resulted', 'resulting', 'results', 'run', 'sec', 'section', 'shed', 'shes', 'showed', 'shown', 'showns', 'shows', 'significant', 'significantly', 'similar', 'similarly', 'slightly', 'somethan', 'specifically', 'state', 'states', 'stop', 'strongly', 'substantially', 'successfully', 'sufficiently', 'suggest', 'thered', 'thereof', 'therere', 'thereto', 'theyd', 'theyre', 'thou', 'thoughh', 'thousand', 'throug', 'til', 'tip', 'ts', 'ups', 'usefully', 'usefulness', "'ve", 'vol', 'vols', 'wed', 'whats', 'wheres', 'whim', 'whod', 'whos', 'widely', 'words', 'world', 'youd', 'youre']



for abstract in data_array: # use [1:21] to get first 20 
    classifier = abstract[1]
    paragraph = abstract[2].translate(str.maketrans('','',string.punctuation)).lower()
    p= paragraph.split()
    clean_para= []
    for word in p:
        if word not in common_words:
            clean_para.append(word.removesuffix('s'))
           
    if classifier not in processed.keys():
        processed[classifier]=[]
        processed[classifier].append(clean_para)
    else:
        processed[classifier].append(clean_para)

        
#processed dictionary is created at end of this block


In [2]:
def NaiveBayes(processed):
    structure = {}
    struc_const = {}
    all_w = []
    all_w_dupe = []
    pre_laplace = []
    prior = []

#     sparse_wc = []

    for i, j in processed.items():
        for x in processed[i]:
            all_w = all_w + x
    
    all_w_uniq = np.unique(all_w)

    for key in processed: # a b c d looping
        class_word_dupe = []
        enum = []
        
#         term_freq = []
#         idf_list = []
        
        class_wordcounts = {}
        print(key)
        
        for para in processed[key]:
            class_word_dupe = class_word_dupe + para
        
        class_word_uniq = np.unique(class_word_dupe)
        
        
        
        for w in class_word_uniq: # for every unique word in this class, count how many times it appears, then add laplace constant
            n = class_word_dupe.count(w) + 1
            enum.append(n)  
            
        class_wordcounts = dict(zip(class_word_uniq, enum)) # zip the number of occurence and the unique words to form a dictionary.
    
        
#         without_sparse = {i:j for i,j in class_wordcounts.items() if j > 50}
#         print('sparsity check')
#         print(len(class_wordcounts))
#         print(len(without_sparse))

        if key not in structure.keys(): # Initialize the final output (structure)
            structure[key] = class_wordcounts
            
        words_in_class = sum(class_wordcounts.values())
        
        pre_laplace.append(words_in_class)
        
    print('prelaplace vs prelaplace + laplace const')
    print(pre_laplace)
    
    laplace_con = len(all_w_uniq) # count of unique words in CSV

    denom = [x+laplace_con for x in pre_laplace]
    
    # PRIOR CALCULATION START
    all_row = 0
    row_in_class = []
    for key in processed:
        row_in_class.append(len(processed[key]))
        all_row += len(processed[key])

    prior_list = [x / all_row for x in row_in_class]
    
    z=0
    for key in processed:
        if key not in struc_const:
            struc_const[key] = {key + '_prior' : math.log(prior_list[z]),
                                key + '_empty' : math.log(1/denom[z])}
        z += 1
        
    # prior values and empty values
    print(struc_const)
    
    c = 0
    for i,j in structure.items():
        for k,l in j.items():
            struc_const[i][k]= math.log(l/denom[c]) # divide each whole wordcount (7) by denominator (7/1500), store the fractional/decimal
        c += 1
        
#   print('Structure AFTER decimal and LOG calculations')
    return(struc_const)

    
task1 = NaiveBayes(processed)

B
A
E
V
prelaplace vs prelaplace + laplace const
[168635, 16870, 232633, 15301]
{'B': {'B_prior': -0.9150415124737231, 'B_empty': -12.193706498942605}, 'A': {'A_prior': -3.4420193761824103, 'A_empty': -10.731537060559845}, 'E': {'E_prior': -0.623621117911335, 'E_empty': -12.474342514460494}, 'V': {'V_prior': -3.4577677331505496, 'V_empty': -10.696661047163198}}


In [3]:
# read in a test file
def read_tst(file):
    with open(file, 'r') as destination:
        data_iterator = csv.reader(destination, delimiter=',')
        data=[data[1].translate(str.maketrans('','',string.punctuation)).lower() for data in data_iterator][1:]
        
    for i in range(len(data)):
        data[i]=data[i].split()
        
    return data

tst = read_tst("tst.csv")

In [4]:
# pre-initialize the prior valuwes
priors = [0,0,0,0]

for i, j in task1.items():
    if i == 'A':
        priors[0] += j['A_prior']
    if i == 'B':
        priors[1] += j['B_prior']
    if i == 'E':
        priors[2] += j['E_prior']
    if i == 'V':
        priors[3] += j['V_prior']

# print(priors)

# MAKE PREDICTION, sentence = a row, pred_structure = log prob dictionary.
def predict(sentence, pred_structure):

    class_name = ['A','B','E','V']
    class_prob = [0  ,0  ,0  ,0  ]
    
    for word in sentence:
        for i, j in pred_structure.items():  
            if word in j.keys():
                class_prob[class_name.index(i)] += j[word]
                
    for i in range(4):
        class_prob[i] =  priors[i] + class_prob[i] 
    final_prediction = class_name[class_prob.index(min(class_prob))]
    
    return(final_prediction) # return either A B E or V


final_counts = []

# tst is test data, it should have 1000 rows, for every row, run predict and append the letter to empty list we inited before.
for para in tst:
    final_counts.append(predict(para, task1))


# print(final_counts)

# print(final_counts.count('A'))
# print(final_counts.count('B'))
# print(final_counts.count('E'))
# print(final_counts.count('V'))



In [5]:
# Print results from above into a CSV file

def print_to_csv(pred_list): # self explanatory
    int = 0
    file = open('yliu523.csv', 'a') 
    file.write('id,class'+'\n')  
    for pred in pred_list: 
        int += 1
        file.write(str(int)+','+pred+'\n')  
    file.close() 

print_to_csv(final_counts) # This creates the file for kaggle handin

In [6]:
# cross validation code:
import random

def CV(trg):
    accuracy = []

    with open(trg, 'r') as file:
        data_iter=csv.reader(file,delimiter=',')
        next(data_iter)
        data=[data for data in data_iter]

    data_array = np.asarray(data)
    processed={}

    # list of common english words
    common_words= ['able', 'about', 'above', 'abroad', 'according', 'accordingly', 'across', 'actually', 'adj', 'after', 'afterwards', 'again', 'against', 'ago', 'ahead', "ain't", 'all', 'allow', 'allows', 'almost', 'alone', 'along', 'alongside', 'already', 'also', 'although', 'always', 'am', 'amid', 'amidst', 'among', 'amongst', 'an', 'and', 'another', 'any', 'anybody', 'anyhow', 'anyone', 'anything', 'anyway', 'anyways', 'anywhere', 'apart', 'appear', 'appreciate', 'appropriate', 'are', "aren't", 'around', 'as', "a's", 'aside', 'ask', 'asking', 'associated', 'at', 'available', 'away', 'awfully', 'back', 'backward', 'backwards', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'begin', 'behind', 'being', 'believe', 'below', 'beside', 'besides', 'best', 'better', 'between', 'beyond', 'both', 'brief', 'but', 'by', 'came', 'can', 'cannot', 'cant', "can't", 'caption', 'cause', 'causes', 'certain', 'certainly', 'changes', 'clearly', "c'mon", 'co', 'co.', 'com', 'come', 'comes', 'concerning', 'consequently', 'consider', 'considering', 'contain', 'containing', 'contains', 'corresponding', 'could', "couldn't", 'course', "c's", 'currently', 'dare', "daren't", 'definitely', 'described', 'despite', 'did', "didn't", 'different', 'directly', 'do', 'does', "doesn't", 'doing', 'done', "don't", 'down', 'downwards', 'during', 'each', 'edu', 'eg', 'eight', 'eighty', 'either', 'else', 'elsewhere', 'end', 'ending', 'enough', 'entirely', 'especially', 'et', 'etc', 'even', 'ever', 'evermore', 'every', 'everybody', 'everyone', 'everything', 'everywhere', 'ex', 'exactly', 'example', 'except', 'fairly', 'far', 'farther', 'few', 'fewer', 'fifth', 'first', 'five', 'followed', 'following', 'follows', 'for', 'forever', 'former', 'formerly', 'forth', 'forward', 'found', 'four', 'from', 'further', 'furthermore', 'get', 'gets', 'getting', 'given', 'gives', 'go', 'goes', 'going', 'gone', 'got', 'gotten', 'greetings', 'had', "hadn't", 'half', 'happens', 'hardly', 'has', "hasn't", 'have', "haven't", 'having', 'he', "he'd", "he'll", 'hello', 'help', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', "here's", 'hereupon', 'hers', 'herself', "he's", 'hi', 'him', 'himself', 'his', 'hither', 'hopefully', 'how', 'howbeit', 'however', 'hundred', "i'd", 'ie', 'if', 'ignored', "i'll", "i'm", 'immediate', 'in', 'inasmuch', 'inc', 'inc.', 'indeed', 'indicate', 'indicated', 'indicates', 'inner', 'inside', 'insofar', 'instead', 'into', 'inward', 'is', "isn't", 'it', "it'd", "it'll", 'its', "it's", 'itself', "i've", 'just', 'k', 'keep', 'keeps', 'kept', 'know', 'known', 'knows', 'last', 'lately', 'later', 'latter', 'latterly', 'least', 'less', 'lest', 'let', "let's", 'like', 'liked', 'likely', 'likewise', 'little', 'look', 'looking', 'looks', 'low', 'lower', 'ltd', 'made', 'mainly', 'make', 'makes', 'many', 'may', 'maybe', "mayn't", 'me', 'mean', 'meantime', 'meanwhile', 'merely', 'might', "mightn't", 'mine', 'minus', 'miss', 'more', 'moreover', 'most', 'mostly', 'mr', 'mrs', 'much', 'must', "mustn't", 'my', 'myself', 'name', 'namely', 'nd', 'near', 'nearly', 'necessary', 'need', "needn't", 'needs', 'neither', 'never', 'neverf', 'neverless', 'nevertheless', 'new', 'next', 'nine', 'ninety', 'no', 'nobody', 'non', 'none', 'nonetheless', 'noone', 'no-one', 'nor', 'normally', 'not', 'nothing', 'notwithstanding', 'novel', 'now', 'nowhere', 'obviously', 'of', 'off', 'often', 'oh', 'ok', 'okay', 'old', 'on', 'once', 'one', 'ones', "one's", 'only', 'onto', 'opposite', 'or', 'other', 'others', 'otherwise', 'ought', "oughtn't", 'our', 'ours', 'ourselves', 'out', 'outside', 'over', 'overall', 'own', 'particular', 'particularly', 'past', 'per', 'perhaps', 'placed', 'please', 'plus', 'possible', 'presumably', 'probably', 'provided', 'provides', 'que', 'quite', 'qv', 'rather', 'rd', 're', 'really', 'reasonably', 'recent', 'recently', 'regarding', 'regardless', 'regards', 'relatively', 'respectively', 'right', 'round', 'said', 'same', 'saw', 'say', 'saying', 'says', 'second', 'secondly', 'see', 'seeing', 'seem', 'seemed', 'seeming', 'seems', 'seen', 'self', 'selves', 'sensible', 'sent', 'serious', 'seriously', 'seven', 'several', 'shall', "shan't", 'she', "she'd", "she'll", "she's", 'should', "shouldn't", 'since', 'six', 'so', 'some', 'somebody', 'someday', 'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhat', 'somewhere', 'soon', 'sorry', 'specified', 'specify', 'specifying', 'still', 'sub', 'such', 'sup', 'sure', 'take', 'taken', 'taking', 'tell', 'tends', 'th', 'than', 'thank', 'thanks', 'thanx', 'that', "that'll", 'thats', "that's", "that've", 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', "there'd", 'therefore', 'therein', "there'll", "there're", 'theres', "there's", 'thereupon', "there've", 'these', 'they', "they'd", "they'll", "they're", "they've", 'thing', 'things', 'think', 'third', 'thirty', 'this', 'thorough', 'thoroughly', 'those', 'though', 'three', 'through', 'throughout', 'thru', 'thus', 'till', 'to', 'together', 'too', 'took', 'toward', 'towards', 'tried', 'tries', 'truly', 'try', 'trying', "t's", 'twice', 'two', 'un', 'under', 'underneath', 'undoing', 'unfortunately', 'unless', 'unlike', 'unlikely', 'until', 'unto', 'up', 'upon', 'upwards', 'us', 'use', 'used', 'useful', 'uses', 'using', 'usually', 'v', 'value', 'various', 'versus', 'very', 'via', 'viz', 'vs', 'want', 'wants', 'was', "wasn't", 'way', 'we', "we'd", 'welcome', 'well', "we'll", 'went', 'were', "we're", "weren't", "we've", 'what', 'whatever', "what'll", "what's", "what've", 'when', 'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein', "where's", 'whereupon', 'wherever', 'whether', 'which', 'whichever', 'while', 'whilst', 'whither', 'who', "who'd", 'whoever', 'whole', "who'll", 'whom', 'whomever', "who's", 'whose', 'why', 'will', 'willing', 'wish', 'with', 'within', 'without', 'wonder', "won't", 'would', "wouldn't", 'yes', 'yet', 'you', "you'd", "you'll", 'your', "you're", 'yours', 'yourself', 'yourselves', "you've", 'zero', 'a', "how's", 'i', "when's", "why's", 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'j', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'uucp', 'w', 'x', 'y', 'z', 'I', 'www', 'amount', 'bill', 'bottom', 'call', 'computer', 'con', 'couldnt', 'cry', 'de', 'describe', 'detail', 'due', 'eleven', 'empty', 'fifteen', 'fifty', 'fill', 'find', 'fire', 'forty', 'front', 'full', 'give', 'hasnt', 'herse', 'himse', 'interest', 'itse”', 'mill', 'move', 'myse”', 'part', 'put', 'show', 'side', 'sincere', 'sixty', 'system', 'ten', 'thick', 'thin', 'top', 'twelve', 'twenty', 'abst', 'accordance', 'act', 'added', 'adopted', 'affected', 'affecting', 'affects', 'ah', 'announce', 'anymore', 'apparently', 'approximately', 'aren', 'arent', 'arise', 'auth', 'beginning', 'beginnings', 'begins', 'biol', 'briefly', 'ca', 'date', 'ed', 'effect', 'et-al', 'ff', 'fix', 'gave', 'giving', 'heres', 'hes', 'hid', 'home', 'id', 'im', 'immediately', 'importance', 'important', 'index', 'information', 'invention', 'itd', 'keys', 'kg', 'km', 'largely', 'lets', 'line', "'ll", 'means', 'mg', 'million', 'ml', 'mug', 'na', 'nay', 'necessarily', 'nos', 'noted', 'obtain', 'obtained', 'omitted', 'ord', 'owing', 'page', 'pages', 'poorly', 'possibly', 'potentially', 'pp', 'predominantly', 'present', 'previously', 'primarily', 'promptly', 'proud', 'quickly', 'ran', 'readily', 'ref', 'refs', 'related', 'research', 'resulted', 'resulting', 'results', 'run', 'sec', 'section', 'shed', 'shes', 'showed', 'shown', 'showns', 'shows', 'significant', 'significantly', 'similar', 'similarly', 'slightly', 'somethan', 'specifically', 'state', 'states', 'stop', 'strongly', 'substantially', 'successfully', 'sufficiently', 'suggest', 'thered', 'thereof', 'therere', 'thereto', 'theyd', 'theyre', 'thou', 'thoughh', 'thousand', 'throug', 'til', 'tip', 'ts', 'ups', 'usefully', 'usefulness', "'ve", 'vol', 'vols', 'wed', 'whats', 'wheres', 'whim', 'whod', 'whos', 'widely', 'words', 'world', 'youd', 'youre']

    for abstract in data_array: # use [1:21] to get first 20 
        classifier = abstract[1]
        paragraph = abstract[2].translate(str.maketrans('','',string.punctuation)).lower()
        p= paragraph.split()
        clean_para= []
        for word in p:
            if word not in common_words:
                clean_para.append(word.removesuffix('s'))

        if classifier not in processed.keys():
            processed[classifier]=[]
            processed[classifier].append(clean_para)
        else:
            processed[classifier].append(clean_para)
            
    acc = []
    for i in range(10):
        X, y = xFold(processed, i, folds = 10)
        
        X_train = X
        y_train = [i[2].translate(str.maketrans('','',string.punctuation)).lower() for i in y]
        
        for i in range(len(y_train)):
            y_train[i] = y_train[i].split()
            
        y_test = y[:,1]
        
        X_struc = naiveBayes(X_train)
        
        y_list = []
        
        for i in y_train:
            y_list.append(predict(i, X_struc))
        
        acc.append(sum(1 for i,j in zip(y_test, y_list) if i == j)) / len(y_list)
        
    return acc
        

def xFold(data, i, folds):
    np.random.seed()
    
    test = data[0::folds]
    train = [i for i in data if i not in test]
    
    return train, test


test = CV('trg.csv')
print(test)


TypeError: unhashable type: 'slice'