## Bigrams with Negation Handling

The idea of this section was to use the same setups for the bigram section, but use the dataset created for the negation handling section. 

The beginning essentially does the set up for this, creating and loading this negation handled dataset and building the model creation structure for bigrams instead of unigrams.

With this in mind, it's once again best to skip down to the "Naive Bayes" subheading

In [23]:
import os
import string
pos_path = "./data/txt_sentoken/pos/"
pos_path_out = "./data/txt_sentoken_negation/pos_negation/"
neg_path = "./data/txt_sentoken/neg/"
neg_path_out = "./data/txt_sentoken_negation/neg_negation/"

In [24]:
def handle_negation(in_path, out_path):
    file_list = os.listdir(in_path)
    for file in file_list:
        new_file = file + "_new.txt"
        new_file_sentences = []
        with open(in_path + file, 'r') as f, open(out_path + new_file, 'w+') as f_out:
            for line in f.readlines():
                new_line = ''
                tokens = line.split()
                i = 0
                while i < len(tokens):

                    if tokens[i][-3:] != "n't":
                        new_line = new_line + tokens[i] + ' '
                        i+=1
                    
                    else:
                        new_line = new_line + tokens[i] + ' '
                        try:
                            while tokens[i+1] not in string.punctuation:
                                new_line = new_line + 'NOT_' + tokens[i+1] + ' '
                                i+=1
                        except:
                            print("end of sentence")
                        i+=1
                new_file_sentences.append(new_line + '\n')
                
            f_out.writelines(new_file_sentences)

In [25]:
import os
import time
import tarfile
import time

class PL04DataLoader_Part_1:
    
    def __init__(self):
        pass
    
    def get_labelled_dataset(self, fold = 0):
        ''' Compile a fold of the data set
        '''
        dataset = []
        for label in ('pos_negation', 'neg_negation'):
            for document in self.get_documents(
                fold = fold,
                label = label,
            ):
                dataset.append((document, label))
        return dataset
    
    def get_documents(self, fold = 0, label = 'pos'):
        ''' Enumerate the raw contents of all data set files.
            Args:
                data_dir: relative or absolute path to the data set folder
                fold: which fold to load (0 to n_folds-1)
                label: 'pos' or 'neg' to
                    select data with positive or negative sentiment
                    polarity
            Return:
                List of tokenised documents, each a list of sentences
                that in turn are lists of tokens
        '''
        raise NotImplementedError

class PL04DataLoader(PL04DataLoader_Part_1):
    
    def get_xval_splits(self):
        ''' Split data with labels for cross-validation
            returns a list of k pairs (training_data, test_data)
            for k cross-validation
        '''
        # load the folds
        folds = []
        for i in range(10):
            folds.append(self.get_labelled_dataset(
                fold = i
            ))
        # create training-test splits
        retval = []
        for i in range(10):
            test_data = folds[i]
            training_data = []
            for j in range(9):
                ij1 = (i+j+1) % 10
                assert ij1 != i
                training_data = training_data + folds[ij1]
            retval.append((training_data, test_data))
        return retval
    
import tarfile
import time

class PL04DataLoaderFromStream(PL04DataLoader):
        
    def __init__(self, tgz_stream, **kwargs):
        super().__init__(**kwargs)
        self.data = {}
        counter = 0
        with tarfile.open(
            mode = 'r|gz',
            fileobj = tgz_stream
        ) as tar_archive:
            for tar_member in tar_archive:
                if counter == 2000:
                    break
                path_components = tar_member.name.split('/')
                filename = path_components[-1]
                if filename.startswith('cv') \
                and filename.endswith('.txt') \
                and '_' in filename:
                    label = path_components[-2]
                    fold = int(filename[2])
                    key = (fold, label)
                    if key not in self.data:
                        self.data[key] = []
                    f = tar_archive.extractfile(tar_member)
                    document = [
                        line.decode('utf-8').split()
                        for line in f.readlines()
                    ]
                    self.data[key].append(document)
                    counter += 1
            
    def get_documents(self, fold = 0, label = 'pos'):
        return self.data[(fold, label)]
    


class PL04DataLoaderFromFolder(PL04DataLoader):
    
    def __init__(self, data_dir, **kwargs):
        self.data_dir = data_dir
        super().__init__(**kwargs)
        
    def get_documents(self, fold = 0, label = 'pos_negation'):
        # read folder contents
        path = os.path.join(self.data_dir, label)
        dir_entries = os.listdir(path)
        # must process entries in numeric order to
        # replicate order of original experiments
        dir_entries.sort()
        # check each entry and add to data if matching
        # selection criteria
        for filename in dir_entries:
            if filename.startswith('cv') \
            and filename.endswith('.txt'):
                if fold == int(filename[2]):
                    # correct fold
                    f = open(os.path.join(path, filename), 'rt')
                    # "yield" tells Python to return an iterator
                    # object that produces the yields of this
                    # function as elements without creating a
                    # full list of all elements
                    yield [line.split() for line in f.readlines()]
                    f.close()

In [26]:
dir_entries = os.listdir()
dir_entries.sort()

In [27]:
data_loader = PL04DataLoaderFromFolder("./data/txt_sentoken_negation/")

In [28]:
# test "get_documents()"

def get_document_preview(document, max_length = 72):
    s = []
    count = 0
    reached_limit = False
    for sentence in document:
        i = 0
        while (i < len(sentence) - 1):
            token = sentence[i] + ' ' + sentence[i+1]
            if count + len(token) + len(s) > max_length:
                reached_limit = True
                break

            s.append(token)
            count += len(token)
            i+=1
        if reached_limit:
            break
    return '|'.join(s)
    
for label in 'pos_negation neg_negation'.split():
    print(f'== {label} ==')
    print('doc sentences start of first sentence')
    for index, document in enumerate(data_loader.get_documents(
        label = label
    )):
        print('%3d %7d   %s' %(
            index, len(document), get_document_preview(document)
        ))
        if index == 4:
            break

== pos_negation ==
doc sentences start of first sentence
  0      25   films adapted|adapted from|from comic|comic books|books have|have had
  1      39   every now|now and|and then|then a|a movie|movie comes|comes along
  2      19   you've got|got mail|mail works|works alot|alot better|better than
  3      42   " jaws|jaws "|" is|is a|a rare|rare film|film that|that grabs|grabs your
  4      25   moviemaking is|is a|a lot|lot like|like being|being the|the general
== neg_negation ==
doc sentences start of first sentence
  0      35   plot :|: two|two teen|teen couples|couples go|go to|to a|a church
  1      13   the happy|happy bastard's|bastard's quick|quick movie|movie review
  2      23   it is|is movies|movies like|like these|these that|that make|make a
  3      19   " quest|quest for|for camelot|camelot "|" is|is warner|warner bros
  4      37   synopsis :|: a|a mentally|mentally unstable|unstable man|man undergoing


In [29]:
# test "get_xval_splits()"

splits = data_loader.get_xval_splits()

print('tr-size te-size (number of documents)')
for xval_tr_data, xval_te_data in splits:
    print('%7d %7d' %(len(xval_tr_data), len(xval_te_data)))

tr-size te-size (number of documents)
   1800     200
   1800     200
   1800     200
   1800     200
   1800     200
   1800     200
   1800     200
   1800     200
   1800     200
   1800     200


In [30]:
class PolarityPredictorInterface:

    def train(self, data_with_labels):
        raise NotImplementedError
        
    def predict(self, data):
        raise NotImplementedError

In [31]:
class PolarityPredictorWithVocabulary(PolarityPredictorInterface):
    
    def train(self, data_with_labels):
        self.reset_vocab()
        self.add_to_vocab_from_data(data_with_labels)
        self.finalise_vocab()
        tr_features = self.extract_features(
            data_with_labels
        )
        tr_targets = self.get_targets(data_with_labels)
        self.train_model_on_features(tr_features, tr_targets)
        
    def reset_vocab(self):
        self.vocab = set()
        
    def add_to_vocab_from_data(self, data):
        for document, label in data:
            for sentence in document:
                i = 0
                while (i < len(sentence) - 1):
                    token = sentence[i] + ' ' + sentence[i+1]
                    self.vocab.add(token)
                    i+=1

    def finalise_vocab(self):
        self.vocab = list(self.vocab)
        # create reverse map for fast token lookup
        self.token2index = {}
        for index, token in enumerate(self.vocab):
            self.token2index[token] = index
        
    def extract_features(self, data):
        raise NotImplementedError
    
    def get_targets(self, data, label2index = None):
        raise NotImplementedError
        
    def train_model_on_features(self, tr_features, tr_targets):
        raise NotImplementedError

In [32]:
import numpy

class PolarityPredictorWithBagOfWords_01(PolarityPredictorWithVocabulary):
    
    def __init__(self, clip_counts = True):
        self.clip_counts = clip_counts
        
    def extract_features(self, data):
        # create numpy array of required size
        columns = len(self.vocab)
        rows = len(data)
        features = numpy.zeros((rows, columns), dtype=numpy.int32)        
        # populate feature matrix
        for row, item in enumerate(data):
            document, _ = item
            for sentence in document:

                i = 0
                while (i < len(sentence)-1):
                    token = sentence[i] + ' ' + sentence[i+1]
                    i+=1

                    try:
                        index = self.token2index[token]
                    except KeyError:
                        # token not in vocab
                        # --> skip this token
                        # --> continue with next token
                        continue
                    if self.clip_counts:
                        features[row, index] = 1
                    else:
                        features[row, index] += 1

        return features

In [33]:
class PolarityPredictorWithBagOfWords(PolarityPredictorWithBagOfWords_01):
 
    def get_targets(self, data):
        ''' create column vector with target labels
        '''
        # prepare target vector
        targets = numpy.zeros(len(data), dtype=numpy.int8)
        index = 0
        for _, label in data:
            if label == 'pos_negation':
                targets[index] = 1
            index += 1
        return targets

    def train_model_on_features(self, tr_features, tr_targets):
        raise NotImplementedError

## Naive Bayes

In [34]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

class PolarityPredictorBowNB(PolarityPredictorWithBagOfWords):

    def train_model_on_features(self, tr_features, tr_targets):
        # pass numpy array to sklearn to train NB
        self.model = MultinomialNB()
        self.model.fit(tr_features, tr_targets)
        
    def predict(
        self, data, get_accuracy = False,
        get_confusion_matrix = False
    ):
        features = self.extract_features(data)
        # use numpy to get predictions
        y_pred = self.model.predict(features)
        # restore labels
        labels = []
        for is_positive in y_pred:
            if is_positive:
                labels.append('pos_negation')
            else:
                labels.append('neg_negation')
        if get_accuracy or get_confusion_matrix:
            retval = []
            retval.append(labels)
            y_true = self.get_targets(data)
            if get_accuracy:
                retval.append(
                    metrics.accuracy_score(y_true, y_pred)
                )
            if get_confusion_matrix:
                retval.append(
                    metrics.confusion_matrix(y_true, y_pred)
                )
            return retval
        else:
            return labels

In [35]:
# first functionality test

model = PolarityPredictorBowNB()
model.train(splits[0][0]) 

In [36]:
def print_first_predictions(model, te_data, n = 12):
    predictions = model.predict(te_data)
    for i in range(n):
        document, label = te_data[i]
        prediction = predictions[i]
        print('%4d %s %s %s' %(
            i, label, prediction,
            get_document_preview(document),
        ))
    
print_first_predictions(model, splits[0][1])

   0 pos_negation pos_negation films adapted|adapted from|from comic|comic books|books have|have had
   1 pos_negation pos_negation every now|now and|and then|then a|a movie|movie comes|comes along
   2 pos_negation pos_negation you've got|got mail|mail works|works alot|alot better|better than
   3 pos_negation pos_negation " jaws|jaws "|" is|is a|a rare|rare film|film that|that grabs|grabs your
   4 pos_negation neg_negation moviemaking is|is a|a lot|lot like|like being|being the|the general
   5 pos_negation pos_negation on june|june 30|30 ,|, 1960|1960 ,|, a|a self-taught|self-taught ,
   6 pos_negation pos_negation apparently ,|, director|director tony|tony kaye|kaye had|had a|a major
   7 pos_negation pos_negation one of|of my|my colleagues|colleagues was|was surprised|surprised when
   8 pos_negation pos_negation after bloody|bloody clashes|clashes and|and independence
   9 pos_negation pos_negation the american|american action|action film|film has|has been|been slowly
  10 pos_n

In [37]:
labels, accuracy, confusion_matrix = model.predict(
    splits[0][1], get_accuracy = True, get_confusion_matrix = True
)

print(accuracy)
print(confusion_matrix)

0.82
[[77 23]
 [13 87]]


In [38]:
def evaluate_model(model, splits, verbose = False):
    accuracies = []
    f1s = []
    fold = 0
    for tr_data, te_data in splits:
        if verbose:
            print('Evaluating fold %d of %d' %(fold+1, len(splits)))
            fold += 1
        model.train(tr_data)
        _, accuracy, confusion_matrix = model.predict(te_data, get_accuracy = True, get_confusion_matrix = True)
       
        tp, fp, fn, tn = confusion_matrix[0][0], confusion_matrix[0][1], confusion_matrix[1][0], confusion_matrix[1][1]
        prec = tp/(tp + fp)
        rec = tp/(tp + fn)
        f1 = (2*prec*rec)/(prec+rec)
       
        accuracies.append(accuracy)
        f1s.append(f1)
        if verbose:
            print('Accuracy -->', accuracy)
            print('Precision -->', prec)
            print('Recall -->', rec)
            print('F1 -->', f1)
            print()
    n = float(len(accuracies))
    avg = sum(f1s) / n
    mse = sum([(x-avg)**2 for x in accuracies]) / n
    return (avg, mse**0.5, min(f1s),
            max(f1s))

# this takes about 3 minutes
print(evaluate_model(model, splits, verbose = True))

Evaluating fold 1 of 10
Accuracy --> 0.82
Precision --> 0.77
Recall --> 0.8555555555555555
F1 --> 0.8105263157894737

Evaluating fold 2 of 10
Accuracy --> 0.88
Precision --> 0.86
Recall --> 0.8958333333333334
F1 --> 0.8775510204081632

Evaluating fold 3 of 10
Accuracy --> 0.84
Precision --> 0.82
Recall --> 0.8541666666666666
F1 --> 0.836734693877551

Evaluating fold 4 of 10
Accuracy --> 0.88
Precision --> 0.82
Recall --> 0.9318181818181818
F1 --> 0.8723404255319149

Evaluating fold 5 of 10
Accuracy --> 0.81
Precision --> 0.77
Recall --> 0.8369565217391305
F1 --> 0.8020833333333334

Evaluating fold 6 of 10
Accuracy --> 0.84
Precision --> 0.77
Recall --> 0.8953488372093024
F1 --> 0.8279569892473118

Evaluating fold 7 of 10
Accuracy --> 0.875
Precision --> 0.82
Recall --> 0.9213483146067416
F1 --> 0.8677248677248677

Evaluating fold 8 of 10
Accuracy --> 0.86
Precision --> 0.8
Recall --> 0.9090909090909091
F1 --> 0.8510638297872342

Evaluating fold 9 of 10
Accuracy --> 0.845
Precision --> 

## Logistic Regression

In [39]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

class PolarityPredictorBowLR(PolarityPredictorWithBagOfWords):

    def train_model_on_features(self, tr_features, tr_targets):
        # pass numpy array to sklearn to train Logistic Regression
        # iterations set to 1000 as default of 100 didn't guarantee convergence with our data
        self.model = LogisticRegression(max_iter=1000)
        self.model.fit(tr_features, tr_targets)
        
    def predict(
        self, data, get_accuracy = False,
        get_confusion_matrix = False
    ):
        features = self.extract_features(data)
        # use numpy to get predictions
        y_pred = self.model.predict(features)
        # restore labels
        labels = []
        for is_positive in y_pred:
            if is_positive:
                labels.append('pos')
            else:
                labels.append('neg')
        if get_accuracy or get_confusion_matrix:
            retval = []
            retval.append(labels)
            y_true = self.get_targets(data)
            if get_accuracy:
                retval.append(
                    metrics.accuracy_score(y_true, y_pred)
                )
            if get_confusion_matrix:
                retval.append(
                    metrics.confusion_matrix(y_true, y_pred)
                )
            return retval
        else:
            return labels

In [40]:
model = PolarityPredictorBowLR()
model.train(splits[0][0]) 

In [41]:
print_first_predictions(model, splits[0][1])

   0 pos_negation pos films adapted|adapted from|from comic|comic books|books have|have had
   1 pos_negation pos every now|now and|and then|then a|a movie|movie comes|comes along
   2 pos_negation neg you've got|got mail|mail works|works alot|alot better|better than
   3 pos_negation pos " jaws|jaws "|" is|is a|a rare|rare film|film that|that grabs|grabs your
   4 pos_negation neg moviemaking is|is a|a lot|lot like|like being|being the|the general
   5 pos_negation pos on june|june 30|30 ,|, 1960|1960 ,|, a|a self-taught|self-taught ,
   6 pos_negation pos apparently ,|, director|director tony|tony kaye|kaye had|had a|a major
   7 pos_negation pos one of|of my|my colleagues|colleagues was|was surprised|surprised when
   8 pos_negation pos after bloody|bloody clashes|clashes and|and independence
   9 pos_negation pos the american|american action|action film|film has|has been|been slowly
  10 pos_negation pos after watching|watching "|" rat|rat race|race "|" last|last week|week ,
  11 p

In [42]:
labels, accuracy, confusion_matrix = model.predict(
    splits[0][1], get_accuracy = True, get_confusion_matrix = True
)

print(accuracy)
print(confusion_matrix)

0.835
[[87 13]
 [20 80]]


In [43]:
print(evaluate_model(model, splits, verbose = True))

Evaluating fold 1 of 10
Accuracy --> 0.835
Precision --> 0.87
Recall --> 0.8130841121495327
F1 --> 0.8405797101449274

Evaluating fold 2 of 10
Accuracy --> 0.86
Precision --> 0.89
Recall --> 0.839622641509434
F1 --> 0.8640776699029127

Evaluating fold 3 of 10
Accuracy --> 0.81
Precision --> 0.91
Recall --> 0.7583333333333333
F1 --> 0.8272727272727273

Evaluating fold 4 of 10
Accuracy --> 0.825
Precision --> 0.84
Recall --> 0.8155339805825242
F1 --> 0.8275862068965517

Evaluating fold 5 of 10
Accuracy --> 0.79
Precision --> 0.8
Recall --> 0.7843137254901961
F1 --> 0.792079207920792

Evaluating fold 6 of 10
Accuracy --> 0.82
Precision --> 0.88
Recall --> 0.7857142857142857
F1 --> 0.830188679245283

Evaluating fold 7 of 10
Accuracy --> 0.83
Precision --> 0.8
Recall --> 0.851063829787234
F1 --> 0.8247422680412372

Evaluating fold 8 of 10
Accuracy --> 0.82
Precision --> 0.81
Recall --> 0.826530612244898
F1 --> 0.8181818181818183

Evaluating fold 9 of 10
Accuracy --> 0.835
Precision --> 0.88

## Decision Tree

In [44]:
from sklearn.tree import DecisionTreeClassifier

class PolarityPredictorBowDT(PolarityPredictorWithBagOfWords):

    def train_model_on_features(self, tr_features, tr_targets):
        # pass numpy array to sklearn to train Logistic Regression
        # iterations set to 1000 as default of 100 didn't guarantee convergence with our data
        self.model = DecisionTreeClassifier()
        self.model.fit(tr_features, tr_targets)
        
    def predict(
        self, data, get_accuracy = False,
        get_confusion_matrix = False
    ):
        features = self.extract_features(data)
        # use numpy to get predictions
        y_pred = self.model.predict(features)
        # restore labels
        labels = []
        for is_positive in y_pred:
            if is_positive:
                labels.append('pos')
            else:
                labels.append('neg')
        if get_accuracy or get_confusion_matrix:
            retval = []
            retval.append(labels)
            y_true = self.get_targets(data)
            if get_accuracy:
                retval.append(
                    metrics.accuracy_score(y_true, y_pred)
                )
            if get_confusion_matrix:
                retval.append(
                    metrics.confusion_matrix(y_true, y_pred)
                )
            return retval
        else:
            return labels

In [45]:
model = PolarityPredictorBowDT()
model.train(splits[0][0])

In [46]:
labels, accuracy, confusion_matrix = model.predict(
    splits[0][1], get_accuracy = True, get_confusion_matrix = True
)

print(accuracy)
print(confusion_matrix)

0.585
[[54 46]
 [37 63]]


In [47]:
print(evaluate_model(model, splits, verbose = True))

Evaluating fold 1 of 10
Accuracy --> 0.62
Precision --> 0.56
Recall --> 0.6363636363636364
F1 --> 0.5957446808510639

Evaluating fold 2 of 10
Accuracy --> 0.565
Precision --> 0.49
Recall --> 0.5764705882352941
F1 --> 0.5297297297297296

Evaluating fold 3 of 10
Accuracy --> 0.645
Precision --> 0.69
Recall --> 0.6330275229357798
F1 --> 0.6602870813397128

Evaluating fold 4 of 10
Accuracy --> 0.615
Precision --> 0.59
Recall --> 0.6210526315789474
F1 --> 0.6051282051282052

Evaluating fold 5 of 10
Accuracy --> 0.59
Precision --> 0.57
Recall --> 0.59375
F1 --> 0.5816326530612245

Evaluating fold 6 of 10
Accuracy --> 0.64
Precision --> 0.68
Recall --> 0.6296296296296297
F1 --> 0.6538461538461539

Evaluating fold 7 of 10
Accuracy --> 0.595
Precision --> 0.59
Recall --> 0.5959595959595959
F1 --> 0.5929648241206029

Evaluating fold 8 of 10
Accuracy --> 0.61
Precision --> 0.64
Recall --> 0.6037735849056604
F1 --> 0.6213592233009708

Evaluating fold 9 of 10
Accuracy --> 0.6
Precision --> 0.63
Rec

## Support Vector Machine

In [48]:
from sklearn import svm

class PolarityPredictorBowSVM(PolarityPredictorWithBagOfWords):

    def train_model_on_features(self, tr_features, tr_targets):
        # pass numpy array to sklearn to train Logistic Regression
        # iterations set to 1000 as default of 100 didn't guarantee convergence with our data
        self.model = svm.SVC()
        self.model.fit(tr_features, tr_targets)
        
    def predict(
        self, data, get_accuracy = False,
        get_confusion_matrix = False
    ):
        features = self.extract_features(data)
        # use numpy to get predictions
        y_pred = self.model.predict(features)
        # restore labels
        labels = []
        for is_positive in y_pred:
            if is_positive:
                labels.append('pos')
            else:
                labels.append('neg')
        if get_accuracy or get_confusion_matrix:
            retval = []
            retval.append(labels)
            y_true = self.get_targets(data)
            if get_accuracy:
                retval.append(
                    metrics.accuracy_score(y_true, y_pred)
                )
            if get_confusion_matrix:
                retval.append(
                    metrics.confusion_matrix(y_true, y_pred)
                )
            return retval
        else:
            return labels

In [49]:
model = PolarityPredictorBowSVM()
model.train(splits[0][0])

In [50]:
print_first_predictions(model, splits[0][1])

   0 pos_negation neg films adapted|adapted from|from comic|comic books|books have|have had
   1 pos_negation neg every now|now and|and then|then a|a movie|movie comes|comes along
   2 pos_negation neg you've got|got mail|mail works|works alot|alot better|better than
   3 pos_negation pos " jaws|jaws "|" is|is a|a rare|rare film|film that|that grabs|grabs your
   4 pos_negation neg moviemaking is|is a|a lot|lot like|like being|being the|the general
   5 pos_negation pos on june|june 30|30 ,|, 1960|1960 ,|, a|a self-taught|self-taught ,
   6 pos_negation pos apparently ,|, director|director tony|tony kaye|kaye had|had a|a major
   7 pos_negation pos one of|of my|my colleagues|colleagues was|was surprised|surprised when
   8 pos_negation neg after bloody|bloody clashes|clashes and|and independence
   9 pos_negation pos the american|american action|action film|film has|has been|been slowly
  10 pos_negation pos after watching|watching "|" rat|rat race|race "|" last|last week|week ,
  11 p

In [51]:
labels, accuracy, confusion_matrix = model.predict(
    splits[0][1], get_accuracy = True, get_confusion_matrix = True
)

print(accuracy)
print(confusion_matrix)

0.73
[[96  4]
 [50 50]]


In [52]:
print(evaluate_model(model, splits, verbose = True))

Evaluating fold 1 of 10
Accuracy --> 0.73
Precision --> 0.96
Recall --> 0.6575342465753424
F1 --> 0.7804878048780488

Evaluating fold 2 of 10
Accuracy --> 0.765
Precision --> 0.98
Recall --> 0.6853146853146853
F1 --> 0.8065843621399176

Evaluating fold 3 of 10
Accuracy --> 0.69
Precision --> 0.95
Recall --> 0.625
F1 --> 0.753968253968254

Evaluating fold 4 of 10
Accuracy --> 0.705
Precision --> 0.94
Recall --> 0.6394557823129252
F1 --> 0.7611336032388665

Evaluating fold 5 of 10
Accuracy --> 0.685
Precision --> 0.95
Recall --> 0.6209150326797386
F1 --> 0.7509881422924901

Evaluating fold 6 of 10
Accuracy --> 0.725
Precision --> 0.93
Recall --> 0.6595744680851063
F1 --> 0.7717842323651452

Evaluating fold 7 of 10
Accuracy --> 0.745
Precision --> 0.96
Recall --> 0.6713286713286714
F1 --> 0.7901234567901235

Evaluating fold 8 of 10
Accuracy --> 0.73
Precision --> 0.94
Recall --> 0.6619718309859155
F1 --> 0.7768595041322315

Evaluating fold 9 of 10
Accuracy --> 0.76
Precision --> 0.96
Reca

This combination of negation handling and bigram usage seemed to be ultimately mediocre at best.

The only algorithm that beat the baseline was the Naive Bayes version and only marginally. Considering the large time cost of running this, particularly for SVM and Decision Tree, it may be best to avoid it.