## Negation Handling

I wanted to add the Negation Handling to original dataset. The basic idea is that if a word in any of our sentences ends in "n't" then any words that follow it should have 'NOT_' added as a prefix. This should continue until the next piece of punctuation appears within the sentence. Once this happens, we continue as normal until the next instance of "n't" is found.

The following four cells start by reading in all of the files contained in our positive folder and our negative folder. 

The function "handle_negation" iterates through the full list of files, finds any incidents of "n't", adds the "NOT_" prefix to words that follow up to the next punctuation.

It then saves the new document with this added negation handling back to a seperate folder.

Here is an example of what a sentence looks like before and after this is implemented:

Without negation handling - "i don't think anyone needs to be briefed on jack the ripper,"

With negation handling - "i don't NOT_think NOT_anyone NOT_needs NOT_to NOT_be NOT_briefed NOT_on NOT_jack NOT_the NOT_ripper," 



In [1]:
import os
import string
pos_path = "./data/txt_sentoken/pos/"
pos_path_out = "./data/txt_sentoken_negation/pos_negation/"
neg_path = "./data/txt_sentoken/neg/"
neg_path_out = "./data/txt_sentoken_negation/neg_negation/"

In [4]:
def handle_negation(in_path, out_path):
    file_list = os.listdir(in_path)
    for file in file_list:
        new_file = file
        new_file_sentences = []
        with open(in_path + file, 'r') as f, open(out_path + new_file, 'w+') as f_out:
            for line in f.readlines():
                new_line = ''
                tokens = line.split()
                i = 0
                while i < len(tokens):

                    if tokens[i][-3:] != "n't":
                        new_line = new_line + tokens[i] + ' '
                        i+=1
                    
                    else:
                        new_line = new_line + tokens[i] + ' '
                        try:
                            while tokens[i+1] not in string.punctuation:
                                new_line = new_line + 'NOT_' + tokens[i+1] + ' '
                                i+=1
                        except:
                            print("end of sentence")
                        i+=1
                new_file_sentences.append(new_line + '\n')
                
            f_out.writelines(new_file_sentences)

In [5]:
handle_negation(pos_path, pos_path_out)

FileNotFoundError: [Errno 2] No such file or directory: './data/txt_sentoken_negation/pos_negation/cv000_29590.txt'

In [4]:
handle_negation(neg_path, neg_path_out)

The final outcome is two folders that match our original ones in structure, containing the same documents as the original with negation handling.

This means we can now continue from this point like we did with baseline and experiment in the same way with various algorithms and compare them to each other and also to the baseline versions that didn't have negation handling.

Like the previous page, it is best to skip down to the "Naive Bayes" section at this point to start looking at the experimentation.

In [5]:
import os
import time
import tarfile
import time

class PL04DataLoader_Part_1:
    
    def __init__(self):
        pass
    
    def get_labelled_dataset(self, fold = 0):
        ''' Compile a fold of the data set
        '''
        dataset = []
        for label in ('pos_negation', 'neg_negation'):
            for document in self.get_documents(
                fold = fold,
                label = label,
            ):
                dataset.append((document, label))
        return dataset
    
    def get_documents(self, fold = 0, label = 'pos'):
        ''' Enumerate the raw contents of all data set files.
            Args:
                data_dir: relative or absolute path to the data set folder
                fold: which fold to load (0 to n_folds-1)
                label: 'pos' or 'neg' to
                    select data with positive or negative sentiment
                    polarity
            Return:
                List of tokenised documents, each a list of sentences
                that in turn are lists of tokens
        '''
        raise NotImplementedError

class PL04DataLoader(PL04DataLoader_Part_1):
    
    def get_xval_splits(self):
        ''' Split data with labels for cross-validation
            returns a list of k pairs (training_data, test_data)
            for k cross-validation
        '''
        # load the folds
        folds = []
        for i in range(10):
            folds.append(self.get_labelled_dataset(
                fold = i
            ))
        # create training-test splits
        retval = []
        for i in range(10):
            test_data = folds[i]
            training_data = []
            for j in range(9):
                ij1 = (i+j+1) % 10
                assert ij1 != i
                training_data = training_data + folds[ij1]
            retval.append((training_data, test_data))
        return retval
    
import tarfile
import time

class PL04DataLoaderFromStream(PL04DataLoader):
        
    def __init__(self, tgz_stream, **kwargs):
        super().__init__(**kwargs)
        self.data = {}
        counter = 0
        with tarfile.open(
            mode = 'r|gz',
            fileobj = tgz_stream
        ) as tar_archive:
            for tar_member in tar_archive:
                if counter == 2000:
                    break
                path_components = tar_member.name.split('/')
                filename = path_components[-1]
                if filename.startswith('cv') \
                and filename.endswith('.txt') \
                and '_' in filename:
                    label = path_components[-2]
                    fold = int(filename[2])
                    key = (fold, label)
                    if key not in self.data:
                        self.data[key] = []
                    f = tar_archive.extractfile(tar_member)
                    document = [
                        line.decode('utf-8').split()
                        for line in f.readlines()
                    ]
                    self.data[key].append(document)
                    counter += 1
            
    def get_documents(self, fold = 0, label = 'pos'):
        return self.data[(fold, label)]
    


class PL04DataLoaderFromFolder(PL04DataLoader):
    
    def __init__(self, data_dir, **kwargs):
        self.data_dir = data_dir
        super().__init__(**kwargs)
        
    def get_documents(self, fold = 0, label = 'pos_negation'):
        # read folder contents
        path = os.path.join(self.data_dir, label)
        dir_entries = os.listdir(path)
        # must process entries in numeric order to
        # replicate order of original experiments
        dir_entries.sort()
        # check each entry and add to data if matching
        # selection criteria
        for filename in dir_entries:
            if filename.startswith('cv') \
            and filename.endswith('.txt'):
                if fold == int(filename[2]):
                    # correct fold
                    f = open(os.path.join(path, filename), 'rt')
                    # "yield" tells Python to return an iterator
                    # object that produces the yields of this
                    # function as elements without creating a
                    # full list of all elements
                    yield [line.split() for line in f.readlines()]
                    f.close()

In [6]:
dir_entries = os.listdir()
dir_entries.sort()

In [7]:
data_loader = PL04DataLoaderFromFolder("./data/txt_sentoken_negation/")

In [8]:
# test "get_documents()"

def get_document_preview(document, max_length = 72):
    s = []
    count = 0
    reached_limit = False
    for sentence in document:
        for token in sentence:
            if count + len(token) + len(s) > max_length:
                reached_limit = True
                break
            s.append(token)
            count += len(token)
        if reached_limit:
            break
    return '|'.join(s)
    
for label in 'pos_negation neg_negation'.split():
    print(f'== {label} ==')
    print('doc sentences start of first sentence')
    for index, document in enumerate(data_loader.get_documents(
        label = label
    )):
        print('%3d %7d   %s' %(
            index, len(document), get_document_preview(document)
        ))
        if index == 4:
            break

== pos_negation ==
doc sentences start of first sentence
  0      25   films|adapted|from|comic|books|have|had|plenty|of|success|,|whether
  1      39   every|now|and|then|a|movie|comes|along|from|a|suspect|studio|,|with
  2      19   you've|got|mail|works|alot|better|than|it|deserves|to|.|in|order|to|make
  3      42   "|jaws|"|is|a|rare|film|that|grabs|your|attention|before|it|shows|you|a
  4      25   moviemaking|is|a|lot|like|being|the|general|manager|of|an|nfl|team|in
== neg_negation ==
doc sentences start of first sentence
  0      35   plot|:|two|teen|couples|go|to|a|church|party|,|drink|and|then|drive|.
  1      13   the|happy|bastard's|quick|movie|review|damn|that|y2k|bug|.|it's|got|a
  2      23   it|is|movies|like|these|that|make|a|jaded|movie|viewer|thankful|for|the
  3      19   "|quest|for|camelot|"|is|warner|bros|.|'|first|feature-length|,
  4      37   synopsis|:|a|mentally|unstable|man|undergoing|psychotherapy|saves|a|boy


In [9]:
# test "get_xval_splits()"

splits = data_loader.get_xval_splits()

print('tr-size te-size (number of documents)')
for xval_tr_data, xval_te_data in splits:
    print('%7d %7d' %(len(xval_tr_data), len(xval_te_data)))

tr-size te-size (number of documents)
   1800     200
   1800     200
   1800     200
   1800     200
   1800     200
   1800     200
   1800     200
   1800     200
   1800     200
   1800     200


In [10]:
class PolarityPredictorInterface:

    def train(self, data_with_labels):
        raise NotImplementedError
        
    def predict(self, data):
        raise NotImplementedError

In [11]:
class PolarityPredictorWithVocabulary(PolarityPredictorInterface):
    
    def train(self, data_with_labels):
        self.reset_vocab()
        self.add_to_vocab_from_data(data_with_labels)
        self.finalise_vocab()
        tr_features = self.extract_features(
            data_with_labels
        )
        tr_targets = self.get_targets(data_with_labels)
        self.train_model_on_features(tr_features, tr_targets)
        
    def reset_vocab(self):
        self.vocab = set()
        
    def add_to_vocab_from_data(self, data):
        for document, label in data:
            for sentence in document:
                for token in sentence:
                    self.vocab.add(token)

    def finalise_vocab(self):
        self.vocab = list(self.vocab)
        # create reverse map for fast token lookup
        self.token2index = {}
        for index, token in enumerate(self.vocab):
            self.token2index[token] = index
        
    def extract_features(self, data):
        raise NotImplementedError
    
    def get_targets(self, data, label2index = None):
        raise NotImplementedError
        
    def train_model_on_features(self, tr_features, tr_targets):
        raise NotImplementedError

In [12]:
import numpy

class PolarityPredictorWithBagOfWords_01(PolarityPredictorWithVocabulary):
    
    def __init__(self, clip_counts = True):
        self.clip_counts = clip_counts
        
    def extract_features(self, data):
        # create numpy array of required size
        columns = len(self.vocab)
        rows = len(data)
        features = numpy.zeros((rows, columns), dtype=numpy.int32)        
        # populate feature matrix
        for row, item in enumerate(data):
            document, _ = item
            for sentence in document:
                for token in sentence:
                    try:
                        index = self.token2index[token]
                    except KeyError:
                        # token not in vocab
                        # --> skip this token
                        # --> continue with next token
                        continue
                    if self.clip_counts:
                        features[row, index] = 1
                    else:
                        features[row, index] += 1
        return features

In [13]:
class PolarityPredictorWithBagOfWords(PolarityPredictorWithBagOfWords_01):
 
    def get_targets(self, data):
        ''' create column vector with target labels
        '''
        # prepare target vector
        targets = numpy.zeros(len(data), dtype=numpy.int8)
        index = 0
        for _, label in data:
            if label == 'pos_negation':
                targets[index] = 1
            index += 1
        return targets

    def train_model_on_features(self, tr_features, tr_targets):
        raise NotImplementedError

## Naive Bayes

The first step is to retry the original Baseline Naive Bayes set up with this added Negation Handling

In [14]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

class PolarityPredictorBowNB(PolarityPredictorWithBagOfWords):

    def train_model_on_features(self, tr_features, tr_targets):
        # pass numpy array to sklearn to train NB
        self.model = MultinomialNB()
        self.model.fit(tr_features, tr_targets)
        
    def predict(
        self, data, get_accuracy = False,
        get_confusion_matrix = False
    ):
        features = self.extract_features(data)
        # use numpy to get predictions
        y_pred = self.model.predict(features)
        # restore labels
        labels = []
        for is_positive in y_pred:
            if is_positive:
                labels.append('pos_negation')
            else:
                labels.append('neg_negation')
        if get_accuracy or get_confusion_matrix:
            retval = []
            retval.append(labels)
            y_true = self.get_targets(data)
            if get_accuracy:
                retval.append(
                    metrics.accuracy_score(y_true, y_pred)
                )
            if get_confusion_matrix:
                retval.append(
                    metrics.confusion_matrix(y_true, y_pred)
                )
            return retval
        else:
            return labels

In [15]:
# first functionality test

model = PolarityPredictorBowNB()
model.train(splits[0][0]) 

In [16]:
def print_first_predictions(model, te_data, n = 12):
    predictions = model.predict(te_data)
    for i in range(n):
        document, label = te_data[i]
        prediction = predictions[i]
        print('%4d %s %s %s' %(
            i, label, prediction,
            get_document_preview(document),
        ))
    
print_first_predictions(model, splits[0][1])

   0 pos_negation neg_negation films|adapted|from|comic|books|have|had|plenty|of|success|,|whether
   1 pos_negation pos_negation every|now|and|then|a|movie|comes|along|from|a|suspect|studio|,|with
   2 pos_negation neg_negation you've|got|mail|works|alot|better|than|it|deserves|to|.|in|order|to|make
   3 pos_negation pos_negation "|jaws|"|is|a|rare|film|that|grabs|your|attention|before|it|shows|you|a
   4 pos_negation neg_negation moviemaking|is|a|lot|like|being|the|general|manager|of|an|nfl|team|in
   5 pos_negation pos_negation on|june|30|,|1960|,|a|self-taught|,|idealistic|,|yet|pragmatic|,|young
   6 pos_negation pos_negation apparently|,|director|tony|kaye|had|a|major|battle|with|new|line
   7 pos_negation pos_negation one|of|my|colleagues|was|surprised|when|i|told|her|i|was|willing|to|see
   8 pos_negation pos_negation after|bloody|clashes|and|independence|won|,|lumumba|refused|to|pander|to
   9 pos_negation pos_negation the|american|action|film|has|been|slowly|drowning|to|death

In [17]:
labels, accuracy, confusion_matrix = model.predict(
    splits[0][1], get_accuracy = True, get_confusion_matrix = True
)

print(accuracy)
print(confusion_matrix)

0.815
[[82 18]
 [19 81]]


In [18]:
def evaluate_model(model, splits, verbose = False):
    accuracies = []
    f1s = []
    fold = 0
    for tr_data, te_data in splits:
        if verbose:
            print('Evaluating fold %d of %d' %(fold+1, len(splits)))
            fold += 1
        model.train(tr_data)
        _, accuracy, confusion_matrix = model.predict(te_data, get_accuracy = True, get_confusion_matrix = True)
        
        tp, fp, fn, tn = confusion_matrix[0][0], confusion_matrix[0][1], confusion_matrix[1][0], confusion_matrix[1][1]
        prec = tp/(tp + fp)
        rec = tp/(tp + fn)
        f1 = (2*prec*rec)/(prec+rec)
        
        accuracies.append(accuracy)
        f1s.append(f1)
        if verbose:
            print('Accuracy -->', accuracy)
            print('Precision -->', prec)
            print('Recall -->', rec)
            print('F1 -->', f1)
            print()
    n = float(len(accuracies))
    avg = sum(f1s) / n
    mse = sum([(x-avg)**2 for x in accuracies]) / n
    return (avg, mse**0.5, min(f1s),
            max(f1s))

# this takes about 3 minutes
print(evaluate_model(model, splits, verbose = True))

Evaluating fold 1 of 10
Accuracy --> 0.815
Precision --> 0.82
Recall --> 0.8118811881188119
F1 --> 0.8159203980099502

Evaluating fold 2 of 10
Accuracy --> 0.825
Precision --> 0.85
Recall --> 0.8095238095238095
F1 --> 0.8292682926829269

Evaluating fold 3 of 10
Accuracy --> 0.83
Precision --> 0.84
Recall --> 0.8235294117647058
F1 --> 0.8316831683168315

Evaluating fold 4 of 10
Accuracy --> 0.805
Precision --> 0.82
Recall --> 0.7961165048543689
F1 --> 0.8078817733990147

Evaluating fold 5 of 10
Accuracy --> 0.805
Precision --> 0.81
Recall --> 0.801980198019802
F1 --> 0.8059701492537314

Evaluating fold 6 of 10
Accuracy --> 0.82
Precision --> 0.82
Recall --> 0.82
F1 --> 0.82

Evaluating fold 7 of 10
Accuracy --> 0.845
Precision --> 0.86
Recall --> 0.8349514563106796
F1 --> 0.8472906403940887

Evaluating fold 8 of 10
Accuracy --> 0.83
Precision --> 0.87
Recall --> 0.8055555555555556
F1 --> 0.8365384615384616

Evaluating fold 9 of 10
Accuracy --> 0.785
Precision --> 0.8
Recall --> 0.776699

The average F1 score for this set up was **0.824** which is actually a slight drop off from the baseline Naive Bayes, indicating that negation handling hasn't helped us here

## Logistic Regression

In [19]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

class PolarityPredictorBowLR(PolarityPredictorWithBagOfWords):

    def train_model_on_features(self, tr_features, tr_targets):
        # pass numpy array to sklearn to train Logistic Regression
        # iterations set to 1000 as default of 100 didn't guarantee convergence with our data
        self.model = LogisticRegression(max_iter=1000)
        self.model.fit(tr_features, tr_targets)
        
    def predict(
        self, data, get_accuracy = False,
        get_confusion_matrix = False
    ):
        features = self.extract_features(data)
        # use numpy to get predictions
        y_pred = self.model.predict(features)
        # restore labels
        labels = []
        for is_positive in y_pred:
            if is_positive:
                labels.append('pos_negation')
            else:
                labels.append('neg_negation')
        if get_accuracy or get_confusion_matrix:
            retval = []
            retval.append(labels)
            y_true = self.get_targets(data)
            if get_accuracy:
                retval.append(
                    metrics.accuracy_score(y_true, y_pred)
                )
            if get_confusion_matrix:
                retval.append(
                    metrics.confusion_matrix(y_true, y_pred)
                )
            return retval
        else:
            return labels

In [20]:
model = PolarityPredictorBowLR()

print(evaluate_model(model, splits, verbose = True))

Evaluating fold 1 of 10
Accuracy --> 0.845
Precision --> 0.84
Recall --> 0.8484848484848485
F1 --> 0.8442211055276383

Evaluating fold 2 of 10
Accuracy --> 0.875
Precision --> 0.91
Recall --> 0.8504672897196262
F1 --> 0.8792270531400966

Evaluating fold 3 of 10
Accuracy --> 0.83
Precision --> 0.86
Recall --> 0.8113207547169812
F1 --> 0.8349514563106797

Evaluating fold 4 of 10
Accuracy --> 0.875
Precision --> 0.85
Recall --> 0.8947368421052632
F1 --> 0.8717948717948718

Evaluating fold 5 of 10
Accuracy --> 0.85
Precision --> 0.83
Recall --> 0.8645833333333334
F1 --> 0.8469387755102041

Evaluating fold 6 of 10
Accuracy --> 0.86
Precision --> 0.88
Recall --> 0.8461538461538461
F1 --> 0.8627450980392156

Evaluating fold 7 of 10
Accuracy --> 0.885
Precision --> 0.87
Recall --> 0.8969072164948454
F1 --> 0.883248730964467

Evaluating fold 8 of 10
Accuracy --> 0.86
Precision --> 0.84
Recall --> 0.875
F1 --> 0.8571428571428572

Evaluating fold 9 of 10
Accuracy --> 0.84
Precision --> 0.88
Recal

The average F1 score is **0.864** which is, again, a slight drop off from Baseline Logistic Regression

## Decision Tree

In [21]:
from sklearn.tree import DecisionTreeClassifier

class PolarityPredictorBowDT(PolarityPredictorWithBagOfWords):

    def train_model_on_features(self, tr_features, tr_targets):
        # pass numpy array to sklearn to train Logistic Regression
        # iterations set to 1000 as default of 100 didn't guarantee convergence with our data
        self.model = DecisionTreeClassifier()
        self.model.fit(tr_features, tr_targets)
        
    def predict(
        self, data, get_accuracy = False,
        get_confusion_matrix = False
    ):
        features = self.extract_features(data)
        # use numpy to get predictions
        y_pred = self.model.predict(features)
        # restore labels
        labels = []
        for is_positive in y_pred:
            if is_positive:
                labels.append('pos')
            else:
                labels.append('neg')
        if get_accuracy or get_confusion_matrix:
            retval = []
            retval.append(labels)
            y_true = self.get_targets(data)
            if get_accuracy:
                retval.append(
                    metrics.accuracy_score(y_true, y_pred)
                )
            if get_confusion_matrix:
                retval.append(
                    metrics.confusion_matrix(y_true, y_pred)
                )
            return retval
        else:
            return labels

In [22]:
model = PolarityPredictorBowDT()

print(evaluate_model(model, splits, verbose = True))

Evaluating fold 1 of 10
Accuracy --> 0.575
Precision --> 0.58
Recall --> 0.5742574257425742
F1 --> 0.5771144278606964

Evaluating fold 2 of 10
Accuracy --> 0.64
Precision --> 0.67
Recall --> 0.6320754716981132
F1 --> 0.6504854368932038

Evaluating fold 3 of 10
Accuracy --> 0.685
Precision --> 0.71
Recall --> 0.6761904761904762
F1 --> 0.6926829268292682

Evaluating fold 4 of 10
Accuracy --> 0.58
Precision --> 0.61
Recall --> 0.5754716981132075
F1 --> 0.5922330097087378

Evaluating fold 5 of 10
Accuracy --> 0.645
Precision --> 0.64
Recall --> 0.6464646464646465
F1 --> 0.6432160804020101

Evaluating fold 6 of 10
Accuracy --> 0.565
Precision --> 0.55
Recall --> 0.5670103092783505
F1 --> 0.5583756345177665

Evaluating fold 7 of 10
Accuracy --> 0.59
Precision --> 0.56
Recall --> 0.5957446808510638
F1 --> 0.577319587628866

Evaluating fold 8 of 10
Accuracy --> 0.665
Precision --> 0.63
Recall --> 0.6774193548387096
F1 --> 0.6528497409326425

Evaluating fold 9 of 10
Accuracy --> 0.635
Precision

The average F1 score for Decision Tree with Negation Handling was slightly better than Baseline Decision Tree but still poor compared with other methods

## Support Vector Machine

In [23]:
from sklearn import svm

class PolarityPredictorBowSVM(PolarityPredictorWithBagOfWords):

    def train_model_on_features(self, tr_features, tr_targets):
        # pass numpy array to sklearn to train Logistic Regression
        # iterations set to 1000 as default of 100 didn't guarantee convergence with our data
        self.model = svm.SVC()
        self.model.fit(tr_features, tr_targets)
        
    def predict(
        self, data, get_accuracy = False,
        get_confusion_matrix = False
    ):
        features = self.extract_features(data)
        # use numpy to get predictions
        y_pred = self.model.predict(features)
        # restore labels
        labels = []
        for is_positive in y_pred:
            if is_positive:
                labels.append('pos')
            else:
                labels.append('neg')
        if get_accuracy or get_confusion_matrix:
            retval = []
            retval.append(labels)
            y_true = self.get_targets(data)
            if get_accuracy:
                retval.append(
                    metrics.accuracy_score(y_true, y_pred)
                )
            if get_confusion_matrix:
                retval.append(
                    metrics.confusion_matrix(y_true, y_pred)
                )
            return retval
        else:
            return labels

In [24]:
model = PolarityPredictorBowSVM()

print(evaluate_model(model, splits, verbose = True))

Evaluating fold 1 of 10
Accuracy --> 0.85
Precision --> 0.89
Recall --> 0.8240740740740741
F1 --> 0.8557692307692307

Evaluating fold 2 of 10
Accuracy --> 0.835
Precision --> 0.9
Recall --> 0.7964601769911505
F1 --> 0.8450704225352113

Evaluating fold 3 of 10
Accuracy --> 0.82
Precision --> 0.87
Recall --> 0.7909090909090909
F1 --> 0.8285714285714286

Evaluating fold 4 of 10
Accuracy --> 0.86
Precision --> 0.85
Recall --> 0.8673469387755102
F1 --> 0.8585858585858585

Evaluating fold 5 of 10
Accuracy --> 0.85
Precision --> 0.85
Recall --> 0.85
F1 --> 0.85

Evaluating fold 6 of 10
Accuracy --> 0.855
Precision --> 0.87
Recall --> 0.8446601941747572
F1 --> 0.8571428571428571

Evaluating fold 7 of 10
Accuracy --> 0.905
Precision --> 0.9
Recall --> 0.9090909090909091
F1 --> 0.9045226130653266

Evaluating fold 8 of 10
Accuracy --> 0.885
Precision --> 0.89
Recall --> 0.8811881188118812
F1 --> 0.8855721393034827

Evaluating fold 9 of 10
Accuracy --> 0.85
Precision --> 0.85
Recall --> 0.85
F1 --

The average F1 score here was **0.8624** which is a small increase vs. Baseline SVM. Once more, however, this SVM set up took many hours to train and so may not be the best choice even with a solid F1 score.

At this point it isn't clear if Negation Handling is an overall improvement. It seemed to improve some algorithms but not others.

The next step was to use Bigrams instead of Unigrams when training the models.