<a href="https://colab.research.google.com/github/rrajasek95/nlp-243-notebooks/blob/main/Section_3_Sklearn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP 243 Section 03 - Classification

In this section, we will be using SciKit-learn and a few other libraries to approach the problem of text classification. We will go through the entire pipeline of reading data, preprocessing, feature creation and ultimately, prediction. 

## Dataset

We will be using the **Large Movie Review Dataset** for this particular task. This comprises of highly polarized movie reviews labeled as either positive or negative sentiment. There are 25k training examples and 25k test examples.

Link to the dataset page: <a>https://ai.stanford.edu/~amaas/data/sentiment/<a/>

In [None]:
!wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar xvf aclImdb_v1.tar.gz

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
aclImdb/train/unsup/44982_0.txt
aclImdb/train/unsup/44981_0.txt
aclImdb/train/unsup/44980_0.txt
aclImdb/train/unsup/44979_0.txt
aclImdb/train/unsup/44978_0.txt
aclImdb/train/unsup/44977_0.txt
aclImdb/train/unsup/44976_0.txt
aclImdb/train/unsup/44975_0.txt
aclImdb/train/unsup/44974_0.txt
aclImdb/train/unsup/44973_0.txt
aclImdb/train/unsup/44972_0.txt
aclImdb/train/unsup/44971_0.txt
aclImdb/train/unsup/44970_0.txt
aclImdb/train/unsup/44969_0.txt
aclImdb/train/unsup/44968_0.txt
aclImdb/train/unsup/44967_0.txt
aclImdb/train/unsup/44966_0.txt
aclImdb/train/unsup/44965_0.txt
aclImdb/train/unsup/44964_0.txt
aclImdb/train/unsup/44963_0.txt
aclImdb/train/unsup/44962_0.txt
aclImdb/train/unsup/44961_0.txt
aclImdb/train/unsup/44960_0.txt
aclImdb/train/unsup/44959_0.txt
aclImdb/train/unsup/44958_0.txt
aclImdb/train/unsup/44957_0.txt
aclImdb/train/unsup/44956_0.txt
aclImdb/train/unsup/44955_0.txt
aclImdb/train/unsup/44954_0.txt
aclImdb

In [None]:
# Data Examples
!echo "Positive Example:"
!fold -w 80 -s aclImdb/train/pos/1_7.txt
!echo ''
!echo ''
!echo "Negative Example:"
!fold -w 80 -s aclImdb/train/neg/0_3.txt


Positive Example:
If you like adult comedy cartoons, like South Park, then this is nearly a 
similar format about the small adventures of three teenage girls at Bromwell 
High. Keisha, Natella and Latrina have given exploding sweets and behaved like 
bitches, I think Keisha is a good leader. There are also small stories going on 
with the teachers of the school. There's the idiotic principal, Mr. Bip, the 
nervous Maths teacher and many others. The cast is also fantastic, Lenny 
Henry's Gina Yashere, EastEnders Chrissie Watts, Tracy-Ann Oberman, Smack The 
Pony's Doon Mackichan, Dead Ringers' Mark Perry and Blunder's Nina Conti. I 
didn't know this came from Canada, but it is very good. Very good!

Negative Example:
Story of a man who has unnatural feelings for a pig. Starts out with a opening 
scene that is a terrific example of absurd comedy. A formal orchestra audience 
is turned into an insane, violent mob by the crazy chantings of it's singers. 
Unfortunately it stays absurd the W

## Pre-Processing

We will need to prepare our datasets (train and test) so that we have enough information to construct our necessary features

In [None]:
!pip install spacy
!python -m spacy download en

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


In [None]:
import os
import spacy
import pickle as pkl

from tqdm.auto import tqdm

# We don't need some of the features of the spacy pipeline, 
# this speeds up processing time
# Only want to do: tokenization, lemmatization
nlp = spacy.load('en', disable=['parser', 'ner', 'sentencizer', 'tagger'])

def process_class(class_path, class_label):
    """
    Processes the data for the specific class that we're working with.
    We will generate additional features such as the tokenized words, lemmas.

    We won't use all of the features, but this illustrates a typical pipeline
    we would need to make for a task.
    """
    examples = []

    for file_name in tqdm(os.listdir(class_path)):
        file_path = os.path.join(class_path, file_name)

        with open(file_path, 'r') as example_file:
            text = example_file.read()
        doc = nlp(text)

        tokens = [t.text for t in doc]
        lemmas = [t.lemma for t in doc]

        # doc.sents
        # for sent in doc.sents:
        #   # Do something with words
        
        examples.append({
            'text': doc.text,
            'tokens': tokens,
            'lemmas': lemmas,
            'class': class_label
        })

        # run, running, ran -> run

    return examples


def process_split(split_folder):
    """
    Process all the examples in a given split
    """

    # Process positives
    pos_path = os.path.join(split_folder, 'pos')
    positive_data = process_class(pos_path, 'pos')

    # Process negatives
    neg_path = os.path.join(split_folder, 'neg')
    negative_data = process_class(neg_path, 'neg')
    
    return positive_data + negative_data

In [None]:
# How spacy works

text = 'Hello, my name is Rishi!'

doc = nlp(text)

print(doc)


Hello, my name is Rishi!


In [None]:

dataset_path = 'aclImdb'

splits = [
         'train',
         'test'
]

dataset = {}

for split in splits:
    split_folder = os.path.join(dataset_path, split)
    split_data = process_split(split_folder)

    dataset[split] = split_data

# Save the processed data

with open('processed_data.pkl', 'wb') as processed_dataset_file:
    pkl.dump(dataset, processed_dataset_file)

HBox(children=(FloatProgress(value=0.0, max=12500.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=12500.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=12500.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=12500.0), HTML(value='')))




In [None]:
dataset['train'][0]

{'class': 'pos',
 'lemmas': [4690420944186131903,
  2158845516055552166,
  447765159362469301,
  11268780503345276587,
  9548244504980166557,
  8587113882175654237,
  14115759321026111728,
  16037325823156266367,
  1995909169258310477,
  18213940162184454424,
  2593208677638477497,
  7063653163634019529,
  9707179535890930240,
  18197037023634208128,
  295350960619799726,
  12150403604591081639,
  12646065887601541794,
  5059648917813135842,
  5345497848075657444,
  10382539506755952630,
  11901859001352538922,
  9778055143417507723,
  8371520351987192855,
  2982782142620708162,
  14560795576765492085,
  8578797347073582537,
  4690420944186131903,
  16875814820671380748,
  10239237003504588839,
  14692702688101715474,
  11901859001352538922,
  15635199268207170545,
  13531646641691661812,
  12646065887601541794,
  5059648917813135842,
  15659840250438241175,
  11500813967494186156,
  10382539506755952630,
  15549368236163590771,
  14121509715367036122,
  2283656566040971221,
  74259856

## Multinomial Naive Bayes

The model we are going to explore is a multinomial Naive Bayes classifier with n-gram features. 

The Naive Bayes model will take the n-gram counts as input and creates a probability distribution for the n-grams in each class (positive or negative). For e.g. we would normally expect positive sentiment reviews to have phrases used such as "very good", "amazing" etc., appear more frequently while the negative sentiment reviews would have phrases like "awful", "terrible" used more frequently.

Naive Bayes requires the counts of each n-gram as input in the form of a vector. Each index of the vector corresponds to an n-gram. Therefore, we will need to construct the feature vector out of our data. 

We will implement our own count vectorizer that return a vector of counts of 10,000 most-frequent n-grams in our dataset.

Vocabulary = {'movie', 'good', 'bad', 'is'}
NAIVE_BAYES = (1, 0, 1)

TEXT = 'movie is bad'
X = (1, 0, 1, 1)

TEXT = 'movie is good'
X = (1, 1, 0, 1)

In [None]:
from collections import Counter
import numpy as np

class CountVectorizer(object):
    def __init__(self, vocab_size=10000, bos='<s>', eos='</s>'):
        self.vocab_size = vocab_size
        self.bos = bos
        self.eos = eos
        self.ngram2idx = {}
        self.idx2ngram = {}
    
    def fit(self, tokenized_examples):
        # We will try to build a vectorizer of unigrams and bigrams
        # from our tokens
        n_gram_counter = Counter()

        for example in tokenized_examples:
            prev_token = self.bos
            for token in example:
                n_gram_counter[token] += 1
                n_gram_counter[(prev_token, token)] += 1
                prev_token = token
            
            n_gram_counter[(prev_token, self.eos)] += 1
        
        print("Number of unigrams and bigrams in corpus:", len(n_gram_counter))
        most_frequent_n_grams = n_gram_counter.most_common(self.vocab_size)

        # Mapping the most frequent n-grams with an index
        for i, (n_gram, _) in enumerate(most_frequent_n_grams):
            self.ngram2idx[n_gram] = i
            self.idx2ngram[i] = n_gram

    def transform(self, tokenized_examples):

        vectorized_counts = np.zeros((len(tokenized_examples), self.vocab_size))

        for i, example in tqdm(enumerate(tokenized_examples)):
            prev_token = self.bos
            example_vector = vectorized_counts[i]
            for token in example:

                # If i see a particular unigram, I increment its count
                unigram_idx = self.ngram2idx.get(token, -1)
                if unigram_idx != -1:
                    example_vector[unigram_idx] += 1

                # If I see a particular bigram, I increment its count
                bigram_idx = self.ngram2idx.get((prev_token, token), -1)
                if bigram_idx != -1:
                    example_vector[bigram_idx] += 1
                prev_token = token
            
            bigram_idx = self.ngram2idx.get((prev_token, self.eos), -1)
            if bigram_idx != -1:
                example_vector[bigram_idx] += 1
                

        return vectorized_counts

    def fit_transform(self, tokenized_examples):
        self.fit(tokenized_examples)

        return self.transform(tokenized_examples)

    def describe_vector(self, example_vector):
        bow_dict = {}
        for j, count in enumerate(example_vector):
            if count > 0:
                bow_dict[self.idx2ngram[j]] = count
        
        return bow_dict

In [None]:
train_examples = dataset['train']
test_examples = dataset['test']

# Now we need to convert the datasets into feature vectors
train_tokenized_examples = [example['tokens'] for example in train_examples]
test_tokenized_examples = [example['tokens'] for example in test_examples]

vectorizer = CountVectorizer()

X_train = vectorizer.fit_transform(train_tokenized_examples)
y_train = [example['class'] for example in train_examples]
X_test = vectorizer.transform(test_tokenized_examples)
y_test = [example['class']for example in test_examples]

Number of unigrams and bigrams in corpus: 1558010


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




In [None]:
from pprint import pprint
from textwrap import fill

print("Example:\n", fill(train_examples[0]['text']))
pprint(vectorizer.describe_vector(X_train[0]))

Example:
 I did not set very high expectations for this movie, which left me
pleasantly surprised. The story is a little strange sometimes but
overall I think it has an acceptable credibility. The action scenes
are rather nice and the accompanying music is used to induce a a bit
of patriotic feelings common to US movies. This may not be the best
movie ever but it's uncommon for Sweden and I hope to see more similar
ones in the future.
{"'s": 1.0,
 ',': 1.0,
 '.': 4.0,
 'I': 3.0,
 'The': 2.0,
 'This': 1.0,
 'US': 1.0,
 'a': 3.0,
 'acceptable': 1.0,
 'action': 1.0,
 'an': 1.0,
 'and': 2.0,
 'are': 1.0,
 'be': 1.0,
 'best': 1.0,
 'bit': 1.0,
 'but': 2.0,
 'common': 1.0,
 'credibility': 1.0,
 'did': 1.0,
 'ever': 1.0,
 'expectations': 1.0,
 'feelings': 1.0,
 'for': 2.0,
 'future': 1.0,
 'has': 1.0,
 'high': 1.0,
 'hope': 1.0,
 'in': 1.0,
 'is': 2.0,
 'it': 2.0,
 'left': 1.0,
 'little': 1.0,
 'may': 1.0,
 'me': 1.0,
 'more': 1.0,
 'movie': 2.0,
 'movies': 1.0,
 'music': 1.0,
 'nice': 1.0,
 

In [None]:
from sklearn.naive_bayes import MultinomialNB

classifier = MultinomialNB()

classifier.fit(X_train, y_train)

predictions = classifier.predict(X_test)

In [None]:
print(test_tokenized_examples[0])
print(y_test[0])
print(predictions[0])

print(len(predictions))

['Dylan', 'and', 'Bobby', 'are', 'boyhood', 'friends', 'and', 'they', 'are', 'in', 'love', 'the', 'way', 'that', 'young', 'boys', 'sometimes', 'are', '.', 'But', 'Dylan', 'has', 'met', 'a', 'girl', 'and', 'is', 'starting', 'to', 'put', 'those', 'boyhood', 'things', 'aside', '.', 'Bobby', 'knows', 'that', 'he', "'s", 'not', 'interested', 'in', 'girls', 'and', 'misses', 'what', 'he', 'had', 'with', 'Dylan', '.', '<', 'br', '/><br', '/>Told', 'as', 'part', 'cheesy', '80', "'s", 'pop', 'video', ',', 'part', 'home', 'movie', ',', 'part', 'video', 'recollection', 'this', 'film', 'tells', 'a', 'confusing', 'and', 'sad', ',', 'but', 'all', 'too', 'often', 'true', ',', 'story', 'that', 'will', 'hit', 'home', 'with', 'many', 'that', 'see', 'it', '.', '<', 'br', '/><br', '/>This', 'film', 'brings', 'back', 'a', 'lot', 'of', 'memories', 'and', 'struck', 'a', 'very', 'true', 'chord', 'with', 'me', 'but', 'I', 'wish', 'the', 'film', 'maker', 'had', 'gone', 'a', 'bit', 'further', 'and', 'left', 'it',

In [None]:
from sklearn.metrics import accuracy_score

print("Accuracy: ", accuracy_score(y_test, predictions))

Accuracy:  0.82736


In [None]:
# Let us manually inspect some examples

indices = np.random.choice(list(range(len(test_tokenized_examples))), size=20)

for i, example_index in enumerate(indices):
    example = test_examples[example_index]
    print("Example", i + 1)
    print("Text:")
    print(fill(example['text']))

    print("True Label:", example['class'])
    print("Predicted Label:", classifier.predict([X_test[example_index]])[0])
    print("Prediction probability: ", classifier.predict_proba([X_test[example_index]])[0])
    print()

Example 1
Text:
How important is the director, anyway? In this film, made in the
politically tumultuous times of the late 60s where questions of social
organization were prime conflicts, asks that question by making a
movie that turns the camera away from the action and only begs to
reveal the director, William Greaves. It is an important work, as it
shows like no other movie shows the difficulties in blocking,
organizing, and setting the scene; it reveals the role of the crew,
something most directors frankly would like to disappear completely
and that the invisibility of is essential for suspending disbelief;
and it also puts into consideration the role of performance and
scripting and how they match/don't match reality and what that has to
say about how the director ultimately influences reality (if at
all).<br /><br />The documentary, or pseudo-documentary, or fictional
narrative (whichever you prefer, via your interpretations of the
themes) has its brain in the over-educated, over

### Is word morphology important?

One question that we'd like to address is whether word morphology plays a role in our predictions for a Naive Bayes model. For that, we can construct a model trained on just the lemmas of the word and compare it with the performance of the original model we trained.

In [None]:
train_lemmatized_examples = [example['lemmas'] for example in train_examples]
test_lemmatized_examples = [example['lemmas'] for example in test_examples]

lemma_vectorizer = CountVectorizer()

X_lemma_train = lemma_vectorizer.fit_transform(train_lemmatized_examples)
X_lemma_test = lemma_vectorizer.transform(test_lemmatized_examples)

lemma_classifier = MultinomialNB()
lemma_classifier.fit(X_lemma_train, y_train)

lemma_predictions = classifier.predict(X_lemma_test)

print("Accuracy: ", accuracy_score(y_test, lemma_predictions))

Number of unigrams and bigrams in corpus: 1343429


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))


Accuracy:  0.51204


In [None]:

with open('nb_model_1.pkl', 'wb') as naive_bayes_model_1:
    pkl.dump({
        'vectorizer': vectorizer,
        'classifier': classifier
    }, naive_bayes_model_1)

From the above example, we have a significantly lower accuracy for our model when we simply use the word lemmas instead of the actual words.

## Exercises to try out and points to ponder
1. Describe the Naives Bayes model mathematically. How do you construct the model from the data and make predictions using it?
1. What are the assumptions of the Naive Bayes model. Does our data actually obey those assumptions?
1. Add more metrics that measure class-specific performance and the behavior of the model.
    1. How often does the model predict each class label?
    1. For which one does the model perform better/worse?
    1. Try to do some error analysis of the model.
1. Play around with the vocabulary size (10 words, 100 words, 1k words, 10k words) and see how accuracy changes with vocabulary size. What is the trade-off that we are doing by varying vocabulary size?
1. The question above asks us to test different values of our feature size, which is a form of hyperparameter tuning. How can you automate this process using SciKit-learn to get the best set of values?
1. Sometimes, review writers may use rare synonyms of meaningful words or may misspell words. This is especially common in real data, and this may lead to our model to not perform as well. Can you modify the code to handle these cases?
1. Try implementing n-gram TF-IDF features and use the model. Does it perform better or worse?
1. Create a command-line tool or web application that accepts text input and returns the sentiment predicted by our model. What would the design of your application look like? What are the components for processing the text, creating the features and predicting the output?
1. Try to implement your own MultinomialNB class. What are some of implementation issues you should consider and how will you handle them? Here are a few:
    1. Many words are incredibly rare - the probability of some words would be in the order of $10^{-6}$. Naive Bayes involves multiplying several of these probabilities, so the probability of the text could be something like $10^{-200}$, which is not representable by floating point values. How would you handle this?
    2. Writing for-loops that iterates over n-grams may be very slow, how can you optimize this?
1. In the Naive Bayes model, some words may occur rarely in one class and may be more common in the other class. These are usually the most informative words for our classification. 
    1. The odds of the words would be either very large or close to 0. 
    1. That is, the ratio $ \frac{p(w | pos)}{p(w | neg)} $ would be very close to infinity or 0 due to the numerator or denominator being the larger quantity. 
    1. If they are equally likely in both classes, then the ratio will be closer to 1.
    1. Look at the attributes of the MultinomialNB class, and see which ones can be used to obtain this metric.
    1. Do the words you see from this analysis  make intuitive sense? If some words don't make sense, they could be a source of error. Identify such cases in your dataset.

## References

1. Building a sentiment classifier using SciKit-Learn (Dorian Lazar): https://towardsdatascience.com/building-a-sentiment-classifier-using-scikit-learn-54c8e7c5d2f0
1. Learning Word Vectors for Sentiment Analysis (Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher, 2011) (<a>https://ai.stanford.edu/~amaas/papers/wvSent_acl2011.pdf</a>)