# Homework 1: Sentiment Analysis with Naïve Bayes
#### CSCI 3832 Natural Language Processing


1. Lemmas and inflected forms, hyponyms/hypernyms, the distributional hypothesis
2. Tokenization, vocabularies, and feature extraction for a Naive Bayes model 

*Your name and email here*

## Section 1: Free Response Questions


**Question 1: Write down the lemmas of the following inflected forms:**
1. walked
2. taught
3. best
4. are
5. running

*Your answer here*

**Question 2: Write down 3 hyponyms of the following words:**
1. dog
2. food
3. profession

*Your answer here*

**Question 3: In your own words, describe:**
1. The distributional hypothesis (see lecture on distributional semantics)
2. How is the distributional hypothesis relvant to NLP systems?

*Your answer here*

## Section 2: Sentiment Analysis with Naive Bayes

In this section, our goal is to classify a set of movie reviews as positive or negative. For our dataset, we'll use the [Large Movie Review Dataset](https://ai.stanford.edu/~amaas/data/sentiment/). To get started, download the dataset from the link, and extract it to where your notebook is. Next, we'll load the data and look at a couple of examples. 

*Important: for any project which involves creating or training models, you can **only** do your exploratory data analysis on the training set. Looking at the test set in any way can invalidate your results!*

In [None]:
import os

data_dir = 'aclImdb/'

pos_train_dir = data_dir + 'train/pos/'
neg_train_dir = data_dir + 'train/neg/'

def read_folder(folder):
    examples = []
    for fname in os.listdir(folder):
        with open(os.path.join(folder, fname), encoding='utf8') as f:
            examples.append(f.readline().strip())
    return examples

pos_examples = read_folder(pos_train_dir)
neg_examples = read_folder(neg_train_dir)

print('Number of positive examples: {}\nNumber of negative examples: {}\n\n'.format(len(pos_examples), len(neg_examples)))

print('Sample positive example: {}\n\n'.format(pos_examples[0]))
print('Sample negative example: {}'.format(neg_examples[0]))



Now that we've loaded the data, let's create our vocabulary. While we want our vocabulary to cover the whole training set, we'll keep them separate to see if there are any words which are frequently found in one or the other class -- these words might be informative features for classification! 

The simplest way to create a vocabulary is to split on spaces:

In [None]:
pos_words = []  # A list of all space separated tokens found across all positive examples. (Contains duplicates)
neg_words = []

pos_vocab = set()  # A list of *unique* separated tokens found in across all positive examples. (No duplicates)
neg_vocab = set()

In [None]:
''' 
Your code here. For each class (positive/negative) find both the list of types and tokens for each class. 
To separate each example into separate words, split the example on spaces. 

'''

In [None]:
# Sanity check

print(len(pos_words))
print(len(pos_vocab))

assert len(pos_words) == 2958696
assert len(pos_vocab) == 178873

Now lets calculate word frequencies for each class. (Hint: use the Python Counter class)

In [None]:
pos_frequencies = [] # A list of tuples of the form (word, count). 
                 # The list should be sorted in descending order, using the count of each tuple as the key

neg_frequencies = []

In [None]:
from collections import Counter

''' 
Your code here. For each class (positive/negative) calculate the frequency of each word and save it in pos_counter
and neg_counter.

Print the top 15 most common word for each class. 

'''

In [None]:
assert pos_frequencies[0] == ('the', 148413)
assert neg_frequencies[0] == ('the', 138612)

Looking at the top 15 words for each class we see two problems:

1. The words are essentially the same for each class, which doesn't give us any information on how to differentiate them.
2. Look at the most frequent tokens. Are there any tokens which aren't words? Any situations where tokens with different surface forms but the same meaning could be repeated (and if so, how might we control for this?)

*Your answer to 2 here*

Instead of looking at the most frequent words, let's instead look at the most frequent words which explicitly do not appear in the other class. 

In [None]:
only_pos_words = [word for word in pos_words if word not in neg_vocab]
only_neg_words = [word for word in neg_words if word not in pos_vocab]

opw_counter = Counter(only_pos_words)
onw_counter = Counter(only_neg_words)

print(opw_counter.most_common()[:50])
print('\n')
print(onw_counter.most_common()[:50])

We begin to see some words we would expect to denote a negative review, but not so much for the positive reviews. Why might this be the case? What types of tokens are found in positive reviews but not in negative reviews?

*Your answer here*

In [None]:
# Lets now make our combined vocabulary
space_vocab = list(pos_vocab.union(neg_vocab))
print('Length of space separated vocab: {}'.format(len(space_vocab)))
print(space_vocab[:50])

Looking at some words from our vocab, what issue do we find by only splitting on spaces?

*Your answer here*

Now, rather than naively splitting on spaces, we can use tools which are informed about English grammar rules to create a cleaner tokenization.

In [None]:
from nltk.tokenize import word_tokenize

pos_examples_tokenized = [word_tokenize(ex) for ex in pos_examples]
neg_examples_tokenized = [word_tokenize(ex) for ex in neg_examples]

print(pos_examples_tokenized[0])

Looking at the first example we can see that things like apostrophes, periods, "n'ts" and ellipses are better handled.

Let's begin defining features for our model. The simplest features are simply if a word exists or not -- however, this is will be very slow if we decide to use the whole vocabulary. Instead, let's create these features for the top 100 most common words. 

In [None]:
all_tokenized_words = [word for ex in pos_examples_tokenized for word in ex] + \
    [word for ex in neg_examples_tokenized for word in ex]

atw_counter = Counter(all_tokenized_words)
top100 = [tup[0] for tup in atw_counter.most_common(100)] # A list of the top 100 most frequent word

print(top100)

Use the following block to define your own features for the NB model.

In [None]:
# Define features here

def top100_word_features(example): # 100 features, 1 for each word in the top 100 most frequent words
    return {word : 1 if word in example else 0 for word in top100}

''' Define your own methods here, which take in a single example, and return a feature value (could be a 0/1 truth value, or a count)
    Some ideas:
        Look at the length of examples, is there a difference between positive and negative examples?
        Are there specific words that could be very indiciative? They may not be in the top 100. 
'''

def example_feature(example): #Delete or modify this 
    feature_val = 5 if 'your feature' in example else 0
    return {'example_feat' : feature_val}


def create_feature_dictionary(example):
    features = {}
    for feat in [top100_word_features, example_feature]: #Once you've created your methods, and them to this list
        features.update(feat(example))
    return features


Now that we've defined our features for our model, we can create our final dataset, which will consist of extracted features and the example label. 

We'll also create a *validation* split by taking 20% of the training dataset. Remember, we never use the test set to make modeling decisions (in this case, decisions about features). Experiment with multiple models that make use of different combinations of features. Measure their performance on the validation split to figure out which features are the most helpful (use the show_most_informative_features function). When you've found your final model, evaluate its performance on the held out data. 

In [None]:
from nltk.classify import NaiveBayesClassifier
import random

# Convert training examples to a set of features.
train = [(create_feature_dictionary(ex), 0) for ex in neg_examples] + \
                [(create_feature_dictionary(ex), 1) for ex in pos_examples]

random.seed(42)
random.shuffle(train)

split_percent = .2

cutoff = int(split_percent * len(train))

validation_set = train[:cutoff]
training_set = train[cutoff:]

model = NaiveBayesClassifier.train(training_set)

In [None]:
from nltk.classify.util import accuracy

print('Validation accuracy: {}'.format(accuracy(model, validation_set)))
model.show_most_informative_features(10)

Describe the sets of features you've considered, and note down their performance below. What is the final set of features you found?

*Your answer here*

Finally, test your model on the test set. 

In [None]:
# Load and process test data
pos_test_examples = read_folder(data_dir + 'test/pos/')
neg_test_examples = read_folder(data_dir + 'test/neg/')

test_set = [(create_feature_dictionary(ex), 0) for ex in neg_test_examples] + \
                [(create_feature_dictionary(ex), 1) for ex in pos_test_examples]

In [None]:
print('Test set accuracy: {}'.format(accuracy(model, test_set)))

# Note that we're looking at accuracy -- this is not always the most reliable metric and other choices like F1 might be more informative. 
