# HW2: Naïve Bayes

In this homework, you'll implement and (very briefly) discuss a bag-of-words Naïve Bayes sentiment classifier—a simple but effective example of a linear classifier.

This assignment is due at the start of class on September 29. When you're done, upload your edited `ipynb` file to NYU Classes.

## Setup

First, let's load the Stanford Sentiment Treebank. If you don't already have it, download it from here: [the train/dev/test Stanford Sentiment Treebank distribution](http://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip), unzip it, and put the resulting folder in the same directory as this notebook. (If you want to put it somewhere else, change `sst_home` below.)

Note: Unlike with k-nearest neighbors, Naïve Bayes evaluation should be quite fast (thousands of examples per second at least), so we don't need to trim down the dev and test sets. 

In [None]:
sst_home = './trees'

import re
import random

def load_sst_data(path):
    # Let's do 2-way positive/negative classification instead of 5-way
    EASY_LABEL_MAP = {0:0, 1:0, 2:None, 3:1, 4:1}
    
    data = []
    with open(path) as f:
        for i, line in enumerate(f): 
            example = {}
            example['label'] = EASY_LABEL_MAP[int(line[1])]
            if example['label'] is None:
                continue
            
            # Strip out the parse information and the phrase labels---we don't need those here
            text = re.sub(r'\s*(\(\d)|(\))\s*', '', line)
            example['text'] = text[1:]
            data.append(example)

    return data
     
training_set = load_sst_data(sst_home + '/train.txt')
dev_set = load_sst_data(sst_home + '/dev.txt')
test_set = load_sst_data(sst_home + '/test.txt')

## Part 1: Bags of words

Next, let's write a function to convert these sentences into feature vectors. The function template here simply extracts three useless (?) dummy features:

- The number of characters in the review.
- The first letter in the review.
- Whether the letters 'th' appear in the review.

This function depends upon a simple dictionary trick that allows us to reason about features by name rather than by index.

For this classifier, we'll be sticking to bag-of-words features. Delete the existing features, and replace them with bag-of-words features.

In [None]:
import numpy as np
import collections

In [None]:
def feature_function(datasets):
    '''Annotates datasets with feature vectors.'''
                         
    feature_names = set()
    for i, dataset in enumerate(datasets):
        for example in dataset:
            example['features'] = collections.defaultdict(float)
            
            # Extract features (by name) for one example
            example['features']['dummy_char_count'] = len(example['text'])
            example['features']['dummy_first_char_' + example['text'][0]] = 1
            example['features']['dummy_contains_th'] = 'th' in example['text']
                
            feature_names.update(example['features'].keys())
                            
    # By now, we know what all the features will be, so we can
    # assign indices to them.
    feature_indices = dict(zip(feature_names, range(len(feature_names))))
    indices_to_features = {v: k for k, v in feature_indices.items()}
    dim = len(feature_indices)
                
    # Now we create actual vectors from those indices.
    for dataset in datasets:
        for example in dataset:
            example['vector'] = np.zeros((dim))
            for feature in example['features']:
                example['vector'][feature_indices[feature]] = example['features'][feature]
    return indices_to_features
    
indices_to_features = feature_function([training_set, dev_set, test_set])

## Part 2: Implementing Naïve Bayes

Next, implement a Naïve Bayes classifier that you can train and test on the feature vectors you just extracted. Use Laplace (add-one) smoothing.

In [None]:
class NaiveBayesClassifier:
    def __init__(self):
        pass
    
    def train(self, training_set):
        pass
    
    def classify(self, example):
        return 0

Here's how it's trained:

In [None]:
classifier = NaiveBayesClassifier()
classifier.train(training_set)

Here's how it's called. It returns a label (0 for negative, 1 for positive):

In [None]:
classifier.classify(dev_set[0])

Here's a simple function to evaluate a classifier. It expects a function from example to labels.

In [None]:
def evaluate_classifier(classifier, eval_set):
    correct = 0
    for example in eval_set:
        hypothesis = classifier(example)
        if hypothesis == example['label']:
            correct += 1
    return correct / float(len(eval_set))

This runs the primary evaluation. It'll return accuracy (%). If you've implemented Naïve Bayes correctly, you should see accuracy of greater than 75% on the dev set. The [original Stanford Sentiment Treebank paper](http://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf) reports 81.8% test accuracy with Naïve Bayes.

In [None]:
print evaluate_classifier(classifier.classify, dev_set)

## Part 3: Questions

Briefly answer each of the questions below.

**Question 1:** Most implementations of Naïve Bayes (hopefully including yours), never actually compute $P(d|c)$, but instead directly compute $\log P(d|c)$. Why is this?

**Question 2:** In class, we found that a nearest neighbor model built over bag-of-words features barely surpassed 60% accuracy on the dev set. Why is Naïve Bayes so much better at classifying sentences using this same style of feature?

**Question 3:** Do some error analysis---identify three sentences that the model mis-classified, and speculate about ways in which a better (but still realistic) machine learning model trained on this same training set might be able to do better.