# HW 1 Classification

Welcome to CS 287 HW1. To begin this assignment first turn on the Python 3 and GPU backend for this Colab by clicking `Runtime > Change Runtime Type` above.  

In this homework you will be building several varieties of text classifiers. Text classifiers are not that exciting from an NLP point of view, but they are a great way to get up to speed on the core technologies we will use in this class.



## Goal

We ask that you construct the following models in PyTorch:

1. A naive Bayes unigram classifer (follow Wang and Manning http://www.aclweb.org/anthology/P/P12/P12-2.pdf#page=118: you should only implement Naive Bayes, not the combined classifer with SVM).
2. A logistic regression model over word types (you can implement this as $y = \sigma(\sum_i W x_i + b)$) 
3. A continuous bag-of-word neural network with embeddings (similar to CBOW in Mikolov et al https://arxiv.org/pdf/1301.3781.pdf).
4. A simple convolutional neural network (any variant of CNN as described in Kim http://aclweb.org/anthology/D/D14/D14-1181.pdf).
5. Your own extensions to these models...

Consult the papers provided for hyperparameters. 


Throughout this semester we plan to *beta* test the recently proposed NamedTensor to annotate Tensor's dimensions, because we believe that this makes the code readable and less error-prune. This is an experimental library though, so please let us know if you have any issues. 

Please see http://nlp.seas.harvard.edu/NamedTensor for more details or https://github.com/harvardnlp/namedtensor to submit a PR.

## Setup

This notebook provides a working definition of the setup of the problem itself. You may construct your models inline or use an external setup (preferred) to build your system.

In [None]:
!pip install -q torch torchtext opt_einsum
!pip install -U git+https://github.com/harvardnlp/namedtensor
!pip install -q numpy
!pip install -q nltk
!pip install -q matplotlib
!pip install -q ipywidgets

In [2]:
import torch
# Text text processing library and methods for pretrained word embeddings
import torchtext
import numpy as np
import matplotlib.pyplot as plt

from torchtext.vocab import Vectors, GloVe

# Named Tensor wrappers
from namedtensor import ntorch, NamedTensor
from namedtensor.text import NamedField

The dataset we will use of this problem is known as the Stanford Sentiment Treebank ( https://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf ). It is a variant of a standard sentiment classification task. For simplicity, we will use the most basic form. Classifying a sentence as positive or negative in sentiment. 

To start, `torchtext` requires that we define a mapping from the raw text data to featurized indices. These fields make it easy to map back and forth between readable data and math, which helps for debugging.

In [3]:
# Our input $x$
TEXT = NamedField(names=('seqlen',))

# Our labels $y$
LABEL = NamedField(sequential=False, names=(), unk_token=None)

Next we input our data. Here we will use the standard SST train split, and tell it the fields. Torchtext also gives us the option of using subtrees in the treebank as examples as well. The subtrees can be obtained by passing the option `train_subtrees=True` to splits. Feel free to experiment with using subtrees and report their effect on performance.

In [4]:
train, val, test = torchtext.datasets.SST.splits(
    TEXT, LABEL,
    filter_pred=lambda ex: ex.label != 'neutral')

Let's look at this data. It's still in its original form, we can see that each example consists of a label and the original words.

Be sure to double check that examples with neutral labels were filtered out. 

The length of the training data should be 6920.

In [5]:
print('len(train)', len(train))
print('vars(train[0])', vars(train[0]))

len(train) 6920
vars(train[0]) {'text': ['The', 'Rock', 'is', 'destined', 'to', 'be', 'the', '21st', 'Century', "'s", 'new', '``', 'Conan', "''", 'and', 'that', 'he', "'s", 'going', 'to', 'make', 'a', 'splash', 'even', 'greater', 'than', 'Arnold', 'Schwarzenegger', ',', 'Jean-Claud', 'Van', 'Damme', 'or', 'Steven', 'Segal', '.'], 'label': 'positive'}


In order to map this data to features, we need to assign an index to each word an label. The function build vocab allows us to do this and provides useful options that we will need in future assignments.

In [6]:
TEXT.build_vocab(train)
LABEL.build_vocab(train)
print('len(TEXT.vocab)', len(TEXT.vocab))
print('len(LABEL.vocab)', len(LABEL.vocab))

len(TEXT.vocab) 16284
len(LABEL.vocab) 2


Finally we are ready to create batches of our training data that can be used for training and validating the model. This function produces 3 iterators that will let us go through the train, val and test data. 

In [7]:
cpu = torch.device("cpu")
cuda = torch.device(torch.device("cuda:0" if torch.cuda.is_available() else "cpu"))

In [8]:
train_iter, val_iter, test_iter = torchtext.data.BucketIterator.splits(
    (train, val, test), batch_size=10, device=cuda)

Let's look at a single batch from one of these iterators. The library automatically converts the underlying words into indices. It then produces tensors for batches of x and y. In this case it will consist of the number of words of the longest sentence (with padding) followed by the number of batches. We can use the vocabulary dictionary to convert back from these indices to words.

In [9]:
batch = next(iter(train_iter))
print("Size of text batch:", batch.text.shape)
example = batch.text.get("batch", 1)
print("Second in batch", example)
print("Converted back to string:", " ".join([TEXT.vocab.itos[i] for i in example.tolist()]))

Size of text batch: OrderedDict([('seqlen', 29), ('batch', 10)])
Second in batch NamedTensor(
	tensor([ 8941,   169,     5, 11257,  1473,   181,     2,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1]),
	('seqlen',))
Converted back to string: Marvelously entertaining and deliriously joyous documentary . <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>


Similarly it produces a vector for each of the labels in the batch. 

In [10]:
print("Size of label batch:", batch.label.shape)
example = batch.label.get("batch", 1)
print("Second in batch", example.item())
print("Converted back to string:", LABEL.vocab.itos[example.item()])
print(len(batch.text.get("batch", 1)))

Size of label batch: OrderedDict([('batch', 10)])
Second in batch 0
Converted back to string: positive
29


Finally the Vocab object can be used to map pretrained word vectors to the indices in the vocabulary. This will be very useful for part 3 and 4 of the problem.  Feel free to experiment with different word vectors and report their effect on performance.

In [11]:
# Build the vocabulary with word embeddings
url = 'https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.simple.vec'
TEXT.vocab.load_vectors(vectors=Vectors('wiki.simple.vec', url=url))

print("Word embeddings size ", TEXT.vocab.vectors.size())
print("Word embedding of 'follows', first 10 dim ", TEXT.vocab.vectors[TEXT.vocab.stoi['follows']][:10])

Word embeddings size  torch.Size([16284, 300])
Word embedding of 'follows', first 10 dim  tensor([ 0.3925, -0.4770,  0.1754, -0.0845,  0.1396,  0.3722, -0.0878, -0.2398,
         0.0367,  0.2800])


## Assignment

Now it is your turn to build the models described at the top of the assignment. 

Using the data given by this iterator, you should construct 4 different torch models that take in batch.text and produce a distribution over labels. 

When a model is trained, use the following test function to produce predictions, and then upload to the kaggle competition:  https://www.kaggle.com/c/harvard-cs287-s19-hw1

In [12]:
def test_code(model):
    "All models should be able to be run with following command."
    upload = []
    # Update: for kaggle the bucket iterator needs to have batch_size 10
    test_iter = torchtext.data.BucketIterator(test, train=False, batch_size=10)
    for batch in test_iter:
        # Your prediction data here (don't cheat!)
        probs = model(batch.text.to(cuda))
        # here we assume that the name for dimension classes is `classes`
        argmax = probs.round().int()
        upload += argmax.tolist()
    with open("predictions.txt", "w") as f:
        f.write("Id,Category\n")
        for i, u in enumerate(upload):
            f.write(str(i) + "," + str(u) + "\n")


In addition, you should put up a (short) write-up following the template provided in the repository:  https://github.com/harvard-ml-courses/nlp-template

## Models

### Naive Bayes

In [29]:
def generate_naive_bayes_model(training_data, alpha):
  labelCounts = ntorch.ones(2, names=("class")).to(cuda) * 0
  vocabCounts = ntorch.ones(len(TEXT.vocab), 2, names=("vocab", "class")).to(cuda) * alpha
  classes = ntorch.tensor(torch.eye(2), names=("class", "classIndex")).to(cuda)
  encoding = ntorch.tensor(torch.eye(len(TEXT.vocab)), names=("vocab", "index")).to(cuda)
  for batch in training_data:
    oneHot = encoding.index_select("index", batch.text)
    setofwords, _ = oneHot.max("seqlen")
    classRep = classes.index_select("classIndex", batch.label)
    labelCounts += classRep.sum("batch")
    vocabCounts += setofwords.dot("batch", classRep)
    
  p = vocabCounts.get("class", 1)
  q = vocabCounts.get("class", 0)
  r = ((p*q.sum())/(q*p.sum())).log()
  weight = r
  b = np.log(labelCounts.get("class", 1)/labelCounts.get("class", 0))
  def naive_bayes(test_batch):
    oneHotTest = encoding.index_select("index", test_batch.to(cuda))
    setofwords, _ = oneHotTest.max("seqlen")
    y = (weight.dot("vocab", setofwords) + b).sigmoid()
    return y
  
  return naive_bayes
  

In [30]:
def get_accuracy(model):
  wrong = 0
  total = 0
  test_iter = torchtext.data.BucketIterator(test, train=False, batch_size=10)
  for batch in test_iter:
    probs = model(batch.text.cuda())
    predictions = probs.round().int().to(cpu).numpy()
    answers = batch.label.to(cpu).numpy()
    total += len(predictions)
    wrong += np.abs(predictions-answers).sum()
  right = total - wrong
  return right/total

In [31]:
def naive_bayes_accuracy(alpha):
  return get_accuracy(generate_naive_bayes_model(train_iter,alpha))

In [32]:
accuracy_tester = np.vectorize(naive_bayes_accuracy)