
# HW 2 

This homework notebook has been adapted from the PyTorch tutorial [Text Classification with the TorchText Library](https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html).

[Torchtext](https://pytorch.org/text/stable/index.html) is a library within the PyTorch framework that consists of data processing utilities and popular datasets for natural language processing.

In this homework, we will
- build a logistic regression model for text classification using bag of words (BoW).
- extend the above model to use continuous bag of words (CBoW).
- consider some other extensions, such as using a better version of gradient descent.

You have to complete **13 tasks**, specified at appropriate places, worth a total of 70 points.  Most of them require writing some code.  Please add your code (or written answers) to this notebook and submit it as part of your homework.



## Environment setup
Below is a guide to set up a Python environment for this assignment.  We recommend installing and using [Anaconda](https://www.anaconda.com/) for environment management for Python.
After [Anaconda is installed](https://docs.anaconda.com/anaconda/install/index.html), 
one may use the commands below to create and activate the enviroment.
```
conda create --name capp30255 python=3.10
conda activate capp30255
```

PyTorch and particularly NLP libraries such as [torchtext](https://pytorch.org/text/stable/index.html) are evolving rapidly. In our experience code written for a version of the library often does not work as new versions are released in a few months.  So it is important that we use the same version.  This homework has been tested on PyTorch 2.0 and torchtext 0.15.1.

[Install PyTorch](https://pytorch.org/get-started/locally/) following specific instructions according to the platforms.  For example, one may run `conda install pytorch=2.0 -c pytorch` on a Mac, and `conda install pytorch=2.0 cpuonly -c pytorch` on a Windows or a GNU/Linux (if no GPU is to be used).

Next install jupyter and matplotlib
```
conda install jupyter
conda install matplotlib -c conda-forge
```

Finally, we also need to install some other dependencies. Conda doesn't seem to have the latest versions, so it is better to use ``pip``.

```
pip install torchtext==0.15.1
pip install torchdata==0.6.0
pip install portalocker==2.7.0
```

Remember to run the jupyter notebook with the kernel `capp30255` as named above. 

In [None]:
import torch
import torchtext
import torchdata
import portalocker

In [None]:
# Function to ensure the correct versions are installed.
def check_lib_versions():
    libversions = {torch: "2.0",  torchtext: "0.15.1", torchdata: "0.6.0",
                   portalocker: "2.7.0"}
    for l, v in libversions.items():
        try:
            assert l.__version__ == v
        except:
            name = [n for n in globals() if globals()[n] is l][0]
            print(f'Error: The version of {name} should be {v}.')
check_lib_versions()

## Classes Dataset and DataLoader

Dataset and DataLoader are PyTorch classes that provides utilities for iterating through and sampling from a dataset. They provide several features for advanced applications (e.g., skim through [this tutorial](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html) on writing custom datasets and dataloaders).

We'll work with the `` AG_NEWS`` dataset included within torchtext, and will write a custom dataloader to create minibatches of examples for training and testing.  The [``AG_NEWS``](https://rdrr.io/cran/textdata/man/dataset_ag_news.html) consists of about 120,000 examples of text from news sources, each labeled with one of 4 classes (world, sports, business, science and technology). 

In [None]:
from torchtext.datasets import AG_NEWS

In [None]:
train_data = AG_NEWS(split='train')

A dataset is an iterable. It returns an iterator via an ``iter()`` call. When called in a for loop or using next it returns a sequence of examples, each of datapoint is a pair of label and text.

In [None]:
train_iter = iter(train_data)
example1 = next(train_iter)
example1

In [None]:
next(train_iter)

In [None]:
next(train_iter)

## Preprocessing: tokenizing, converting to BoW


In order to convert a piece of text into, say, a BoW vector, we need to do

- Split up the text into a sequence of words or tokens.  This can be a suprisingly complex task because of various uses of apostrophe, uses of contractions, etc.  (E.g., see this [article](https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html) at Stanford.)  We'll simply use a tokenizer provided by torchtext.

- Determine which tokens will be included in our BoW vector, and which will be ignored.  Using too few will degrade prediction performance, while using too many will slow down the system.  We'll only include words that occur in the training set at least a given minimum number of times.

- Convert each included word into an numerical index corresponding to its position on the BoW vectors. Torchtext provide a ``Vocab`` class to help with this. (Alternately, we can convert each word into a dense vector representation using pretrained representations such as [GloVe](https://nlp.stanford.edu/projects/glove/).) 

In [None]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tokenizer = get_tokenizer('basic_english')
train_iter = AG_NEWS(split='train')

def yield_tokens(train_iter):
    for _, text in train_iter:
        yield tokenizer(text)

In [None]:
# Only inlcude words that occur at least 1000 times in the training data.
# Also let "<unk>" represent unknown words, i.e., words not in the vocabulary.

vocab = build_vocab_from_iterator(
    yield_tokens(train_iter), specials=["<unk>"], min_freq=1000)


See the [documentation](https://pytorch.org/text/stable/vocab.html) for ``Vocab.``  Now write code to answer the following:
- **Task 1** [2 points]: Print the number of words in ``vocab``.
- **Task 2** [2]: Print the index of the work "economy".
- **Task 3** [2]: Print the word at index 500.
- **Task 4** [2]: Find out what index vocab has for the speci
al unknown token `"<unk>"`, and set it as the default index of `vocab`.


In [None]:
### WRITE YOUR CODE HERE



In [None]:
# Print the indices corresponding to the text in example1
print(example1[1])
print([vocab[token] for token in tokenizer(example1[1])])

A [dataloader](https://pytorch.org/docs/stable/data.html?highlight=dataloader#torch.utils.data.DataLoader) takes a dataset and creates minibatches of examples, each of a specified batch size.  We specify how it processes each example from the dataset via a custom collate function.

**Task 5** [10]: Write a collate function ``collate_into_bow`` that accepts a batch of k examples created from the dataset above and returns two tensors:
a tensor of shape (k,) containing the labels of the batch, and a tensor of shape (k, m), in which m is the number of tokens in the vocabulary, containing the bow vectors for the examples. Further:

1. The labels in the dataset are numbers 1 to 4. Since PyTorch is 0-indexed, please convert them to numbers 0 to 3 in the collate function.
2. Remember that the entry in each bow vector is the **relative frequency** of the word in the corresponding text.

In [None]:
from torch.utils.data import DataLoader

def collate_into_bow(batch):
    ## WRITE YOUR CODE HERE    


def test_collate():
    w1 = vocab.lookup_token(3)
    w2 = vocab.lookup_token(7)
    w3 = vocab.lookup_token(8)
    w4 = vocab.lookup_token(9)
    examples = [
        (1, " ".join([w1, w2, w3, w4])),
        (2, " ".join([w2, w1, w3, w4])),
        (4, " ".join([w4, w2, w3, w4])),
        (3, " ".join([w2, w2, w2, w4])),
        (3, " ".join([w1, w2])),

    ]
    bowt = torch.tensor(
        [
            [0.0, 0.0, 0.0, 0.25, 0.0, 0.0, 0.0, 0.25, 0.25, 0.25],
            [0.0, 0.0, 0.0, 0.25, 0.0, 0.0, 0.0, 0.25, 0.25, 0.25],
            [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.25, 0.25, 0.50],
            [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.75, 0.0, 0.25],
            [0.0, 0.0, 0.0, 0.5, 0.0, 0.0, 0.0, 0.5, 0.0, 0.00],

        ]) 
    lt, tt = collate_into_bow(examples)
    assert lt.shape == torch.Size([5])
    assert tt.shape == torch.Size([5, len(vocab)])
    assert torch.equal(lt, torch.tensor([0, 1, 3, 2, 2]))
    assert torch.equal(tt[:,:10], bowt)
    assert tt[:,10:].sum().item() == 0.00
    print('Test passed.')
    
test_collate()

The collate function is provided to a dataloader as shown below.

In [None]:
train_iter = AG_NEWS(split='train')
dataloader = DataLoader(train_iter, batch_size=16, shuffle=False, 
                        collate_fn=collate_into_bow)
for idx, (lt, tt) in enumerate(dataloader):
    print(idx, lt.shape, tt.shape)
    if idx == 4: 
        break

## A BoW Classifier Class

**Task 6** [5]: Write a BoWClassifier class with one single linear layer,
similar to the one in [Robert Guthrie's tutorial](https://pytorch.org/tutorials/beginner/nlp/deep_learning_tutorial.html#sphx-glr-beginner-nlp-deep-learning-tutorial-py).


In [None]:
from torch import nn
import torch.nn.functional as F

class BoWClassifier(nn.Module):
    ## WRITE YOUR CODE BELOW
    pass  

The following creates a model object of the class BoWClassifier.




In [None]:
train_data = AG_NEWS(split='train')
num_labels = len(set([label for (label, text) in train_data]))
vocab_size = len(vocab)
model = BoWClassifier(num_labels, vocab_size)

## Training an epoch

The code below is similar to what we saw in Gutherie's tutorial. It prints the loss every 500 iterations. ``model.train()`` is used by PyTorch to set the model in training model.  This usually only impacts some advanced architectures.

In [None]:
import time

loss_function = torch.nn.NLLLoss()

def train_an_epoch(dataloader, optimizer):
    model.train() # Sets the module in training mode.
    log_interval = 500

    for idx, (label, text) in enumerate(dataloader):
        model.zero_grad()
        log_probs = model(text)
        loss = loss_function(log_probs, label)
        loss.backward()
        optimizer.step()
        if idx % log_interval == 0 and idx > 0:
            print(f'At iteration {idx} the loss is {loss:.3f}.')

## Computing average accuracy on a validation set

**Task 7** [7]: Write a function ``get_accuracy`` to compute the average accuracy of the model for a given dataloader.  Your code should iterate through all the examples, for each find the predicted label with the highest probability, and count the number of examples in which this predicted label is correct.  It should then return the average accuracy. Remember that although most batches will have a fixed number of examples (the given batch size), the last batch may have fewer examples.  So you should explicitly count the number of examples in each batch.

In [None]:
def get_accuracy(dataloader):
    model.eval()
    with torch.no_grad():
        ## WRITE YOUR CODE BELOW.    
        pass

## Create training, validation, and testing dataloaders

Since the original ``AG_NEWS`` has no valid dataset, we split the training
dataset into train/valid sets with a split ratio of 0.95 (train) and
0.05 (valid). Here we use
[``torch.utils.data.dataset.random_split``](https://pytorch.org/docs/stable/data.html?highlight=random_split#torch.utils.data.random_split)
function in PyTorch core library.

In [None]:
from torch.utils.data.dataset import random_split

BATCH_SIZE = 64 # batch size for training
  
train_valid_data, test_data = AG_NEWS()
train_valid_data = list(train_valid_data)
num_train = int(len(train_valid_data) * 0.95)
num_valid = len(train_valid_data) - num_train
train_data, valid_data = random_split(
    train_valid_data, [num_train, num_valid])

train_dataloader = DataLoader(train_data, batch_size=BATCH_SIZE,
                              shuffle=True, 
                              collate_fn=collate_into_bow)
valid_dataloader = DataLoader(valid_data, batch_size=BATCH_SIZE,
                              shuffle=False, 
                              collate_fn=collate_into_bow)
test_dataloader = DataLoader(test_data, batch_size=BATCH_SIZE,
                             shuffle=False, 
                             collate_fn=collate_into_bow)

## Training

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

EPOCHS = 15 # epoch
optimizer = torch.optim.SGD(model.parameters(), lr=3)

accuracies=[]
for epoch in range(1, EPOCHS + 1):
    epoch_start_time = time.time()
    train_an_epoch(train_dataloader, optimizer)
    accuracy = get_accuracy(valid_dataloader)
    accuracies.append(accuracy)
    time_taken = time.time() - epoch_start_time
    print()
    print(f'After epoch {epoch} the validation accuracy is {accuracy:.3f}.')
    print()
    
plt.plot(range(1, EPOCHS+1), accuracies)

**Task 8** [10]: Run the model for a sufficient number of epochs such that the model shows overfitting, and submit a pdf of the plot of accuracy against number of epochs.  Determine the optimal number of epochs to train for.  Write code to estimate the accuracy of your model corresponding to this optimal number of epocs and report this estimated accuracy.

**Task 9** [5]: Notice above that both the printed losses and the accuracies keep varying and do not necessary increase or decrease in a steady fashion.  List all the reasons you can think of for this variance in the loss and the accuracy.

## Adding a pre-trained embedding

[GloVe](https://nlp.stanford.edu/projects/glove/) is set of dense vector representations, or embeddings.  Torchtext has support for GloVe. (It takes several minutes the first time---to download.)

In [None]:
from itertools import combinations
from torchtext.vocab import GloVe

# It is best to save GloVe data in a cache to reuse across projects.
VECTOR_CACHE_DIR = '/Users/amitabh/mlpp23/.vector_cache'

glove = GloVe(name='6B', cache = VECTOR_CACHE_DIR)

words = ["hello", "hi", "king", "president"]
vecs = glove.get_vecs_by_tokens(words)

print(vecs.shape)
print()
for (i, j) in combinations(range(4), 2):
    print(words[i], words[j], vecs[i].dot(vecs[j]))
print()
print(vecs)

**Task 10** [10]: Write a new collate function ``collate_into_cbow`` that returns a CBoW representation of each batch, using GloVe.

**Task 11** [5]: Write copies of other functions as needed to determine the estimate accuracy of the (optimal) model that incorporates GloVe.

## Using the Adam optimizer

The [Adam](https://pytorch.org/docs/stable/optim.html#torch.optim.Adam) is usually preferred to SGD because of better convergence properites.

**Task 12** [5]: Write copies of functions as needed to plot the convergence of the Adam optimizer.

## Other Optimizations

**Task 13** [5]: Briefly desribe 3 ways by which you could make the above code run faster or improve its accuracy.  (You don't have to implement your suggestions.)