In [1]:
# Please do not change this cell because some hidden tests might depend on it.
import os

# Otter grader does not handle ! commands well, so we define and use our
# own function to execute shell commands.
def shell(commands, warn=True):
    """Executes the string `commands` as a sequence of shell commands.
     
       Prints the result to stdout and returns the exit status. 
       Provides a printed warning on non-zero exit status unless `warn` 
       flag is unset.
    """
    file = os.popen(commands)
    print (file.read().rstrip('\n'))
    exit_status = file.close()
    if warn and exit_status != None:
        print(f"Completed with errors. Exit status: {exit_status}\n")
    return exit_status

shell("""
ls requirements.txt >/dev/null 2>&1
if [ ! $? = 0 ]; then
 rm -rf .tmp
 git clone https://github.com/cs187-2021/project1.git .tmp
 mv .tmp/requirements.txt ./
 rm -rf .tmp
fi
pip install -q -r requirements.txt
""")




In [2]:
# Initialize Otter
import otter
grader = otter.Notebook()

$$
\renewcommand{\vect}[1]{\mathbf{#1}}
\renewcommand{\cnt}[1]{\sharp(#1)}
\renewcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}}
\renewcommand{\softmax}{\operatorname{softmax}}
\renewcommand{\Prob}{\Pr}
\renewcommand{\given}{\,|\,}
$$


# CS187

## Project segment 1: Text classification

In this project segment you will build several varieties of text classifiers using PyTorch.

1. A majority baseline.
2. A naive Bayes classifer.
3. A logistic regression (single-layer perceptron) classifier.
4. A multilayer perceptron classifier.

# Preparation {-}

In [3]:
import copy
import re
import wget
import torch
import torch.nn as nn
import torchtext.legacy as tt

from collections import Counter
from torch import optim
from tqdm.auto import tqdm

In [4]:
# Random seed
random_seed = 1234
torch.manual_seed(random_seed)

## GPU check
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cpu


# The task: Answer types for ATIS queries

For this and future project segments, you will be working with a standard natural-language-processing dataset, the [ATIS (Airline Travel Information System) dataset](https://www.kaggle.com/siddhadev/atis-dataset-from-ms-cntk). This dataset is composed of queries about flights – their dates, times, locations, airlines, and the like.

Over the years, the dataset has been annotated in all kinds of ways, with parts of speech, informational chunks, parse trees, and even corresponding SQL database queries. You'll use various of these annotations in future assignments. For this project segment, however, you'll pursue an easier classification task: **given a query, predict the answer type**.

These queries ask for different types of answers, such as

* Flight IDs: "Show me the flights from Washington to Boston"
* Fares: "How much is the cheapest flight to Milwaukee"
* City names: "Where does flight 100 fly to?"

In all, there are some 30 answer types to the queries.

Below is an example taken from this dataset:

_Query:_

```
show me the afternoon flights from washington to boston
```

_SQL:_

```
SELECT DISTINCT flight_1.flight_id FROM flight flight_1 , airport_service airport_service_1 , city city_1 , airport_service airport_service_2 , city city_2 
   WHERE flight_1.departure_time BETWEEN 1200 AND 1800 
     AND ( flight_1.from_airport = airport_service_1.airport_code 
           AND airport_service_1.city_code = city_1.city_code 
           AND city_1.city_name = 'WASHINGTON' 
           AND flight_1.to_airport = airport_service_2.airport_code 
           AND airport_service_2.city_code = city_2.city_code 
           AND city_2.city_name = 'BOSTON' )
```

In this project segment, we will consider the answer type for a natural-language query to be the target field of the corresponding SQL query. For the above example, the answer type would be *flight_id*.

## Loading and preprocessing the data

> Read over this section, executing the cells, and **making sure you understand what's going on before proceeding to the next parts.**

First, let's download the dataset.

In [5]:
data_dir = "https://raw.githubusercontent.com/nlp-course/data/master/ATIS/"
os.makedirs('data', exist_ok=True)
for file in ["train.nl",
             "train.sql",
             "dev.nl",
             "dev.sql",
             "test.nl",
             "test.sql"]:
    wget.download(f"{data_dir}{file}", out='data/')

100% [............................................................................] 250477 / 250477

We use `torchtext` to prepare the data, as in lab 1-5. More information on `torchtext` can be found at https://pytorch.org/text/0.8.1/data.html.

> You'll notice that we link to version 0.8.1 of the PyTorch documentation, because the `torchtext.data.Field` class is now deprecated. We therefore imported it as `torchtext.legacy` at the top of this notebook. Sadly, torchtext has no convenient replacement for `Field` at the moment.

To begin, `torchtext` requires that we define a mapping from the raw data to featurized indices, called a [`Field`](https://torchtext.readthedocs.io/en/latest/data.html#fields). We need one field for processing the question (`TEXT`), and another for processing the label (`LABEL`). These fields make it easy to map back and forth between readable data and lower-level representations like numbers.

In [6]:
TEXT = tt.data.Field(lower=True,            # lowercase all tokens
                     sequential=True,       # sequential data
                     include_lengths=False, # do not include lengths
                     batch_first=True,      # batches will be batch_size X max_len
                     tokenize=tt.data.get_tokenizer("basic_english")) 
LABEL = tt.data.Field(batch_first=True, sequential=False, unk_token=None)

We provide an interface for loading ATIS data, built on top of [`torchtext.data.Dataset`](https://pytorch.org/text/data.html#torchtext.data.Dataset). 

In [7]:
class ATIS(tt.data.Dataset):
  @staticmethod
  def sort_key(ex):
    return len(ex.text)

  def __init__(self, path, text_field, label_field, **kwargs):
    """Creates an ATIS dataset instance given a path and fields.
    Arguments:
        path: Path to the data file
        text_field: The field that will be used for text data.
        label_field: The field that will be used for label data.
        Remaining keyword arguments: Passed to the constructor of
            tt.data.Dataset.
    """
    fields = [('text', text_field), ('label', label_field)]
    
    examples = []
    # Get text
    with open(path+'.nl', 'r') as f:
        for line in f:
            ex = tt.data.Example()
            ex.text = text_field.preprocess(line.strip()) 
            examples.append(ex)
    
    # Get labels
    with open(path+'.sql', 'r') as f:
        for i, line in enumerate(f):
            label = self._get_label_from_query(line.strip())
            examples[i].label = label
            
    super(ATIS, self).__init__(examples, fields, **kwargs)
  
  def _get_label_from_query(self, query):
    """Returns the answer type from `query` by dead reckoning.
    It's basically the second or third token in the SQL query.
    """    
    match = re.match(r'\s*SELECT\s+(DISTINCT\s*)?(\w+\.)?(?P<label>\w+)', query)
    if match:
        label = match.group('label')
    else:
        raise RuntimeError(f'no label in query {query}')
    return label

  @classmethod
  def splits(cls, text_field, label_field, path='./',
              train='train', validation='dev', test='test',
              **kwargs):
    """Create dataset objects for splits of the ATIS dataset.
    
    Arguments:
        text_field: The field that will be used for the sentence.
        label_field: The field that will be used for label data.
        root: The root directory that the dataset's zip archive will be
            expanded into; therefore the directory in whose trees
            subdirectory the data files will be stored.
        train: The filename of the train data. Default: 'train.txt'.
        validation: The filename of the validation data, or None to not
            load the validation set. Default: 'dev.txt'.
        test: The filename of the test data, or None to not load the test
            set. Default: 'test.txt'.
        Remaining keyword arguments: Passed to the splits method of
            Dataset.
    """

    train_data = None if train is None else cls(
        os.path.join(path, train), text_field, label_field, **kwargs)
    val_data = None if validation is None else cls(
        os.path.join(path, validation), text_field, label_field, **kwargs)
    test_data = None if test is None else cls(
        os.path.join(path, test), text_field, label_field, **kwargs)
    return tuple(d for d in (train_data, val_data, test_data)
                   if d is not None)

We split the data into training, validation, and test corpora, and build the vocabularies from the training data.

In [8]:
import math
#Make splits for data
train_data, val_data, test_data = ATIS.splits(TEXT, LABEL, path='./data/')

# Build vocabulary for data fields
MIN_FREQ = 3 # words appearing fewer than 3 times are treated as 'unknown'
TEXT.build_vocab(train_data, min_freq=MIN_FREQ)
LABEL.build_vocab(train_data)

# Compute size of vocabulary
print(TEXT.vocab.itos)
vocab_size = len(TEXT.vocab.itos)
num_labels = len(LABEL.vocab.itos)
print(f"Size of vocab: {vocab_size}")
print(f"Number of labels: {num_labels}")

['<unk>', '<pad>', 'to', 'from', 'flights', 'the', 'on', 'what', 'me', 'flight', 'boston', 'show', 'i', 'san', 'denver', 'a', 'in', 'francisco', 'atlanta', 'and', 'pittsburgh', 'is', 'dallas', 'baltimore', 'all', 'philadelphia', 'like', "'", 'are', 'list', 'airlines', 'of', 'that', 'between', 'washington', 'leaving', 'please', 'morning', 'would', 'fare', 'fly', 'for', 'first', 'oakland', 'after', 'there', 'wednesday', 'd', 'ground', 'cheapest', 'you', 'transportation', 'does', 'class', 'need', 'trip', 'city', 'arriving', 'round', 'available', 'have', 'before', 'with', 'afternoon', 'which', 'one', 'fares', 'way', 'american', 'new', 'leave', 'at', 'give', 'monday', 'want', 'dc', 'york', 'earliest', 'nonstop', 'thursday', 'arrive', 'united', 'go', 'information', 'tuesday', 'can', 'airport', 'find', 'how', '.', 'st', 'evening', 'twenty', 'newark', 'noon', 'miami', 'milwaukee', 'delta', 'sunday', 'any', 'august', 'vegas', 'charlotte', 'las', 's', 'continental', 'do', 'o', 'stop', 'chicago',

To get a sense of the kinds of things that are asked about in this dataset, here is the list of all of the answer types in the training data.

In [9]:
for i, label in enumerate(sorted(LABEL.vocab.itos)):
    print(f"{i:2d} {label}") 

 0 advance_purchase
 1 aircraft_code
 2 airline_code
 3 airport_code
 4 airport_location
 5 arrival_time
 6 basic_type
 7 booking_class
 8 city_code
 9 city_name
10 count
11 day_name
12 departure_time
13 fare_basis_code
14 fare_id
15 flight_id
16 flight_number
17 ground_fare
18 meal_code
19 meal_description
20 miles_distant
21 minimum_connect_time
22 minutes_distant
23 restriction_code
24 state_code
25 stop_airport
26 stops
27 time_elapsed
28 time_zone_code
29 transport_type


## Handling unknown words

Note that we mapped words appearing fewer than 3 times to a special _unknown_ token (we're using the `torchtext` default, `<unk>`) for two reasons: 

1. Due to the scarcity of such rare words in training data, we might not be able to learn generalizable conclusions about them.
2. Introducing an unknown token allows us to deal with out-of-vocabulary words in the test data as well: we just map those words to `<unk>`.

In [10]:
unk_token = TEXT.unk_token
print (f"Unknown token: {unk_token}")
unk_index = TEXT.vocab.stoi[unk_token]
print (f"Unknown token id: {unk_index}")

# UNK example
example_unk_token = 'IAmAnUnknownWordForSure'
print (f"An unknown token: {example_unk_token}")
print (f"Mapped back to word id: {TEXT.vocab.stoi[example_unk_token]}")
print (f"Mapped to <unk>?: {TEXT.vocab.stoi[example_unk_token] == unk_index}")

Unknown token: <unk>
Unknown token id: 0
An unknown token: IAmAnUnknownWordForSure
Mapped back to word id: 0
Mapped to <unk>?: True


## Batching the data

To load data in batches, we use `data.BucketIterator`. This enables us to iterate over the dataset under a given `BATCH_SIZE` which specifies how many examples we want to process at a time.

In [11]:
BATCH_SIZE = 32
train_iter = tt.data.BucketIterator(train_data, batch_size=BATCH_SIZE, device=device)
val_iter = tt.data.BucketIterator(val_data, batch_size=BATCH_SIZE, device=device)
test_iter = tt.data.Iterator(test_data, batch_size=BATCH_SIZE, sort=False, device=device)

Let's look at a single batch from one of these iterators.

In [12]:
batch = next(iter(train_iter))
text = batch.text
print(batch.text)
for arr in batch.text:
    print(" ".join(TEXT.vocab.itos[i] for i in arr))
print (f"Size of text batch: {text.size()}")
print (f"Third sentence in batch: {text[2]}")
print (f"Converted back to string: {' '.join([TEXT.vocab.itos[i] for i in text[2]])}")

print(train_iter.dataset[2].text)

label = batch.label
print(label)
# print(LABEL.vocab.itos[1])
# print (f"Size of label batch: {label.size()}")
# print (f"Third label in batch: {label[2]}")
# print(LABEL.vocab.itos)
# print (f"Converted back to string: {LABEL.vocab.itos[label[2].item()]}")

tensor([[221,   4,   3,  22,   2,  13,  17,   1,   1,   1,   1,   1,   1,   1,
           1,   1,   1,   1,   1,   1,   1],
        [  7,  21, 276, 198,  55,   3,  34,   2,  10,   6, 100,  92, 115,   1,
           1,   1,   1,   1,   1,   1,   1],
        [ 12,  38,  26,   2, 145,   5,   4,   3,  23,   2,  25,  36,   1,   1,
           1,   1,   1,   1,   1,   1,   1],
        [ 12,  54,   2,  82,   3,  10,   2,  18,  19, 318,  16,   5, 341, 207,
          87,   8,   5,  77,   9,   3,  10],
        [ 11,   8,  48,  51,  16,  23,   1,   1,   1,   1,   1,   1,   1,   1,
           1,   1,   1,   1,   1,   1,   1],
        [ 28,  45,  99,   4,  61, 183,  94,   6, 149, 279,   3, 166, 185,   2,
         203,   1,   1,   1,   1,   1,   1],
        [  7,  27, 104,   5,  77,   9,  35,  14,  41,  20,   1,   1,   1,   1,
           1,   1,   1,   1,   1,   1,   1],
        [ 29,  41,   8, 303,   5,  81,   4,  33,  14,  19,  43,   1,   1,   1,
           1,   1,   1,   1,   1,   1,   1],
        

You might notice some padding tokens `<pad>` when we convert word ids back to strings, or equivalently, padding ids `1` in the corresponding tensor. The reason why we need such padding is because the sentences in a batch might be of different lengths, and to save them in a 2D tensor for parallel processing, sentences that are shorter than the longest sentence need to be padded with some placeholder values. `torchtext` does all this for us automatically. Later during training you'll need to make sure that the paddings do not affect the final results.

In [13]:
padding_token = TEXT.pad_token
print (f"Padding token: {padding_token}")

padding_id = TEXT.vocab.stoi[padding_token]
print (f"Padding word id: {padding_id}")

Padding token: <pad>
Padding word id: 1


Alternatively, we can also directly iterate over the individual examples in `train_data`, `val_data` and `test_data`. Here the returned values are the raw sentences and labels instead of their corresponding ids, and you might need to explicitly deal with the unknown words, unlike using bucket iterators which automatically map unknown words to an unknown word id.

In [14]:
for example in train_iter.dataset[:5]: # train_iter.dataset is just train_data
  print(f"{example.label:10} -- {' '.join(example.text)}")

flight_id  -- list all the flights that arrive at general mitchell international from various cities
flight_id  -- give me the flights leaving denver august ninth coming back to boston
flight_id  -- what flights from tacoma to orlando on saturday
fare_id    -- what is the most expensive one way fare from boston to atlanta on american airlines
flight_id  -- what flights return from denver to philadelphia on a saturday


## Notations used

In this project segment, we'll use the following notations.

* Sequences of elements (vectors and the like) are written with angle brackets and commas ($\langle w_1, \ldots, w_M \rangle$) or directly with no punctuation ($w_1 \cdots w_M$).
* Sets are notated similarly but with braces, ($\{ v_1, \ldots, v_V \}$).
* Maximum indices ($M$, $N$, $V$, $T$, and $X$ in the following) are written as uppercase italics.
* Variables over sequences and sets are written in boldface ($\vect{w}$), typically with the same letter as the variables over their elements.

In particular,

* $\vect{w} = w_1 \cdots w_M$: A text to be classified, each element $w_j$ being a word token.
* $\vect{v} = \{ v_1, \ldots, v_V\}$: A vocabulary, each element $v_k$ being a word type.
* $\vect{x} = \langle x_1, \ldots, x_X \rangle$: Input features to a model.
* $\vect{y} = \{ y_1, \ldots, y_N \}$: The output classes of a model, each element $y_i$ being a class label.
* $\vect{T} = \langle \vect{w}^{(1)}, \ldots, \vect{w}^{(T)} \rangle$: The training corpus of texts.
* $\vect{Y} = \langle y^{(1)}, \ldots, y^{(T)} \rangle$: The corresponding gold labels for the training examples in $T$.

# To Do: Establish a majority baseline

A simple baseline for classification tasks is to always predict the most common class. 
Given a training set of texts $\vect{T}$ labeled by classes $\vect{Y}$, we classify an input text $\vect{w} = w_1 \cdots w_M$ as the class $y_i$ that occurs most frequently in the training data, that is, specified by

$$ \argmax{i} \cnt{y_i} $$

and thus ignoring the input entirely (!).

**Implement the majority baseline and compute test accuracy using the starter code below.** For this baseline, and for the naive Bayes classifier later, we don't need to use the validation set since we don't tune any hyper-parameters.

In [15]:
# TODO
def majority_baseline_accuracy(train_iter, test_iter):
    """Returns the most common label in the training set, and the accuracy of
     the majority baseline on the test set.
    """
    labelCounter = {}
    for label in LABEL.vocab.itos:
        labelCounter[label] = 0
    for example in train_iter.dataset:
        labelCounter[example.label] += 1
    most_common_label = max(labelCounter, key = labelCounter.get)
    
    accuracyCount = 0
    for example in test_iter.dataset:
        if(example.label == most_common_label):
            accuracyCount += 1
    test_accuracy = accuracyCount/len(test_iter.dataset)
    
    return most_common_label, test_accuracy

How well does your classifier work? Let's see:

In [16]:
# Call the method to establish a baseline
most_common_label, test_accuracy = majority_baseline_accuracy(train_iter, test_iter)

print(f'Most common label: {most_common_label}\n'
      f'Test accuracy:     {test_accuracy:.3f}')

Most common label: flight_id
Test accuracy:     0.683


# To Do: Implement a Naive Bayes classifier


## Review of the naive Bayes method

Recall from lab 1-3 that the Naive Bayes classification method classifies a text $\vect{w} = \langle w_1, w_2, \ldots, w_M \rangle$ as the class $y_i$ given by the following maximization:

$$
\argmax{i} \Prob(y_i \given \vect{w}) \approx \argmax{i} \Prob(y_i) \cdot \prod_{j=1}^M \Prob(w_j \given y_i)
$$

or equivalently (since taking the log is monotonic)

\begin{align}
\argmax{i} \Prob(y_i \given \vect{w}) &= \argmax{i} \log\Prob(y_i \given \vect{w}) \\
&\approx \argmax{i} \left(\log\Prob(y_i) + \sum_{j=1}^M \log\Prob(w_j \given y_i)\right)
\end{align}

All we need, then, to apply the Naive Bayes classification method is values for the various log probabilities: the priors $\log\Prob(y_i)$ and the likelihoods $\log\Prob(w_j \given y_i)$, for each feature (word) $w_j$ and each class $y_i$.

We can estimate the prior probabilities $\Prob(y_i)$ by examining the empirical probability in the training set. That is, we estimate 

$$ \Prob(y_i) \approx \frac{\cnt{y_i}}{\sum_j \cnt{y_j}} $$

We can estimate the likelihood probabilities $\Prob(w_j \given y_i)$ similarly by examining the empirical probability in the training set. That is, we estimate 

$$ \Prob(w_j \given y_i) \approx \frac{\cnt{w_j, y_i}}{\sum_{j'} \cnt{w_{j'}, y_i}} $$

To allow for cases in which the count $\cnt{w_j, y_i}$ is zero, we can use a modified estimate incorporating add-$\delta$ smoothing:

$$ \Prob(w_j \given y_i) \approx \frac{\cnt{w_j, y_i} + \delta}{\sum_{j'} \cnt{w_{j'}, y_i} + \delta \cdot V} $$

## Two conceptions of the naive Bayes method implementation

We can store all of these parameters in different ways, leading to two different implementation conceptions. We review two conceptions of implementing the naive Bayes classification of a text $\vect{w} = \langle w_1, w_2, \ldots, w_M \rangle$, corresponding to using different representations of the input $\vect{x}$ to the model: the index representation and the bag-of-words representation. 

Within each conception, the parameters of the model will be stored in one or more matrices. The conception dictates what operations will be performed with these matrices.

### Using the index representation

In the first conception, we take the input elements $\vect{x} = \langle x_1, x_2, \ldots, x_M \rangle$ to be the _vocabulary indices_ of the words $\vect{w} = w_1 \cdots w_M$. That is, each word token $w_i$ is of the word type in the vocabulary $\vect{v}$ at index $x_i$, so 

$$ v_{x_i} = w_i $$

In this representation, the input vector has the same length as the word sequence.

We think of the likelihood probabilities as forming a matrix, call it $\vect{L}$, where the $i,j$-th element stores $\log \Prob(v_j \given y_i)$. 

$$\vect{L}_{ij} = \log\Prob(v_j \given y_i)$$

Similarly, for the priors, we'll have 

$$\vect{P}_{i} = \log\Prob(y_i)$$

Now the maximization can be implemented as 

\begin{align}
\argmax{i} \log\Prob(y_i) + \sum_{j=1}^M \log\Prob(w_j \given y_i)
&= \argmax{i} \vect{P}_i + \sum_{j=1}^M \vect{L}_{x_j i}
\end{align}

Implemented in this way, we see that the use of each input $x_i$ is as an _index_ into the likelihood matrix. 

### Using the bag-of-words representation

<img src="https://github.com/nlp-course/data/raw/master/Resources/naive-bayes-figure.png" width=400 align=right />

Notice that since each word in the input is treated separately, the order of the words doesn't matter. Rather, all that matters is how frequently each word type occurs in a text. Consequently, we can use the bag-of-words representation introduced in lab 1-1.

Recall that the bag-of-words representation of a text is just its frequency distribution over the vocabulary, which we will notate $bow(\vect{w})$. Given a vocabulary of word types $\vect{v} = \langle v_1, v_2, \ldots, v_V \rangle$, the representation of a sentence $\vect{w} = \langle w_1, w_2, \ldots, w_M \rangle$ is a vector $\vect{x}$ of size $V$, where 

$$\begin{aligned}
bow(\vect{w})_j &= \sum_{i=1}^M 1[w_i = v_j] & \mbox{for $1 \leq j \leq V$}
\end{aligned}$$

We write $1[w_i = v_j]$ to indicate 1 if $w_i = v_j$ and 0 otherwise. For convenience, we'll add an extra $(V+1)$-st element to the end of the bag-of-words vector, a single $1$ whose use will be clear shortly. That is,

$$bow(\vect{w})_{V+1} = 1$$

Under this conception, then, we'll take the input $\vect{x}$ to be $bow(\vect{w})$. Instead of the input having the same length as the text, it has the same length as the vocabulary.

As described in lecture, represented in this way, the quantity to be maximized in the naive Bayes method

$$\log\Prob(y_i) + \sum_{j=1}^M \log\Prob(w_j \given y_i)$$

can be calculated as 

$$\log\Prob(y_i) + \sum_{j=1}^V x_j \cdot \log\Prob(v_j \given y_i)$$

which is just $\vect{U} \vect{x}$ for a suitable choice of $N \times (V+1)$ matrix $\vect{U}$, namely

$$ \vect{U}_{ij} = \left\{
    \begin{array}{ll}
        \log \Prob(v_j \given y_i) & \mbox{$1 \leq i \leq N$ and $1 \leq j \leq V$} \\
        \log \Prob(y_i) & \mbox{$1 \leq i \leq N$ and $j = V+1$} 
    \end{array} \right.
$$

Under this implementation conception, we've reduced naive Bayes calculations to a single matrix operation. This conception is depicted in the figure at right.

You are free to use either conception in your implementation of naive Bayes.

## Implement the naive Bayes classifier
 
For the implementation, we ask you to implement a Python class `NaiveBayes` that will have (at least) the following three methods:

1. `__init__`: An initializer that takes two `torchtext` fields providing descriptions of the text and label aspects of examples.

2. `train`: A method that takes a training data iterator and estimates all of the log probabilities $\log\Prob(c_i)$ and $\log\Prob(x_j \given c_i)$ as described above. Perform add-$\delta$ smoothing with $\delta=1$. These parameters will be used by the `evaluate` method to evaluate a test dataset for accuracy, so you'll want to store them in some data structures in objects of the class.

3. `evaluate`: A method that takes a test data iterator and evaluates the accuracy of the trained model on the test set.

You can organize your code using either of the conceptions of Naive Bayes described above.

You should expect to achieve about an **86% test accuracy** on the ATIS task.

In [17]:
def bagOfWords(text,sizeV,padding_id):
        c = Counter(text)
        bow = [0 for i in range(sizeV)]
        for j in range(sizeV):
            for k in c:
                if torch.tensor(j) == k:
                    bow[j] += c[k]
        bow[padding_id] = 0
        return bow

In [18]:
class NaiveBayes():
    def __init__ (self, text, label):
        self.text = text
        self.label = label
        self.padding_id = text.vocab.stoi[text.pad_token]
        self.V = len(text.vocab.itos) # vocabulary size
        self.N = len(label.vocab.itos) # the number of classes
        # TODO: Add your code here
        # self.L = len(text)
        self.U = torch.zeros(self.N, self.V+1)


    def train(self, iterator):
        """Calculates and stores log probabilities for training dataset `iterator`."""
        # TODO: Implement this method.
        labelList = [example.label for example in iterator.dataset]
        total = len(labelList)
        #print(total)
        c = Counter(labelList)
        print(c)
        for sentence in iterator.dataset:
            labelIndex = self.label.vocab.stoi[sentence.label]
            for word in sentence.text:
                wordIndex = self.text.vocab.stoi[word]
                self.U[labelIndex][wordIndex] += 1
        #print(self.U)
        countArr = []
        for i in range(self.U.size()[0]):
            count = 0
            for j in range(len(self.U[i])):
                count += self.U[i][j]
            countArr.append(count)
        print(countArr)
        
        for i in range(self.U.size()[0]):
            for j in range(len(self.U[i]) - 1):
                #print(countArr[i].item())
                self.U[i][j] = math.log2((self.U[i][j] + 1)/(countArr[i].item() + self.V))
            self.U[i][self.V] = math.log2(c[LABEL.vocab.itos[i]]/total)
        print(self.U)
          
                 
            

    def evaluate(self, iterator):
        """Returns the model's accuracy on a given dataset `iterator`."""
        # TODO: Implement this method.
        evalIterator = iter(iterator)
        numTotal = 0
        numCorrect = 0
        while True:
            try:
                batch = next(evalIterator)
                label = batch.label
                text = batch.text
            
                for i in range(len(label)):
                    bow = bagOfWords(text[i], self.V, self.padding_id)
                    bow.append(1)
                    y = torch.matmul(self.U, torch.tensor(bow, dtype = torch.float32))
                    gold_label = label[i]
                    numTotal += 1
                    if gold_label == torch.argmax(y):
                        numCorrect += 1
            except StopIteration:
                break
        accuracy = numCorrect/numTotal
        return accuracy
                    
                

In [19]:
# for word in train_iter.dataset:
#     print(word.text)
#     print(word.label)
# totalCount = 0
# for sentence in train_iter.dataset:
#     #print(sentence.label)
#     totalCount += len(sentence.text)
# sentenceList = [example.text for example in train_iter.dataset]
# #print(len(sentenceList))
# #print(sentenceList)
# labelList = [example.label for example in train_iter.dataset]
# #print(len(labelList))

# labelName = LABEL.vocab.stoi['flight_id']
# print(labelName)
# print(LABEL.vocab.stoi['time_elapsed'])
# print(TEXT.vocab.stoi['your'])
# print(TEXT.vocab.itos[2])
# print(LABEL.vocab.itos[0])
# print(U.size()[1])
    
#print(totalCount)
    
    
nb_classifier = NaiveBayes(TEXT, LABEL)
nb_classifier.train(train_iter)

# for example in test_iter.dataset:
#     print(example.text)

# Evaluate model performance
print(f'Training accuracy: {nb_classifier.evaluate(train_iter):.3f}\n'
      f'Test accuracy: {nb_classifier.evaluate(test_iter):.3f}')

Counter({'flight_id': 3210, 'fare_id': 344, 'transport_type': 228, 'airline_code': 171, 'aircraft_code': 82, 'departure_time': 52, 'fare_basis_code': 43, 'airport_code': 39, 'count': 38, 'state_code': 29, 'booking_class': 24, 'ground_fare': 19, 'restriction_code': 18, 'miles_distant': 11, 'arrival_time': 11, 'city_code': 10, 'meal_code': 10, 'meal_description': 5, 'basic_type': 5, 'minutes_distant': 5, 'flight_number': 5, 'advance_purchase': 5, 'time_elapsed': 3, 'airport_location': 3, 'day_name': 2, 'stop_airport': 2, 'stops': 2, 'minimum_connect_time': 1, 'time_zone_code': 1, 'city_name': 1})
[tensor(37635.), tensor(4211.), tensor(1964.), tensor(1461.), tensor(1144.), tensor(714.), tensor(270.), tensor(182.), tensor(408.), tensor(378.), tensor(167.), tensor(196.), tensor(103.), tensor(134.), tensor(112.), tensor(90.), tensor(92.), tensor(20.), tensor(63.), tensor(159.), tensor(36.), tensor(67.), tensor(15.), tensor(37.), tensor(6.), tensor(19.), tensor(31.), tensor(3.), tensor(9.), t

## Implement the logistic regression classifier

For the implementation, we ask you to implement a logistic regression classifier as a subclass of [`torch.nn.Module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module). You need to implement the following methods:

1. `__init__`: an initializer that takes two `torchtext` fields providing descriptions of the text and label aspects of examples.

    During initialization, you'll want to define a [tensor](https://pytorch.org/docs/stable/tensors.html#torch-tensor) of weights, wrapped in [`torch.nn.Parameter`](https://pytorch.org/docs/master/generated/torch.nn.parameter.Parameter.html#torch.nn.parameter.Parameter), [initialized randomly](https://pytorch.org/docs/stable/tensors.html#torch.Tensor.uniform_), which plays the role of $\vect{U}$. The elements of this tensor are the parameters of the `torch.nn` instance in the following special technical sense: It is the parameters of the module whose gradients will be calculated and whose values will be updated. Alternatively, you might find it easier to use the [`nn.Embedding` module](https://pytorch.org/docs/master/generated/torch.nn.Embedding.html) which is a wrapper to the weight tensor with a lookup implementation.

2. `forward`: given a text batch of size `batch_size X max_length`, return a tensor of logits of size `batch_size X num_labels`. That is, for each text $\vect{x}$ in the batch and each label $y$, you'll be calculating $\vect{U}\vect{x}$ as shown in the figure, returning a tensor of these values. Note that the softmax operation is absorbed into [`nn.CrossEntropyLoss`](https://pytorch.org/docs/master/generated/torch.nn.CrossEntropyLoss.html) so you won't need to deal with that.

3. `train_all`: A method that performs training. You might find lab 1-5 useful.

4. `evaluate`: A method that takes a test data iterator and evaluates the accuracy of the trained model on the test set.

Some things to consider:

1. The parameters of the model, the weights, need to be initialized properly. We suggest initializing them to some small random values. See [`torch.uniform_`](https://pytorch.org/docs/stable/tensors.html#torch.Tensor.uniform_).

2. You'll want to make sure that padding tokens are handled properly. What should the weight be for the padding token?

3. In extracting the proper weights to sum up, based on the word types in a sentence, we are essentially doing a lookup operation. You might find [`nn.Embedding`](https://pytorch.org/docs/master/generated/torch.nn.Embedding.html) or [`torch.gather`](https://pytorch.org/docs/stable/generated/torch.gather.html#torch-gather) useful.

You should expect to achieve about **90%** accuracy on the ATIS classificiation task. 

In [20]:
class LogisticRegression(nn.Module):
    def __init__ (self, text, label):
        super().__init__()
        self.text = text
        self.label = label
        self.padding_id = text.vocab.stoi[text.pad_token]
        # Keep the vocabulary sizes available
        self.N = len(label.vocab.itos) # num_classes
        self.V = len(text.vocab.itos)  # vocab_size
        # Specify cross-entropy loss for optimization
        self.criterion = nn.CrossEntropyLoss()
        # TODO: Create and initialize a tensor for the weights,
        #       or create an nn.Embedding module and initialize
        ...
        self.U = torch.nn.Parameter(torch.Tensor(self.N,self.V).uniform_(0,1))
        #self.embedding = nn.Embedding(self.N,self.V, padding_idx = self.padding_id)
        #self.embedding.weight.data.uniform_(0,1)
        #self.linear = nn.Linear(self.V,self.N)
        #self.linear.weight.data.uniform_(0,1)
        #self.U = nn.Linear()
        
        #self.U = nn.Embedding(torch.Tensor(self.N,self.V).uniform_)

    def forward(self, text_batch):
    # TODO: Calculate the logits (Ux) for the `text_batch`, 
    #       returning a tensor of size batch_size x num_labels
        ret = torch.zeros([len(text_batch),self.V])
        for i, sentence in enumerate(text_batch.text):
            bow = torch.zeros(self.V)
            for wordIndex in sentence:
                bow[wordIndex] += 1
            ret[i] = bow

        retVal = torch.matmul(ret,self.U.T)
        return retVal  

    def train_all(self, train_iter, val_iter, epochs=8, learning_rate=3e-3):
    # Switch the module to training mode
        self.train()
        # Use Adam to optimize the parameters
        optim = torch.optim.Adam(self.parameters(), lr=learning_rate)
        best_validation_accuracy = -float('inf')
        best_model = None
        # Run the optimization for multiple epochs
        with tqdm(range(epochs), desc='train', position=0) as pbar:
            for epoch in pbar:
                c_num = 0
                total = 0
                running_loss = 0.0

                for batch in tqdm(train_iter, desc='batch', leave=False):
                    # TODO: set labels, compute logits (Ux in this model), 
                    #       loss, and update parameters
                    ...
                    optim.zero_grad()
                    labels = batch.label
                    logits = self.forward(batch)
                    loss = self.criterion(logits,labels)
                    loss.backward()
                    optim.step()
                    ...
                    # Prepare to compute the accuracy
                    predictions = torch.argmax(logits, dim=1)
                    total += predictions.size(0)
                    c_num += (predictions == labels).float().sum().item()        
                    running_loss += loss.item() * predictions.size(0)

                    # Evaluate and track improvements on the validation dataset
                    validation_accuracy = self.evaluate(val_iter)
                    if validation_accuracy > best_validation_accuracy:
                        best_validation_accuracy = validation_accuracy
                        self.best_model = copy.deepcopy(self.state_dict())
                        epoch_loss = running_loss / total
                        epoch_acc = c_num / total
                        pbar.set_postfix(epoch=epoch+1, loss=epoch_loss, train_acc = epoch_acc, val_acc=validation_accuracy)

    def evaluate(self, iterator):
        """Returns the model's accuracy on a given dataset `iterator`."""
        self.eval()   # switch the module to evaluation mode
        # TODO: Compute accuracy
        with torch.no_grad():
            numCorrect = 0
            numTotal = len(iterator.dataset)
            for sentence in iterator.dataset:
                bow = torch.zeros(self.V)
                for word in sentence.text:
                    bow[self.text.vocab.stoi[word]] += 1
                label = torch.argmax(torch.matmul(self.U,bow))
                if sentence.label == self.label.vocab.itos[label]:
                    numCorrect += 1

        return numCorrect / numTotal

In [21]:
model = LogisticRegression(TEXT, LABEL).to(device) 
model.forward(batch)

tensor([[ 8.2163, 15.5570,  8.8837,  8.6198, 15.6998,  9.0532,  4.1633,  9.7062,
         15.9774,  4.3654,  3.2959,  9.4384, 15.9067, 14.5179, 15.2123,  3.2148,
         14.7468, 15.9950, 10.5336,  7.1896,  7.4921, 10.5424, 13.5207,  5.5149,
          8.0534, 12.3653,  7.4518, 15.4852, 14.6244,  4.0875],
        [ 8.7474, 13.1596,  8.4309,  9.3893, 13.2008,  9.9894,  6.6734,  9.3189,
         14.1458,  8.0941,  6.4428,  9.4998, 14.0991, 10.9224, 12.3683,  6.1700,
         12.4263, 13.9541,  9.1823,  8.0108,  6.7542,  9.7368, 11.2559,  6.9367,
          8.7493, 11.2465,  9.8955, 13.5798, 13.1740,  9.2232],
        [ 9.7426, 12.9181,  9.9068, 10.3123, 14.7843,  9.6891,  6.2082, 11.0457,
         13.5136,  8.2502,  5.8746,  8.6225, 15.6152, 12.6178, 13.8781,  4.7702,
         12.4727, 14.7312, 10.0066,  7.5826,  7.6166, 11.2500, 10.4801,  7.6911,
          8.4950, 12.1162,  8.3370, 12.8192, 13.9425,  6.0688],
        [10.6882,  8.7890, 11.5213,  8.0070, 11.9126, 10.7590,  8.7202, 11.8735

In [22]:
# Instantiate the logistic regression classifier and run it
N = len(LABEL.vocab.itos)
V = len(TEXT.vocab.itos)
x = torch.nn.Parameter(torch.Tensor(N,V).uniform_(0,1))
#print(torch.mm(x,x.T))
model = LogisticRegression(TEXT, LABEL).to(device) 
model.train_all(train_iter, val_iter)
model.load_state_dict(model.best_model)
test_accuracy = model.evaluate(test_iter)
print (f'Test accuracy: {test_accuracy:.4f}')

train:   0%|          | 0/8 [00:00<?, ?it/s]

batch:   0%|          | 0/137 [00:00<?, ?it/s]

batch:   0%|          | 0/137 [00:00<?, ?it/s]

batch:   0%|          | 0/137 [00:00<?, ?it/s]

batch:   0%|          | 0/137 [00:00<?, ?it/s]

batch:   0%|          | 0/137 [00:00<?, ?it/s]

batch:   0%|          | 0/137 [00:00<?, ?it/s]

batch:   0%|          | 0/137 [00:00<?, ?it/s]

batch:   0%|          | 0/137 [00:00<?, ?it/s]

Test accuracy: 0.9085


# To Do: Implement a multilayer perceptron

## Review of multilayer perceptrons

<img src="https://github.com/nlp-course/data/raw/master/Resources/multilayer-perceptron-figure.png" alt="multilayer perceptron illustration" width="400"  align=right />

In the last part, you implemented a perceptron, a model that involved a linear calculation (the sum of weights) followed by a nonlinear calculation (the softmax, which converts the summed weight values to probabilities). In a multi-layer perceptron, we take the output of the first perceptron to be the input of a second perceptron (and of course, we could continue on with a third or even more).

In this part, you'll implement the forward calculation of a two-layer perceptron, again letting PyTorch handle the backward calculation as well as the optimization of parameters. The first layer will involve a linear summation as before and a **sigmoid** as the nonlinear function. The second will involve a linear summation and a softmax (the latter absorbed, as before, into the loss function). Thus, the difference from the logistic regression implementation is simply the adding of the sigmoid and second linear calculations. See the figure for the structure of the computation. 



## Implement a multilayer perceptron classifier

For the implementation, we ask you to implement a two layer perceptron classifier, again as a subclass of the [`torch.nn` module](https://pytorch.org/docs/stable/nn.html). You might reuse quite a lot of the code from logistic regression. As before, you need to implement the following methods:

1. `__init__`: An initializer that takes two `torchtext` fields providing descriptions of the text and label aspects of examples, and `hidden_size` specifying the size of the hidden layer (e.g., in the above illustration, `hidden_size` is `D`).

    During initialization, you'll want to define two tensors of weights, which serve as the parameters of this model, one for each layer. You'll want to [initialize them randomly](https://pytorch.org/docs/stable/tensors.html#torch.Tensor.uniform_). 
    
    The weights in the first layer are a kind of lookup (as in the previous part), mapping words to a vector of size `hidden_size`. The [`nn.Embedding` module](https://pytorch.org/docs/master/generated/torch.nn.Embedding.html) is a good way to set up and make use of this weight tensor.
    
    The weights in the second layer define a linear mapping from vectors of size `hidden_size` to vectors of size `num_labels`. The [`nn.Linear` module](https://pytorch.org/docs/master/generated/torch.nn.Linear.html) or [`torch.mm`](https://pytorch.org/docs/master/generated/torch.mm.html) for matrix multiplication may be helpful here.

2. `forward`: Given a text batch of size `batch_size X max_length`, the `forward` function returns a tensor of logits of size `batch_size X num_labels`. 

    That is, for each text $\vect{x}$ in the batch and each label $c$, you'll be calculating $MLP(bow(\vect{x}))$ as shown in the illustration above, returning a tensor of these values. Note that the softmax operation is absorbed into [`nn.CrossEntropyLoss`](https://pytorch.org/docs/master/generated/torch.nn.CrossEntropyLoss.html) so you don't need to worry about that.
    
    For the sigmoid sublayer, you might find [`nn.Sigmoid`](https://pytorch.org/docs/stable/generated/torch.nn.Sigmoid.html) useful.
    
3. `train_all`: A method that performs training. You might find lab 1-5 useful.

4. `evaluate`: A method that takes a test data iterator and evaluates the accuracy of the trained model on the test set.

You should expect to achieve at least **90%** accuracy on the ATIS classificiation task. 

In [23]:
class MultiLayerPerceptron(nn.Module):
    def __init__ (self, text, label, hidden_size=128): 
        super().__init__ ()
        self.text = text
        self.label = label
        self.padding_id = text.vocab.stoi[text.pad_token]
        self.hidden_size = hidden_size
        # Keep the vocabulary sizes available
        self.N = len(label.vocab.itos) # num_classes
        self.V = len(text.vocab.itos)  # vocab_size
        # Specify cross-entropy loss for optimization
        self.criterion = nn.CrossEntropyLoss()
        # TODO: Create and initialize neural modules
        self.hidden1 = torch.nn.Embedding(self.V, hidden_size)
        self.hidden1.weight.data.uniform_(-0.05,0.05)
        self.hidden2 = nn.Sigmoid()
        self.hidden3 = nn.Linear(hidden_size, self.N)

    def forward(self, text_batch):
        # TODO: Calculate the logits for the `text_batch`, 
        #       returning a tensor of size batch_size x num_labels
        embedding = self.hidden1(torch.tensor(text_batch)).sum(axis = 1)
        finalHidden = self.hidden2(embedding)
        return self.hidden3(finalHidden)

    def train_all(self, train_iter, val_iter, epochs=8, learning_rate=3e-3):
        # Switch the module to training mode
        self.train()
        # Use Adam to optimize the parameters
        optim = torch.optim.Adam(self.parameters(), lr=learning_rate)
        best_validation_accuracy = -float('inf')
        best_model = None
        # Run the optimization for multiple epochs
        with tqdm(range(epochs), desc='train', position=0) as pbar:
            for epoch in pbar:
                c_num = 0
                total = 0
                running_loss = 0.0
                for batch in tqdm(train_iter, desc='batch', leave=False):
                    # TODO: set labels, compute logits (Ux in this model), 
                    #       loss, and update parameters
                    ...
                    optim.zero_grad()
                    labels = batch.label
                    logits = self.forward(batch.text)
                    loss = self.criterion(logits,labels)
                    loss.backward()
                    optim.step()
                    ...
                    # Prepare to compute the accuracy
                    predictions = torch.argmax(logits, dim=1)
                    total += predictions.size(0)
                    c_num += (predictions == labels).float().sum().item()        
                    running_loss += loss.item() * predictions.size(0)

                # Evaluate and track improvements on the validation dataset
                validation_accuracy = self.evaluate(val_iter)
                if validation_accuracy > best_validation_accuracy:
                    best_validation_accuracy = validation_accuracy
                    self.best_model = copy.deepcopy(self.state_dict())
                    epoch_loss = running_loss / total
                    epoch_acc = c_num / total
                    pbar.set_postfix(epoch=epoch+1, loss=epoch_loss, train_acc = epoch_acc, val_acc=validation_accuracy)

    def evaluate(self, iterator):
        """Returns the model's accuracy on a given dataset `iterator`."""
        # TODO: Compute accuracy
        evalIterator = iter(iterator)
        numTotal = 0
        numCorrect = 0

        for batch in evalIterator:
            model = self.forward(batch.text)

            for i, sentence in enumerate(batch):
                correctLabel = batch.label[i]
                label = torch.argmax(model[i])
                if correctLabel == label:
                    numCorrect += 1
                numTotal += 1

        return numCorrect / numTotal
        

In [24]:
# Instantiate classifier and run it
model = MultiLayerPerceptron(TEXT, LABEL).to(device) 
model.train_all(train_iter, val_iter)
model.load_state_dict(model.best_model)
test_accuracy = model.evaluate(test_iter)
print (f'Test accuracy: {test_accuracy:.4f}')

train:   0%|          | 0/8 [00:00<?, ?it/s]

batch:   0%|          | 0/137 [00:00<?, ?it/s]

  embedding = self.hidden1(torch.tensor(text_batch)).sum(axis = 1)


batch:   0%|          | 0/137 [00:00<?, ?it/s]

batch:   0%|          | 0/137 [00:00<?, ?it/s]

batch:   0%|          | 0/137 [00:00<?, ?it/s]

batch:   0%|          | 0/137 [00:00<?, ?it/s]

batch:   0%|          | 0/137 [00:00<?, ?it/s]

batch:   0%|          | 0/137 [00:00<?, ?it/s]

batch:   0%|          | 0/137 [00:00<?, ?it/s]

Test accuracy: 0.9286


<!-- BEGIN QUESTION -->

# Lessons learned

Take a look at some of the examples that were classified correctly and incorrectly by your best method.

**Question:** Do you notice anything about the incorrectly classified examples that might indicate _why_ they were classified incorrectly?

<!--
BEGIN QUESTION
name: open_response_lessons
manual: true
-->

The incorrectly classified examples require a lot more information to be gathered and understood in order to make a correct prediction on the class label. This was a trend that was present in a lot of the incorrectly classified examples because they were not only longer in length but also included additional required information to a lot of the simpler queries. For simple and short queries, the accuracy of the classification was pretty high. However, for longer and more complex queries the accuracy became a lot lower and it was more likely that it would be classified incorrectly in terms of the label. 

<!-- END QUESTION -->

In [25]:
...

Ellipsis

<!-- BEGIN QUESTION -->

# Debrief

**Question:** We're interested in any thoughts you have about this project segment so that we can improve it for later years, and to inform later segments for this year. Please list any issues that arose or comments you have to improve the project segment. Useful things to comment on include the following: 

* Was the project segment clear or unclear? Which portions?
* Were the readings appropriate background for the project segment? 
* Are there additions or changes you think would make the project segment better?

<!--
BEGIN QUESTION
name: open_response_debrief
manual: true
-->

I thought that the necessary computation for the evaluation functions was unclear along with the definition of why to use nn.Linear or nn.Embedding. I think there was some reading that was missing and going to section and office hours was necessary to do this problem set and would have been very hard to do without it. I think more information about each of the functions that are being used throughout the project would make the project segments a lot more accessible and easier. 

<!-- END QUESTION -->



# Instructions for submission of the project segment

This project segment should be submitted to Gradescope at <http://go.cs187.info/project1-submit>, which will be made available some time before the due date.

Project segment notebooks are manually graded, not autograded using otter as labs are. (Otter is used within project segment notebooks to synchronize distribution and solution code however.) **We will not run your notebook before grading it.** Instead, we ask that you submit the already freshly run notebook. The best method is to "restart kernel and run all cells", allowing time for all cells to be run to completion.

We also request that you **submit a PDF of the freshly run notebook**. The simplest method is to use "Export notebook to PDF", which will render the notebook to PDF via LaTeX. If that doesn't work, the method that seems to be most reliable is to export the notebook as HTML (if you are using Jupyter Notebook, you can do so using `File -> Print Preview`), open the HTML in a browser, and print it to a file. Then make sure to add the file to your git commit. Please name the file the same name as this notebook, but with a `.pdf` extension. (Conveniently, the methods just described will use that name by default.) You can then perform a git commit and push and submit the commit to Gradescope.

# End of project segment 1 {-}