# Section 0. Getting Started




In [7]:
print('Welcome to the second NLP Praktikum.')

Welcome to the second NLP Praktikum.


## 0.1 Introducing Pytorch

[Pytorch](https://pytorch.org/) is a deep learning library support **accelerated tensor operations** (on GPUs) with supoort for **automatic differentiation**.

Many of its tensor operation syntax are similar to NumPy, which we worked with last time.

In [8]:
import torch

# Initialize a tensor with the content:
# |0  1  2|
# |3  4  5|
a = torch.Tensor([[0, 1, 2], 
                  [3, 4, 5]])

# Check maxtrix size
a.shape

torch.Size([2, 3])

In [9]:
# Check tensor type, default is float
a.dtype

torch.float32

In [10]:
# Cast to integer
a = a.long()
a.dtype

torch.int64

In [11]:
# Arithmetic operations
a.sum()

tensor(15)

In [12]:
# Operation over a certain dimension
a.sum(axis=1)

tensor([ 3, 12])

In [13]:
# Slicing the matrix (accessing a selection of indices)
a[:, 1:a.shape[1]]

tensor([[1, 2],
        [4, 5]])

In [14]:
# Slicing the matrix (accessing a selection of indices)
a[:, :-1]

tensor([[0, 1],
        [3, 4]])

In [15]:
# Extracting content of tensor
a[0, 2].item()

2

For more information on PyTorch tensor operations, [this blog](https://jhui.github.io/2018/02/09/PyTorch-Basic-operations/) is a good resource.

## 0.2 Enabling and testing the GPU

Training your model on a GPU is much faster than on CPU.

First, let's enable GPUs for this notebook:

- Navigate to "**Edit**" → "**Notebook Settings**"
- Select GPU from the **Hardware Accelerator** drop-down

Next, we'll check if we can connect to the GPU with PyTorch:

In [16]:
import torch
torch.manual_seed(0)

if torch.cuda.is_available():
    torch.cuda.set_device(0)
    DEVICE = torch.cuda.current_device()
    print('Current device:', torch.cuda.get_device_name(DEVICE))
else:
    print('Failed to find GPU. Will use CPU.')
    DEVICE = 'cpu'

Current device: NVIDIA GeForce GTX 1660 Ti


##0.3 Useful Resources

The following materials might be useful for solving the tasks in this Praktikum:
* Lecture 8, slides 24-28 
* Lecture 8, slides 32-41
* Lecture 10, slides 32-39

# Section 1. Intent Classification

Many virtual assistants (e.g. Alexa, Google Assistant) rely on **intent classification** models to categorize incoming user requests.

We will build a intent classification models based on the **MultiWOZ** dataset.

First download and unpack the dataset:

In [17]:
!wget -nc https://github.com/budzianowski/multiwoz/raw/master/data/MultiWOZ_2.1.zip
!unzip -o MultiWOZ_2.1.zip

--2023-01-25 17:50:09--  https://github.com/budzianowski/multiwoz/raw/master/data/MultiWOZ_2.1.zip
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/budzianowski/multiwoz/master/data/MultiWOZ_2.1.zip [following]
--2023-01-25 17:50:09--  https://raw.githubusercontent.com/budzianowski/multiwoz/master/data/MultiWOZ_2.1.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20241542 (19M) [application/zip]
Saving to: ‘MultiWOZ_2.1.zip’


2023-01-25 17:50:12 (9.09 MB/s) - ‘MultiWOZ_2.1.zip’ saved [20241542/20241542]

Archive:  MultiWOZ_2.1.zip
   creating: MultiWOZ_2.1/
  inflating: Multi

The original data contains conversation threads from **users** and **agents**.

We extract the sentences from the **user** side, since that is what the intent classifier needs to handle.

In [18]:
import json

def load_list(filename):
    with open(filename, 'r') as f:
        l = f.readlines()
    return {s.strip():1 for s in l}

def load_data(data, validfile, testfile):
    valid_list, test_list = load_list(validfile), load_list(testfile)
    train_set, valid_set, test_set = [], [], []

    with open(data) as json_file:
        dialogs = json.load(json_file)

    for k in dialogs.keys():
        if k in test_list:
            test_set.append(dialogs[k])
        elif k in valid_list:
            valid_set.append(dialogs[k])
        else:
            train_set.append(dialogs[k])

    return train_set, valid_set, test_set

def prepare_data(partition):
    input, output = [], []
    for d in partition:
        # the dialogs are stored in the "Log" dictonary entry
        for i in range(len(d["log"])):
            if (i % 2 == 0):    # only user turns, not the agent's turns
                seg = d["log"][i]
                # only take turns by the human with annotated dialog_act
                if "dialog_act" in seg and len(list(seg["dialog_act"].keys())) > 0:
                    text = seg["text"]
                    input.append(text)
                    output.append(list(seg["dialog_act"].keys())[0])
    return input, output

train, valid, test = load_data(data='MultiWOZ_2.1/data.json',
                               validfile='MultiWOZ_2.1/valListFile.txt',
                               testfile='MultiWOZ_2.1/testListFile.txt')

train_acts = prepare_data(partition=train)
valid_acts = prepare_data(partition=valid)
test_acts = prepare_data(partition=test)
print("Done extracting data")

Done extracting data


Take a look at the data:

In [19]:
show_top_k = 5
print(f"First {show_top_k} sentences:")
print("\n".join(train_acts[0][:show_top_k]) + "\n")
print(f"First {show_top_k} labels:")
print("\n".join(train_acts[1][:show_top_k]) + "\n")

print(f"# sent. in train/dev/test: {len(train_acts[0])}, {len(valid_acts[0])}, {len(test_acts[0])}")

First 5 sentences:
am looking for a place to to stay that has cheap price range it should be in a type of hotel
no, i just need to make sure it's cheap. oh, and i need parking
Yes, please. 6 people 3 nights starting on tuesday.
how about only 2 nights.
No, that will be all. Good bye.

First 5 labels:
Hotel-Inform
Hotel-Inform
Hotel-Inform
Hotel-Inform
general-bye

# sent. in train/dev/test: 50836, 6651, 6701


We now extract the **labels**, store them in a dictionary, and create the inverse mapping.

In [20]:
# preprocess the labels
label2id = {}
id2label = {}
cnt = 0

# not assuming having access to test set at this point
for l in set(train_acts[1] + valid_acts[1]):
    label2id[l] = cnt
    id2label[cnt] = l
    cnt += 1

print(f"Total number of unique labels: {len(label2id)}")
print(label2id)
print(id2label)

Total number of unique labels: 17
{'Restaurant-Request': 0, 'Attraction-Inform': 1, 'Hotel-Inform': 2, 'general-greet': 3, 'Hospital-Inform': 4, 'Hotel-Request': 5, 'Attraction-Request': 6, 'Taxi-Inform': 7, 'Taxi-Request': 8, 'Train-Inform': 9, 'general-bye': 10, 'general-thank': 11, 'Police-Inform': 12, 'Hospital-Request': 13, 'Police-Request': 14, 'Restaurant-Inform': 15, 'Train-Request': 16}
{0: 'Restaurant-Request', 1: 'Attraction-Inform', 2: 'Hotel-Inform', 3: 'general-greet', 4: 'Hospital-Inform', 5: 'Hotel-Request', 6: 'Attraction-Request', 7: 'Taxi-Inform', 8: 'Taxi-Request', 9: 'Train-Inform', 10: 'general-bye', 11: 'general-thank', 12: 'Police-Inform', 13: 'Hospital-Request', 14: 'Police-Request', 15: 'Restaurant-Inform', 16: 'Train-Request'}


Below we first create a vocabulary class.

We then fill it with the words in the training data.

❓ Why do we need the default token for unknown words in `L7`?

In [21]:
from torchtext.data.utils import get_tokenizer

# preprocess vocabulary
class Vocabulary:
    def __init__(self, unk="<unk>"):
        self.tokenizer = get_tokenizer('basic_english')
        self.unk_token = unk    # default token for unknown words
        self.token2id = {self.unk_token: 0}

    def update_vocab(self, sentences):
        for sent_id, sent in enumerate(sentences):
            tokens = self.tokenizer(sent)
            for t in tokens:
                if t not in self.token2id:
                    self.token2id[t] = len(self.token2id)
            # print out first sentences as example
            if sent_id < 5:
                print(f"Input: {sent}")
                print(f"After tokenization: {tokens}\n")
        print(f"Vocab size: {len(self.token2id)}")

    def sentence_to_id(self, sentence):
        res = []
        for t in self.tokenizer(sentence):
            if t in self.token2id:
                res.append(self.token2id[t])
            else:
                res.append(self.token2id[self.unk_token])
        return res

# init empty vocabulary
vocab = Vocabulary()
# create vocabulary based on training data
vocab.update_vocab(train_acts[0])

Input: am looking for a place to to stay that has cheap price range it should be in a type of hotel
After tokenization: ['am', 'looking', 'for', 'a', 'place', 'to', 'to', 'stay', 'that', 'has', 'cheap', 'price', 'range', 'it', 'should', 'be', 'in', 'a', 'type', 'of', 'hotel']

Input: no, i just need to make sure it's cheap. oh, and i need parking
After tokenization: ['no', ',', 'i', 'just', 'need', 'to', 'make', 'sure', 'it', "'", 's', 'cheap', '.', 'oh', ',', 'and', 'i', 'need', 'parking']

Input: Yes, please. 6 people 3 nights starting on tuesday.
After tokenization: ['yes', ',', 'please', '.', '6', 'people', '3', 'nights', 'starting', 'on', 'tuesday', '.']

Input: how about only 2 nights.
After tokenization: ['how', 'about', 'only', '2', 'nights', '.']

Input: No, that will be all. Good bye.
After tokenization: ['no', ',', 'that', 'will', 'be', 'all', '.', 'good', 'bye', '.']

Vocab size: 3596


❓ What do you observe when observing the following example?

In [22]:
print(vocab.sentence_to_id("I'm looking for a hotel this Thursday in Karlsruhe"))

[22, 27, 76, 2, 3, 4, 19, 246, 291, 16, 0]


To prepare for training, we tensorize the train, dev, test sets.

In [23]:
train_dataset, valid_dataset, test_dataset = [], [], []

def convert_to_tensor(text, label):
    text_tensor = torch.Tensor(vocab.sentence_to_id(text)).long().unsqueeze(0).to(DEVICE)
    label_tensor = torch.Tensor([label2id[label]]).long().to(DEVICE)
    return (text_tensor, label_tensor)

# train
for sent_id, sent in enumerate(train_acts[0]):
    train_dataset.append(convert_to_tensor(sent, train_acts[1][sent_id]))    
print("First instance in training: ", train_dataset[0])

# dev
for sent_id, sent in enumerate(valid_acts[0]):
    valid_dataset.append(convert_to_tensor(sent, valid_acts[1][sent_id]))
print("First instance in dev: ", valid_dataset[0])

# test
for sent_id, sent in enumerate(test_acts[0]):
    test_dataset.append(convert_to_tensor(sent, test_acts[1][sent_id]))

First instance in training:  (tensor([[ 1,  2,  3,  4,  5,  6,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,  4,
         17, 18, 19]], device='cuda:0'), tensor([2], device='cuda:0'))
First instance in dev:  (tensor([[ 22,  24,   6, 108,   4,  19,  16,  61, 114,   8,   9, 104, 105,  29]],
       device='cuda:0'), tensor([2], device='cuda:0'))


# Section 2. Training Intent Classifiers

Below is an implementation of a **bag-of-words** classifier.

❓ Which parameters does it have, and what are their shapes? 

In [33]:
from torch import nn
import torch.nn.functional as F

class BagOfWordsClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_class):
        super(BagOfWordsClassifier, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.fc = nn.Linear(embed_dim, num_class)

    def forward(self, text):
        """ Forward pass, return a (log) prob. distribution over all labels. 
        Arguments: 
        text: input sentence represented as a sequence of token IDs;
        shape B x T (batch size x sequence length)
        In this exercise we always use batch size of 1.
        """
        # shape: B x H (embedding dimension)
        embedded = self.embedding(text)
        embedded_sum = embedded[:, 0]
        for i in range(1, embedded.size(dim=1)):
            embedded_sum += embedded[:, i]
        # shape: B x L (number of labels)
        fc_out = self.fc(embedded_sum / embedded.size(dim=1))
        return F.log_softmax(fc_out, dim=-1)    # B x L

In [74]:
import torch
import torch.nn as nn

class TextClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
        super().__init__()

        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, text):
        embedded = self.embedding(text)
        output, (hidden, cell) = self.lstm(embedded)
        return self.fc(hidden)


Below is a `Trainer` class.

Please go over the `train_one_epoch` method once thoroughly.

In [75]:
import random
random.seed(0)

class Trainer:
    def __init__(self, model, optimizer,
                 train_dataset, valid_dataset, test_dataset):
        print("===== Model parameters and shape ====")
        for name, param in model.named_parameters():
            print(name, param.shape)

        self.criterion = torch.nn.NLLLoss()
        self.optimizer = optimizer
        self.train_dataset, self.valid_dataset = train_dataset, valid_dataset
        self.test_dataset = test_dataset

        self.bsz = 64   # mini-batch size
        self.log_interval = 10000

    def train_one_epoch(self, dataset, epoch):
        random.shuffle(dataset) # shuffle dataset
        model.train()   # set model to training mode
        total_acc, total_count = 0, 0

        for idx, (text, label) in enumerate(dataset):
            predicted_label = model(text)   # forward pass
            loss = self.criterion(predicted_label, label)    # loss calculation
            loss.backward() # backward pass, accmulate gradient

            if (idx + 1) % self.bsz == 0:
                optimizer.step()     # update model parameters
                optimizer.zero_grad()   # clear gradients from graph

            total_acc += (predicted_label.argmax(1) == label).sum().item()
            total_count += label.size(0)

            if idx % self.log_interval == 0 and idx > 0: # print logging
                print(f"| epoch {epoch:3d} | {idx:5d}/{len(dataset):5d} sentences | training accuracy {total_acc/total_count:.3f}")
                total_acc, total_count = 0, 0

    def evaluate(self, dataset):
        model.eval()    # set model to eval mode
        total_acc, total_count = 0, 0

        with torch.no_grad():   # no gradient accmulation
            for idx, (text, label) in enumerate(dataset):
                predicted_label = model(text).argmax(1).item()
                if predicted_label == label.item():
                    total_acc += 1
                # print examples of wrong predictions:
                elif random.random() < 1.0 / len(dataset):
                    print("sentence:\t", valid_acts[0][idx])
                    print("predicted:\t", id2label[predicted_label])
                    print("true label:\t", id2label[label.item()], "\n")
                total_count += 1
        return total_acc / total_count

    def train(self, num_epoch):
        for i in range(num_epoch):
            self.train_one_epoch(self.train_dataset, i)
            dev_accu = self.evaluate(self.valid_dataset)
            print(f"Dev accuracy after epoch {i}: {dev_accu:.2f} \n\n")

        # done training, save model
        test_accu = self.evaluate(self.test_dataset)
        print(f"Test accuracy {test_accu:.2f}")

In [76]:
num_class = len(label2id)
vocab_size = len(vocab.token2id)
emb_dim = 300   # embedding dimension
n_layers = 4
dropout = 0.2

# init model and optimizer
model = TextClassifier(vocab_size=vocab_size, embedding_dim=emb_dim, hidden_dim=300, output_dim=num_class).to(DEVICE)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

# init trainer
trainer = Trainer(model=model,
                  optimizer=optimizer,
                  train_dataset=train_dataset,
                  valid_dataset=valid_dataset,
                  test_dataset=test_dataset)
# start training, train for 3 epochs (going through the dataset 3 times)
trainer.train(num_epoch=3)

After starting the training, you will see the performance is quite poor.

This is the starting point of Implementation Task 1 in Section 3.


In [41]:
num_class = len(label2id)
vocab_size = len(vocab.token2id)
emb_dim = 300   # embedding dimension

# init model and optimizer
model = BagOfWordsClassifier(vocab_size, emb_dim, num_class).to(DEVICE)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

# init trainer
trainer = Trainer(model=model,
                  optimizer=optimizer,
                  train_dataset=train_dataset,
                  valid_dataset=valid_dataset,
                  test_dataset=test_dataset)
# start training, train for 3 epochs (going through the dataset 3 times)
trainer.train(num_epoch=3)

===== Model parameters and shape ====
embedding.weight torch.Size([3596, 300])
fc.weight torch.Size([17, 300])
fc.bias torch.Size([17])


KeyboardInterrupt: 

# Section 3. Tasks

## Implementation Task 1: Fixing Bag-of-Words Classifier
Currently, the bag-of-word classifier is **not** implemented **correctly**:
It only uses the embedding of the **first word** for prediction.

Implement an aggregation mechenism and report the perfromance.



embedded
## Implementation Task 2: Sequential Layer

Recall that the bag-of-word approach **ignores word orders**. 

In the lecture, we cover several models that are better at modeling the **sequential** nature of the input.

Implement an **LSTM-based** classifier, and compare the performance to the previous results.

(*Hint: see [here](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html) for the documentation of LSTM in PyTorch*)



## Implementation Task 3: Low-Resource Condition
Currently, there are over **50,000 sentences** in our training data. 

In many practical scenarios, we do not have so much **labeled data** for training.

Truncate the training data to simulate the **low-resource** case of training with 1,000 sentences. 

How do the results differ compared to previously?

What could we do to improve the system? Implment this idea.


# Next Steps

This Praktikum is ungraded. You do not have to send your solutions in.

You are, however, highly encouraged to ask for feedback during the practical session.

#Credits

Some code snippets in this Praktikum are from
[pytorch](https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html).

The MultiWOZ dataset we worked with is created by [Budzianowski et al. (2018)](https://aclanthology.org/D18-1547.pdf).