# CIS6930 Week 6a: Recurrent Neural Networks Assignment Template

---

Preparation: Go to `Runtime > Change runtime type` and choose `GPU` for the hardware accelerator.



In [None]:
gpu_info = !nvidia-smi -L
gpu_info = "\n".join(gpu_info)
if gpu_info.find("failed") >= 0:
    print("Not connected to a GPU")
else:
    print(gpu_info)

## Preparation

For this assignment, we use Hugging Face's `datasets` library, which offers a simple interface to download a wide variety of NLP datasets.

In [None]:
!pip install datasets

In [None]:
from datasets import load_dataset

# https://huggingface.co/datasets/conll2003
train_dataset = load_dataset("conll2003", split="train")
valid_dataset = load_dataset("conll2003", split="validation")
test_dataset = load_dataset("conll2003", split="test")

Each dataset object follows the data format below. For the assignment, you will only use `ner_tags` and `tokens`.

```
>>> train_dataset[0]

{'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'id': '0',
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.']}
```

Here is the information about `ner_tags`. For more details, see https://huggingface.co/datasets/conll2003.

`ner_tags`: a list of classification labels, with possible values including O (0), B-PER (1), I-PER (2), B-ORG (3), I-ORG (4) B-LOC (5), I-LOC (6) B-MISC (7), I-MISC (8).



In [None]:
train_dataset[0]

### Problem 1: Create a vocabulary

As shown above, sentences in the dataset are already tokenized. Since token IDs are not assigned yet, the dataset is not "Neural Network" ready.

**Problem 1**: Create a vocabulary and conver tokens into token IDs.

**Instructions/Hints:**
- Construct a vocabulary using `train_dataset`.
- Make sure to reserve ID for unknown tokens. 
- The vocabuary can be just a `dict` object.
- Apply the constructed vocabulary to train/validation/test datasets and make `token_ids` fields.
- Use `dataset.map` for the token ID conversion.


In [None]:
# https://huggingface.co/docs/datasets/processing.html
def assign_tokenid(example):
    """This is example code that does NOT assign real token IDs.
    You need to revise the code"""
    example["token_ids"] = [999999 for token in example["tokens"]]
    return example

train_dataset = train_dataset.map(assign_tokenid)
valid_dataset = valid_dataset.map(assign_tokenid)
test_dataset = test_dataset.map(assign_tokenid)

In [None]:
# Now processed dataset contains `token_ids` field
train_dataset[0]

In [None]:
# Select field(s) to use and specify the data type (i.e., PyTorch)
train_dataset.set_format(type="torch", columns=["token_ids", "ner_tags"])
valid_dataset.set_format(type="torch", columns=["token_ids", "ner_tags"])
test_dataset.set_format(type="torch", columns=["token_ids", "ner_tags"])

In [None]:
# The command above overwrites the dataset object and now you should only see filtered fields
train_dataset[0]

## Problem 2: Implement collate function

Following the example for the RNN classification code, complete collate function so that you can appropriately padding sequences for batch sampling.

**Problem 2**: Complete `collate_func()`.

**Instructions/Hints:**
- `collate_func()` takes a batch and returns a batch.
- Use `torch.nn.utils.rnn.pad_sequence` for padding.
- Hint: insert `import pdb; pdb.set_trace()` to check the data structure. 


In [None]:
def collate_func(batch):
    ## !!Complete code!! ##
    pass
    return batch

Now, you should be able to load each of the dataset objects using `DataLoader` to make the datasets PyTorch-training ready.

In [None]:
from torch.utils.data import DataLoader
dl_train = DataLoader(train_dataset,
                      batch_size=8,
                      collate_fn=collate_func)
batch = next(iter(dl_train))
print(batch["token_ids"])
print(batch["ner_tags"])

## Problem 3: Design a BiLSTM model for NER

**Problem 3**: Implement `BiLSTMForNER` class.

**Instructions/Hints**:
- For the design choise, use Bidirectional LSTM.
- Consider refactor `SimpleRNN` in [Hands-on session Colab](https://colab.research.google.com/drive/1DZN-Bo2HBnPQPm4jrQzEIchhHdN682qP?usp=sharing)
- Hint: The `SimpleRNN` class takes *only the last hidden state* for classification. For NER, the model has to output prediction for each token. *What do you need to change?*

## Problem 4: Write a training script. 

**Problem 4**: Implement a `train()` function and run the script using the model and datasets. 

**Instructions/Hints**:
- Evaluate training/validation loss and accuracy.
- You can reuse `train()` in [Hands-on session Colab](https://colab.research.google.com/drive/1DZN-Bo2HBnPQPm4jrQzEIchhHdN682qP?usp=sharing), but you do need modification.
- Hint: Now, you have to consider multiple predictions for each sequence, which is different from the classification task.



## Problem 5: Run the training script and summarize the results

**Problem 5**: Run the training script and report the numbers. 

**Instructions/Hints**:
- Add a discussion on the results. 
- [Optional] Update the `BiLSTM` class that you implemented above to take hyperparameters regarding the network architecture (e.g., `LSTM` or `GRU`, `bidirectional` or not, `stacked` or not etc.), so you can conduct comparative experiments.

