# Transformer-based Natural Language Processing
## Introduction to PyTorch & the 🤗 Framework
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/texttechnologylab/WiSe22-M-PNLR-PR-TbNLP/blob/master/datasets.ipynb)

### Aqcuiring Some Data

- Use the code below to accquire some sentence-segmented data.
    - Note: You may use also any other corpus available to you.

In [1]:
# Dowload a small dataset of sentences from the English Wikipedia from the "Wortschatz" project of the University Leipzig
# - D. Goldhahn, T. Eckart & U. Quasthoff: Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages.
#   In: Proceedings of the 8th International Language Resources and Evaluation (LREC'12), 2012
!mkdir data

!curl http://pcai056.informatik.uni-leipzig.de/downloads/corpora/eng_wikipedia_2016_100K.tar.gz -o data/eng_wikipedia_2016_100K.tar.gz
!curl http://pcai056.informatik.uni-leipzig.de/downloads/corpora/eng_news_2016_100K.tar.gz -o data/eng_news_2016_100K.tar.gz
!tar -xf data/eng_wikipedia_2016_100K.tar.gz -C data/
!tar -xf data/eng_news_2016_100K.tar.gz -C data/

mkdir: cannot create directory ‘data’: File exists
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 24.3M  100 24.3M    0     0  34.2M      0 --:--:-- --:--:-- --:--:-- 34.2M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 25.9M  100 25.9M    0     0  43.9M      0 --:--:-- --:--:-- --:--:-- 43.9M


### Installing necessary packages (i.e. if on Colab)

In [2]:
# %pip install torch datasets tokenizers transformers

### Working with `🤗 Datasets`

1. Familiarize yourself with the `🤗 Datasets` package and it's API.
2. [Load](https://huggingface.co/docs/datasets/loading#local-and-remote-files) the plain text corpus that was downloaded using the code above.
3. [(Pre-)Process](https://huggingface.co/docs/datasets/process#map) the data:
    - Remove the line-number preceding each sentence.
    - Split the sentences into words/tokens.


#### Resources

- [`🤗 datasets` Documentation](https://huggingface.co/docs/datasets/index)

In [3]:
wikipedia_file = "data/eng_wikipedia_2016_100K/eng_wikipedia_2016_100K-sentences.txt"
news_file = "data/eng_wikipedia_2016_100K/eng_wikipedia_2016_100K-sentences.txt"

In [4]:
from datasets import load_dataset

corpus = load_dataset('text', data_files={'train': [wikipedia_file, news_file]}, split="train")
print(corpus)
print(corpus[0])


def remove_line_index(examples):
    for i, sample in enumerate(examples['text']):
        examples['text'][i] = ' '.join(sample.strip().split()[1:])
    return examples


corpus = corpus.map(remove_line_index, batched=True)
print(corpus[0])

Using custom data configuration default-0fa1bc84c0851a77


Downloading and preparing dataset text/default to /home/mastoeck/.cache/huggingface/datasets/text/default-0fa1bc84c0851a77/0.0.0/21a506d1b2b34316b1e82d0bd79066905d846e5d7e619823c0dd338d6f1fa6ad...




Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

Dataset text downloaded and prepared to /home/mastoeck/.cache/huggingface/datasets/text/default-0fa1bc84c0851a77/0.0.0/21a506d1b2b34316b1e82d0bd79066905d846e5d7e619823c0dd338d6f1fa6ad. Subsequent calls will reuse this data.
Dataset({
    features: ['text'],
    num_rows: 200000
})
{'text': '1\t0.41% of the population were Hispanic or Latino of any race.'}


  0%|          | 0/200 [00:00<?, ?ba/s]

{'text': '0.41% of the population were Hispanic or Latino of any race.'}


### Working with `🤗 tokenizers`

1. Implement a tokenization approach using the `🤗 tokenizers` library.
    - There are [multiple different models](https://huggingface.co/docs/tokenizers/python/latest/components.html#models) of tokenizers available. Which one do you choose for the task at hand?
2. Tokenize your dataset using the new tokenizer and rerun your experiment from above.
3. Evaluate the results and compare them with the results from above.

#### Resources

- [`🤗 Tokenizers` Documentation](https://huggingface.co/docs/tokenizers/python/latest/)
- [`🤗 Transformers` "Use tokenizers from 🤗 Tokenizers"](https://huggingface.co/docs/transformers/main/en/fast_tokenizers)

In [5]:
from tqdm.notebook import trange
from tokenizers import Tokenizer, models, normalizers, pre_tokenizers, trainers

tokenizer = Tokenizer(models.WordLevel(unk_token="[UNK]"))
tokenizer.normalizer = normalizers.Sequence([normalizers.NFKC(), normalizers.Lowercase()])
tokenizer.pre_tokenizer = pre_tokenizers.Sequence([pre_tokenizers.UnicodeScripts(), pre_tokenizers.Whitespace()])

trainer = trainers.WordLevelTrainer(
    vocab_size=32000,
    min_frequency=50,
    special_tokens=["[PAD]", "[UNK]"],
    show_progress=True
)


def _batch_iterator(batch_size=1000):
    for i in trange(0, len(corpus), batch_size, desc="Training Tokenizer from Iterator", unit=" batches"):
        yield corpus[i: i + batch_size]["text"]


tokenizer.train_from_iterator(_batch_iterator(), trainer=trainer)

unk_id = tokenizer.token_to_id(tokenizer.model.unk_token)
pad_id = tokenizer.token_to_id('[PAD]')

Training Tokenizer from Iterator:   0%|          | 0/200 [00:00<?, ? batches/s]

In [6]:
tokenizer.save("tokenizer.json")

In [7]:
print(tokenizer.encode("This is an   example-sentence, that is some 123456 words long!²").ids)
print(tokenizer.encode("[PAD] Some sentence alkgjöeo!").ids)

[29, 10, 31, 157, 11, 2640, 4, 15, 10, 55, 1, 469, 162, 1527, 91]
[0, 55, 2640, 1, 1527]


In [8]:
batch = tokenizer.encode_batch(
    ["This is an   example-sentence, that is some 123456 words long!²", "[PAD] Some sentence alkgjöeo!"])
print(batch)

print(batch.ids)  # will raise AttributeError

[Encoding(num_tokens=15, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]), Encoding(num_tokens=5, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])]


AttributeError: 'list' object has no attribute 'ids'

### Convert your tokenizer to a `🤗 Transformer` tokenizer

In [9]:
from transformers import PreTrainedTokenizerFast

fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)

In [10]:
fast_tokenizer.add_special_tokens({'pad_token': '[PAD]'})

0

### Apply your tokenizer to the dataset

In [11]:
def encode(batch: dict):
    return fast_tokenizer(
        batch['text'],
        return_token_type_ids=False,
        return_attention_mask=False,
    )


dataset = corpus.map(encode, batched=True)

  0%|          | 0/200 [00:00<?, ?ba/s]

In [12]:
print(dataset[0:5])

{'text': ['0.41% of the population were Hispanic or Latino of any race.', "'06 DTAs appear to have made essentially simultaneous and duplicative amendments to the Code and its notes.", '; 1011 Configuration Write : This operates analogously to a configuration read.', "10 May 2007 Former IRA members Anthony McIntyre and Richard O'Rawe have claimed Adams was a key figure in the IRA.", '; 1111 Memory Write and Invalidate : This command is identical to a generic memory write, but comes with the guarantee that one or more whole cache lines will be written, with all byte selects enabled.'], 'input_ids': [[270, 3, 2798, 139, 5, 2, 336, 33, 5504, 28, 1, 5, 110, 1147, 3], [17, 1, 1, 843, 7, 37, 100, 2502, 6760, 6, 1, 1, 7, 2, 598, 6, 47, 1330, 3], [58, 1, 2664, 1246, 42, 29, 3606, 1, 7, 9, 2664, 907, 3], [204, 68, 521, 588, 4958, 227, 4775, 1, 6, 1450, 819, 17, 1, 37, 1285, 3901, 12, 9, 600, 1373, 8, 2, 4958, 3], [58, 1, 792, 1246, 6, 1, 42, 29, 1146, 10, 2695, 7, 9, 7433, 792, 1246, 4, 41, 134

In [13]:
pt_dataset = dataset.with_format('torch', columns=['input_ids'])
print(pt_dataset[0:5])

{'input_ids': [tensor([ 270,    3, 2798,  139,    5,    2,  336,   33, 5504,   28,    1,    5,
         110, 1147,    3]), tensor([  17,    1,    1,  843,    7,   37,  100, 2502, 6760,    6,    1,    1,
           7,    2,  598,    6,   47, 1330,    3]), tensor([  58,    1, 2664, 1246,   42,   29, 3606,    1,    7,    9, 2664,  907,
           3]), tensor([ 204,   68,  521,  588, 4958,  227, 4775,    1,    6, 1450,  819,   17,
           1,   37, 1285, 3901,   12,    9,  600, 1373,    8,    2, 4958,    3]), tensor([  58,    1,  792, 1246,    6,    1,   42,   29, 1146,   10, 2695,    7,
           9, 7433,  792, 1246,    4,   41, 1342,   20,    2, 5860,   15,   44,
          28,   49,  822, 6207,  632,   95,   26,  389,    4,   20,   51, 5382,
           1, 4709,    3])]}


In [14]:
import torch
from torch.nn.utils.rnn import pad_sequence

pt_dataset.set_transform(
    lambda batch: batch | {'input_ids': pad_sequence(map(torch.tensor, batch['input_ids']), batch_first=True, padding_value=pad_id)},
    columns=['input_ids']
)
print(pt_dataset[0:5])

{'input_ids': tensor([[ 270,    3, 2798,  139,    5,    2,  336,   33, 5504,   28,    1,    5,
          110, 1147,    3,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0],
        [  17,    1,    1,  843,    7,   37,  100, 2502, 6760,    6,    1,    1,
            7,    2,  598,    6,   47, 1330,    3,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0],
        [  58,    1, 2664, 1246,   42,   29, 3606,    1,    7,    9, 2664,  907,
            3,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0],
        [ 204,   68,  521,  588, 4958,  227, 4775,    1,    6, 1450,  819,   17,
            1,   37, 1285, 3901,   12,    9,  600, 1373,    8,    2, 4958,    3,
           

In [15]:
# Recommended way?
# https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.with_transform
def encode_pt(batch: dict):
    return fast_tokenizer(
        batch['text'],
        return_token_type_ids=False,
        return_attention_mask=False,
        padding=True,
        truncation=True,
        return_tensors='pt'
    )

jit_dataset = corpus.with_transform(encode_pt)
print(jit_dataset[0:5])

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


{'input_ids': tensor([[ 270,    3, 2798,  139,    5,    2,  336,   33, 5504,   28,    1,    5,
          110, 1147,    3,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0],
        [  17,    1,    1,  843,    7,   37,  100, 2502, 6760,    6,    1,    1,
            7,    2,  598,    6,   47, 1330,    3,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0],
        [  58,    1, 2664, 1246,   42,   29, 3606,    1,    7,    9, 2664,  907,
            3,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0],
        [ 204,   68,  521,  588, 4958,  227, 4775,    1,    6, 1450,  819,   17,
            1,   37, 1285, 3901,   12,    9,  600, 1373,    8,    2, 4958,    3,
           