# Installation
You do not have to follow our installation instructions if you have roughly equivalent setups / environments already.

We will use Conda and Pip to help us install packages for this homework. If you do not have Miniconda or Anaconda, you can install Miniconda from here https://docs.conda.io/en/latest/miniconda.html.

```
conda create --name tutorial3 python=3.7
conda activate tutorial3

pip install jupyter
```

Go to https://pytorch.org/ to install PyTorch if you don't have it already, then run
```
pip install transformers sacremoses
```

Spin up jupyter notebook with
```
jupyter notebook
```

# Preface
This tutorial draws heavily from the [Hugging Face Transformers library](https://github.com/huggingface/transformers)'s examples and documentations.

In [1]:
import warnings
warnings.filterwarnings('ignore')

# Tokenization

In [11]:
text = 'Eddie Van Halen, the guitarist and songwriter who helped give the radio-rock band Van Halen its name and sound, died Tuesday after a battle with cancer. He was 65.'

## Word Tokenization
### Moses

In [12]:
from sacremoses import MosesTokenizer, MosesDetokenizer
tokenizer = MosesTokenizer(lang='en')
tokens = tokenizer.tokenize(text)
print(tokens)

['Eddie', 'Van', 'Halen', ',', 'the', 'guitarist', 'and', 'songwriter', 'who', 'helped', 'give', 'the', 'radio-rock', 'band', 'Van', 'Halen', 'its', 'name', 'and', 'sound', ',', 'died', 'Tuesday', 'after', 'a', 'battle', 'with', 'cancer', '.', 'He', 'was', '65', '.']


In [13]:
detokenizer = MosesDetokenizer(lang='en')
print(detokenizer.detokenize(tokens))

Eddie Van Halen, the guitarist and songwriter who helped give the radio-rock band Van Halen its name and sound, died Tuesday after a battle with cancer. He was 65.


## Subword (Byte-Pair) Tokenization
### WordPiece (BERT)

In [15]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokens = tokenizer.tokenize(text)
print(tokens)

['eddie', 'van', 'hale', '##n', ',', 'the', 'guitarist', 'and', 'songwriter', 'who', 'helped', 'give', 'the', 'radio', '-', 'rock', 'band', 'van', 'hale', '##n', 'its', 'name', 'and', 'sound', ',', 'died', 'tuesday', 'after', 'a', 'battle', 'with', 'cancer', '.', 'he', 'was', '65', '.']


In [16]:
print(' '.join(tokens).replace(' ##', ''))

eddie van halen , the guitarist and songwriter who helped give the radio - rock band van halen its name and sound , died tuesday after a battle with cancer . he was 65 .


In [17]:
print(tokenizer.vocab_size)

30522


In [19]:
print(tokenizer.encode(tokens))

[101, 5752, 3158, 13084, 2078, 1010, 1996, 5990, 1998, 6009, 2040, 3271, 2507, 1996, 2557, 1011, 2600, 2316, 3158, 13084, 2078, 2049, 2171, 1998, 2614, 1010, 2351, 9857, 2044, 1037, 2645, 2007, 4456, 1012, 2002, 2001, 3515, 1012, 102]


### Byte-Level Byte Pair Encoding (GPT-2)

In [20]:
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokens = tokenizer.tokenize(text)
print(tokens)

['E', 'dd', 'ie', 'ĠVan', 'ĠHal', 'en', ',', 'Ġthe', 'Ġguitarist', 'Ġand', 'Ġsong', 'writer', 'Ġwho', 'Ġhelped', 'Ġgive', 'Ġthe', 'Ġradio', '-', 'rock', 'Ġband', 'ĠVan', 'ĠHal', 'en', 'Ġits', 'Ġname', 'Ġand', 'Ġsound', ',', 'Ġdied', 'ĠTuesday', 'Ġafter', 'Ġa', 'Ġbattle', 'Ġwith', 'Ġcancer', '.', 'ĠHe', 'Ġwas', 'Ġ65', '.']


In [21]:
print(tokenizer.vocab_size)

50257


In [22]:
print(tokenizer.encode(tokens))

[36, 1860, 494, 6656, 11023, 268, 11, 262, 32705, 290, 3496, 16002, 508, 4193, 1577, 262, 5243, 12, 10823, 4097, 6656, 11023, 268, 663, 1438, 290, 2128, 11, 3724, 3431, 706, 257, 3344, 351, 4890, 13, 679, 373, 6135, 13]


### SentencePiece (XLNet)

In [23]:
from transformers import XLNetTokenizer
tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
tokens = tokenizer.tokenize(text)
print(' '.join(tokens))

▁Eddie ▁Van ▁Hal en , ▁the ▁guitarist ▁and ▁ songwriter ▁who ▁helped ▁give ▁the ▁radio - rock ▁band ▁Van ▁Hal en ▁its ▁name ▁and ▁sound , ▁died ▁Tuesday ▁after ▁a ▁battle ▁with ▁cancer . ▁He ▁was ▁65 .


In [24]:
detokenized = ''.join(tokens).replace('▁', ' ')
print(detokenized)

 Eddie Van Halen, the guitarist and songwriter who helped give the radio-rock band Van Halen its name and sound, died Tuesday after a battle with cancer. He was 65.


In [25]:
print(tokenizer.encode(tokens))

[9142, 2641, 5842, 254, 19, 18, 11342, 21, 17, 10943, 61, 1351, 371, 18, 1242, 13, 6651, 1014, 2641, 5842, 254, 81, 304, 21, 1224, 19, 650, 376, 99, 24, 1727, 33, 1847, 9, 69, 30, 3295, 9, 4, 3]


# Transformer Models

Generation code adapted from https://github.com/huggingface/transformers/blob/master/examples/text-generation/run_generation.py

In [26]:
import transformers
model_classes = {
    'gpt2': (transformers.GPT2LMHeadModel, transformers.GPT2Tokenizer),
    'ctrl': (transformers.CTRLLMHeadModel, transformers.CTRLTokenizer),
    'openai-gpt': (transformers.OpenAIGPTLMHeadModel, transformers.OpenAIGPTTokenizer),
    'xlnet': (transformers.XLNetLMHeadModel, transformers.XLNetTokenizer),
    'transfo-xl': (transformers.TransfoXLLMHeadModel, transformers.TransfoXLTokenizer),
    'xlm': (transformers.XLMWithLMHeadModel, transformers.XLMTokenizer),
}

In [27]:
cls = 'gpt2'
LMHead, Tokenizer = model_classes[cls]

tokenizer = Tokenizer.from_pretrained(cls)
model = LMHead.from_pretrained(cls)

In [28]:
model.transformer

GPT2Model(
  (wte): Embedding(50257, 768)
  (wpe): Embedding(1024, 768)
  (drop): Dropout(p=0.1, inplace=False)
  (h): ModuleList(
    (0): Block(
      (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (attn): Attention(
        (c_attn): Conv1D()
        (c_proj): Conv1D()
        (attn_dropout): Dropout(p=0.1, inplace=False)
        (resid_dropout): Dropout(p=0.1, inplace=False)
      )
      (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (mlp): MLP(
        (c_fc): Conv1D()
        (c_proj): Conv1D()
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (1): Block(
      (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (attn): Attention(
        (c_attn): Conv1D()
        (c_proj): Conv1D()
        (attn_dropout): Dropout(p=0.1, inplace=False)
        (resid_dropout): Dropout(p=0.1, inplace=False)
      )
      (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (mlp): MLP(
        (c_fc): Conv1D

In [29]:
def count_params(network, requires_grad=False):
    return sum(p.numel() for p in network.parameters() if not requires_grad or p.requires_grad)
print('%.1fM parameters' % (count_params(model.transformer) / 1e6))

124.4M parameters


| GPT-2 Name | Parameters |
|---------|------|
| Small   | 124M |
| Medium  | 355M |
| Large   | 774M |
| X-Large | 1.5B |

In [30]:
model.lm_head

Linear(in_features=768, out_features=50257, bias=False)

In [31]:
device = 'cuda:0'
model.to(device)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0): Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): Laye

In [32]:
prompt = '“We’ve never seen this many people voting so far ahead of an election,” McDonald said.'

In [33]:
tokens = tokenizer.encode(prompt, add_special_tokens=False, return_tensors='pt').to(device)
tokens

tensor([[  447,   250,  1135,   447,   247,   303,  1239,  1775,   428,   867,
           661,  6709,   523,  1290,  4058,   286,   281,  3071,    11,   447,
           251, 14115,   531,    13]], device='cuda:0')

## Calculating Perplexity of the Prompt

Perplexity is related to the probability of a language model generating a sequence. **High perplexity** means that the model is **less likely to generate the sequence**. In natural language, this happens when the sequence is rare or when the language model is not predictive (i.e. it's a bad language model).

In [34]:
loss, logits, past_kv = model(input_ids=tokens, labels=tokens)
perplexity = loss.exp()
print(f'Loss: {loss.cpu().item()}    Perplexity: {perplexity.cpu().item()}')

Loss: 5.215977668762207    Perplexity: 184.1918182373047


## Conditional Generation Given the Prompt

In [37]:
output_sequences = model.generate(
    input_ids=tokens,
    max_length=100,
    temperature=1,
    top_k=0,
    top_p=0.9,
    repetition_penalty=1.0,
    do_sample=True,
    num_return_sequences=3,
)
output_sequences

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


tensor([[  447,   250,  1135,   447,   247,   303,  1239,  1775,   428,   867,
           661,  6709,   523,  1290,  4058,   286,   281,  3071,    11,   447,
           251, 14115,   531,    13,   198,   198,     1,  1212,   318,   636,
           286,   674,   976,  1080,    11,   428,   318,   644,   338,  5836,
           287,  4505,    13,   447,   242,   775,  1183,  1464,   307,   428,
          6655,  1080,    11,   564,   250,   392,  1282,   766,   345,   994,
           290,  3015,   553, 14115,   531,    13,   198,   198,  3260, 46913,
           509,  1436,   338,  4039, 22972,    11,   339,   550,  7147,   655,
           530,  1729,    12, 29762, 43265,   508, 14451,   509,  1436,   338,
          2748, 10330,   286,  5373,   287,   477,   465,  2180,  1542,  3096],
        [  447,   250,  1135,   447,   247,   303,  1239,  1775,   428,   867,
           661,  6709,   523,  1290,  4058,   286,   281,  3071,    11,   447,
           251, 14115,   531,    13,   628,   198, 

In [38]:
for output_sequence in output_sequences.cpu().numpy():
    print(tokenizer.decode(output_sequence, clean_up_tokenization_spaces=True))
    print('\n')

“We’ve never seen this many people voting so far ahead of an election,” McDonald said.

"This is part of our same system, this is what's happening in Australia.— We'll always be this surprised system, “and come see you here and vote," McDonald said.

After electing Katter's chief whip, he had chosen just one non-single MLA who matched Katter's exact margin of victory in all his previous 30 board


“We’ve never seen this many people voting so far ahead of an election,” McDonald said.


"Today is a major day for us," she said, adding that it was an unprecedented step.


Thoughts? Follow @dvanslink<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoft

# Pipeline

Abstracted pipelines for tasks. See the pipeline source code at https://github.com/huggingface/transformers/blob/master/src/transformers/pipelines.py

In [39]:
from transformers import pipeline

## Sentiment Analysis

In [40]:
classifier = pipeline('sentiment-analysis')
classifier.model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
       

In [41]:
classifier('It was a sweet ending.')

[{'label': 'POSITIVE', 'score': 0.9997830986976624}]

In [42]:
classifier('It was a bitter ending.')

[{'label': 'NEGATIVE', 'score': 0.9977140426635742}]

In [43]:
classifier('It was a bittersweet ending.')

[{'label': 'POSITIVE', 'score': 0.9788888692855835}]

## Question Answering

In [44]:
question_answerer = pipeline('question-answering')
question_answerer.model

DistilBertForQuestionAnswering(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            

In [45]:
question_answerer({
    'question': 'Who will win the election?',
    'context': 'The candidates for the 2020 US presidential election are Donald Trump and Joe Biden.'
})



{'score': 0.4824332296848297,
 'start': 57,
 'end': 83,
 'answer': 'Donald Trump and Joe Biden.'}

In [46]:
question_answerer({
    'question': 'Which candidate will win the election?',
    'context': 'The candidates for the 2020 US presidential election are Donald Trump and Joe Biden.'
})



{'score': 0.5508456826210022, 'start': 57, 'end': 69, 'answer': 'Donald Trump'}

In [47]:
question_answerer({
    'question': 'Which candidate will win the election?',
    'context': 'The candidates for the 2020 US presidential election are Joe Biden and Donald Trump.'
})



{'score': 0.8045397400856018,
 'start': 71,
 'end': 83,
 'answer': 'Donald Trump.'}