!pip install transformers

# Tasks

## Sequence classification

In [1]:
from transformers import pipeline
classifier = pipeline('sentiment-analysis')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

In [3]:
result = classifier('I hate you')
result

[{'label': 'NEGATIVE', 'score': 0.9991129040718079}]

> paraphrase classification with BERT pre-trained model

In [4]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

MODEL_PATH = "bert-base-cased-finetuned-mrpc"

# 1. Instantiate tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_PATH)

classes = ['not paraphrase', 'is paraphrase']

sequence_0 = "The company HuggingFace is based in New York City"
sequence_1 = "Apples are especially bad for your health"
sequence_2 = "HuggingFace's headquarters are situated in Manhattan"


# 2. tokenize the sentences
# The tokenizer will automatically add any model specific separators (i.e. <CLS> and <SEP>) and tokens to
# the sequence, as well as compute the attention masks.
paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="pt")
not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="pt")

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/433 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/413M [00:00<?, ?B/s]

In [5]:
paraphrase

{'input_ids': tensor([[  101,  1109,  1419, 20164, 10932,  2271,  7954,  1110,  1359,  1107,
          1203,  1365,  1392,   102, 20164, 10932,  2271,  7954,   112,   188,
          3834,  1132,  3629,  1107,  6545,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1]])}

In [22]:
# 3. retrieve the logits
paraphrase_classification_logits = model(**paraphrase).logits
not_paraphrase_classification_logits = model(**not_paraphrase).logits

# 4. retrieve probabilities through softmax
paraphrase_results = torch.softmax(paraphrase_classification_logits, dim=1).tolist()[0]
not_paraphrase_results = torch.softmax(not_paraphrase_classification_logits, dim=1).tolist()[0]

In [27]:
def print_res(classes, prob):
    for c, p in zip(classes, prob):
        print(f"{c}: {p*100:.2f}%")

#paraphrase
print("O: ", sequence_0)
print("P: ", sequence_2)
print_res(classes, paraphrase_results)
print("-"*10)

#not paraphrase
print("O: ", sequence_0)
print("P: ", sequence_1)
print_res(classes, not_paraphrase_results)
print("-"*10)


O:  The company HuggingFace is based in New York City
P:  HuggingFace's headquarters are situated in Manhattan
not paraphrase: 9.54%
is paraphrase: 90.46%
----------
O:  The company HuggingFace is based in New York City
P:  Apples are especially bad for your health
not paraphrase: 94.04%
is paraphrase: 5.96%
----------


In [24]:
paraphrase_results

[0.09536290913820267, 0.9046370387077332]

In [20]:
torch.softmax(paraphrase_classification_logits, dim=1)

tensor([[0.0954, 0.9046]], grad_fn=<SoftmaxBackward0>)

## Extractive Question Answering
Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune a model on a SQuAD task, you may leverage the run_qa.py and run_tf_squad.py scripts.

### From Pipeline

In [33]:
from transformers import pipeline
question_answerer = pipeline("question-answering")
import re



No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


'Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a question answering dataset is the   SQuAD dataset, which is entirely based on that task. If you would like to fine-tune a model on a SQuAD task, you may leverage the   examples/pytorch/question-answering/run_squad.py script.'

In [36]:
context = r"""Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a question answering dataset is the 
            SQuAD dataset, which is entirely based on that task. If you would like to fine-tune a model on a SQuAD task, you may leverage the 
            examples/pytorch/question-answering/run_squad.py script."""
context = re.sub('\n', ' ', context)
context = re.sub(' +', ' ', context)
context

'Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune a model on a SQuAD task, you may leverage the examples/pytorch/question-answering/run_squad.py script.'

In [45]:
result = question_answerer(question="What is extractive question answering?", context=context)
result

{'score': 0.6177273988723755,
 'start': 33,
 'end': 94,
 'answer': 'the task of extracting an answer from a text given a question'}

In [46]:
result = question_answerer(question="What is a good example of a question answering dataset?", context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")


Answer: 'SQuAD dataset', score: 0.5152, start: 146, end: 159


Note how the model is doing 'simply' an heuristic search in the text. Given a quesiton like "On what is based the SQuAD dataset?" the correct answer would be something like "It is based on the task of Extractive Question Answering". However, this requires a more complex reasonament on the text. 
>This model simply reduce the task to find a sub-span of the text that likely answers the question

In [41]:
result = question_answerer(question="On what is based the SQuAD dataset?", context=context)
result

{'score': 0.6470518708229065, 'start': 188, 'end': 197, 'answer': 'that task'}

### From AutoModel
As in the previous case we can manually manage tokenizers and models. This requires to:
1. Instantite tokenizer and model from a checkpoint name
2. Define Context and Questions
    1. For each question create a sequence Context-Question (with correct model separator)
    2. Tokenize the enriched sequences
3. Pass the sequence through the model
    1. This outputs a range of scores across the entire sequence tokens (question and text), for both the start and end positions.
4. Compute the softmax of the result to get probabilities over the tokens.
5. Fetch the tokens from the identified start and stop values, convert those tokens to a string.

In [49]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

#1. Initialize tokenizer and model
MODEL_PATH = "bert-large-uncased-whole-word-masking-finetuned-squad"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForQuestionAnswering.from_pretrained(MODEL_PATH)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/443 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.25G [00:00<?, ?B/s]

In [62]:
# 2. Prepare Context and questions
text = r"""
🤗 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural
Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between
TensorFlow 2.0 and PyTorch.
"""

questions = [
    "How many pretrained models are available in 🤗 Transformers?",
    "What does 🤗 Transformers provide?",
    "🤗 Transformers provides interoperability between which frameworks?",
    "How many languages are available?"
]

for question in questions:
    tokens_dict = tokenizer(question, text, add_special_tokens=True, return_tensors='pt')
    # store ids for later
    input_ids = tokens_dict["input_ids"].tolist()[0]
    
    outputs = model(**tokens_dict)
    #logits
    answer_start_scores = outputs.start_logits
    answer_end_scores = outputs.end_logits
    
#     probabilities
#     answer_start_probabilities = torch.softmax(answer_start_scores)
#     answer_end_probabilities = torch.softmax(answer_end_scores)

    # predictions
    answer_start_token = torch.argmax(answer_start_scores)
    answer_end_token = torch.argmax(answer_end_scores)
    
    answer = tokenizer.decode(input_ids[answer_start_token:answer_end_token + 1])
    
    print(f"Q: {question}")
    print(f"A: {answer}")

Q: How many pretrained models are available in 🤗 Transformers?
A: over 32 +
Q: What does 🤗 Transformers provide?
A: general - purpose architectures
Q: 🤗 Transformers provides interoperability between which frameworks?
A: tensorflow 2. 0 and pytorch
Q: How many languages are available?
A: 100 + languages


## Language Modeling
Language modeling is the task of fitting a model to a corpus, which can be domain specific. All popular transformer-based models are trained using a variant of language modeling, e.g. BERT with masked language modeling, GPT-2 with causal language modeling.

>Language modeling can be useful outside of pretraining as well, for example to **shift the model distribution** to be domain-specific: using a language model trained over a very large corpus, and then fine-tuning it to a news dataset or on scientific papers e.g. LysandreJik/arxiv-nlp.

### Masked Language Modeling
Masked language modeling is the task of masking tokens in a sequence with a masking token, and prompting the model to fill that mask with an appropriate token. 

> This allows the model to attend to both the right context and the left context!

#### From Pipeline

In [63]:
from transformers import Pipeline
unmasker = pipeline('fill-mask')

No model was supplied, defaulted to distilroberta-base (https://huggingface.co/distilroberta-base)


Downloading:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/316M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

In [67]:
from pprint import pprint
pprint(unmasker(f"HuggingFace is creating a {unmasker.tokenizer.mask_token} thath the community uses to solve NLP tasks"))

[{'score': 0.17354880273342133,
  'sequence': 'HuggingFace is creating a tool thath the community uses to '
              'solve NLP tasks',
  'token': 3944,
  'token_str': ' tool'},
 {'score': 0.09451553970575333,
  'sequence': 'HuggingFace is creating a framework thath the community uses to '
              'solve NLP tasks',
  'token': 7208,
  'token_str': ' framework'},
 {'score': 0.055287063121795654,
  'sequence': 'HuggingFace is creating a bot thath the community uses to solve '
              'NLP tasks',
  'token': 14084,
  'token_str': ' bot'},
 {'score': 0.04934253543615341,
  'sequence': 'HuggingFace is creating a library thath the community uses to '
              'solve NLP tasks',
  'token': 5560,
  'token_str': ' library'},
 {'score': 0.04610683396458626,
  'sequence': 'HuggingFace is creating a plugin thath the community uses to '
              'solve NLP tasks',
  'token': 43201,
  'token_str': ' plugin'}]


#### From AutoModel

1. Instantiate tokenizer and model
2. Create a sentence with `tokenizer.mask_token` instead of a word
3. Encode the sequence
4. Pass through the model
    1. The resulting logits can be used to compute the predicted token

In [70]:
from transformers import AutoModelForMaskedLM, AutoTokenizer

MODEL_PATH = 'distilbert-base-cased'
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForMaskedLM.from_pretrained(MODEL_PATH)

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/411 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/251M [00:00<?, ?B/s]

In [110]:
sequence = "Distilled models are smaller than the models they mimic. Using them instead of the large " \
f"versions would help {tokenizer.mask_token} our carbon footprint."

inputs = tokenizer(sequence, return_tensors='pt')
outputs = model(**inputs)
logits = outputs.logits

# retrieve masked token
mask_token_index = torch.where(inputs['input_ids'] == tokenizer.mask_token_id)[1]
mask_token_logits = logits[0, mask_token_index, :]
probabilities = torch.softmax(mask_token_logits, dim=1)

k = 5
print(sequence)
print(f'\n TOP-{k} predictions:')

# topk return the values and the indices
top_5_tokens = torch.topk(probabilities, k, dim=1)
for prob, token in zip(top_5_tokens.values[0], top_5_tokens.indices[0]):
    print(f'\t-{tokenizer.decode(token)} \tp={prob*100:.2f}%')
    

best_token = top_5_tokens.indices[0, 0]
print("\nFinal:")
print(sequence.replace(tokenizer.mask_token, tokenizer.decode(best_token)))

Distilled models are smaller than the models they mimic. Using them instead of the large versions would help [MASK] our carbon footprint.

 TOP-5 predictions:
	-reduce 	p=71.16%
	-increase 	p=4.62%
	-decrease 	p=3.26%
	-offset 	p=1.88%
	-improve 	p=1.77%

Final:
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help reduce our carbon footprint.


In [90]:
logits.shape   #[batch_size, seq_size, vocab_size]

torch.Size([1, 30, 28996])

### Causal Language Modeling (Next Token Prediction)
Causal language modeling is the task of predicting the token following a sequence of tokens. In this situation, the model only attends to the left context (tokens on the left of the mask). Such a training is particularly interesting for generation tasks

> Usually, the next token is predicted by sampling from the logits of the last hidden state the model produces from the input sequence.

1. Initialize tokenizer and model
2. Prepare (encode) a sequence and pass through the model
3. Generate next token
    1. Retrieve the last hidden state (last token logits)
    2. Filtering
    3. Sample
    4. Append token to sequence

In [130]:
from transformers import AutoModelForCausalLM, AutoTokenizer, top_k_top_p_filtering
import torch
from torch import nn

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

In [139]:
sequence = f"Hugging Face is based in DUMBO, New York City, and"

inputs = tokenizer(sequence, return_tensors="pt")
input_ids = inputs["input_ids"]
outputs = model(**inputs)

In [152]:

# retrieve last hidden state
next_token_logits = outputs['logits'][:,-1,:]

# filter
next_token_logits = top_k_top_p_filtering(next_token_logits, 
                                          top_k=50, 
                                          top_p=1.0)

# sample
probs = torch.softmax(next_token_logits, dim=1)
next_token = torch.multinomial(probs, 1)
generated = torch.cat([input_ids, next_token], dim=-1)

generated_string = tokenizer.decode(generated.tolist()[0])
print(sequence)
print(generated_string)

Hugging Face is based in DUMBO, New York City, and
Hugging Face is based in DUMBO, New York City, and launched


### Text Generation
In **text generation** (a.k.a open-ended text generation) the goal is to create a coherent portion of text that is a continuation from the given context. The following example shows how GPT-2 can be used in pipelines to generate text. As a default all models apply Top-K sampling when used in pipelines, as configured in their respective configurations (see gpt-2 config for example).

In [155]:
from transformers import pipeline
text_generator = pipeline('text-generation')
text_generator("As far as I am concerned, I will",
              max_length=50, 
              do_sample=False)

No model was supplied, defaulted to gpt2 (https://huggingface.co/gpt2)
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'As far as I am concerned, I will be the first to admit that I am not a fan of the idea of a "free market." I think that the idea of a free market is a bit of a stretch. I think that the idea'}]

In [157]:
text_generator("As far as I am concerned, I will",
              max_length=50, 
              do_sample=True)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "As far as I am concerned, I will do my utmost to protect anyone and everyone in the state from having any personal belongings taken away from them.\n\n\nLet's be clear for the record. The purpose of this policy, as described in these"}]

Without sampling the model can be easily stuck on a repetition loop

In [159]:
text_generator("As far as I am concerned, I will",
              max_length=100, 
              do_sample=False)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'As far as I am concerned, I will be the first to admit that I am not a fan of the idea of a "free market." I think that the idea of a free market is a bit of a stretch. I think that the idea of a free market is a bit of a stretch. I think that the idea of a free market is a bit of a stretch. I think that the idea of a free market is a bit of a stretch. I think that the idea of a'}]

In [160]:
text_generator("As far as I am concerned, I will",
              max_length=100, 
              do_sample=True)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "As far as I am concerned, I will stick to our previous point of view as a reason to support people in this situation, rather than to be left alone. So, why can't you guys let everyone off with an option like that?\n\n\nIf you put your concerns into place then they can all be resolved right?\n\n\nI think we should let people know how we feel about it, we all have our views on it and we are all in the same boat here.\n\n\n"}]

In [161]:
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_PATH = 'xlnet-base-cased'
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForCausalLM.from_pretrained(MODEL_PATH)


Downloading:   0%|          | 0.00/760 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/779k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/445M [00:00<?, ?B/s]

In [165]:
# Padding text helps XLNet with short prompts - proposed by Aman Rusia in https://github.com/rusiaaman/XLNet-gen#methodology
PADDING_TEXT = """In 1991, the remains of Russian Tsar Nicholas II and his family
(except for Alexei and Maria) are discovered.
The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
remainder of the story. 1883 Western Siberia,
a young Grigori Rasputin is asked by his father and a group of men to perform magic.
Rasputin has a vision and denounces one of the men as a horse thief. Although his
father initially slaps him for making such an accusation, Rasputin watches as the
man is chased outside and beaten. Twenty years later, Rasputin sees a vision of
the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
with people, even a bishop, begging for his blessing. <eod> </s> <eos>"""

prompt = "Today the weather is really nice and I am planning on "
inputs = tokenizer(PADDING_TEXT + prompt, add_special_tokens=False, return_tensors="pt")["input_ids"]

prompt_length = len(tokenizer.decode(inputs[0]))
outputs = model.generate(inputs, max_length=250, do_sample=True, top_p=0.95, top_k=60)
generated = prompt + tokenizer.decode(outputs[0])[prompt_length+1:]

print(generated)

Today the weather is really nice and I am planning on heading out for lunch. After I get the house clean and clean, I am going to drive around the neighbourhood to see the neighbourhood's parks. After I have gotten up, I plan to drive around the neighborhood and see a lot of local parks. During my drive, I will take a break. (I'm really nervous, I'm about to
