# Hugging Face Transformers

There are several transformer components:

* Encoder layers process input, such as analyzing a movie review's tone.
* Decoder layers reconstruct output, as in English-to-French translation.

## Text generation

Hugging Face pretrained models:

* GPT2LMHeadModel is used for text generation

* GPT2Tokenizer
  * converts text to tokens
  * handles subword tokenization

In [1]:
import torch

In [2]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel

In [3]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

In [4]:
sample_text = 'The cat sat on the mat'

In [5]:
# flag return_tensors equals 'pt' specifies that we want these tensors in PyTorch format
input_ids = tokenizer.encode(sample_text, return_tensors='pt')

Arguments:

* temperature - controls the randomness of the output, with lower values reducing randomness
* no_repeat_ngram_size parameter - prevents consecutive word repetition in the generated text
* pad_token_id is set to the ID of the end-of-sentence (EOS) token, which means the model pads the output with this token if it's shorter than the maximum length of 40 tokens.

In [6]:
output = model.generate(
    input_ids, 
    max_length=40, 
    temperature=0.7, 
    no_repeat_ngram_size=2,
    pad_token_id=tokenizer.eos_token_id,
    do_sample=True
)

In [7]:
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
generated_text

"The cat sat on the mat, its legs pulled under the covers and its paws folded back on its knees and held his paws. The cat's eyes went wide, the cat was still breathing heavily,"

## Translation

`t5-small` is Text-to-Text trasformer model. It supports English, French, Romanian, German.

In [8]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

In [None]:
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small')

In [10]:
sample_text = 'translate English to French: I love to read books'

In [11]:
input_ids = tokenizer.encode(sample_text, return_tensors='pt')

In [12]:
output = model.generate(input_ids, max_length=100)

In [13]:
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
generated_text

'Je lis des livres'