<a href="https://colab.research.google.com/github/victor-roris/mediumseries/blob/master/NLP/Transformers_Text_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Transformers - Natural Language Generation

In this notebook we will generate a paragraph of text based on the given input. For this, we can use different models from transformers.

This code is a simplification of the transformer example code: [run_generation.py](https://github.com/huggingface/transformers/blob/master/examples/run_generation.py). 

This code use some implementations that are not available in the (at the moment) newer release of transformers (version X). For this reason, we have to install the newest transformers package directly from the github. 


In [0]:
!pip install git+https://github.com/huggingface/transformers.git@master#egg=transformers

* Input 

In [0]:
text  = """
In a shocking finding, scientist discovered a herd of unicorns living in a remote, 
previously unexplored valley, in the Andes Mountains. Even more surprising to 
the researchers was the fact that the unicorns spoke perfect English.

"""

In [0]:
model_type = "xlnet" #"gpt2"
model_name_or_path = "xlnet-base-cased" #"gpt2"
length = 300

* Definition of the possible models

In [4]:
import numpy as np
import torch

from transformers import (
    CTRLLMHeadModel,
    CTRLTokenizer,
    GPT2LMHeadModel,
    GPT2Tokenizer,
    OpenAIGPTLMHeadModel,
    OpenAIGPTTokenizer,
    TransfoXLLMHeadModel,
    TransfoXLTokenizer,
    XLMTokenizer,
    XLMWithLMHeadModel,
    XLNetLMHeadModel,
    XLNetTokenizer,
)

MODEL_CLASSES = {
    "gpt2": (GPT2LMHeadModel, GPT2Tokenizer),
    "ctrl": (CTRLLMHeadModel, CTRLTokenizer),
    "openai-gpt": (OpenAIGPTLMHeadModel, OpenAIGPTTokenizer),
    "xlnet": (XLNetLMHeadModel, XLNetTokenizer),
    "transfo-xl": (TransfoXLLMHeadModel, TransfoXLTokenizer),
    "xlm": (XLMWithLMHeadModel, XLMTokenizer),
}

* Model and tokenizer instation

In [5]:
model_class, tokenizer_class = MODEL_CLASSES[model_type]

tokenizer = tokenizer_class.from_pretrained(model_name_or_path)
model = model_class.from_pretrained(model_name_or_path)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device) # Move the model object to the cuda if it is available

print(f'The execution will be in the device : {device}')

HBox(children=(IntProgress(value=0, description='Downloading', max=798011, style=ProgressStyle(description_wid…




HBox(children=(IntProgress(value=0, description='Downloading', max=641, style=ProgressStyle(description_width=…




HBox(children=(IntProgress(value=0, description='Downloading', max=467042463, style=ProgressStyle(description_…


The execution will be in the device : cpu


* Length of the text

Limit the user request length by the maximum length for the model

In [6]:
def adjust_length_to_model(length, max_sequence_length):
    if length < 0 and max_sequence_length > 0:
        length = max_sequence_length
    elif 0 < max_sequence_length < length:
        length = max_sequence_length  # No generation bigger than model size
    elif length < 0:
        length = MAX_LENGTH  # avoid infinite loop
    return length

length = adjust_length_to_model(length, max_sequence_length=model.config.max_position_embeddings)
print(f"Model will generate a text of length : {length} ")

Model will generate a text of length : 300 


* Preprocess the input text

Different models need different input formatting and/or extra arguments

In [0]:
def prepare_ctrl_input( _, tokenizer, prompt_text):
    encoded_prompt = tokenizer.encode(prompt_text, add_special_tokens=False)
    if not any(encoded_prompt[0] == x for x in tokenizer.control_codes.values()):
        logger.info("WARNING! You are not starting your generation from a control code so you won't get good results")
    return prompt_text


def prepare_xlm_input(model, tokenizer, prompt_text):
    # kwargs = {"language": None, "mask_token_id": None}

    # Set the language
    use_lang_emb = hasattr(model.config, "use_lang_emb") and model.config.use_lang_emb
    if hasattr(model.config, "lang2id") and use_lang_emb:
        available_languages = model.config.lang2id.keys()
        language = None
        while language not in available_languages:
            language = input("Using XLM. Select language in " + str(list(available_languages)) + " >>> ")
        # kwargs["language"] = tokenizer.lang2id[language]

    # TODO fix mask_token_id setup when configurations will be synchronized between models and tokenizers
    # XLM masked-language modeling (MLM) models need masked token
    # is_xlm_mlm = "mlm" in args.model_name_or_path
    # if is_xlm_mlm:
    #     kwargs["mask_token_id"] = tokenizer.mask_token_id

    return prompt_text

PADDING_TEXT = """ In 1991, the remains of Russian Tsar Nicholas II and his family
(except for Alexei and Maria) are discovered.
The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the
remainder of the story. 1883 Western Siberia,
a young Grigori Rasputin is asked by his father and a group of men to perform magic.
Rasputin has a vision and denounces one of the men as a horse thief. Although his
father initially slaps him for making such an accusation, Rasputin watches as the
man is chased outside and beaten. Twenty years later, Rasputin sees a vision of
the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous,
with people, even a bishop, begging for his blessing. <eod> </s> <eos>"""

def prepare_xlnet_input( _, tokenizer, prompt_text):
    prompt_text = PADDING_TEXT + prompt_text
    # return prompt_text, {}
    return prompt_text


def prepare_transfoxl_input(_, tokenizer, prompt_text):
    prompt_text = PADDING_TEXT + prompt_text
    return prompt_text, {}

In [0]:
PREPROCESSING_FUNCTIONS = {
    "ctrl": prepare_ctrl_input,
    "xlm": prepare_xlm_input,
    "xlnet": prepare_xlnet_input,
    "transfo-xl": prepare_transfoxl_input,
}

# Different models need different input formatting and/or extra arguments
prompt_text = text
requires_preprocessing = model_type in PREPROCESSING_FUNCTIONS.keys()
if requires_preprocessing:
    prepare_input = PREPROCESSING_FUNCTIONS.get(model_type)
    prompt_text = prepare_input(model, tokenizer, prompt_text)

* Tokenize input text

In [9]:
encoded_prompt = tokenizer.encode(prompt_text, add_special_tokens=False, return_tensors="pt")
encoded_prompt = encoded_prompt.to(device) # Move the tensor data to the cuda if it is available

print(f'The input tensor from the input text has the shape: {encoded_prompt.shape}')

The input tensor from the input text has the shape: torch.Size([1, 217])


* Run model

In [0]:
output_sequences = model.generate(
    input_ids=encoded_prompt,
    max_length=length,
    temperature=1,
    top_k=0,
    top_p=0.9,
    repetition_penalty=1.2,
)

* Generate the text from the prediction




In [0]:
# Batch size == 1. to add more examples please use num_return_sequences > 1
generated_sequence = output_sequences[0].tolist()
text = tokenizer.decode(generated_sequence, clean_up_tokenization_spaces=True)
text = text[: None]

In [12]:
print(text)

In 1991, the remains of Russian Tsar Nicholas II and his family (except for Alexei and Maria) are discovered. The voice of Nicholas's young son, Tsarevich Alexei Nikolaevich, narrates the remainder of the story. 1883 Western Siberia, a young Grigori Rasputin is asked by his father and a group of men to perform magic. Rasputin has a vision and denounces one of the men as a horse thief. Although his father initially slaps him for making such an accusation, Rasputin watches as the man is chased outside and beaten. Twenty years later, Rasputin sees a vision of the Virgin Mary, prompting him to become a priest. Rasputin quickly becomes famous, with people, even a bishop, begging for his blessing.<eod></s> <eos> In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English. They were also able to communicate with other animals