In order to leverage the Hugging Face transformers library for creating a pair generation function tailored for fine-tuning GPT-2 models, the provided script incorporates a generate_pair function. This function utilizes top-p sampling with a specified temperature to generate paired sequences based on given input and output prompts. The script also includes a pair_generation function designed to take a list of input-output tuples, facilitating the generation of multiple pairs. Both functions rely on a GPT-2 model, and the example uses the GPT-2 tokenizer and model instances from the Hugging Face library. Fine-tuning and training steps can be inserted appropriately within the script. The user can easily customize parameters like temperature, top_p, and max_length to suit specific requirements. Overall, this script provides a flexible and efficient way to generate paired sequences for GPT-2 models based on given input-output prompts.

The GPT-2 model and tokenizer are loaded using the Hugging Face Transformers library. The GPT2Tokenizer class is employed to initialize the tokenizer, responsible for converting text into numerical tokens. Subsequently, the GPT2LMHeadModel class is utilized to load the GPT-2 language model, a powerful neural network trained on extensive text data. The from_pretrained method is employed for both the tokenizer and the model, with the model_name variable set to "gpt2." This enables the code to access and utilize the pre-trained GPT-2 model for generating text based on provided prompts. The combined functionality of the tokenizer and the GPT-2 model facilitates the transformation of input prompts into coherent and contextually relevant text outputs.

In [None]:
import torch
import torch.nn.functional as F
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from tqdm import trange

# Load the GPT-2 model and tokenizer
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

The "generate_pair" function takes as input a GPT-2 language model, a tokenizer, input and output prompts, and additional parameters for controlling the generation process. It initializes the model in evaluation mode and uses a specified number of iterations to generate pairs of input and output sequences. Within each iteration, the function processes the input and output prompts separately, iteratively sampling tokens based on the GPT-2 model's probabilities. The sampling is influenced by parameters such as temperature and top-p, which control the diversity and likelihood of generated tokens. The process continues until either the specified entry length is reached or a special token is encountered. The resulting pairs are then decoded using the tokenizer and appended to a list, which is returned at the end. The function effectively leverages the GPT-2 model and tokenizer to generate pairs of coherent input and output sequences based on given prompts, allowing for controlled and contextually relevant text generation.

In [None]:
def generate_pair(model, tokenizer, input_prompt, output_prompt, entry_count=10, entry_length=30, top_p=0.8, temperature=1.):
    model.eval()
    generated_list = []

    filter_value = -float("Inf")

    with torch.no_grad():

        for _ in trange(entry_count):
            input_finished = False
            input_generated = torch.tensor(tokenizer.encode(input_prompt)).unsqueeze(0)

            for i in range(entry_length):
                input_outputs = model(input_generated, labels=input_generated)
                input_loss, input_logits = input_outputs[:2]
                input_logits = input_logits[:, -1, :] / (temperature if temperature > 0 else 1.0)

                input_sorted_logits, input_sorted_indices = torch.sort(input_logits, descending=True)
                input_cumulative_probs = torch.cumsum(F.softmax(input_sorted_logits, dim=-1), dim=-1)

                input_sorted_indices_to_remove = input_cumulative_probs > top_p
                input_sorted_indices_to_remove[..., 1:] = input_sorted_indices_to_remove[..., :-1].clone()
                input_sorted_indices_to_remove[..., 0] = 0

                input_indices_to_remove = input_sorted_indices[input_sorted_indices_to_remove]
                input_logits[:, input_indices_to_remove] = filter_value

                input_next_token = torch.multinomial(F.softmax(input_logits, dim=-1), num_samples=1)
                input_generated = torch.cat((input_generated, input_next_token), dim=1)

                if input_next_token in tokenizer.encode(""):
                    input_finished = True

                if input_finished:
                    break

            output_finished = False
            output_generated = torch.tensor(tokenizer.encode(output_prompt)).unsqueeze(0)

            for i in range(entry_length):
                output_outputs = model(output_generated, labels=output_generated)
                output_loss, output_logits = output_outputs[:2]
                output_logits = output_logits[:, -1, :] / (temperature if temperature > 0 else 1.0)

                output_sorted_logits, output_sorted_indices = torch.sort(output_logits, descending=True)
                output_cumulative_probs = torch.cumsum(F.softmax(output_sorted_logits, dim=-1), dim=-1)

                output_sorted_indices_to_remove = output_cumulative_probs > top_p
                output_sorted_indices_to_remove[..., 1:] = output_sorted_indices_to_remove[..., :-1].clone()
                output_sorted_indices_to_remove[..., 0] = 0

                output_indices_to_remove = output_sorted_indices[output_sorted_indices_to_remove]
                output_logits[:, output_indices_to_remove] = filter_value

                output_next_token = torch.multinomial(F.softmax(output_logits, dim=-1), num_samples=1)
                output_generated = torch.cat((output_generated, output_next_token), dim=1)

                if output_next_token in tokenizer.encode(""):
                    output_finished = True

                if output_finished:
                    break

            input_list = list(input_generated.squeeze().numpy())
            input_text = tokenizer.decode(input_list)
            output_list = list(output_generated.squeeze().numpy())
            output_text = tokenizer.decode(output_list)

            generated_list.append((input_text, output_text))

    return generated_list

The "pair_generation" function is designed to generate multiple pairs of input and output sequences based on a list of tuples containing input and output prompts. It takes three parameters: the GPT-2 language model (model), the tokenizer (tokenizer), and the test data (test_data). The function iterates through each tuple in the test data, invoking the "generate_pair" function for each pair of input and output prompts. The generated pairs are then accumulated into a list, which is returned as the final output of the function. Essentially, this function serves as a wrapper, orchestrating the process of generating pairs for multiple input and output prompt combinations, providing a convenient way to obtain a collection of model-generated sequences for testing or training purposes.

In [None]:
# Function to generate multiple pairs. Test data should be a list of tuples (input_prompt, output_prompt).
def pair_generation(model, tokenizer, test_data):
    generated_pairs = []
    for input_prompt, output_prompt in test_data:
        pairs = generate_pair(model, tokenizer, input_prompt, output_prompt, entry_count=1)
        generated_pairs.extend(pairs)
    return generated_pairs

# Run the functions to generate the pairs
train_data = [
    ("HI", "HI"),
    ("BYE", "BYE"),
    ("YO", "AAAAY"),
    ("SMACK THAT", "I DIDNT ASK")
]

generated_pairs = pair_generation(model, tokenizer, train_data)
print(generated_pairs)

100%|██████████| 1/1 [00:18<00:00, 18.60s/it]
100%|██████████| 1/1 [00:11<00:00, 11.83s/it]
100%|██████████| 1/1 [00:12<00:00, 12.89s/it]
100%|██████████| 1/1 [00:13<00:00, 13.10s/it]

[('HI" better for "good," the government had implemented improved public education and education accountability by failing to adequately promote or promote childhood obesity, including reduced rates of', "HI source. There was no accident. The NSA's double tap on Snowden's personal telephone calls and emails can go on forever. And so the same story"), ('BYE CARLSON • JASON JAYSON • DONALD TRUMP • PICKUJA BALLO • AUSTIN CARLSON • DAVID', "BYE,'Sigourney'\n\nA few decades ago, Gaffney argued that Biblical values and rituals were deeply rooted in contemporary American culture. Though"), ('YO LZLG-00071-1","PulseAudio ZGL1.5","PulseAudio ZGL1.5","PulseAudio', 'AAAAYx2gGQyZXLUcKc6o9yIhZXLUhN0ZXLUhU4lu5l'), ("SMACK THAT WAS #1.\n\nAND THIS HITS IT RIGHT THERE.\n\n\nI used a book called #1 on this website and it's a", 'I DIDNT ASK ME HOW TO KILL MY DICK, BUT I NEVER INCREDIBLY DID NOT BECOME FUCKED BY HIS VAMPIRES')]



