# HUGGINGFACE TUTORIAL:
https://huggingface.co/docs/transformers/tasks/language_modeling

# PREREQUISITES

In [1]:
!pip install transformers datasets evaluate



In [2]:
!pip install -U ipywidgets>=8

# DATASET & MODEL
Start by loading a smaller subset of the r/askscience subset of the ELI5 dataset from the 🤗 Datasets library. This’ll give you a chance to experiment and make sure everythings works before spending more time training on the full dataset.

In [3]:
from datasets import load_dataset
from transformers import AutoModel, AutoTokenizer

In [4]:
DATASET_NAME = "eli5"
MODEL_NAME = "gpt2-medium"
DATASET_SEGMENT_SIZE = 5000
DATASET_SEGMENT_SIZE = -1
dataset = load_dataset(DATASET_NAME, split=f"train_asks[:{DATASET_SEGMENT_SIZE}]")

Found cached dataset eli5 (/home/dkarpeyev/.cache/huggingface/datasets/eli5/LFQA_reddit/1.0.0/17574e5502a10f41bbd17beba83e22475b499fa62caa1384a3d093fc856fe6fa)


In [5]:
dataset

Dataset({
    features: ['q_id', 'title', 'selftext', 'document', 'subreddit', 'answers', 'title_urls', 'selftext_urls', 'answers_urls'],
    num_rows: 131777
})

Split the dataset’s train_asks split into a train and test set with the train_test_split method:

In [6]:
dataset_train_test = dataset.train_test_split(test_size=0.2)

Then take a look at an example (we only need the 'text' field):

In [7]:
dataset_train_test['train'][0]

{'q_id': 'o5w78',
 'title': 'Is the rate at which scientific discoveries are being made decreasing?',
 'selftext': "Seems like it's becoming harder for scientists to actually pin down discoveries (Higgs boson, dark matter and the like)",
 'document': '',
 'subreddit': 'askscience',
 'answers': {'a_id': ['c3en07a', 'c3emkb5'],
  'text': ["Many people, me among them, think that the rate of progress is in fact increasing, known as the Law of Accelerating Returns.  You can find a good explanation of it [here](_URL_2_).  If you look at cost effectiveness of [CPUs](_URL_1_), [hard drives](_URL_0_), [genomic reading](_URL_3_), and many, many, other important technologies (notice these graphs are basically straight lines on a log graph, which indicates an exponential trend) what you find is an exponential increase in performance per dollar over time (Moores Law is one example of this, for the number of transistors you can fit in an area for a price).\n\nAlthough not really scientific, someone 

The next step is to load a DistilGPT2 tokenizer to process the text subfield:

In [8]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

In [9]:
tokenizer

GPT2TokenizerFast(name_or_path='gpt2-medium', vocab_size=50257, model_max_length=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'})

You’ll notice from the example above, the text field is actually nested inside answers. This means you’ll need to extract the text subfield from its nested structure with the flatten method:

In [10]:
dataset_train_test_flattened = dataset_train_test.flatten()

In [11]:
dataset_train_test_flattened['train'][0]

{'q_id': 'o5w78',
 'title': 'Is the rate at which scientific discoveries are being made decreasing?',
 'selftext': "Seems like it's becoming harder for scientists to actually pin down discoveries (Higgs boson, dark matter and the like)",
 'document': '',
 'subreddit': 'askscience',
 'answers.a_id': ['c3en07a', 'c3emkb5'],
 'answers.text': ["Many people, me among them, think that the rate of progress is in fact increasing, known as the Law of Accelerating Returns.  You can find a good explanation of it [here](_URL_2_).  If you look at cost effectiveness of [CPUs](_URL_1_), [hard drives](_URL_0_), [genomic reading](_URL_3_), and many, many, other important technologies (notice these graphs are basically straight lines on a log graph, which indicates an exponential trend) what you find is an exponential increase in performance per dollar over time (Moores Law is one example of this, for the number of transistors you can fit in an area for a price).\n\nAlthough not really scientific, someo

Each subfield is now a separate column as indicated by the answers prefix, and the text field is a list now. Instead of tokenizing each sentence separately, convert the list to a string so you can jointly tokenize them.

Here is how you can create a preprocessing function to convert the list to a string, and truncate sequences to be no longer than DistilGPT2’s maximum input length:

In [12]:
def preprocessor(examples):
    return tokenizer([" ".join(x) for x in examples["answers.text"]], truncation=True)

To apply the preprocessing function over the entire dataset, use 🤗 Datasets with_transform method. You can speed up the map function by setting batched=True to process multiple elements of the dataset at once, and increasing the number of processes with num_proc. Remove any columns you don’t need:

In [13]:
dataset_tokenized = dataset_train_test_flattened.map(
    preprocessor,
    batched=True,
    num_proc=4,
    remove_columns=dataset_train_test_flattened["train"].column_names,
)

        

#0:   0%|          | 0/27 [00:00<?, ?ba/s]

#1:   0%|          | 0/27 [00:00<?, ?ba/s]

#2:   0%|          | 0/27 [00:00<?, ?ba/s]

#3:   0%|          | 0/27 [00:00<?, ?ba/s]

        

#0:   0%|          | 0/7 [00:00<?, ?ba/s]

#1:   0%|          | 0/7 [00:00<?, ?ba/s]

#3:   0%|          | 0/7 [00:00<?, ?ba/s]

#2:   0%|          | 0/7 [00:00<?, ?ba/s]

Now you’ll need a second preprocessing function to capture text truncated from the lengthier examples to avoid losing any information. This preprocessing function should:

Concatenate all the text.
Split the concatenated text into smaller chunks defined by block_size.

In [14]:
block_size = 128
def group_texts(examples):
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    total_length = (total_length // block_size) * block_size
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

Apply the group_texts function over the entire dataset:

In [15]:
dataset_tokenized_batched = dataset_tokenized.map(group_texts, batched=True, num_proc=4)

        

#1:   0%|          | 0/27 [00:00<?, ?ba/s]

#0:   0%|          | 0/27 [00:00<?, ?ba/s]

#3:   0%|          | 0/27 [00:00<?, ?ba/s]

#2:   0%|          | 0/27 [00:00<?, ?ba/s]

        

#0:   0%|          | 0/7 [00:00<?, ?ba/s]

#1:   0%|          | 0/7 [00:00<?, ?ba/s]

#2:   0%|          | 0/7 [00:00<?, ?ba/s]

#3:   0%|          | 0/7 [00:00<?, ?ba/s]

In [16]:
dataset_tokenized_batched

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 267562
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 66573
    })
})

Now create a batch of examples using DataCollatorForLanguageModeling. It’s more efficient to dynamically pad the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximium length.

For causal language modeling, use the end-of-sequence token as the padding token and set mlm=False. This will use the inputs as labels shifted to the right by one element:

In [17]:
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

In [18]:
#help(data_collator)

# Causal language modeling


Causal language models are frequently used for text generation. This section shows you how to finetune DistilGPT2 to generate new text.



If you aren’t familiar with finetuning a model with the Trainer, take a look at the basic tutorial here!

You're ready to start training your model now! Load DistilGPT2 with [AutoModelForCausalLM](/docs/transformers/v4.26.0/en/model_doc/auto#transformers.AutoModelForCausalLM):


In [19]:
from transformers import TrainingArguments, Trainer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)

## DEBUG: BEGIN

## DEBUG: END

### Training
## [Fails to run on specificed GPU -- defaults to "cuda:0" -- due to a bug in `transformers`]
At this point, only three steps remain:

Define your training hyperparameters in TrainingArguments. The only required parameter is output_dir which specifies where to save your model. You’ll push this model to the Hub by setting push_to_hub=True (you need to be signed in to Hugging Face to upload your model).
Pass the training arguments to Trainer along with the model, datasets, and data collator.
Call train() to finetune your model.

### Disable external telemetry (this may not be reachable from the local network)

In [20]:
import os
os.environ["WANDB_DISABLED"] = "TRUE"

In [21]:
GPU = 3

In [22]:
# Set up CUDA environment BEFORE importing torch
import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = f"{GPU}"  # This shrinks the GPU universe and maps cuda:0 to {GPU}

In [23]:
import torch
torch.cuda.device_count()

1

In [24]:
torch.cuda.current_device() # This really is device {GPU}

0

In [25]:
import transformers

In [26]:
CHECKPOINT_DIR=None
if CHECKPOINT_DIR is not None:
    model = model.from_pretrained(CHECKPOINT_DIR)

In [27]:
from transformers import pipeline, set_seed
generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
set_seed(42)
generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Hello, I'm a language model, I'm a language. I'm a compiler, I'm a parser, I'm a server process. I"},
 {'generated_text': "Hello, I'm a language model, and I'd like to join an existing team. What can I do to get started?\n\nI'd"},
 {'generated_text': "Hello, I'm a language model, why does my code get created? Can't I just copy it? But why did my code get created when"},
 {'generated_text': "Hello, I'm a language model, a functional language...\n\nI'm a functional language. Is it hard? A little, yes. But"},
 {'generated_text': "Hello, I'm a language model, not an object model.\n\nIn a nutshell, I need to give me objects from which I can get"}]

In [28]:
PRIME_LEN = 10
input_ids_0 = torch.tensor(dataset_tokenized_batched["test"]["input_ids"][0])
inputs_0 = tokenizer.decode(input_ids_0)
label_ids_0 = torch.tensor(dataset_tokenized_batched["test"]["labels"][0])
labels_0 = tokenizer.decode(label_ids_0)

print(f"inputs_0: {inputs_0}")
print(f"inputs_0[:{PRIME_LEN}]: {inputs_0[:PRIME_LEN]}")

generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
set_seed(42)
generator(inputs_0[:PRIME_LEN], max_length=30, num_return_sequences=5)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


inputs_0: I highly doubt there to be scientific experimentation about this.  
The answer is relative and would depend on how much the person sweats during the day, and at what times. This isn't a scientific question. If your bed is dirty and you aren't changing the sheets, then morning shower is better. But I can't go to bed with the day's disgustingness all over me so I usually shower before bed. If I sat around home all day doing nothing, I might skip the night shower and do it in the morning.Temperatures for stars are easily determined - different elements produce very specific light signatures, and the signature
inputs_0[:10]: I highly d


[{'generated_text': 'I highly dificulty: my husband has an allergy to peanuts and we think they help relieve the discomfort of the allergic joint. It is also the'},
 {'generated_text': 'I highly dicuss, and will send an email when a possible new topic is chosen.'},
 {'generated_text': 'I highly dived down to listen for that, and for how this could take a few more years with more work on that front, all the way'},
 {'generated_text': 'I highly dicuss the idea of sending you an item in the mail with the contents in plain text so you can look it over before sending it'},
 {'generated_text': 'I highly dnactified my father\'s letter! It seemed so true." It was really not a letter, it was the vision of a person'}]

In [29]:
PRIME_LEN = 30
inputs_0 = "I have to tell ya honestly"

print(f"inputs_0: {inputs_0}")
print(f"inputs_0[:{PRIME_LEN}]: {inputs_0[:PRIME_LEN]}")

generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
set_seed(42)
generator(inputs_0[:PRIME_LEN], max_length=30, num_return_sequences=5)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


inputs_0: I have to tell ya honestly
inputs_0[:30]: I have to tell ya honestly


[{'generated_text': 'I have to tell ya honestly, I\'ve had it with everyone who\'s running and talking about the Olympics."\n\nWagner\'s father is'},
 {'generated_text': 'I have to tell ya honestly, I\'d rather not play football. We\'ll watch the game again. This time it could have been worse."\n'},
 {'generated_text': 'I have to tell ya honestly," said the detective, "but there\'s an interesting story behind that one. I\'m not sure I even got a'},
 {'generated_text': 'I have to tell ya honestly I\'m starting to like these things."'},
 {'generated_text': "I have to tell ya honestly, he's kind of fun! He's funny. I see him coming out of a booth all the time, asking"}]

In [30]:
import datetime
date = datetime.datetime.now().strftime('%Y-%m-%d')
time = datetime.datetime.now().strftime('%H.%M')
training_args = TrainingArguments(
    output_dir=f"{MODEL_NAME}-{DATASET_NAME}/date={date}/time={time}",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    #place_model_on_device=torch.device(f"cuda:{GPU}"),
    push_to_hub=False,
    num_train_epochs=3.0,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset_tokenized_batched["train"],
    eval_dataset=dataset_tokenized_batched["test"],
    data_collator=data_collator,
    
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [31]:
trainer.train()

***** Running training *****
  Num examples = 267562
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 100338
  Number of trainable parameters = 354823168
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
1,3.1438,3.079568
2,3.0535,3.053271
3,3.0028,3.046569


Saving model checkpoint to gpt2-medium-eli5/date=2023-02-10/time=06.25/checkpoint-500
Configuration saved in gpt2-medium-eli5/date=2023-02-10/time=06.25/checkpoint-500/config.json
Configuration saved in gpt2-medium-eli5/date=2023-02-10/time=06.25/checkpoint-500/generation_config.json
Model weights saved in gpt2-medium-eli5/date=2023-02-10/time=06.25/checkpoint-500/pytorch_model.bin
Saving model checkpoint to gpt2-medium-eli5/date=2023-02-10/time=06.25/checkpoint-1000
Configuration saved in gpt2-medium-eli5/date=2023-02-10/time=06.25/checkpoint-1000/config.json
Configuration saved in gpt2-medium-eli5/date=2023-02-10/time=06.25/checkpoint-1000/generation_config.json
Model weights saved in gpt2-medium-eli5/date=2023-02-10/time=06.25/checkpoint-1000/pytorch_model.bin
Saving model checkpoint to gpt2-medium-eli5/date=2023-02-10/time=06.25/checkpoint-1500
Configuration saved in gpt2-medium-eli5/date=2023-02-10/time=06.25/checkpoint-1500/config.json
Configuration saved in gpt2-medium-eli5/date

TrainOutput(global_step=100338, training_loss=3.0815638375445293, metrics={'train_runtime': 11711.3688, 'train_samples_per_second': 68.539, 'train_steps_per_second': 8.568, 'total_flos': 1.8636376142197555e+17, 'train_loss': 3.0815638375445293, 'epoch': 3.0})

In [33]:

PRIME_LEN = 10
input_ids_0 = torch.tensor(dataset_tokenized_batched["test"]["input_ids"][0])
inputs_0 = tokenizer.decode(input_ids_0)
label_ids_0 = torch.tensor(dataset_tokenized_batched["test"]["labels"][0])
labels_0 = tokenizer.decode(label_ids_0)

print(f"inputs_0: {inputs_0}")
print(f"inputs_0[:{PRIME_LEN}]: {inputs_0[:PRIME_LEN]}")

model.to("cpu")
generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
set_seed(42)
generator(inputs_0[:PRIME_LEN], max_length=30, num_return_sequences=5)

Generate config GenerationConfig {
  "bos_token_id": 50256,
  "do_sample": true,
  "eos_token_id": 50256,
  "max_length": 50,
  "transformers_version": "4.26.0"
}

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


inputs_0: I highly doubt there to be scientific experimentation about this.  
The answer is relative and would depend on how much the person sweats during the day, and at what times. This isn't a scientific question. If your bed is dirty and you aren't changing the sheets, then morning shower is better. But I can't go to bed with the day's disgustingness all over me so I usually shower before bed. If I sat around home all day doing nothing, I might skip the night shower and do it in the morning.Temperatures for stars are easily determined - different elements produce very specific light signatures, and the signature
inputs_0[:10]: I highly d


[{'generated_text': 'I highly dificulty in my research. In my limited time working on this thing, I have not come anywhere close to doing what you are asking'},
 {'generated_text': "I highly dicuss, and will continue doing so, a very broad topic regarding science. This is a fun, interesting question. I'll leave"},
 {'generated_text': "I highly dificult to add that I think there's an interesting idea in there somewhere. I'm not sure on how it is resolved, but"},
 {'generated_text': 'I highly dicuss the idea of artificial gravity and what that would effect the universe. If no significant effects from gravity is being caused i dont see'},
 {'generated_text': 'I highly dabbled in my studies of the subject as well. I think the general consensus is that the short answer is yes. One of the'}]

In [34]:
PRIME_LEN = 30
inputs_0 = "I have to tell ya honestly"

print(f"inputs_0: {inputs_0}")
print(f"inputs_0[:PRIME_LEN]: {inputs_0[:PRIME_LEN]}")

model.to("cpu")
generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
set_seed(42)
generator(inputs_0[:PRIME_LEN], max_length=30, num_return_sequences=5)

Generate config GenerationConfig {
  "bos_token_id": 50256,
  "do_sample": true,
  "eos_token_id": 50256,
  "max_length": 50,
  "transformers_version": "4.26.0"
}

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


inputs_0: I have to tell ya honestly
inputs_0[:PRIME_LEN]: I have to tell ya honestly


[{'generated_text': "I have to tell ya honestly I've never once thought to take a long time after I'm done using an electric razor.  Especially when it's"},
 {'generated_text': "I have to tell ya honestly, I'd be happy to tell you how a human can produce a photon.  So I have to do a bit"},
 {'generated_text': 'I have to tell ya honestly that this is a really cool question! As an undergrad, I studied chemical dynamics/material science and chemistry/physics'},
 {'generated_text': "I have to tell ya honestly I'm not really a scientist in these matters, so I'm probably not a great judge of their accuracy, but this"},
 {'generated_text': 'I have to tell ya honestly, I like my dogs because they provide a constant amount of comfort or attention.  It really is a good bonding experience'}]

In [35]:
PRIME_LEN = 30
inputs_0 = "I have to tell ya hone"

print(f"inputs_0: {inputs_0}")
print(f"inputs_0[:PRIME_LEN]: {inputs_0[:PRIME_LEN]}")

model.to("cpu")
generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
set_seed(42)
generator(inputs_0[:PRIME_LEN], max_length=30, num_return_sequences=5)

Generate config GenerationConfig {
  "bos_token_id": 50256,
  "do_sample": true,
  "eos_token_id": 50256,
  "max_length": 50,
  "transformers_version": "4.26.0"
}

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


inputs_0: I have to tell ya hone
inputs_0[:PRIME_LEN]: I have to tell ya hone


[{'generated_text': "I have to tell ya honeysuckle gets a bad rap for not having a lot of flavor; I've heard it's not really nearly as sweet"},
 {'generated_text': 'I have to tell ya honeysuckle is one of those things that people say makes you ugly - I have read it is associated with cancer (just'},
 {'generated_text': "I have to tell ya honeysuckle is a really cool plant! Here's one to take a look at.  \n_URL_0"},
 {'generated_text': "I have to tell ya honeysuckle is really hard to keep alive. It's incredibly toxic and hard to chew, so you'd have to keep"},
 {'generated_text': 'I have to tell ya honeysuckle is my absolute favorite! It smells like burning wood, or burning your eyes from the inside out, or burning'}]

In [36]:
PRIME_LEN = 3000
inputs_0 = "Please explain AI to me."

print(f"inputs_0: {inputs_0}")
print(f"inputs_0[:{PRIME_LEN}]: {inputs_0[:PRIME_LEN]}")

generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
set_seed(42)
generator(inputs_0[:PRIME_LEN], max_length=30, num_return_sequences=5)

Generate config GenerationConfig {
  "bos_token_id": 50256,
  "do_sample": true,
  "eos_token_id": 50256,
  "max_length": 50,
  "transformers_version": "4.26.0"
}

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


inputs_0: Please explain AI to me.
inputs_0[:3000]: Please explain AI to me.


[{'generated_text': "Please explain AI to me. I've never once thought to think about it, and I think the universe seems to be thinking the same way. It"},
 {'generated_text': "Please explain AI to me. What exactly is it/would it do? How would it be able to handle complex systems efficiently?I'll assume there"},
 {'generated_text': 'Please explain AI to me. Can you give examples for how its made? What can you tell me about the problem? \n\nI am curious'},
 {'generated_text': "Please explain AI to me. I'm not really equipped to answer these questions, and I'm probably not a great person to be answering questions about this"},
 {'generated_text': "Please explain AI to me.  If you can't, I'll assume you are talking in terms of computer programs.  I've never seen an"}]

In [37]:
PRIME_LEN = 3000
inputs_0 = "Please explain AI to me:"

print(f"inputs_0: {inputs_0}")
print(f"inputs_0[:{PRIME_LEN}]: {inputs_0[:PRIME_LEN]}")

generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
set_seed(42)
generator(inputs_0[:PRIME_LEN], max_length=30, num_return_sequences=5)

Generate config GenerationConfig {
  "bos_token_id": 50256,
  "do_sample": true,
  "eos_token_id": 50256,
  "max_length": 50,
  "transformers_version": "4.26.0"
}

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


inputs_0: Please explain AI to me:
inputs_0[:3000]: Please explain AI to me:


[{'generated_text': "Please explain AI to me:) I would imagine it would take some time, and we aren't sure yet if it's faster than humans. It"},
 {'generated_text': 'Please explain AI to me: What is it and how does it work? How is it related to science? What does it have to do with science'},
 {'generated_text': 'Please explain AI to me: why does the sky get blue, and why is this a good idea? > I thought that the world is composed of'},
 {'generated_text': 'Please explain AI to me: a computer can do a lot of things that humans have never been able to in a conscious effort.  AI, as'},
 {'generated_text': 'Please explain AI to me: is there something like "human" intelligence as we understand it in our definition? Like a human is a machine, or'}]

In [47]:
%%time
PRIME_LEN = 3000
inputs_0 = "Please explain AI to me."

print(f"inputs_0: {inputs_0}")
print(f"inputs_0[:{PRIME_LEN}]: {inputs_0[:PRIME_LEN]}")

model.to(torch.device("cpu"))
generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
set_seed(42)
generator(inputs_0[:PRIME_LEN], max_length=300, num_return_sequences=1)

Generate config GenerationConfig {
  "bos_token_id": 50256,
  "do_sample": true,
  "eos_token_id": 50256,
  "max_length": 50,
  "transformers_version": "4.26.0"
}

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


inputs_0: Please explain AI to me.
inputs_0[:3000]: Please explain AI to me.
CPU times: user 3h 3s, sys: 2min 12s, total: 3h 2min 16s
Wall time: 9min 17s


[{'generated_text': 'Please explain AI to me. I\'m currently a 3 year old girl.  Is it possible that one day she would be able to understand me?\n\nI mean come on, what\'s the big deal. Can\'t she see how silly I am, then try to teach me how to walk? How can a computer (if I take the word literally) understand someone or help me play? All I know is that we are constantly doing something we\'re not supposed to do?       \n\nI can\'t think of a single reason why an AI couldn\'t be possible. I mean, look at her and her tiny little head twitch and blink and her hands move?  I\'d love to hear her thoughts, but isn\'t there some sort of \'program\' that\'s actually running inside her brain to help her?  \n\nSo basically, what do you think of when you think of an AI?  How would it know how to get in the way of the \'programs\' the user is creating?  Imagine a creature that could not think but still could control her physical body and walk around?        \n\nAnd also could not reproduce--no pr

In [38]:
PRIME_LEN = 3000
inputs_0 = "Russia has invaded"

print(f"inputs_0: {inputs_0}")
print(f"inputs_0[:{PRIME_LEN}]: {inputs_0[:PRIME_LEN]}")

generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
set_seed(41)
generator(inputs_0[:PRIME_LEN], max_length=30, num_return_sequences=5)

Generate config GenerationConfig {
  "bos_token_id": 50256,
  "do_sample": true,
  "eos_token_id": 50256,
  "max_length": 50,
  "transformers_version": "4.26.0"
}

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


inputs_0: Russia has invaded
inputs_0[:3000]: Russia has invaded


[{'generated_text': "Russia has invaded Russia and Russia has been invaded. The question is, what are the legal consequences of that? Russia doesn't have a law which deals"},
 {'generated_text': "Russia has invaded Syria in the 1970's. They have a nuclear arsenal as well.\n\nSo, yes - you might want to keep your nuclear"},
 {'generated_text': "Russia has invaded the US?**\n\n**From my quick research**\n\nIt's hard to say, but given the very low cost to"},
 {'generated_text': 'Russia has invaded and occupied the part of Georgia it has controlled for several years. This is a legitimate invasion by a major power, not an international crime'},
 {'generated_text': 'Russia has invaded, it\'s the US that has invaded". This is actually quite popular with the public because the phrase "invasion" is used to'}]