# HUGGINGFACE TUTORIAL:
https://huggingface.co/docs/transformers/tasks/language_modeling

# PREREQUISITES

In [1]:
!pip install transformers datasets evaluate



In [2]:
!pip install -U ipywidgets>=8

# DATASET & MODEL
Start by loading a smaller subset of the r/askscience subset of the ELI5 dataset from the 🤗 Datasets library. This’ll give you a chance to experiment and make sure everythings works before spending more time training on the full dataset.

In [3]:
from datasets import load_dataset
from transformers import AutoModel, AutoTokenizer

In [4]:
DATASET_NAME = "eli5"
MODEL_NAME = "gpt2"
DATASET_SEGMENT_SIZE = 5000
DATASET_SEGMENT_SIZE = -1
dataset = load_dataset(DATASET_NAME, split=f"train_asks[:{DATASET_SEGMENT_SIZE}]")

Found cached dataset eli5 (/home/dkarpeyev/.cache/huggingface/datasets/eli5/LFQA_reddit/1.0.0/17574e5502a10f41bbd17beba83e22475b499fa62caa1384a3d093fc856fe6fa)


In [5]:
dataset

Dataset({
    features: ['q_id', 'title', 'selftext', 'document', 'subreddit', 'answers', 'title_urls', 'selftext_urls', 'answers_urls'],
    num_rows: 131777
})

Split the dataset’s train_asks split into a train and test set with the train_test_split method:

In [6]:
dataset_train_test = dataset.train_test_split(test_size=0.2)

Then take a look at an example (we only need the 'text' field):

In [7]:
dataset_train_test['train'][0]

{'q_id': 'koozm',
 'title': 'Why do NSAIDS antagonize THC effects?',
 'selftext': 'By personal experience, anecdotical evidence and [this article](_URL_0_) I know that Non steroidal anti inflammatory drugs like aspirin and iboprufen antagonize THC effects. In other words they sober you up very fast. Can anyone here explain me why they do it? \n\nThanks, guys.',
 'document': '',
 'subreddit': 'askscience',
 'answers': {'a_id': ['c2m9z55'],
  'text': ["I'm going to try and make this is as coherent as possible...but it's tricky.  THC does its thing by binding to a cannabinoid receptor (CB1), which is where most of the high comes from. \n\nThe reason why you have cannabinoid receptors in the first place is that you have an endogenous cannabinoid system, with anandamide being the main poster child of that.  Anandamide is derived from phospholipds (arachadonic acid).\n\nNSAIDs block the enzyme cyclooxygenase (COX), which normally makes prostaglandins that have roles in inflammation/fever/etc

The next step is to load a DistilGPT2 tokenizer to process the text subfield:

In [8]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

In [9]:
tokenizer

GPT2TokenizerFast(name_or_path='gpt2', vocab_size=50257, model_max_length=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'})

You’ll notice from the example above, the text field is actually nested inside answers. This means you’ll need to extract the text subfield from its nested structure with the flatten method:

In [10]:
dataset_train_test_flattened = dataset_train_test.flatten()

In [11]:
dataset_train_test_flattened['train'][0]

{'q_id': 'koozm',
 'title': 'Why do NSAIDS antagonize THC effects?',
 'selftext': 'By personal experience, anecdotical evidence and [this article](_URL_0_) I know that Non steroidal anti inflammatory drugs like aspirin and iboprufen antagonize THC effects. In other words they sober you up very fast. Can anyone here explain me why they do it? \n\nThanks, guys.',
 'document': '',
 'subreddit': 'askscience',
 'answers.a_id': ['c2m9z55'],
 'answers.text': ["I'm going to try and make this is as coherent as possible...but it's tricky.  THC does its thing by binding to a cannabinoid receptor (CB1), which is where most of the high comes from. \n\nThe reason why you have cannabinoid receptors in the first place is that you have an endogenous cannabinoid system, with anandamide being the main poster child of that.  Anandamide is derived from phospholipds (arachadonic acid).\n\nNSAIDs block the enzyme cyclooxygenase (COX), which normally makes prostaglandins that have roles in inflammation/fever/

Each subfield is now a separate column as indicated by the answers prefix, and the text field is a list now. Instead of tokenizing each sentence separately, convert the list to a string so you can jointly tokenize them.

Here is how you can create a preprocessing function to convert the list to a string, and truncate sequences to be no longer than DistilGPT2’s maximum input length:

In [12]:
def preprocessor(examples):
    return tokenizer([" ".join(x) for x in examples["answers.text"]], truncation=True)

To apply the preprocessing function over the entire dataset, use 🤗 Datasets with_transform method. You can speed up the map function by setting batched=True to process multiple elements of the dataset at once, and increasing the number of processes with num_proc. Remove any columns you don’t need:

In [13]:
dataset_tokenized = dataset_train_test_flattened.map(
    preprocessor,
    batched=True,
    num_proc=4,
    remove_columns=dataset_train_test_flattened["train"].column_names,
)

        

#0:   0%|          | 0/27 [00:00<?, ?ba/s]

#1:   0%|          | 0/27 [00:00<?, ?ba/s]

#2:   0%|          | 0/27 [00:00<?, ?ba/s]

#3:   0%|          | 0/27 [00:00<?, ?ba/s]

        

#0:   0%|          | 0/7 [00:00<?, ?ba/s]

#3:   0%|          | 0/7 [00:00<?, ?ba/s]

#2:   0%|          | 0/7 [00:00<?, ?ba/s]

#1:   0%|          | 0/7 [00:00<?, ?ba/s]

Now you’ll need a second preprocessing function to capture text truncated from the lengthier examples to avoid losing any information. This preprocessing function should:

Concatenate all the text.
Split the concatenated text into smaller chunks defined by block_size.

In [14]:
block_size = 128
def group_texts(examples):
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    total_length = (total_length // block_size) * block_size
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

Apply the group_texts function over the entire dataset:

In [15]:
dataset_tokenized_batched = dataset_tokenized.map(group_texts, batched=True, num_proc=4)

        

#0:   0%|          | 0/27 [00:00<?, ?ba/s]

#1:   0%|          | 0/27 [00:00<?, ?ba/s]

#2:   0%|          | 0/27 [00:00<?, ?ba/s]

#3:   0%|          | 0/27 [00:00<?, ?ba/s]

        

#0:   0%|          | 0/7 [00:00<?, ?ba/s]

#2:   0%|          | 0/7 [00:00<?, ?ba/s]

#1:   0%|          | 0/7 [00:00<?, ?ba/s]

#3:   0%|          | 0/7 [00:00<?, ?ba/s]

In [16]:
dataset_tokenized_batched

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 267621
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 66508
    })
})

Now create a batch of examples using DataCollatorForLanguageModeling. It’s more efficient to dynamically pad the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximium length.

For causal language modeling, use the end-of-sequence token as the padding token and set mlm=False. This will use the inputs as labels shifted to the right by one element:

In [17]:
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

In [18]:
#help(data_collator)

# Causal language modeling


Causal language models are frequently used for text generation. This section shows you how to finetune DistilGPT2 to generate new text.



If you aren’t familiar with finetuning a model with the Trainer, take a look at the basic tutorial here!

You're ready to start training your model now! Load DistilGPT2 with [AutoModelForCausalLM](/docs/transformers/v4.26.0/en/model_doc/auto#transformers.AutoModelForCausalLM):


In [19]:
from transformers import TrainingArguments, Trainer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)

## DEBUG: BEGIN

## DEBUG: END

### Training
## [Fails to run on specificed GPU -- defaults to "cuda:0" -- due to a bug in `transformers`]
At this point, only three steps remain:

Define your training hyperparameters in TrainingArguments. The only required parameter is output_dir which specifies where to save your model. You’ll push this model to the Hub by setting push_to_hub=True (you need to be signed in to Hugging Face to upload your model).
Pass the training arguments to Trainer along with the model, datasets, and data collator.
Call train() to finetune your model.

### Disable external telemetry (this may not be reachable from the local network)

In [20]:
import os
os.environ["WANDB_DISABLED"] = "TRUE"

In [21]:
GPU = 3

In [22]:
# Set up CUDA environment BEFORE importing torch
import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = f"{GPU}"  # This shrinks the GPU universe and maps cuda:0 to {GPU}

In [23]:
import torch
torch.cuda.device_count()

1

In [24]:
torch.cuda.current_device() # This really is device {GPU}

0

In [25]:
import transformers

In [26]:
CHECKPOINT_DIR=None
if CHECKPOINT_DIR is not None:
    model = model.from_pretrained(CHECKPOINT_DIR)

In [27]:
from transformers import pipeline, set_seed
generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
set_seed(42)
generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Hello, I'm a language model, I'm writing a new language for you. But first, I'd like to tell you about the language itself"},
 {'generated_text': "Hello, I'm a language model, and I'm trying to be as expressive as possible. In order to be expressive, it is necessary to know"},
 {'generated_text': "Hello, I'm a language model, so I don't get much of a license anymore, but I'm probably more familiar with other languages on that"},
 {'generated_text': "Hello, I'm a language model, a functional model... It's not me, it's me!\n\nI won't bore you with how"},
 {'generated_text': "Hello, I'm a language model, not an object model.\n\nIn a nutshell, I need to give language model a set of properties that"}]

In [28]:
PRIME_LEN = 10
input_ids_0 = torch.tensor(dataset_tokenized_batched["test"]["input_ids"][0])
inputs_0 = tokenizer.decode(input_ids_0)
label_ids_0 = torch.tensor(dataset_tokenized_batched["test"]["labels"][0])
labels_0 = tokenizer.decode(label_ids_0)

print(f"inputs_0: {inputs_0}")
print(f"inputs_0[:PRIME_LEN]: {inputs_0[:PRIME_LEN]}")

generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
set_seed(42)
generator(inputs_0[:PRIME_LEN], max_length=30, num_return_sequences=5)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


inputs_0: Dogs have been selectively bred by humans specifically to look different from each other.  People like variety, and exaggerate it in dogs.  People haven't been breeding killer whales to all look really different, so they pretty much all look similar (in a broad sense).Asteroid belts are not dense. This is a myth perpetuated by popular media. Space is so big that if you were flying through an asteroid belt, you would not come close to any asteroids unless you aimed very carefully right at one. Collisions between large asteroids occurs on the order of once every several million years. So no, the average mass density of
inputs_0[:PRIME_LEN]: Dogs have 


[{'generated_text': 'Dogs have iced heads: a new study finds that nearly two-thirds think the dog breeds are less important to their health than others\n\n'},
 {'generated_text': 'Dogs have ˜˜˜˜˜˜˜.\n\nSome pets have "˜˜˜˜˜˜˜.\n\nSome pets'},
 {'generated_text': 'Dogs have ichthyogenic disease, which is a deadly disease affecting about 150,000 dogs. Dogs are one of the most popular pets for'},
 {'generated_text': "Dogs have ursine amyloid receptors in their own cells and that doesn't stop a dog from taking their own urine, said Robert E"},
 {'generated_text': 'Dogs have erythropoiesis – which is a swelling of skin around or around your eyes. These are called anaphylaxis or'}]

In [29]:
PRIME_LEN = 30
inputs_0 = "I have to tell ya honestly"

print(f"inputs_0: {inputs_0}")
print(f"inputs_0[:PRIME_LEN]: {inputs_0[:PRIME_LEN]}")

generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
set_seed(42)
generator(inputs_0[:PRIME_LEN], max_length=30, num_return_sequences=5)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


inputs_0: I have to tell ya honestly
inputs_0[:PRIME_LEN]: I have to tell ya honestly


[{'generated_text': "I have to tell ya honestly, I've had it with everyone. I've had people think I'm an asshole because I'm an activist. It"},
 {'generated_text': "I have to tell ya honestly, I'd be happy to tell you how lucky I was to get married. I never expected to live happily ever after"},
 {'generated_text': 'I have to tell ya honestly," said the manager, who told me his team can only take a third more point with four minutes on the clock,'},
 {'generated_text': 'I have to tell ya honestly I\'m not really a big fan of this game\'s story," he said in a phone interview from Paris before his team'},
 {'generated_text': 'I have to tell ya honestly, he had my money because I knew as well as I did he was going to be the star of that show.'}]

In [30]:
import datetime
date = datetime.datetime.now().strftime('%Y-%m-%d')
time = datetime.datetime.now().strftime('%H.%M')
training_args = TrainingArguments(
    output_dir=f"{MODEL_NAME}-{DATASET_NAME}/date={date}/time={time}",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    #place_model_on_device=torch.device(f"cuda:{GPU}"),
    push_to_hub=False,
    num_train_epochs=4.0,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset_tokenized_batched["train"],
    eval_dataset=dataset_tokenized_batched["test"],
    data_collator=data_collator,
    
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [31]:
trainer.train()

***** Running training *****
  Num examples = 267621
  Num Epochs = 4
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 16728
  Number of trainable parameters = 124439808
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
1,3.4225,3.335937
2,3.3755,3.308554
3,3.3503,3.296214
4,3.3364,3.29239


Saving model checkpoint to gpt2-eli5/date=2023-02-09/time=21.39/checkpoint-500
Configuration saved in gpt2-eli5/date=2023-02-09/time=21.39/checkpoint-500/config.json
Configuration saved in gpt2-eli5/date=2023-02-09/time=21.39/checkpoint-500/generation_config.json
Model weights saved in gpt2-eli5/date=2023-02-09/time=21.39/checkpoint-500/pytorch_model.bin
Saving model checkpoint to gpt2-eli5/date=2023-02-09/time=21.39/checkpoint-1000
Configuration saved in gpt2-eli5/date=2023-02-09/time=21.39/checkpoint-1000/config.json
Configuration saved in gpt2-eli5/date=2023-02-09/time=21.39/checkpoint-1000/generation_config.json
Model weights saved in gpt2-eli5/date=2023-02-09/time=21.39/checkpoint-1000/pytorch_model.bin
Saving model checkpoint to gpt2-eli5/date=2023-02-09/time=21.39/checkpoint-1500
Configuration saved in gpt2-eli5/date=2023-02-09/time=21.39/checkpoint-1500/config.json
Configuration saved in gpt2-eli5/date=2023-02-09/time=21.39/checkpoint-1500/generation_config.json
Model weights s

TrainOutput(global_step=16728, training_loss=3.3876338978807987, metrics={'train_runtime': 3637.9317, 'train_samples_per_second': 294.256, 'train_steps_per_second': 4.598, 'total_flos': 6.9927234895872e+16, 'train_loss': 3.3876338978807987, 'epoch': 4.0})

In [33]:
PRIME_LEN = 10
input_ids_0 = torch.tensor(dataset_tokenized_batched["test"]["input_ids"][0])
inputs_0 = tokenizer.decode(input_ids_0)
label_ids_0 = torch.tensor(dataset_tokenized_batched["test"]["labels"][0])
labels_0 = tokenizer.decode(label_ids_0)

print(f"inputs_0: {inputs_0}")
print(f"inputs_0[:PRIME_LEN]: {inputs_0[:PRIME_LEN]}")

generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
set_seed(42)
generator(inputs_0[:PRIME_LEN], max_length=30, num_return_sequences=5)

Generate config GenerationConfig {
  "bos_token_id": 50256,
  "do_sample": true,
  "eos_token_id": 50256,
  "max_length": 50,
  "transformers_version": "4.26.0"
}

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


inputs_0: Dogs have been selectively bred by humans specifically to look different from each other.  People like variety, and exaggerate it in dogs.  People haven't been breeding killer whales to all look really different, so they pretty much all look similar (in a broad sense).Asteroid belts are not dense. This is a myth perpetuated by popular media. Space is so big that if you were flying through an asteroid belt, you would not come close to any asteroids unless you aimed very carefully right at one. Collisions between large asteroids occurs on the order of once every several million years. So no, the average mass density of
inputs_0[:PRIME_LEN]: Dogs have 


[{'generated_text': 'Dogs have  higher mortality because they spend much more time at night and are better protected by their tails because they live longer and also are more doc'},
 {'generated_text': 'Dogs have ~~one~~ brain. Dogs and cats seem to have the same brain. This means they have similar parts. So there are '},
 {'generated_text': 'Dogs have  < 20% of their body weight stored for as long as necessary. But more importantly, there is evidence that these cats are often'},
 {'generated_text': "Dogs have  a more variable sense of smell than their fur, and that doesn't mean they won't smell their way through it. \n"},
 {'generated_text': 'Dogs have erythrocyte pigment receptors which allow for their red blood cells to become sensitive to other pigments. This means they are capable'}]

In [29]:
PRIME_LEN = 30
inputs_0 = "I have to tell ya honestly"

print(f"inputs_0: {inputs_0}")
print(f"inputs_0[:PRIME_LEN]: {inputs_0[:PRIME_LEN]}")

generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
set_seed(42)
generator(inputs_0[:PRIME_LEN], max_length=30, num_return_sequences=5)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


inputs_0: I have to tell ya honestly
inputs_0[:PRIME_LEN]: I have to tell ya honestly


[{'generated_text': "I have to tell ya honestly, I've had it with everyone. I've had people think I'm an asshole because I'm an activist. It"},
 {'generated_text': "I have to tell ya honestly, I'd be happy to tell you how lucky I was to get married. I never expected to live happily ever after"},
 {'generated_text': 'I have to tell ya honestly," said the manager, who told me his team can only take a third more point with four minutes on the clock,'},
 {'generated_text': 'I have to tell ya honestly I\'m not really a big fan of this game\'s story," he said in a phone interview from Paris before his team'},
 {'generated_text': 'I have to tell ya honestly, he had my money because I knew as well as I did he was going to be the star of that show.'}]

In [34]:
PRIME_LEN = 3000
inputs_0 = "Please explain AI to me."

print(f"inputs_0: {inputs_0}")
print(f"inputs_0[:PRIME_LEN]: {inputs_0[:PRIME_LEN]}")

generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
set_seed(42)
generator(inputs_0[:PRIME_LEN], max_length=30, num_return_sequences=5)

Generate config GenerationConfig {
  "bos_token_id": 50256,
  "do_sample": true,
  "eos_token_id": 50256,
  "max_length": 50,
  "transformers_version": "4.26.0"
}

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


inputs_0: Please explain AI to me.
inputs_0[:PRIME_LEN]: Please explain AI to me.


[{'generated_text': 'Please explain AI to me. I can do a deep learning algorithm that can perform a lot better under pressure (like a robot) than a human can'},
 {'generated_text': 'Please explain AI to me. What you are trying to do is to create a human-language program. This software will be similar to what a program'},
 {'generated_text': 'Please explain AI to me.\n\nWhen talking about how computers play chess, one of the biggest concerns is speed at which they can do these comput'},
 {'generated_text': 'Please explain AI to me. I will explain AI to you in 4 steps.\nA computer is programmed in many different ways. It is programmed to'},
 {'generated_text': "Please explain AI to me.  If you can't solve the problem as you are talking about, then your brain is really not good at solving it"}]

In [34]:
PRIME_LEN = 3000
inputs_0 = "Please explain AI to me."

print(f"inputs_0: {inputs_0}")
print(f"inputs_0[:{PRIME_LEN}]: {inputs_0[:PRIME_LEN]}")

generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
set_seed(42)
generator(inputs_0[:PRIME_LEN], max_length=30, num_return_sequences=5)

Generate config GenerationConfig {
  "bos_token_id": 50256,
  "do_sample": true,
  "eos_token_id": 50256,
  "max_length": 50,
  "transformers_version": "4.26.0"
}

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


inputs_0: Please explain AI to me.
inputs_0[:PRIME_LEN]: Please explain AI to me.


[{'generated_text': 'Please explain AI to me. I can do a deep learning algorithm that can perform a lot better under pressure (like a robot) than a human can'},
 {'generated_text': 'Please explain AI to me. What you are trying to do is to create a human-language program. This software will be similar to what a program'},
 {'generated_text': 'Please explain AI to me.\n\nWhen talking about how computers play chess, one of the biggest concerns is speed at which they can do these comput'},
 {'generated_text': 'Please explain AI to me. I will explain AI to you in 4 steps.\nA computer is programmed in many different ways. It is programmed to'},
 {'generated_text': "Please explain AI to me.  If you can't solve the problem as you are talking about, then your brain is really not good at solving it"}]

In [36]:
PRIME_LEN = 3000
inputs_0 = "Russia has invaded"

print(f"inputs_0: {inputs_0}")
print(f"inputs_0[:PRIME_LEN]: {inputs_0[:PRIME_LEN]}")

generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
set_seed(41)
generator(inputs_0[:PRIME_LEN], max_length=30, num_return_sequences=5)

Generate config GenerationConfig {
  "bos_token_id": 50256,
  "do_sample": true,
  "eos_token_id": 50256,
  "max_length": 50,
  "transformers_version": "4.26.0"
}

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


inputs_0: Russia has invaded
inputs_0[:PRIME_LEN]: Russia has invaded


[{'generated_text': 'Russia has invaded Russia and killed us.\n\nIf you are coming from a side, there are two things you should focus on:\n\na'},
 {'generated_text': 'Russia has invaded Syria in the past two days, and has killed or driven hundreds of thousands of people, as has been pointed out, with artillery fire'},
 {'generated_text': "Russia has invaded the Arctic?** Well, actually it hasn't invaded Russia.** The first major military strike over Russia happened a month ago as part"},
 {'generated_text': 'Russia has invaded Afghanistan in the 19th century, or at least has launched major missile strikes against American forces operating in the region in the last decades.'},
 {'generated_text': "Russia has invaded, it's the same reason why Germany was attacked.\n\n**Now**: The most dramatic thing to happen is the current geopolitical"}]