In [3]:
!pip install transformers datasets
!pip install transformers[torch]

Collecting transformers
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m76.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.13.1-py3-none-any.whl (486 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.2/486.2 kB[0m [31m51.5 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m30.7 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m107.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloa

# **Causal language modeling**

## **Load ELI5 dataset**

In [19]:
from datasets import load_dataset

eli5 = load_dataset("eli5",
                    split="train_asks[:5000]")
eli5



Dataset({
    features: ['q_id', 'title', 'selftext', 'document', 'subreddit', 'answers', 'title_urls', 'selftext_urls', 'answers_urls'],
    num_rows: 5000
})

In [20]:
# train/test splitting of data
# For language modeling we only need text
# the prediction for the model will be the next word

eli5 = eli5.train_test_split(test_size=0.2)

In [31]:
eli5["train"][110]['answers']['text']

['If it\'s a "very rare disease", then the number of surgeons with experience/training in that procedure is going to be limited.  Surgery isn\'t like working on a car, where one guy can pretty much do anything.  It\'s pretty specialized.  (Yes, there are genral surgeons, but they do general surgeries like tonselectomies, appendectomies, wound closures, etc...)\n\nSo, even though medical costs in general are outrageous, there is a matter of supply and demand involved when it comes to "rare" procedures.',
 'It will mostly be the equipment and the staff involved. Sometimes surgeries have a multi-disciplinary approach, in that perhaps a cardiologist, a neurologist and a microvascular surgeon will all be working during the course of getting a patient from the start to the end of a course of treatments. All of these are incredibly specialised and very well paid jobs, due to their high degree of skill and technical expertise.']

## **Preprocess**

In [6]:
# The next step is to load a DistilGPT2 tokenizer to process the `text` subfield

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
eli5 = eli5.flatten()
eli5["train"][0]

Downloading (…)lve/main/config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

{'q_id': 'p51dg',
 'title': 'Is there a tangible benefit to taking a omega-3 DHA supplement? ',
 'selftext': 'Some friends of mine recently starting taking a large number of supplements, ranging from a multivitamin to B12. I have never taken anything other than a multivitamin, but was intrigued by a DHA supplement that claimed to improve memory functions. Is there any solid research that shows these types of supplements are beneficial? ',
 'document': '',
 'subreddit': 'askscience',
 'answers.a_id': ['c3mkyuk', 'c3mmuuo'],
 'answers.text': ["It depends on how old you are. There was a study that came out last year that showed that patients over 55 did benefit from supplementation of DHA to slow declining mental function:\n\n[Journal of Alzheimer's Association study](_URL_0_). \n\nI cant find any kind of evidence that DHA would improve your baseline memory. This theory for why DHA was beneficial in this study is that there is a decline in DHA stores in older people I believe.",
  'What a

In [7]:
def preprocess_function(examples):
    return tokenizer([" ".join(x) for x in examples["answers.text"]])

# To apply this preprocessing function over the entire dataset
tokenized_eli5 = eli5.map(
    preprocess_function,
    batched=True,
    num_proc=4,
    remove_columns=eli5["train"].column_names,
)

Map (num_proc=4):   0%|          | 0/4000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (2068 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1431 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1094 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (2296 > 1024). Running this sequence through the model will result in indexing errors


Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1543 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1050 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1308 > 1024). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (3471 > 1024). Running this sequence through the model will result in indexing errors


In [8]:
block_size = 128
def group_texts(examples):

    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}

    total_length = len(concatenated_examples[list(examples.keys())[0]])

    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size

    # Split by chunks of block_size
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }

    result["labels"] = result["input_ids"].copy()
    return result

# apply function to the entire dataset (Create Blocks of Text)
lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4)

Map (num_proc=4):   0%|          | 0/4000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

In [9]:
from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer,
                                                mlm=False)

## **Train Model**

In [10]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("distilgpt2")

Downloading model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [11]:
# define hyperparameters in TrainingArguments

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="my_awesome_eli5_clm-model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    save_strategy = "no",
    save_total_limit = 2,
    load_best_model_at_end=False
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset["train"],
    eval_dataset=lm_dataset["test"],
    data_collator=data_collator,
)

trainer.train()

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
1,3.8669,3.753221
2,3.771,3.736462
3,3.7331,3.73379


TrainOutput(global_step=3405, training_loss=3.8018169296645556, metrics={'train_runtime': 710.8695, 'train_samples_per_second': 38.311, 'train_steps_per_second': 4.79, 'total_flos': 889519463202816.0, 'train_loss': 3.8018169296645556, 'epoch': 3.0})

In [12]:
import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 41.84


In [13]:
# save model
trainer.save_model("my_awesome_eli5_clm-model")

<Tip>

For a more in-depth example of how to finetune a model for causal language modeling, take a look at the corresponding
[PyTorch notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling.ipynb)
or [TensorFlow notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling-tf.ipynb).

</Tip>

In [14]:
# save tokeniser as well
tokenizer.save_pretrained("my_awesome_eli5_clm-model")

('my_awesome_eli5_clm-model/tokenizer_config.json',
 'my_awesome_eli5_clm-model/special_tokens_map.json',
 'my_awesome_eli5_clm-model/vocab.json',
 'my_awesome_eli5_clm-model/merges.txt',
 'my_awesome_eli5_clm-model/added_tokens.json',
 'my_awesome_eli5_clm-model/tokenizer.json')

## **Inference**

In [41]:
prompt = "Somatic hypermutation allows the immune system to"

In [42]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("my_awesome_eli5_clm-model")
inputs = tokenizer(prompt, return_tensors="pt").input_ids

In [46]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("my_awesome_eli5_clm-model")
outputs = model.generate(inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


["Somatic hypermutation allows the immune system to create the new virus for itself (since it's so much easier to infect with a few immune cells) without getting too old.\n\nThere's a bunch of other things that could work.  The virus could then make you sick.  It could also be used for the disease.   In my lab, which is in San Francisco, I used that lab for a couple of years, so I got rid of that lab with the exception of the ones in San Francisco that were on my"]