In [6]:
import sys
!conda install --prefix {sys.prefix} -y -c pytorch pytorch

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 23.7.4
  latest version: 24.3.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=24.3.0



# All requested packages already installed.



(Note: You can also use Transformers with Tensorflow, but in practice I've found that PyTorch support in Transformers is better.)

Now you can install Transformers by running the following cell:

In [7]:
import sys
!{sys.executable} -m pip install transformers



In [8]:
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

In [9]:
tokenizer = AutoTokenizer.from_pretrained('distilgpt2')
model = AutoModelForCausalLM.from_pretrained('distilgpt2')

  torch.utils._pytree._register_pytree_node(


In [10]:
generator = pipeline('text-generation', model=model, tokenizer=tokenizer)

And then you can generate text with the pipeline. The first argument is the prompt; any remaining parameters will be forwarded to the model's `.generate()` method.

In [11]:
generator("Whether it lands on head or ‘head’, on automation or—in Astra Taylor’s term—fauxtomation, the house always wins. But the question of labour conditions may allow for a more general observation about the relation of statistics and reality, or the question of correlation and causation. Many writers, including myself, have interpreted the shift from a science based on causality towards assumptions based on correlation as an example of magical thinking, or a slide into alchemy. But what if this slide also captures an important aspect of reality? A reality which, rather than being governed by logic or causality, is in fact becoming structured much more like a casino?",
          max_length=500)[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (1024). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.


IndexError: index out of range in self

In [None]:
my_poem = generator("The writer describes a causal relation between input and output, labour and reward. The striking conclusion is that such causality within actually existing capitalism is rare, especially in precarious work.",
          max_length=500)[0]['generated_text']

In [None]:
my_poem.replace(",", "...")

In [None]:
generator("The predictability of these rooms is, in a word, exquisite.",
          temperature=5.0,
          max_length=100)[0]['generated_text']

In [None]:
generator("I want a dyke for president. I want a person with aids for president and I want a faggot for vice president and I want someone with no health insurance and I want someone who grew up in a place where the earth is so saturated with toxic waste that they didn't have a choice about getting leukemia.",
          top_k=tokenizer.vocab_size,
          temperature= 2.2,
          max_length=200)[0]['generated_text']

On the other extreme, setting the `top_k` value to `1` ensures that *only* the most likely token is chosen at each step. This is the same thing as ["greedy decoding"](https://huggingface.co/transformers/main_classes/model.html#transformers.generation_utils.GenerationMixin.greedy_search):

In [None]:
generator("You, too, can be carved anew by the details of your devotions",
          top_k=10,
          max_length=100)[0]['generated_text']

Playing around with `top_k` and `temperature` in tandem is a good way to make adjustments to the texture of your generated text.

### Logit warping: Exclude "bad" words

The `.generate()` method has a parameter called `bad_words_ids`, which causes the model to zero out the probabilities of tokens associated with words that you pass in. The intended use of this feature is to stop the model from generating offensive or harmful words. But we can also repurpose it for poetic purposes. For example, in the cell below, I make the model complete the prompt "It was a dark and stormy" *without* using the words "night" or "day":

In [None]:
generator("where is my curly mask I’m so sad",
          bad_words_ids=tokenizer([" person", " sad"]).input_ids)[0]['generated_text']

The syntax for specifying the "bad words" is to call the tokenizer on a list of words that you want to exclude, and then get the `.input_ids` attribute of the value returned from calling the tokenizer. This yields a list of lists that looks like this:

Note that I used ` night` and ` day` as the words, with leading spaces—this is necessary because I ended the prompt without whitespace, so the model is likely to generate a token with leading whitespace at the next step. I've found that the `bad_words_ids` parameter works best if your list of words includes versions both with and without whitespace.

Here's another example: getting the model to complete a prompt without using any forms of the verb *to be*:

In [None]:
generator("In the large rectangle above my bedroom, a sky bled to tell me so much less",
          bad_words_ids=tokenizer(
              ["be", " be",
               "am", " am",
               "are", " are",
               "is", " is",
               "was", " was",
               "were", " were"]).input_ids,
          temperature= 2.0,
          max_length=100)[0]['generated_text']

You can also create a list of token IDs that you want to exclude on the fly. In the following example, I make a list of token IDs that have the letter `e` in them, and pass that list to the `bad_words_ids` parameter:

In [None]:
forbidden_ids = []
for key, val in tokenizer.get_vocab().items():
    if 'e' in key:
        forbidden_ids.append([val]) # needs to be a list of lists
print(generator("In my dreams, have you seen, have I ever ever ver",
          bad_words_ids=forbidden_ids,
          max_length=100)[0]['generated_text'])

### Fine-tuning a model

"Fine-tuning" is a way of slightly modifying a model by training it a few extra steps on a corpus of your choice. This process adjusts the probabilities of the model so that it more closely reflects the probabilities of the source text you train it on. Fine-tuning models with Transformers is a little bit tricky! First, you'll need to install Hugging Face's `datasets` package:

In [12]:
import sys
!{sys.executable} -m pip install datasets

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


And then import it:

In [13]:
import datasets

You'll want to select a text file to fine-tune the model on. Fine-tuning works best on large amounts of text, but fine-tuning is also very slow if you're not using a GPU. For demonstration purposes, I create a special version of [Frankenstein](https://www.gutenberg.org/ebooks/84) that contains only the first 20000 characters, and save it to a file:

In [14]:
with open("bow", "w") as fh:
    fh.write(open("bow").read())

Then I load this text file as my fine-tuning dataset:

In [16]:
training_data = datasets.load_dataset('text', data_files="bow")

Downloading and preparing dataset text/default to /Users/user/.cache/huggingface/datasets/text/default-034f241caca8044d/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset text downloaded and prepared to /Users/user/.cache/huggingface/datasets/text/default-034f241caca8044d/0.0.0/cb1e9bd71a82ad27976be3b12b407850fe2837d80c22c5e03a28949843a8ace2. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Now, there's a bunch of obligatory processing that we need to do to the data in order to prepare it for the model. This is boilerplate stuff, which I'm not going to go into in detail. If you want details, consult Hugging Face's [fine-tuning language models notebook](https://github.com/huggingface/notebooks/blob/master/examples/language_modeling.ipynb).

First, we tokenize the text:

In [17]:
tokenizer.pad_token = tokenizer.eos_token
tokenized_training_data = training_data.map(
    lambda x: tokenizer(x['text']),
    remove_columns=["text"]
)

Map:   0%|          | 0/125 [00:00<?, ? examples/s]

Then we break the tokenized text up into batches of tokens:

In [18]:
block_size = 64
# magic from https://github.com/huggingface/notebooks/blob/master/examples/language_modeling.ipynb
def group_texts(examples):
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    total_length = (total_length // block_size) * block_size
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result
lm_training_data = tokenized_training_data.map(
    group_texts,
    batched=True,
    batch_size=200
)

Map:   0%|          | 0/125 [00:00<?, ? examples/s]

Now we import the `Trainer` class, which implements a training loop.

In [22]:
from transformers import Trainer, TrainingArguments

Running the following cell creates the `Trainer` object. The `output_dir` parameter specifies a directory where your fine-tuned model will be saved. The `num_train_epochs` sets how many "epochs" the trainer will run; one epoch is one iteration over the entire dataset. More epochs is better, but even one epoch can significantly change the way the model generates text.

In [23]:
trainer = Trainer(model=model,
                  train_dataset=lm_training_data['train'],
                  args=TrainingArguments(
                      output_dir='distilgpt2-finetune-bow',
                      num_train_epochs=1,
                      do_train=True,
                      do_eval=False
                  ),
                  tokenizer=tokenizer)

Finally, the cell below will start the training process. If you're running this on a computer without a GPU, it will take a while. You can open this notebook on [Google Colab](http://colab.research.google.com/) if you want and take advantage of the free GPU that Google lets you use.

In [24]:
trainer.train()

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss


TrainOutput(global_step=30, training_loss=4.364565022786459, metrics={'train_runtime': 29.8336, 'train_samples_per_second': 7.944, 'train_steps_per_second': 1.006, 'total_flos': 3870458118144.0, 'train_loss': 4.364565022786459, 'epoch': 1.0})

Running the cell below will save the model to disk:

In [26]:
trainer.save_model()

Now you can generate with the fine-tuned model! The fine-tuning process modifies the model in-place, so the `pipeline` you created before will make use of the fine-tuned model. (Note that if you want to get the original `distilgpt2` back, you'll need to reload it with the `.from_pretrained()` method, as demonstrated at the top of the notebook.)

In [31]:
generator("Two roads diverged in a yellow", max_length=100)[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


RuntimeError: Placeholder storage has not been allocated on MPS device!

You can see that fine-tuning on even a small dataset produces big changes in the model.

If you want to use your fine-tuned model in another project, use the same syntax that we used above to load `distilgpt2`—just replace `distilgpt2` with the name of the directory where you saved your model:

In [35]:
my_tokenizer = AutoTokenizer.from_pretrained('distilgpt2-finetune-bow')
my_model = AutoModelForCausalLM.from_pretrained('distilgpt2-finetune-bow')

Now generate with it:

In [36]:
my_generator = pipeline("text-generation", model=my_model, tokenizer=my_tokenizer)

In [40]:
my_generator("Whānau is often translated as ‘family’, but its meaning is more complex. It includes physical, emotional and spiritual dimensions.",
          top_k=10,
          temperature= 2.2,
          max_length=200)[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'Whānau is often translated as ‘family’, but its meaning is more complex. It includes physical, emotional and spiritual dimensions. It is the key of all of the many ways this human world might feel like a ‘world in the first place. In my own words, it is an example of this kind of thing.'