#  Train a GPT-2 Model with AITEXTGEN

This tutorial explores fine-tuning and text generation with the GPT-2 language model. More information on GPT-2 can be found here: [Better Language Models and Their Implications](https://openai.com/blog/better-language-models/)

<br>

`AITEXTGEN` is authored by [Max Woolf](https://minimaxir.com). For more info you can visit [this GitHub repository](https://github.com/minimaxir/aitextgen) or [read the documentation](https://docs.aitextgen.io/). This is an evolution of Woolf's earlier project [GPT2-SIMPLE](https://github.com/minimaxir/gpt-2-simple). `AITEXTGEN` also takes advantage of processes authored by [HuggingFace](https://huggingface.co/), so it may be interesting to research their [NLP Models](https://huggingface.co/models) and the [Transformers package](https://huggingface.co/docs/transformers/installation).

<br>

Colab Setup:

Make sure GPU is enabled, go to Edit->Notebook settings->Hardware Accelerator->GPU

Note: Free Colab runtimes can vary greatly in duration, from 4-12 hours before diconnecting, so we'll discuss later how to save your progress at regular intervals. Colab Plus is a paid service that provides longer runtimes with more RAM and better GPUs.



#Environment setup
This cell installs/imports `AITEXTGEN` and some utilities for interfacing with your Google Drive. Much of the following tutorial is copied from Woolf's Colab demo: [Finetune OpenAI's 124M GPT-2 model (or GPT Neo) on your own dataset (GPU)](https://colab.research.google.com/drive/15qBZx5y9rdaQSyWpsreMDnTiZ5IlN0zD?usp=sharing).

In [None]:
!pip install aitextgen
!pip install pytorch-lightning==1.7.7

import logging
logging.basicConfig(
        format="%(asctime)s — %(levelname)s — %(name)s — %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO
    )

from aitextgen import aitextgen
from aitextgen.colab import mount_gdrive, copy_file_from_gdrive

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting aitextgen
  Downloading aitextgen-0.6.0.tar.gz (572 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m572.2/572.2 KB[0m [31m41.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers>=4.5.1
  Downloading transformers-4.26.0-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m105.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting fire>=0.3.0
  Downloading fire-0.5.0.tar.gz (88 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m88.3/88.3 KB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pytorch-lightning>=1.7.0
  Downloading pytorch_lightning-1.9.0-py3-none-any.whl (825 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m825.8/825.8 KB[0m [31m64.7 MB/

## GPU

Colaboratory uses a Nvidia P4, an Nvidia T4, an Nvidia P100, or an Nvidia V100. For finetuning GPT-2 124M, any of these GPUs will be fine, but for text generation, a T4 or a P100 is ideal since they have more VRAM. **If you receive a T4 or a V100 GPU, you can enable `fp16=True` during training for faster/more memory efficient training.**

You can verify which GPU is active by running the cell below. If you want to try for a different GPU, go to **Runtime -> Factory Reset Runtime**.

In [None]:
!nvidia-smi

Tue Jan 31 22:08:09 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   68C    P0    27W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Loading GPT-2 or another base model

If you're finetuning a model with your own text, you need to download and load the GPT-2 model into the GPU. 

There are several sizes of GPT-2:

* `124M`: the "small" model, 500MB on disk.
* `355M`: the "medium" model, 1.5GB on disk.
* `774M`: the "large" model, 3GB on disk.

You can also finetune alternative base models such as [GPT Neo](https://www.eleuther.ai/projects/gpt-neo/), which is more suitable for longer texts and the base model has more recent data:

* `125M`: Analogous to the GPT-2 124M model.

In this case, *many* of the models hosted on [HuggingFace](https://huggingface.co/models) may be used as a base model for finetuning. With this in mind, even non-English models may be finetuned with this code. 

In [None]:
ai = aitextgen(tf_gpt2="124M", to_gpu=True)

# Comment out the above line and uncomment the below line to use alternate base models such as GPT Neo.
# ai = aitextgen(model="EleutherAI/gpt-neo-125M", to_gpu=True)

INFO:aitextgen:Downloading the 124M GPT-2 TensorFlow weights/config from Google's servers


Fetching checkpoint:   0%|          | 0.00/77.0 [00:00<?, ?it/s]

Fetching hparams.json:   0%|          | 0.00/90.0 [00:00<?, ?it/s]

Fetching model.ckpt.data-00000-of-00001:   0%|          | 0.00/498M [00:00<?, ?it/s]

Fetching model.ckpt.index:   0%|          | 0.00/5.21k [00:00<?, ?it/s]

Fetching model.ckpt.meta:   0%|          | 0.00/471k [00:00<?, ?it/s]

INFO:aitextgen:Converting the 124M GPT-2 TensorFlow weights to PyTorch.
Converting TensorFlow checkpoint from /content/aitextgen/124M
Loading TF weight model/h0/attn/c_attn/b with shape [2304]
Loading TF weight model/h0/attn/c_attn/w with shape [1, 768, 2304]
Loading TF weight model/h0/attn/c_proj/b with shape [768]
Loading TF weight model/h0/attn/c_proj/w with shape [1, 768, 768]
Loading TF weight model/h0/ln_1/b with shape [768]
Loading TF weight model/h0/ln_1/g with shape [768]
Loading TF weight model/h0/ln_2/b with shape [768]
Loading TF weight model/h0/ln_2/g with shape [768]
Loading TF weight model/h0/mlp/c_fc/b with shape [3072]
Loading TF weight model/h0/mlp/c_fc/w with shape [1, 768, 3072]
Loading TF weight model/h0/mlp/c_proj/b with shape [768]
Loading TF weight model/h0/mlp/c_proj/w with shape [1, 3072, 768]
Loading TF weight model/h1/attn/c_attn/b with shape [2304]
Loading TF weight model/h1/attn/c_attn/w with shape [1, 768, 2304]
Loading TF weight model/h1/attn/c_proj/b wi

Save PyTorch model to aitextgen/pytorch_model.bin


INFO:aitextgen:Loading 124M GPT-2 model from /aitextgen.


Save configuration file to aitextgen/config.json


Generate config GenerationConfig {
  "bos_token_id": 50256,
  "eos_token_id": 50256,
  "transformers_version": "4.26.0"
}

INFO:aitextgen:GPT2 loaded with 124M parameters.
INFO:aitextgen:Using the default GPT-2 Tokenizer.


## Mounting Google Drive

The best way to get input text to-be-trained into the Colaboratory VM, and to get the trained model *out* of Colaboratory, is to route it through Google Drive *first*. Running this cell will mount your personal Google Drive in the VM, which later cells can use to get data in/out. (it will ask for an auth code)

<br>

Alternately, this code can be run from any Colab notebook to mount your drive
<br>

`from google.colab import drive`
<br>
`drive.mount('/content/drive')`

In [None]:
mount_gdrive()

Mounted at /content/drive


## IMPORT DATASET
## Option 1
## Downloading a Text File Dataset for Training
In this case the file is A Tale of Two Cities (Charles Dickens) from [Project Gutenberg](https://www.gutenberg.org/). To change the dataset that GPT-2 will fine-tune on, change this URL to another .txt file, and change the corresponding `file_name` variable later in the notebook.

Keep in mind that this *must* be an 8-bit, plain text .txt file

98/98-0.txt - Dickens - A Tale of Two Cities - 139k words
<br>
84/84-0.txt - Shelley - Frankenstein - 78k words
<br>
996/996-0.txt - Cervantes - Don Quixote - 430k words

In [None]:
# !wget https://www.gutenberg.org/files/98/98-0.txt
!wget https://www.gutenberg.org/cache/epub/15154/pg15154.txt

--2023-01-18 22:53:01--  https://www.gutenberg.org/cache/epub/15154/pg15154.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 417649 (408K) [text/plain]
Saving to: ‘pg15154.txt’


2023-01-18 22:53:02 (1.67 MB/s) - ‘pg15154.txt’ saved [417649/417649]



In [None]:
file_name = "98-0.txt"

## Option 2
## Uploading a Text File to Colaboratory

In the Colaboratory Notebook sidebar on the left of the screen, select *Files*. From there you can *Upload to session storage*:

![alt text](https://i.imgur.com/w3wvHhR.png)

**Make sure to re-define the `file_name` variable to your dataset's name

In [None]:
file_name = "lyrics_rap.txt"

## Option 3
## Transferring a Text File from Google Drive

If your text file is larger than 10MB, it is recommended to upload that file to Google Drive first, then copy that file from Google Drive to the Colaboratory VM. ONLY run this line, if your named text file is in the root of your personal Google Drive.

**Make sure to first re-define the `file_name` variable to your dataset's name

In [None]:
file_name = "pg15154.txt"

In [None]:
copy_file_from_gdrive(file_name)

FileNotFoundError: ignored

## Finetune GPT-2

The next cell will start the actual finetuning of GPT-2 in aitextgen. It runs for `num_steps`, and a progress bar will appear to show training progress, current loss (the lower the better the model), and average loss (to give a sense on loss trajectory).

The model will be saved every `save_every` steps in `trained_model` by default, and when training completes. If you mounted your Google Drive, the model will _also_ be saved there in a unique folder.

The training might time out after 4ish hours; if you did not mount to Google Drive, make sure you end training and save the results so you don't lose them! (if this happens frequently, you may want to consider using [Colab Pro](https://colab.research.google.com/signup))

Important parameters for `train()`:

- **`line_by_line`**: Set this to `True` if the input text file is a single-column CSV, with one record per row. aitextgen will automatically process it optimally.
- **`from_cache`**: If you compressed your dataset locally (as noted in the previous section) and are using that cache file, set this to `True`.
- **`num_steps`**: Number of steps to train the model for.
- **`generate_every`**: Interval of steps to generate example text from the model; good for qualitatively validating training.
- **`save_every`**: Interval of steps to save the model: the model will be saved in the VM to `/trained_model`.
- **`save_gdrive`**: Set this to `True` to copy the model to a unique folder in your Google Drive, if you have mounted it in the earlier cells
- **`fp16`**: Enables half-precision training for faster/more memory-efficient training. Only works on a T4 or V100 GPU.

Here are other important parameters for `train()` that are useful but you likely do not need to change.

- **`learning_rate`**: Learning rate of the model training.
- **`batch_size`**: Batch size of the model training; setting it too high will cause the GPU to go OOM. (if using `fp16`, you can increase the batch size more safely)

In [None]:
ai.train(file_name,
         line_by_line=False,
         from_cache=False,
         num_steps=2500,
         generate_every=1000,
         save_every=500,
         save_gdrive=True,
         learning_rate=1e-3,
         fp16=False,
         batch_size=1, 
         )

INFO:aitextgen:Loading text from lyrics_rap.txt with generation length of 1024.


  0%|          | 0/13276 [00:00<?, ?it/s]

INFO:aitextgen.TokenDataset:Encoding 13,276 sets of tokens from lyrics_rap.txt.
  rank_zero_deprecation(
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
  rank_zero_deprecation(
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


  0%|          | 0/2500 [00:00<?, ?it/s]

Configuration saved in trained_model/generation_config.json


[1m500 steps reached: saving model to /trained_model[0m


Configuration saved in trained_model/generation_config.json


[1m1,000 steps reached: saving model to /trained_model[0m


Generate config GenerationConfig {
  "bos_token_id": 50256,
  "eos_token_id": 50256,
  "transformers_version": "4.26.0"
}

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


[1m1,000 steps reached: generating sample texts.[0m

And all the bees buzzed me, they said I gotta go home
I'm not even trippin', my nigga, I'm just tryna breathe
I like my motherma, you do not wanna get no shine Your lesson for me
You're tryna get your little something to keep you away
You're tryna get your own number
You don't wanna get your number
You don't wanna get no more
You are a man
I love life with you
One day at a time
So I'ma work on your side
One day at a time
So you're all out
One day at a time
So, they say I gotta go home
One day at a time
So hey, y'all close the chapter
One day at a time
So you better wake up now
One day at a time
So you better wake up now
One day at a time
They said I gotta go home
One day at a time
So I'ma work on your mind
One day at a time
And one day at a time
So you better wake up now
One day at a time
Was the same way back then as you are now
Two steppin


Configuration saved in trained_model/generation_config.json


[1m1,500 steps reached: saving model to /trained_model[0m


Configuration saved in trained_model/generation_config.json


[1m2,000 steps reached: saving model to /trained_model[0m


Generate config GenerationConfig {
  "bos_token_id": 50256,
  "eos_token_id": 50256,
  "transformers_version": "4.26.0"
}

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


[1m2,000 steps reached: generating sample texts.[0m
, when I get past security, the high school
I'll be back with the 12
I'll be back with the 12
I'll be back with the 12
I'll be back with the 12
I'm back with the 12
I'm strapped for real, ask me if I got the fuss
I'll be back with the 12 (not me)
I'm strapped for real, but I need
Livin' in captivity raised my hole deeper (yeah)
Celery, tellin' me where I'm at
Karen like Sonic, I'm in the Lamb', bitch
Been a lyrical grand wizard like Theodore
I'm in the Lamb', bitch, I'm a natural
I'm a real wizard like Theodore
I'm a real wizard like
I'm a real wizard like
I'm a real wizard
I'm a real wizard like
I'm a real wizard like
I'm a real wizard like
I'm in the Lamb', bitch
I'm a real magician like
I'm in the Lamb', bitch
I'm in the Lamb', bitch
I'm in the Lamb', bitch
I'm in the Lamb', bitch
I'm in the Lamb', bitch
I'm


Configuration saved in trained_model/generation_config.json


[1m2,500 steps reached: saving model to /trained_model[0m


INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_steps=2500` reached.
INFO:aitextgen:Saving trained model pytorch_model.bin to /trained_model
Configuration saved in trained_model/generation_config.json


You're done! 
<br>
At this time it may be useful ro rename the `trained_model` folder to something unique, pehaps according to the `file_name` and `num_steps` of your finetuned model.
<br> 
Likewise, if you set `save_gdrive=True` you will find a new folder in your Google Drive formatted like `ATG_YYYYMMDD_HHMMSS`. Perhaps rename this to something logical.

<br>
Feel free to go to the **Generate Text From The Trained Model** section to generate text based on your retrained model.

## Generate Text From The Trained Model

After you've trained the model or loaded a retrained model from checkpoint, you can now generate text.

Make sure to change `my_model_folder` to your custom model's folder.

In [None]:
!cp -r /content/drive/MyDrive/2500_rap_model /content/
#change my_model_folder to your custom model's name

## Generate Text From The Trained Model

After you've trained the model or loaded a retrained model from checkpoint, you can now generate text.

**If you just trained a model**, you'll get much faster training performance if you reload the model; the next cell will reload the model you just trained from the `trained_model` folder. Otherwise, change `trained_model` to your renamed folder.

In [None]:
ai = aitextgen(model_folder="2500_rap_model", to_gpu=True)

INFO:aitextgen:Loading model from provided weights and config in /2500_rap_model.
Generate config GenerationConfig {
  "bos_token_id": 50256,
  "eos_token_id": 50256,
  "transformers_version": "4.26.0"
}

INFO:aitextgen:GPT2 loaded with 124M parameters.
INFO:aitextgen:Using the default GPT-2 Tokenizer.


`generate()` without any parameters generates a single text from the loaded model to the console.

In [None]:
ai.generate()

Generate config GenerationConfig {
  "bos_token_id": 50256,
  "eos_token_id": 50256,
  "transformers_version": "4.26.0"
}

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Just a feeling I've got, like something's about to happen, but I don't know what
If that means what I think it means, we're in trouble, big trouble
And if he is as bananas as you say, I'm not taking any chances
You are just what the doc ordered
I'm beginnin' to feel like a Rap God, Rap God
All my people from the front to the back nod, back nod
Now, who thinks their arms are long enough to slap box, slap box?
They said I rap bout a robot, so call me Rap-bot
But for me to rap like a computer, it must be in my genes
I got a laptop in my back pocket
My pen'll go off when I half-cock it
Got a fat knot from that rap profit
I'ma a boss in a skirt
I'm a dog on the get up watch
I'm gonna pop some tags
Only got twenty dollars in my pocket
I'm, I'm, I'm huntin'
Maybe I needed to wear 'cause my famhes
I'm in a league of my own
Watch 'em coolin' on the prize, 'cause at the end
It's


If you're creating an API based on your model and need to pass the generated text elsewhere, you can do `text = ai.generate_one()`

You can also pass in a `prompt` to the generate function to force the text to start with a given character sequence and generate text from there (good if you add an indicator when the text starts).

You can also generate multiple texts at a time by specifing `n`. You can pass a `batch_size` to generate multiple samples in parallel, giving a massive speedup (in Colaboratory, set a maximum of 50 for `batch_size` to avoid going OOM).

Other optional-but-helpful parameters for `ai.generate()` and friends:

*  **`min length`**: The minimum length of the generated text: if the text is shorter than this value after cleanup, aitextgen will generate another one.
*  **`max_length`**: Number of tokens to generate (default 256, you can generate up to 1024 tokens with GPT-2 and 2048 with GPT Neo)
* **`temperature`**: The higher the temperature, the crazier the text (default 0.7, recommended to keep between 0.7 and 1.0)
* **`top_k`**: Limits the generated guesses to the top *k* guesses (default 0 which disables the behavior; if the generated output is super crazy, you may want to set `top_k=40`)
* **`top_p`**: Nucleus sampling: limits the generated guesses to a cumulative probability. (gets good results on a dataset with `top_p=0.9`)

In [None]:
ai.generate(n=10,
            prompt="die young",
            max_length=64,
            temperature=1.0, # change this 
            top_p=0.9)

Generate config GenerationConfig {
  "bos_token_id": 50256,
  "eos_token_id": 50256,
  "transformers_version": "4.26.0"
}



[1mdie young[0m to drive me, band leaves the stage
You're owed for the shit I am workin' at shows
But I need a straight jacket, face facts
I am a Rap God, Rap God
All my people from the front to the back nod, back nod
Now, I'm a Rap
[1mdie young[0m, y'all fucked up
You creepin', you creepin'
I went through your phone last night
I went through your phone last night
Saw some things I didn't like
I went through your phone last night
It's killin' me, killin' me, oh
[1mdie young[0m again, I was gonna call a cab
I couldn't leave the scene for y'all to chase it
I went the scene with no I ever see it
And I ain't got shit to lose
I went the distance, I got a deep route
And I'm a couple things I've
[1mdie young[0m enough for my investment and turned it into a fence
Then my friend Carlos' brother
Got murdered for his Fours, whoa
See he just wanted a jump shot
But they wanted his Starter coat, though
Didn't wanna get caught, from Genesee Park to Othello it
[1mdie young[0m but I know my s

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# LICENSE

MIT License

Copyright (c) 2020-2021 Max Woolf

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.