#  aitextgen — Train a GPT-2 (or GPT Neo) Text-Generating Model w/ GPU

by [Max Woolf](https://minimaxir.com)

*Last updated: May 16th, 2021 (aitextgen v0.5.2)*

Retrain an advanced text generating neural network on any text dataset **for free on a GPU using Colaboratory** using `aitextgen`!

For more about `aitextgen`, you can visit [this GitHub repository](https://github.com/minimaxir/aitextgen) or [read the documentation](https://docs.aitextgen.io/).


To get started:

1. Copy this notebook to your Google Drive to keep it and save your changes. (File -> Save a Copy in Drive)
2. Run the cells below:


In [None]:
!pip install -q aitextgen

[K     |████████████████████████████████| 572 kB 12.9 MB/s 
[K     |████████████████████████████████| 4.2 MB 39.5 MB/s 
[K     |████████████████████████████████| 87 kB 8.6 MB/s 
[K     |████████████████████████████████| 585 kB 42.1 MB/s 
[K     |████████████████████████████████| 140 kB 47.8 MB/s 
[K     |████████████████████████████████| 419 kB 50.4 MB/s 
[K     |████████████████████████████████| 596 kB 57.0 MB/s 
[K     |████████████████████████████████| 1.1 MB 56.1 MB/s 
[K     |████████████████████████████████| 6.6 MB 49.8 MB/s 
[K     |████████████████████████████████| 86 kB 6.6 MB/s 
[K     |████████████████████████████████| 144 kB 39.4 MB/s 
[K     |████████████████████████████████| 94 kB 4.4 MB/s 
[K     |████████████████████████████████| 271 kB 55.1 MB/s 
[?25h  Building wheel for aitextgen (setup.py) ... [?25l[?25hdone
  Building wheel for fire (setup.py) ... [?25l[?25hdone


In [None]:
import aitextgen
import datetime
import gc
import logging
import os
import requests
import torch

In [None]:
session_url = 'http://172.28.0.2:9000/api/sessions'
notebook_name = requests.get(session_url).json()[0]['name']

run_datetime = datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%S')
run_id = notebook_name + '_run_' + run_datetime

In [None]:
log_format = '%(asctime)s — %(levelname)s — %(name)s — %(message)s'
date_format = '%d/%m/%Y %H:%M:%S'
log_level = logging.DEBUG

logging.basicConfig(format=log_format, datefmt=date_format, level=log_level)

## GPU

Colaboratory uses a Nvidia P4, an Nvidia T4, an Nvidia P100, or an Nvidia V100. For finetuning GPT-2 124M, any of these GPUs will be fine, but for text generation, a T4 or a P100 is ideal since they have more VRAM. **If you receive a T4 or a V100 GPU, you can enable `fp16=True` during training for faster/more memory efficient training.**

You can verify which GPU is active by running the cell below. If you want to try for a different GPU, go to **Runtime -> Factory Reset Runtime**.

In [None]:
!nvidia-smi

Mon Jun 13 16:23:41 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P8    11W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Mounting Google Drive

The best way to get input text to-be-trained into the Colaboratory VM, and to get the trained model *out* of Colaboratory, is to route it through Google Drive *first*.

Running this cell (which will only work in Colaboratory) will mount your personal Google Drive in the VM, which later cells can use to get data in/out. (it will ask for an auth code; that auth is not saved anywhere)

In [None]:
aitextgen.colab.mount_gdrive()

Mounted at /content/drive


In [None]:
gdrive_root_dir = '/content/drive/My Drive'


## Load a Trained Model

If you already had a trained model from this notebook, running the next cell will copy the `pytorch_model.bin` and the `config.json`file from the specified folder in Google Drive into the Colaboratory VM. (If no `from_folder` is specified, it assumes the two files are located at the root level of your Google Drive)

In [None]:
load_model = 'aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11'

In [None]:
model_load_dir = gdrive_root_dir + '/aitextgen/models/' + load_model

The next cell will allow you to load the retrained model + metadata necessary to generate text.

## Generate Text From The Trained Model

After you've trained the model or loaded a retrained model from checkpoint, you can now generate text.

**If you just trained a model**, you'll get much faster training performance if you reload the model; the next cell will reload the model you just trained from the `trained_model` folder.

`generate()` without any parameters generates a single text from the loaded model to the console.

If you're creating an API based on your model and need to pass the generated text elsewhere, you can do `text = ai.generate_one()`

You can also pass in a `prompt` to the generate function to force the text to start with a given character sequence and generate text from there (good if you add an indicator when the text starts).

You can also generate multiple texts at a time by specifing `n`. You can pass a `batch_size` to generate multiple samples in parallel, giving a massive speedup (in Colaboratory, set a maximum of 50 for `batch_size` to avoid going OOM).

Other optional-but-helpful parameters for `ai.generate()` and friends:

*  **`min length`**: The minimum length of the generated text: if the text is shorter than this value after cleanup, aitextgen will generate another one.
*  **`max_length`**: Number of tokens to generate (default 256, you can generate up to 1024 tokens with GPT-2 and 2048 with GPT Neo)
* **`temperature`**: The higher the temperature, the crazier the text (default 0.7, recommended to keep between 0.7 and 1.0)
* **`top_k`**: Limits the generated guesses to the top *k* guesses (default 0 which disables the behavior; if the generated output is super crazy, you may want to set `top_k=40`)
* **`top_p`**: Nucleus sampling: limits the generated guesses to a cumulative probability. (gets good results on a dataset with `top_p=0.9`)

For bulk generation, you can generate a large amount of texts to a file and sort out the samples locally on your computer. The next cell will generate `num_files` files, each with `n` texts and whatever other parameters you would pass to `generate()`. The files can then be downloaded from the Files sidebar!

You can rerun the cells as many times as you want for even more generated texts!

In [None]:
num_outputs = 10
max_length = 2000
temperature = 1.0
top_p = 0.9

prompts = ['',
           'Digital Forensics Analysis Report\n',
           'This report is ',
           'The contents of ',
           'Conclusion\n',
           'It is recommended that ',
           'In the opinion of the expert, ',
           'File \'Exploit_Office\' contains ',
           'File \'Exploit_Office\' does not contain ',
           'Website \'Webmail SquirrelMail\' contains ',
           'Website \'Webmail SquirrelMail\' does not contain ',
           'Bill Due to past contains a link \'https://genom.mefst.hr/webmail/src/login.php\' to a website \'Webmail SquirrelMail\'.',
           'New Dogecoin Crypto Sale contains a link \'http://webmail.forumofthemall.hr/mail/loging.php\' to a website \'Webmail SquirrelMail Popular Forum\'.',
           'New OneCoin Crypto Sale contains a link \'http://',
           'Note of eviction contains ',
           'Note of eviction contains attachment ',
           'Note of eviction contains attachment \'Exploit_Office\'. Attachment is quarantined on \'Mail server EP\'.',
           'Log entry found: ',
           'Log entry found: Firewall (Type: Firewall) ',
           'Log entry found: Firewall (Type: Firewall) detected. [Allowed network traffic protocol ',
           'Log entry found: Firewall (Type: Firewall) blocked. [Blocked network traffic protocol ',
           'Log entry found: Firewall (Type: Firewall) detected. [Allowed network traffic protocol \'smtp:25\' from \'server74.aws.com\' to \'Mail server EP\'. Rule \'Internet_to_Mail_Server\'.]',
           'Log entry found: Firewall (Type: Firewall) detected. [Allowed network traffic protocol \'https:443\' from \'Proxy server\' to \'server74.aws.com\'. Rule \'Proxy_to_Internet, https:443\'.]',
           'Log entry found: Firewall (Type: Firewall) blocked. [Blocked network traffic protocol \'https:443\' from \'PCSZT03\' to \'Firewall TSO Enterprise\'.]',
           'Log analysis on ',
           'Log analysis on \'Firewall TSO Enterprise\' for period 1.1.2022. 0:00:00 - 4.2.2022. 13:50:44 finished. Report is ready.']

In [None]:
output_root_dir = gdrive_root_dir + '/aitextgen/outputs/' + run_id
output_ext = '.txt'

if not os.path.exists(output_root_dir):
    os.makedirs(output_root_dir)

In [None]:
ai = aitextgen.aitextgen(model_folder=model_load_dir, to_gpu=True)

output_basepath = output_root_dir + '/' + run_id + '_output'

for i in range(len(prompts)):
    if len(prompts) > 1:
        current_output = output_basepath + '.' + str(i) + output_ext
    else:
        current_output = output_basepath + output_ext
    ai.generate_to_file(n=num_outputs,
                        batch_size=1,
                        prompt=prompts[i],
                        max_length=max_length,
                        temperature=temperature,
                        top_p=top_p,
                        destination_path=current_output)

13/06/2022 16:27:20 — INFO — aitextgen — Loading model from provided weights and config in //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11.
13/06/2022 16:27:27 — INFO — aitextgen — GPT2 loaded with 124M parameters.
13/06/2022 16:27:27 — INFO — aitextgen — Using the default GPT-2 Tokenizer.
13/06/2022 16:27:41 — INFO — aitextgen — Generating 10 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41_output.0.txt


  0%|          | 0/10 [00:00<?, ?it/s]

13/06/2022 16:29:28 — INFO — aitextgen — Generating 10 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41_output.1.txt


  0%|          | 0/10 [00:00<?, ?it/s]

13/06/2022 16:31:15 — INFO — aitextgen — Generating 10 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41_output.2.txt


  0%|          | 0/10 [00:00<?, ?it/s]

13/06/2022 16:33:02 — INFO — aitextgen — Generating 10 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41_output.3.txt


  0%|          | 0/10 [00:00<?, ?it/s]

13/06/2022 16:34:48 — INFO — aitextgen — Generating 10 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41_output.4.txt


  0%|          | 0/10 [00:00<?, ?it/s]

13/06/2022 16:36:34 — INFO — aitextgen — Generating 10 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41_output.5.txt


  0%|          | 0/10 [00:00<?, ?it/s]

13/06/2022 16:38:20 — INFO — aitextgen — Generating 10 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41_output.6.txt


  0%|          | 0/10 [00:00<?, ?it/s]

13/06/2022 16:40:07 — INFO — aitextgen — Generating 10 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41_output.7.txt


  0%|          | 0/10 [00:00<?, ?it/s]

13/06/2022 16:41:54 — INFO — aitextgen — Generating 10 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41_output.8.txt


  0%|          | 0/10 [00:00<?, ?it/s]

13/06/2022 16:43:41 — INFO — aitextgen — Generating 10 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41_output.9.txt


  0%|          | 0/10 [00:00<?, ?it/s]

13/06/2022 16:45:27 — INFO — aitextgen — Generating 10 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41_output.10.txt


  0%|          | 0/10 [00:00<?, ?it/s]

13/06/2022 16:47:13 — INFO — aitextgen — Generating 10 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41_output.11.txt


  0%|          | 0/10 [00:00<?, ?it/s]

13/06/2022 16:48:56 — INFO — aitextgen — Generating 10 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41_output.12.txt


  0%|          | 0/10 [00:00<?, ?it/s]

13/06/2022 16:50:37 — INFO — aitextgen — Generating 10 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41_output.13.txt


  0%|          | 0/10 [00:00<?, ?it/s]

13/06/2022 16:52:22 — INFO — aitextgen — Generating 10 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41_output.14.txt


  0%|          | 0/10 [00:00<?, ?it/s]

13/06/2022 16:54:09 — INFO — aitextgen — Generating 10 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41_output.15.txt


  0%|          | 0/10 [00:00<?, ?it/s]

13/06/2022 16:55:53 — INFO — aitextgen — Generating 10 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41_output.16.txt


  0%|          | 0/10 [00:00<?, ?it/s]

13/06/2022 16:57:35 — INFO — aitextgen — Generating 10 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41_output.17.txt


  0%|          | 0/10 [00:00<?, ?it/s]

13/06/2022 16:59:19 — INFO — aitextgen — Generating 10 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41_output.18.txt


  0%|          | 0/10 [00:00<?, ?it/s]

13/06/2022 17:01:02 — INFO — aitextgen — Generating 10 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41_output.19.txt


  0%|          | 0/10 [00:00<?, ?it/s]

13/06/2022 17:02:45 — INFO — aitextgen — Generating 10 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41_output.20.txt


  0%|          | 0/10 [00:00<?, ?it/s]

13/06/2022 17:04:29 — INFO — aitextgen — Generating 10 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41_output.21.txt


  0%|          | 0/10 [00:00<?, ?it/s]

13/06/2022 17:06:08 — INFO — aitextgen — Generating 10 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41_output.22.txt


  0%|          | 0/10 [00:00<?, ?it/s]

13/06/2022 17:07:47 — INFO — aitextgen — Generating 10 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41_output.23.txt


  0%|          | 0/10 [00:00<?, ?it/s]

13/06/2022 17:09:27 — INFO — aitextgen — Generating 10 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41_output.24.txt


  0%|          | 0/10 [00:00<?, ?it/s]

13/06/2022 17:11:11 — INFO — aitextgen — Generating 10 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41/aitextgen-CCS-124M-5400-600-generate_run_2022-06-13-16-23-41_output.25.txt


  0%|          | 0/10 [00:00<?, ?it/s]

# LICENSE

MIT License

Copyright (c) 2020-2021 Max Woolf

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.