#  aitextgen — Train a GPT-2 (or GPT Neo) Text-Generating Model w/ GPU

by [Max Woolf](https://minimaxir.com)

*Last updated: May 16th, 2021 (aitextgen v0.5.2)*

Retrain an advanced text generating neural network on any text dataset **for free on a GPU using Colaboratory** using `aitextgen`!

For more about `aitextgen`, you can visit [this GitHub repository](https://github.com/minimaxir/aitextgen) or [read the documentation](https://docs.aitextgen.io/).


To get started:

1. Copy this notebook to your Google Drive to keep it and save your changes. (File -> Save a Copy in Drive)
2. Run the cells below:


In [1]:
!pip install -q aitextgen

In [2]:
import aitextgen
import gc
import logging
import torch

In [3]:
logging.basicConfig(format='%(asctime)s — %(levelname)s — %(name)s — %(message)s',
                    datefmt='%d/%m/%Y %H:%M:%S',
                    level=logging.DEBUG)

## GPU

Colaboratory uses a Nvidia P4, an Nvidia T4, an Nvidia P100, or an Nvidia V100. For finetuning GPT-2 124M, any of these GPUs will be fine, but for text generation, a T4 or a P100 is ideal since they have more VRAM. **If you receive a T4 or a V100 GPU, you can enable `fp16=True` during training for faster/more memory efficient training.**

You can verify which GPU is active by running the cell below. If you want to try for a different GPU, go to **Runtime -> Factory Reset Runtime**.

In [4]:
!nvidia-smi

Fri Apr 29 11:30:43 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   46C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Loading GPT-2 or GPT Neo

If you're retraining a model on new text, you need to download and load the GPT-2 model into the GPU. 

There are several sizes of GPT-2:

* `124M` (default): the "small" model, 500MB on disk.
* `355M` (default): the "medium" model, 1.5GB on disk.
* `774M` (default): the "large" model, 3GB on disk.

You can also finetune a GPT Neo model instead, which is more suitable for longer texts and the base model has more recent data:

* `125M`: Analogous to the GPT-2 124M model.
* `350M`: Analogous to the GPT-2 355M model

The next cell downloads the model and saves it in the Colaboratory VM. If the model has already been downloaded, running this cell will reload it.

In [5]:
model='124M'
#model='355M'
#model='774M'

#model='gpt-neo-125M'
#model='gpt-neo-350M'

In [6]:
if model == '124M' or model == '355M' or model == '774M':
    ai = aitextgen.aitextgen(tf_gpt2=model, to_gpu=True)
else:
    ai = aitextgen.aitextgen(model='EleutherAI/' + model, to_gpu=True)

29/04/2022 11:30:43 — INFO — aitextgen — Loading 124M GPT-2 model from /aitextgen.
29/04/2022 11:30:47 — INFO — aitextgen — GPT2 loaded with 124M parameters.
29/04/2022 11:30:47 — INFO — aitextgen — Using the default GPT-2 Tokenizer.


## Mounting Google Drive

The best way to get input text to-be-trained into the Colaboratory VM, and to get the trained model *out* of Colaboratory, is to route it through Google Drive *first*.

Running this cell (which will only work in Colaboratory) will mount your personal Google Drive in the VM, which later cells can use to get data in/out. (it will ask for an auth code; that auth is not saved anywhere)

In [7]:
aitextgen.colab.mount_gdrive()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [8]:
gdrive_rootdir = '/content/drive/My Drive/'

## Uploading a Text File to be Trained to Colaboratory

In the Colaboratory Notebook sidebar on the left of the screen, select *Files*. From there you can upload files:

![alt text](https://i.imgur.com/w3wvHhR.png)

Upload **any smaller text file** (for example, [a text file of Shakespeare plays](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt)) and update the file name in the cell below, then run the cell.

In [9]:
file_basename = 'dataset_cache'
file_ext = '.tar.gz'
num_files = 9
from_cache = True

If your text file is large (>10MB), it is recommended to upload that file to Google Drive first, then copy that file from Google Drive to the Colaboratory VM.

Additionally, you may want to consider [compressing the dataset to a cache first](https://docs.aitextgen.io/dataset/) on your local computer, then uploading the resulting `dataset_cache.tar.gz` and setting the `file_name`in the previous cell to that.

## Finetune GPT-2

The next cell will start the actual finetuning of GPT-2 in aitextgen. It runs for `num_steps`, and a progress bar will appear to show training progress, current loss (the lower the better the model), and average loss (to give a sense on loss trajectory).

The model will be saved every `save_every` steps in `trained_model` by default, and when training completes. If you mounted your Google Drive, the model will _also_ be saved there in a unique folder.

The training might time out after 4ish hours; if you did not mount to Google Drive, make sure you end training and save the results so you don't lose them! (if this happens frequently, you may want to consider using [Colab Pro](https://colab.research.google.com/signup))

Important parameters for `train()`:

- **`line_by_line`**: Set this to `True` if the input text file is a single-column CSV, with one record per row. aitextgen will automatically process it optimally.
- **`from_cache`**: If you compressed your dataset locally (as noted in the previous section) and are using that cache file, set this to `True`.
- **`num_steps`**: Number of steps to train the model for.
- **`generate_every`**: Interval of steps to generate example text from the model; good for qualitatively validating training.
- **`save_every`**: Interval of steps to save the model: the model will be saved in the VM to `/trained_model`.
- **`save_gdrive`**: Set this to `True` to copy the model to a unique folder in your Google Drive, if you have mounted it in the earlier cells
- **`fp16`**: Enables half-precision training for faster/more memory-efficient training. Only works on a T4 or V100 GPU.

Here are other important parameters for `train()` that are useful but you likely do not need to change.

- **`learning_rate`**: Learning rate of the model training.
- **`batch_size`**: Batch size of the model training; setting it too high will cause the GPU to go OOM. (if using `fp16`, you can increase the batch size more safely)

In [11]:
num_steps = 1200
generate_every = 1200
save_every = 200

num_steps_total = num_files * num_steps

In [None]:
folder = 'aitextgen-CCS-' + model + num_steps_total

In [12]:
for i in range(num_files):
    current_file = gdrive_rootdir + file_basename + '.' + str(i) + file_ext
    ai.train(current_file,
             line_by_line=False,
             from_cache=from_cache,
             num_steps=num_steps,
             generate_every=generate_every,
             save_every=save_every,
             save_gdrive=True,
             output_dir=folder,
             learning_rate=1e-3,
             fp16=False,
             batch_size=1)

    # R.B.: required to prevent memory leaks in Colab
    gc.collect()

29/04/2022 11:30:55 — INFO — aitextgen — Loading text from /content/drive/My Drive/dataset_cache.0.tar.gz with generation length of 1024.
29/04/2022 11:30:55 — INFO — aitextgen.TokenDataset — TokenDataset containing 2,294,942 subsets loaded via cache.
29/04/2022 11:30:55 — INFO — torch.distributed.nn.jit.instantiator — Created a temporary directory at /tmp/tmpbkr3hr6e
29/04/2022 11:30:55 — INFO — torch.distributed.nn.jit.instantiator — Writing /tmp/tmpbkr3hr6e/_remote_module_non_sriptable.py
  f"Setting `Trainer(checkpoint_callback={checkpoint_callback})` is deprecated in v1.5 and will "
  f"Setting `Trainer(progress_bar_refresh_rate={progress_bar_refresh_rate})` is deprecated in v1.5 and"
  "Setting `Trainer(weights_summary=None)` is deprecated in v1.5 and will be removed"
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
  f"The `Callback.{hook}` hook was deprecated in v1.6 and"
LOCAL_RANK:

  0%|          | 0/1200 [00:00<?, ?it/s]

  "`trainer.progress_bar_dict` is deprecated in v1.5 and will be removed in v1.7."


[1m200 steps reached: saving model to /trained_model_124M[0m
[1m400 steps reached: saving model to /trained_model_124M[0m
[1m600 steps reached: saving model to /trained_model_124M[0m
[1m800 steps reached: saving model to /trained_model_124M[0m
[1m1,000 steps reached: saving model to /trained_model_124M[0m
[1m1,200 steps reached: saving model to /trained_model_124M[0m
[1m1,200 steps reached: generating sample texts.[0m
, KrebsOnSecurity was alerted to a tip in the company’s name that a third-party security vendor had recently provided support on its site.

A close-source source said the company was unaware of anything to stop the breach. According to an FAQ posted on its site, the company routinely hired a third party security firm in its internal networks to help block it from disclosing the breach, but that the company’s site was offline shortly after the breach.

The CEO of the firm that hired the company was A.K. Security Solutions, a company based firm that handles cy

29/04/2022 11:39:43 — INFO — aitextgen — Saving trained model pytorch_model.bin to /trained_model_124M
29/04/2022 11:39:47 — INFO — aitextgen — Loading text from /content/drive/My Drive/dataset_cache.1.tar.gz with generation length of 1024.
29/04/2022 11:39:47 — INFO — aitextgen.TokenDataset — TokenDataset containing 2,325,417 subsets loaded via cache.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


  0%|          | 0/1200 [00:00<?, ?it/s]

[1m200 steps reached: saving model to /trained_model_124M[0m
[1m400 steps reached: saving model to /trained_model_124M[0m
[1m600 steps reached: saving model to /trained_model_124M[0m
[1m800 steps reached: saving model to /trained_model_124M[0m
[1m1,000 steps reached: saving model to /trained_model_124M[0m
[1m1,200 steps reached: saving model to /trained_model_124M[0m
[1m1,200 steps reached: generating sample texts.[0m
 is as difficult as possible, so as a result of the necessity of maintaining so-called “almost” cyber security solutions in today’s most recent data. Some of the most pressing questions around data breaches is that, and those are not as hard as you’re a professional.

Many companies are more critical than ever than ever, and some are more aware of what their data is being exposed online. A recent study found that 80% of them have seen a data breach than their 20th most likely year. The report also suggests that these findings are serious but is a light examp

29/04/2022 11:48:51 — INFO — aitextgen — Saving trained model pytorch_model.bin to /trained_model_124M
29/04/2022 11:48:56 — INFO — aitextgen — Loading text from /content/drive/My Drive/dataset_cache.2.tar.gz with generation length of 1024.
29/04/2022 11:48:56 — INFO — aitextgen.TokenDataset — TokenDataset containing 2,140,763 subsets loaded via cache.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


  0%|          | 0/1200 [00:00<?, ?it/s]

[1m200 steps reached: saving model to /trained_model_124M[0m
[1m400 steps reached: saving model to /trained_model_124M[0m
[1m600 steps reached: saving model to /trained_model_124M[0m
[1m800 steps reached: saving model to /trained_model_124M[0m
[1m1,000 steps reached: saving model to /trained_model_124M[0m
[1m1,200 steps reached: saving model to /trained_model_124M[0m
[1m1,200 steps reached: generating sample texts.[0m
 for real-world victims of all sizes.

To the extent, we have seen an increase in the prevalence of ransomware and ransomware attacks in 2016. According to the threat data, our analysis criteria were observed as a result of a decrease in ransomware in ransomware.

The increased use of ransomware is likely attributed to a decrease in ransomware, and the increase in ransomware.

The number of ransomware families in the past year has increased by more than doubled.

In the first half of 2017, we saw a significant decrease in the CCM increase in ransomware.

The

29/04/2022 11:58:02 — INFO — aitextgen — Saving trained model pytorch_model.bin to /trained_model_124M
29/04/2022 11:58:07 — INFO — aitextgen — Loading text from /content/drive/My Drive/dataset_cache.3.tar.gz with generation length of 1024.
29/04/2022 11:58:07 — INFO — aitextgen.TokenDataset — TokenDataset containing 2,272,696 subsets loaded via cache.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


  0%|          | 0/1200 [00:00<?, ?it/s]

[1m200 steps reached: saving model to /trained_model_124M[0m
[1m400 steps reached: saving model to /trained_model_124M[0m
[1m600 steps reached: saving model to /trained_model_124M[0m
[1m800 steps reached: saving model to /trained_model_124M[0m
[1m1,000 steps reached: saving model to /trained_model_124M[0m
[1m1,200 steps reached: saving model to /trained_model_124M[0m
[1m1,200 steps reached: generating sample texts.[0m
 is more important than ever for vulnerabilities discovered in many software.

[1]

[1]
[2],

[3]
[4]

[3]

[4]

[4]

[4]

[1]

[4]
[4]

[5]

[4]

[4]

[5]

[6]
[4]

[6]
[4]
[4]

[5]

[4]
[5]
[5]
[5]

[6]

[6]
[6]

[6]
[6]
[6]
[6]
[6]
[6]
[6]

[5]
[6]
[6]

[6]

[6]
[6]

[6]

[6]
[6]

[6]
[6]

[6]

[6]

[6]
[6]

[6]
[6]



29/04/2022 12:07:10 — INFO — aitextgen — Saving trained model pytorch_model.bin to /trained_model_124M
29/04/2022 12:07:14 — INFO — aitextgen — Loading text from /content/drive/My Drive/dataset_cache.4.tar.gz with generation length of 1024.
29/04/2022 12:07:14 — INFO — aitextgen.TokenDataset — TokenDataset containing 2,392,413 subsets loaded via cache.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


  0%|          | 0/1200 [00:00<?, ?it/s]

[1m200 steps reached: saving model to /trained_model_124M[0m
[1m400 steps reached: saving model to /trained_model_124M[0m
[1m600 steps reached: saving model to /trained_model_124M[0m
[1m800 steps reached: saving model to /trained_model_124M[0m
[1m1,000 steps reached: saving model to /trained_model_124M[0m
[1m1,200 steps reached: saving model to /trained_model_124M[0m
[1m1,200 steps reached: generating sample texts.[0m
 the web server and the server the server you’ve connected to.

This bug is patched as soon as possible, although it is still open in limited testing, and is not yet patched.

Fobe published a patch on Tuesday, so you can check whether you have a patch yet, and whether you have a WebKit, or an ASK, and if you have a hardware, you can take advantage.

For a list of other details on how to fix this bug, please read on.

A bug that’s supposed to be exploited by hackers to steal more than a million user’s passwords, and to steal them.

A researcher has dubbed a 

29/04/2022 12:16:16 — INFO — aitextgen — Saving trained model pytorch_model.bin to /trained_model_124M
29/04/2022 12:16:21 — INFO — aitextgen — Loading text from /content/drive/My Drive/dataset_cache.5.tar.gz with generation length of 1024.
29/04/2022 12:16:21 — INFO — aitextgen.TokenDataset — TokenDataset containing 2,332,252 subsets loaded via cache.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


  0%|          | 0/1200 [00:00<?, ?it/s]

[1m200 steps reached: saving model to /trained_model_124M[0m
[1m400 steps reached: saving model to /trained_model_124M[0m
[1m600 steps reached: saving model to /trained_model_124M[0m
[1m800 steps reached: saving model to /trained_model_124M[0m
[1m1,000 steps reached: saving model to /trained_model_124M[0m
[1m1,200 steps reached: saving model to /trained_model_124M[0m
[1m1,200 steps reached: generating sample texts.[0m
.com domain, or a related domain that has been linked to a particular one of the largest DDoS attacks ever conducted.

That’s a lot. And because it’s a handy reminder that while you were running in a home country, it’s still worth a lot.


A year ago, the US presidential election administration took over the “most popular” aspect of the US meddling, and the US administration of the USA and the American Civil Liberties Union (ACLU) – one of a major cybersecurity-related topics.

In February 2016, the FBI had the first, most active, most prominent, as it was t

29/04/2022 12:25:23 — INFO — aitextgen — Saving trained model pytorch_model.bin to /trained_model_124M
29/04/2022 12:25:28 — INFO — aitextgen — Loading text from /content/drive/My Drive/dataset_cache.6.tar.gz with generation length of 1024.
29/04/2022 12:25:28 — INFO — aitextgen.TokenDataset — TokenDataset containing 2,377,528 subsets loaded via cache.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


  0%|          | 0/1200 [00:00<?, ?it/s]

[1m200 steps reached: saving model to /trained_model_124M[0m
[1m400 steps reached: saving model to /trained_model_124M[0m
[1m600 steps reached: saving model to /trained_model_124M[0m
[1m800 steps reached: saving model to /trained_model_124M[0m
[1m1,000 steps reached: saving model to /trained_model_124M[0m
[1m1,200 steps reached: saving model to /trained_model_124M[0m
[1m1,200 steps reached: generating sample texts.[0m
 of the stolen data, the news outlet reported.

We don’t want to be the only ones and we’re going to be the only ones with this data!

But a few of those who don’t have access to your data, and they are willing to give away your phone and have time to contact you to say they really want the data they want to be kept.

And if you have a phone with your data, or “regular” with your bank account, you may be a bit like PayPal or credit card company.

The company says that the data breach is being offered “within 48 hours”, while you are being offered a five-year

29/04/2022 12:34:30 — INFO — aitextgen — Saving trained model pytorch_model.bin to /trained_model_124M
29/04/2022 12:34:35 — INFO — aitextgen — Loading text from /content/drive/My Drive/dataset_cache.7.tar.gz with generation length of 1024.
29/04/2022 12:34:35 — INFO — aitextgen.TokenDataset — TokenDataset containing 2,356,408 subsets loaded via cache.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


  0%|          | 0/1200 [00:00<?, ?it/s]

[1m200 steps reached: saving model to /trained_model_124M[0m
[1m400 steps reached: saving model to /trained_model_124M[0m
[1m600 steps reached: saving model to /trained_model_124M[0m
[1m800 steps reached: saving model to /trained_model_124M[0m
[1m1,000 steps reached: saving model to /trained_model_124M[0m
[1m1,200 steps reached: saving model to /trained_model_124M[0m
[1m1,200 steps reached: generating sample texts.[0m


A study by the University of California, filed in the US District of California that prohibits the use of digital communications for advertising.

In a letter to the Electronic Privacy and Electronic Communications Privacy Act (DMPA) on 2 August at the Federal Trade Commission on 18 March 2011, the Federal Trade Commission (FCC) filed a court court on 11 April that asks the government to “appear” a “cyber-related” of “cyber-related” in an open letter on the grounds that the EU’s National Consumer Commission (EPAA) found the broadband providers whose intern

29/04/2022 12:43:36 — INFO — aitextgen — Saving trained model pytorch_model.bin to /trained_model_124M
29/04/2022 12:43:41 — INFO — aitextgen — Loading text from /content/drive/My Drive/dataset_cache.8.tar.gz with generation length of 1024.
29/04/2022 12:43:41 — INFO — aitextgen.TokenDataset — TokenDataset containing 1,919,852 subsets loaded via cache.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


  0%|          | 0/1200 [00:00<?, ?it/s]

[1m200 steps reached: saving model to /trained_model_124M[0m
[1m400 steps reached: saving model to /trained_model_124M[0m
[1m600 steps reached: saving model to /trained_model_124M[0m
[1m800 steps reached: saving model to /trained_model_124M[0m
[1m1,000 steps reached: saving model to /trained_model_124M[0m
[1m1,200 steps reached: saving model to /trained_model_124M[0m
[1m1,200 steps reached: generating sample texts.[0m
, they do even get to see a greeting from their postal code.

So, do you know when you’ve fallen for this sort of scam?  Well, it is.  Although most of you remember it’s likely that it’s not surprising that the scammers are using it to warn you about a scam as they could use something they’ve not seen on the internet, it’s a very bad idea to call a simple telephone number, and say that it’s a phishing scam designed to be sent to you by your ISP for personal information which could earn money from the scammers.

So, what would it mean for you to use this sort

29/04/2022 12:52:43 — INFO — aitextgen — Saving trained model pytorch_model.bin to /trained_model_124M


You're done! Feel free to go to the **Generate Text From The Trained Model** section to generate text based on your retrained model.


## Load a Trained Model

If you already had a trained model from this notebook, running the next cell will copy the `pytorch_model.bin` and the `config.json`file from the specified folder in Google Drive into the Colaboratory VM. (If no `from_folder` is specified, it assumes the two files are located at the root level of your Google Drive)

In [13]:
#aitextgen.colab.copy_file_from_gdrive('pytorch_model.bin', folder)
#aitextgen.colab.copy_file_from_gdrive('config.json', folder)

The next cell will allow you to load the retrained model + metadata necessary to generate text.

## Generate Text From The Trained Model

After you've trained the model or loaded a retrained model from checkpoint, you can now generate text.

**If you just trained a model**, you'll get much faster training performance if you reload the model; the next cell will reload the model you just trained from the `trained_model` folder.

In [14]:
ai = aitextgen.aitextgen(model_folder=folder, to_gpu=True)

29/04/2022 12:52:47 — INFO — aitextgen — Loading model from provided weights and config in /trained_model_124M.
29/04/2022 12:52:49 — INFO — aitextgen — GPT2 loaded with 124M parameters.
29/04/2022 12:52:49 — INFO — aitextgen — Using the default GPT-2 Tokenizer.


`generate()` without any parameters generates a single text from the loaded model to the console.

In [15]:
ai.generate()

as well as the most embarrassing case for Sophos and Sophos, and the truth of this incident is that we are now detecting this attack as Troj/Agent-3970 and the most destructive malware threat in the world.

Sophos detects this Trojan horse as W32/Agent-JQ.


There are still 23 security threats in our spam traps today, we have been issuing detection of the malware.  We are currently analysing the spam messages, and we are issuing detection.

SophosLabs have noticed that the malicious code is being dropped on a compromised computer.  We are continuing to monitor the malware we have seen in these attempts to distribute malware.


For one week’s spam campaign that was distributed by a remote hacker, we are now seeing a new wave being distributed via the internet today by hackers.

The first thing in the world that we see is that a new campaign targeting the UK government is growing.  The message which we have seen on the internet is still being spread by the most destructive message.


The

If you're creating an API based on your model and need to pass the generated text elsewhere, you can do `text = ai.generate_one()`

You can also pass in a `prompt` to the generate function to force the text to start with a given character sequence and generate text from there (good if you add an indicator when the text starts).

You can also generate multiple texts at a time by specifing `n`. You can pass a `batch_size` to generate multiple samples in parallel, giving a massive speedup (in Colaboratory, set a maximum of 50 for `batch_size` to avoid going OOM).

Other optional-but-helpful parameters for `ai.generate()` and friends:

*  **`min length`**: The minimum length of the generated text: if the text is shorter than this value after cleanup, aitextgen will generate another one.
*  **`max_length`**: Number of tokens to generate (default 256, you can generate up to 1024 tokens with GPT-2 and 2048 with GPT Neo)
* **`temperature`**: The higher the temperature, the crazier the text (default 0.7, recommended to keep between 0.7 and 1.0)
* **`top_k`**: Limits the generated guesses to the top *k* guesses (default 0 which disables the behavior; if the generated output is super crazy, you may want to set `top_k=40`)
* **`top_p`**: Nucleus sampling: limits the generated guesses to a cumulative probability. (gets good results on a dataset with `top_p=0.9`)

In [16]:
prompts = ['Digital Forensics Analysis Report\n',
           'This report is ',
           'The contents of ',
           'Conclusion\n',
           'It is recommended that ',
           'In the opinion of the expert, ']

In [None]:
for prompt in prompts:
    ai.generate(n=5,
                batch_size=1,
                prompt=prompt,
                max_length=1000,
                temperature=1.0,
                top_p=0.9)

[1mDigital Forensics Analysis Report
[0mCar crash events
Car crash websites
Per HaLL Signing for driver and key system

The attacker appears to have changed DNS entries to those of the affected webmasters in Germany, aiming the DNS server at the end of June 2009.

In the last day’s Patch Tuesday, Microsoft announced an emergency emergency patch for IE on its regular scheduled patch cycle.

The patch, which was released at approximately 11am, did not be released today but it would be a good idea to follow the usual release of IE patching.

I also strongly encourage all of us to change your DNS DNS settings, ensure that their DNS records remain secure, and make sure that their computers can be patched when they are compromised.

What makes this security bulletin really important is the fact that a company like Sophos has a DNS DNS records of your customers.  If the DNS records are compromised, it could be compromised and potentially be compromised.

All patches are released immediately

For bulk generation, you can generate a large amount of texts to a file and sort out the samples locally on your computer. The next cell will generate `num_files` files, each with `n` texts and whatever other parameters you would pass to `generate()`. The files can then be downloaded from the Files sidebar!

You can rerun the cells as many times as you want for even more generated texts!

In [None]:
#num_outputs = 0

#for prompt in prompts:
#    for _ in range(num_outputs):
#        ai.generate_to_file(n=200,
#                            batch_size=1,
#                            prompt=prompt,
#                            max_length=2000,
#                            temperature=1.0,
#                            top_p=0.9)

# LICENSE

MIT License

Copyright (c) 2020-2021 Max Woolf

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.