#  aitextgen — Train a GPT-2 (or GPT Neo) Text-Generating Model w/ GPU

by [Max Woolf](https://minimaxir.com)

*Last updated: May 16th, 2021 (aitextgen v0.5.2)*

Retrain an advanced text generating neural network on any text dataset **for free on a GPU using Colaboratory** using `aitextgen`!

For more about `aitextgen`, you can visit [this GitHub repository](https://github.com/minimaxir/aitextgen) or [read the documentation](https://docs.aitextgen.io/).


To get started:

1. Copy this notebook to your Google Drive to keep it and save your changes. (File -> Save a Copy in Drive)
2. Run the cells below:


In [None]:
!pip install -q aitextgen

[K     |████████████████████████████████| 572 kB 15.2 MB/s 
[K     |████████████████████████████████| 4.2 MB 66.8 MB/s 
[K     |████████████████████████████████| 87 kB 8.5 MB/s 
[K     |████████████████████████████████| 585 kB 53.0 MB/s 
[K     |████████████████████████████████| 419 kB 67.4 MB/s 
[K     |████████████████████████████████| 596 kB 63.2 MB/s 
[K     |████████████████████████████████| 140 kB 72.9 MB/s 
[K     |████████████████████████████████| 1.1 MB 62.4 MB/s 
[K     |████████████████████████████████| 86 kB 6.0 MB/s 
[K     |████████████████████████████████| 6.6 MB 54.1 MB/s 
[K     |████████████████████████████████| 94 kB 2.3 MB/s 
[K     |████████████████████████████████| 271 kB 73.0 MB/s 
[K     |████████████████████████████████| 144 kB 72.0 MB/s 
[?25h  Building wheel for aitextgen (setup.py) ... [?25l[?25hdone
  Building wheel for fire (setup.py) ... [?25l[?25hdone


In [None]:
import aitextgen
import datetime
import gc
import logging
import os
import requests
import torch

In [None]:
session_url = 'http://172.28.0.2:9000/api/sessions'
notebook_name = requests.get(session_url).json()[0]['name']

run_datetime = datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%S')
run_id = notebook_name + '_run_' + run_datetime

In [None]:
log_format = '%(asctime)s — %(levelname)s — %(name)s — %(message)s'
date_format = '%d/%m/%Y %H:%M:%S'
log_level = logging.DEBUG

logging.basicConfig(format=log_format, datefmt=date_format, level=log_level)

## GPU

Colaboratory uses a Nvidia P4, an Nvidia T4, an Nvidia P100, or an Nvidia V100. For finetuning GPT-2 124M, any of these GPUs will be fine, but for text generation, a T4 or a P100 is ideal since they have more VRAM. **If you receive a T4 or a V100 GPU, you can enable `fp16=True` during training for faster/more memory efficient training.**

You can verify which GPU is active by running the cell below. If you want to try for a different GPU, go to **Runtime -> Factory Reset Runtime**.

In [None]:
!nvidia-smi

Sun Jun 12 12:00:11 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Mounting Google Drive

The best way to get input text to-be-trained into the Colaboratory VM, and to get the trained model *out* of Colaboratory, is to route it through Google Drive *first*.

Running this cell (which will only work in Colaboratory) will mount your personal Google Drive in the VM, which later cells can use to get data in/out. (it will ask for an auth code; that auth is not saved anywhere)

In [None]:
aitextgen.colab.mount_gdrive()

Mounted at /content/drive


In [None]:
gdrive_root_dir = '/content/drive/My Drive'


## Load a Trained Model

If you already had a trained model from this notebook, running the next cell will copy the `pytorch_model.bin` and the `config.json`file from the specified folder in Google Drive into the Colaboratory VM. (If no `from_folder` is specified, it assumes the two files are located at the root level of your Google Drive)

In [None]:
load_model = None
#load_model = 'aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15'

In [None]:
if load_model is not None:
    model_load_dir = gdrive_root_dir + '/aitextgen/models/' + load_model
    ai = aitextgen.aitextgen(model_folder=model_load_dir, to_gpu=True)

The next cell will allow you to load the retrained model + metadata necessary to generate text.

## Loading GPT-2 or GPT Neo

If you're retraining a model on new text, you need to download and load the GPT-2 model into the GPU. 

There are several sizes of GPT-2:

* `124M` (default): the "small" model, 500MB on disk.
* `355M` (default): the "medium" model, 1.5GB on disk.
* `774M` (default): the "large" model, 3GB on disk.

You can also finetune a GPT Neo model instead, which is more suitable for longer texts and the base model has more recent data:

* `125M`: Analogous to the GPT-2 124M model.
* `350M`: Analogous to the GPT-2 355M model

The next cell downloads the model and saves it in the Colaboratory VM. If the model has already been downloaded, running this cell will reload it.

In [None]:
model='124M'
#model='355M'
#model='774M'

#model='EleutherAI/gpt-neo-125M'
#model='EleutherAI/gpt-neo-350M'

In [None]:
if load_model is None:
    if model == '124M' or model == '355M' or model == '774M':
        ai = aitextgen.aitextgen(tf_gpt2=model, to_gpu=True)
    else:
        ai = aitextgen.aitextgen(model=model, to_gpu=True)

12/06/2022 12:00:32 — INFO — aitextgen — Downloading the 124M GPT-2 TensorFlow weights/config from Google's servers
12/06/2022 12:00:32 — DEBUG — urllib3.connectionpool — Starting new HTTPS connection (1): openaipublic.blob.core.windows.net:443
12/06/2022 12:00:32 — DEBUG — urllib3.connectionpool — https://openaipublic.blob.core.windows.net:443 "GET /gpt-2/models/124M/checkpoint HTTP/1.1" 200 77


Fetching checkpoint:   0%|          | 0.00/77.0 [00:00<?, ?it/s]

12/06/2022 12:00:32 — DEBUG — urllib3.connectionpool — Starting new HTTPS connection (1): openaipublic.blob.core.windows.net:443
12/06/2022 12:00:32 — DEBUG — urllib3.connectionpool — https://openaipublic.blob.core.windows.net:443 "GET /gpt-2/models/124M/hparams.json HTTP/1.1" 200 90


Fetching hparams.json:   0%|          | 0.00/90.0 [00:00<?, ?it/s]

12/06/2022 12:00:32 — DEBUG — urllib3.connectionpool — Starting new HTTPS connection (1): openaipublic.blob.core.windows.net:443
12/06/2022 12:00:33 — DEBUG — urllib3.connectionpool — https://openaipublic.blob.core.windows.net:443 "GET /gpt-2/models/124M/model.ckpt.data-00000-of-00001 HTTP/1.1" 200 497759232


Fetching model.ckpt.data-00000-of-00001:   0%|          | 0.00/498M [00:00<?, ?it/s]

12/06/2022 12:00:48 — DEBUG — urllib3.connectionpool — Starting new HTTPS connection (1): openaipublic.blob.core.windows.net:443
12/06/2022 12:00:48 — DEBUG — urllib3.connectionpool — https://openaipublic.blob.core.windows.net:443 "GET /gpt-2/models/124M/model.ckpt.index HTTP/1.1" 200 5215


Fetching model.ckpt.index:   0%|          | 0.00/5.21k [00:00<?, ?it/s]

12/06/2022 12:00:48 — DEBUG — urllib3.connectionpool — Starting new HTTPS connection (1): openaipublic.blob.core.windows.net:443
12/06/2022 12:00:48 — DEBUG — urllib3.connectionpool — https://openaipublic.blob.core.windows.net:443 "GET /gpt-2/models/124M/model.ckpt.meta HTTP/1.1" 200 471155


Fetching model.ckpt.meta:   0%|          | 0.00/471k [00:00<?, ?it/s]

12/06/2022 12:00:48 — INFO — aitextgen — Converting the 124M GPT-2 TensorFlow weights to PyTorch.
Converting TensorFlow checkpoint from /content/aitextgen/124M
Loading TF weight model/h0/attn/c_attn/b with shape [2304]
Loading TF weight model/h0/attn/c_attn/w with shape [1, 768, 2304]
Loading TF weight model/h0/attn/c_proj/b with shape [768]
Loading TF weight model/h0/attn/c_proj/w with shape [1, 768, 768]
Loading TF weight model/h0/ln_1/b with shape [768]
Loading TF weight model/h0/ln_1/g with shape [768]
Loading TF weight model/h0/ln_2/b with shape [768]
Loading TF weight model/h0/ln_2/g with shape [768]
Loading TF weight model/h0/mlp/c_fc/b with shape [3072]
Loading TF weight model/h0/mlp/c_fc/w with shape [1, 768, 3072]
Loading TF weight model/h0/mlp/c_proj/b with shape [768]
Loading TF weight model/h0/mlp/c_proj/w with shape [1, 3072, 768]
Loading TF weight model/h1/attn/c_attn/b with shape [2304]
Loading TF weight model/h1/attn/c_attn/w with shape [1, 768, 2304]
Loading TF weight

Save PyTorch model to aitextgen/pytorch_model.bin


12/06/2022 12:00:54 — INFO — aitextgen — Loading 124M GPT-2 model from /aitextgen.


Save configuration file to aitextgen/config.json


12/06/2022 12:00:56 — INFO — aitextgen — GPT2 loaded with 124M parameters.
12/06/2022 12:00:56 — INFO — aitextgen — Using the default GPT-2 Tokenizer.


## Uploading a Text File to be Trained to Colaboratory

In the Colaboratory Notebook sidebar on the left of the screen, select *Files*. From there you can upload files:

![alt text](https://i.imgur.com/w3wvHhR.png)

Upload **any smaller text file** (for example, [a text file of Shakespeare plays](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt)) and update the file name in the cell below, then run the cell.

In [None]:
data_root_dir = gdrive_root_dir + '/aitextgen/training_data'
datasets = ['articles', 'reports']
dataset_splits = [9, 1]
dataset_iterations = [600, 600]
dataset_runs = [1, 1]
dataset_learnrates = [1e-3, 1e-6]
save_partial = False

file_basename = 'dataset_cache'
file_ext = '.tar.gz'
from_cache = True

If your text file is large (>10MB), it is recommended to upload that file to Google Drive first, then copy that file from Google Drive to the Colaboratory VM.

Additionally, you may want to consider [compressing the dataset to a cache first](https://docs.aitextgen.io/dataset/) on your local computer, then uploading the resulting `dataset_cache.tar.gz` and setting the `file_name`in the previous cell to that.

## Finetune GPT-2

The next cell will start the actual finetuning of GPT-2 in aitextgen. It runs for `num_steps`, and a progress bar will appear to show training progress, current loss (the lower the better the model), and average loss (to give a sense on loss trajectory).

The model will be saved every `save_every` steps in `trained_model` by default, and when training completes. If you mounted your Google Drive, the model will _also_ be saved there in a unique folder.

The training might time out after 4ish hours; if you did not mount to Google Drive, make sure you end training and save the results so you don't lose them! (if this happens frequently, you may want to consider using [Colab Pro](https://colab.research.google.com/signup))

Important parameters for `train()`:

- **`line_by_line`**: Set this to `True` if the input text file is a single-column CSV, with one record per row. aitextgen will automatically process it optimally.
- **`from_cache`**: If you compressed your dataset locally (as noted in the previous section) and are using that cache file, set this to `True`.
- **`num_steps`**: Number of steps to train the model for.
- **`generate_every`**: Interval of steps to generate example text from the model; good for qualitatively validating training.
- **`save_every`**: Interval of steps to save the model: the model will be saved in the VM to `/trained_model`.
- **`save_gdrive`**: Set this to `True` to copy the model to a unique folder in your Google Drive, if you have mounted it in the earlier cells
- **`fp16`**: Enables half-precision training for faster/more memory-efficient training. Only works on a T4 or V100 GPU.

Here are other important parameters for `train()` that are useful but you likely do not need to change.

- **`learning_rate`**: Learning rate of the model training.
- **`batch_size`**: Batch size of the model training; setting it too high will cause the GPU to go OOM. (if using `fp16`, you can increase the batch size more safely)

In [None]:
model_root_dir = gdrive_root_dir + '/aitextgen/models/' + run_id

if save_partial:
    model_partial_dir = model_root_dir + '/partial'
    partial_models = []

In [None]:
for i in range(len(datasets)):
    file_basepath = data_root_dir + '/' + datasets[i] + '/' + file_basename

    for j in range(dataset_runs[i]):
        if save_partial and (i < len(datasets) - 1 or j < dataset_runs[i] - 1):
            partial_model_name = datasets[i] + '_trainrun-' + str(j + 1)
            model_dir = model_partial_dir + '/' + partial_model_name
            partial_models.append(partial_model_name)
        else:
            model_dir = model_root_dir

        for k in range(dataset_splits[i]):
            if dataset_splits[i] > 1:
                current_file = file_basepath + '.' + str(k) + file_ext
            else:
                current_file = file_basepath + file_ext
            ai.train(current_file,
                     line_by_line=False,
                     from_cache=from_cache,
                     num_steps=dataset_iterations[i],
                     generate_every=dataset_iterations[i],
                     save_every=dataset_iterations[i],
                     save_gdrive=False,
                     run_id=run_id,
                     output_dir=model_dir,
                     learning_rate=dataset_learnrates[i],
                     fp16=False,
                     batch_size=1)

            # R.B.: required to prevent memory leaks in Colab
            gc.collect()

12/06/2022 12:01:10 — INFO — aitextgen — Loading text from /content/drive/My Drive/aitextgen/training_data/articles/dataset_cache.0.tar.gz with generation length of 1024.
12/06/2022 12:01:11 — INFO — aitextgen.TokenDataset — TokenDataset containing 2,297,405 subsets loaded via cache.
12/06/2022 12:01:11 — INFO — torch.distributed.nn.jit.instantiator — Created a temporary directory at /tmp/tmpqahu29yf
12/06/2022 12:01:11 — INFO — torch.distributed.nn.jit.instantiator — Writing /tmp/tmpqahu29yf/_remote_module_non_sriptable.py
  f"Setting `Trainer(checkpoint_callback={checkpoint_callback})` is deprecated in v1.5 and will "
  f"Setting `Trainer(progress_bar_refresh_rate={progress_bar_refresh_rate})` is deprecated in v1.5 and"
  "Setting `Trainer(weights_summary=None)` is deprecated in v1.5 and will be removed"
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
  f"The `Callback.{hook}` hook was de

  0%|          | 0/600 [00:00<?, ?it/s]

  "`trainer.progress_bar_dict` is deprecated in v1.5 and will be removed in v1.7."


[1m600 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11[0m
[1m600 steps reached: generating sample texts.[0m
.�Duo” (Duo) said in an emailed statement. “We’ve seen a lot of action against people who are looking for the mobile phone accounts, and on this one of the most important guys that we’re all going to see right now.”

The company’s statement continues:

“We are making a lot of efforts to make sure that people have a decent record of providing usernames and passwords to us, and to have other data out there.”

The statement continues:

“We need to have a number with usernames,” the statement continues. “And we are going to have to do this to ensure that usernames and passwords can be used to create passwords, passwords and even some passwords that needs to be entered to any site that has some good idea.”

The statement continues:

“We are committed to providing these services to others.”

The statement 

12/06/2022 12:05:09 — INFO — aitextgen — Saving trained model pytorch_model.bin to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11
12/06/2022 12:05:11 — INFO — aitextgen — Loading text from /content/drive/My Drive/aitextgen/training_data/articles/dataset_cache.1.tar.gz with generation length of 1024.
12/06/2022 12:05:12 — INFO — aitextgen.TokenDataset — TokenDataset containing 2,291,300 subsets loaded via cache.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


  0%|          | 0/600 [00:00<?, ?it/s]

[1m600 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11[0m
[1m600 steps reached: generating sample texts.[0m
 offers the following and how they re doing (or take advantage of the a act that they have you you and you you do (or act play play play play play play play with your internet internet internet and internet your internet network and your online online – and you and you as well. Whether you are more than you you you and you be you you you with your online online online and you off that you need to help as well. Whether you are your you you that you are here and you here as to you you family online and you share you as you have you you and you you help do here you to you you be you here and you you be


12/06/2022 12:09:20 — INFO — aitextgen — Saving trained model pytorch_model.bin to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11
12/06/2022 12:09:22 — INFO — aitextgen — Loading text from /content/drive/My Drive/aitextgen/training_data/articles/dataset_cache.2.tar.gz with generation length of 1024.
12/06/2022 12:09:23 — INFO — aitextgen.TokenDataset — TokenDataset containing 2,133,040 subsets loaded via cache.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


  0%|          | 0/600 [00:00<?, ?it/s]

[1m600 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11[0m
[1m600 steps reached: generating sample texts.[0m
, the attackers can easily move to the cloud, use cloud storage to move laterally and deploy the malware.

In the last half of the year, the threat landscape is growing. Today we’ve seen the threat landscape be growing. The new reality is that the number of systems running Windows and Windows continues to be increasingly increasingly vulnerable, with more than the number of systems running on the infected machine. The Windows Defender Advanced Notification Service (AC) is a fundamental part of Windows Defender Advanced Notification Service (AD), available from Windows 10, and newer Windows Defender Advanced Notification Service (AD).

The threat landscape is growing. Windows Defender Advanced Notification Service (ANS) is an essential part of Windows 10. Windows Defender Advanced Notification Service

12/06/2022 12:13:33 — INFO — aitextgen — Saving trained model pytorch_model.bin to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11
12/06/2022 12:13:35 — INFO — aitextgen — Loading text from /content/drive/My Drive/aitextgen/training_data/articles/dataset_cache.3.tar.gz with generation length of 1024.
12/06/2022 12:13:35 — INFO — aitextgen.TokenDataset — TokenDataset containing 2,277,210 subsets loaded via cache.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


  0%|          | 0/600 [00:00<?, ?it/s]

[1m600 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11[0m
[1m600 steps reached: generating sample texts.[0m
, and I’ve gotten a lot of questions from customers, even if you’ve used them to provide more information on the Web.

To help to understand more frequently I receive from customers, I am happy to tell that we have the opportunity to make changes to the Web.


I’m happy to know that our products are now available on the Web. Security will be available for free download.

For more information on Security Updates and other security related issues related to Web, please head over to our Microsoft TechNet Library.






I’ve had a chance to blogged about in the early 2000s and 2001s at Microsoft Security Intelligence Report Report. We’ve been doing this month with the Security Bulletins, which provides guidance to customers on how to avoid online security misconfigurations and information misconfiguratio

12/06/2022 12:17:45 — INFO — aitextgen — Saving trained model pytorch_model.bin to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11
12/06/2022 12:17:47 — INFO — aitextgen — Loading text from /content/drive/My Drive/aitextgen/training_data/articles/dataset_cache.4.tar.gz with generation length of 1024.
12/06/2022 12:17:47 — INFO — aitextgen.TokenDataset — TokenDataset containing 2,396,272 subsets loaded via cache.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


  0%|          | 0/600 [00:00<?, ?it/s]

[1m600 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11[0m
[1m600 steps reached: generating sample texts.[0m
, the “one” of the “an abundance” of caution, which is a “one of caution” for IT security companies.

That means that it’s not what we should do now, too.

Our advice about this is to consider that we’re not trying to do things that could be done without actually having to think.

And if we do know, we are ready to say, “Our Sophos Rapid Response team has been released, and we have a lot of new features.”

And then, we’re ready to say, “Sophos Rapid Response”…

…and we just love the new features, so we want to make it simple and we’re ready to use them in the hope that we make our products are now available.”

And, of course, we’ll be testing and we’re getting a lot more advanced in the coming months to make our products work.

In other words, we’re ready to be ready to say “What”…


And we know wher

12/06/2022 12:21:56 — INFO — aitextgen — Saving trained model pytorch_model.bin to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11
12/06/2022 12:21:58 — INFO — aitextgen — Loading text from /content/drive/My Drive/aitextgen/training_data/articles/dataset_cache.5.tar.gz with generation length of 1024.
12/06/2022 12:21:59 — INFO — aitextgen.TokenDataset — TokenDataset containing 2,330,789 subsets loaded via cache.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


  0%|          | 0/600 [00:00<?, ?it/s]

[1m600 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11[0m
[1m600 steps reached: generating sample texts.[0m
The Endpoint Security SOS and the National Institute (NIST) and the National Cyber Defense Agency (NSA) on Thursday announced that the DHS’s Office of the National Health Service (ICOC) is investigating the alleged test of the test of “this test.”

The report’s not just the FBI’s Office of National Health Service (HS) and the Office of National Health Services (NIS) as well.

The report claims that the DHS’s Office of National Health (NIHHS) plan, which is the only agency for the US, which is the DHS, are due to its findings on this test.

The report mentions that the DHS’s Office of National Health Service (NSA) will also be working with the US and US to get the DHS “to implement this test and to ensure that the DHS will be able to deploy it and will be able to get the DHS’s executive office.”

The 

12/06/2022 12:26:08 — INFO — aitextgen — Saving trained model pytorch_model.bin to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11
12/06/2022 12:26:11 — INFO — aitextgen — Loading text from /content/drive/My Drive/aitextgen/training_data/articles/dataset_cache.6.tar.gz with generation length of 1024.
12/06/2022 12:26:11 — INFO — aitextgen.TokenDataset — TokenDataset containing 2,379,669 subsets loaded via cache.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


  0%|          | 0/600 [00:00<?, ?it/s]

[1m600 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11[0m
[1m600 steps reached: generating sample texts.[0m
-tune the cybercrooks to get to get access to your computers, too.

So we asked you to do the same, too.

There’s a more risk of the crooks getting access to your computer, so we decided to make a public announcement at the time, saying, “We need to take a public statement and encourage you to use your computer to handle this vulnerability.”

The problem is that when you use the Mac to update your computer to update your Mac, the Mac will create new, previously-released version of Mac, you can update your Mac to support Mac, so that you can install Mac from the Mac, like a Mac.

Mac users have to update their Mac to update the Mac, so we know that Mac users have their Mac by default, so we can wait to let you do the update and make sure you have the update back, so that you can update your Mac.


The

12/06/2022 12:30:21 — INFO — aitextgen — Saving trained model pytorch_model.bin to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11
12/06/2022 12:30:23 — INFO — aitextgen — Loading text from /content/drive/My Drive/aitextgen/training_data/articles/dataset_cache.7.tar.gz with generation length of 1024.
12/06/2022 12:30:24 — INFO — aitextgen.TokenDataset — TokenDataset containing 2,356,283 subsets loaded via cache.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


  0%|          | 0/600 [00:00<?, ?it/s]

[1m600 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11[0m
[1m600 steps reached: generating sample texts.[0m
.

Hacking is a problem with all the other methods of crime which is covered, but it is a pretty easy task in the industry.

In the past week, police have seen several arrests around the world, including the Serious Organised Crime Unit, the Serious Organised Crime Agency, and much more.

The FBI also has seen a significant number of arrests worldwide that have been brought in to light.

For those who haven’t used the same password at the same time as the US, they’ve seen various arrests across the world.

The FBI has been the targets of the investigation, and the FBI has known about the nature of the problem and the lack of encryption that the FBI has to do with the use of encryption in the past.

The FBI has also pointed out that more than 12,000 people have used encryption technologies like RSA an

12/06/2022 12:34:33 — INFO — aitextgen — Saving trained model pytorch_model.bin to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11
12/06/2022 12:34:35 — INFO — aitextgen — Loading text from /content/drive/My Drive/aitextgen/training_data/articles/dataset_cache.8.tar.gz with generation length of 1024.
12/06/2022 12:34:35 — INFO — aitextgen.TokenDataset — TokenDataset containing 1,899,269 subsets loaded via cache.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


  0%|          | 0/600 [00:00<?, ?it/s]

[1m600 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11[0m
[1m600 steps reached: generating sample texts.[0m
 the attack.

The truth is that once installed, you can’t help but feel compelled to take a look at the code and write the code.

If you think that the code could be malicious, I can’t help you, but it’s a lot more likely to find it.

Remember, there’s a lot of work in any software that can be found on the malicious endpoints infrastructure and the code and the code is a code that has been downloaded.

In this case, the code for the malicious code was downloaded from the source code, which is downloaded from a malicious site.

The code is downloaded by the malicious code and downloaded in various ways (1,2,3,1,4,1,1,1,1,1,1,1,1). These attacks are also seen by malware authors.

The code for this attack is a Trojan horse that attempts to download itself from the source code or is detected by the malwa

12/06/2022 12:38:45 — INFO — aitextgen — Saving trained model pytorch_model.bin to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11
12/06/2022 12:38:47 — INFO — aitextgen — Loading text from /content/drive/My Drive/aitextgen/training_data/reports/dataset_cache.tar.gz with generation length of 1024.
12/06/2022 12:38:47 — INFO — aitextgen.TokenDataset — TokenDataset containing 87,006 subsets loaded via cache.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


  0%|          | 0/600 [00:00<?, ?it/s]

[1m600 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11[0m
[1m600 steps reached: generating sample texts.[0m
�.

Of course, this isn’t any surprise at all if this is a software vulnerability – I’m not very confident the problem with software that has been exploited by the hackers, and it’s easy for me to help but not impossible to find out what it was, and what it was, and what it was.


If you’re a Windows Vista user, there is a lot of software, including patches in place to all of our components.

The security-savvy and the security-savvy users of this isn’t the sort of thing worth doing.  If you’re a regular user, I’d love you can download this podcast directly in MP3 format: Sophos Security Chet Chat episode 7.



Many internet users are being fooled by a bogus phishing email claiming to come from a free airline tickets.

The scammers are pretending to be a free tickets tickets!

The emails claim that t

12/06/2022 12:42:57 — INFO — aitextgen — Saving trained model pytorch_model.bin to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11


You're done! Feel free to go to the **Generate Text From The Trained Model** section to generate text based on your retrained model.

## Generate Text From The Trained Model

After you've trained the model or loaded a retrained model from checkpoint, you can now generate text.

**If you just trained a model**, you'll get much faster training performance if you reload the model; the next cell will reload the model you just trained from the `trained_model` folder.

`generate()` without any parameters generates a single text from the loaded model to the console.

If you're creating an API based on your model and need to pass the generated text elsewhere, you can do `text = ai.generate_one()`

You can also pass in a `prompt` to the generate function to force the text to start with a given character sequence and generate text from there (good if you add an indicator when the text starts).

You can also generate multiple texts at a time by specifing `n`. You can pass a `batch_size` to generate multiple samples in parallel, giving a massive speedup (in Colaboratory, set a maximum of 50 for `batch_size` to avoid going OOM).

Other optional-but-helpful parameters for `ai.generate()` and friends:

*  **`min length`**: The minimum length of the generated text: if the text is shorter than this value after cleanup, aitextgen will generate another one.
*  **`max_length`**: Number of tokens to generate (default 256, you can generate up to 1024 tokens with GPT-2 and 2048 with GPT Neo)
* **`temperature`**: The higher the temperature, the crazier the text (default 0.7, recommended to keep between 0.7 and 1.0)
* **`top_k`**: Limits the generated guesses to the top *k* guesses (default 0 which disables the behavior; if the generated output is super crazy, you may want to set `top_k=40`)
* **`top_p`**: Nucleus sampling: limits the generated guesses to a cumulative probability. (gets good results on a dataset with `top_p=0.9`)

For bulk generation, you can generate a large amount of texts to a file and sort out the samples locally on your computer. The next cell will generate `num_files` files, each with `n` texts and whatever other parameters you would pass to `generate()`. The files can then be downloaded from the Files sidebar!

You can rerun the cells as many times as you want for even more generated texts!

In [None]:
num_outputs = 5
max_length = 1000
temperature = 1.0
top_p = 0.9

prompts = ['Digital Forensics Analysis Report\n',
           'This report is ',
           'The contents of ',
           'Conclusion\n',
           'It is recommended that ',
           'In the opinion of the expert, ',
           'File \'Exploit_Office\' contains ',
           'File \'Exploit_Office\' does not contain ',
           'Website \'Webmail SquirrelMail\' contains ',
           'Website \'Webmail SquirrelMail\' does not contain ',
           'Bill Due to past contains a link \'https://genom.mefst.hr/webmail/src/login.php\' to a website \'Webmail SquirrelMail\'.',
           'New Dogecoin Crypto Sale contains a link \'http://webmail.forumofthemall.hr/mail/loging.php\' to a website \'Webmail SquirrelMail Popular Forum\'.',
           'New OneCoin Crypto Sale contains a link \'http://',
           'Note of eviction contains ',
           'Note of eviction contains attachment ',
           'Note of eviction contains attachment \'Exploit_Office\'. Attachment is quarantined on \'Mail server EP\'.',
           'Log entry found: ',
           'Log entry found: Firewall (Type: Firewall) ',
           'Log entry found: Firewall (Type: Firewall) detected. [Allowed network traffic protocol ',
           'Log entry found: Firewall (Type: Firewall) blocked. [Blocked network traffic protocol ',
           'Log entry found: Firewall (Type: Firewall) detected. [Allowed network traffic protocol \'smtp:25\' from \'server74.aws.com\' to \'Mail server EP\'. Rule \'Internet_to_Mail_Server\'.]',
           'Log entry found: Firewall (Type: Firewall) detected. [Allowed network traffic protocol \'https:443\' from \'Proxy server\' to \'server74.aws.com\'. Rule \'Proxy_to_Internet, https:443\'.]',
           'Log entry found: Firewall (Type: Firewall) blocked. [Blocked network traffic protocol \'https:443\' from \'PCSZT03\' to \'Firewall TSO Enterprise\'.]',
           'Log analysis on ',
           'Log analysis on \'Firewall TSO Enterprise\' for period 1.1.2022. 0:00:00 - 4.2.2022. 13:50:44 finished. Report is ready.']

In [None]:
generate_partial_outputs = False

#num_outputs_partial = 2
#max_length_partial = 500
#prompts_partial = ['Digital Forensics Analysis Report\n',
#                   'This report is ',
#                   'The contents of ',
#                   'Conclusion\n']

In [None]:
output_root_dir = gdrive_root_dir + '/aitextgen/outputs/' + run_id
output_ext = '.txt'

if not os.path.exists(output_root_dir):
    os.makedirs(output_root_dir)

In [None]:
if save_partial and generate_partial_outputs:
    output_partial_dir = output_root_dir + '/partial'

    for i in partial_models:
        model_dir = model_partial_dir + '/' + i
        ai = aitextgen.aitextgen(model_folder=model_dir, to_gpu=True)

        output_dir = output_partial_dir + '/' + i
        partial_id = run_id + '_partial_' + i
        output_basepath = output_dir + '/' + partial_id + '_output'

        if not os.path.exists(output_dir):
            os.makedirs(output_dir)

        for j in range(len(prompts_partial)):
            if len(prompts_partial) > 1:
                current_output = output_basepath + '.' + str(j) + output_ext
            else:
                current_output = output_basepath + output_ext
            ai.generate_to_file(n=num_outputs_partial,
                                batch_size=1,
                                prompt=prompts_partial[j],
                                max_length=max_length_partial,
                                temperature=temperature,
                                top_p=top_p,
                                destination_path=current_output)

In [None]:
if len(datasets) > 0:
    ai = aitextgen.aitextgen(model_folder=model_root_dir, to_gpu=True)

output_basepath = output_root_dir + '/' + run_id + '_output'

for i in range(len(prompts)):
    if len(prompts) > 1:
        current_output = output_basepath + '.' + str(i) + output_ext
    else:
        current_output = output_basepath + output_ext
    ai.generate_to_file(n=num_outputs,
                        batch_size=1,
                        prompt=prompts[i],
                        max_length=max_length,
                        temperature=temperature,
                        top_p=top_p,
                        destination_path=current_output)

12/06/2022 12:42:59 — INFO — aitextgen — Loading model from provided weights and config in //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11.
12/06/2022 12:43:01 — INFO — aitextgen — GPT2 loaded with 124M parameters.
12/06/2022 12:43:01 — INFO — aitextgen — Using the default GPT-2 Tokenizer.
12/06/2022 12:43:01 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11_output.0.txt


  0%|          | 0/5 [00:00<?, ?it/s]

12/06/2022 12:43:54 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11_output.1.txt


  0%|          | 0/5 [00:00<?, ?it/s]

12/06/2022 12:44:44 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11_output.2.txt


  0%|          | 0/5 [00:00<?, ?it/s]

12/06/2022 12:45:34 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11_output.3.txt


  0%|          | 0/5 [00:00<?, ?it/s]

12/06/2022 12:46:24 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11_output.4.txt


  0%|          | 0/5 [00:00<?, ?it/s]

12/06/2022 12:47:15 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11_output.5.txt


  0%|          | 0/5 [00:00<?, ?it/s]

12/06/2022 12:48:05 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11_output.6.txt


  0%|          | 0/5 [00:00<?, ?it/s]

12/06/2022 12:48:55 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11_output.7.txt


  0%|          | 0/5 [00:00<?, ?it/s]

12/06/2022 12:49:44 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11_output.8.txt


  0%|          | 0/5 [00:00<?, ?it/s]

12/06/2022 12:50:33 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11_output.9.txt


  0%|          | 0/5 [00:00<?, ?it/s]

12/06/2022 12:51:24 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11_output.10.txt


  0%|          | 0/5 [00:00<?, ?it/s]

12/06/2022 12:52:12 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11_output.11.txt


  0%|          | 0/5 [00:00<?, ?it/s]

12/06/2022 12:53:00 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11_output.12.txt


  0%|          | 0/5 [00:00<?, ?it/s]

12/06/2022 12:53:49 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11_output.13.txt


  0%|          | 0/5 [00:00<?, ?it/s]

12/06/2022 12:54:39 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11_output.14.txt


  0%|          | 0/5 [00:00<?, ?it/s]

12/06/2022 12:55:30 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11_output.15.txt


  0%|          | 0/5 [00:00<?, ?it/s]

12/06/2022 12:56:19 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11_output.16.txt


  0%|          | 0/5 [00:00<?, ?it/s]

12/06/2022 12:57:08 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11_output.17.txt


  0%|          | 0/5 [00:00<?, ?it/s]

12/06/2022 12:57:58 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11_output.18.txt


  0%|          | 0/5 [00:00<?, ?it/s]

12/06/2022 12:58:48 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11_output.19.txt


  0%|          | 0/5 [00:00<?, ?it/s]

12/06/2022 12:59:36 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11_output.20.txt


  0%|          | 0/5 [00:00<?, ?it/s]

12/06/2022 13:00:24 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11_output.21.txt


  0%|          | 0/5 [00:00<?, ?it/s]

12/06/2022 13:01:11 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11_output.22.txt


  0%|          | 0/5 [00:00<?, ?it/s]

12/06/2022 13:01:59 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11_output.23.txt


  0%|          | 0/5 [00:00<?, ?it/s]

12/06/2022 13:02:50 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11/aitextgen-CCS-124M-5400-600_run_2022-06-12-12-00-11_output.24.txt


  0%|          | 0/5 [00:00<?, ?it/s]

# LICENSE

MIT License

Copyright (c) 2020-2021 Max Woolf

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.