#  aitextgen — Train a GPT-2 (or GPT Neo) Text-Generating Model w/ GPU

by [Max Woolf](https://minimaxir.com)

*Last updated: May 16th, 2021 (aitextgen v0.5.2)*

Retrain an advanced text generating neural network on any text dataset **for free on a GPU using Colaboratory** using `aitextgen`!

For more about `aitextgen`, you can visit [this GitHub repository](https://github.com/minimaxir/aitextgen) or [read the documentation](https://docs.aitextgen.io/).


To get started:

1. Copy this notebook to your Google Drive to keep it and save your changes. (File -> Save a Copy in Drive)
2. Run the cells below:


In [1]:
!pip install -q aitextgen

[K     |████████████████████████████████| 572 kB 11.5 MB/s 
[K     |████████████████████████████████| 4.0 MB 34.1 MB/s 
[K     |████████████████████████████████| 87 kB 5.0 MB/s 
[K     |████████████████████████████████| 584 kB 46.5 MB/s 
[K     |████████████████████████████████| 136 kB 38.9 MB/s 
[K     |████████████████████████████████| 409 kB 37.5 MB/s 
[K     |████████████████████████████████| 596 kB 17.2 MB/s 
[K     |████████████████████████████████| 1.1 MB 30.0 MB/s 
[K     |████████████████████████████████| 6.6 MB 41.9 MB/s 
[K     |████████████████████████████████| 880 kB 38.8 MB/s 
[K     |████████████████████████████████| 77 kB 6.0 MB/s 
[K     |████████████████████████████████| 94 kB 3.3 MB/s 
[K     |████████████████████████████████| 144 kB 52.3 MB/s 
[K     |████████████████████████████████| 271 kB 51.3 MB/s 
[?25h  Building wheel for aitextgen (setup.py) ... [?25l[?25hdone
  Building wheel for fire (setup.py) ... [?25l[?25hdone
  Building wheel for sacr

In [2]:
import aitextgen
import datetime
import gc
import logging
import os
import requests
import torch

In [3]:
session_url = 'http://172.28.0.2:9000/api/sessions'
notebook_name = requests.get(session_url).json()[0]['name']

run_datetime = datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%S')
run_id = notebook_name + '_run_' + run_datetime

In [4]:
log_format = '%(asctime)s — %(levelname)s — %(name)s — %(message)s'
date_format = '%d/%m/%Y %H:%M:%S'
log_level = logging.DEBUG

logging.basicConfig(format=log_format, datefmt=date_format, level=log_level)

## GPU

Colaboratory uses a Nvidia P4, an Nvidia T4, an Nvidia P100, or an Nvidia V100. For finetuning GPT-2 124M, any of these GPUs will be fine, but for text generation, a T4 or a P100 is ideal since they have more VRAM. **If you receive a T4 or a V100 GPU, you can enable `fp16=True` during training for faster/more memory efficient training.**

You can verify which GPU is active by running the cell below. If you want to try for a different GPU, go to **Runtime -> Factory Reset Runtime**.

In [5]:
!nvidia-smi

Fri May  6 18:01:15 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P8    28W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Loading GPT-2 or GPT Neo

If you're retraining a model on new text, you need to download and load the GPT-2 model into the GPU. 

There are several sizes of GPT-2:

* `124M` (default): the "small" model, 500MB on disk.
* `355M` (default): the "medium" model, 1.5GB on disk.
* `774M` (default): the "large" model, 3GB on disk.

You can also finetune a GPT Neo model instead, which is more suitable for longer texts and the base model has more recent data:

* `125M`: Analogous to the GPT-2 124M model.
* `350M`: Analogous to the GPT-2 355M model

The next cell downloads the model and saves it in the Colaboratory VM. If the model has already been downloaded, running this cell will reload it.

In [6]:
model='124M'
#model='355M'
#model='774M'

#model='gpt-neo-125M'
#model='gpt-neo-350M'

In [7]:
if model == '124M' or model == '355M' or model == '774M':
    ai = aitextgen.aitextgen(tf_gpt2=model, to_gpu=True)
else:
    ai = aitextgen.aitextgen(model='EleutherAI/' + model, to_gpu=True)

06/05/2022 18:01:16 — INFO — aitextgen — Downloading the 124M GPT-2 TensorFlow weights/config from Google's servers
06/05/2022 18:01:16 — DEBUG — urllib3.connectionpool — Starting new HTTPS connection (1): openaipublic.blob.core.windows.net:443
06/05/2022 18:01:16 — DEBUG — urllib3.connectionpool — https://openaipublic.blob.core.windows.net:443 "GET /gpt-2/models/124M/checkpoint HTTP/1.1" 200 77


Fetching checkpoint:   0%|          | 0.00/77.0 [00:00<?, ?it/s]

06/05/2022 18:01:16 — DEBUG — urllib3.connectionpool — Starting new HTTPS connection (1): openaipublic.blob.core.windows.net:443
06/05/2022 18:01:17 — DEBUG — urllib3.connectionpool — https://openaipublic.blob.core.windows.net:443 "GET /gpt-2/models/124M/hparams.json HTTP/1.1" 200 90


Fetching hparams.json:   0%|          | 0.00/90.0 [00:00<?, ?it/s]

06/05/2022 18:01:17 — DEBUG — urllib3.connectionpool — Starting new HTTPS connection (1): openaipublic.blob.core.windows.net:443
06/05/2022 18:01:17 — DEBUG — urllib3.connectionpool — https://openaipublic.blob.core.windows.net:443 "GET /gpt-2/models/124M/model.ckpt.data-00000-of-00001 HTTP/1.1" 200 497759232


Fetching model.ckpt.data-00000-of-00001:   0%|          | 0.00/498M [00:00<?, ?it/s]

06/05/2022 18:02:02 — DEBUG — urllib3.connectionpool — Starting new HTTPS connection (1): openaipublic.blob.core.windows.net:443
06/05/2022 18:02:03 — DEBUG — urllib3.connectionpool — https://openaipublic.blob.core.windows.net:443 "GET /gpt-2/models/124M/model.ckpt.index HTTP/1.1" 200 5215


Fetching model.ckpt.index:   0%|          | 0.00/5.21k [00:00<?, ?it/s]

06/05/2022 18:02:03 — DEBUG — urllib3.connectionpool — Starting new HTTPS connection (1): openaipublic.blob.core.windows.net:443
06/05/2022 18:02:03 — DEBUG — urllib3.connectionpool — https://openaipublic.blob.core.windows.net:443 "GET /gpt-2/models/124M/model.ckpt.meta HTTP/1.1" 200 471155


Fetching model.ckpt.meta:   0%|          | 0.00/471k [00:00<?, ?it/s]

06/05/2022 18:02:04 — INFO — aitextgen — Converting the 124M GPT-2 TensorFlow weights to PyTorch.
Converting TensorFlow checkpoint from /content/aitextgen/124M
Loading TF weight model/h0/attn/c_attn/b with shape [2304]
Loading TF weight model/h0/attn/c_attn/w with shape [1, 768, 2304]
Loading TF weight model/h0/attn/c_proj/b with shape [768]
Loading TF weight model/h0/attn/c_proj/w with shape [1, 768, 768]
Loading TF weight model/h0/ln_1/b with shape [768]
Loading TF weight model/h0/ln_1/g with shape [768]
Loading TF weight model/h0/ln_2/b with shape [768]
Loading TF weight model/h0/ln_2/g with shape [768]
Loading TF weight model/h0/mlp/c_fc/b with shape [3072]
Loading TF weight model/h0/mlp/c_fc/w with shape [1, 768, 3072]
Loading TF weight model/h0/mlp/c_proj/b with shape [768]
Loading TF weight model/h0/mlp/c_proj/w with shape [1, 3072, 768]
Loading TF weight model/h1/attn/c_attn/b with shape [2304]
Loading TF weight model/h1/attn/c_attn/w with shape [1, 768, 2304]
Loading TF weight

Save PyTorch model to aitextgen/pytorch_model.bin


06/05/2022 18:02:11 — INFO — aitextgen — Loading 124M GPT-2 model from /aitextgen.


Save configuration file to aitextgen/config.json


06/05/2022 18:02:13 — INFO — aitextgen — GPT2 loaded with 124M parameters.
06/05/2022 18:02:13 — INFO — aitextgen — Using the default GPT-2 Tokenizer.


## Mounting Google Drive

The best way to get input text to-be-trained into the Colaboratory VM, and to get the trained model *out* of Colaboratory, is to route it through Google Drive *first*.

Running this cell (which will only work in Colaboratory) will mount your personal Google Drive in the VM, which later cells can use to get data in/out. (it will ask for an auth code; that auth is not saved anywhere)

In [8]:
aitextgen.colab.mount_gdrive()

Mounted at /content/drive


In [9]:
gdrive_rootdir = '/content/drive/My Drive'

## Uploading a Text File to be Trained to Colaboratory

In the Colaboratory Notebook sidebar on the left of the screen, select *Files*. From there you can upload files:

![alt text](https://i.imgur.com/w3wvHhR.png)

Upload **any smaller text file** (for example, [a text file of Shakespeare plays](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt)) and update the file name in the cell below, then run the cell.

In [10]:
data_rootdir = gdrive_rootdir + '/aitextgen/training_data'
datasets = ['articles', 'meta_reports_combined']
dataset_splits = [9, 1]
dataset_iterations = [800, 3600]

file_basename = 'dataset_cache'
file_ext = '.tar.gz'
from_cache = True

If your text file is large (>10MB), it is recommended to upload that file to Google Drive first, then copy that file from Google Drive to the Colaboratory VM.

Additionally, you may want to consider [compressing the dataset to a cache first](https://docs.aitextgen.io/dataset/) on your local computer, then uploading the resulting `dataset_cache.tar.gz` and setting the `file_name`in the previous cell to that.

## Finetune GPT-2

The next cell will start the actual finetuning of GPT-2 in aitextgen. It runs for `num_steps`, and a progress bar will appear to show training progress, current loss (the lower the better the model), and average loss (to give a sense on loss trajectory).

The model will be saved every `save_every` steps in `trained_model` by default, and when training completes. If you mounted your Google Drive, the model will _also_ be saved there in a unique folder.

The training might time out after 4ish hours; if you did not mount to Google Drive, make sure you end training and save the results so you don't lose them! (if this happens frequently, you may want to consider using [Colab Pro](https://colab.research.google.com/signup))

Important parameters for `train()`:

- **`line_by_line`**: Set this to `True` if the input text file is a single-column CSV, with one record per row. aitextgen will automatically process it optimally.
- **`from_cache`**: If you compressed your dataset locally (as noted in the previous section) and are using that cache file, set this to `True`.
- **`num_steps`**: Number of steps to train the model for.
- **`generate_every`**: Interval of steps to generate example text from the model; good for qualitatively validating training.
- **`save_every`**: Interval of steps to save the model: the model will be saved in the VM to `/trained_model`.
- **`save_gdrive`**: Set this to `True` to copy the model to a unique folder in your Google Drive, if you have mounted it in the earlier cells
- **`fp16`**: Enables half-precision training for faster/more memory-efficient training. Only works on a T4 or V100 GPU.

Here are other important parameters for `train()` that are useful but you likely do not need to change.

- **`learning_rate`**: Learning rate of the model training.
- **`batch_size`**: Batch size of the model training; setting it too high will cause the GPU to go OOM. (if using `fp16`, you can increase the batch size more safely)

In [11]:
model_save_dir = gdrive_rootdir + '/aitextgen/models/' + run_id
save_every = 200

In [12]:
for i in range(len(datasets)):
    file_basepath = data_rootdir + '/' + datasets[i] + '/' + file_basename
    for j in range(dataset_splits[i]):
        if dataset_splits[i] > 1:
            current_file = file_basepath + '.' + str(j) + file_ext
        else:
            current_file = file_basepath + file_ext
        ai.train(current_file,
                 line_by_line=False,
                 from_cache=from_cache,
                 num_steps=dataset_iterations[i],
                 generate_every=dataset_iterations[i],
                 save_every=save_every,
                 save_gdrive=False,
                 run_id=run_id,
                 output_dir=model_save_dir,
                 learning_rate=1e-3,
                 fp16=False,
                 batch_size=1)

        # R.B.: required to prevent memory leaks in Colab
        gc.collect()

06/05/2022 18:02:46 — INFO — aitextgen — Loading text from /content/drive/My Drive/aitextgen/training_data/articles/dataset_cache.0.tar.gz with generation length of 1024.
06/05/2022 18:02:47 — INFO — aitextgen.TokenDataset — TokenDataset containing 2,299,548 subsets loaded via cache.
06/05/2022 18:02:47 — INFO — torch.distributed.nn.jit.instantiator — Created a temporary directory at /tmp/tmpj4_rdsje
06/05/2022 18:02:47 — INFO — torch.distributed.nn.jit.instantiator — Writing /tmp/tmpj4_rdsje/_remote_module_non_sriptable.py
  f"Setting `Trainer(checkpoint_callback={checkpoint_callback})` is deprecated in v1.5 and will "
  f"Setting `Trainer(progress_bar_refresh_rate={progress_bar_refresh_rate})` is deprecated in v1.5 and"
  "Setting `Trainer(weights_summary=None)` is deprecated in v1.5 and will be removed"
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
  f"The `Callback.{hook}` hook was de

  0%|          | 0/800 [00:00<?, ?it/s]

  "`trainer.progress_bar_dict` is deprecated in v1.5 and will be removed in v1.7."


[1m200 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15[0m
[1m400 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15[0m
[1m600 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15[0m
[1m800 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15[0m
[1m800 steps reached: generating sample texts.[0m
 the malware used to exploit the ability of spoof the IP addresses for many Web sites.

A new feature that allows users to create and edit accounts for thousands of dollars in a bid to steal cryptocurrencies from cash accounts at U.S. banks is being used to steal cryptocurrencies from Bitcoin.

The cryptocurrency deposits account holders can then be used to counterfeit cryptocurrencies like Bi

06/05/2022 18:12:39 — INFO — aitextgen — Saving trained model pytorch_model.bin to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15
06/05/2022 18:12:41 — INFO — aitextgen — Loading text from /content/drive/My Drive/aitextgen/training_data/articles/dataset_cache.1.tar.gz with generation length of 1024.
06/05/2022 18:12:42 — INFO — aitextgen.TokenDataset — TokenDataset containing 2,291,734 subsets loaded via cache.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


  0%|          | 0/800 [00:00<?, ?it/s]

[1m200 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15[0m
[1m400 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15[0m
[1m600 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15[0m
[1m800 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15[0m
[1m800 steps reached: generating sample texts.[0m
 social network (VPN) to find a secure connection over the phone, or phone. If someone has a connection over the phone and tries to locate the connection, the hacker may need to take a moment and scramble their connection.

Cybercriminals are continually upgrading their tactics and tactics, and this year’s cybercrime? They’re using the world’s largest cybercrime threats, as the latest version 

06/05/2022 18:22:36 — INFO — aitextgen — Saving trained model pytorch_model.bin to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15
06/05/2022 18:22:37 — INFO — aitextgen — Loading text from /content/drive/My Drive/aitextgen/training_data/articles/dataset_cache.2.tar.gz with generation length of 1024.
06/05/2022 18:22:38 — INFO — aitextgen.TokenDataset — TokenDataset containing 2,138,444 subsets loaded via cache.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


  0%|          | 0/800 [00:00<?, ?it/s]

[1m200 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15[0m
[1m400 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15[0m
[1m600 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15[0m
[1m800 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15[0m
[1m800 steps reached: generating sample texts.[0m
 more than $5.1 trillion to a year.

Microsoft Defender for Endpoint detects these threats as PWS/Door:

Microsoft Defender for Endpoint detects these email threats as W97/Door:

Microsoft Defender for Endpoint alerts a user to a malicious website which hosts an infection.

This alert identifies a malicious website, and alerts an attacker to a malicious website.

As an example, a typical web b

06/05/2022 18:32:33 — INFO — aitextgen — Saving trained model pytorch_model.bin to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15
06/05/2022 18:32:35 — INFO — aitextgen — Loading text from /content/drive/My Drive/aitextgen/training_data/articles/dataset_cache.3.tar.gz with generation length of 1024.
06/05/2022 18:32:36 — INFO — aitextgen.TokenDataset — TokenDataset containing 2,273,315 subsets loaded via cache.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


  0%|          | 0/800 [00:00<?, ?it/s]

[1m200 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15[0m
[1m400 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15[0m
[1m600 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15[0m
[1m800 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15[0m
[1m800 steps reached: generating sample texts.[0m
 an online demo.


The initial version of this version was originally intended to provide more guidance and protections for customers.

The original version was intended to provide more guidance and provide guidance.

In the meantime, we are not currently aware of any customer impact as a result of this incident.



Hello everyone, Mike Reavey here.

The MSRC Communications Manager for MSRC




06/05/2022 18:42:39 — INFO — aitextgen — Saving trained model pytorch_model.bin to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15
06/05/2022 18:42:42 — INFO — aitextgen — Loading text from /content/drive/My Drive/aitextgen/training_data/articles/dataset_cache.4.tar.gz with generation length of 1024.
06/05/2022 18:42:42 — INFO — aitextgen.TokenDataset — TokenDataset containing 2,392,964 subsets loaded via cache.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


  0%|          | 0/800 [00:00<?, ?it/s]

[1m200 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15[0m
[1m400 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15[0m
[1m600 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15[0m
[1m800 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15[0m
[1m800 steps reached: generating sample texts.[0m
 the phone and the phone. There’s an easy way to keep your iPhone to secure.

For more information on how to stop your phone from snooping through your phone without a password.

For more information about how phone security is protecting your Android device, read our Security article.

In our series, we’re not just about security, but also about mobile device security.


Sophos Mobile Securit

06/05/2022 18:52:45 — INFO — aitextgen — Saving trained model pytorch_model.bin to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15
06/05/2022 18:52:47 — INFO — aitextgen — Loading text from /content/drive/My Drive/aitextgen/training_data/articles/dataset_cache.5.tar.gz with generation length of 1024.
06/05/2022 18:52:48 — INFO — aitextgen.TokenDataset — TokenDataset containing 2,330,672 subsets loaded via cache.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


  0%|          | 0/800 [00:00<?, ?it/s]

[1m200 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15[0m
[1m400 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15[0m
[1m600 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15[0m
[1m800 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15[0m
[1m800 steps reached: generating sample texts.[0m
 as “unallocated”.

But there’s also a way of suggesting that the message will be sent, so that it’s not sent, but it’s sent, and it’s been sent everywhere.

The data from the message is being sent, so the text is being sent to the recipient.

It’s worth noting that the email addresses in the above line are now encrypted.

On the example of the WhatsApp message, Facebook is no more than Facebo

06/05/2022 19:02:51 — INFO — aitextgen — Saving trained model pytorch_model.bin to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15
06/05/2022 19:02:54 — INFO — aitextgen — Loading text from /content/drive/My Drive/aitextgen/training_data/articles/dataset_cache.6.tar.gz with generation length of 1024.
06/05/2022 19:02:54 — INFO — aitextgen.TokenDataset — TokenDataset containing 2,379,429 subsets loaded via cache.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


  0%|          | 0/800 [00:00<?, ?it/s]

[1m200 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15[0m
[1m400 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15[0m
[1m600 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15[0m
[1m800 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15[0m
[1m800 steps reached: generating sample texts.[0m
�s, which seems to have come from the Facebook network, with its enormous share of users and its out of more than 20 million user data.

Foggi, a Facebook-based company that tracks customer data with its site, said the BBC:

It was a huge amount of fun to be able to be a robust network. We must be on top of a lot of people that we visit a year ago.

Foggi said it was, in addition to the fact t

06/05/2022 19:12:58 — INFO — aitextgen — Saving trained model pytorch_model.bin to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15
06/05/2022 19:13:00 — INFO — aitextgen — Loading text from /content/drive/My Drive/aitextgen/training_data/articles/dataset_cache.7.tar.gz with generation length of 1024.
06/05/2022 19:13:01 — INFO — aitextgen.TokenDataset — TokenDataset containing 2,356,415 subsets loaded via cache.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


  0%|          | 0/800 [00:00<?, ?it/s]

[1m200 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15[0m
[1m400 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15[0m
[1m600 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15[0m
[1m800 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15[0m
[1m800 steps reached: generating sample texts.[0m
 the Internet Association of America (ISPs), and you get to have a try, too many times, during the launch of the net.

The “I'm not sure whether you’re wondering,” that’s just what it is – a “compreting battle” on the site, and it’s the people who are doing it.

It’s hard to know just how many people who care about the net have become aware of the importance of reading their private lives onli

06/05/2022 19:23:05 — INFO — aitextgen — Saving trained model pytorch_model.bin to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15
06/05/2022 19:23:07 — INFO — aitextgen — Loading text from /content/drive/My Drive/aitextgen/training_data/articles/dataset_cache.8.tar.gz with generation length of 1024.
06/05/2022 19:23:08 — INFO — aitextgen.TokenDataset — TokenDataset containing 1,928,162 subsets loaded via cache.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


  0%|          | 0/800 [00:00<?, ?it/s]

[1m200 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15[0m
[1m400 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15[0m
[1m600 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15[0m
[1m800 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15[0m
[1m800 steps reached: generating sample texts.[0m

It’s a big problem as you have to have Windows 7 installed and run you in the background.


The concept of OS X malware is a nuisance.

Many legitimate anti-virus products are quite a bit of a nuisance.  I know it often.

I’ve blogged about it all the time in 2008, when the Microsoft Office vulnerability is being exploited in the wild.

By the way, you’ve probably heard that Microsoft has rel

06/05/2022 19:33:12 — INFO — aitextgen — Saving trained model pytorch_model.bin to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15
06/05/2022 19:33:14 — INFO — aitextgen — Loading text from /content/drive/My Drive/aitextgen/training_data/meta_reports_combined/dataset_cache.tar.gz with generation length of 1024.
06/05/2022 19:33:14 — INFO — aitextgen.TokenDataset — TokenDataset containing 91,415 subsets loaded via cache.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


  0%|          | 0/3600 [00:00<?, ?it/s]

[1m200 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15[0m
[1m400 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15[0m
[1m600 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15[0m
[1m800 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15[0m
[1m1,000 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15[0m
[1m1,200 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15[0m
[1m1,400 steps reached: saving model to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15[0m
[1m1,600 steps reached: sav

06/05/2022 20:18:21 — INFO — aitextgen — Saving trained model pytorch_model.bin to //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15


You're done! Feel free to go to the **Generate Text From The Trained Model** section to generate text based on your retrained model.


## Load a Trained Model

If you already had a trained model from this notebook, running the next cell will copy the `pytorch_model.bin` and the `config.json`file from the specified folder in Google Drive into the Colaboratory VM. (If no `from_folder` is specified, it assumes the two files are located at the root level of your Google Drive)

The next cell will allow you to load the retrained model + metadata necessary to generate text.

## Generate Text From The Trained Model

After you've trained the model or loaded a retrained model from checkpoint, you can now generate text.

**If you just trained a model**, you'll get much faster training performance if you reload the model; the next cell will reload the model you just trained from the `trained_model` folder.

In [13]:
ai = aitextgen.aitextgen(model_folder=model_save_dir, to_gpu=True)

06/05/2022 20:18:23 — INFO — aitextgen — Loading model from provided weights and config in //content/drive/My Drive/aitextgen/models/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15.
06/05/2022 20:18:26 — INFO — aitextgen — GPT2 loaded with 124M parameters.
06/05/2022 20:18:26 — INFO — aitextgen — Using the default GPT-2 Tokenizer.


`generate()` without any parameters generates a single text from the loaded model to the console.

If you're creating an API based on your model and need to pass the generated text elsewhere, you can do `text = ai.generate_one()`

You can also pass in a `prompt` to the generate function to force the text to start with a given character sequence and generate text from there (good if you add an indicator when the text starts).

You can also generate multiple texts at a time by specifing `n`. You can pass a `batch_size` to generate multiple samples in parallel, giving a massive speedup (in Colaboratory, set a maximum of 50 for `batch_size` to avoid going OOM).

Other optional-but-helpful parameters for `ai.generate()` and friends:

*  **`min length`**: The minimum length of the generated text: if the text is shorter than this value after cleanup, aitextgen will generate another one.
*  **`max_length`**: Number of tokens to generate (default 256, you can generate up to 1024 tokens with GPT-2 and 2048 with GPT Neo)
* **`temperature`**: The higher the temperature, the crazier the text (default 0.7, recommended to keep between 0.7 and 1.0)
* **`top_k`**: Limits the generated guesses to the top *k* guesses (default 0 which disables the behavior; if the generated output is super crazy, you may want to set `top_k=40`)
* **`top_p`**: Nucleus sampling: limits the generated guesses to a cumulative probability. (gets good results on a dataset with `top_p=0.9`)

For bulk generation, you can generate a large amount of texts to a file and sort out the samples locally on your computer. The next cell will generate `num_files` files, each with `n` texts and whatever other parameters you would pass to `generate()`. The files can then be downloaded from the Files sidebar!

You can rerun the cells as many times as you want for even more generated texts!

In [14]:
prompts = ['Digital Forensics Analysis Report\n',
           'This report is ',
           'The contents of ',
           'Conclusion\n',
           'It is recommended that ',
           'In the opinion of the expert, ',
           'File \'Exploit_Office\' contains ',
           'File \'Exploit_Office\' does not contain ',
           'Website \'Webmail SquirrelMail\' contains ',
           'Website \'Webmail SquirrelMail\' does not contain ',
           'Bill Due to past contains a link \'https://genom.mefst.hr/webmail/src/login.php\' to a website \'Webmail SquirrelMail\'.',
           'New Dogecoin Crypto Sale contains a link \'http://webmail.forumofthemall.hr/mail/loging.php\' to a website \'Webmail SquirrelMail Popular Forum\'.',
           'New OneCoin Crypto Sale contains a link \'http://',
           'Note of eviction contains ',
           'Note of eviction contains attachment ',
           'Note of eviction contains attachment \'Exploit_Office\'. Attachment is quarantined on \'Mail server EP\'.',
           'Log entry found: ',
           'Log entry found: Firewall (Type: Firewall) ',
           'Log entry found: Firewall (Type: Firewall) detected. [Allowed network traffic protocol ',
           'Log entry found: Firewall (Type: Firewall) blocked. [Blocked network traffic protocol ',
           'Log entry found: Firewall (Type: Firewall) detected. [Allowed network traffic protocol \'smtp:25\' from \'server74.aws.com\' to \'Mail server EP\'. Rule \'Internet_to_Mail_Server\'.]',
           'Log entry found: Firewall (Type: Firewall) detected. [Allowed network traffic protocol \'https:443\' from \'Proxy server\' to \'server74.aws.com\'. Rule \'Proxy_to_Internet, https:443\'.]',
           'Log entry found: Firewall (Type: Firewall) blocked. [Blocked network traffic protocol \'https:443\' from \'PCSZT03\' to \'Firewall TSO Enterprise\'.]',
           'Log analysis on ',
           'Log analysis on \'Firewall TSO Enterprise\' for period 1.1.2022. 0:00:00 - 4.2.2022. 13:50:44 finished. Report is ready.']

In [15]:
output_dir = gdrive_rootdir + '/aitextgen/outputs/' + run_id
output_basepath = output_dir + '/' + run_id + '_output'
output_ext = '.txt'

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

In [16]:
num_outputs = 5
max_length = 1000
temperature = 1.0
top_p = 0.9

for i in range(len(prompts)):
    if len(prompts) > 1:
        current_output = output_basepath + '.' + str(i) + output_ext
    else:
        current_output = output_basepath + output_ext
    ai.generate_to_file(n=num_outputs,
                        batch_size=1,
                        prompt=prompts[i],
                        max_length=max_length,
                        temperature=temperature,
                        top_p=top_p,
                        destination_path=current_output)

06/05/2022 20:18:26 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15_output.0.txt


  0%|          | 0/5 [00:00<?, ?it/s]

06/05/2022 20:19:48 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15_output.1.txt


  0%|          | 0/5 [00:00<?, ?it/s]

06/05/2022 20:21:09 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15_output.2.txt


  0%|          | 0/5 [00:00<?, ?it/s]

06/05/2022 20:22:27 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15_output.3.txt


  0%|          | 0/5 [00:00<?, ?it/s]

06/05/2022 20:23:47 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15_output.4.txt


  0%|          | 0/5 [00:00<?, ?it/s]

06/05/2022 20:25:05 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15_output.5.txt


  0%|          | 0/5 [00:00<?, ?it/s]

06/05/2022 20:26:23 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15_output.6.txt


  0%|          | 0/5 [00:00<?, ?it/s]

06/05/2022 20:27:41 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15_output.7.txt


  0%|          | 0/5 [00:00<?, ?it/s]

06/05/2022 20:28:58 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15_output.8.txt


  0%|          | 0/5 [00:00<?, ?it/s]

06/05/2022 20:30:16 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15_output.9.txt


  0%|          | 0/5 [00:00<?, ?it/s]

06/05/2022 20:31:33 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15_output.10.txt


  0%|          | 0/5 [00:00<?, ?it/s]

06/05/2022 20:32:48 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15_output.11.txt


  0%|          | 0/5 [00:00<?, ?it/s]

06/05/2022 20:34:03 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15_output.12.txt


  0%|          | 0/5 [00:00<?, ?it/s]

06/05/2022 20:35:20 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15_output.13.txt


  0%|          | 0/5 [00:00<?, ?it/s]

06/05/2022 20:36:37 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15_output.14.txt


  0%|          | 0/5 [00:00<?, ?it/s]

06/05/2022 20:37:54 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15_output.15.txt


  0%|          | 0/5 [00:00<?, ?it/s]

06/05/2022 20:39:09 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15_output.16.txt


  0%|          | 0/5 [00:00<?, ?it/s]

06/05/2022 20:40:26 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15_output.17.txt


  0%|          | 0/5 [00:00<?, ?it/s]

06/05/2022 20:41:43 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15_output.18.txt


  0%|          | 0/5 [00:00<?, ?it/s]

06/05/2022 20:42:59 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15_output.19.txt


  0%|          | 0/5 [00:00<?, ?it/s]

06/05/2022 20:44:15 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15_output.20.txt


  0%|          | 0/5 [00:00<?, ?it/s]

06/05/2022 20:45:29 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15_output.21.txt


  0%|          | 0/5 [00:00<?, ?it/s]

06/05/2022 20:46:43 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15_output.22.txt


  0%|          | 0/5 [00:00<?, ?it/s]

06/05/2022 20:47:57 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15_output.23.txt


  0%|          | 0/5 [00:00<?, ?it/s]

06/05/2022 20:49:15 — INFO — aitextgen — Generating 5 texts to /content/drive/My Drive/aitextgen/outputs/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15/aitextgen-CCS-124M-7200-3600_run_2022-05-06-18-01-15_output.24.txt


  0%|          | 0/5 [00:00<?, ?it/s]

# LICENSE

MIT License

Copyright (c) 2020-2021 Max Woolf

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.