# Dependencies

In [1]:
from aitextgen.TokenDataset import TokenDataset
from aitextgen.tokenizers import train_tokenizer
from aitextgen.utils import GPT2ConfigCPU
from aitextgen import aitextgen

# Text File to Train New Model

In [2]:
file_name = "lovecraft.txt"

# Tokenize

Train a custom BPE Tokenizer on the downloaded text. This will save two files: aitextgen-vocab.json and aitextgen-merges.txt, which are needed to rebuild the tokenizer.



In [3]:
train_tokenizer(file_name)
vocab_file = "aitextgen-vocab.json"
merges_file = "aitextgen-merges.txt"

INFO:aitextgen.tokenizers:Saving aitextgen-vocab.json and aitextgen-merges.txt to the current directory. You will need both files to build the GPT2Tokenizer.


# CPU Training
GPT2ConfigCPU is a mini variant of GPT-2 optimized for CPU-training e.g. the # of input tokens here is 64 vs. 1024 for base GPT-2.

In [4]:
config = GPT2ConfigCPU()

# Instantiate aitextgen using the created tokenizer and config

In [5]:
ai = aitextgen(vocab_file=vocab_file, 
               merges_file=merges_file, 
               config=config)

INFO:aitextgen:Constructing GPT-2 model from provided config.
INFO:aitextgen:Using a custom tokenizer.


# Training Dataset
You can build datasets for training by creating TokenDatasets, which automatically processes the dataset with the appropriate size.

In [6]:
data = TokenDataset(file_name, 
                    vocab_file=vocab_file, 
                    merges_file=merges_file, 
                    block_size=64)

HBox(children=(FloatProgress(value=0.0, layout=Layout(flex='2'), max=8642.0), HTML(value='')), layout=Layout(d…

INFO:aitextgen.TokenDataset:Encoding 8,642 sets of tokens from lovecraft.txt.





# Train the Model
It will save pytorch_model.bin periodically and after completion.

In [8]:
ai.train(data, 
         num_steps=10000,
         num_workers = 4,
         batch_size=16)

GPU available: False, used: False
INFO:lightning:GPU available: False, used: False
TPU available: False, using: 0 TPU cores
INFO:lightning:TPU available: False, using: 0 TPU cores


HBox(children=(FloatProgress(value=0.0, layout=Layout(flex='2'), max=10000.0), HTML(value='')), layout=Layout(…

[1m1,000 steps reached: saving model to /trained_model[0m
[1m1,000 steps reached: generating sample texts.[0m
.











"


"















 I found, it be an I was the a more he was to the old be a house, and the the ancient of the night, and
[1m2,000 steps reached: saving model to /trained_model[0m
[1m2,000 steps reached: generating sample texts.[0m
-it to the very a more-lation. The he had was a womon, the way of the very dark, and the room was as he had seen, but had the time had been a small, the mited; and the cas. I had the great things in some it
[1m3,000 steps reached: saving model to /trained_model[0m
[1m3,000 steps reached: generating sample texts.[0m
 it the time by that in those more rimo's thing, and the time.

ONas.




Eeon'I knew that my day I were never told that he could only at the the Te the world, and the moment he's eyes--the hand
[1m4,000 steps reached: saving model to /trained_model[0m
[1m4,000 steps reached: generating sample texts.[0m
, an

INFO:aitextgen:Saving trained model pytorch_model.bin to /trained_model


 in the old and the old man's body before.

I was not to see what the whole of the Great world had been in the house, and had not been to see the place of the house was no longer before. It was a long the whole of the Sason Sweston,


# Generation Functions

## Parameters

**n**: Number of texts generated

**max_length**: Maximum length of the generated text (default: 200; for GPT-2, the maximum is 1024.)

**prompt**: Prompt that starts the generated text and is included in the generate text. (used to be prefix in previous tools)

**temperature**: Controls the "craziness" of the text (recommended to keep between 0.7 and 1.0) 

**top_k**: If nonzero, limits the sampled tokens to the top k values. (40 is recommended)

**top_p**: If nonzero, limits the sampled tokens to the cumulative probability (0.9 is recommended)

**ai.generate_samples()**: Generates multiple samples at specified temperatures: great for debugging.

In [9]:
ai.generate_samples(n=1,
            prompt="Ancient ruin in Alabama swamp - voodoo.",
            top_k=40,
            top_p=0.9, 
            max_length=1024)

####################
Temperature: 0.7
####################
[1mAncient ruin in Alabama swamp - voodoo.[0m

DATo, the last of what has gone in the Go, and that the Pore Lad.

The next I know of the next time, for the old man had seen in the night before
####################
Temperature: 1.0
####################
[1mAncient ruin in Alabama swamp - voodoo.[0m The porned, I heard a very queerly-book of the Chazing-post--sus, the ruder's dort was the souzous, and its page at this
####################
Temperature: 1.2
####################
[1mAncient ruin in Alabama swamp - voodoo.[0m The lader came from the sea; and Carter had been so very more more than he had no more than he thought to the last. The time was in the night a certain old man, and it had said that there came


**ai.generate_one()**: A helper function which generates a single text and returns as a string (good for APIs)

In [10]:
ai.generate_one(prompt="Ancient ruin in Alabama swamp - voodoo.",
            top_k=40,
            top_p=0.9, 
            max_length=1024, 
            temperature=1.2)

'Ancient ruin in Alabama swamp - voodoo. Louns and I was, for, my dreams and the hideous and unladition I knew in the same kind of a long-radent diarom and I wondered that a little one thing. I did not'

**ai.generate()**: Generates and prints text to console.

In [11]:
ai.generate(n=4,
            prompt="Ancient ruin in Alabama swamp - voodoo.",
            top_k=40,
            top_p=0.9, 
            max_length=1024, 
            temperature=1.2)

[1mAncient ruin in Alabama swamp - voodoo.[0m Then and over these weed, I felt the whole rows of a cesselaritically at a blacker, but to keep it from the dark house-wodhed door to a cubetinch
[1mAncient ruin in Alabama swamp - voodoo.[0m But when I did not let myself with that old he was at that he knew before it the next day before a man told them, and if they had seen of its lading gently at the great house in his son.
[1mAncient ruin in Alabama swamp - voodoo.[0m Hanll and Rort. It's old Diliscar's Prile wasnoa at that time, who had no good for an in the night in the end, if no longer might have found;
[1mAncient ruin in Alabama swamp - voodoo.[0m They know that he can't keep.

Po, he spoke about, for those day-gaunts was very darkly. But I had to say of what I had a moment's face of the whole and mumul


**ai.generate_to_file()**: Generates a bulk amount of texts to file. (this accepts a batch_size parameter which is useful if using on a GPU)

In [12]:
# Last digits of saved file are seed number that can be used to reproduce results.

ai.generate_to_file(n=100, 
                    prompt="Ancient ruin in Alabama swamp - voodoo.",
                    top_k=40,
                    top_p=0.9, 
                    max_length=1024, 
                    temperature=1.2)

INFO:aitextgen:Generating 100 texts to ATG_20200831_025307_32961397.txt


HBox(children=(FloatProgress(value=0.0), HTML(value='')))


