<a href="https://colab.research.google.com/github/tbadams/pokegen/blob/ipython/pokenames_textgenrnn_GPU.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  Textgen-rnn Pokemon

by Trevor Adams

based on [textgen-rnn](https://colab.research.google.com/drive/1mMKGnVxirJnqDViH7BDJxFqWrsXlPSoK#scrollTo=aeXshJM-Cuaf) by [Max Woolf](http://minimaxir.com)

For more about textgenrnn, you can visit [this GitHub repository](https://github.com/minimaxir/textgenrnn).


## Setup

### Init

In [6]:
%tensorflow_version 1.x

!pip install -q textgenrnn
from google.colab import files
from textgenrnn import textgenrnn
from datetime import datetime
import os
from google.colab import drive
import csv
import random
import pprint
import json
import re
import time
import shutil

SAMPLE_DATA_FILE_NAME = "poke_data.txt"

# link drive so we can pull base data automatically/save results
# can also upload manually before run
drive.mount('/content/drive')
# IMPORTANT! CLICK ON THE KEY ITSELF WHEN COPYING, DO NOT CLICK COPY BUTTON
# BUTTON DOES NOT WORK !!?!

# Set up library params

model_cfg = {
    'word_level': False,   # set to True if want to train a word-level model (requires more data and smaller max_length)
    'rnn_size': 256,   # number of LSTM cells of each layer (128/256 recommended)
    'rnn_layers': 2,   # number of LSTM layers (>=2 recommended)
    'rnn_bidirectional': True,   # consider text both forwards and backward, can give a training boost
    'max_length': 40,   # number of tokens to consider before predicting the next (20-40 for characters, 5-10 for words recommended)
    'max_words': 10000   # maximum number of words to model; the rest will be ignored (word-level model only)
}
 # @markdown set higher to train the model for longer between sample generations
num_epochs = 1 #@param {type: "number"}
 # @markdown set higher to execute more train/generate cycles
# todo call them runs
total_rounds = 10 #@param {type: "number"}
 # @markdown proportion of input data to train on: setting < 1.0 limits model from learning perfectly
train_size = 0.9  #@param {type: "number"}
 # @markdown ignore a random proportion of source tokens each epoch, allowing model to generalize better
dropout = 0.1 #@param {type: "number"}
gen_epochs = 99999 # library generates sample text from model after given number of epochs
train_cfg = {
    'line_delimited': True,   # set to True if each text has its own line in the source file
    'num_epochs': num_epochs,  
    'gen_epochs': gen_epochs,  
    'train_size': train_size,  
    'dropout': dropout,  
    'validation': True,   # If train__size < 1.0, test on holdout dataset; will make overall training slower
    'is_csv': False,   # set to True if file is a CSV exported from Excel/BigQuery/pandas
    'new_model':False
}
#@markdown change to set file name of resulting trained models/texts
model_name = 'tpoke4b'   # @param {type: "string"}

DRIVE_ROOT = "/content/drive/My Drive/"
cur_round = 0

# prepare sample text
file_name = SAMPLE_DATA_FILE_NAME
if not os.path.exists(file_name):
  uploaded = files.upload()
if not os.path.exists(file_name):
  raise FileNotFoundError("Need a source file for data!!")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Uh, everything else

In [26]:
checkpoint_file_templates = ['{}_weights.hdf5', '{}_vocab.json','{}_config.json']

def all_to_gdrive(fnames, overwrite=False):
  for fname in fnames:
    if  overwrite or not os.path.isfile(os.path.join(DRIVE_ROOT, fname)):
      print("copied {}".format(fname))
      shutil.copyfile(fname, DRIVE_ROOT + fname)
    else:
      print("skipped {}".format(fname))

def save_checkpoint_to_drive(model_name, add_labeled_copy=True):
  fnames = []
  copy_names = []
  for t in checkpoint_file_templates:
    fname = t.format(model_name)
    fnames.append(fname)
    if add_labeled_copy:
      rounds = cur_round * num_epochs
      copyname = t.format("{}_{}".format(model_name, rounds))
      shutil.copyfile(fname, copyname)
      fnames.append(copyname)
      copy_names.append(copyname)
  all_to_gdrive(fnames, True)
  for copyname in copy_names:
    os.remove(copyname)

logs = []
def log(msg):
  logs.append(msg)
  print(msg)

import json
# this temperature schedule cycles between 1 very unexpected token, 1 unexpected token, 2 expected tokens, repeat.
# changing the temperature schedule can result in wildly different output!
# temperature = [1.0, 0.5, 0.2, 0.2]  

# if train_cfg['line_delimited']:
#   n = 1000
#   max_gen_length = 60 if model_cfg['word_level'] else 300
# else:
#   n = 1
#   max_gen_length = 2000 if model_cfg['word_level'] else 1000
# timestring = datetime.now().strftime('%Y%m%d_%H%M%S')

def make_samples(temp_pattern=[0.5], length=400, sample_num=500, 
                prefix="<|startoftext|>"): # if you want each generated text to start with a given seed text):
  samples = ""
  for n in range(sample_num):
    samples = samples + textgen.generate(temperature=temp_pattern,
                          prefix=prefix,
                          n=sample_num,
                          return_as_list=True,
                          max_gen_length=length)[0]
  return samples


def write_sample_header(header_file_name, temperature, prefix):
      # add param header
      gen_header = {"model_name":model_name,
                      "temperature": str(temperature),
                      "prefix": prefix,
                      "model_cfg":model_cfg, 
                      "train_cfg":train_cfg}
      with open(header_file_name, "w") as gen_header_file:
        gen_header_file.write(json.dumps(gen_header, indent=4))
        print("wrote gen header to " + header_file_name)

def permute_samples(temps=[[0.2],[0.5], [0.75], [1.0], [1.0, 0.5, 0.2, 0.2]], 
                 prefix="<|startoftext|>",
                 length=1000, sample_num=1, header=True, download=True):
  for temperature in temps:
      print("Sampling temp pattern {}".format(temperature))
      starttime = time.time()
      file_prefix = "{}_{}_{}_{}".format(
          model_name, cur_round * num_epochs, int(starttime), 
          "temp"+ ",".join(map(lambda t: str(t).lstrip("0."), temperature)))
      samples_file_name = "{}.txt".format(file_prefix)
      header_file_name = "{}.json".format(file_prefix)

      samples = make_samples(
          temperature, 
          prefix=prefix, 
          length=length, 
          sample_num=sample_num)
      sample_lines = samples.split("\n")

      with open(samples_file_name, "w") as samples_file:
        samples_file.write(samples)
        print("wrote {} samples to {}".format(str(len(sample_lines)), samples_file_name))
      if header:
        write_sample_header(header_file_name, temperature, prefix)
        all_to_gdrive([header_file_name, samples_file_name])
      else:
        all_to_gdrive([samples_file_name])
      print("took {}s".format(time.time() - starttime))

def have_model(model_name=model_name):
  for t in checkpoint_file_templates:
    if not os.path.exists(t.format(model_name)):
      return False
  return True

Sampling temp pattern [0.5, 0.7]
wrote 1 samples to tpoke4b_1_1601341530_temp5,7.txt
wrote gen header to tpoke4b_1_1601341530_temp5,7.json
copied tpoke4b_1_1601341530_temp5,7.json
copied tpoke4b_1_1601341530_temp5,7.txt
took 3.1607608795166016s


## Train

In [8]:
textgen = textgenrnn(name=model_name)

In [25]:
# start the actual training.  Keras's CuDNN layers make training on GPU fast.
# Ideally, you want a training loss less than `1.0` in order 
# for the model to create sensible text consistently.


while cur_round < total_rounds:
  cur_round = cur_round + 1
  print("Starting round {} of {}.".format(cur_round, total_rounds))
  train_function = textgen.train_from_file if train_cfg['line_delimited'] else textgen.train_from_largetext_file
  should_start_fresh = train_cfg['new_model'] or not have_model()
  if should_start_fresh:
    print("Initializing new model {}.".format(model_name))
  train_function(
      file_path=file_name,
      new_model=should_start_fresh,
      num_epochs=train_cfg['num_epochs'],
      gen_epochs=train_cfg['gen_epochs'],
      batch_size=2048,
      train_size=train_cfg['train_size'],
      dropout=train_cfg['dropout'],
      validation=train_cfg['validation'],
      is_csv=train_cfg['is_csv'],
      rnn_layers=model_cfg['rnn_layers'],
      rnn_size=model_cfg['rnn_size'],
      rnn_bidirectional=model_cfg['rnn_bidirectional'],
      max_length=model_cfg['max_length'],
      dim_embeddings=100,
      word_level=model_cfg['word_level'])

  save_checkpoint_to_drive(model_name)
  permute_samples(length=3000)

Starting round 1 of 10.
Initializing new model tpoke4b.
1,785 texts collected.
Training new model w/ 2-layer, 256-cell Bidirectional LSTMs
Training on 475,692 character sequences.
Epoch 1/1
 12/232 [>.............................] - ETA: 2:52 - loss: 6.0648

KeyboardInterrupt: ignored

# Generate 
You can download a large amount of generated text from your model with the cell below! Rerun the cell as many times as you want for even more text!

In [10]:
permute_samples([0.5])

Sampling temp pattern 0.5


TypeError: ignored

## Download Models

You can download the weights and configuration files in the cell below, allowing you recreate the model on your own computer!

In [None]:
files.download('{}_weights.hdf5'.format(model_name))
files.download('{}_vocab.json'.format(model_name))
files.download('{}_config.json'.format(model_name))

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

To recreate the model on your own computer, after installing textgenrnn and TensorFlow, you can create a Python script with:

```
from textgenrnn import textgenrnn
textgen = textgenrnn(weights_path='colaboratory_weights.hdf5',
                       vocab_path='colaboratory_vocab.json',
                       config_path='colaboratory_config.json')
                       
textgen.generate_samples(max_gen_length=1000)
textgen.generate_to_file('textgenrnn_texts.txt', max_gen_length=1000)
```

Have fun with your new model! :)

# Etcetera

If the model fails to load on a local machine due to a model-size-not-matching bug (common in >30MB weights), this is due to a file export bug from Colaboratory. To work around this issue, save the weights to Google Drive with the two cells below and download from there.

In [None]:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from google.colab import files
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [None]:
uploaded = drive.CreateFile({'title': '{}_weights.hdf5'.format(model_name)})
uploaded.SetContentFile('{}_weights.hdf5'.format(model_name))
uploaded.Upload()
print('Uploaded file with ID {}'.format(uploaded.get('id')))

Uploaded file with ID 10Wiez4To7JJJaXnMkfJJsQ72FmH23qvV


If the notebook has errors (e.g. GPU Sync Fail), force-kill the Colaboratory virtual machine and restart it with the command below:

In [None]:
!kill -9 -1