# Finetuning a model for NovelAI
Welcome to javaman's Colab Notebook for training your own GPT-Neo model for NovelAI. Follow these steps and you'll be playing NAI with your model in no time. Do join [NovelAI Discord](https://discord.gg/q9XU7fchbY) if you're not a member already, and help us build it!

## Requirements
* GPT-Neo
* Google Cloud Storage (a bucket)
* A Colab Network with TPUs available (if you have Colab Pro, even better)
* Conversion from TensorFlow model to a Huggingface model
* NovelAI Colab (mine or finetune's, either works)

## Getting started

In [None]:
#@title Set up GPT-Neo
#@markdown This step will set up needed stuff for training your model. It will download GPT-Neo and install other dependencies.

%tensorflow_version 2.x
!git clone https://github.com/EleutherAI/GPTNeo
%cd GPTNeo
!pip3 install -q -r requirements.txt
pretrained_model = None
dataset = None

In [None]:
#@title Set up Google Cloud Storage
#@markdown You will need to create a bucket on GCP to use this colab. Once you've created your bucket and given a unique name, you can come here and authenticate on GCP. Use [this link](https://console.cloud.google.com/storage) to create a GCP bucket.

cloud_bucket = 'gs://your-bucket' #@param {type:"string"}

from google.colab import auth
auth.authenticate_user()
!gcloud init

In [None]:
#@title Set up Google Drive
#@markdown You might want to store your datasets on GDrive and then use them here. For this, you'll need to authenticate on GDrive.

from google.colab import drive
drive.mount('/content/drive')

In [None]:
#@title Set up training dataset
#@markdown This cell was imported from the official GPT-Neo Colab Notebook, and gives you options of finetuning models with any dataset you want, including stuff from The Pile. If you wish to use The Pile, just select an option from the dropdown menu.

#@markdown * Sampling Only - choose this option if you only wish to sample from our trained models, then move on to the Pretrained Model section.

#@markdown * Literotica - erotic texts

#@markdown * On penWebText - an opensource clone of OpenAI's WebText dataset, the original training data of GPT2.

#@markdown * YoutubeSubtitles - a dataset of subtitles scraped from youtube videos.

#@markdown * Hackernews - comments scraped from hackernews

#@markdown * Books - collection of novels

#@markdown * NIHExporter - Data relating to various projects from the national institute of health.

#@markdown * Custom - if this option is chosen you will be prompted to enter the path to your own dataset. It should be a directory containing .txt or .jsonl files.

#@markdown All these datasets are from EleutherAI's side project - The Pile™ - an effort to gather a general purpose, diverse and open source plain text dataset large enough to train 1T+ parameter language models. Alternatively, you can provide your own dataset in the form of a folder or gzip archive of .txt files. Simply select 'Custom' below and follow input the path to your data and the name of your dataset when prompted.

#@markdown **Note:** `dataset_name` is an arbitrary name you set. It will be used again when finetuning starts. It doesn't need to match the actual dataset file names.

import os
dataset = 'Custom' #@param ["Sampling_Only", "Literotica", "OpenWebText", "YoutubeSubtitles", "HackerNews", "Books", "NIHExporter", "Custom"]

if dataset == "Sampling_Only":
  pass
elif dataset == 'OpenWebText':
  !wget https://the-eye.eu/public/AI/pile_preliminary_components/openwebtext2.jsonl.zst.tar -O openwebtext.tar.xz
  !tar xf openwebtext.tar.xz
  dataset_path = "openwebtext"
  dataset_name = dataset_path
  out_name = dataset_name + "_tokenized"
elif dataset == 'Literotica':
  os.makedirs('data', exist_ok=True)
  !wget https://the-eye.eu/public/AI/pile_preliminary_components/Literotica.jsonl.zst -O data/Literotica.jsonl.zst
  dataset_path = 'data'
  dataset_name = 'literotica'
  out_name = dataset_name + "_tokenized"
elif dataset == 'YoutubeSubtitles':
  os.makedirs('data', exist_ok=True)
  !wget https://the-eye.eu/public/AI/pile_preliminary_components/yt_subs.jsonl.zst -O data/yt_subs.jsonl.zst
  dataset_path = 'data'
  dataset_name = 'ytsubs'
  out_name = dataset_name + "_tokenized"
elif dataset == 'HackerNews':
  os.makedirs('data', exist_ok=True)
  !wget https://the-eye.eu/public/AI/pile_preliminary_components/hn.tar.gz -O data/hn.tar.gz
  dataset_path = 'data'
  dataset_name = 'hackernews'
  out_name = dataset_name + "_tokenized"
elif dataset == 'Books':
  os.makedirs('data', exist_ok=True)
  !wget https://the-eye.eu/public/AI/pile_preliminary_components/books3.tar.gz -O data/Books.tar.gz
  dataset_path = 'data'
  dataset_name = 'books'
  out_name = dataset_name + "_tokenized"
elif dataset == "NIHExporter":
  os.makedirs('data', exist_ok=True)
  !wget https://the-eye.eu/public/AI/pile_preliminary_components/NIH_ExPORTER_awarded_grant_text.jsonl.zst -O data/NIH_ExPORTER_awarded_grant_text.jsonl.zst
  dataset_path = 'data'
  os.system('mv NIH_ExPORTER_awarded_grant_text.jsonl.zst ./data')
  dataset_name = 'nihexporter'
  out_name = dataset_name + "_tokenized"
elif dataset == "Custom":
  dataset_path = input('Enter the path to the folder containing your data: ')
  dataset_name = input('Enter the name of your dataset: ')
  out_name = dataset_name + "_tokenized"
else:
  raise NotImplementedError('please select from available options: ["OpenWebText", "YoutubeSubtitles", "HackerNews", "NIHExporter", "Custom"]')


In [None]:
#@title Tokenize your dataset
#@markdown This step will read through the dataset your provided and tokenize its contents. The tokens will then be stored on your GCP bucket.

!python data/create_tfrecords.py --input_dir $dataset_path --name $dataset_name --files_per 1000 --output_dir $out_name --write_dataset_config --processes 1

print("Your GCP bucket path: " + cloud_bucket)
print("Dataset type: " + dataset)
print("Dataset name: " + dataset_name)
print("Dataset path: " + dataset_path)

if not cloud_bucket.endswith('/'):
  cloud_bucket += '/'

copy_loc = cloud_bucket + "datasets/" + dataset
!gsutil -m cp -r /content/GPTNeo/$out_name $copy_loc
!gsutil ls $cloud_bucket

## Finetuning the model
Now that we've set up everything, we can begin finetuning our model. You can either finetune a model from scratch, or keep your tuning from a checkpoint from an existent model.

#### Configuration files
Now you will need to write some configurations for your model. Do keep in mind that you'll need to alter some infos on the files below before being able to use them. The relative paths here all point to /content/GPTNeo.

#### First file (dataset config)
* Change _**YOUR-MODEL-NAME.json**_ to, well... your model's name so the file gets saved.
* Change _**gs://your-unique-bucket-name**_ to... you know what to do.
* Change the path to the one that matches where your _tfrecords_ files are. Do notice the wildcard there. Leave it, just change what's before it.

#### Second file (model config)
* _**YOUR-MODEL-NAME.json**_... you know what to do.
* The prop _**model_path**_ needs to be changed as well. It needs to be pointed to your bucket, to a folder that doens't exist yet. This folder will be created and your finetuned checkpoints will be stored there.
* The prop _**datasets**_ needs to be modified to the dataset config created before. That is, the name of the first file (or, if you set up various datasets, add them to the array too).
* The prop _**iterations**_ refers to how many steps the finetuning will go before it's saved. We recommend a value higher than 500, and if you're doing 2K or more steps, try increasing it even more. The more often the checkpoints are saved, the heavier the model will be, so try to keep it to a minimum (which means a higher value, but never higher than the quantity of steps you set for the training).

**Note:** this model config file refers to the 1.5B model. To update it with your own or use the 2.7B config, just open the model folder and paste the contents of the file `config.json` in the second cell.

In [None]:
%%writefile configs/dataset_configs/your_dataset_name.json

{
  "path": "gs://amaranth-ai/datasets/Custom/your_dataset_name_tokenized/your_dataset_name*.tfrecords",
  "eval_path": "",
  "n_vocab": 50256,
  "tokenizer_is_pretrained": true,
  "tokenizer_path": "gpt2",
  "eos_id": 50256,
  "padding_id": 50257
}

Writing configs/dataset_configs/teslore_complete.txt.json


In [None]:
%%writefile configs/custom_model.json

{
  "n_head": 20,
  "n_vocab": 50257,
  "embed_dropout": 0,
  "lr": 0.00016,
  "lr_decay": "cosine",
  "warmup_steps": 3000,
  "beta1": 0.9,
  "beta2": 0.95,
  "epsilon": 1e-08,
  "ada_epsilon1": "1e-30",
  "ada_epsilon2": 0.001,
  "opt_name": "adam",
  "weight_decay": 0,
  "train_batch_size": 512,
  "attn_dropout": 0,
  "train_steps": 400000,
  "lr_decay_end": 300000,
  "eval_steps": 10,
  "predict_steps": 0,
  "res_dropout": 0,
  "eval_batch_size": 128,
  "predict_batch_size": 1,
  "iterations": 1000,
  "n_embd": 2560,
  "datasets": [
    ["your_dataset_name", null, null, null]
  ],
  "model_path": "gs://neo-d/models/GPT3_2-7B",
  "n_ctx": 2048,
  "n_layer": 32,
  "scale_by_depth": true,
  "scale_by_in": false,
  "attention_types": [
    [
      ["global", "local"], 16]
  ],
  "mesh_shape": "x:64,y:4",
  "layout": "batch:x,embd:y",
  "activation_function": "gelu",
  "recompute_grad": true,
  "gradient_clipping": 1.0,
  "tokens_per_mb_per_replica": 4096,
  "padding_id": 50257,
  "eos_id": 50256
}

Writing configs/custom_model.json


#### Setting things up
EleutherAI has two pre-trained models, one with [1.3B parameters](https://the-eye.eu/public/AI/gptneo-release/GPT3_XL/), and another with [2.7B](https://the-eye.eu/public/AI/gptneo-release/GPT3_2-7B/). We will use those to finetune with our own dataset. Be advised that the 2.7B model is really heavy, and you won't be able to use Colab to train a model that big with a batch size greater than 2, because the Colab VM will start throwing memory errors. If you want to do heavy-duty training of the 2.7B model, consider [renting instances made for this kind of work](https://vast.ai/console/create/).

* `pretrained_model`: the name of the model you're going to finetune (use the options in the dropdown for vanilla models)
* `is_vanilla_model`: check only if you're going to finetune 1.5B or 2.7B in their original state

In [None]:
#@markdown #### Set up the model
pretrained_model = "GPT3_XL" #@param ["GPT3_XL", "GPT3_2-7B"] {allow-input: true}
is_vanilla_model = True #@param {type:"boolean"}

if is_vanilla_model:
  !wget --cut-dirs=4 -nH -r -m -np -c -U "eye02" -w 2 -R "index.html*" "https://the-eye.eu/public/AI/gptneo-release/$pretrained_model/" -P "/content/models/$pretrained_model/"

path_to_local_weights = f"/content/models/{pretrained_model}"

In [None]:
#@markdown #### Copy the downloaded model to your bucket
#@markdown * `delete_from_colab`: if checked, deletes the downloaded model from local storage in Colab.

delete_from_colab = True #@param{type:'boolean'}

# Copy the downloaded model to your bucket
bucket_base = "gs://" + cloud_bucket.replace('gs://', '').split('/')[0]
!gsutil -m cp -r $path_to_local_weights $bucket_base

if delete_from_colab:
  !rm -rf $path_to_local_weights

#### Begin finetuning
Now we want to make a few modifications to the model config in order to get training / sampling working on colab.

If you are just sampling from our pretrained models, you can leave the settings as is, run the cell below, then move on to the `Sample from your model` section.

If finetuning, you can change parameters below. 

* `path_to_model`: where your model files are located in your bucket

* `batch_size`: is your train batch size - if you're encountering memory errors, try lowering this. (**Note:** Colab can only handle up to 2 with the 2.7B model)

* `mesh_shape`: specifies the way the model will be divided up across the TPU cores. We suggest leaving this alone unless you know what you're doing.

* `train_steps`: specifies how many steps you want the model to finetune for. If you are just sampling from the model, you can leave this as is.

* `steps_per_checkpoint`: specifies how often you want to save model weights during training.

* `start_step`: the checkpoint which your training will continue from. Check the latest checkpoint your model has and continue from it. If you're using the vanilla 2.7B, set it to 400000, or 362000 for the 1.5B.

* `config_filename`: config file name. It should be the one you created before this step. It will be located in `/content/GPTNeo/configs`.

In [None]:
import json
from pprint import pprint
path_to_model = "" #@param {type:"string"}
batch_size =  2#@param {type:"integer"}
mesh_shape = "x:4,y:2" #@param {type:"string"}
train_steps = 15000 #@param {type:"integer"}
steps_per_checkpoint = 1000 #@param {type:"integer"}
start_step = 415000 #@param {type:"integer"}
config_filename = 'custom_model.json' #@param {type:"string"}
 
if path_to_model == "":
  path_to_model = "gs://" + cloud_bucket.replace('gs://', '').replace('/', '') + f'/{pretrained_model}'
print(f'MODEL PATH: {path_to_model}\n')

if dataset_name == "" and dataset != "Sampling_Only":
  dataset_name = dataset
elif dataset is None and dataset_name == "":
  dataset_name = "pile"
 
def pad_to_multiple_of(n, mult):
  """
  pads n to a multiple of mult
  """
  extra = n % mult
  if extra > 0:
    n = n + mult - extra
  return n

with open(f'/content/GPTNeo/configs/{config_filename}', 'r') as f:
  data = json.load(f)
  pprint(data)
  dset_val = [[dataset_name, None, None, None]] if dataset_name != "" else data["datasets"]
  mods = {
          "mesh_shape": mesh_shape,
          "layout": "intermediate_expanded:x,heads:x,memory_length:y,embd:y",
          "model_path": path_to_model,
          "datasets": dset_val,
          "train_steps": start_step + train_steps,
          "eval_steps": 0,
          "train_batch_size": batch_size,
          "predict_batch_size": batch_size
        }
  data.update(mods)
  print('\n--->\n')
  pprint(data)
  with open(f'configs/{pretrained_model}.json', 'w') as outfile:
    json.dump(data, outfile, indent=2)

!python3 main.py --model $pretrained_model --steps_per_checkpoint $steps_per_checkpoint --tpu colab

## Sampling from your model
Now we can try sampling stuff from our model. First, let's create an initial prompt for the model.

In [None]:
#@markdown After saving the sample file, let's try generating an output from the model. Type a prompt and then run this cell to test it.
!rm -rf example_prompt.txt
prompt = "You are a knight in the kingdom of Larion" #@param{type:'string'}

!echo -n $prompt > example_prompt.txt
!python3 main.py --model $pretrained_model --steps_per_checkpoint $steps_per_checkpoint --tpu colab --predict --prompt example_prompt.txt