Welcome to the colab notebook for [GPTNeo](https://github.com/EleutherAI/GPTNeo) - a fully open source implementation of GPT like models for mesh-tensorflow by [EleutherAI](eleuther.ai).

Our library provides training and inference for GPT models up to GPT3 sizes on both TPUs and GPUs. 

In this notebook we walk you through TPU training (or finetuning!) and sampling using the freely available colab TPUs.

If you find our repo useful, come join [our discord](https://discord.gg/BK2v3EJ) and say hi! 😬

Before we get going - make sure you are running this notebook with a TPU available. Go to Runtime -> Change Runtime Type and select 'TPU' under hardware accelerator.




In [10]:
%%bash
cd /content/GPTNeo
rm -rf GPTNeo
ls -l
pretrained_model = None
dataset = None

total 187804
-rw-r--r-- 1 root root       751 Sep 11 10:43 CITATION.cff
-rw-r--r-- 1 root root        23 Sep 11 10:43 CODEOWNERS
drwxr-xr-x 3 root root      4096 Sep 11 11:55 configs
-rw-r--r-- 1 root root      1776 Sep 11 10:43 configs.py
drwxr-xr-x 3 root root      4096 Sep 11 11:15 data
-rw-r--r-- 1 root root      1544 Sep 11 10:43 docker-compose.yml
-rw-r--r-- 1 root root       455 Sep 11 10:43 Dockerfile
-rw-r--r-- 1 root root       885 Sep 11 10:43 encoders.py
-rw-r--r-- 1 root root       826 Sep 11 12:18 example_prompt.txt
-rw-r--r-- 1 root root       501 Sep 11 10:43 export.py
-rw-r--r-- 1 root root    118389 Sep 11 10:43 GPTNeo_example_notebook.ipynb
-rw-r--r-- 1 root root     15608 Sep 11 10:43 inputs.py
-rw-r--r-- 1 root root      1067 Sep 11 10:43 LICENSE
drwxr-xr-x 2 root root      4096 Sep 11 11:15 logs
-rw-r--r-- 1 root root     11270 Sep 11 10:43 main.py
-rw-r--r-- 1 root root     14740 Sep 11 10:43 model_fns.py
drwxr-xr-x 4 root root      4096 Sep 11 11:15 models
-rw-r

bash: line 4: pretrained_model: command not found
bash: line 5: dataset: command not found


In [31]:
#@title Setup
%tensorflow_version 2.x
!git clone https://github.com/EleutherAI/GPTNeo
%cd GPTNeo
!pip3 install -q -r requirements.txt
pretrained_model = None
dataset = None


Cloning into 'GPTNeo'...
remote: Enumerating objects: 3835, done.[K
remote: Counting objects: 100% (226/226), done.[K
remote: Compressing objects: 100% (146/146), done.[K
remote: Total 3835 (delta 126), reused 144 (delta 79), pack-reused 3609[K
Receiving objects: 100% (3835/3835), 1.47 MiB | 10.45 MiB/s, done.
Resolving deltas: 100% (2216/2216), done.
/content/GPTNeo/GPTNeo
[31mERROR: Operation cancelled by user[0m


## Set Up Google Cloud

To train on TPUs we need to store our data on a google cloud bucket - as TPUs can't read from local filesystems.

You can set up a bucket by signing up for a free trial here: https://console.cloud.google.com/

Make a bucket at https://console.cloud.google.com/storage and come back when that's done.

Make sure to select 'Uniform' access control when setting up the bucket, or the colab notebook won't have the required permissions to read from it.

The next cell sets up google authentication and gives the notebook read and write access to your bucket.


https://github.com/google-research/text-to-text-transfer-transformer/issues/318

In [4]:
!pip install -q t5 tensorflow-text==2.2

[K     |████████████████████████████████| 152 kB 5.3 MB/s 
[K     |████████████████████████████████| 3.0 MB 46.1 MB/s 
[K     |████████████████████████████████| 516.4 MB 13 kB/s 
[K     |████████████████████████████████| 20.1 MB 1.2 MB/s 
[K     |████████████████████████████████| 3.0 MB 42.3 MB/s 
[K     |████████████████████████████████| 2.9 MB 41.9 MB/s 
[K     |████████████████████████████████| 454 kB 56.8 MB/s 
[K     |████████████████████████████████| 4.0 MB 40.7 MB/s 
[K     |████████████████████████████████| 269 kB 57.2 MB/s 
[K     |████████████████████████████████| 1.2 MB 45.9 MB/s 
[K     |████████████████████████████████| 90 kB 9.5 MB/s 
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.
albumentations 0.1.12 requires imgaug<0.2.7,>=0.2.5, but you

In [5]:
from google.colab import auth
auth.authenticate_user()
!gcloud init

Welcome! This command will take you through the configuration of gcloud.

Settings from your current configuration [default] are:
component_manager:
  disable_update_check: 'True'
compute:
  gce_metadata_read_timeout_sec: '0'
core:
  account: sivasubramanian.v@prodapt.com

Pick configuration to use:
 [1] Re-initialize this configuration [default] with new settings 
 [2] Create a new configuration
Please enter your numeric choice:  1

Your current configuration has been set to: [default]

You can skip diagnostics next time by using the following flag:
  gcloud init --skip-diagnostics

Network diagnostic detects and fixes local network connection issues.
Reachability Check passed.
Network diagnostic passed (1/1 checks passed).

Choose the account you would like to use to perform operations for 
this configuration:
 [1] sivasubramanian.v@prodapt.com
 [2] Log in with a new account
Please enter your numeric choice:  1

You are logged in as: [sivasubramanian.v@prodapt.com].

Pick cloud proje

In [6]:
path_to_cloud_bucket = 'gs://terraformgenerator/ml/GPTNeo' #@param {type:"string"}

## Set Up Dataset

We first need to download and tokenize a dataset. If you just want to sample from a pretrained model, you can skip this step and move on to the `Pretrained Model` section.

You can choose from:

*   Sampling Only - choose this option if you only wish to sample from our trained models, then move on to the `Pretrained Model` section.

*   OpenWebText - an opensource clone of OpenAI's WebText dataset, the original training data of GPT2.

*   YoutubeSubtitles - a dataset of subtitles scraped from youtube videos.

* Hackernews - comments scraped from hackernews

* NIHExporter - Data relating to various projects from the national institute of health.

* Custom - if this option is chosen you will be prompted to enter the path to your own dataset. It should be a directory containing .txt or .jsonl files.

All these datasets are from EleutherAI's side project - [The Pile™](https://github.com/EleutherAI/The-Pile) - an effort to gather a general purpose, diverse and open source plain text dataset large enough to train 1T+ parameter language models.

Even the smallest datasets are fairly large files, so this step will likely take a while. Select a dataset in the next cell, then run the next two cells, and go grab a snack and a cup of tea 😊

Alternatively, you can provide your own dataset in the form of a folder or gzip archive of .txt files. Simply select 'Custom' below and follow input the path to your data and the name of your dataset when prompted.

In [11]:
# Select a Dataset:
import os
dataset = 'Sampling_Only' #@param ["Sampling_Only", "OpenWebText", "YoutubeSubtitles", "HackerNews", "NIHExporter", "Custom"]

if dataset == "Sampling_Only":
  pass
elif dataset == 'OpenWebText':
  !wget https://the-eye.eu/public/AI/pile_preliminary_components/openwebtext2.jsonl.zst.tar -O openwebtext.tar.xz
  !tar xf openwebtext.tar.xz
  dataset_path = "openwebtext"
  dataset_name = dataset_path
  out_name = dataset_name + "_tokenized"
elif dataset == 'YoutubeSubtitles':
  os.makedirs('data', exist_ok=True)
  !wget https://the-eye.eu/public/AI/pile_preliminary_components/yt_subs.jsonl.zst -O data/yt_subs.jsonl.zst
  dataset_path = 'data'
  dataset_name = 'ytsubs'
  out_name = dataset_name + "_tokenized"
elif dataset == 'HackerNews':
  os.makedirs('data', exist_ok=True)
  !wget https://the-eye.eu/public/AI/pile_preliminary_components/hn.tar.gz -O data/hn.tar.gz
  dataset_path = 'data'
  dataset_name = 'hackernews'
  out_name = dataset_name + "_tokenized"
elif dataset == "NIHExporter":
  os.makedirs('data', exist_ok=True)
  !wget https://the-eye.eu/public/AI/pile_preliminary_components/NIH_ExPORTER_awarded_grant_text.jsonl.zst -O data/NIH_ExPORTER_awarded_grant_text.jsonl.zst
  dataset_path = 'data'
  os.system('mv NIH_ExPORTER_awarded_grant_text.jsonl.zst ./data')
  dataset_name = 'nihexporter'
  out_name = dataset_name + "_tokenized"
elif dataset == "Custom":
  dataset_path = input('Enter the path to the folder containing your data: ')
  dataset_name = input('Enter the name of your dataset: ')
  out_name = dataset_name + "_tokenized"
else:
  raise NotImplementedError('please select from available options: ["OpenWebText", "YoutubeSubtitles", "HackerNews", "NIHExporter", "Custom"]')


In [19]:
%%bash
echo $dataset
cd GPTNeo
pwd


/content/GPTNeo


### Tokenize and Upload Data

Now tokenize the dataset and copy it over to your google cloud bucket. You may skip this step if you are sampling from a pre-trained model.

In [23]:
!cd GPTNeo
!pwd
# Tokenize Data
!python data/create_tfrecords.py --input_dir /content/GPTNeo/$dataset_path --name $dataset_name --files_per 1000 --output_dir $out_name --write_dataset_config --processes 1

# copy the data to your bucket
if not path_to_cloud_bucket.endswith('/'):
  path_to_cloud_bucket += '/'
copy_loc = path_to_cloud_bucket + "datasets/" + dataset
!gsutil -m cp -r /content/GPTNeo/$out_name $copy_loc
!gsutil ls $path_to_cloud_bucket

/content
python3: can't open file 'data/create_tfrecords.py': [Errno 2] No such file or directory


NameError: ignored

Before starting training - you'll need to edit your dataset & model configs to point to your buckets / data. You need to do this even if you are sampling from a pre-trained model.

*   First change the writefile path to point to your chosen dataset - e.g `%%writefile configs/dataset_configs/ytsubs.json`
*   Change the "path" field to point to your cloud bucket location - e.g `gs://neo_lmdatasets/datasets/ytsubs_*.tfrecords`
* Change `dataset_name` in `%%writefile configs/dataset_configs/dataset_name.json` to the name of your chosen dataset.
* Once you've made the edits, then run the cell below to overwrite the existing files.




In [18]:
%%writefile configs/dataset_configs/Sampling_Only.json

{
  "path": "gs://eleutherai/datasets/Sampling_Only/Sampling_Only*.tfrecords",
  "eval_path": "",
  "n_vocab": 50256,
  "tokenizer_is_pretrained": true,
  "tokenizer_path": "gpt2",
  "eos_id": 50256,
  "padding_id": 50257
}


Writing configs/dataset_configs/Sampling_Only.json


FileNotFoundError: ignored

## Set Model Configs

The model below is identical to our pretrained GPT3XL model (1.3B Params). 

If you want to use a smaller model, you can modify any of the config files in ../configs/ ending in _8.json, all of which are designed to train on tpu-v8s.

For a more detailed breakdown on what each item in the configuration file means - please read through our training and config guides in our [github README](https://github.com/EleutherAI/GPTNeo#training-guide). 

You'll want to change the first item in the `datasets` list to the name of your chosen dataset. (the filename minus .json in ./configs/dataset_configs)

You'll also want to modify the `model_path` field to point to your google cloud bucket, so checkpoints get saved to there.

In [14]:
%%writefile configs/GPT3_XL.json

{
    "n_head": 16,
    "n_vocab": 50257,
    "embed_dropout": 0,
    "lr": 0.0002,
    "lr_decay": "cosine",
    "warmup_steps": 3000,
    "beta1": 0.9,
    "beta2": 0.95,
    "epsilon": 1e-8,
    "opt_name": "adam",
    "weight_decay": 0,
    "train_batch_size": 256,
    "attn_dropout": 0,
    "train_steps": 600000,
    "eval_steps": 0,
    "predict_steps": 1,
    "res_dropout": 0,
    "eval_batch_size": 4,
    "predict_batch_size": 1,
    "iterations": 100,
    "n_embd": 2048,
    "datasets": [["pile", null, null, null]],
    "model": "GPT",
    "model_path": "gs://eleutherai/GPT3_XL",
    "n_ctx": 2048,
    "n_layer": 24,
    "scale_by_depth": true,
    "scale_by_in": false,
    "attention_types" :  [[["global", "local"],12]],
    "mesh_shape": "x:4,y:2",
    "layout": "intermediate_expanded:x,heads:x,vocab:n_vocab,memory_length:y,embd:y",
    "activation_function": "gelu",
    "recompute_grad": true,
    "gradient_clipping": 1.0,
    "tokens_per_mb_per_replica": 2048,
    "precision": "bfloat16"
}

Writing configs/GPT3_XL.json


FileNotFoundError: ignored

## Training from Scratch

Now we will begin to train the model. If no previous model is found in "model_path", the model will start training from scratch. If you'd prefer to finetune from pretrained, skip to the `Finetune a Pretrained Model` section.

If everything's set up correctly, you can now run the main.py function to start training!

In [None]:
!python3 main.py --model colab_XL --steps_per_checkpoint 500 --tpu colab

## Pretrained Model

If you want to sample from or finetune a pretrained model, EleutherAI has pretrained two models for release. One with [1.3B parameters](https://the-eye.eu/public/AI/gptneo-release/GPT3_XL/), and another with [2.7B](https://the-eye.eu/public/AI/gptneo-release/GPT3_2-7B/). 

Select an option below to download the weights locally. You will then need to upload them to your cloud bucket in order to finetune from them. If the download command isn't working, try the commented out code to download from a different source.

The 2-7B model likely won't fit into the colab TPUs memory, and you may have to get some larger pods to finetune from it.

Sampling from it, however, works just fine.


In [7]:
# @title Download pretrained model weights:
pretrained_model = 'GPT3_2-7B' #@param ["GPT3_XL", "GPT3_2-7B"]
!wget -m -np -c -U "eye02" -w 2 -R "index.html*" "https://the-eye.eu/public/AI/gptneo-release/$pretrained_model/"
path_to_local_weights = f"/content/GPTNeo/the-eye.eu/public/AI/gptneo-release/{pretrained_model}"

# URL = f"http://eaidata.bmk.sh/data/gptneo-release/{pretrained_model}/"
# FOLDER_NAME = "GPT3_XL"
# !curl $URL | grep -i "</a>" | sed -n 's/.*href="\([^"]*\).*/\1/p' | sed "s|^|$URL|" | xargs -n 1 -P 4 wget -P $pretrained_model
# path_to_local_weights = pretrained_model


--2021-09-11 10:53:03--  https://the-eye.eu/public/AI/gptneo-release/GPT3_2-7B/
Resolving the-eye.eu (the-eye.eu)... 162.213.130.242
Connecting to the-eye.eu (the-eye.eu)|162.213.130.242|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘the-eye.eu/public/AI/gptneo-release/GPT3_2-7B/index.html.tmp’

the-eye.eu/public/A     [ <=>                ]  14.37K  --.-KB/s    in 0.03s   

Last-modified header missing -- time-stamps turned off.
2021-09-11 10:53:03 (446 KB/s) - ‘the-eye.eu/public/AI/gptneo-release/GPT3_2-7B/index.html.tmp’ saved [14718]

Loading robots.txt; please ignore errors.
--2021-09-11 10:53:05--  https://the-eye.eu/robots.txt
Reusing existing connection to the-eye.eu:443.
HTTP request sent, awaiting response... 200 OK
Length: 4 [text/plain]
Saving to: ‘the-eye.eu/robots.txt’


2021-09-11 10:53:05 (567 KB/s) - ‘the-eye.eu/robots.txt’ saved [4/4]

Removing the-eye.eu/public/AI/gptneo-release/GPT3_2-7B/index.html.tmp s

In [8]:
# upload to your bucket
bucket_base = "gs://" + path_to_cloud_bucket.replace('gs://', '').split('/')[0]
!gsutil -m cp -r $path_to_local_weights $bucket_base

Copying file:///content/GPTNeo/the-eye.eu/public/AI/gptneo-release/GPT3_2-7B/model.ckpt-400000.data-00014-of-00064 [Content-Type=application/octet-stream]...
/ [0/68 files][    0.0 B/ 29.7 GiB]   0% Done                                   Copying file:///content/GPTNeo/the-eye.eu/public/AI/gptneo-release/GPT3_2-7B/model.ckpt-400000.data-00045-of-00064 [Content-Type=application/octet-stream]...
==> NOTE: You are uploading one or more large file(s), which would run
significantly faster if you enable parallel composite uploads. This
feature can be enabled by editing the
"parallel_composite_upload_threshold" value in your .boto
configuration file. However, note that if you do this large files will
be uploaded as `composite objects
<https://cloud.google.com/storage/docs/composite-objects>`_,which
means that any user who downloads such objects will need to have a
compiled crcmod installed (see "gsutil help crcmod"). This is because
without a compiled crcmod, computing checksums on composite 

If everything has worked successfully you should now see your model listed in your bucket below.

In [9]:
!gsutil ls $bucket_base

gs://terraformgenerator/GPT3_2-7B/
gs://terraformgenerator/ml/


Now we want to make a few modifications to the model config in order to get training / sampling working on colab.

If you are just sampling from our pretrained models, you can leave the settings as is, run the cell below, then move on to the `Sample from your model` section.

If finetuning, you can change parameters below. 

* `path_to_model` should point to the model weights location in your cloud bucket, and will default to `$bucket_base/${pretrained_model}` if nothing is entered.

* `batch_size` is your train batch size - if you're encountering memory errors, try lowering this.

* `dataset_name` is the name of your dataset, if nothing is entered, this should default to the dataset you selected in the `Prepare Data` section.

* `mesh_shape` specifies the way the model will be divided up across the TPU cores. We suggest leaving this alone unless you know what you're doing.

* `train_steps` specifies how many steps you want the model to finetune for. We set this to 1000 for demonstrative purposes but you may need to increase this a little depending on your goals. If you are just sampling from the model, you can leave this as is.

* `steps_per_checkpoint` specifies how often you want to save model weights during training.



In [26]:
# @title Modify config for colab. 
  
import json
from pprint import pprint

path_to_model = "" #@param {type:"string"}
batch_size = 8 #@param {type:"integer"}
dset = ""  #@param {type:"string"}
mesh_shape = "x:4,y:2" #@param {type:"string"}
train_steps = 1000 #@param {type:"integer"}
steps_per_checkpoint = 500 #@param {type:"integer"}
start_step = 400000 if pretrained_model == "GPT3_2-7B" else 362000

if path_to_model == "":
  path_to_model = f'{bucket_base.strip("/")}/{pretrained_model}'
print(f'MODEL PATH: {path_to_model}\n')

if dset == "" and dataset != "Sampling_Only":
  dset = dataset
elif dataset is None and dset == "":
  dset = "pile"

def pad_to_multiple_of(n, mult):
  """
  pads n to a multiple of mult
  """
  extra = n % mult
  if extra > 0:
      n = n + mult - extra
  return n

with open(f'{path_to_local_weights}/config.json', 'r') as f:
  data = json.load(f)
  pprint(data)
  dset_val = [[dset, None, None, None]] if dset != "" else data["datasets"]
  mods = {
          "mesh_shape": mesh_shape,
          "layout": "intermediate_expanded:x,heads:x,memory_length:y,embd:y",
          "model_path": path_to_model,
          "datasets": dset_val,
          "train_steps": start_step + train_steps,
          "eval_steps": 0,
          "train_batch_size": batch_size,
          "predict_batch_size": batch_size
        }
  data.update(mods)
  print('\n--->\n')
  pprint(data)
  with open(f'configs/{pretrained_model}.json', 'w') as outfile:
    json.dump(data, outfile, indent=2)

MODEL PATH: gs://terraformgenerator/GPT3_2-7B

{'activation_function': 'gelu',
 'ada_epsilon1': '1e-30',
 'ada_epsilon2': 0.001,
 'attention_types': [[['global', 'local'], 16]],
 'attn_dropout': 0,
 'beta1': 0.9,
 'beta2': 0.95,
 'datasets': [['pile', None, None, None]],
 'embed_dropout': 0,
 'eos_id': 50256,
 'epsilon': 1e-08,
 'eval_batch_size': 128,
 'eval_steps': 10,
 'gradient_clipping': 1.0,
 'iterations': 500,
 'layout': 'batch:x,embd:y',
 'lr': 0.00016,
 'lr_decay': 'cosine',
 'lr_decay_end': 300000,
 'mesh_shape': 'x:64,y:4',
 'model_path': 'gs://neo-d/models/GPT3_2-7B',
 'n_ctx': 2048,
 'n_embd': 2560,
 'n_head': 20,
 'n_layer': 32,
 'n_vocab': 50257,
 'opt_name': 'adam',
 'padding_id': 50257,
 'predict_batch_size': 1,
 'predict_steps': 0,
 'recompute_grad': True,
 'res_dropout': 0,
 'scale_by_depth': True,
 'scale_by_in': False,
 'tokens_per_mb_per_replica': 4096,
 'train_batch_size': 512,
 'train_steps': 400000,
 'warmup_steps': 3000,
 'weight_decay': 0}

--->

{'activation

### Begin Fine-Tuning

If you are fine-tuning the pretrained model, this line of code will begin the training.

In [27]:
!python3 main.py --model $pretrained_model --steps_per_checkpoint $steps_per_checkpoint --tpu colab

Instructions for updating:
non-resource variables are not supported in the long term
Traceback (most recent call last):
  File "main.py", line 257, in <module>
    main(args)
  File "main.py", line 56, in main
    params = fetch_model_params(args.model)
  File "/content/GPTNeo/configs.py", line 29, in fetch_model_params
    assert dataset_id in DATASETS, f"Dataset '{dataset_id}' was not found under dataset_configs/ folder. Please follow the example.json in that folder."
AssertionError: Dataset 'None' was not found under dataset_configs/ folder. Please follow the example.json in that folder.


### Sample from your model

Once training is finished, (or your pretrained model is on your bucket), you can run the same command with the --predict flag to sample from your model.

To pass in a prompt, save it to a .txt file, and pass in the name of the file with the --prompt flag.

use the cell below to enter your prompt, and run it to save it to example_prompt.txt.

You may need to decrease the predict batch size in your config if you're facing OOM errors.

Let's see if the GPTNeo model can finish coding itself, with a sample prompt consisting of the beginning of a `torch.nn.Module`:

In [29]:
%%writefile example_prompt.txt

class GPT(nn.Module):
    """  the full GPT language model, with a context size of block_size """

    def __init__(self, config):
        super().__init__()

        # input embedding stem
        self.tok_emb = nn.Embedding(config.vocab_size, config.n_embd)
        self.pos_emb = nn.Parameter(torch.zeros(1, config.block_size, config.n_embd))
        self.drop = nn.Dropout(config.embd_pdrop)
        # transformer
        self.blocks = nn.Sequential(*[Block(config) for _ in range(config.n_layer)])
        # decoder head
        self.ln_f = nn.LayerNorm(config.n_embd)
        self.head = nn.Linear(config.n_embd, config.vocab_size, bias=False)

        self.block_size = config.block_size
        self.apply(self._init_weights)

        logger.info("number of parameters: %e", sum(p.numel() for p in self.parameters()))

Overwriting example_prompt.txt


In [30]:
!python3 main.py --model $pretrained_model --steps_per_checkpoint 500 --tpu colab --predict --prompt example_prompt.txt

Instructions for updating:
non-resource variables are not supported in the long term
Traceback (most recent call last):
  File "main.py", line 257, in <module>
    main(args)
  File "main.py", line 56, in main
    params = fetch_model_params(args.model)
  File "/content/GPTNeo/configs.py", line 29, in fetch_model_params
    assert dataset_id in DATASETS, f"Dataset '{dataset_id}' was not found under dataset_configs/ folder. Please follow the example.json in that folder."
AssertionError: Dataset 'None' was not found under dataset_configs/ folder. Please follow the example.json in that folder.


# Evaluating the model

This section assumes you are using a pretrained model and relies on variables created in the `Pretrained model` section.

## Wikitext

Download the wikitext test set:


In [16]:
wikitext103_src = "https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip"
!wget $wikitext103_src
!unzip wikitext-103-raw-v1.zip

--2021-09-11 11:33:10--  https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.133.112
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.133.112|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 191984949 (183M) [application/zip]
Saving to: ‘wikitext-103-raw-v1.zip’


2021-09-11 11:33:12 (80.2 MB/s) - ‘wikitext-103-raw-v1.zip’ saved [191984949/191984949]

Archive:  wikitext-103-raw-v1.zip
   creating: wikitext-103-raw/
  inflating: wikitext-103-raw/wiki.test.raw  
  inflating: wikitext-103-raw/wiki.valid.raw  
  inflating: wikitext-103-raw/wiki.train.raw  


Tokenize and upload to bucket:


In [17]:

!mkdir wikitext
!mv /content/GPTNeo/wikitext-103-raw/wiki.test.raw wikitext/wikitext_test.txt

# Tokenize Data
!python data/create_tfrecords.py --input_dir wikitext --name wikitext --files_per 1000 --output_dir wikitext_tokenized --write_dataset_config --processes 1 --wikitext-detokenize

# copy the data to your bucket
if not path_to_cloud_bucket.endswith('/'):
  path_to_cloud_bucket += '/'
copy_loc = path_to_cloud_bucket 
!gsutil -m cp -r wikitext_tokenized $copy_loc
!gsutil ls $path_to_cloud_bucket

mkdir: cannot create directory ‘wikitext’: File exists
Downloading: 100% 1.04M/1.04M [00:00<00:00, 7.17MB/s]
Downloading: 100% 456k/456k [00:00<00:00, 4.16MB/s]
Downloading: 100% 1.36M/1.36M [00:00<00:00, 8.36MB/s]
Writing TFRecord Files to wikitext_tokenized/. Parsed 0 input files. files_written : 0it [00:00, ?it/s]{'discarded': 0, 'processed': 1, 'successful': 1}
Writing TFRecord Files to wikitext_tokenized/. Parsed 0 input files. files_written : 0it [00:01, ?it/s]
Copying file://wikitext_tokenized/wikitext_0_139.tfrecords [Content-Type=application/octet-stream]...
/ [1/1 files][578.6 KiB/578.6 KiB] 100% Done                                    
Operation completed over 1 objects/578.6 KiB.                                    
gs://terraformgenerator/ml/GPTNeo/
gs://terraformgenerator/ml/GPTNeo/wikitext_tokenized/


Now make a dataset config that points to the tokenized wikitext data:

In [21]:
%%writefile configs/dataset_configs/wikitext.json

{
  "path": "",
  "eval_path": "gs://terraformgenerator/ml/GPTNeo/wikitext_tokenized/*.tfrecords",
  "n_vocab": 50256,
  "tokenizer_is_pretrained": true,
  "tokenizer_path": "gpt2",
  "eos_id": 50256,
  "padding_id": 50257
}


Overwriting configs/dataset_configs/wikitext.json


And update your model config to point to that dataset:


In [22]:
# @title Modify config for wikitext. 
  
import json
from pprint import pprint

batch_size = 8 #@param {type:"integer"}
assert pretrained_model is not None
with open(f'configs/{pretrained_model}.json', 'r') as f:
  data = json.load(f)
  pprint(data)
  dset_val = [["wikitext", None, None, None]]
  mods = {
          "datasets": dset_val,
          "eval_steps": 139 // batch_size,
          "train_batch_size": batch_size,
          "eval_batch_size": batch_size,
        }
  data.update(mods)
  print('\n--->\n')
  pprint(data)
  with open(f'configs/{pretrained_model}.json', 'w') as outfile:
    json.dump(data, outfile, indent=2)

{'activation_function': 'gelu',
 'ada_epsilon1': '1e-30',
 'ada_epsilon2': 0.001,
 'attention_types': [[['global', 'local'], 16]],
 'attn_dropout': 0,
 'beta1': 0.9,
 'beta2': 0.95,
 'datasets': [['wikitext', None, None, None]],
 'embed_dropout': 0,
 'eos_id': 50256,
 'epsilon': 1e-08,
 'eval_batch_size': 8,
 'eval_steps': 17,
 'gradient_clipping': 1.0,
 'iterations': 500,
 'layout': 'intermediate_expanded:x,heads:x,memory_length:y,embd:y',
 'lr': 0.00016,
 'lr_decay': 'cosine',
 'lr_decay_end': 300000,
 'mesh_shape': 'x:4,y:2',
 'model_path': 'gs://terraformgenerator/GPT3_2-7B',
 'n_ctx': 2048,
 'n_embd': 2560,
 'n_head': 20,
 'n_layer': 32,
 'n_vocab': 50257,
 'opt_name': 'adam',
 'padding_id': 50257,
 'predict_batch_size': 8,
 'predict_steps': 0,
 'recompute_grad': True,
 'res_dropout': 0,
 'scale_by_depth': True,
 'scale_by_in': False,
 'tokens_per_mb_per_replica': 4096,
 'train_batch_size': 8,
 'train_steps': 401000,
 'warmup_steps': 3000,
 'weight_decay': 0}

--->

{'activation_f

Now run model in eval mode over tokenized data:

In [23]:
!python3 main.py --eval --tpu colab --model $pretrained_model

Instructions for updating:
non-resource variables are not supported in the long term
Current step 400000
Saving config to gs://terraformgenerator/GPT3_2-7B
2021-09-11 11:40:25.118764: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2021-09-11 11:40:25.125062: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2199995000 Hz
2021-09-11 11:40:25.125317: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55f3cef7b9c0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-09-11 11:40:25.125399: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2021-09-11 11:40:25.127906: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-09-11 11:40:25.139989: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed ca

## Lambada

Lambada eval is built into the codebase and can be run by adding a field to your model config

In [None]:
# @title Modify config for Lambada. 
  
import json
from pprint import pprint

batch_size = 8 #@param {type:"integer"}
assert pretrained_model is not None
with open(f'configs/{pretrained_model}.json', 'r') as f:
  data = json.load(f)
  mods = {
          "datasets": dset_val,
          "eval_steps": 0,
          "train_batch_size": batch_size,
          "eval_batch_size": batch_size,
          "eval_tasks": ["lambada"]
        }
  data.update(mods)
  print('\n--->\n')
  pprint(data)
  with open(f'configs/{pretrained_model}.json', 'w') as outfile:
    json.dump(data, outfile, indent=2)

Now run the eval:

In [None]:
!python3 main.py --eval --tpu colab --model $pretrained_model