# Fine-tuning GPT-J in Jupyter

This code was originally setup for a Jupyter Labs environment on an Nvidia A10 GPU (24GB of VRAM) with 256GB of RAM.

Find and replace "DATASET" with your dataset name.

In [None]:
# install dependencies
!pip install -r requirements.txt

^C
Traceback (most recent call last):
  File "/root/venv/bin/pip", line 5, in <module>
    from pip._internal.cli.main import main
  File "/root/venv/lib/python3.9/site-packages/pip/_internal/cli/main.py", line 9, in <module>
    from pip._internal.cli.autocompletion import autocomplete
  File "/root/venv/lib/python3.9/site-packages/pip/_internal/cli/autocompletion.py", line 10, in <module>
    from pip._internal.cli.main_parser import create_main_parser
  File "/root/venv/lib/python3.9/site-packages/pip/_internal/cli/main_parser.py", line 9, in <module>
    from pip._internal.build_env import get_runnable_pip
  File "/root/venv/lib/python3.9/site-packages/pip/_internal/build_env.py", line 19, in <module>
    from pip._internal.cli.spinners import open_spinner
  File "/root/venv/lib/python3.9/site-packages/pip/_internal/cli/spinners.py", line 9, in <module>
    from pip._internal.utils.logging import get_indentation
  File "/root/venv/lib/python3.9/site-packages/pip/_internal/utils/log

In [None]:
# set up local variables for deepspeed GPU(s) to use deepspeed in notebook, this emulates a launcher
import os
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '9994'
os.environ['RANK'] = "0"
os.environ['LOCAL_RANK'] = "0"
os.environ['WORLD_SIZE'] = "1"
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID" # order CUDA devices (GPUs) by bus ID
os.environ["CUDA_VISIBLE_DEVICES"]="0" # which CUDA device(s) to use

In [None]:
# import required libraries
import pandas as pd
import torch
from torch.utils.data import Dataset, random_split
from transformers import AutoTokenizer, TrainingArguments, Trainer, AutoModelForCausalLM, IntervalStrategy

  from .autonotebook import tqdm as notebook_tqdm


ModuleNotFoundError: No module named 'transformers'

In [None]:
# check if GPU is available and how much VRAM is available
!nvidia-smi

Sun Nov 12 20:27:19 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.108.03   Driver Version: 510.108.03   CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A10          On   | 00000000:17:00.0 Off |                    0 |
|  0%   34C    P8    16W / 150W |      0MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A10          On   | 00000000:CA:00.0 Off |                    0 |
|  0%   56C    P8    17W / 150W |      0MiB / 23028MiB |      0%      Default |
|       

# Making deepspeed config file

Make sure the following json config parameters are consistent with the TrainingArguments in the next code block:

- train_batch_size <==> per_device_train_batch_size

- lr <==> learning_rate

- warmup_num_steps <==> warmup_steps

Sorry for the inconvenience of manually keeping these consistent, I couldn't find a way to automate/remove this quickly.
It might be possible to set some of these to auto but I didn't bother to find out, the parameters didn't change much during my research.

If this code block doesn't work, you can also just make the json file yourself by copying the curly-bracketed text.

In [None]:
%%bash
cat <<'EOT' > ds_config_gpt_j.json
{
  "resume_from_checkpoint": true,
  "train_batch_size": 1,
  "bf16": {
    "enabled": true,
    "min_loss_scale": 0.25,
    "opt_level": "O3"
  },
  "zero_optimization": {
    "stage": 3,
    "offload_param": {
      "device": "cpu"
    },
    "offload_optimizer": {
      "device": "cpu"
    },
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "contiguous_gradients": true
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": 1e-05,
      "betas": [
        0.9,
        0.999
      ],
      "eps": 1e-08
    }
  },
  "scheduler": {
    "type": "WarmupLR",
    "params": {
      "warmup_min_lr": 0,
      "warmup_max_lr": 1e-05,
      "warmup_num_steps": 45
    }
  }
}

EOT

# Setup model fine-tuning parameters and dataset

In [None]:
# same seed for consistent results
torch.manual_seed(42)

# setup GPT-J tokenizer
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B", bos_token='<|startoftext|>', eos_token='<|endoftext|>', pad_token='<|pad|>')

# set arguments for training, see https://huggingface.co/docs/transformers/v4.36.1/en/main_classes/trainer#transformers.TrainingArguments 
# batch size of 1 or 2 is recommended with 24GB of VRAM and a dataset of ~6k entries
# warmup_steps should be ~10% of dataset size, maybe try another scheduler for better results, see https://www.deepspeed.ai/docs/config-json/#scheduler-parameters 
# weight decay avoids overfitting, 0.01 to 0.1 is recommended
# bf16 means bfloat16 is enabled, see https://cloud.google.com/tpu/docs/bfloat16

# check consistency of TrainingArguments with deepspeed json config file above!

training_args = TrainingArguments(output_dir='./results', num_train_epochs=1, logging_steps=5, gradient_accumulation_steps=1, learning_rate=1e-05,
                                  per_device_train_batch_size=1, per_device_eval_batch_size=1, warmup_steps=45,
                                  weight_decay=0.1, logging_dir='./logs', bf16=True, deepspeed='./ds_config_gpt_j.json')

# setup GPT-J model to fine-tune
model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-j-6B").cuda()

# resize model embedding to match new tokenizer
model.resize_token_embeddings(len(tokenizer))

# loading CSV file, replace "DATASET" with name of CSV file with tags inserted
descriptions = pd.read_csv('DATASET.csv', sep='\t')['descriptions']

# get max length of entries in dataset, required for knowing how much to pad each entry later
max_length = max([len(tokenizer.encode(description)) for description in descriptions])
print("Max length: {}".format(max_length))

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


[2023-11-26 12:35:46,511] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-11-26 12:38:43,146] [INFO] [partition_parameters.py:415:__exit__] finished initializing model with 6.05B parameters




Max length: 323


In [None]:
# set up dataset class for convenience
class GameDataset(Dataset):
    def __init__(self, txt_list, tokenizer, max_length):
        self.input_ids = []
        self.attn_masks = []
        self.labels = []
        for txt in txt_list:
            encodings_dict = tokenizer('<|startoftext|>' + txt + '<|endoftext|>', truncation=True,
                                       max_length=max_length, padding="max_length")
            self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
            self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.attn_masks[idx]

# Training

In [None]:
# set up dataset, split for 90% training, 10% validation 
# not actually sure if validation ever gets used because the documentation is vague
dataset = GameDataset(descriptions, tokenizer, max_length=max_length)
train_size = int(0.9 * len(dataset))
train_dataset, val_dataset = random_split(dataset, [train_size, len(dataset) - train_size])

# set up deepspeed training
model_trainer = Trainer(model=model, args=training_args, train_dataset=train_dataset,
        eval_dataset=val_dataset, data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
                                                              'attention_mask': torch.stack([f[1] for f in data]),
                                                              'labels': torch.stack([f[0] for f in data])})
model_trainer.train()

# save model after training
model_trainer.save_model("./gpt-j-DATASET")

Using cuda_amp half precision backend


[2023-11-12 20:28:11,047] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed info: version=0.8.1, git-hash=unknown, git-branch=unknown
[2023-11-12 20:28:11,121] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False


Using /home/jovyan/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/jovyan/.cache/torch_extensions/py310_cu116/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Loading extension module cpu_adam...


ninja: no work to do.
Time to load cpu_adam op: 2.9119908809661865 seconds
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000010, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
[2023-11-12 20:28:17,550] [INFO] [logging.py:75:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2023-11-12 20:28:17,568] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
[2023-11-12 20:28:17,569] [INFO] [utils.py:53:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2023-11-12 20:28:17,570] [INFO] [logging.py:75:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 3 optimizer
[2023-11-12 20:28:17,651] [INFO] [utils.py:825:see_memory_usage] Stage 3 initialize beginning
[2023-11-12 20:28:17,653] [INFO] [utils.py:826:see_memory_usage] MA 0.88 GB         Max_MA 1.65 GB         CA 1.93 GB         Max_CA 2 GB 
[

Using /home/jovyan/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...
Emitting ninja build file /home/jovyan/.cache/torch_extensions/py310_cu116/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)


ninja: no work to do.
Time to load utils op: 0.15554070472717285 seconds


Loading extension module utils...


[2023-11-12 20:28:17,879] [INFO] [utils.py:825:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2023-11-12 20:28:17,881] [INFO] [utils.py:826:see_memory_usage] MA 0.88 GB         Max_MA 0.88 GB         CA 1.93 GB         Max_CA 2 GB 
[2023-11-12 20:28:17,883] [INFO] [utils.py:834:see_memory_usage] CPU Virtual Memory:  used = 20.32 GB, percent = 8.1%
Parameter Offload: Total persistent parameters: 861410 in 115 params
[2023-11-12 20:28:18,380] [INFO] [utils.py:825:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2023-11-12 20:28:18,382] [INFO] [utils.py:826:see_memory_usage] MA 0.11 GB         Max_MA 0.88 GB         CA 1.93 GB         Max_CA 2 GB 
[2023-11-12 20:28:18,384] [INFO] [utils.py:834:see_memory_usage] CPU Virtual Memory:  used = 21.09 GB, percent = 8.4%
[2023-11-12 20:28:18,440] [INFO] [utils.py:825:see_memory_usage] Before creating fp16 partitions
[2023-11-12 20:28:18,442] [INFO] [utils.py:826:see_memory_usage] MA 0.11 GB         Max_MA 0.11 GB         CA 1.

Using /home/jovyan/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
***** Running training *****
  Num examples = 900
  Num Epochs = 1
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 225
  Number of trainable parameters = 0


Time to load utils op: 0.005967378616333008 seconds




Step,Training Loss
5,5.8969
10,1.6922
15,1.5891
20,1.4109
25,1.4281
30,1.3672
35,1.3109
40,1.3641
45,1.3922
50,1.2984


[2023-11-12 20:32:07,466] [INFO] [logging.py:75:log_dist] [Rank 0] step=10, skipped=0, lr=[7.44921859773347e-06], mom=[[0.9, 0.999]]
[2023-11-12 20:32:07,468] [INFO] [timer.py:198:stop] epoch=0/micro_step=10/global_step=10, RunningAvgSamplesPerSec=0.19143738813000266, CurrSamplesPerSec=0.18888465122617287, MemAllocated=1.04GB, MaxMemAllocated=14.53GB
[2023-11-12 20:35:23,224] [INFO] [logging.py:75:log_dist] [Rank 0] step=20, skipped=0, lr=[9.691656839909223e-06], mom=[[0.9, 0.999]]
[2023-11-12 20:35:23,226] [INFO] [timer.py:198:stop] epoch=0/micro_step=20/global_step=20, RunningAvgSamplesPerSec=0.1984423663521117, CurrSamplesPerSec=0.2064105363514331, MemAllocated=1.04GB, MaxMemAllocated=14.53GB
[2023-11-12 20:38:35,825] [INFO] [logging.py:75:log_dist] [Rank 0] step=30, skipped=0, lr=[1e-05], mom=[[0.9, 0.999]]
[2023-11-12 20:38:35,827] [INFO] [timer.py:198:stop] epoch=0/micro_step=30/global_step=30, RunningAvgSamplesPerSec=0.20167892837235188, CurrSamplesPerSec=0.21611128636033308, Me



Training completed. Do not forget to share your model on huggingface.co/models =)


Saving model checkpoint to ./gpt-j-wow-1ep-small
Configuration saved in ./gpt-j-wow-1ep-small/config.json
Configuration saved in ./gpt-j-wow-1ep-small/generation_config.json
Model weights saved in ./gpt-j-wow-1ep-small/pytorch_model.bin


[2023-11-12 21:41:28,452] [INFO] [engine.py:3507:save_16bit_model] Did not save the model ./gpt-j-wow-1ep-small/pytorch_model.bin because `stage3_gather_16bit_weights_on_model_save` is False


deepspeed.save_16bit_model didn't save the model, since stage3_gather_16bit_weights_on_model_save=false. Saving the full checkpoint instead, use zero_to_fp32.py to recover weights


[2023-11-12 21:41:28,474] [INFO] [logging.py:75:log_dist] [Rank 0] [Torch] Checkpoint global_step225 is begin to save!
[2023-11-12 21:41:28,491] [INFO] [logging.py:75:log_dist] [Rank 0] Saving model checkpoint: ./gpt-j-wow-1ep-small/global_step225/zero_pp_rank_0_mp_rank_00_model_states.pt
[2023-11-12 21:41:28,493] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving ./gpt-j-wow-1ep-small/global_step225/zero_pp_rank_0_mp_rank_00_model_states.pt...




[2023-11-12 21:41:28,837] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved ./gpt-j-wow-1ep-small/global_step225/zero_pp_rank_0_mp_rank_00_model_states.pt.
[2023-11-12 21:41:28,840] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving ./gpt-j-wow-1ep-small/global_step225/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2023-11-12 21:43:26,399] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved ./gpt-j-wow-1ep-small/global_step225/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt.
[2023-11-12 21:43:26,501] [INFO] [engine.py:3407:_save_zero_checkpoint] zero checkpoint saved ./gpt-j-wow-1ep-small/global_step225/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
[2023-11-12 21:43:26,520] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step225 is ready now!
