<div>
<img src="https://camo.githubusercontent.com/473dd9f992924d27457650251786464f72e54121ac6e9210add0f483ca849277/68747470733a2f2f692e696d6775722e636f6d2f3765523750616e2e706e67" width="40%">  
</div>

# Distributed Bloom for Text Generation using Prompt Tuning

In this example, we show how to use [prompt tuning](https://aclanthology.org/2021.emnlp-main.243.pdf) to adapt a test 6B version of the [BLOOM](https://huggingface.co/bigscience/bloom) model for a specific downstream task. We will run this model in a decentralized fashion using [Petals](https://github.com/bigscience-workshop/petals). Petals servers will maintain the BLOOM blocks (they are kept unchanged during adaptation), and the gradient descent will learn a few prefix tokens stored on a Petals client.

We will adapt the BLOOM model for the chatbot task using the [Personachat](https://huggingface.co/datasets/bavard/personachat_truecased) dataset. For a given dialogue context, the model has to provide a relevant answer.

In [1]:
import os
import sys
sys.path.insert(0, "..")
 
import torch
import transformers
import wandb
from datasets import load_dataset
from tqdm import tqdm
from torch.optim import AdamW
from torch.utils.data import DataLoader
from transformers import get_scheduler

# Import a Petals model
from src.client.remote_model import DistributedBloomForCausalLM


Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 6.1
CUDA SETUP: Detected CUDA version 110
CUDA SETUP: Loading binary /home/grzolotov/venvs/jupyter/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda110_nocublaslt.so...


NVIDIA GeForce RTX 3060 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA GeForce RTX 3060 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/



In [2]:
import logging
logging.basicConfig(filename="log.txt", format='%(asctime)s,%(msecs)d %(name)s %(levelname)s %(message)s',
                    datefmt='%H:%M:%S', level=logging.DEBUG, force=True)
logging.debug("Import logging")
!tail -n 1 log.txt

12:29:49,83 root DEBUG Import logging


Let's set some hyperparameters for training:

In [3]:
MODEL_NAME = "bigscience/test-bloomd-6b3"
INITIAL_PEERS = ["/ip4/193.106.95.184/tcp/31000/p2p/QmSg7izCDtowVTACbUmWvEiQZNY4wgCQ9T9Doo66K59X6q"]
NUM_PREFIX_TOKENS = 1 # 1 2 4 8 16 
DEVICE = 'cuda'
BATCH_SIZE = 4
LR = 1e-2 #
WEIGHT_DECAY = 0.0 #
NUM_SAMPLES = 1000#
SEED = 42
MODEL_MAX_LENGTH = 256
TUNING_MODE = 'ptune' # choose between ['ptune', 'deep_ptune']
logging.debug(f'Used device: {DEVICE}')

Prepare tokenizer and distributed model, connect it to servers.

In [4]:
tokenizer = transformers.BloomTokenizerFast.from_pretrained(MODEL_NAME)
tokenizer.padding_side = 'right'
tokenizer.model_max_length = MODEL_MAX_LENGTH
model = DistributedBloomForCausalLM.from_pretrained(
    MODEL_NAME, 
    initial_peers=INITIAL_PEERS, 
    pre_seq_len=NUM_PREFIX_TOKENS, 
    tuning_mode=TUNING_MODE
).to(DEVICE)
logging.info(f'loaded model {MODEL_NAME} on {DEVICE}')

Some weights of DistributedBloomForCausalLM were not initialized from the model checkpoint at bigscience/test-bloomd-6b3 and are newly initialized: ['lm_head.word_embeddings.weight', 'prompt_embeddings.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [5]:
from copy import deepcopy
init_weights = model.transformer.prompt_embeddings._parameters['weight'].clone().to(DEVICE)

def init_model(model):
    model.transformer.prompt_embeddings._parameters['weight'] = init_weights
    
logging.info(f'define init function')

In [7]:
os.environ["CUDA_LAUNCH_BLOCKING"] = '1'
# model.transformer.prompt_embeddings._parameters['weight']

Let's prepare the Personachat dataset. We need two mapping functions, one to concatenate history and candidate answers, and another for tokenization.

In [8]:
dataset = load_dataset("bavard/personachat_truecased")
logging.info(f'loaded dataset')


def chunking(examples):
    inputs = [
        "\n-----\n".join(history) + "\n-----\n" + candidate
        for history, candidates in zip(examples["history"], examples["candidates"])
        for candidate in candidates
    ]
    return {"chunks": inputs}


def tokenize(examples):
    outputs = {
        "input_ids": tokenizer(examples["chunks"], padding='max_length', truncation=True)["input_ids"]
    }
    outputs["labels"] = outputs["input_ids"]
    return outputs


tokenized_datasets = (
    dataset
        .map(chunking, batched=True, remove_columns=dataset["train"].column_names)
        .map(tokenize, batched=True, remove_columns=["chunks"])
)


tokenized_datasets.set_format("torch")
train_dataset = tokenized_datasets["train"].shuffle(seed=SEED)
train_dataloader = DataLoader(
    train_dataset.select(list(range(NUM_SAMPLES))),
    shuffle=True,
    batch_size=BATCH_SIZE,
    drop_last=True,)
logging.info(f'get train dataloader')

  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/157 [00:00<?, ?ba/s]

In [12]:
validation_dataset = tokenized_datasets["validation"]
validation_dataloader = DataLoader(
    validation_dataset.select(list(range(NUM_SAMPLES))),
    batch_size=BATCH_SIZE,
)
logging.info(f'get validation dataloader')

Before setting up optimizers, check the model parameters that will be trained.

In [13]:
for n, p in model.named_parameters():
    if p.requires_grad:
        print(n, p.requires_grad, p.device)

transformer.prompt_embeddings.weight True cuda:0


The optimizer will only work on **prompts**, they are only trainable parameters. Let's initialize optimizer and learning rate scheduler.

In [22]:
optimizer = AdamW(model.parameters(), lr=LR, weight_decay=WEIGHT_DECAY)

lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=len(train_dataloader)
)
logging.info(f'loaded validation data')

Let's initialize wandb for logging and start the training loop!

In [29]:
optimizer.__dict__

RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

In [12]:
def train_with_params(optimizer, lr_scheduler):
    train_loss = 0
    logging.info(f'start train with ')
    for (i, batch) in enumerate(train_dataloader):
        batch = {k: v.to(DEVICE) for k, v in batch.items()}

        model.train()
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        train_loss += loss

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
    logging.info(f'train Loss {i}: {train_loss.item()}')
    val_loss = 0
    for (i, val_bath) in enumerate(validation_dataloader):
        outputs = model(**val_bath)
        val_loss += outputs.loss
    logging.info(f'validation Loss {i}: {val_loss.item()}')
    return val_loss.item()
train_with_params(optimizer, lr_scheduler)

RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

In [None]:
from hyperopt import fmin, tpe, hp

def f(space):
    init_model(model)
    optimizer = AdamW(model.parameters(), lr=space['lr'], weight_decay=space['wd'])

    lr_scheduler = get_scheduler(
        name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=len(train_dataloader)
    )
    return train_with_params(optimizer, lr_scheduler)
    
space = {
    'lr': hp.choice('lr', [0.01, 0.05, 0.1, 0.2, 0.3]),
#     'num_tokens' : hp.choice('lr', [1, 2, 4, 8, 16]),
    'wd': hp.choice('wd', [0.0, 1e-4]),
}

# NUM_PREFIX_TOKENS = 16 # 1 2 4 8 16 
# LR = 1e-2 #
# WEIGHT_DECAY = 0.0 #
# NUM_SAMPLES = 1000#

best = fmin(
    fn=f,
    space=space,
    algo=tpe.suggest,
    max_evals=3
)

print("Found minimum after 1000 trials:")
print(best)

Try to talk with the trained model! Submit an empty input to stop the execution.


__Note__: In this example, we the whole dialogue as a prefix when generating each new replica. In the future, we will support a faster "interactive" dialogue mode, so generating a new replica will be able to reuse inference caches from the previous replica.

In [None]:
MAX_TOKENS = 16
TOP_K = 100
TEMPERATURE = 0.6
dialog = ""

while True:
    user_phrase = input()
    if len(user_phrase) == 0:
        break
    dialog += f"{user_phrase}\n-----\n"
    inputs = tokenizer([dialog], return_tensors='pt')['input_ids']
    outputs = model.generate(
        inputs,
        temperature=TEMPERATURE,
        do_sample=True,
        top_k=TOP_K,
        eos_token_id=tokenizer.eos_token_id,
        max_new_tokens=MAX_TOKENS,
    )
    bloom_answer = tokenizer.batch_decode(outputs)[0]
    bloom_answer = bloom_answer[len(dialog):].split("\n")[0]
    print(bloom_answer)
    dialog += f"{bloom_answer}\n-----\n"