<div>
<img src="https://camo.githubusercontent.com/473dd9f992924d27457650251786464f72e54121ac6e9210add0f483ca849277/68747470733a2f2f692e696d6775722e636f6d2f3765523750616e2e706e67" width="40%">  
</div>

# Distributed Bloom for Text Generation using Prompt Tuning

In this example, we show how to use [prompt tuning](https://aclanthology.org/2021.emnlp-main.243.pdf) to adapt a test 6B version of the [BLOOM](https://huggingface.co/bigscience/bloom) model for a specific downstream task. We will run this model in a decentralized fashion using [Petals](https://github.com/bigscience-workshop/petals). Petals servers will maintain the BLOOM blocks (they are kept unchanged during adaptation), and the gradient descent will learn a few prefix tokens stored on a Petals client.

We will adapt the BLOOM model for the chatbot task using the [Personachat](https://huggingface.co/datasets/bavard/personachat_truecased) dataset. For a given dialogue context, the model has to provide a relevant answer.

First, we have to prepare all dependencies.

In [1]:
import os
import sys
sys.path.insert(0, "../../../petals")
 
import torch
import transformers
import wandb
from datasets import load_dataset
from tqdm import tqdm
from torch.optim import AdamW
from torch.utils.data import DataLoader
from transformers import get_scheduler

# Import a Petals model
from src.client.remote_model import DistributedBloomForCausalLM


Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 6.1
CUDA SETUP: Detected CUDA version 110
CUDA SETUP: Loading binary /home/jagiljazev/personalized-chat-bot/env/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda110_nocublaslt.so...


NVIDIA GeForce RTX 3060 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA GeForce RTX 3060 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/



Let's set some hyperparameters for training:

In [10]:
MODEL_NAME = "bigscience/test-bloomd-6b3" # select model you like
INITIAL_PEERS = ["/ip4/193.106.95.184/tcp/31000/p2p/QmSg7izCDtowVTACbUmWvEiQZNY4wgCQ9T9Doo66K59X6q"] # add your peers adresses here, like "/ip4/192.168.1.2/tcp/31000/p2p/Qma...."
NUM_PREFIX_TOKENS = 16
DEVICE = 'cpu'
BATCH_SIZE = 4
LR = 1e-2
WEIGHT_DECAY = 0.0
NUM_SAMPLES = 1000
SEED = 42
MODEL_MAX_LENGTH = 256
TUNING_MODE = 'ptune' # choose between ['ptune', 'deep_ptune'] 

Prepare tokenizer and distributed model, connect it to servers.

In [11]:
tokenizer = transformers.BloomTokenizerFast.from_pretrained(MODEL_NAME)
tokenizer.padding_side = 'right'
tokenizer.model_max_length = MODEL_MAX_LENGTH
model = DistributedBloomForCausalLM.from_pretrained(
    MODEL_NAME, 
    initial_peers=INITIAL_PEERS, 
    pre_seq_len=NUM_PREFIX_TOKENS, 
    tuning_mode=TUNING_MODE
).to(DEVICE)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Oct 22 17:22:45.128 [WARN] [/home/jagiljazev/personalized-chat-bot/notebooks/gilyazev/../../../petals/src/client/remote_sequential.py.__init__:34] RemoteSequential is in active development; expect adventures
Some weights of DistributedBloomForCausalLM were not initialized from the model checkpoint at bigscience/test-bloomd-6b3 and are newly initialized: ['lm_head.word_embeddings.weight', 'prompt_embeddings.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Let's prepare the Personachat dataset. We need two mapping functions, one to concatenate history and candidate answers, and another for tokenization.

In [6]:
dataset = load_dataset("bavard/personachat_truecased")


def chunking(examples):
    inputs = [
        "\n-----\n".join(history) + "\n-----\n" + candidate
        for history, candidates in zip(examples["history"], examples["candidates"])
        for candidate in candidates
    ]
    return {"chunks": inputs}


def tokenize(examples):
    outputs = {
        "input_ids": tokenizer(examples["chunks"], padding='max_length', truncation=True)["input_ids"]
    }
    outputs["labels"] = outputs["input_ids"]
    return outputs


tokenized_datasets = (
    dataset
        .map(chunking, batched=True, remove_columns=dataset["train"].column_names)
        .map(tokenize, batched=True, remove_columns=["chunks"])
)


tokenized_datasets.set_format("torch")
train_dataset = tokenized_datasets["train"].shuffle(seed=SEED)
train_dataloader = DataLoader(
    train_dataset.select(list(range(NUM_SAMPLES))),
    shuffle=True,
    batch_size=BATCH_SIZE,
    drop_last=True,
)

Oct 22 17:20:47.470 [WARN] [datasets.builder._create_builder_config:427] No config specified, defaulting to: personachat_truecased/full
Oct 22 17:20:47.534 [WARN] [datasets.builder.download_and_prepare:739] Found cached dataset personachat_truecased (/home/jagiljazev/.cache/huggingface/datasets/bavard___personachat_truecased/full/1.0.0/73ee8f1a0d9e42255af5a8301877a2f3ac638e55b1cd9cbccca5ab7e23d2b638)


  0%|          | 0/2 [00:00<?, ?it/s]

Oct 22 17:20:47.597 [WARN] [datasets.arrow_dataset._map_single:2793] Loading cached processed dataset at /home/jagiljazev/.cache/huggingface/datasets/bavard___personachat_truecased/full/1.0.0/73ee8f1a0d9e42255af5a8301877a2f3ac638e55b1cd9cbccca5ab7e23d2b638/cache-5ecae882ebbd418d.arrow
Oct 22 17:20:52.933 [WARN] [datasets.arrow_dataset._map_single:2793] Loading cached processed dataset at /home/jagiljazev/.cache/huggingface/datasets/bavard___personachat_truecased/full/1.0.0/73ee8f1a0d9e42255af5a8301877a2f3ac638e55b1cd9cbccca5ab7e23d2b638/cache-7000c64a1a527e4d.arrow
Oct 22 17:20:53.640 [WARN] [datasets.arrow_dataset._map_single:2793] Loading cached processed dataset at /home/jagiljazev/.cache/huggingface/datasets/bavard___personachat_truecased/full/1.0.0/73ee8f1a0d9e42255af5a8301877a2f3ac638e55b1cd9cbccca5ab7e23d2b638/cache-76265556d7dc8064.arrow


  0%|          | 0/157 [00:00<?, ?ba/s]

Oct 22 17:21:18.737 [WARN] [datasets.arrow_dataset.shuffle:3609] Loading cached shuffled indices for dataset at /home/jagiljazev/.cache/huggingface/datasets/bavard___personachat_truecased/full/1.0.0/73ee8f1a0d9e42255af5a8301877a2f3ac638e55b1cd9cbccca5ab7e23d2b638/cache-72898fb72c715c1b.arrow


Before setting up optimizers, check the model parameters that will be trained.

In [12]:
for n, p in model.named_parameters():
    if p.requires_grad:
        print(n, p.requires_grad, p.device)

transformer.prompt_embeddings.weight True cpu


The optimizer will only work on **prompts**, they are only trainable parameters. Let's initialize optimizer and learning rate scheduler.

In [13]:
optimizer = AdamW(model.parameters(), lr=LR, weight_decay=WEIGHT_DECAY)

lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=len(train_dataloader)
)

Let's initialize wandb for logging and start the training loop!

In [14]:
# wandb.init(
#     project="bloom-personachat",
#     config={
#         "num_samples": NUM_SAMPLES,
#         "batch_size": BATCH_SIZE,
#         "learning_rate": LR,
#         "weight_decay": WEIGHT_DECAY,
#         "num_prefix_tokens": NUM_PREFIX_TOKENS,
#         "model_name": MODEL_NAME,
#         "seed": SEED,
#     }
# )
loss_hist = []
print('wandb initialized\n')

for batch in tqdm(train_dataloader):
    batch = {k: v.to(DEVICE) for k, v in batch.items()}

    model.train()
    outputs = model(**batch)
    loss = outputs.loss
    loss.backward()

    optimizer.step()
    lr_scheduler.step()
    optimizer.zero_grad()
    print(f"Train Loss: {loss}")
    loss_hist.append(loss)

#     wandb.log({"Train Loss": loss})

wandb initialized



  0%|          | 1/250 [00:32<2:16:10, 32.81s/it]

Train Loss: 6.991169452667236


  1%|          | 2/250 [01:05<2:14:55, 32.64s/it]

Train Loss: 6.4879069328308105


  1%|          | 3/250 [01:38<2:15:41, 32.96s/it]

Train Loss: 5.239833354949951


  2%|▏         | 4/250 [02:11<2:14:40, 32.85s/it]

Train Loss: 6.417842864990234


  2%|▏         | 4/250 [02:38<2:42:36, 39.66s/it]


KeyboardInterrupt: 

In [16]:
%%time
outputs = model.generate(
    tokenizer(['something'], return_tensors='pt')['input_ids'],
    temperature=1.0,
    do_sample=True,
    top_k=10,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=250,
)

CPU times: user 28min 16s, sys: 7.11 s, total: 28min 23s
Wall time: 1min 57s


In [20]:
print(tokenizer.decode(outputs[0]))

something else, it shouldnt be able to recognize the word "engineer", as it does not exist in its lexicon.
Is it possible to do this?
Thanks in advance,
Alex.

A:

The problem is that the "engineer" word isn't in your lexicon, therefore, when you run the model and the model sees "engineer", it doesn't understand it. You can try the following code to make it understand it (just replace "engineer" with the word you want):
from __future__ import print_function

from gensim import models
from sklearn.externals.joblib import Parallel, delayed
from sklearn.preprocessing import StandardScaler
from gensim.models import Word2Vec, SkipGramModel
from gensim.models.word2vec.utils import pad_sequences

import numpy as np
from scipy.sparse import csr_matrix 
from sklearn import preprocessing
from sklearn.feature_selection import f_classif
from sklearn.cross_validation import KfoldCV as sklearnCV
from gensim.models import Word2Vec, SkipGramModel
from gensim.models.word2vec.utils import pad_sequences


Try to talk with the trained model! Submit an empty input to stop the execution.


__Note__: In this example, we the whole dialogue as a prefix when generating each new replica. In the future, we will support a faster "interactive" dialogue mode, so generating a new replica will be able to reuse inference caches from the previous replica.

In [None]:
MAX_TOKENS = 16
TOP_K = 100
TEMPERATURE = 0.6
dialog = ""

while True:
    user_phrase = input()
    if len(user_phrase) == 0:
        break
    dialog += f"{user_phrase}\n-----\n"
    inputs = tokenizer([dialog], return_tensors='pt')['input_ids']
    outputs = model.generate(
        inputs,
        temperature=TEMPERATURE,
        do_sample=True,
        top_k=TOP_K,
        eos_token_id=tokenizer.eos_token_id,
        max_new_tokens=MAX_TOKENS,
    )
    bloom_answer = tokenizer.batch_decode(outputs)[0]
    bloom_answer = bloom_answer[len(dialog):].split("\n")[0]
    print(bloom_answer)
    dialog += f"{bloom_answer}\n-----\n"