<a href="https://colab.research.google.com/github/tombarz/Therapist_AI/blob/main/training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# `transformers` meets `bitsandbytes` for democratzing Large Language Models (LLMs) through 4bit quantization

<center>
<img src="https://github.com/huggingface/blog/blob/main/assets/96_hf_bitsandbytes_integration/Thumbnail_blue.png?raw=true" alt="drawing" width="700" class="center"/>
</center>

Welcome to this notebook that goes through the recent `bitsandbytes` integration that includes the work from XXX that introduces no performance degradation 4bit quantization techniques, for democratizing LLMs inference and training.

In this notebook, we will learn together how to load a large model in 4bit (`gpt-neo-x-20b`) and train it using Google Colab and PEFT library from Hugging Face 🤗.

[In the general usage notebook](https://colab.research.google.com/drive/1ge2F1QSK8Q7h0hn3YKuBCOAS0bK8E0wf?usp=sharing), you can learn how to propely load a model in 4bit with all its variants.

If you liked the previous work for integrating [*LLM.int8*](https://arxiv.org/abs/2208.07339), you can have a look at the [introduction blogpost](https://huggingface.co/blog/hf-bitsandbytes-integration) to lean more about that quantization method.


In [2]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m99.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m77.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m5.0 MB/s

In [3]:
pip install huggingface_hub



In [4]:
from huggingface_hub import login
login('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')

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


First let's load the model we are going to use - GPT-neo-x-20B! Note that the model itself is around 40GB in half precision

In [5]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig


#model_id = "EleutherAI/gpt-neox-20b"
model_id = "meta-llama/Llama-2-13b-chat-hf"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})

Downloading (…)okenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/587 [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/33.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/9.90G [00:00<?, ?B/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/6.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/167 [00:00<?, ?B/s]

Then we have to apply some preprocessing to the model to prepare it for training. For that use the `prepare_model_for_kbit_training` method from PEFT.

In [6]:
from peft import prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

In [7]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [8]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 6553600 || all params: 6678533120 || trainable%: 0.09812933292752765


Let's load a common dataset, english quotes, to fine tune our model on famous quotes.

In [9]:
from datasets import load_dataset

data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)

Downloading readme:   0%|          | 0.00/5.55k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/647k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/2508 [00:00<?, ? examples/s]

In [24]:
from datasets import Dataset
ds = Dataset.from_list([{'text': 'hello a how are you doing'},
  {'text': 'hello a how are you do'},
   {'text': 'hello a how are you doing'},
    {'text': 'hello a how are you doing'},
     {'text': 'hello a how are you doing'},
      {'text': 'hello a how are you doing'},
       {'text': 'hello a how are you doing'},
        {'text': 'hello a how are you doing'},
         {'text': 'hello a how are you doing'},
          {'text': 'hello a how are you doing'},
           {'text': 'hello a how are you doing'},
            {'text': 'hello a how are you doing'},
             {'text': 'hello a how are you doing'},
  {'text': 'hello a how are you do'},
   {'text': 'hello a how are you doing'},
    {'text': 'hello a how are you doing'},
     {'text': 'hello a how are you doing'},
      {'text': 'hello a how are you doing'},
       {'text': 'hello a how are you doing'},
        {'text': 'hello a how are you doing'},
         {'text': 'hello a how are you doing'},
          {'text': 'hello a how are you doing'},
           {'text': 'hello a how are you doing'},
            {'text': 'hello a how are you doing'}])
def tokenize_function(example):
  return tokenizer(example['text'])
ds = ds.map(tokenize_function, batched=True)
def label_function(ds):
  list_of_labeles = []
  for example in ds:
    labels = []
    for i, id in enumerate(example['input_ids']):
      if i < len(example['input_ids']) - 1:
        if example['input_ids'][i+1] != 263:
          labels.append(-100)
        else:
          labels.append(4335)
      else:
        labels.append(-100)
    list_of_labeles.append(labels)
  return ds.add_column('labels',list_of_labeles)
ds = label_function(ds)

Map:   0%|          | 0/24 [00:00<?, ? examples/s]

In [25]:
for i in ds:
  print(i)

{'text': 'hello a how are you doing', 'input_ids': [1, 22172, 263, 920, 526, 366, 2599], 'attention_mask': [1, 1, 1, 1, 1, 1, 1], 'labels': [-100, 4335, -100, -100, -100, -100, -100]}
{'text': 'hello a how are you do', 'input_ids': [1, 22172, 263, 920, 526, 366, 437], 'attention_mask': [1, 1, 1, 1, 1, 1, 1], 'labels': [-100, 4335, -100, -100, -100, -100, -100]}
{'text': 'hello a how are you doing', 'input_ids': [1, 22172, 263, 920, 526, 366, 2599], 'attention_mask': [1, 1, 1, 1, 1, 1, 1], 'labels': [-100, 4335, -100, -100, -100, -100, -100]}
{'text': 'hello a how are you doing', 'input_ids': [1, 22172, 263, 920, 526, 366, 2599], 'attention_mask': [1, 1, 1, 1, 1, 1, 1], 'labels': [-100, 4335, -100, -100, -100, -100, -100]}
{'text': 'hello a how are you doing', 'input_ids': [1, 22172, 263, 920, 526, 366, 2599], 'attention_mask': [1, 1, 1, 1, 1, 1, 1], 'labels': [-100, 4335, -100, -100, -100, -100, -100]}
{'text': 'hello a how are you doing', 'input_ids': [1, 22172, 263, 920, 526, 366, 25

In [34]:
from transformers import Trainer
import torch
from torch import nn

class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.pop("labels")
        outputs = model(**inputs)
        logits = outputs.logits
        print(logits.view(-1, logits.shape[-1]).shape)
        print(labels.view(-1).shape)
        # Using ignore_index for the loss calculation
        loss_fct = nn.CrossEntropyLoss(ignore_index=-100)
        loss = loss_fct(logits.view(-1, logits.shape[-1]), labels.view(-1))

        return (loss, outputs) if return_outputs else loss

Run the cell below to run the training! For the sake of the demo, we just ran it for few steps just to showcase how to use this integration with existing tools on the HF ecosystem.

In [35]:
import transformers


tokenizer.pad_token = tokenizer.eos_token

trainer = CustomTrainer(
    model=model,
    train_dataset=ds,
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=10,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

torch.Size([7, 32000])
torch.Size([7])


KeyboardInterrupt: ignored

In [36]:
#model.generate(tokenizer('hello', return_tensors="pt").input_ids)

model_output = model.base_model.model.generate(tokenizer('hello a', return_tensors="pt").input_ids, max_length=15)
for id in model_output[0]:
  print(tokenizer.decode(id))




<s>
hello
a
a
a
a
a
a
a
a
a
a
a
a
a
