<a href="https://colab.research.google.com/github/tuhinmallick/AI-for-Fashion/blob/main/Full_Fine_tuning_with_GaLore_on_a_Consumer_GPU_Example_with_Mistral_7B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook shows how to fully fine-tune an LLM with GaLore. Mistral 7B is used for example.

All the code is explained in this article:

[GaLore: Full Fine-tuning on Your GPU](https://kaitchup.substack.com/p/galore-full-fine-tuning-on-your-gpu)

*Note: When I wrote this notebook, the layerwise implementation of GaLore didn't work but I left the code so that it's ready to be used when it will be fixed by Hugging Face. The non-layerwise implementation works. *

GPU memory consumptions:
- Layerwise GaLore: 22.5 GB
- Non-layerwsie GaLore: 35 GB


First, we need all these dependencies:

In [None]:

!pip install -q -U bitsandbytes
!pip install git+https://github.com/jiaweizzhao/GaLore
!pip install -- upgrade transformers
!pip install -q -U accelerate
!pip install -q -U datasets
!pip install -q -U trl

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting git+https://github.com/jiaweizzhao/GaLore
  Cloning https://github.com/jiaweizzhao/GaLore to /tmp/pip-req-build-afhvdzej
  Running command git clone --filter=blob:none --quiet https://github.com/jiaweizzhao/GaLore /tmp/pip-req-build-afhvdzej
  Resolved https://github.com/jiaweizzhao/GaLore to commit 1b36c33782bdd74a4d6a4f51bc626ef67f51011f
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: galore-torch
  Building wheel for galore-torch (setup.py) ... [?25l[?25hdone
  Created wheel for galore-torch: filename=galore_torch-1.0-py3-none-any.whl size=13310 sha256=095b2b697c27438e50a222d4bc477332fe9a5f0203528b007be0860dd9a5d48e
  Stored in directory: /tmp/pip-ephem-wheel-cache-rgcgyj_1/wheels/88/47/b5/ca5f75e9f8a2eef76440b7070f8e82f0099831c3d13ebbe221
Successfully built galore-torch
Installing collected package

Import all the necessary packages.

In [None]:
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
)
from trl import SFTTrainer

Use bfloat16 and Flash Attention is your GPU is compatible:

In [None]:
major_version, minor_version = torch.cuda.get_device_capability()
if major_version >= 8:
  !pip install flash-attn
  torch_dtype = torch.bfloat16
  attn_implementation='flash_attention_2'
  print("Your GPU is compatible with FlashAttention and bfloat16.")
else:
  torch_dtype = torch.float16
  attn_implementation='eager'
  print("Your GPU is not compatible with FlashAttention and bfloat16.")

Collecting flash-attn
  Downloading flash_attn-2.5.8.tar.gz (2.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m33.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting einops (from flash-attn)
  Downloading einops-0.8.0-py3-none-any.whl (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.2/43.2 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
Collecting ninja (from flash-attn)
  Downloading ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl (307 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.2/307.2 kB[0m [31m32.7 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: flash-attn
  Building wheel for flash-attn (setup.py) ... [?25l[?25hdone
  Created wheel for flash-attn: filename=flash_attn-2.5.8-cp310-cp310-linux_x86_64.whl size=120853537 sha256=53979129f883680327bf5d13027cd014e2d054f4fb5b8856916686ae315e57d6
  St

Load and configure the tokenizer for fine-tuning:

In [None]:
model_name = "mistralai/Mistral-7B-v0.1"
#Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, add_eos_token=True, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = 'left' #Necessary for FlashAttention compatibility



tokenizer_config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

Load an instruction dataset for fine-tuning:

In [None]:
ds = load_dataset("timdettmers/openassistant-guanaco")

Downloading readme:   0%|          | 0.00/395 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


Downloading data:   0%|          | 0.00/20.9M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.11M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/9846 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/518 [00:00<?, ? examples/s]

Load the model and enable gradient checkpointing to reduce memory consumption

In [None]:
model = AutoModelForCausalLM.from_pretrained(
          model_name, device_map={"": 0},  attn_implementation=attn_implementation, torch_dtype=torch_dtype
)
model.gradient_checkpointing_enable()

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

The following cell configures the training argument for GaLore with layerwise updates. At the time of writing this notebook, this functionnality didn't seem to work.

Expect a consumption of 22.5 GB of GPU RAM.

In [None]:
training_arguments = TrainingArguments(
        output_dir="./galore_adamw_8bit_layerwise_r512_1e-5_wl/",
        evaluation_strategy="steps",
        do_eval=True,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        log_level="debug",
        optim="galore_adamw_8bit_layerwise",
        optim_args="rank=512, update_proj_gap=200, scale=1.8",
        optim_target_modules=[r".*attn.*", r".*mlp.*"],
        save_strategy = 'epoch',
        logging_steps=50,
        learning_rate=1e-5,
        eval_steps=50,
        fp16= not torch.cuda.is_bf16_supported(),
        bf16= torch.cuda.is_bf16_supported(),
        num_train_epochs=3,
        warmup_ratio=0.1,
        lr_scheduler_type="linear",
)

Start training:

In [None]:
trainer = SFTTrainer(
        model=model,
        train_dataset=ds['train'],
        eval_dataset=ds['test'],
        dataset_text_field="text",
        max_seq_length=512,
        tokenizer=tokenizer,
        args=training_arguments,
)

trainer.train()

This is another configuration disabling the layerwise updates. It works but consumes 35 GB of GPU RAM.

In [None]:
training_arguments = TrainingArguments(
        output_dir="./galore_adamw_8bit_r512_1e-5_nol/",
        evaluation_strategy="steps",
        do_eval=True,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        log_level="debug",
        optim="galore_adamw_8bit",
        optim_args="rank=512, update_proj_gap=200, scale=1.8",
        optim_target_modules=[r".*attn.*", r".*mlp.*"],
        save_strategy = 'epoch',
        logging_steps=50,
        learning_rate=1e-5,
        eval_steps=50,
        fp16= not torch.cuda.is_bf16_supported(),
        bf16= torch.cuda.is_bf16_supported(),
        num_train_epochs=3,
        warmup_ratio=0.1,
        lr_scheduler_type="linear",
)

trainer = SFTTrainer(
        model=model,
        train_dataset=ds['train'],
        eval_dataset=ds['test'],
        dataset_text_field="text",
        max_seq_length=512,
        tokenizer=tokenizer,
        args=training_arguments,
)

trainer.train()