<a href="https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/gemma-huggingface.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Gemma in the Hugging Face Ecosystem

The Hugging Face ecosystem is designed to maximise the potential of open-source developers. In this notebook, we'll see how the ecosystem can be leveraged for the "full-stack" Gemma pipeline, starting with inference, fine-tuning the model on a specific dataset, and finishing with deploying the model in the cloud.

<p align="center">
  <img src="https://github.com/sanchit-gandhi/notebooks/blob/main/gemma_pipeline.jpg?raw=true" width="800"/>
</p>

## Set-up Python environment

First, we need to register our Hugging Face Hub token with our Google Colab runtime. Since the Gemma model is gated, our token will be checked when the model is downloaded to ensure we have accepted the terms-of-use. To register your token, click the key symbol 🔑 in the left-hand pane of the screen. Name the secret `HF_TOKEN`, and copy a token from your Hugging Face [Hub account](https://huggingface.co/settings/tokens). Your token should now be registered, allowing you to access the Gemma weights to this Colab session.

For reasonable training and inference speed with Gemma, we'll want to run the model on a GPU. The runtime is already configured to use the free 16GB T4 GPU provided through Google Colab Free Tier, so all you need to do is hit `"Connect T4"` in the top right-hand corner of the screen.

Once we've done that, we can go ahead and install the necessary Python packages:

In [None]:
!pip install --upgrade --quiet transformers huggingface_hub datasets accelerate trl peft bitsandbytes

## Inference with Transformers

[Transformers](https://huggingface.co/docs/transformers/index) is a state-of-the-art toolkit for open-source machine learning models. It contains all the functionality to download pre-trained weights, run inputs through them with inference, and integrations with other libraries for further fine-tuning.

There are four pre-trained Gemma checkpoints from which we can choose from, summarised in the table below. All four checkpoints are uploaded to the Hugging Face Hub with integrations in the Transformers library:

| Model ID    | Size / B params | Type        |
|-------------|-----------------|-------------|
| [gemma-2b](https://huggingface.co/google/gemma-2b)    | 2.5             | Base        |
| [gemma-2b-it](https://huggingface.co/google/gemma-2b-it) | 2.5             | Instruction |
| [gemma-7b](https://huggingface.co/google/gemma-7b)    | 8.5             | Base        |
| [gemma-7b-it](https://huggingface.co/google/gemma-7b-it) | 8.5             | Instruction |


For this example, we'll use [gemma-2b](https://huggingface.co/google/gemma-2b), the 2B parameter model that has been trained for the task of text generation. An 2B parameter LLM in full precision (float32) requires 10GB of memory just to load the weights, which is already near the limit of the 16GB T4 GPU assigned to Google Colab free tier. To circumvent this, we'll load the weights in [4-bit precision](https://huggingface.co/blog/4bit-transformers-bitsandbytes), which reduces the memory of the weights roughly by a factor of 8.

To do this, we'll define a bitsandbytes [quantization config](https://huggingface.co/docs/transformers/quantization#4-bit), which quantifies the precision in which the model should be quantized and what data-type (dtype) to run it in:

In [None]:
import torch
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16,
)

We can now load the pre-trained model weights from the Hugging Face Hub, passing the quantization config to the loading method to prepare the weights in 4-bit precision:

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint_id = "google/gemma-2b"

model = AutoModelForCausalLM.from_pretrained(
    checkpoint_id, low_cpu_mem_usage=True, quantization_config=quantization_config,
)
tokenizer = AutoTokenizer.from_pretrained(checkpoint_id, use_fast=True)

The advantage of using the [`AutoModelForCausalLM`](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForCausalLM) and [`AutoTokenizer`](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoTokenizer) API over model-specific classes is that we can easily swap the checkpoint id for any other checkpoint
on the Hugging Face Hub and re-use our code without any changes. The auto-classes will take care of loading the correct model and tokenizer classes for us. That means if a new LLM is released, we can quickly update our code to leverage the new model. Simply swap the checkpoint id `google/gemma-2b` for the model id on the Hugging Face Hub.

Great! We've loaded our model into memory, so now we're ready to define our inputs for inference. In this example, we'll pass a simple prompt to the Gemma model, and pre-process it to token id representation using our tokenizer:

In [None]:
input_ids = tokenizer("Recipe for pasta:", return_tensors="pt").input_ids
input_ids = input_ids.to(model.device)

Now that we've pre-processed our inputs, we can auto-regressively generate our response using the model's [`generate`](https://huggingface.co/blog/how-to-generate) method. Here, we'll set generation strategy to sampling, and specify the number of new tokens to generate. A full list of generation parameters can be found [here](https://huggingface.co/docs/transformers/en/main_classes/text_generation#transformers.GenerationConfig).

In [None]:
from transformers import set_seed

set_seed(0)
pred_ids = model.generate(input_ids, do_sample=True, temperature=0.6, max_new_tokens=256)

Finally, we decode the predicted ids back to text characters, again using the tokenizer:

In [None]:
pred_text = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
print(pred_text[0])

## Datasets with Datasets

[Datasets](https://huggingface.co/docs/datasets/index) is a library for easily accessing and sharing machine learning datasets across all tasks and domains. It can be used to load and pre-process datasets with a single line of code, has powerful functinolity to prepare a dataset ready for training with a transformer-based model. Datasets features a deep integration with the Hugging Face Hub, allowing you to easily load and share a dataset with the wider machine learning community.

Let's load a subset of the OpenAssistant dataset from the Hugging Face Hub, [OpenAssistant Guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco). This dataset is extremely lightweight (22MB) and so can be downloaded locally very quickly:

In [None]:
from datasets import load_dataset

dataset = load_dataset("timdettmers/openassistant-guanaco")

**Tip:** you can swap this dataset for any one of the [available text datasets](https://huggingface.co/datasets?sort=trending) on the Hugging Face Hub

We can pull up the first sample to have a look at the format of our data. We see that the human/assistant turns are labelled by `### Human:` and `### Assistant:` respectively:

In [None]:
sample = dataset["train"][0]
sample

## Training with TRL and PEFT

### TRL

[trl](https://huggingface.co/docs/trl/en/index) is an open-source library for training Transformer language models with Reinforcement Learning, from Supervised Fine-Tuning (SFT), to Reward Modelling (RM), and Proximal Policy Optimisation (PPO). The library is built on top of the Transformers Trainer, meaning it is inherently compatible with Transformers models and datasets in Datasets.

### PEFT

Training LLMs requires a significant amount of GPU compute to hold all the weights, gradients and optimiser states. Specifically, for full fine-tuning, we require:
* 2 bytes for the weight
* 2 bytes for the gradient
* 4 + 8 bytes for the Adam optimizer states

With a total of 16 bytes per trainable parameter, this gives a total of 40GB for training the 2B Gemma model (excluding the intermediate hidden states). This means we are at risk of an *out-of-memory (OOM)* error, where the memory usage exceeds that of the GPU.

To circumvent this, we'll apply [Parameter Efficient Fine-Funing (PEFT)](https://huggingface.co/docs/peft/index) to only fine-tune a small number of (extra) model parameters and keep the base parameters frozen. This significantly decreases computational and storage costs, while yielding performance comparable to a fully fine-tuned model. This makes it more accessible to train and store large language models (LLMs) on consumer hardware, such as Google Colab GPUs.

In this example, we'll perform the first step in converting a pre-trained language model to an assitant: [Supervised Fine-Tuning (SFT)](https://huggingface.co/docs/trl/en/sft_trainer). However, the steps covered here can be extended to [Reward Modelling (RM)](https://huggingface.co/docs/trl/en/reward_trainer), [Proximal Policy Optimisation (PPO)](https://huggingface.co/docs/trl/en/ppo_trainer) and [Direct Policy Optimisation (DPO)](https://huggingface.co/docs/trl/en/dpo_trainer). Refer to the linked documentation for examples on each task. For more details on PEFT, refer to this excellent [blog post](https://pytorch.org/blog/finetune-llms/).

Let's first define our [training arguments](https://huggingface.co/docs/transformers/v4.38.2/en/main_classes/trainer#transformers.TrainingArguments), such as batch size, number of epochs and learning rate. We'll also specify the output directory for our model, `gemma-2b-fine-tuned`:

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    learning_rate=1e-3,
    warmup_ratio=0.1,
    logging_steps=10,
    gradient_checkpointing=True,
    output_dir="./gemma-2b-fine-tuned",
    push_to_hub=True,
)

Note that if you do not want to push your fine-tuned model to the Hugging Face Hub, set `push_to_hub=False`.

We can now define our [PEFT config](https://huggingface.co/docs/peft/en/quicktour#train), which specifies the way in which the extra parameters should be defined and trained during fine-tuning:

In [None]:
from peft import LoraConfig, TaskType

peft_config = LoraConfig(task_type=TaskType.CAUSAL_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1)

To instantiate a data-collator for human-assistant style conversation data, pass a response template, an instruction template and the tokenizer:

In [None]:
from trl import DataCollatorForCompletionOnlyLM

instruction_template = "### Human:"
response_template = "### Assistant:"

collator = DataCollatorForCompletionOnlyLM(instruction_template=instruction_template, response_template=response_template, tokenizer=tokenizer, mlm=False)

We can pass our model, tokenizer and dataset to the [SFT Trainer](https://huggingface.co/docs/trl/sft_trainer):

In [None]:
from trl import SFTTrainer

trainer = SFTTrainer(
    model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    dataset_text_field="text",
    data_collator=collator,
    max_seq_length=2048,
    peft_config=peft_config,
)

And then launch training:

In [None]:
trainer.train()

Training will take approximately 3-6 hours depending on your GPU or the one allocated to the Google Colab. Depending on your GPU, it may be possible to increase the `per_device_train_batch_size` to increase throughput. In this case, you can increase the `per_device_train_batch_size` incrementally by factors of 2 until you reach the maximum possible batch size. Alternatively, if you are limited to a `per_device_train_batch_size` of 1, you can employ [`gradient_accumulation_steps`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.gradient_accumulation_steps) to compensate, by also increasing them by factors of 2 to bump your effective batch size.

Once training has finished, we can push the final model to the Hugging Face Hub, such that it is available under `your-username/gemma-2b-fine-tuned`:

In [None]:
trainer.push_to_hub(
    finetuned_from=checkpoint_id,
    dataset_tags="timdettmers/openassistant-guanaco",
    tasks="text-generation",
)