<a href="https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/gemma-transformers-streamlined.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Gemma in the Hugging Face Ecosystem

The Hugging Face ecosystem is designed to maximise the potential of open-source developers. In this notebook, we'll see how the ecosystem can be leveraged for the "full-stack" Gemma pipeline, starting with inference, fine-tuning the model on a specific dataset, and finishing with deploying the model in the cloud.

<p align="center">
  <img src="https://github.com/sanchit-gandhi/notebooks/blob/main/gemma_pipeline.jpg?raw=true" width="800"/>
</p>

## Set-up Python environment

First, we need to register our Hugging Face Hub token with our Google Colab runtime. Since the Gemma model is gated, our token will be checked when the model is downloaded to ensure we have accepted the terms-of-use. To register your token, click the key symbol 🔑 in the left-hand pane of the screen. Name the secret `HF_TOKEN`, and copy a token from your Hugging Face [Hub account](https://huggingface.co/settings/tokens). Your token should now be registered, allowing you to access the Gemma weights to this Colab session.

For reasonable training and inference speed with Gemma, we'll want to run the model on a GPU. The runtime is already configured to use the free 16GB T4 GPU provided through Google Colab Free Tier, so all you need to do is hit `"Connect T4"` in the top right-hand corner of the screen.

Once we've done that, we can go ahead and install the necessary Python packages:

In [None]:
!pip install --upgrade --quiet transformers datasets accelerate trl peft

## Inference with Transformers

[Transformers](https://huggingface.co/docs/transformers/index) is a state-of-the-art toolkit for open-source machine learning models. It contains all the functionality to download pre-trained weights, run inputs through them with inference, and integrations with other libraries for further fine-tuning.

There are four pre-trained Gemma checkpoints from which we can choose from, summarised in the table below. All four checkpoints are uploaded to the Hugging Face Hub with integrations in the Transformers library:

| Model ID    | Size / B params | Type        |
|-------------|-----------------|-------------|
| [gemma-2b](https://huggingface.co/google/gemma-2b)    | 2.5             | Base        |
| [gemma-2b-it](https://huggingface.co/google/gemma-2b-it) | 2.5             | Instruction |
| [gemma-7b](https://huggingface.co/google/gemma-7b)    | 8.5             | Base        |
| [gemma-7b-it](https://huggingface.co/google/gemma-7b-it) | 8.5             | Instruction |


For this example, we'll use [gemma-2b](https://huggingface.co/google/gemma-2b), the 2B parameter model that has been trained for the task of text generation. An 2B parameter LLM in full precision (float32) requires 10GB of memory just to load the weights, and a whopping 40GB for fine-tuning! The GPU typically assigned to a Google Colab free tier only has a capacity of 16GB. This means we are at risk of an *out-of-memory (OOM)* error, where the memory usage exceeds that of the GPU. To circumvent this, we'll load the weights in [4-bit precision](https://huggingface.co/blog/4bit-transformers-bitsandbytes), which reduces the memory of the weights roughly by a factor of 8. 

To do this, we'll define a bitsandbytes [quantization config](https://huggingface.co/docs/transformers/quantization#4-bit), which quantifies the precision in which the model should be quantized and what data-type (dtype) to run it in:

In [1]:
import torch
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16,
)

  from .autonotebook import tqdm as notebook_tqdm


We can now load the pre-trained model weights from the Hugging Face Hub, passing the quantization config to the loading method to prepare the weights in 4-bit precision:

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2b", low_cpu_mem_usage=True, quantization_config=quantization_config,
)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b", use_fast=True)

Downloading shards: 100%|███████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 3611.11it/s]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.25s/it]


The advantage of using the [`AutoModelForCausalLM`](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForCausalLM) and [`AutoTokenizer`](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoTokenizer) API over model-specific classes is that we can easily swap the checkpoint id for any other checkpoint
on the Hugging Face Hub and re-use our code without any changes. The auto-classes will take care of loading the correct model and tokenizer classes for us. That means if a new LLM is released, we can quickly update our code to leverage the new model. Simply swap the checkpoint id `google/gemma-2b` for the model id on the Hugging Face Hub.

Great! We've loaded our model into memory, so now we're ready to define our inputs for inference. In this example, we'll pass a simple prompt to the Gemma model, and pre-process it to token id representation using our tokenizer:

In [3]:
input_ids = tokenizer("Recipe for pasta:", return_tensors="pt").input_ids
input_ids = input_ids.to(model.device)

Now that we've pre-processed our inputs, we can auto-regressively generate our response using the model's [`generate`](https://huggingface.co/blog/how-to-generate) method. Here, we'll set generation strategy to sampling, and specify the number of new tokens to generate. A full list of generation parameters can be found [here](https://huggingface.co/docs/transformers/en/main_classes/text_generation#transformers.GenerationConfig).

In [4]:
from transformers import set_seed

set_seed(0)
pred_ids = model.generate(input_ids, do_sample=True, temperature=0.6, max_new_tokens=256)

Finally, we decode the predicted ids back to text characters, again using the tokenizer:

In [5]:
pred_text = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
print(pred_text[0])

Recipe for pasta:

* 4 1/2 cups of all-purpose flour
* 2 tsp of salt
* 2 tsp of sugar
* 12 oz. of warm water
* 2 tsp of active dry yeast
* 3 tbsp of vegetable oil
* 3/4 tsp of granulated garlic
* 1 tsp of granulated onion
* 1/2 tsp of dried marjoram
* 1/4 tsp of ground black pepper
* 1/4 tsp of cayenne pepper

In a large bowl, sift together the flour, salt, sugar, and garlic, onion, and marjoram. Make a well in the dry ingredients and pour in the yeast. Add the oil and use your hands to mix.

Add the water and knead the dough together into a smooth, elastic, and extensible dough.

Cover the dough and let it rise for about 20 minutes.

Divide the dough into two or three loaves and shape them into your desired shape.

Cover the loaves and let them rise for about 30 minutes.

Preheat the oven to 375°F.

Brush a little oil or melted butter on the baking sheet and place the loaves on the sheet.

Stick a


## Datasets with Datasets

[Datasets](https://huggingface.co/docs/datasets/index) is a library for easily accessing and sharing machine learning datasets across all tasks and domains. It can be used to load and pre-process datasets with a single line of code, has powerful functinolity to prepare a dataset ready for training with a transformer-based model. Datasets features a deep integration with the Hugging Face Hub, allowing you to easily load and share a dataset with the wider machine learning community.

Let's load a subset of the OpenAssistant dataset from the Hugging Face Hub, [OpenAssistant Guanaco](https://huggingface.co/datasets/timdettmers/openassistant-guanaco). This dataset is extremely lightweight (22MB) and so can be downloaded locally very quickly:

In [6]:
from datasets import load_dataset

dataset = load_dataset("timdettmers/openassistant-guanaco")



**Tip:** you can swap this dataset for any one of the [available text datasets](https://huggingface.co/datasets?sort=trending) on the Hugging Face Hub

We can pull up the first sample to have a look at the format of our data. We see that the human/assistant turns are labelled by `### Human:` and `### Assistant:` respectively:

In [7]:
sample = dataset["train"][0]
sample

{'text': '### Human: Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.### Assistant: "Monopsony" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term is particularly relevant in the labor market, where a monopsony employer has significant power over the wages and working conditions of their employees. The presence of a monopsony can result in lower wages and reduced employment opportunities for workers, as the employer has little incentive to increase wages or provide better working conditions.\n\nRecent research has identified potential monopsonies in industries such as retail and fast food, where a few large companies control a significant portion of the market (Bivens & Mishel, 2013). In these industries, workers often face low wages, limited benefits, and reduced bargaining po

## Training with TRL and PEFT

[trl](https://huggingface.co/docs/trl/en/index) is an open-source library for training Transformer language models with Reinforcement Learning, from Supervised Fine-Tuning (SFT), to Reward Modelling (RM), and Proximal Policy Optimisation (PPO). The library is built on top of the Transformers Trainer, meaning it is inherently compatible with Transformers models and datasets in Datasets.

As mentioned above, full fine-tuning of the Gemma model would require 40GB of GPU memory. To circumvent this, we'll apply [Parameter Efficient Fine-Funing (PEFT)](https://huggingface.co/docs/peft/index) to only fine-tune a small number of (extra) model parameters. This significantly decreases computational and storage costs, while yielding performance comparable to a fully fine-tuned model. This makes it more accessible to train and store large language models (LLMs) on consumer hardware, such as Google Colab GPUs.

Let's first define our [training arguments](https://huggingface.co/docs/transformers/v4.38.2/en/main_classes/trainer#transformers.TrainingArguments), such as batch size, number of epochs and learning rate. We'll also specify the output directory for our model, `gemma-2b-fine-tuned`:

In [8]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    learning_rate=1e-3,
    warmup_ratio=0.1,
    logging_steps=10,
    gradient_checkpointing=True,
    output_dir="gemma-2b-fine-tuned",
    push_to_hub=True,
)

Note that if you do not want to push your fine-tuned model to the Hugging Face Hub, set `push_to_hub=False`.

We can now define our [PEFT config](https://huggingface.co/docs/peft/en/quicktour#train), which specifies the way in which the extra parameters should be defined and trained during fine-tuning:

In [9]:
from peft import LoraConfig, TaskType

peft_config = LoraConfig(task_type=TaskType.CAUSAL_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1)

And also prepare our model for 4-bit training:

In [10]:
from peft import prepare_model_for_kbit_training

model = prepare_model_for_kbit_training(model)

To instantiate a data-collator for human-assistant style conversation data, pass a response template, an instruction template and the tokenizer:

In [11]:
from trl import DataCollatorForCompletionOnlyLM

instruction_template = "### Human:"
response_template = "### Assistant:"

collator = DataCollatorForCompletionOnlyLM(instruction_template=instruction_template, response_template=response_template, tokenizer=tokenizer, mlm=False)

  torch.utils._pytree._register_pytree_node(
  torch.utils._pytree._register_pytree_node(


We can pass our model, tokenizer and dataset to the [SFT Trainer](https://huggingface.co/docs/trl/sft_trainer):

In [14]:
from trl import SFTTrainer

trainer = SFTTrainer(
    model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    dataset_text_field="text",
    data_collator=collator,
    max_seq_length=min(model.config.max_position_embeddings, tokenizer.model_max_length),
    peft_config=peft_config,
)

Map: 100%|█████████████████████████████████████████████████████████████████| 9846/9846 [00:01<00:00, 6157.16 examples/s]
Map: 100%|███████████████████████████████████████████████████████████████████| 518/518 [00:00<00:00, 5736.41 examples/s]


And then launch training:

In [15]:
trainer.train()

Step,Training Loss
10,2.0936



KeyboardInterrupt



Training will take approximately 2-3 hours depending on your GPU or the one allocated to the Google Colab. Depending on your GPU, it is possible that you will encounter a CUDA `"out-of-memory"` error when you start training. In this case, you can reduce the `per_device_train_batch_size` incrementally by factors of 2 and employ [`gradient_accumulation_steps`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.gradient_accumulation_steps) to compensate.

Once training has finished, we can push the final model to the Hugging Face Hub, such that it is available under `your-username/gemma-2b-fine-tuned`:

In [None]:
trainer.push_to_hub(
    finetuned_from=checkpoint_id,
    dataset_tags="timdettmers/openassistant-guanaco",
    tasks="text-generation",
)