<a href="https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/gemma-transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Gemma in 🤗 Transformers

Gemma is a family language models released by Google DeepMind in the paper [Gemma: Open Models Based on Gemini Research and Technology](https://storage.googleapis.com/deepmind-media/gemma/gemma-report.pdf). Gemma leverages the same research and technology used to create the Gemini models. It was trained using a dataset consisting of 6-trillion tokens, formed from web documents, code and mathematics. The result is a series of state-of-the-art models at the 2B and 7B scale, all of which are open-sourced and permissively licensed.

Gemma has support in the 🤗 Transformers library from day-0, in both PyTorch and JAX. In this Google Colab, we'll showcase how to use the Gemma 7B model in PyTorch, leveraging the familiar Transformers 3-line API. By quantising the model to 4-bit precision, the notebook can be run on a Google Colab T4 free-tier GPU. We'll wrap the loaded model into a Gradio demo, showcasing how the model can be shared with anyone in the community.

## Set-Up Environment

First, we need to register our Hugging Face Hub token with our Google Colab runtime. Since the Gemma model is gated, our token will be checked when the model is downloaded to ensure we have accepted the terms-of-use. To register your token, click the key symbol 🔑 in the left-hand pane of the screen. Name the secret `HF_TOKEN`, and copy a token from your Hugging Face [Hub account](https://huggingface.co/settings/tokens). Your token should now be registered, allowing you to access the Gemma weights to this Colab session.

For reasonable inference speed with Gemma, we'll want to run the model on a GPU. The runtime is already configured to use the free 16GB T4 GPU provided through Google Colab Free Tier, so all you need to do is hit `"Connect T4"` in the top right-hand corner of the screen.

Once we've done that, we can go ahead and install the necessary Python packages:

In [1]:
!pip install --upgrade --quiet transformers accelerate bitsandbytes gradio

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.5/8.5 MB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.9/16.9 MB[0m [31m23.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.1/92.1 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.9/307.9 kB[0m [31m27.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m138.5/138.5 kB[0m [31m15.7 

## Baseline Implementation

In this example, we'll use Transformers' model + tokenizer API. Let's go ahead an import the two Python classes we'll need to run the Gemma model. The first of these classes, [`AutoModelForCausalLM`](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModelForCausalLM), is our model class. This is the Python class that holds the model weights and graph definition. The second class, [`AutoTokenizer`](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoTokenizer), is the tokenizer class, which converts the input string to a token id (discrete number) representation:

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer

We can now load the pre-trained model weights from the Hugging Face Hub. There are four pre-trained Gemma checkpoints from which we can choose from, summarised in the table below:

| Model ID    | Size / B params | Type        |
|-------------|-----------------|-------------|
| [gemma-2b](https://huggingface.co/google/gemma-2b)    | 2.5             | Base        |
| [gemma-2b-it](https://huggingface.co/google/gemma-2b-it) | 2.5             | Instruction |
| [gemma-7b](https://huggingface.co/google/gemma-7b)    | 8.5             | Base        |
| [gemma-7b-it](https://huggingface.co/google/gemma-7b-it) | 8.5             | Instruction |


For this example, we'll use [gemma-7b-it](https://huggingface.co/google/gemma-7b-it), the 8.5B parameter model that has been instructioned tuned for the task of an assistant (or chatbot). An 8.5B parameter LLM in full precision (float32) requires 16GB of memory just to load the weights. The GPU typically assigned to a Google Colab free tier only has a capacity of 16GB. This means we are at risk of an *out-of-memory (OOM)* error, where the memory of the model exceeds that of the GPU. To circumvent this, we'll load the weights in [4-bit precision](https://huggingface.co/blog/4bit-transformers-bitsandbytes), which reduces the memory of the weights roughly by a factor of 8. This is extremely simple in Transformers: we just pass the flag [`load_in_4bit`](https://huggingface.co/docs/transformers/main_classes/model#transformers.PreTrainedModel.from_pretrained.load_in_4bit) when we load the pre-trained weights:

In [3]:
checkpoint_id = "google/gemma-7b-it"

model = AutoModelForCausalLM.from_pretrained(checkpoint_id, low_cpu_mem_usage=True, load_in_4bit=True)
tokenizer = AutoTokenizer.from_pretrained(checkpoint_id)

config.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/2.11G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.16k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/888 [00:00<?, ?B/s]

The advantage of using the `AutoModelForCausalLM` and `AutoTokenizer` API over model-specific classes is that we can easily swap the checkpoint id for any other checkpoint
on the Hugging Face Hub and re-use our code without any changes. The auto-classes will take care of loading the correct model and tokenizer classes for us! That means if a new LLM is released, we can quickly update our code to leverage the new model.

Great! We've loaded our model into memory, so now we're ready to define our pre-processing strategy. Our inputs consist of a series of user/assistant responses, from which we want to generate a new assistant response. Each instruction tuned model has it's own format for user/assistant responses, and it can be fiddly getting one's inputs into the correct format for the model. Luckily for us, the tokenizer provides a convinient method [`apply_chat_template`](https://huggingface.co/docs/transformers/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.apply_chat_template) to format the tokens correctly for us. We just need to define what messages were sent by the user, and what ones were provided by the assistant. The tokenizer will then handle the precise formatting, and getting the ids ready for the model:

In [4]:
messages = [
    {"role": "user", "content": "Can you tell me about current trends artificial intelligence?"},
    {"role": "assistant", "content": "Certainly! Artificial Intelligence (AI) is advancing rapidly, with key trends including the growing adoption of deep learning techniques, the increasing use of AI in healthcare, finance, retail, and other industries, and the development of new AI tools for creative tasks, like music composition and visual art creation. Additionally, there is a growing focus on ethical considerations surrounding AI, as concerns arise about potential bias, privacy implications, and job displacement. Furthermore, AI is integrating with other technologies, such as blockchain and virtual reality, to create novel solutions for various challenges."},
    {"role": "user", "content": "What should I consider when using large language models?"}
]

encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")

Perfect! Now, our model is currently sitting on the GPU:

In [5]:
model.device

device(type='cuda', index=0)

Whereas our inputs are on the CPU:

In [6]:
encodeds.device

device(type='cpu')

Every PyTorch model expects the inputs to be on the same device as the model. Let's move the inputs to the correct device before generating:

In [7]:
model_inputs = encodeds.to(model.device)

Now that we've pre-processed our inputs, we can generate our response using the model in much the same way as before:

In [8]:
generated_ids = model.generate(model_inputs, max_new_tokens=1024, do_sample=True)
generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

print(generated_text[0])



user
Can you tell me about current trends artificial intelligence?
model
Certainly! Artificial Intelligence (AI) is advancing rapidly, with key trends including the growing adoption of deep learning techniques, the increasing use of AI in healthcare, finance, retail, and other industries, and the development of new AI tools for creative tasks, like music composition and visual art creation. Additionally, there is a growing focus on ethical considerations surrounding AI, as concerns arise about potential bias, privacy implications, and job displacement. Furthermore, AI is integrating with other technologies, such as blockchain and virtual reality, to create novel solutions for various challenges.
user
What should I consider when using large language models?
Here are some key considerations when using large language models:

**Accuracy:**
- Large language models are not perfect, and errors can occur.
- Always verify information generated by LLMs with other sources.

**Bias:**
- LLMs can 

Looks like a very reasonable response! We can see the change between the user input and the assistant response marked by the `user` and `model` tokens. This was handled automatically for us by the tokenizer. Note that we've generated with [*sampling*](https://huggingface.co/blog/how-to-generate#sampling), so there's an element of randomness in the generations, meaining each one will be different from the last. Feel free to re-run generation a few times to get a feel for the kind of variety that sampling brings.

## Streaming Outputs

While the generated response looks good, we had to wait a significant amount of time for the model to finish generating. We know from our knowledge of Transformer models that they generate in an *auto-regressive* fashion, that is, one token at a time. See [this animation](https://cdn-lfs.huggingface.co/datasets/huggingface/documentation-images/55f001babd6cb06ea403299fa1b091f6fdb194f9ed756c4c8786185207f2fe3b?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27gif_2_1080p.mov%3B+filename%3D%22gif_2_1080p.mov%22%3B&response-content-type=video%2Fquicktime&Expires=1709301979&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcwOTMwMTk3OX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9kYXRhc2V0cy9odWdnaW5nZmFjZS9kb2N1bWVudGF0aW9uLWltYWdlcy81NWYwMDFiYWJkNmNiMDZlYTQwMzI5OWZhMWIwOTFmNmZkYjE5NGY5ZWQ3NTZjNGM4Nzg2MTg1MjA3ZjJmZTNiP3Jlc3BvbnNlLWNvbnRlbnQtZGlzcG9zaXRpb249KiZyZXNwb25zZS1jb250ZW50LXR5cGU9KiJ9XX0_&Signature=nflxtJhQDQtdU5FR6ZUCcmn2tVReVL43shmRM92%7EAbVjjR%7EVxElzpV7ueAdz%7EZ3qEX3SUtAd5SAdg6jkQUCrOUySud0vhF24rurHNIlg3ldHtFLucVDlO%7Esp2hNyjzU7Qpo8kP77Q%7EzD3gRsa6Ctmrn-aRIFV9vkuSRUUdWOkOVltxmyGsezoJUqujSDlBfNLDNxeoOzzyUV0AZOwCL2pAccZrHLQvRN8d1FX31sfVRJC3V9euY7-lsFjsMstpzfu8t1456tY2Tq7rDEXVo7PmsOWqCWKoy1M%7EcbxYw8OUc3putZYlAUkFfLlxX%7ERarwrO38CoQvfdC4S1vrB2ZpDA__&Key-Pair-Id=KVTP0A1DKRTAX) for a demonstration of auto-regressive decoding.

Rather than waiting for the entire text sequence to finish generating, we can print each token as it is generated. While this doesn't change the overall latency of the model (the total generation time is the same as before), the *percieved* latency is lower, since the first generated token is returned to the user as soon as it's ready.

Transformers provides a [`TextStreamer`](https://huggingface.co/docs/transformers/internal/generation_utils#transformers.TextStreamer) class to easily print the predictions on-the-fly. We simply have to pass the streamer to our generate function, and the rest is taken care for us. Let's run an example using the text streamer:

In [9]:
from transformers import TextStreamer

streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
generated_ids = model.generate(model_inputs, max_new_tokens=1024, do_sample=True, streamer=streamer)

 - **Accuracy:** Large language models tend to be very accurate, but there can be errors in the context or facts presented especially when dealing with complex topics. Always fact-check information generated by large language models and avoid relying on them exclusively for primary research.


 - **Bias:** Large language models are trained on vast amounts of text data, which can contain biases. This biases can manifest in the way the models make decisions and generate text content. Be mindful of potential biases and be cautious about drawing conclusions based solely on the output of large language models.


 - **Credibility:** Large language models can generate highly convincing text that can be difficult to distinguish from human-written content. Use caution when judging content generated by large language models based solely on its appearance or linguistic fluency. Consider additional factors such as the model's source code, its training data, and the context in which it was generate

Interacting with the LLM this way already feels more personalised, as we watch it generate each token at a time. It also feels faster to the user, as we get the first output as soon as it is ready.

## Gradio Demo

While it's interesting for us learners to interact with the model through Python code, it's by no means the easiest way to share our model with others. To finish this section, we'll demonstrate how the Gemma model can be wrapped into a simple chat interface and shared with anyone in the community.

To achieve this, we'll use the [Gradio library](https://www.gradio.app). Gradio is an open-source Python library that simplifies the creation of web interfaces for machine learning models. It allows developers and researchers to quickly build and share customisable UI components for model testing and feedback, enhancing the accessibility and usability of AI models.

First, we'll define an end-to-end function that takes the user's message and the chat history, and generates a response. Largely speaking, this simply requires concatenating the three stages of prediction:
1. Tokenizer (pre-process) the message and chat history
2. Define the generation arguments
3. Generate the predicted ids and stream the output

We'll make a small modification to our streamer: instead of using the base [`TextStreamer`](https://huggingface.co/docs/transformers/internal/generation_utils#transformers.TextStreamer), we'll use the [`TextIteratorStreamer`](https://huggingface.co/docs/transformers/internal/generation_utils#transformers.TextIteratorStreamer), which returns the generated ids in a non-blocking way.

In [10]:
from transformers import TextIteratorStreamer
from threading import Thread

MAX_INPUT_TOKEN_LENGTH = 4096

def generate(message, chat_history):
    # Step 1: pre-process the inputs
    conversation = []
    for user, assistant in chat_history:
        conversation.extend([{"role": "user", "content": user}, {"role": "assistant", "content": assistant}])

    conversation.append({"role": "user", "content": message})

    input_ids = tokenizer.apply_chat_template(conversation, return_tensors="pt")

    # in-case our inputs exceed the maximum length, we might need to cut them
    if input_ids.shape[1] > MAX_INPUT_TOKEN_LENGTH:
        input_ids = input_ids[:, -MAX_INPUT_TOKEN_LENGTH:]
        gr.Warning(f"Trimmed input from conversation as it was longer than {MAX_INPUT_TOKEN_LENGTH} tokens.")

    input_ids = input_ids.to(model.device)
    streamer = TextIteratorStreamer(tokenizer, timeout=10.0, skip_prompt=True, skip_special_tokens=True)

    # Step 2: define generation arguments
    generate_kwargs = dict(
        {"input_ids": input_ids},
        streamer=streamer,
        max_new_tokens=1024,
        do_sample=True,
    )

    t = Thread(target=model.generate, kwargs=generate_kwargs)
    t.start()

    # Step 3: generate and stream outputs
    outputs = ""
    for text in streamer:
        outputs += text
        yield outputs

With the generation function defined, we can now create a Gradio demo in just 3 lines of additional code:

In [11]:
import gradio as gr

chat_interface = gr.ChatInterface(generate)
chat_interface.queue().launch(share=True)

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://4e000201c198a77ae9.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


