##### Copyright 2024 Google LLC.

In [None]:
# @title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

| | |
|-|-|
|Author(s) | [Laurie White](https://github.com/annie29) |

# Using samplers with Gemma

This tutorial shows you how to use sampling, sometimes called decoding, to change the behavior of the Gemma model.  Gemma is a family of lightweight, state-of-the art open models built from the same research and technology used to create the Gemini models.  This tutorial uses KerasNLP, a collection of natural language processing (NLP) models implemented in [Keras](https://keras.io/) and runnable on JAX, PyTorch, and TensorFlow.

In this tutorial, you'll use Gemma to generate text responses using several samplers. You'll see how changes to the sampler can change the usefulness of the responses Gemma gives.

## Setup

### Gemma setup

To complete this tutorial, you'll first need to complete the setup instructions at [Gemma setup](https://ai.google.dev/gemma/docs/setup). The Gemma setup instructions show you how to do the following:

* Get access to Gemma at kaggle.com.
* Select a Colab runtime with sufficient resources to run
  the Gemma 2B model.
* Generate and configure a Kaggle username and API key.

After you've completed the Gemma setup, move on to the next section, where you'll set environment variables for your Colab environment.

### Accept the Gemma Terms of Use

While you have accepted the Gemma Terms of Use in a previous step, each time you use a non-local version of Gemma you'll need to link to your acceptance.  You can either do this by setting and accessing secrets in Colab or by entering your username and key at a prompt.

Use this version with environment keys.  The `userdata.get` code will need to be rewritten for your system if you are not using Colab or you can connect to Kaggle directly in the next step.

In [None]:
import os
from google.colab import userdata

# Note: `userdata.get` is a Colab API. If you're not using Colab, set the env
# vars as appropriate for your system.
os.environ["KAGGLE_USERNAME"] = userdata.get("KAGGLE_USERNAME")
os.environ["KAGGLE_KEY"] = userdata.get("KAGGLE_KEY")

This version will connect you to Kaggle where you'll be prompted for your username and key.

In [None]:
!pip install kagglehub
import os
import kagglehub

kagglehub.login()

### Install dependencies

Install Keras and KerasNLP.  (You may see a warning about pip's dependency resolver.  You can ignore it; it should not cause trouble later on.)

In [None]:
# Install Keras 3 last. See https://keras.io/getting_started/ for more details.
!pip install -q -U keras-nlp
!pip install -q -U "keras>=3"

### Select a backend

Keras is a high-level, multi-framework deep learning API designed for simplicity and ease of use. [Keras 3](https://keras.io/keras_3) lets you choose the backend: TensorFlow, JAX, or PyTorch. All three will work for this tutorial.

In [None]:
import os

os.environ["KERAS_BACKEND"] = "jax"  # Or "tensorflow" or "torch".
os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"] = "0.9"

### Import packages

Import Keras and KerasNLP.

In [None]:
import keras
import keras_nlp

## Create a model

KerasNLP provides implementations of many popular [model architectures](https://keras.io/api/keras_nlp/models/). In this tutorial, you'll create a model using `GemmaCausalLM`, an end-to-end Gemma model for causal language modeling. A causal language model predicts the next token based on previous tokens.

Create the model using the `from_preset` method:

In [None]:
gemma_lm = keras_nlp.models.GemmaCausalLM.from_preset("gemma_2b_en")

`from_preset` instantiates the model from a preset architecture and weights. In the code above, the string `"gemma_2b_en"` specifies the preset architecture: a Gemma model with 2 billion parameters.


## Samplers

An LLM has a number of choices when creating its responses.  You can affect the way Gemma makes these choices by using a _sampler_.  

To change a sampler in Gemma, you will recompile the model with the sampler you want to use.  
The easiest way to do this is to just send the name of the sampler as the `sampler` parameter when compiling the model, as shown below.

```
gemma_lm.compile(sampler="greedy")
```

If you want to send parameters to a sampler, you may find it easier to first create the sampler with the parameters and then send the new sampler as the `sampler` parameter.

```
sampler = keras_nlp.samplers.TopKSampler(k=5)
gemma_lm.compile(sampler=sampler)
```

Let's take a look at some samplers that can work with Gemma in Keras.  You can read more about them in the [Keras documentation](https://keras.io/api/keras_nlp/samplers/). If none of the built-in samplers fit your needs, you can create [custom samplers](https://keras.io/api/keras_nlp/samplers/samplers/).


### Greedy sampler

The default sampler is the [`Greedy` sampler](https://keras.io/api/keras_nlp/samplers/greedy_sampler/).  It will pick the token with the largest probability as the next token, thus having no variation in its output if all tokens have unique probabilities.

Consider the case below.



In [None]:
gemma_lm.compile(sampler="greedy")
print(gemma_lm.generate("Are cats or dogs better?", max_length=32))

If you run it multiple times with the same prompt, you should notice the output does not change.

### TopK sampler

The [`Top K` sampler](https://keras.io/api/keras_nlp/samplers/top_k_sampler/).  allows for some variabilty in output.

It will first restrict the options to the _k_ possible tokens with the highest probability.
 It will then select from those _k_ elements with the chance of selection determined by the probability.

Consider the case below which will select from the top 5 next tokens.

In [None]:
sampler = keras_nlp.samplers.TopKSampler(k=5)
gemma_lm.compile(sampler=sampler)
print(gemma_lm.generate("What should I do on a trip to Europe?", max_length=64))

If you run this code more than once, you should notice different answers.

If you're debugging or doing a demo or something similar and want to ensure you get the same "random" values each time, you can add a seed parameter to the line which creates the sampler to use the same random number sequence:

```
sampler = keras.nlp.samplers.TopKSampler(k=5, seed = 2)
```

### TopP sampler

Top P sampling, also known as nucleus sampling, will first order the options in descending order of probability.    

It will then select tokens starting with the one with the highest probability and keeping adding tokens to the set to be considered until the sum of their probabilities is ≥ *p*.

Consider the case below which will select from the smallest set of tokens with a total probability that's greater than or equal to 0.9.

In [None]:
sampler = keras_nlp.samplers.TopPSampler(p=0.9)
gemma_lm.compile(sampler=sampler)
print(gemma_lm.generate("What should I do on a trip to Europe?", max_length=128))

Since ordering the tokens by frequency can be an expensive operation, Top K sampling is often used before Top P sampling.  In Keras, this can be done by sending a k value when creating the sampler, as shown below.

In [None]:
sampler = keras_nlp.samplers.TopPSampler(p=0.9, k=200)
gemma_lm.compile(sampler=sampler)
print(gemma_lm.generate("What should I do on a trip to Europe?", max_length=128))

### Random models

Random sampling is similar to Top K sampling, but it will consider all possible tokens as the next token, with selection chance determined by the probability of each token.


In [None]:
gemma_lm.compile(sampler="random")
print(gemma_lm.generate("What should I do on a trip to Europe?", max_length=128))

By default, the _temperature_ of a sampler is 1.0.  By adjusting the temperature to a value between 0.0 and 2.0, you can adjust how much difference is between the likelihoods.

Temperature values greater than 1.0 will reduce the difference between likelihood values, thus making the LLM seem more creative.  Temperatures less than 1.0 will increase the difference between likelihood, making the more likely values even likelier to happen, thus making the LLM seem less creative.

Try changing the temperature parameter of the random sampler below to see how it changes the output.


In [None]:
sampler = keras_nlp.samplers.RandomSampler(temperature=0.7)
gemma_lm.compile(sampler=sampler)
print(gemma_lm.generate("What should I do on a trip to Europe?", max_length=128))

## What's next

In this tutorial, you learned how to modify the output of Gemma by using different sampling techniques. Here are a few suggestions for what to learn next:

* Learn how to [finetune a Gemma model](https://ai.google.dev/gemma/docs/lora_tuning).
* Learn how to perform [distributed fine-tuning and inference on a Gemma model](https://ai.google.dev/gemma/docs/distributed_tuning).
* Learn about [Gemma integration with Vertex AI](https://ai.google.dev/gemma/docs/integrations/vertex)
* Learn how to [use Gemma models with Vertex AI](https://cloud.google.com/vertex-ai/docs/generative-ai/open-models/use-gemma).