##### Copyright 2024 Google LLC.

In [None]:
# @title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Gemma - Minimal RAG

This cookbook demonstrates how you can build a minimal Retrieval-Augmented Generation (RAG) system without using any orchestration tool like LangChain or LlamaIndex, or any vector database. The only dependency needed is Google's [UniSim](https://github.com/google/unisim) project as the embedding model and [HtmlChunker](https://github.com/google/labs-prototypes/tree/main/seeds/chunker-python).

<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/google-gemini/gemma-cookbook/blob/main/Gemma/[Gemma_1]Minimal_RAG.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
</table>

## Setup

### Select the Colab runtime
To complete this tutorial, you'll need to have a Colab runtime with sufficient resources to run the Gemma model. In this case, you can use a T4 GPU:

1. In the upper-right of the Colab window, select **▾ (Additional connection options)**.
2. Select **Change runtime type**.
3. Under **Hardware accelerator**, select **T4 GPU**.


### Gemma setup on Hugging Face
This cookbook uses Gemma 7B instruction tuned model through Hugging Face. So you will need to:

* Get access to Gemma on [huggingface.co](huggingface.co) by accepting the Gemma license on the Hugging Face page of the specific model, i.e., [Gemma 7B IT](https://huggingface.co/google/gemma-7b-it).
* Generate a [Hugging Face access token](https://huggingface.co/docs/hub/en/security-tokens) and configure it as a Colab secret 'HF_TOKEN'.

## Retrieval-Augmented Generation (RAG)

Large Language Models (LLMs) can learn new abilities without directly being trained on them. However, LLMs have been known to "hallucinate" when tasked with providing responses for questions they have not been trained on. This is partly because LLMs are unaware of events after training. It is also very difficult to trace the sources from which LLMs draw their responses from. For reliable, scalable applications, it is important that an LLM provides responses that are grounded in facts and is able to cite its information sources.

A common approach used to overcome these constraints is called Retrieval Augmented Generation (RAG), which augments the prompt sent to an LLM with relevant data retrieved from an external knowledge base through an Information Retrieval (IR) mechanism. The knowledge base can be your own corpora of documents, databases, or APIs.

### Chunking the data

To improve the relevance of content returned by the vector database during retrieval, break down large documents into smaller pieces or chunks while ingesting the document.

In this cookbook, you will use the [Google I/O 2024 Gemma family expansion launch blog](https://developers.googleblog.com/en/gemma-family-and-toolkit-expansion-io-2024/) as the sample document and Google's [Open Source HtmlChunker](https://github.com/google/labs-prototypes/tree/main/seeds/chunker-python) to chunk it up into passages.

In [None]:
!pip install google-labs-html-chunker

from google_labs_html_chunker.html_chunker import HtmlChunker

from urllib.request import urlopen

with urlopen(
    "https://developers.googleblog.com/en/gemma-family-and-toolkit-expansion-io-2024/"
) as f:
    html = f.read().decode("utf-8")

# Chunk the file using HtmlChunker
chunker = HtmlChunker(
    max_words_per_aggregate_passage=200,
    greedily_aggregate_sibling_nodes=True,
    html_tags_to_exclude={"noscript", "script", "style"},
)
passages = chunker.chunk(html)



Take a look at how the chunked text look like.

In [None]:
for passage in passages:
    print(passage)

Introducing PaliGemma, Gemma 2, and an Upgraded Responsible AI Toolkit
            
            
            
            - Google Developers Blog
Products Develop Android Chrome ChromeOS Cloud Firebase Flutter Google Assistant Google Maps Platform Google Workspace TensorFlow YouTube Grow Firebase Google Ads Google Analytics Google Play Search Web Push and Notification APIs Earn AdMob Google Ads API Google Pay Google Play Billing Interactive Media Ads Solutions Events Learn Community Groups Google Developer Groups Google Developer Student Clubs Woman Techmakers Google Developer Experts Tech Equity Collective Programs Accelerator Solution Challenge DevFest Stories All Stories Developer Profile Blog Search English English Español (Latam) Bahasa Indonesia 日本語 한국어 Português (Brasil) 简体中文
Products More Solutions Events Learn Community More Developer Profile Blog Develop Android Chrome ChromeOS Cloud Firebase Flutter Google Assistant Google Maps Platform Google Workspace TensorFlow YouTube G

## Retrieve the relevant chunks

Given a user question 'where can I get PaliGemma?', you will use Unisim to retrieve the relevant chunks.

First, compute the similarities between the user question and all the text chunks (passages).

In [None]:
!pip install unisim
from unisim import TextSim

user_question = "where can I find PaliGemma?"

text_sim = TextSim()

similarities = []
for passage in passages:
    similarities.append(text_sim.similarity(user_question, passage))

INFO: Loaded backend
INFO: Using TF with GPU




INFO: UniSim is storing a copy of the indexed data
INFO: If you are using large data corpus, consider disabling this behavior using store_data=False


Put the passages and similarities into a dataframe.

In [None]:
import pandas as pd

results_df = pd.DataFrame({"passage": passages, "similarity": similarities})
results_df

Unnamed: 0,passage,similarity
0,"Introducing PaliGemma, Gemma 2, and an Upgrade...",0.517319
1,Products Develop Android Chrome ChromeOS Cloud...,0.299514
2,Products More Solutions Events Learn Community...,0.296253
3,"Gemini Introducing PaliGemma, Gemma 2, and an ...",0.508258
4,"At Google, we believe in the power of collabor...",0.369846
5,Link to Youtube Video (visible only when JS is...,0.33353
6,"Gemma is a family of lightweight, state-of-the...",0.386614
7,Introducing PaliGemma: Open Vision-Language Mo...,0.57323
8,Screenshot from the HuggingFace Space running ...,0.530472
9,Announcing Gemma 2: Next-Gen Performance and E...,0.460508


Identify the top 3 most relevant passages.

In [None]:
top_3_similarities = results_df.nlargest(3, "similarity")
top_3_targets = top_3_similarities["passage"]
top_3_targets

7    Introducing PaliGemma: Open Vision-Language Mo...
8    Screenshot from the HuggingFace Space running ...
0    Introducing PaliGemma, Gemma 2, and an Upgrade...
Name: passage, dtype: object

Next, assemble a prompt using both the user question and retrieved context.

In [None]:
prompt_template = """You are an expert in answering user questions. You always understand user questions well, and then provide high-quality answers based on the information provided in the context.

If the provided context does not contain relevant information, just respond "I could not find the answer based on the context you provided."

User question: {}

Context:
{}
"""

context = "\n".join(
    [f"{i+1}. {passage}" for i, passage in enumerate(top_3_targets.iloc[:].tolist())]
)
prompt = f"{prompt_template.format(user_question, context)}"

Here is the final prompt that will be sent to Gemma.

In [None]:
print(prompt)

You are an expert in answering user questions. You always understand user questions well, and then provide high-quality answers based on the information provided in the context.

If the provided context does not contain relevant information, just respond "I could not find the answer based on the context you provided."

User question: where can I find PaliGemma?

Context:
1. Introducing PaliGemma: Open Vision-Language Model PaliGemma is a powerful open VLM inspired by PaLI-3 . Built on open components including the SigLIP vision model and the Gemma language model, PaliGemma is designed for class-leading fine-tune performance on a wide range of vision-language tasks. This includes image and short video captioning, visual question answering, understanding text in images, object detection, and object segmentation. We're providing both pretrained and fine-tuned checkpoints at multiple resolutions, as well as checkpoints specifically tuned to a mixture of tasks for immediate exploration. To 

### Generate the answer

Now load the Gemma model in quantized 4-bit mode using Hugging Face.

In [None]:
!pip install bitsandbytes accelerate
from transformers import AutoTokenizer
import transformers
import torch
import bitsandbytes, accelerate

model = "google/gemma-7b-it"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    model_kwargs={
        "torch_dtype": torch.float16,
        "quantization_config": {"load_in_4bit": True},
    },
)



`low_cpu_mem_usage` was None, now set to True since model is quantized.
`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Finally, generate the answer.

In [None]:
messages = [
    {"role": "user", "content": prompt},
]
prompt = pipeline.tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.1)
print(outputs[0]["generated_text"][len(prompt) :])



Sure, here is the answer to the user question:

You can find PaliGemma on GitHub, Hugging Face models, Kaggle, Vertex AI Model Garden, and ai.nvidia.com (accelerated with TensoRT-LLM) with easy integration through JAX and Hugging Face Transformers.


Gemma is able to provide the correct answer based on the retrieved context.

In this cookbook the sample document [Google I/O 2024 Gemma family expansion launch blog](https://developers.googleblog.com/en/gemma-family-and-toolkit-expansion-io-2024/) is pretty short, so after chunking there aren't many passages to search through. To make the cookbook minimal, we did exhaustive search to find the relevant search.

In real world use cases, there may be a lot of chunks to go through for a single query, in which case you will need to use Approximate Nearest Neighbor (ANN) for efficiency. This is usually directly supported by vector databases. UniSim also supports ANN, please consult UniSim documentation and its [Colab](https://github.com/google/unisim/blob/main/notebooks/unisim_text_demo.ipynb) on indexing and searching.

UniSim team has also created a separate [RAG demo](https://github.com/google/unisim/blob/main/notebooks/unisim-gemma-text_rag_demo.ipynb). Feel free to check it out.