<a href="https://colab.research.google.com/github/rahiakela/genai-research-and-practice/blob/main/gemma-notebooks/07_Gemma_RAG_LlamaIndex.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## RAG with Gemma and LlamaIndex

This notebook demonstrates how to integrate Gemma model with [LlamaIndex](https://www.llamaindex.ai/) library to build a basic RAG application.

## Setup

### Select the Colab runtime
To complete this tutorial, you'll need to have a Colab runtime with sufficient resources to run the Gemma model. In this case, you can use a T4 GPU:

1. In the upper-right of the Colab window, select **▾ (Additional connection options)**.
2. Select **Change runtime type**.
3. Under **Hardware accelerator**, select **T4 GPU**.

### Gemma setup

To complete this tutorial, you'll first need to complete the setup instructions at [Gemma setup](https://ai.google.dev/gemma/docs/setup). The Gemma setup instructions show you how to do the following:

* Get access to Gemma on kaggle.com.
* Select a Colab runtime with sufficient resources to run
  the Gemma 2B model.
* Generate and configure a Kaggle username and an API key as Colab secrets.

After you've completed the Gemma setup, move on to the next section, where you'll set environment variables for your Colab environment.


### Configure your credentials

Add your your Kaggle credentials to the Colab Secrets manager to securely store it.

1. Open your Google Colab notebook and click on the 🔑 Secrets tab in the left panel. <img src="https://storage.googleapis.com/generativeai-downloads/images/secrets.jpg" alt="The Secrets tab is found on the left panel." width=50%>
2. Create new secrets: `KAGGLE_USERNAME` and `KAGGLE_KEY`
3. Copy/paste your username into `KAGGLE_USERNAME`
3. Copy/paste your key into `KAGGLE_KEY`
4. Toggle the buttons on the left to allow notebook access to the secrets.


In [None]:
import os
from google.colab import userdata

# Note: `userdata.get` is a Colab API. If you're not using Colab, set the env
# vars as appropriate for your system.
os.environ["KAGGLE_USERNAME"] = userdata.get("KAGGLE_USERNAME")
os.environ["KAGGLE_KEY"] = userdata.get("KAGGLE_KEY")

### Install dependencies
Run the cell below to install all the required dependencies.

In [None]:
!pip install -q -U tensorflow keras keras-nlp
!pip install -q llama-index llama-index-embeddings-huggingface

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m589.8/589.8 MB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m52.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m570.5/570.5 kB[0m [31m46.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.3/5.3 MB[0m [31m86.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m76.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.5/5.5 MB[0m [31m93.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.2/311.2 kB[0m [31m30.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m950.8/950.8 kB[0m [31m62.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━

### Gemma

**About Gemma**

Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. They are text-to-text, decoder-only large language models, available in English, with open weights, pre-trained variants, and instruction-tuned variants. Gemma models are well-suited for a variety of text generation tasks, including question answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as a laptop, desktop or your own cloud infrastructure, democratizing access to state of the art AI models and helping foster innovation for everyone.

In [None]:
import keras
import keras_nlp

os.environ["KAGGLE_USERNAME"] = userdata.get("KAGGLE_USERNAME")
os.environ["KAGGLE_KEY"] = userdata.get("KAGGLE_KEY")

In [None]:
# Let's load Gemma using Keras
gemma_model_id = "gemma_1.1_instruct_2b_en"
gemma = keras_nlp.models.GemmaCausalLM.from_preset(gemma_model_id)

Attaching 'metadata.json' from model 'keras/gemma/keras/gemma_1.1_instruct_2b_en/3' to your Colab notebook...
Attaching 'metadata.json' from model 'keras/gemma/keras/gemma_1.1_instruct_2b_en/3' to your Colab notebook...
Attaching 'task.json' from model 'keras/gemma/keras/gemma_1.1_instruct_2b_en/3' to your Colab notebook...
Attaching 'config.json' from model 'keras/gemma/keras/gemma_1.1_instruct_2b_en/3' to your Colab notebook...
Attaching 'metadata.json' from model 'keras/gemma/keras/gemma_1.1_instruct_2b_en/3' to your Colab notebook...
Attaching 'metadata.json' from model 'keras/gemma/keras/gemma_1.1_instruct_2b_en/3' to your Colab notebook...
Attaching 'config.json' from model 'keras/gemma/keras/gemma_1.1_instruct_2b_en/3' to your Colab notebook...
Attaching 'config.json' from model 'keras/gemma/keras/gemma_1.1_instruct_2b_en/3' to your Colab notebook...
Attaching 'model.weights.h5' from model 'keras/gemma/keras/gemma_1.1_instruct_2b_en/3' to your Colab notebook...
Attaching 'metada

## LlamaIndex

LlamaIndex is a toolkit for developers to build applications that use large language models (LLMs) with specific data. This data can be private or related to a particular field. With LlamaIndex, developers can create various LLM applications, including question-answering chatbots, document analysis tools, and even autonomous agents. The toolkit offers functions to process data and design workflows that combine data retrieval with instructions for the LLM.


Large language models (LLMs) are powerful but lack your specific data. Retrieval-Augmented Generation (RAG) bridges this gap by incorporating your data for improved performance. RAG works by indexing your data for efficient retrieval based on user queries. The most relevant information, along with the query itself, is then fed to the LLM to generate a response. Understanding RAG is essential for building LLM applications like chatbots and agents.

### Setup

In [None]:
from typing import Optional, List, Mapping, Any
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import Settings, VectorStoreIndex, SimpleDirectoryReader, PromptTemplate
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.llms import (
    CustomLLM,
    CompletionResponse,
    CompletionResponseGen,
    LLMMetadata,
)
from llama_index.core.llms.callbacks import llm_completion_callback

To ensure compatibility between Gemma and the LlamaIndex library, we need to creata a simple interface class. The provided code implements basic generation methods, allowing the library to interact with our model effectively.

In [None]:
class GemmaLLMInterface(CustomLLM):
    model: keras_nlp.models.GemmaCausalLM = None
    context_window: int = 8192
    num_output: int = 2048
    model_name: str = "gemma_2"

    def _format_prompt(self, message: str) -> str:
        return (
            f"<start_of_turn>user\n{message}<end_of_turn>\n" f"<start_of_turn>model\n"
        )

    @property
    def metadata(self) -> LLMMetadata:
        """Get LLM metadata."""
        return LLMMetadata(
            context_window=self.context_window,
            num_output=self.num_output,
            model_name=self.model_name,
        )

    @llm_completion_callback()
    def complete(self, prompt: str, **kwargs: Any) -> CompletionResponse:
        prompt = self._format_prompt(prompt)
        raw_response = self.model.generate(prompt, max_length=self.num_output)
        response = raw_response[len(prompt) :]
        return CompletionResponse(text=response)

    @llm_completion_callback()
    def stream_complete(self, prompt: str, **kwargs: Any) -> CompletionResponseGen:
        response = self.complete(prompt).text
        for token in response:
            response += token
            yield CompletionResponse(text=response, delta=token)

In [None]:
# This settings define what models will be used by LlamaIndex
Settings.embed_model = HuggingFaceEmbedding()
Settings.llm = GemmaLLMInterface(model=gemma)

Access to the secret `HF_TOKEN` has not been granted on this notebook.
You will not be requested again.
Please restart the session if you want to be prompted again.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/90.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## Retrieval-Augmented Generation (RAG)

RAGs improve generative AI outputs through a multi-step process. First, it searches for relevant external data (webpages, databases) using powerful algorithms. This retrieved information is then cleaned and prepped for the LLM. Finally, the prepped data is fed alongside the original query into the LLM. This extra context allows the LLM to understand the topic better, resulting in more precise, informative, and engaging responses.

This notebook demonstrates how to build an RAG application by utilising Paul Graham's essay. The essay is used as a placeholder to illustrate the RAG concepts without introducing complexities of real-world data. It simplifies the process by focusing on a single, well-defined source (the essay) to showcase how data retrieval enhances LLM performance in a RAG system.

### Chunking the data

In [None]:
# Let's download the data frist
!wget -q "https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt" "paul_graham_essay.txt"

In [None]:
# Reading documents from disk
documents = SimpleDirectoryReader(input_files=["paul_graham_essay.txt"]).load_data()

# Splitting the document into chunks with
# predefined size and overlap
parser = SentenceSplitter.from_defaults(
    chunk_size=256, chunk_overlap=64, paragraph_separator="\n\n"
)
nodes = parser.get_nodes_from_documents(documents)

In [None]:
# Example node:
nodes[0].text

'What I Worked On\n\nFebruary 2021\n\nBefore college the two main things I worked on, outside of school, were writing and programming. I didn\'t write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\n\nThe first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district\'s 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain\'s lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.'

### Building Vector Store

Now we can now build a search engine that finds the best parts of text that answer a user's question.

In [None]:
# Converting the vector store to retrevier
query_engine = VectorStoreIndex(nodes).as_query_engine(
    similarity_top_k=3, response_mode="tree_summarize"
)

In [None]:
# Let's test it out
relevant_chunks = query_engine.retrieve("1992")
print(f"Found: {len(relevant_chunks)} relevant chunks")
for idx, chunk in enumerate(relevant_chunks):
    print(f"{idx + 1}) {chunk.text[:64]}...")

Found: 3 relevant chunks
1) In the fall of 1992 I moved back to Providence to continue at RI...
2) There were three main parts to the software: the editor, which p...
3) HN was no doubt good for YC, but it was also by far the biggest ...


Those chunks will be inject to the LLM's prompt in order to answer user query.

In [None]:
# (Optional) Gemma works better with straightforward prompts without
# additional tokens needed to separate sections.
# We can simple update it to get better results:
new_summary_tmpl_str = """Text:
{context_str}
According to the text answer the query: {query_str}"""

query_engine.update_prompts(
    {"response_synthesizer:summary_template": PromptTemplate(new_summary_tmpl_str)}
)

So when an user ask a question, the following prompt will be send to the LLM:

```
Text:
<chunk #1>
<chunk #2>
<chunk #3>
According to the text answer the query: <question>
```

By providing the large language model (LLM) with additional context, it can generate more accurate and informative responses to user queries.


### Test it yourself!

In [None]:
response = query_engine.query("What does ASP stand for?")
print(response)

ASP stands for application service provider.


In [None]:
response = query_engine.query("What was Jessica responsible for?")
print(response)

According to the text, Jessica was responsible for marketing at a Boston investment bank.


In [None]:
response = query_engine.query("What art schools did Paul Graham apply to?")
print(response)

Graham Paul applied to RISD and the Accademia di Belli Arti in Florence.
