##### Copyright 2024 Google LLC.

In [14]:
# @title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Building a RAG using Gemma with Elasticsearch, Ollama and Langchain

This tutorial will guide you through building a **Retrieval-Augmented Generation (RAG)** application using the **Gemma 2 9B** model, **LangChain**, **Ollama**, and **Elasticsearch**. You'll go through each step in detail, ensuring that even if you're new to RAGs or Large Language Models (LLMs), you'll be able to follow along and build your own local AI application.

## Introduction

**Retrieval-Augmented Generation (RAG)** is a technique that combines large language models (LLMs) with external knowledge sources to generate more accurate and contextually relevant responses. It involves two main components:

* **Retriever**: Based on the user's query, the retriever fetches relevant documents from a dataset to provide additional context to the LLM.

* **Generator**: The LLM uses the retrieved context along with the user's query to generate accurate and coherent responses.

By combining retrieval with generation, RAG systems can produce responses that are both informed and coherent, making them ideal for tasks like question answering over custom datasets.

[**Gemma**](https://ai.google.dev/gemma) is a family of lightweight, state-of-the-art open language models from Google. Built from the same research and technology used to create the Gemini models, Gemma models are text-to-text, decoder-only large language models (LLMs) available in English, with open weights, pre-trained variants, and instruction-tuned variants.

The **Gemma 2 9B IT Q6 K** model is a quantized instruction-tuned version of the Gemma model, optimized for performance while reducing computational load. This makes it possible to deploy the model in environments with limited resources, such as a laptop, desktop, or your own cloud infrastructure. It democratizes access to state-of-the-art AI models and fosters innovation for everyone.

[**LangChain**](https://python.langchain.com/) is a framework for developing applications powered by language models. It provides a suite of tools and integrations that simplify building complex AI applications, such as chatbots, question-answering systems, and more. LangChain allows you to chain together various components like prompts, LLMs, and retrievers to create sophisticated pipelines.

[**Ollama**](https://ollama.ai/) is a tool that simplifies running language models locally. It allows you to manage and serve multiple models efficiently, making it easier to deploy and test AI models on your machine. With Ollama, you can switch between different models and versions seamlessly, providing flexibility in development and experimentation. You can browse the available Gemma 2 models at the [Ollama Gemma 2 Model Catalog](https://ollama.com/library/gemma2).

[**Elasticsearch**](https://www.elastic.co/elasticsearch/) is a powerful open-source search and analytics engine. It allows you to store, search, and analyze large volumes of data quickly and in near real-time. In this tutorial, Elasticsearch serves as the data source and vector store for our RAG application, enabling efficient retrieval of relevant documents based on user queries.

By combining these tools, you can build a local RAG application that leverages the strengths of each component to create a powerful AI application capable of handling tasks like question answering over custom datasets—all running locally on a modest GPU like the T4.

<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/google-gemini/gemma-cookbook/blob/main/Gemma/[Gemma_2]Using_with_Elasticsearch_and_LangChain.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
</table>

## Setup

Before you begin, make sure you have a Google Colab account.

### Select the Colab Runtime

First, you'll need to set up your Google Colab environment:

1. **Open Google Colab** and create a new notebook.
2. In the upper-right corner of the Colab window, click on the **▾ (Additional connection options)** button.
3. Select **Change runtime type**.
4. Under **Hardware accelerator**, choose **GPU**.
5. Ensure that the **GPU type** is set to **T4**.

This will provide sufficient resources to run the Gemma 2 9B model.

### Gemma Setup

Before diving into the tutorial, let's set up Gemma:

1. **Create a Hugging Face Account**: If you don't have one, you can sign up for a free account [here](https://huggingface.co/join).
2. **Access the Gemma Model**: Visit the [Gemma model page](https://huggingface.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315) and accept the usage conditions.
3. **Generate a Hugging Face Token**: Go to your Hugging Face [settings page](https://huggingface.co/settings/tokens) and generate a new access token (preferably with `write` permissions).

**Once you've completed these steps, you're ready to move on to the next section where you'll set up environment variables in your Colab environment.**

### Configure Your Credentials


Next, we'll securely store your Hugging Face token using the Colab Secrets manager:

1. Open your Google Colab notebook and click on the 🔑 Secrets tab in the left panel. <img src="https://storage.googleapis.com/generativeai-downloads/images/secrets.jpg" alt="The Secrets tab is found on the left panel." width=50%>
2. **Add Hugging Face Token**:
   - Create a new secret named `HF_TOKEN`.
   - Paste your Hugging Face token into the Value input box.
   - Toggle the button to allow notebook access to the secret.

Now, set the environment variables in your notebook:

In [1]:
import os
from google.colab import userdata

# Set Hugging Face token
os.environ["HF_TOKEN"] = userdata.get("HF_TOKEN")

This code retrieves your secrets and sets them as environment variables, which you will use later in the tutorial.

### Installing Dependencies


Next, you need to install all the required dependencies. Run the following cell to install them:


In [2]:
!pip install -q langchain tiktoken
!pip install -q langchainhub langchain-huggingface langchain-text-splitters
!pip install -q sentence-transformers==2.2.2
!pip install -q -U huggingface-hub
!pip install -q -U elasticsearch==8.15.1 langchain-elasticsearch==0.3.0
!pip install -q langchain-ollama

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/275.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m275.9/275.9 kB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
langchain-huggingface 0.1.2 requires sentence-transformers>=2.6.0, but you have sentence-transformers 2.2.2 which is incompatible.[0m[31m
[0m

### Import Dependencies

In [17]:
import os
import time
from google.colab import userdata
from typing import Dict

from elasticsearch import Elasticsearch

from langchain_ollama.chat_models import ChatOllama
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_elasticsearch import ElasticsearchStore, ElasticsearchRetriever
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain import hub
from langchain_core.output_parsers import BaseTransformOutputParser
from langchain_core.runnables import RunnablePassthrough

## Gemma

Gemma models are designed to be lightweight yet powerful, making them suitable for environments with limited resources. They support various text generation tasks and are available in instruction-tuned variants, which means they've been trained to follow instructions provided in prompts.

#### Prompt Formatting

Instruction-tuned models use specific control tokens to format prompts:

- **`user`**: Indicates a user turn.
- **`model`**: Indicates a model turn.
- **`<start_of_turn>`**: Marks the beginning of a dialogue turn.
- **`<end_of_turn>`**: Marks the end of a dialogue turn.

This formatting helps the model understand and generate conversational responses. Refer to the [official documentation](https://ai.google.dev/gemma/docs/formatting) for more details.

### Installing and Running Ollama

You will use Ollama to run the Gemma model locally.

First, install Ollama by running:

In [3]:
!curl -fsSL https://ollama.com/install.sh | sh

>>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
############################################################################################# 100.0%
>>> Creating ollama user...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.


Then, start the Ollama server in the background.


In [4]:
!nohup ollama serve > ollama.log &

nohup: redirecting stderr to stdout


Ollama provides a library of pre-configured models, including Gemma 2 models. You can browse the available Gemma 2 models at the [Ollama Gemma 2 Model Catalog](https://ollama.com/library/gemma2). This allows you to switch between different Gemma 2 models easily.

In this notebook, you'll use the [gemma2:9b-instruct-q6_K](https://ollama.com/library/gemma2:9b-instruct-q6_K) model.


In [24]:
!ollama run gemma2:2b "What is the capital of France?" 2> ollama.log

The capital of France is **Paris**. 🇫🇷 




You should see the model's response in the output.

### Integrate Gemma with LangChain

Now, let's set up the Gemma model with LangChain:

In [25]:
llm = ChatOllama(
    model="gemma2:2b",
    temperature=0.8
)

Test the LLM by asking a simple question:


In [26]:
response = llm.invoke("What is the capital of France?")
print(response.content)

The capital of France is **Paris**. 🗼  



## Setting up Elasticsearch

Next, you'll set up a local instance of Elasticsearch to serve as our data source and vector store.

**Note:** You're choosing to run Elasticsearch locally instead of using [Elastic Cloud](https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud) to keep the tutorial self-contained and to avoid external dependencies. This way, you can run everything on your machine without needing internet access or incurring any cloud costs.


### Download and Install Elasticsearch

First, you need to download and install Elasticsearch.


In [27]:
# Removes any previous Elasticsearch installations:
!rm -rf elasticsearch*

Download Elasticsearch Version 8.15.1, extract the archive and set permissions.


In [28]:
ESVERSION = "8.15.1"

In [29]:
%%bash -s "$ESVERSION"
export ESVERSION=$1

# Download and extract ES
wget -q https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-${ESVERSION}-linux-x86_64.tar.gz
wget -q https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-${ESVERSION}-linux-x86_64.tar.gz.sha512
tar -xzf elasticsearch-${ESVERSION}-linux-x86_64.tar.gz

# The binary's integrity is verified using SHA-512
shasum -a 512 -c elasticsearch-${ESVERSION}-linux-x86_64.tar.gz.sha512

# Set up user to run ES daemon and configure cgroups
umount /sys/fs/cgroup
apt install cgroup-tools
sudo chown -R daemon:daemon elasticsearch-${ESVERSION}/

elasticsearch-8.15.1-linux-x86_64.tar.gz: OK
Reading package lists...
Building dependency tree...
Reading state information...
The following additional packages will be installed:
  libcgroup1
The following NEW packages will be installed:
  cgroup-tools libcgroup1
0 upgraded, 2 newly installed, 0 to remove and 29 not upgraded.
Need to get 121 kB of archives.
After this operation, 435 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 libcgroup1 amd64 2.0-2 [49.8 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 cgroup-tools amd64 2.0-2 [70.8 kB]
Fetched 121 kB in 1s (116 kB/s)
Selecting previously unselected package libcgroup1:amd64.
(Reading database ... (Reading database ... 5%(Reading database ... 10%(Reading database ... 15%(Reading database ... 20%(Reading database ... 25%(Reading database ... 30%(Reading database ... 35%(Reading database ... 40%(Reading database ... 45%(Reading database ... 50%(Reading d





### Configure Elasticsearch

For demonstration purposes, let's disable security settings.   
**Note**: In a production environment, always enable security features.

Open the Elasticsearch configuration file and append the following settings:

In [30]:
with open(f'./elasticsearch-{ESVERSION}/config/elasticsearch.yml', 'a') as f:
    f.write("xpack.security.enabled: false\n")
    f.write("xpack.security.authc:\n")
    f.write("  anonymous:\n")
    f.write("    username: anonymous_user\n")
    f.write("    roles: superuser\n")
    f.write("    authz_exception: true\n")

If you want to verify that the **elasticsearch.yml** file is written correctly, you can uncomment and run the following code block.

In [31]:
# with open(f'./elasticsearch-{ESVERSION}/config/elasticsearch.yml', 'r') as f:
#     print(f.read())

### Run Elasticsearch

Now, let's start Elasticsearch as a daemon process:


In [32]:
%%bash --bg -s "$ESVERSION"

export ESVERSION=$1

sudo -H -u daemon elasticsearch-${ESVERSION}/bin/elasticsearch

It takes Elasticsearch a while to get running, so be sure to wait a few seconds. You can run a manual 60-second sleep command to ensure Elasticsearch has enough time to start:

In [33]:
time.sleep(60)

Once the instance has been started, you can check if Elasticsearch is running by listing the processes. You should see several elasticsearch processes running.

In [34]:
!ps -ef | grep elastic

root        4604    4602  0 16:38 ?        00:00:00 sudo -H -u daemon elasticsearch-8.15.1/bin/elast
daemon      4605    4604  1 16:38 ?        00:00:05 /content/elasticsearch-8.15.1/jdk/bin/java -Xms4
daemon      4675    4605 13 16:38 ?        00:01:02 /content/elasticsearch-8.15.1/jdk/bin/java -Des.
daemon      4719    4675  0 16:38 ?        00:00:00 /content/elasticsearch-8.15.1/modules/x-pack-ml/
root        6673     857  0 16:46 ?        00:00:00 /bin/bash -c ps -ef | grep elastic
root        6675    6673  0 16:46 ?        00:00:00 grep elastic


Verify Elasticsearch is running by making a request to the cluster. Here, you use the default elastic superuser and password password to initialize the cluster so that you can perform anonymous calls moving forward.

**WARNING**: Do not pass user passwords like this in real life.

In [35]:
!curl -u elastic:password -H 'Content-Type: application/json' -XGET http://localhost:9200/?pretty=true

{
  "name" : "4239b4775fb9",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "G3Z3Yhg5TwCfHPZVC1uFPA",
  "version" : {
    "number" : "8.15.1",
    "build_flavor" : "default",
    "build_type" : "tar",
    "build_hash" : "253e8544a65ad44581194068936f2a5d57c2c051",
    "build_date" : "2024-09-02T22:04:47.310170297Z",
    "build_snapshot" : false,
    "lucene_version" : "9.11.1",
    "minimum_wire_compatibility_version" : "7.17.0",
    "minimum_index_compatibility_version" : "7.0.0"
  },
  "tagline" : "You Know, for Search"
}


## QA with RAG Using Elasticsearch

Now, you'll perform question answering using Retrieval-Augmented Generation (RAG) with Elasticsearch by implementing the two stages in a RAG-based architecture:

1. **Retrieval**: Retrieves relevant context based on the user's query.
2. **Generation**: Uses the LLM to generate answers using the retrieved context.

### Retrieval

In this stage, you will perform the following steps:

* **Create a sample dataset**: Use sample Pokémon data.
* **Preparing Documents for Indexing**: Split documents into manageable chunks.
* **Create Embeddings of the Data**: Convert text data into numerical vectors.
* **Store the Embeddings in the Vector Store**: Index the embeddings in Elasticsearch.
* **Create a Retriever**: Set up a retriever to fetch relevant documents.


Initialize the Elasticsearch client by connecting to the local instance.


In [36]:
es_url = "http://localhost:9200"
client = Elasticsearch(hosts=[es_url])

# Verify connection
if client.ping():
    print("Connected to Elasticsearch")
else:
    print("Could not connect to Elasticsearch")

print(client.info())

Connected to Elasticsearch
{'name': '4239b4775fb9', 'cluster_name': 'elasticsearch', 'cluster_uuid': 'G3Z3Yhg5TwCfHPZVC1uFPA', 'version': {'number': '8.15.1', 'build_flavor': 'default', 'build_type': 'tar', 'build_hash': '253e8544a65ad44581194068936f2a5d57c2c051', 'build_date': '2024-09-02T22:04:47.310170297Z', 'build_snapshot': False, 'lucene_version': '9.11.1', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'}


#### Create a sample dataset

First, create some sample data to index. To do this, let's use descriptions of various Pokémon:

In [37]:
data = [
    {
        "name": "Bulbasaur",
        "description": "Bulbasaur has a strange seed planted on its back at birth. The plant sprouts and grows with Bulbasaur.",
        "type": "Grass/Poison"
    },
    {
        "name": "Charmander",
        "description": "Charmander obviously prefers hot places. When it rains, steam is said to spout from the tip of Charmander's tail.",
        "type": "Fire"
    },
    {
        "name": "Squirtle",
        "description": "After birth, Squirtle's back swells and hardens into a shell. Squirtle powerfully sprays foam from its mouth.",
        "type": "Water"
    },
    {
        "name": "Pikachu",
        "description": "When several Pikachu gather, their electricity could build and cause lightning storms.",
        "type": "Electric"
    },
    {
        "name": "Jigglypuff",
        "description": "When Jigglypuff sings, it never pauses to breathe. If Jigglypuff is in battle against an opponent that does not easily fall asleep, it cannot breathe, endangering its life.",
        "type": "Normal/Fairy"
    },
    {
        "name": "Meowth",
        "description": "Meowth adores round objects. It wanders the streets on a nightly basis to look for dropped loose change.",
        "type": "Normal"
    },
    {
        "name": "Psyduck",
        "description": "While lulling its enemies with its vacant look, this wily Psyduck will use psychokinetic powers.",
        "type": "Water"
    },
    {
        "name": "Mewtwo",
        "description": "Mewtwo was created by a scientist after years of horrific gene splicing and DNA engineering experiments.",
        "type": "Psychic"
    },
    {
        "name": "Snorlax",
        "description": "Snorlax's daily routine consists of nothing more than eating and sleeping. It is such a docile Pokémon that children use Snorlax's expansive belly as a place to play.",
        "type": "Normal"
    },
    {
        "name": "Gengar",
        "description": "Sometimes, on a dark night, your shadow thrown by a streetlight will suddenly and startlingly overtake you. It is actually a Gengar running past you.",
        "type": "Ghost/Poison"
    },
    {
        "name": "Lapras",
        "description": "People have driven Lapras almost to the point of extinction. In the evenings, Lapras is said to sing plaintively as it seeks what few others of its kind still remain.",
        "type": "Water/Ice"
    },
    {
        "name": "Dragonite",
        "description": "Dragonite is capable of circling the globe in just sixteen hours. It is a kindhearted Pokémon that leads lost ships in a storm to the safety of land.",
        "type": "Dragon/Flying"
    },
    {
        "name": "Ditto",
        "description": "Ditto rearranges its cell structure to transform itself into other shapes. However, if Ditto tries to transform itself by relying on its memory, it may get details wrong.",
        "type": "Normal"
    },
    {
        "name": "Magikarp",
        "description": "Magikarp is a pathetic excuse for a Pokémon that is only capable of flopping and splashing. This behavior prompted scientists to undertake research into Magikarp.",
        "type": "Water"
    },
    {
        "name": "Charizard",
        "description": "Charizard flies around the sky in search of powerful opponents. Charizard breathes fire of such great heat that it melts anything.",
        "type": "Fire/Flying"
    },
    {
        "name": "Onix",
        "description": "Onix burrows at high speed in search of food. The tunnels Onix leaves are used as homes by Diglett.",
        "type": "Rock/Ground"
    }
]

#### Preparing Documents for Indexing

Splitting documents into smaller manageable chunks is ideal for efficient Elasticsearch indexing and search, especially for larger descriptions.

The following code prepares the documents for efficient processing by splitting them into manageable chunks while preserving metadata. It first extracts the descriptions and metadata, then uses `RecursiveCharacterTextSplitter` to break each description into overlapping segments. The result is a collection of smaller, context-rich text chunks, each paired with relevant metadata, ready for indexing.

In [38]:
metadata = []
content = []

for doc in data:
    content.append(doc["description"])
    metadata.append({"name": doc["name"], "type": doc["type"]})

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=512, chunk_overlap=256
)
docs = text_splitter.create_documents(content, metadatas=metadata)

#### Create Embeddings of the Data

Embeddings are numerical representations (vectors) of text. Text with similar meaning will have similar embedding vectors. You'll use an embedding model to create the embedding vectors of the data.

Initialize the embeddings using the `sentence-transformers/all-MiniLM-L12-v2` HuggingFace embedding model. You specify the `device` as `cuda` to utilize the GPU for faster computations and any additional paramaters like `normalize_embeddings` can be customized too.


In [42]:
!pip install -U sentence-transformers

embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L12-v2",
    model_kwargs={'device': 'cuda'},
    encode_kwargs={'normalize_embeddings': False}
)

Collecting sentence-transformers
  Using cached sentence_transformers-3.4.1-py3-none-any.whl.metadata (10 kB)
Using cached sentence_transformers-3.4.1-py3-none-any.whl (275 kB)
Installing collected packages: sentence-transformers
  Attempting uninstall: sentence-transformers
    Found existing installation: sentence-transformers 2.2.2
    Uninstalling sentence-transformers-2.2.2:
      Successfully uninstalled sentence-transformers-2.2.2
Successfully installed sentence-transformers-3.4.1


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/352 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

#### Store the Embeddings in the Vector Store

Set up the `ElasticsearchStore` index using the `langchain-elasticsearch` integration. You'll also use the same index while querying the vector store using the RAG later.

Run the following cell to start indexing the documents.

In [43]:
index_name = "es-rag-pokemon"

documents = ElasticsearchStore.from_documents(
    docs,
    embeddings,
    index_name=index_name,
    es_url=es_url
)

#### Create a Retriever

Next, you'll create a retriever to fetch relevant documents based on user queries. You'll design a hybrid search query for Elasticsearch that combines both traditional keyword-based search and vector similarity search.

* [**Keyword Matching using BM25**](https://www.elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables): Ensures that documents containing the exact or similar terms (with fuzziness) are retrieved. This is useful for precise term matching and accommodates typos.

* [**Vector Similarity (KNN)**](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm): Retrieves documents that are semantically similar to the search query, even if they don't contain the exact search terms. This captures the meaning behind the words.


So, let's define a hybrid query function that uses both techniques.


In [44]:
def hybrid_query(search_query: str) -> Dict:
    vector = embeddings.embed_query(search_query)
    return {
        # Keyword matching
        "query": {
            "match": {
                "text": {
                    "query": search_query,
                    # Keyword matching with typo tolerance
                    "fuzziness": "AUTO",
                }
            },
        },
        # K-Nearest Neighbors
        "knn": {
            # The default vector field name in LangChain is "vector"
            "field": "vector",
            "query_vector": vector,
            "k": 5,
            "num_candidates": 10,
        }
    }

Initialize the retriever.

In [45]:
retriever = ElasticsearchRetriever.from_es_params(
    index_name=index_name,
    body_func=hybrid_query,
    content_field="text",
    url=es_url,
)

Here, you'll test if the retriever is working as intended.

In [46]:
retriever.invoke("Pikachu")

[Document(metadata={'_index': 'es-rag-pokemon', '_id': '277d8ba0-2778-494d-bd20-9d06e6663809', '_score': 3.7311673, '_source': {'metadata': {'name': 'Pikachu', 'type': 'Electric'}, 'vector': [-0.08204592019319534, 0.040552400052547455, -0.017693540081381798, 0.09651298075914383, -0.05067732557654381, -0.0397016704082489, 0.007839432917535305, 0.056036271154880524, -0.013275082223117352, 0.023567529395222664, 0.03249923139810562, 0.014013098552823067, 0.054716791957616806, 0.020771976560354233, 0.03376133739948273, 0.04162945598363876, -0.0982939749956131, -0.022955622524023056, -0.04961521923542023, -0.0022519598715007305, 0.020477814599871635, -0.027227040380239487, -0.03955269604921341, 0.05996553972363472, -0.016906335949897766, -0.016677657142281532, -0.03829752653837204, -0.006165575236082077, -0.0014378922060132027, -0.018529055640101433, 0.03166741505265236, -0.009889976121485233, 0.0027327565476298332, 0.11636561155319214, -0.033894993364810944, -0.0017111323541030288, 0.004250

### Generation

Next, the Generation phase involves prompting the LLM for an answer when the user asks a question. The retriever you created in the previous stage will be used here to provide more context.

You'll perform the following steps in this stage:

* Load a predefined RAG prompt.
* Define a function to format retrieved documents.
* Handle Gemma formatting using a custom output parser
* Chain everything together

#### Load a predefined prompt

You will use a [RAG prompt template](https://www.google.com/url?q=https%3A%2F%2Fsmith.langchain.com%2Fhub%2Frlm%2Frag-prompt) from LangChain Hub for the predefined prompt. It is useful for chat, QA, or other applications that rely on passing context to an LLM.


In [47]:
prompt = hub.pull("rlm/rag-prompt")
print(f"Prompt:\n\n{prompt.messages[0].prompt.template}")

Prompt:

You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: {question} 
Context: {context} 
Answer:




#### The formatted context

Define a function to format the retrieved documents.

In [48]:
def format_docs(docs):
    return "\n\n".join([doc.page_content for doc in docs])

#### Custom Gemma Output Parser
Since Gemma models use specific formatting, you'll create an output parser to extract the instruction-tuned model's response properly.

In [49]:
class GemmaOutputParser(BaseTransformOutputParser[str]):
    def parse(self, text: str) -> str:
        model_start_token = "<start_of_turn>model\n"
        idx = text.rfind(model_start_token)
        return text[idx + len(model_start_token):] if idx > -1 else text

#### Chain everything together

Finally, you'll set up the RAG chain that ties everything together while relying on LangChain's [LCEL paradigm](https://python.langchain.com/v0.1/docs/expression_language/why/).

[The prompt](https://smith.langchain.com/hub/rlm/rag-prompt) you're using expects an input that includes two keys: `context` and `question`. The user only provides the question, so you need to obtain and format the relevant context using our retriever. Here's how it works:

* **Retrieve Relevant Documents**: Use `retriever` to get the relevant documents based on the user's question.

* **Format the Documents**: The retrieved documents may not be in a format suitable for our prompt, so you'll use the `format_docs` function to combine and format these documents into a single coherent string. This formatted string becomes our `context`. The pipe symbol here (`|`) is used to chain `retriever` and `format_docs`, resulting in the **formatted context**.

* **Pass the Question**: Pass the **user's question** directly under the `question` key via `RunnablePassthrough`. This behaves almost like the identity function, except that `RunnablePassthrough` allows you to pass the user's question directly through the chain to the prompt and the model without any modification (or add more keys to the output via `RunnablePassthrough.assign`). To learn more about this, read [the official documentation](https://python.langchain.com/api_reference/core/runnables/langchain_core.runnables.passthrough.RunnablePassthrough.html).

* **Fill the Prompt**: Combine both the **formatted context** and the **user's question** into a single input dictionary for the prompt. This dictionary fills in the placeholders (like `context` and `question`) with the actual values from the dictionary. This results in a complete prompt text that is ready to be sent to the language model.

* **Generate Answer with the LLM**: The filled prompt is then passed to the `llm` (the Gemma model via Ollama). The LLM processes the prompt and generates a response based on the provided context and question.

* **Parse the LLM's Output**: The raw output from the LLM might include control tokens or additional formatting specific to the Gemma model. You'll use the `GemmaOutputParser` to parse the LLM's output and extract the final answer.


In simple terms, you gather the context related to the question, format it appropriately, and then provide both the formatted context and the question to the model to generate an informed answer.


In [50]:
# Create an actual chain

rag_chain = (
    # First you need retrieve documents that are relevant to the
    # given query
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    # The `context` and `question` are then passed to the prompt
    | prompt
    # The whole prompt will all the information is passed the LLM
    | llm
    # The answer of the LLM is parsed by the class defined above
    | GemmaOutputParser()
)

### Try It Out!

Finally, let's ask some questions and see how the RAG chain performs.

In [51]:
question = "What happens when several Pikachu gather?"
answer = rag_chain.invoke(question)
print(answer)

When several Pikachu gather, their electricity can build up and cause lightning storms.  



In [52]:
question = "Name a few Pokémon that can fly"
answer = rag_chain.invoke(question)
print(answer)

Charizard is capable of flying, and Dragonite can circle the globe in just sixteen hours.  Both of these Pokémon are known for their ability to fly. 



In [53]:
question = "What's a Pokémon that loves round objects?"
answer = rag_chain.invoke(question)
print(answer)

Meowth loves round objects like loose change.  
It wanders the streets looking for them on a nightly basis. 
This is likely because it is an association Meowth has formed with its daily life. 



Congratulations! You've successfully built a local Retrieval-Augmented Generation (RAG) application using the quantized Gemma 2 model (Gemma 2 9B IT), LangChain, Ollama, and Elasticsearch.

Feel free to explore further by:

- Adding more data to Elasticsearch.
- Experimenting with different Gemma models from the [Ollama Gemma 2 Model Catalog](https://ollama.com/library/gemma2).
- Tweaking model parameters for better performance.
- Check [Elastic Cloud deployment](https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud) out and learn how to obtain your Elastic credentials ([ELASTIC_CLOUD_ID](https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id) and [ELASTIC_API_KEY](https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key)) for setting up a cloud instance instead.

By following this tutorial, you're now equipped to build your own local RAG applications and explore the capabilities of Gemma models combined with LangChain and Elasticsearch.


In [8]:
import os
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from elasticsearch import Elasticsearch
from langchain_ollama import ChatOllama

# Define Elasticsearch Connection
ES_HOST = "http://localhost:9200"
INDEX_NAME = "pdf_documents"
es = Elasticsearch([ES_HOST])

# Define function to process PDFs
def process_pdf_folder(folder_path):
    documents = []
    for filename in os.listdir(folder_path):
        if filename.endswith(".pdf"):
            file_path = os.path.join(folder_path, filename)
            loader = PyPDFLoader(file_path)
            documents.extend(loader.load())
    return documents

# Define function to index documents into Elasticsearch
def index_documents(docs):
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
    for doc in docs:
        split_texts = text_splitter.split_text(doc.page_content)
        for i, text_chunk in enumerate(split_texts):
            es.index(index=INDEX_NAME, body={"text": text_chunk})

# Define function to retrieve documents from Elasticsearch
def retrieve_documents():
    search_body = {"query": {"match_all": {}}, "size": 10}
    response = es.search(index=INDEX_NAME, body=search_body)
    return [hit["_source"]["text"] for hit in response["hits"]["hits"]]

# Define function to query Ollama using entire PDF context
def query_ollama_from_pdfs():
    context_docs = retrieve_documents()
    context_text = "\n".join(context_docs)
    chat_model = ChatOllama(model_name="gemma:2b")
    response = chat_model.generate(prompt=f"Context:\n{context_text}\n\nSummarize the key insights from the PDFs.")
    return response

# Usage Example
pdf_folder = "/content/PDFs"  # Replace with your folder path
documents = process_pdf_folder(pdf_folder)
index_documents(documents)
print(f"Indexed {len(documents)} documents from PDFs into Elasticsearch!")

# Query Ollama for insights from the PDFs
response = query_ollama_from_pdfs()
print("AI Response:", response)


Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/urllib3/connection.py", line 198, in _new_conn
    sock = connection.create_connection(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/urllib3/util/connection.py", line 85, in create_connection
    raise err
  File "/usr/local/lib/python3.11/dist-packages/urllib3/util/connection.py", line 73, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/elastic_transport/_node/_http_urllib3.py", line 167, in perform_request
    response = self.pool.urlopen(
               ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/urllib3/connectionpool.py", line 841, in urlopen
    retries = retries.increment(
              ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/pyth

ConnectionError: Connection error caused by: ConnectionError(Connection error caused by: NewConnectionError(<urllib3.connection.HTTPConnection object at 0x7e4206280210>: Failed to establish a new connection: [Errno 111] Connection refused))

In [6]:
!pip install pypdf


Collecting pypdf
  Downloading pypdf-5.4.0-py3-none-any.whl.metadata (7.3 kB)
Downloading pypdf-5.4.0-py3-none-any.whl (302 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/302.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.3/302.3 kB[0m [31m18.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-5.4.0
