![image](https://raw.githubusercontent.com/IBM/watson-machine-learning-samples/master/cloud/notebooks/headers/watsonx-Prompt_Lab-Notebook.png)
# Use watsonx, LangChain, Elasticsearch, and Model Gateway to create and deploy RAG function with model load balancing

#### Disclaimers

- Use only Projects and Spaces that are available in watsonx context.

## Notebook content

This notebook contains the steps and code to demonstrate support of creating and deploying Retrieval Augmented Generation in watsonx.ai using Model Gateway. It introduces commands for data retrieval, knowledge base building & querying, model testing, deploying a RAG solution as an AI service.

Some familiarity with Python is helpful. This notebook uses Python 3.11.

#### About Retrieval Augmented Generation
Retrieval Augmented Generation (RAG) is a versatile pattern that can unlock a number of use cases requiring factual recall of information, such as querying a knowledge base in natural language.

In its simplest form, RAG requires 3 steps:

- Index knowledge base passages (once)
- Retrieve relevant passage(s) from knowledge base (for every user query)
- Generate a response by feeding retrieved passage into a large language model (for every user query)

## Contents

This notebook contains the following parts:

- [Setup the environment](#setup)
- [Configure Model Gateway](#model-gateway)
- [Create Model Gateway providers](#model-gateway-providers)
- [Data preparation](#data-preparation)
- [Set up connection to Elasticsearch](#elasticsearch)
- [Set up VectorStore with Elasticsearch credentials](#vectorstore)
- [Create RAG AI service](#rag-function)
- [Create RAG AI service deployment](#deploy)
- [Calculate rougeL metric](#evaluate)

<a id="setup"></a>
## Set up the environment

Before you use the sample code in this notebook, you must perform the following setup tasks:

- Create a <a href="https://cloud.ibm.com/catalog/services/watsonxai-runtime" target="_blank" rel="noopener no referrer">watsonx.ai Runtime Service</a> instance (a free plan is offered and information about how to create the instance can be found <a href="https://dataplatform.cloud.ibm.com/docs/content/wsj/getting-started/wml-plans.html?context=wx&audience=wdp" target="_blank" rel="noopener no referrer">here</a>).

**Note:** The example of model load balancing presented in this sample notebook may raise `Status Code 429 (Too Many Requests)` errors when using the free plan, due to lower maximum number of requests allowed per second.

### Install dependencies
**Note:** `ibm-watsonx-ai` documentation can be found <a href="https://ibm.github.io/watsonx-ai-python-sdk/index.html" target="_blank" rel="noopener no referrer">here</a>.

In [1]:
%pip install wget | tail -n 1
%pip install rouge-score | tail -n 1
%pip install -U "ibm-watsonx-ai>=1.3.25" | tail -n 1
%pip install -U "langchain>=0.3,<0.4" | tail -n 1
%pip install -U "langchain-elasticsearch>=0.3,<0.4" | tail -n 1

Successfully installed wget-3.2
[1A[2KSuccessfully installed absl-py-2.3.0 click-8.2.1 joblib-1.5.1 nltk-3.9.1 numpy-2.3.1 regex-2024.11.6 rouge-score-0.1.2 tqdm-4.67.1
[1A[2KSuccessfully installed anyio-4.9.0 certifi-2025.6.15 charset_normalizer-3.4.2 h11-0.16.0 httpcore-1.0.9 httpx-0.28.1 ibm-cos-sdk-2.14.2 ibm-cos-sdk-core-2.14.2 ibm-cos-sdk-s3transfer-2.14.2 ibm-watsonx-ai-1.3.26 idna-3.10 jmespath-1.0.1 lomond-0.3.3 pandas-2.2.3 pytz-2025.2 requests-2.32.4 sniffio-1.3.1 tabulate-0.9.0 tzdata-2025.2 urllib3-2.5.0
[1A[2KSuccessfully installed PyYAML-6.0.2 SQLAlchemy-2.0.41 annotated-types-0.7.0 jsonpatch-1.33 jsonpointer-3.0.0 langchain-0.3.26 langchain-core-0.3.67 langchain-text-splitters-0.3.8 langsmith-0.4.4 orjson-3.10.18 packaging-24.2 pydantic-2.11.7 pydantic-core-2.33.2 requests-toolbelt-1.0.0 tenacity-9.1.2 typing-inspection-0.4.1 zstandard-0.23.0
[1A[2KSuccessfully installed elastic-transport-8.17.1 elasticsearch-8.18.1 langchain-elasticsearch-0.3.2 simsimd-6.4.9


### Define the watsonx.ai credentials
Use the code cell below to define the watsonx.ai credentials.

**Action:** Provide the IBM Cloud user API key. For details, see <a href="https://cloud.ibm.com/docs/account?topic=account-userapikey&interface=ui" target="_blank" rel="noopener no referrer">Managing user API keys</a>.

In [2]:
import getpass
from ibm_watsonx_ai import Credentials

credentials = Credentials(
    url="https://ca-tor.ml.cloud.ibm.com",
    api_key=getpass.getpass("Enter your watsonx.ai api key and hit enter: "),
)

### Working with projects

First of all, you need to create a project that will be used for your work.
The project must have a watsonx.ai Runtime instance assigned to it for this notebook to work properly.
To assign an instance, follow the [documentation](https://www.ibm.com/docs/en/watsonx/saas?topic=projects-adding-associated-services).

- Open IBM Cloud Pak main page
- Click all projects
- Create an empty project
- Assign the watsonx.ai Runtime instance
- Copy `project_id` from url and paste it below

**Action**: Assign project ID below

In [3]:
import os

try:
    project_id = os.environ["PROJECT_ID"]
except KeyError:
    project_id = input("Enter your project_id and hit enter: ")

### Working with spaces

You need to create a space that will be used for your work. If you do not have a space, you can use [Deployment Spaces Dashboard](https://dataplatform.cloud.ibm.com/ml-runtime/spaces?context=wx) to create one.

- Click **New Deployment Space**
- Create an empty space
- Select Cloud Object Storage
- Select watsonx.ai Runtime instance and press **Create**
- Go to **Manage** tab
- Copy `Space GUID` and paste it below

**Tip**: You can also use SDK to prepare the space for your work. More information can be found [here](https://github.com/IBM/watson-machine-learning-samples/blob/master/cloud/notebooks/python_sdk/instance-management/Space%20management.ipynb).

**Action**: assign space ID below

In [4]:
import os

try:
    space_id = os.environ["SPACE_ID"]
except KeyError:
    space_id = input("Enter your space_id and hit enter: ")

### Create `APIClient` instance

In [5]:
from ibm_watsonx_ai import APIClient

client = APIClient(credentials, project_id=project_id)

<a id="model-gateway"></a>
## Configure Model Gateway

**Note:** You can learn more about setting up Model Gateway in this [sample notebook](https://github.com/IBM/watsonx-ai-samples/blob/master/cloud/notebooks/python_sdk/deployments/ai_services/Use%20watsonx%2C%20and%20Model%20Gateway%20to%20run%20as%20an%20AI%20service%20with%20load%20balancing.ipynb).

### Define IBM Cloud Secrets Manager URL
In order to store secrets for different model providers, you need to use the IBM Cloud Secrets Manager.

**Note:** This notebook assumes that the IBM Cloud Secrets Manager instance is already configured. In order to configure the instance, follow [this chapter](https://www.ibm.com/docs/en/watsonx/saas?topic=models-using-model-gateway-preview#setting-up-authentication) in the documentation.

In [6]:
secrets_manager_url = "PASTE_YOUR_IBM_CLOUD_SECRETS_MANAGER_URL_HERE"

### Initialize the Model Gateway
Create `Gateway` instance

In [7]:
from ibm_watsonx_ai.gateway import Gateway

gateway = Gateway(api_client=client)

Set your IBM Cloud Secrets Manager instance

**Note:** This instance will store your provider credentials. The same credentials will later be used inside the AI service.

**Note:** Secrets manager should be set only once per project.

In [8]:
gateway.set_secrets_manager(secrets_manager_url)

{'id': 'd6a9d735-dca3-5492-9161-62577c7bc575',
 'name': 'Watsonx AI Model Gateway configuration'}

List available providers

In [9]:
gateway.providers.list()

Unnamed: 0,ID,NAME,TYPE


<a id="model-gateway-providers"></a>
## Configure Model Gateway providers and create models with the same alias

In [10]:
model_alias = "load-balancing-rag-models"

### Create watsonx.ai provider for `meta-llama/llama-3-3-70b-instruct` model

In [11]:
llama_model = "meta-llama/llama-3-3-70b-instruct"

watsonx_ai_provider_1_details = gateway.providers.create(
    provider="watsonxai",
    name="watsonx-ai-provider-1",
    data={
        "apikey": client.credentials.api_key,
        "auth_url": client.service_instance._href_definitions.get_iam_token_url(),
        "base_url": client.credentials.url,
        "project_id": project_id,
    },
)

watsonx_ai_provider_1_id = gateway.providers.get_id(watsonx_ai_provider_1_details)

llama_model_details = gateway.models.create(
    provider_id=watsonx_ai_provider_1_id,
    model=llama_model,
    alias=model_alias,
)

llama_model_id = gateway.models.get_id(llama_model_details)

### Create watsonx.ai provider for `mistralai/mistral-large` model

In [12]:
mistral_model = "mistralai/mistral-large"

watsonx_ai_provider_2_details = gateway.providers.create(
    provider="watsonxai",
    name="watsonx-ai-provider-2",
    data={
        "apikey": client.credentials.api_key,
        "auth_url": client.service_instance._href_definitions.get_iam_token_url(),
        "base_url": client.credentials.url,
        "project_id": project_id,
    },
)

watsonx_ai_provider_2_id = gateway.providers.get_id(watsonx_ai_provider_2_details)

mistral_model_details = gateway.models.create(
    provider_id=watsonx_ai_provider_2_id,
    model=mistral_model,
    alias=model_alias,
)

mistral_model_id = gateway.models.get_id(mistral_model_details)

### Create watsonx.ai provider for `ibm/granite-3-8b-instruct` model

In [13]:
granite_model = "ibm/granite-3-8b-instruct"

watsonx_ai_provider_3_details = gateway.providers.create(
    provider="watsonxai",
    name="watsonx-ai-provider-3",
    data={
        "apikey": client.credentials.api_key,
        "auth_url": client.service_instance._href_definitions.get_iam_token_url(),
        "base_url": client.credentials.url,
        "project_id": project_id,
    },
)

watsonx_ai_provider_3_id = gateway.providers.get_id(watsonx_ai_provider_3_details)

granite_model_details = gateway.models.create(
    provider_id=watsonx_ai_provider_3_id,
    model=granite_model,
    alias=model_alias,
)

granite_model_id = gateway.models.get_id(granite_model_details)

### List available providers

In [14]:
gateway.providers.list()

Unnamed: 0,ID,NAME,TYPE
0,9ad8bf00-e741-47c3-8eae-13e5ae4e4d1d,watsonx-ai-provider-1,watsonxai
1,8e4a6805-f61e-4f68-b235-75236479191d,watsonx-ai-provider-2,watsonxai
2,b63c0ee4-63f7-4403-8734-fec841c5f676,watsonx-ai-provider-3,watsonxai


<a id="data-preparation"></a>
## Data preparation

### Build up knowledge base

The current state-of-the-art in RAG is to create dense vector representations of the knowledge base in order to calculate the semantic similarity to a given user query.

We can generate dense vector representations using embedding models. In this notebook, we use `ibm/slate-125m-english-rtrvr` model to embed both the knowledge base passages and user queries.

A vector database is optimized for dense vector indexing and retrieval. This notebook uses <a href="https://python.langchain.com/docs/integrations/vectorstores/elasticsearch#basic-example" target="_blank" rel="noopener no referrer">Elasticsearch</a>, a distributed, RESTful search and analytics engine, capable of performing both vector and lexical search.

The dataset we are using is already split into self-contained passages that can be ingested by Elasticsearch. The size of each passage is limited by the embedding model's context window (which is 512 tokens for `ibm/slate-125m-english-rtrvr`).

#### Load knowledge base documents

Load set of documents used further to build knowledge base and store them as a project asset.

In [15]:
import wget

filename = "psgs.tsv"
url = f"https://raw.github.com/IBM/watsonx-ai-samples/master/cloud/data/RAG/{filename}"
if not os.path.isfile(filename):
    wget.download(url)

asset_details = client.data_assets.create(name=filename, file_path=filename)

Creating data asset...
SUCCESS


#### Read and prepare documents
Read documents using `DataConnection` and prepare them for vector database ingestion by combining title and text.

In [16]:
from ibm_watsonx_ai.helpers import DataConnection

data_connection = DataConnection(data_asset_id=client.data_assets.get_id(asset_details))
data_connection.set_client(client)
documents = data_connection.read(csv_separator="\t")

Collecting pyarrow>=3.0.0
  Using cached pyarrow-20.0.0-cp311-cp311-macosx_12_0_arm64.whl.metadata (3.3 kB)
Using cached pyarrow-20.0.0-cp311-cp311-macosx_12_0_arm64.whl (30.9 MB)
Installing collected packages: pyarrow
Successfully installed pyarrow-20.0.0


In [17]:
documents["indextext"] = documents["title"].astype(str) + "\n" + documents["text"]
documents = documents[:1000]
documents.head()

Unnamed: 0,id,text,title,indextext
0,1.0,History of Idaho - wikipedia History of Idaho ...,History of Idaho,History of Idaho\nHistory of Idaho - wikipedia...
1,2.0,"1957 . Location Cataldo , Idaho Built 1848 Arc...",History of Idaho,"History of Idaho\n1957 . Location Cataldo , Id..."
2,3.0,"of the Columbia was created in June 1816 , and...",History of Idaho,History of Idaho\nof the Columbia was created ...
3,4.0,"Canyon , he concluded that water transport was...",History of Idaho,"History of Idaho\nCanyon , he concluded that w..."
4,5.0,"1842 , Father Pierre - Jean De Smet , with Fr....",History of Idaho,"History of Idaho\n1842 , Father Pierre - Jean ..."


### Create an embedding function for VectorStore

Note that you can feed a custom embedding function to be used by Elasticsearch. The performance of Elasticsearch may differ depending on the embedding model used. 

In [18]:
from ibm_watsonx_ai.foundation_models import Embeddings

embeddings = Embeddings(
    model_id=client.foundation_models.EmbeddingModels.SLATE_125M_ENGLISH_RTRVR_V2,
    credentials=credentials,
    project_id=project_id,
)

<a id="elasticsearch"></a>
## Set up connectivity information to Elasticsearch

**This notebook focuses on self-managed cluster using <a href="https://cloud.ibm.com/docs/databases-for-elasticsearch?topic=databases-for-elasticsearch-getting-started" target="_blank" rel="noopener no referrer">IBM Cloud® Databases for Elasticsearch.</a>**

The following cell retrieves the Elasticsearch users, password, host and port from the environment if available and prompts you otherwise.

You can provide a connection asset ID to read all required connection data from it. Before doing so, make sure that connection asset was created in your space. To do that, you have to set the client's default space and then create the connection.

In [19]:
client.set.default_space(space_id)

Unsetting the project_id ...


'SUCCESS'

Create Elasticsearch connection

In [20]:
es_connection_id = input(
    "Provide connection asset ID in your project. Skip this, if you wish to type credentials by hand and hit enter: "
)

if not es_connection_id:
    try:
        esuser = os.environ["ESUSER"]
    except KeyError:
        esuser = input("Please enter your Elasticsearch user name and hit enter: ")

    try:
        espassword = os.environ["ESPASSWORD"]
    except KeyError:
        espassword = getpass.getpass(
            "Please enter your Elasticsearch password and hit enter: "
        )

    try:
        eshost = os.environ["ESHOST"]
    except KeyError:
        eshost = input("Please enter your Elasticsearch hostname and hit enter: ")

    try:
        esport = os.environ["ESPORT"]
    except KeyError:
        esport = input("Please enter your Elasticsearch port number and hit enter: ")

    try:
        esca = os.environ["ESCA"]
    except KeyError:
        esca = input(
            "Please enter your Elasticsearch certificate contents (base64 encoded) and hit enter: "
        )

    elasticsearch_data_source_type_id = (
        client.connections.get_datasource_type_id_by_name("elasticsearch")
    )

    details = client.connections.create(
        {
            client.connections.ConfigurationMetaNames.NAME: "ES Connection",
            client.connections.ConfigurationMetaNames.DESCRIPTION: "connection description",
            client.connections.ConfigurationMetaNames.DATASOURCE_TYPE: elasticsearch_data_source_type_id,
            client.connections.ConfigurationMetaNames.PROPERTIES: {
                "url": f"{eshost}:{esport}",
                "username": esuser,
                "password": espassword,
                "use_anonymous_access": "false",
                "ssl_certificate": esca,
            },
        }
    )

    es_connection_id = client.connections.get_id(details)

es_connection_id

'73962db1-9b6c-490c-8d63-4ea64fc067cf'

<a id="vectorstore"></a>
## Set up `VectorStore` with Elasticsearch credentials 

Create a `VectorStore` class that automatically detects the database type (in our case it will be Elasticsearch) and allows us to add, search and delete documents.

It works as a wrapper for LangChain `VectorStore` classes. You can customize the settings as long as it is supported. Consult the LangChain documentation for more information about <a href="https://api.python.langchain.com/en/latest/vectorstores/langchain_community.vectorstores.elasticsearch.ElasticsearchStore.html" target="_blank" rel="noopener no referrer">`ElasticsearchStore`</a> connector.

Provide the name of your Elasticsearch index for subsequent operations:

In [21]:
index_name = input("Please enter Elasticsearch index name and hit enter: ")

In [22]:
from ibm_watsonx_ai.foundation_models.extensions.rag import VectorStore

vector_store = VectorStore(
    client=client,
    embeddings=embeddings,
    connection_id=es_connection_id,
    index_name=index_name,
)

<a id="elasticsearchstore_index"></a>
### Embed and index documents with Elasticsearch

**Note:** Could take several minutes if you don't have pre-built indices

In [23]:
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter

texts = documents.text.tolist()
metadata_dicts = [
    {"title": title, "id": doc_id}
    for (title, doc_id) in zip(documents.title, documents.id)
]
docs_to_add = [
    Document(page_content=text, metadata=metadata)
    for text, metadata in zip(texts, metadata_dicts)
]

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=10)
docs_to_add_split = text_splitter.split_documents(docs_to_add)

ids = vector_store.add_documents(docs_to_add_split)

Verify the number of documents loaded into the Elasticsearch index.

In [24]:
doc_count = vector_store.count()
doc_count

3059

Let's search for an example document as a sample. Note the embedding in the vector field, that was generated with the sentence transformer.

In [25]:
vector_store.search("United States of America", k=5)

[Document(metadata={'title': 'British colonization of the Americas', 'id': 521.0}, page_content="the United States of America , which was recognised internationally with the signing of the Treaty of Paris on 3 September 1783 . Great Britain also colonised the west coast of North America , indirectly via the Hudson 's Bay Company licenses west of the Rocky Mountains : the Columbia District and New Caledonia fur district . Most of these were jointly claimed as the Oregon Country by the United"),
 Document(metadata={'title': 'Chicago Fire (season 6)', 'id': 769.0}, page_content="Us '' April 26 , 2018"),
 Document(metadata={'title': 'Founding Fathers of the United States', 'id': 893.0}, page_content='of Congress further identifies the Articles of Confederation , also preserved at NARA , as a primary U.S. document . The Articles of Confederation served as the first constitution of the United States until its replacement by the present Constitution on March 4 , 1789 . Signatories of the Cont

<a id="rag-function"></a>
## Create RAG AI service

### Create AI service

Prepare function which will be deployed using AI service. Please specify the default parameters that will be passed to the function.

In [None]:
prompt_template_text = (
    "Use the following pieces of documents to answer the question at the end. "
    "If you don't know the answer, just say that you don't know, don't try to make up an answer. "
    "Use one sentence maximum. Keep the answer as concise as possible. "
    "Do not include question in your response. "
    "Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. "
    "Please ensure that your responses are socially unbiased and positive in nature. "
    "Please provide a concise professional response.\n\n"
    "{reference_documents}\nQuestion:{question}\nAnswer:"
)

vector_store_config_dict = {
    "connection_id": es_connection_id,
    "embeddings": {
        "__class__": "Embeddings",
        "__module__": "ibm_watsonx_ai.foundation_models.embeddings.embeddings",
        "model_id": "ibm/slate-125m-english-rtrvr-v2",
    },
    "index_name": index_name,
    "datasource_type": "elasticsearch",
}

retriever_config_dict = {"method": "simple", "number_of_chunks": 5}

# Parameter `model_max_input_tokens` may vary depending on which models were used.
# To check the max input tokens for a given `model_id`, run:
# `client.foundation_models.get_model_specs(model_id)["model_limits"]["max_sequence_length"]`
model_max_input_tokens = 128000

def deployable_load_balancing_rag_ai_service(context, url=credentials.url, vector_store_config_dict=vector_store_config_dict, retriever_config_dict=retriever_config_dict, model=model_alias, prompt_template_text=prompt_template_text, model_max_input_tokens=model_max_input_tokens, context_template_text=""): # fmt: skip
    from ibm_watsonx_ai import APIClient, Credentials
    from ibm_watsonx_ai.gateway import Gateway
    from ibm_watsonx_ai.foundation_models.extensions.rag import Retriever, VectorStore
    from ibm_watsonx_ai.foundation_models.extensions.rag.pattern.prompt_builder import (
        build_prompt,
    )

    api_client = APIClient(
        credentials=Credentials(url=url, token=context.generate_token()),
        space_id=context.get_space_id(),
    )

    gateway = Gateway(api_client=api_client)

    vector_store = VectorStore.from_dict(api_client, vector_store_config_dict)

    retriever = Retriever.from_vector_store(
        vector_store=vector_store, init_parameters=retriever_config_dict
    )

    def generate(context):
        api_client.set_token(context.get_token())

        payload = context.get_json()
        question = payload["question"]

        retrieved_docs = retriever.retrieve(query=question)
        reference_documents = [doc.page_content for doc in retrieved_docs]

        prompt_input_text = build_prompt(
            prompt_template_text=prompt_template_text,
            context_template_text=context_template_text,
            question=question,
            reference_documents=reference_documents,
            model_max_input_tokens=model_max_input_tokens,
        )

        response = gateway.completions.create(
            model=model,
            prompt=prompt_input_text,
            decoding_method="greedy",
            min_tokens=1,
            max_tokens=200,
        )

        return {
            "body": {
                "model": response["model"],
                "answer": response["choices"][0]["text"],
                "reference_documents": [
                    {"page_content": doc.page_content, "metadata": doc.metadata}
                    for doc in retrieved_docs
                ],
            }
        }

    return generate

### Test the function locally

To test our solution we can query the function locally without deploying.

In [27]:
questions_and_answers = {
    "what are the names of founding fathers of the united states?": "Thomas Jefferson::James Madison::John Jay::George Washington::John Adams::Benjamin Franklin::Alexander Hamilton",
    "which teams played in the super bowl in 2013?": "Baltimore Ravens::San Francisco 49ers",
    "when did bucharest become the capital of romania?": "1862",
}

Define a helper function for formatting the response:

In [28]:
from ibm_watsonx_ai.foundation_models.extensions.rag.utils import verbose_search
from IPython.display import display, Markdown


def print_rag_response(question, response):
    display(Markdown(f"**Model**: `{response['model']}`"))
    verbose_search(
        question,
        [Document(**d) for d in response["reference_documents"]],
    )
    display(Markdown(f"**Answer:** {response['answer']}"))

Create AI service function

In [29]:
from ibm_watsonx_ai.deployments import RuntimeContext

context = RuntimeContext(api_client=client)
local_function = deployable_load_balancing_rag_ai_service(context=context)

Validate response correctness

In [30]:
for question in questions_and_answers:
    context.request_payload_json = {"question": question}
    response = local_function(context)
    print_rag_response(question, response["body"])

**Model**: `meta-llama/llama-3-3-70b-instruct`

**Question:** what are the names of founding fathers of the united states?

Unnamed: 0,page_content,id,title
0,Founding Fathers of the United States - wikipe...,878.0,Founding Fathers of the United States
1,further groupings of Founding Fathers include ...,879.0,Founding Fathers of the United States
2,of Independence . The term Founding Fathers is...,879.0,Founding Fathers of the United States
3,"two were Lutherans , two were Dutch Reformed ,...",889.0,Founding Fathers of the United States
4,", the only person who signed all four U.S. his...",883.0,Founding Fathers of the United States


**Answer:**  Names include John Adams, Benjamin Franklin, John Jay, Thomas Jefferson, and others.

**Model**: `meta-llama/llama-3-3-70b-instruct`

**Question:** which teams played in the super bowl in 2013?

Unnamed: 0,page_content,id,title
0,for the 2012 season . The Ravens defeated the ...,819.0,Super Bowl XLVII
1,Schedule : NFL Super Bowl XLVII -- 2 / 3 '' . ...,863.0,Super Bowl XLVII
2,"in Super Bowl XXX ) , Detroit Lions ( never ap...",833.0,Super Bowl XLVII
3,representatives . Baltimore defeated the Colts...,833.0,Super Bowl XLVII
4,downs at their 10 - yard line to secure the vi...,832.0,Super Bowl XLVII


**Answer:**  Ravens and 49ers.

**Model**: `meta-llama/llama-3-3-70b-instruct`

**Question:** when did bucharest become the capital of romania?

Unnamed: 0,page_content,id,title
0,documents in 1459 . It became the capital of R...,944.0,Bucharest
1,destroying a third of the city . Ottoman massa...,948.0,Bucharest
2,Bucharest - wikipedia Bucharest This article c...,942.0,Bucharest
3,. I.C. Brătianu Boulevard in the 1930s Between...,948.0,Bucharest
4,project . Bucharest ( / ˈb ( j ) uː kərɛst / ;...,943.0,Bucharest


**Answer:**  1862.

Validate load balancing

In [31]:
import asyncio
from collections import Counter


async def send_requests(function, context):
    responses: list[dict] = []

    for _ in range(10):
        tasks = [asyncio.to_thread(function, context) for _ in range(4)]
        responses.extend(await asyncio.gather(*tasks))

    return responses


loop = asyncio.get_event_loop()

context.request_payload_json = {
    "question": "what are the names of founding fathers of the united states?"
}
responses = await loop.create_task(
    send_requests(function=local_function, context=context)
)

Counter(map(lambda x: x["body"]["model"], responses))

Counter({'meta-llama/llama-3-3-70b-instruct': 21,
         'mistralai/mistral-large': 14,
         'ibm/granite-3-8b-instruct': 5})

As demonstrated, out of 40 requests sent to the RAG AI service:
- 21 of them were handled by `meta-llama/llama-3-3-70b-instruct`,
- 14 of them were handled by `mistralai/mistral-large`,
- 5 of them were handled by `ibm/granite-3-8b-instruct`.

<a id="deploy"></a>
## Create RAG function deployment

In order to deploy the RAG function, a custom software specification with `ibm-watsonx-ai[rag]>=1.3.25` must be created, as the previously defined function uses features not available in previous versions of the SDK.  

### Create custom software specification containing a custom version of `ibm-watsonx-ai` SDK

Define `requirements.txt` file for package extension

In [32]:
requirements_txt = "ibm-watsonx-ai[rag]>=1.3.25"

with open("requirements.txt", "w") as file:
    file.write(requirements_txt)

Get the ID of base software specification

In [33]:
base_software_specification_id = client.software_specifications.get_id_by_name(
    "runtime-24.1-py3.11"
)

Store the package extension

In [34]:
meta_props = {
    client.package_extensions.ConfigurationMetaNames.NAME: "RAG Model Gateway package extension",
    client.package_extensions.ConfigurationMetaNames.DESCRIPTION: "Package extension with ibm-watsonx-ai RAG flavour and Model Gateway functionality enabled",
    client.package_extensions.ConfigurationMetaNames.TYPE: "requirements_txt",
}

package_extension_details = client.package_extensions.store(
    meta_props, file_path="requirements.txt"
)
package_extension_id = client.package_extensions.get_id(package_extension_details)

Creating package extensions
SUCCESS


Create a new software specification with the created package extension

In [35]:
meta_props = {
    client.software_specifications.ConfigurationMetaNames.NAME: "RAG Model Gateway software specification",
    client.software_specifications.ConfigurationMetaNames.DESCRIPTION: "Software specification for RAG and Model Gateway",
    client.software_specifications.ConfigurationMetaNames.BASE_SOFTWARE_SPECIFICATION: {
        "guid": base_software_specification_id
    },
}

software_specification_details = client.software_specifications.store(meta_props)
software_specification_id = client.software_specifications.get_id(
    software_specification_details
)

client.software_specifications.add_package_extension(
    software_specification_id, package_extension_id
)

SUCCESS


'SUCCESS'

### Deploy RAG AI service

Store function defined above with previously created software specification

In [36]:
meta_props = {
    client.repository.FunctionMetaNames.NAME: "RAG AI service with Model Gateway",
    client.repository.FunctionMetaNames.DESCRIPTION: "RAG AI service with load balancing using Model Gateway",
    client.repository.FunctionMetaNames.SOFTWARE_SPEC_UID: software_specification_id,
}

ai_service_details = client.repository.store_ai_service(
    deployable_load_balancing_rag_ai_service, meta_props
)

ai_service_id = client.repository.get_ai_service_id(ai_service_details)

Create online deployment of stored function

In [37]:
meta_props = {
    client.deployments.ConfigurationMetaNames.NAME: "RAG function deployment",
    client.deployments.ConfigurationMetaNames.ONLINE: {},
}

deployment_details = client.deployments.create(ai_service_id, meta_props=meta_props)



######################################################################################

Synchronous deployment creation for id: '6c3534b0-bce4-4473-ad0d-839e548c61b4' started

######################################################################################


initializing
Note: online_url and serving_urls are deprecated and will be removed in a future release. Use inference instead.
.....
ready


-----------------------------------------------------------------------------------------------
Successfully finished deployment creation, deployment_id='b130314d-ad30-4691-95c6-29eea2aff721'
-----------------------------------------------------------------------------------------------




Obtain the `deployment_id` of the previously created deployment.

In [38]:
deployment_id = client.deployments.get_id(deployment_details)

### Execute the AI service

Validate response correctness

In [39]:
answer_by_question: dict[str, str] = {}

for question in questions_and_answers:
    context.request_payload_json = {"question": question}
    response = client.deployments.run_ai_service(deployment_id, {"question": question})
    answer_by_question[question] = response["answer"]
    print_rag_response(question, response)

**Model**: `meta-llama/llama-3-3-70b-instruct`

**Question:** what are the names of founding fathers of the united states?

Unnamed: 0,page_content,id,title
0,Founding Fathers of the United States - wikipe...,878.0,Founding Fathers of the United States
1,further groupings of Founding Fathers include ...,879.0,Founding Fathers of the United States
2,of Independence . The term Founding Fathers is...,879.0,Founding Fathers of the United States
3,"two were Lutherans , two were Dutch Reformed ,...",889.0,Founding Fathers of the United States
4,", the only person who signed all four U.S. his...",883.0,Founding Fathers of the United States


**Answer:**  Names include John Adams, Benjamin Franklin, John Jay, Thomas Jefferson, and others.

**Model**: `meta-llama/llama-3-3-70b-instruct`

**Question:** which teams played in the super bowl in 2013?

Unnamed: 0,page_content,id,title
0,for the 2012 season . The Ravens defeated the ...,819.0,Super Bowl XLVII
1,Schedule : NFL Super Bowl XLVII -- 2 / 3 '' . ...,863.0,Super Bowl XLVII
2,"in Super Bowl XXX ) , Detroit Lions ( never ap...",833.0,Super Bowl XLVII
3,representatives . Baltimore defeated the Colts...,833.0,Super Bowl XLVII
4,downs at their 10 - yard line to secure the vi...,832.0,Super Bowl XLVII


**Answer:**  Ravens and 49ers.

**Model**: `meta-llama/llama-3-3-70b-instruct`

**Question:** when did bucharest become the capital of romania?

Unnamed: 0,page_content,id,title
0,documents in 1459 . It became the capital of R...,944.0,Bucharest
1,destroying a third of the city . Ottoman massa...,948.0,Bucharest
2,Bucharest - wikipedia Bucharest This article c...,942.0,Bucharest
3,. I.C. Brătianu Boulevard in the 1930s Between...,948.0,Bucharest
4,project . Bucharest ( / ˈb ( j ) uː kərɛst / ;...,943.0,Bucharest


**Answer:**  1862.

Validate load balancing

In [40]:
async def send_requests(question):
    responses: list[dict] = []

    for _ in range(10):
        tasks = [
            asyncio.to_thread(
                client.deployments.run_ai_service, deployment_id, {"question": question}
            )
            for _ in range(4)
        ]

        responses.extend(await asyncio.gather(*tasks))
        await asyncio.sleep(1)

    return responses


loop = asyncio.get_event_loop()
responses = await loop.create_task(
    send_requests(
        question="what are the names of founding fathers of the united states?"
    )
)

Counter(map(lambda x: x["model"], responses))

Counter({'meta-llama/llama-3-3-70b-instruct': 26,
         'mistralai/mistral-large': 12,
         'ibm/granite-3-8b-instruct': 2})

As demonstrated, out of 40 requests sent to the RAG AI service:
- 26 of them were handled by `meta-llama/llama-3-3-70b-instruct`,
- 12 of them were handled by `mistralai/mistral-large`,
- 2 of them were handled by `ibm/granite-3-8b-instruct`.

<a id="evaluate"></a>
## Calculate rougeL metric 
Calculate rougeL recall score to verify expected answer presence in generated response.

In [41]:
from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(["rougeL"], use_stemmer=True)
scores = [
    scorer.score(questions_and_answers[question], answer_by_question[question])
    for question in questions_and_answers
]
mean_rougeL_recall_score = sum([s["rougeL"].recall for s in scores]) / len(
    questions_and_answers
)

print(f"Mean rougeL recall score: {mean_rougeL_recall_score}")

Mean rougeL recall score: 0.5619047619047619


<a id="summary"></a>
## Summary and next steps

You successfully completed this notebook!

Check out our _<a href="https://ibm.github.io/watsonx-ai-python-sdk/samples.html" target="_blank" rel="noopener no referrer">Online Documentation</a>_ for more samples, tutorials, documentation, how-tos, and blog posts. 

### Authors:
**Rafał Chrzanowski**, Software Engineer Intern at watsonx.ai

Copyright © 2025 IBM. This notebook and its source code are released under the terms of the MIT License.