![image](https://raw.githubusercontent.com/IBM/watsonx-ai-samples/master/cloud/notebooks/headers/watsonx-Prompt_Lab-Notebook.png)
# AutoAI RAG experiment with custom foundation model.

#### Disclaimers

- Use only Projects and Spaces that are available in the watsonx context.


## Notebook content

This notebook demonstrates how to deploy custom foundation model and use this model in AutoAI RAG experiment.
The data used in this notebook is from the [Granite Code Models paper](https://arxiv.org/pdf/2405.04324).

Some familiarity with Python is helpful. This notebook uses Python 3.12.


## Learning goal

The learning goals of this notebook are:

- How to deploy your own foundation models with huggingface hub
- Create an AutoAI RAG job that will find the best RAG pattern based on custom foundation model used during the experiment


## Contents

This notebook contains the following parts:
- [Set up the environment](#Set-up-the-environment)
- [Prerequisites](#Prerequisites)
- [Create API Client instance.](#Create-API-Client-instance.)
- [Download custom model from hugging face](#Download-custom-model-from-hugging-face)
- [Deploy the model](#Deploy-the-model)
- [Prepare the data for the AutoAI RAG experiment](#Prepare-the-data-for-the-AutoAI-RAG-experiment)
- [Run the AutoAI RAG experiment](#Run-the-AutoAI-RAG-experiment)
- [Query generated pattern locally](#Query-generated-pattern-locally)
- [Summary](#Summary)

## Set up the environment

In [None]:
%pip install -U wget | tail -n 1
%pip install -U 'ibm-watsonx-ai[rag]>=1.3.26' | tail -n 1

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


<a id="prerequisites"></a>

## Prerequisites
Please fill below values to be able to move forward:
- URL - url which points to your CPD instance
- USERNAME - username to your CPD instance
- PASSWORD - password to your CPD instance associated with your username.
- INSTANCE_ID - your CPD instance ID
- VERSION - CPD version which your instance supports, but it has to be at least `5.2`
- PROJECT_ID - ID of the project associated with your username and instance.

In [None]:
URL = "PUT YOUR CPD INSTANCE URL HERE"
USERNAME = "PUT YOUR USERNAME HERE"
PASSWORD = "PUT YOUR PASSWORD HERE"
INSTANCE_ID = "PUT YOUR INSTANCE ID HERE"
VERSION = "5.2"

PROJECT_ID = "PUT YOUR PROJECT ID HERE"

BUCKET_BENCHMARK_JSON_FILE_PATH = "benchmark.json"

## Create API Client instance.
This client will allow us to connect with the IBM services.

In [3]:
from ibm_watsonx_ai import APIClient, Credentials

credentials = Credentials(
    url=URL,
    username=USERNAME,
    password=PASSWORD,
    instance_id=INSTANCE_ID,
    version=VERSION,
    verify=False,
)

client = APIClient(credentials=credentials, project_id=PROJECT_ID)

## Create custom PVC for custom foundation model

Once you have our model files in our COS bucket you need to create our custom model specification. <br />
Please note that this is beyond the scope of this notebook, refer to this [documentation](https://www.ibm.com/docs/en/software-hub/5.2.x?topic=setup-deploying-custom-foundation-models) to be able to use your custom model on your CPD cluster.

## Deploy the model
Check the docs to avoid any problems during model deployment [here](https://ibm.github.io/watsonx-ai-python-sdk/fm_custom_models.html#id2).

### Create custom model repository

In [7]:
from ibm_watsonx_ai.foundation_models import get_custom_model_specs

custom_model_spec = get_custom_model_specs(api_client=client, limit=1)
model_id = custom_model_spec["resources"][0]["model_id"]
model_id

'deepseek-r1-distill-llama-8b'

In [11]:
sw_spec_id = client.software_specifications.get_id_by_name("watsonx-cfm-caikit-1.0")
sw_metadata = {
    client.repository.ModelMetaNames.NAME: "My custom deployment",
    client.repository.ModelMetaNames.SOFTWARE_SPEC_ID: sw_spec_id,
    client.repository.ModelMetaNames.TYPE: client.repository.ModelAssetTypes.CUSTOM_FOUNDATION_MODEL_1_0,
}

In [None]:
stored_model_details = client.repository.store_model(
    model=model_id, meta_props=sw_metadata
)
stored_model_asset_id = client.repository.get_model_id(stored_model_details)
client.repository.list(framework_filter="custom_foundation_model_1.0")

Unnamed: 0,ID,NAME,CREATED,FRAMEWORK,TYPE,SPEC_STATE,SPEC_REPLACEMENT
0,e2a38f94-31c5-4106-880b-f41423fd42db,My custom deployment,2025-06-27T13:35:53Z,custom_foundation_model_1.0,model,supported,


### Create custom foundation model hardware specification

In [14]:
hardware_spec_meta_props = {
    client.hardware_specifications.ConfigurationMetaNames.NAME: "Custom GPU hw spec",
    client.hardware_specifications.ConfigurationMetaNames.NODES: {
        "cpu": {"units": "2"},
        "mem": {"size": "128Gi"},
        "gpu": {"num_gpu": 1},
    },
}

hw_spec_details = client.hardware_specifications.store(hardware_spec_meta_props)

In [16]:
hw_spec_details = client.hardware_specifications.get_details(
    client.hardware_specifications.get_id_by_name("Custom GPU hw spec")
)
hw_spec_id = client.hardware_specifications.get_id(hw_spec_details)

### Perform custom model deployment

In [18]:
MAX_SEQUENCE_LENGTH = 32_000
MAX_NEW_TOKENS = 1000
MIN_NEW_TOKENS = 1
MAX_BATCH_SIZE = 1024
metadata = {
    client.deployments.ConfigurationMetaNames.NAME: "My custom foundation model",
    client.deployments.ConfigurationMetaNames.DESCRIPTION: "My custom foundation model",
    client.deployments.ConfigurationMetaNames.ONLINE: {},
    # optionally overwrite model parameters here
    client.deployments.ConfigurationMetaNames.HARDWARE_SPEC: {
        "name": "Custom GPU hw spec"
    },
    client.deployments.ConfigurationMetaNames.FOUNDATION_MODEL: {
        "max_sequence_length": MAX_SEQUENCE_LENGTH,
        "max_new_tokens": MAX_NEW_TOKENS,
        "max_batch_size": MAX_BATCH_SIZE,
    },
    client.deployments.ConfigurationMetaNames.SERVING_NAME: "pllum_12b_instruct",  # must be unique
}
deployment_details = client.deployments.create(stored_model_asset_id, metadata)
deployment_id = deployment_details["metadata"]["id"]



######################################################################################

Synchronous deployment creation for id: 'e2a38f94-31c5-4106-880b-f41423fd42db' started

######################################################################################


initializing
Note: online_url and serving_urls are deprecated and will be removed in a future release. Use inference instead.
..........
ready


-----------------------------------------------------------------------------------------------
Successfully finished deployment creation, deployment_id='30702554-42cc-43e8-82a5-5dc7826d719d'
-----------------------------------------------------------------------------------------------




## Prepare the data for the AutoAI RAG experiment

### Download `granite_code_models.pdf` document

In [5]:
import wget

data_url = "https://arxiv.org/pdf/2405.04324"

byom_input_filename = "granite_code_models.pdf"

wget.download(data_url, byom_input_filename)

'granite_code_models.pdf'

### Create data asset with your training data

In [6]:
document_asset_details = client.data_assets.create(name=byom_input_filename, file_path=byom_input_filename)

document_asset_id = client.data_assets.get_id(document_asset_details)
document_asset_id

Creating data asset...
SUCCESS


'fb05d631-8c5e-4641-bc72-6633ba06cbe2'

In [9]:
from ibm_watsonx_ai.helpers import DataConnection

input_data_references = [DataConnection(data_asset_id=document_asset_id)]

### Create your own benchmark.json file to ask the questions related to the document

In [7]:
import json

local_benchmark_json_filename = "benchmark.json"

benchmarking_data = [
    {
        "question": "What are the two main variants of Granite Code models?",
        "correct_answer": "The two main variants are Granite Code Base and Granite Code Instruct.",
        "correct_answer_document_ids": [byom_input_filename],
    },
    {
        "question": "What is the purpose of Granite Code Instruct models?",
        "correct_answer": "Granite Code Instruct models are finetuned for instruction-following tasks using datasets like CommitPack, OASST, HelpSteer, and synthetic code instruction datasets, aiming to improve reasoning and instruction-following capabilities.",
        "correct_answer_document_ids": [byom_input_filename],
    },
    {
        "question": "What is the licensing model for Granite Code models?",
        "correct_answer": "Granite Code models are released under the Apache 2.0 license, ensuring permissive and enterprise-friendly usage.",
        "correct_answer_document_ids": [byom_input_filename],
    },
]

with open(local_benchmark_json_filename, mode="w", encoding="utf-8") as fp:
    json.dump(benchmarking_data, fp, indent=4)

### Create data asset with benchmark.json file

In [8]:
test_asset_details = client.data_assets.create(name=local_benchmark_json_filename, file_path=local_benchmark_json_filename)

test_asset_id = client.data_assets.get_id(test_asset_details)
test_asset_id

Creating data asset...
SUCCESS


'fc14fca8-93c4-4b5e-a72f-d19fcc0aeaf4'

In [10]:
test_data_references = [DataConnection(data_asset_id=test_asset_id)]

## Run the AutoAI RAG experiment

Provide the input information for AutoAI RAG optimizer:
- `custom_prompt_template_text` - custom prompt template text which will be used to query your own foundation model
- `custom_context_template_text` - custom context template text which will be used to query your own foundation model
- `name` - experiment name
- `description` - experiment description
- `max_number_of_rag_patterns` - maximum number of RAG patterns to create
- `optimization_metrics` - target optimization metrics

In [None]:
from ibm_watsonx_ai.experiment import AutoAI
from ibm_watsonx_ai.foundation_models.schema import (
    AutoAIRAGCustomModelConfig,
    AutoAIRAGModelParams,
)

experiment = AutoAI(credentials, project_id=PROJECT_ID)

custom_prompt_template_text = (
    "Answer my question {question} related to these documents {reference_documents}."
)
custom_context_template_text = "My document {document}"

parameters = AutoAIRAGModelParams(max_sequence_length=32_000)
custom_foundation_model_config = AutoAIRAGCustomModelConfig(
    deployment_id=deployment_id,
    project_id=PROJECT_ID,
    prompt_template_text=custom_prompt_template_text,
    context_template_text=custom_context_template_text,
    parameters=parameters,
)

rag_optimizer = experiment.rag_optimizer(
    name="AutoAI RAG - Custom foundation model experiment",
    description="AutoAI RAG experiment with custom foundation model.",
    max_number_of_rag_patterns=4,
    optimization_metrics=["faithfulness"],
    foundation_models=[custom_foundation_model_config],
)

rag_optimizer.run(
    test_data_references=test_data_references,
    input_data_references=input_data_references,
    background_mode=False
)



##############################################

Running '7878e0b8-94a6-4e9a-b1b4-f3c62fa70786'

##############################################


pending....
running...........................................................................................................................................
completed
Training of '7878e0b8-94a6-4e9a-b1b4-f3c62fa70786' finished successfully.


{'entity': {'hardware_spec': {'id': 'a6c4923b-b8e4-444c-9f43-8a7ec3020110',
   'name': 'L'},
  'input_data_references': [{'location': {'href': '/v2/assets/fb05d631-8c5e-4641-bc72-6633ba06cbe2?project_id=19a157bf-5c1c-4773-812a-520d8068d84a',
     'id': 'fb05d631-8c5e-4641-bc72-6633ba06cbe2'},
    'type': 'data_asset'}],
  'parameters': {'constraints': {'generation': {'foundation_models': [{'context_template_text': 'My document {document}',
       'deployment_id': '30702554-42cc-43e8-82a5-5dc7826d719d',
       'parameters': {'max_sequence_length': 32000},
       'project_id': '19a157bf-5c1c-4773-812a-520d8068d84a',
       'prompt_template_text': 'Answer my question {question} related to these documents {reference_documents}.'}]},
    'max_number_of_rag_patterns': 4},
   'optimization': {'metrics': ['faithfulness']},
   'output_logs': True},
  'results': [{'context': {'iteration': 0,
     'max_combinations': 40,
     'rag_pattern': {'composition_steps': ['model_selection',
       'chunki

In [18]:
rag_optimizer.get_run_details()

{'entity': {'hardware_spec': {'id': 'a6c4923b-b8e4-444c-9f43-8a7ec3020110',
   'name': 'L'},
  'input_data_references': [{'location': {'href': '/v2/assets/fb05d631-8c5e-4641-bc72-6633ba06cbe2?project_id=19a157bf-5c1c-4773-812a-520d8068d84a',
     'id': 'fb05d631-8c5e-4641-bc72-6633ba06cbe2'},
    'type': 'data_asset'}],
  'parameters': {'constraints': {'generation': {'foundation_models': [{'context_template_text': 'My document {document}',
       'deployment_id': '30702554-42cc-43e8-82a5-5dc7826d719d',
       'parameters': {'max_sequence_length': 32000},
       'project_id': '19a157bf-5c1c-4773-812a-520d8068d84a',
       'prompt_template_text': 'Answer my question {question} related to these documents {reference_documents}.'}]},
    'max_number_of_rag_patterns': 4},
   'optimization': {'metrics': ['faithfulness']},
   'output_logs': True},
  'results': [{'context': {'iteration': 0,
     'max_combinations': 40,
     'rag_pattern': {'composition_steps': ['model_selection',
       'chunki

In [19]:
summary = rag_optimizer.summary()
summary

Unnamed: 0_level_0,mean_faithfulness,mean_answer_correctness,mean_context_correctness,chunking.method,chunking.chunk_size,chunking.chunk_overlap,embeddings.model_id,vector_store.distance_metric,retrieval.method,retrieval.number_of_chunks,retrieval.hybrid_ranker,generation.model_id
Pattern_Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Pattern2,0.2144,0.4815,1.0,recursive,512,256,ibm/slate-125m-english-rtrvr,cosine,window,3,,30702554-42cc-43e8-82a5-5dc7826d719d
Pattern3,0.1634,0.2256,1.0,recursive,1024,256,ibm/slate-125m-english-rtrvr,cosine,simple,5,,30702554-42cc-43e8-82a5-5dc7826d719d
Pattern1,0.072,0.5079,1.0,recursive,512,128,ibm/slate-125m-english-rtrvr,cosine,simple,3,,30702554-42cc-43e8-82a5-5dc7826d719d


In [20]:
best_pattern_name = summary.index.values[0]
print("Best pattern is:", best_pattern_name)

best_pattern = rag_optimizer.get_pattern()

Best pattern is: Pattern2


In [23]:
rag_optimizer.get_pattern_details(pattern_name=best_pattern_name)

{'composition_steps': ['model_selection',
  'chunking',
  'embeddings',
  'retrieval',
  'generation'],
 'duration_seconds': 266,
 'location': {'evaluation_results': '/projects/19a157bf-5c1c-4773-812a-520d8068d84a/assets/autorag/results/7878e0b8-94a6-4e9a-b1b4-f3c62fa70786/Pattern2/evaluation_results.json',
  'indexing_notebook': '/projects/19a157bf-5c1c-4773-812a-520d8068d84a/assets/autorag/results/7878e0b8-94a6-4e9a-b1b4-f3c62fa70786/Pattern2/indexing_inference_notebook.ipynb',
  'inference_notebook': '/projects/19a157bf-5c1c-4773-812a-520d8068d84a/assets/autorag/results/7878e0b8-94a6-4e9a-b1b4-f3c62fa70786/Pattern2/indexing_inference_notebook.ipynb',
  'inference_service_code': '/projects/19a157bf-5c1c-4773-812a-520d8068d84a/assets/autorag/results/7878e0b8-94a6-4e9a-b1b4-f3c62fa70786/Pattern2/inference_ai_service.gz',
  'inference_service_metadata': '/projects/19a157bf-5c1c-4773-812a-520d8068d84a/assets/autorag/results/7878e0b8-94a6-4e9a-b1b4-f3c62fa70786/Pattern2/inference_service_

## Query generated pattern locally

In [29]:
from ibm_watsonx_ai.deployments import RuntimeContext

runtime_context = RuntimeContext(api_client=client)
inference_service_function = best_pattern.inference_service(runtime_context)[0]

  from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4


In [None]:
question = "What training objectives are used for the granite models?"

context = RuntimeContext(
    api_client=client,
    request_payload_json={"messages": [{"role": "user", "content": question}]},
)


resp = inference_service_function(context)
resp

{'body': {'choices': [{'index': 0,
    'message': {'role': 'assistant',
     'content': ' The\nclusters are equipped with 100Gbps and 200Gbps HDR InfiniBand links, respectively.\nWe utilize NVIDIA’s Megatron-LM (Shoeybi et al., 2019; Narayanan et al., 2021) for\ndistributed training, which is optimized for large language models. We use the same Megatron\nLM framework for all our models, ensuring consistency in training infrastructure.\n4.5 Model Architecture\nThe architecture of the Granite Code models is based on the original transformer architecture\n(Douglas & Smith, 2019) with modifications for code modeling. The base model has 16\nlayers, 8 attention heads, and a token embedding dimension of 512. For the 3B model, we\nuse the standard transformer architecture with a multi-head attention mechanism. The 8B model\nincorporates Grouped-Query Attention (GQA) (Ainslie et al., 2023) to improve inference\nefficiency. The 20B model uses learned absolute position embeddings and Multi-Query\

In [None]:
print(inference_service_function(context)["body"]["choices"][0]["message"]["content"])

 The
clusters are equipped with 100Gbps and 200Gbps HDR InfiniBand links, respectively.
We utilize NVIDIA’s Megatron-LM (Shoeybi et al., 2019; Narayanan et al., 2021) for
distributed training, which is optimized for large language models. We use the same Megatron
LM framework for all our models, ensuring consistency in training infrastructure.
4.5 Model Architecture
The architecture of the Granite Code models is based on the original transformer architecture
(Douglas & Smith, 2019) with modifications for code modeling. The base model has 16
layers, 8 attention heads, and a token embedding dimension of 512. For the 3B model, we
use the standard transformer architecture with a multi-head attention mechanism. The 8B model
incorporates Grouped-Query Attention (GQA) (Ainslie et al., 2023) to improve inference
efficiency. The 20B model uses learned absolute position embeddings and Multi-Query
Attention (Shazeer, 2019). The 34B model is built upon the 20B model with depth
upscaling (Kim et al

## Summary

 You successfully completed this notebook!
 
 You learned how to use AutoAI RAG with your own foundation model.
 
Check out our _<a href="https://ibm.github.io/watsonx-ai-python-sdk/samples.html" target="_blank" rel="noopener no referrer">Online Documentation</a>_ for more samples, tutorials, documentation, how-tos, and blog posts.

### Author:
 **Michał Steczko**, Software Engineer at watsonx.ai.

Copyright © 2025 IBM. This notebook and its source code are released under the terms of the MIT License.