# RAG with HANA Vector Store

This notebook walks you through building a **Retrieval-Augmented Generation (RAG)** application using:
- **HANA Vector Store**
- **OpenAI GPT-4o**

## Prerequisites

* **HANA Vector Store**:

    The HANA Vector Store is pre-populated with text documents, their corresponding embedding vectors, and associated metadata.

    For this tutorial, the necessary source documents have already been processed and stored in the Vector Store table `NUTRITION_SCIENCE_DATA`.

    To learn how to prepare and store data in the HANA Vector Store—including embedding generation and metadata handling — refer [Embeddings Best Practices](https://github.com/SAP-samples/sap-btp-ai-best-practices/tree/main/best-practices/vector-rag-embedding) guide.

## Setup

### Dependencies
Ensure all python packages provided in `requirements.txt` file are installed.
``` bash
pip install -r requirements.txt
```

### Environment Variables
Ensure `.env` file is updated with following environment variables.
``` env
HANA_ADDRESS=your_hana_host
HANA_PORT=your_port
HANA_USER=your_user
HANA_PASSWORD=your_password
HANA_AUTOCOMMIT=true
HANA_SSL_CERT_VALIDATE=false

AICORE_AUTH_URL=your_aicore_auth_url
AICORE_CLIENT_ID=your_aicore_client_id
AICORE_CLIENT_SECRET=your_aicore_secret
AICORE_RESOURCE_GROUP=your_resource_group
AICORE_BASE_URL=your_base_url
```

## Import Section

In [8]:
import os
from dotenv import load_dotenv
load_dotenv(override=True)

from hana_ml import ConnectionContext

from gen_ai_hub.proxy.core.proxy_clients import get_proxy_client
from gen_ai_hub.proxy.native.openai import embeddings

## View Source Dataset

While the source dataset has already been processed and stored—along with its embedding vectors — in the HANA Vector Store, it is presented here to help you better understand the underlying data.

The dataset is based on scientifically grounded nutrition content and is licensed under the MIT license. It was originally sourced from Hugging Face and has been enriched with synthetic examples generated by a large language model (LLM).

HuggingFace Dataset: [Science-Text-Data](https://huggingface.co/datasets/Aashi/Science-Text-Data)

Augmented Dataset: [Augmented-Science-Text-Data](https://github.com/SAP-samples/sap-btp-ai-best-practices/tree/main/best-practices/vector-rag-query/python/sample_files/)

In [20]:
import pandas as pd

# Load CSV file
csv_path = './sample_files/science-data-sample.csv'
pd.set_option('display.max_colwidth', None)
df = pd.read_csv(csv_path, low_memory=False)
df_shuffled = df.sample(frac=1, random_state=3).reset_index(drop=True)
df_shuffled.head(4)

Unnamed: 0,Topic,Difficulty Level,Category
0,"What are some common meals from different regions or states in India? Punjab's common meal includes Makki roti, Rajma, and Sarson saag. Andhra Pradesh's includes Rice, Tuar dal with rasam, and Kunduru. Punjab's common meal includes Makki roti, Rajma, and Sarson saag. Andhra Pradesh's includes Rice, Tuar dal with rasam, and Kunduru. The exploration of fundamental principles and their applications in modern research is essential. This study provides a detailed overview of current challenges, potential solutions, and future research directions. The study highlights key advancements in theoretical and applied sciences, emphasizing the importance of interdisciplinary approaches. By integrating data analytics and machine learning, researchers can gain deeper insights into previously unexplored areas.The search for extraterrestrial life is primarily focused on exoplanets—planets orbiting stars outside our solar system. The discovery of thousands of exoplanets, particularly those within the habitable zone of their stars, has raised hopes of finding life beyond Earth. The habitable zone, often referred to as the Goldilocks zone, is the region around a star where conditions allow for liquid water to exist.Astrobiologists use various methods to de",Easy,Nutrition
1,"What are some diseases caused by a deficiency of vitamins and minerals? Deficiencies can cause night blindness (Vitamin A), beriberi (Vitamin B1), scurvy (Vitamin C), rickets (Vitamin D), tooth decay (Calcium), goiter (Iodine), and anaemia (Iron). night blindness (Vitamin A), beriberi (Vitamin B1), scurvy (Vitamin C), rickets (Vitamin D), tooth decay (Calcium), goiter (Iodine), and anaemia (Iron) This research explores new frontiers in scientific discovery, delving into methodologies and experimental techniques that have significantly impacted the field. By utilizing modern computational tools, we can better understand complex systems and improve predictive modeling. This research explores new frontiers in scientific discovery, delving into methodologies and experimental techniques that have significantly impacted the field. By utilizing modern computational tools, we can better understand complex systems and improve predictive modeling.Astrobiologists use various methods to detect potential biosignatures, including spectroscopy, which analyzes the atmospheric composition of exoplanets. The presence of oxygen, methane, and other organic compounds in an exoplanet’s atmosphere could indicate biological activity. The study highlights key advancements in theoretical and applied sciences, emphasizing the importance of interdisciplinary approaches. By integrating data analytics and machine learning, researchers can gain deeper insights into previously unexplored areas. The continuous evolution of experimental techniques allows for more precise measurements and better accuracy in scientific investigations. This enhances our ability to validate hypotheses and refine theoretical models. The continuous evolution of experimental techniques allows for more precise measurements and better accuracy in scientific inves",Easy,Nutrition
2,"What are deficiency diseases? Deficiency diseases occur due to a lack of nutrients over a long period and can cause diseases or disorders in our body​​. diseases that occur due to the lack of nutrients over a long period The study highlights key advancements in theoretical and applied sciences, emphasizing the importance of interdisciplinary approaches. By integrating data analytics and machine learning, researchers can gain deeper insights into previously unexplored areas.Astrobiologists use various methods to detect potential bi",Hard,Health
3,"What are the main carbohydrates found in our food? The main carbohydrates in our food are in the form of starch and sugars​​. The main carbohydrates in our food are in the form of starch and sugars​​. With rapid technological advancements, the field continues to evolve, pushing the boundaries of innovation. By analyzing historical trends and current developments, we gain a comprehensive understanding of the implications of scientific breakthroughs. The study highlights key advancements in theoretical and applied sciences, emphasizing the importance of interdisciplinary approaches. By integrating data analytics and machine learning, researchers can gain deeper insights into previously unexplored areas. This research explores new frontiers in scientific discovery, delving into methodologies and experimental techniques that have significantly impacted the field. By utilizing modern computational tools, we can better understand complex systems and improve predictive modeling. The study highlights key advancements in theoretical and applied sciences, emphasizing the importance of interdisciplinary approaches. By integrating data analytics and machine learning, researchers can gain deeper insights into previously unexplored areas. The continuous evolution of experimental techniques allows for more precise measurements and better accuracy in scientific investigations. This enhances our ability to validate hypotheses and refine theoretical models. The study highlights key advancements in theoretical and applied sciences, emphasizing the importanc",Medium,Health


## Retrieval

Use the user's query to retrieve most semantically similar documents from the vector store to create context that will be used to ground LLM for answer generation.

In [21]:
# Connect to SAP HANA
cc = ConnectionContext(
    address=os.environ.get("HANA_ADDRESS"),
    port=os.environ.get("HANA_PORT"),
    user=os.environ.get("HANA_USER"),
    password=os.environ.get("HANA_PASSWORD"),
    encrypt=True
)

cursor = cc.connection.cursor()

print(cc.hana_version())
print(cc.get_current_schema())

4.00.000.00.1715685275 (fa/CE2024.2)
USR_336RA2ZQ5LAGTHKHCKIYB945E


In [22]:
# Initialize AI Core proxy client
proxy_client = get_proxy_client('gen-ai-hub')

def get_embedding(query):
    """
    Create embedding vector for given text.
    """
    embeds = embeddings.create(
        model_name="text-embedding-ada-002",
        input=query
    )
    return embeds.data[0].embedding

In [28]:
def run_vector_search(query, cursor, table_name, metric="COSINE_SIMILARITY", k=4):
    """
    Performs vector search on indexed documents.
    """
    try:
        query_vector = get_embedding(query)
        if not query_vector:
            raise ValueError("Failed to generate query embedding.")

        sort_order = "DESC" if metric != "L2DISTANCE" else "ASC"
        sql_query = f"""
        SELECT TOP {k} MY_TEXT, MY_METADATA
        FROM {table_name}
        ORDER BY {metric}(MY_VECTOR, TO_REAL_VECTOR('{query_vector}')) {sort_order}
        """
        cursor.execute(sql_query)
        return cursor.fetchall()
    except Exception as e:
        print(f"Error during vector search: {e}")
        return []

In [29]:
query = "How to test for fat in foods?"

# Retrieve top 4 matching docs from vector store
context_records = run_vector_search(query, cursor, "NUTRITION_SCIENCE_DATA", 'COSINE_SIMILARITY', 4)
# Join the content from retrieved docs
context = ' '.join([c[1] for c in context_records])


# Augment

Augment the prompt instructions by embedding retrieved context into it.

In [31]:
prompt = f"""
Use the following context information to answer to user's query.
Here is some context: {context}

Based on the above context, answer the following query:
{query}

The answer tone has to be very professional in nature.

If you don't know the answer, politely say that you don't know, don't try to make up an answer.
"""

# Generation

Use an LLM from Generative AI Hub to generate response for the context augmented prompt. This allows the LLM to be grounded on the context while answering.

In [32]:
from gen_ai_hub.proxy.native.openai import chat

messages = [
    {"role": "system", "content": "You are an intelligent assistant."},
    {"role": "user", "content": prompt}
]

kwargs = dict(model_name="gpt-4o", messages=messages)

response = chat.completions.create(**kwargs)

print(response.choices[0].message.content)

To test for fat in foods, a common method employed is the use of the Sudan III stain, which is a qualitative test used to indicate the presence of lipids. In this test, the food sample is mixed with Sudan III stain; lipids will absorb the stain and exhibit a distinct reddish-orange color. Alternatively, quantitative methods such as Soxhlet extraction can be used to measure the precise quantity of fat present in a food sample. This method involves extracting fat with a solvent and then measuring the extracted amount. Each methodology varies in its precision, complexity, and equipment requirements, depending on whether qualitative or quantitative results are desired.
