# Best Practices for Creating Embeddings in HANA Cloud with SAP GenAI Hub and LangChain

## Introduction
This notebook demonstrates an efficient way to create text embeddings from PDFs and store them in an SAP HANA Cloud vector database using SAP GenAI Hub SDK and LangChain. The guide follows best practices to ensure optimal chunking, embedding, and retrieval.


## Set up and Database Connection

In [1]:
## Step 1: Setup and Database Connection
import os
import glob
import uuid
from dotenv import load_dotenv

# SAP HANA DBAPI
from hdbcli import dbapi

# LangChain Components
from langchain.document_loaders import PyPDFLoader
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter, RecursiveCharacterTextSplitter
from langchain_core.documents import Document
from langchain_community.vectorstores.hanavector import HanaDB

# SAP GenAI Hub SDK Components
from gen_ai_hub.proxy.langchain.openai import OpenAIEmbeddings

# Loads configuration from .env file.
load_dotenv(override=True)

True

In [2]:
# Initializes connection to the HANA database.
connection = dbapi.connect(
    address=os.environ.get("HANA_ADDRESS"),
    port=os.environ.get("HANA_PORT"),
    user=os.environ.get("HANA_USER"),
    password=os.environ.get("HANA_PASSWORD"),
    autocommit=True,
    sslValidateCertificate=False,
)

In [3]:
import warnings
warnings.filterwarnings('ignore')

## Dataset

SAP help pages for SAP Business AI and SAP Hana Cloud have been downloaded in form of PDF documents and placed in `sample_files` directory.

## Load Files

The sample files considered for the notebook are PDF documents. However different file types can be processed using corresponding loader.

In [4]:
# Load PDFs from a directory
def load_pdfs(directory: str):
    """Load and extract text from PDF files in the specified directory."""
    pdf_files = glob.glob(os.path.join(directory, "*.pdf"))
    docs = []
    for file in pdf_files:
        loader = PyPDFLoader(file)
        docs.extend(loader.load())
    return docs

## Chunking Strategy

Large documents need to be broken down into smaller, meaningful chunks to optimize retrieval. This function splits documents into segments with controlled overlap, ensuring the embeddings capture context effectively.


- **RecursiveCharacterTextSplitter:** Works well for most text-heavy documents.
- **CharacterTextSplitter:** Suitable for structured text with clear section markers.
- **Token-based splitters:** Useful when working with token-limited models.
- **MarkdownHeaderTextSplitter:** Suitable for processing Markdown files.

Please refer Langchain official page for supported splitters. Custom splitters based on the document structure can be created and integrated with Langchain as well.


In [5]:
def split_document(document, chunk_size=500, chunk_overlap=50):
    """Split documents into smaller chunks for better processing."""
    splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    return splitter.split_documents(document)

In [6]:
# Initialize Embeddings model from Generative AI Hub
def init_embeddings_model(model_name):
    embeddings_model = OpenAIEmbeddings(proxy_model_name=model_name)
    return embeddings_model

In [7]:
# Optional: Generate unique ID as an additional metadata for a later step
def get_unique_id():
    return str(uuid.uuid4())

In [8]:
print("Loading PDFs...")
documents = load_pdfs("../sample_files/")
print(f"Loaded {len(documents)} documents.")

Loading PDFs...
Loaded 49 documents.


In [9]:
print("Splitting source documents into chunks...")
chunks = split_document(documents)
print(f"Generated {len(chunks)} chunks.")

Splitting source documents into chunks...
Generated 137 chunks.


In [10]:
# Optional: Add more metadata to the chunks if required
chunks = [
    Document(
        page_content=c.page_content,
        metadata=c.metadata | {'document_number': str(c_number), 'unique_id': get_unique_id()}
    ) for c_number, c in enumerate(chunks)
]

## Initialize Embeddings Model

Using an Embedding model, you can map text to high-dimensional vectors for tasks such as semantic search and clustering.

For this example, we considered an OpenAI embedding mode. However, a number of Embedding models are supported on SAP Generative AI Hub. Please refer the SAP help page to select a suitable model for your requirements.

[Available Embedding models on SAP Generative AI Hub](https://help.sap.com/doc/DRAFT/de440ae8a54442068c7c7bcb9375c9b4/DEV/en-US/_temp/gen_ai_hub/examples/gen_ai_hub.html#embeddings)

In [11]:
# Initialize Embeddings model from Generative AI Hub
embeddings = init_embeddings_model("text-embedding-ada-002")

## Store Embeddings in HANA Vector Store

In [12]:
# Initialize HANA Vector Store instance
db = HanaDB(
    embedding=embeddings, connection=connection, table_name="SAP_HELP_PUBLIC"
)

In [13]:
# Optional: Remove all entries from the table
db.delete(filter={})

True

In [14]:
# Upsert Document objects to HANA Vector Store
db.add_documents(chunks)

[]

## Query the Vector Store

By default, HANA Vector Store uses COSINE similarity as Distance Strategy. However, it provides other similarity measures like EUCLIDEAN_DISTANCE and Maximal Marginal Relevance Search (MMR) as well.

1. To use Euclidean distance for retrieval
``` python
db = HanaDB(
    embedding=<embedding_model>,
    connection=<connection>,
    distance_strategy=DistanceStrategy.EUCLIDEAN_DISTANCE,
    table_name=<table_name>,
)
```

2. MMR optimizes for similarity to query AND diversity among selected documents. The first 20 (fetch_k) items will be retrieved from the DB. The MMR algorithm will then find the best 2 (k) matches.
``` python
docs = db.max_marginal_relevance_search(query, k=2, fetch_k=20)
for doc in docs:
    print("-" * 80)
    print(doc.page_content)
```

3. HNSW Vector Index: A vector index can significantly speed up top-k nearest neighbor queries for vectors. Users can create a Hierarchical Navigable Small World (HNSW) vector index using the create_hnsw_index function.
``` python
# HanaDB instance uses cosine similarity as default:
db_cosine = HanaDB(
    embedding=embeddings, connection=connection, table_name="SAP_HELP_PUBLIC"
)

# Attempting to create the HNSW index with default parameters
db_cosine.create_hnsw_index()  # If no other parameters are specified, the default values will be used
# Default values: m=64, ef_construction=128, ef_search=200
# The default index name will be: STATE_OF_THE_UNION_COSINE_SIMILARITY_IDX (verify this naming pattern in HanaDB class)


# Creating a HanaDB instance with L2 distance as the similarity function and defined values
db_l2 = HanaDB(
    embedding=embeddings,
    connection=connection,
    table_name="SAP_HELP_PUBLIC",
    distance_strategy=DistanceStrategy.EUCLIDEAN_DISTANCE,  # Specify L2 distance
)

# This will create an index based on L2 distance strategy.
db_l2.create_hnsw_index(
    index_name="SAP_HELP_PUBLIC_L2_index",
    m=100,  # Max number of neighbors per graph node (valid range: 4 to 1000)
    ef_construction=200,  # Max number of candidates during graph construction (valid range: 1 to 100000)
    ef_search=500,  # Min number of candidates during the search (valid range: 1 to 100000)
)

# Use L2 index to perform MMR
docs = db_l2.max_marginal_relevance_search(query, k=2, fetch_k=20)
for doc in docs:
    print("-" * 80)
    print(doc.page_content)
```

In [15]:
# Retrieve top matching documents from HANA for a given input text
query = "What is SAP Business AI?"
docs = db.similarity_search(query, k=2)
for doc in docs:
    print("-" * 80)
    print(doc.page_content)

--------------------------------------------------------------------------------
4PUBLIC© 2023 SAP SE or an SAP affiliate company. All rights reserved.  ǀ
Customers wanting to adopt Business AI Desire Real Business Results
--------------------------------------------------------------------------------
9PUBLIC© 2023 SAP SE or an SAP affiliate company. All rights reserved.  ǀ
Today, SAP offers a large catalogue of AI-powered scenarios across all business functions 
…Find out more on SAP Business AI
Finance
▪ Tax Compliance
▪ Cash Application
▪ Intelligent accrual
▪ Travel expense auditing
▪ Travel expense 
verification
▪ Invoice processing
▪ Business Integrity 
screening
▪ Goods and invoice receipt 
matching
▪ Mobile expense entry
Supply 
Chain
▪ Stock in transit
▪ Visual Inspection


## Key Takeaways
- **Chunking Strategy Matters:** Choose a splitting method based on document structure.
- **Batch Processing for Efficiency:** Process documents in batches when working with large datasets.
- **Metadata Enrichment:** Adding metadata (e.g., document numbers, unique IDs) enhances traceability and helps in deletion of specific documents.
- **Embedding Model Selection:** Choose an embedding model based on the retrieval requirements and performance.


Please check [LangChain documentation](https://python.langchain.com/docs/integrations/vectorstores/sap_hanavector/) for more details on SAP HANA Vector Store and LangChain integration.