# OpenSearch RAG Demo with Llama 3.1 8B

This notebook demonstrates Retrieval Augmented Generation (RAG) using:
- **[OpenSearch](https://opensearch.org/platform/vector-engine/)** for vector search (k-NN) and hybrid search (k-NN + BM25)
- **Ollama with Llama 3.1 8B** for text generation
- **nomic-embed-text:v1.5** for embeddings

## About the Data

This demo uses **~4200 documentation chunks** from the Stackable Data Platform, covering operator documentation for products like Kafka, Trino, OpenSearch, Spark, and others. Each chunk contains:

- Text content from documentation pages
- 768-dimensional vector embeddings (generated with nomic-embed-text)
- Metadata: repository name, category, URL, code block indicators

The data is pre-generated and loaded into OpenSearch during demo deployment. This allows you to immediately start querying without waiting for document processing or embedding generation.

## Setup

### Imports and constants

In [None]:
# Install required packages
%pip install opensearch-py requests urllib3 -q

In [None]:
import os
import json
import requests
from opensearchpy import OpenSearch
import warnings
warnings.filterwarnings('ignore')

# Configuration from environment variables
OPENSEARCH_HOSTS = os.getenv('OPENSEARCH_HOSTS')
OPENSEARCH_HOSTNAME = os.getenv('OPENSEARCH_HOSTNAME')
OPENSEARCH_PORT = int(os.getenv('OPENSEARCH_PORT'))
OPENSEARCH_PROTOCOL = os.getenv('OPENSEARCH_PROTOCOL')
OPENSEARCH_USER = os.getenv('OPENSEARCH_USER')
OPENSEARCH_PASSWORD = os.getenv('OPENSEARCH_PASSWORD')
OLLAMA_HOST = os.getenv('OLLAMA_HOST')
OLLAMA_PORT = int(os.getenv('OLLAMA_PORT'))
OLLAMA_LLM_MODEL = os.getenv('OLLAMA_LLM_MODEL')
INDEX_NAME = 'rag-documents'

print(f"OpenSearch: {OPENSEARCH_HOSTS}")
print(f"Ollama: {OLLAMA_HOST}:{OLLAMA_PORT} using {OLLAMA_LLM_MODEL} to generate responses")
print(f"Index: {INDEX_NAME}")

### Connect to Services
Initialize the OpenSearch client and test the connections to OpenSearch and Ollama.
Verify that the index containing the documentation chunks exists in OpenSearch.

In [None]:
# Initialize OpenSearch client
opensearch_client = OpenSearch(
    hosts=[{'host': OPENSEARCH_HOSTNAME, 'port': OPENSEARCH_PORT}],
    http_auth=(OPENSEARCH_USER, OPENSEARCH_PASSWORD),
    use_ssl=OPENSEARCH_PROTOCOL == 'https',
    verify_certs=False,
    ssl_show_warn=False
)

# Test connection
info = opensearch_client.info()
print(f"Connected to OpenSearch {info['version']['number']}")

# Check index
if opensearch_client.indices.exists(index=INDEX_NAME):
    count = opensearch_client.count(index=INDEX_NAME)['count']
    print(f"Index '{INDEX_NAME}' has {count} chunks")
else:
    print(f"Index '{INDEX_NAME}' does not exist. Run the data ingestion job first.")

In [None]:
# Test Ollama connection
response = requests.get(f'http://{OLLAMA_HOST}:{OLLAMA_PORT}/api/tags')
models = response.json().get('models', [])
print(f"Ollama is running with {len(models)} models:")
for model in models:
    print(f"  - {model['name']}")

## Understanding the RAG Pipeline

RAG (Retrieval Augmented Generation) enhances LLM responses by grounding them in specific documentation. Our pipeline follows 5 steps:

1. **Query Embedding**: Convert the user's question into a vector representation
2. **Query Enhancement**: Detect keywords and intent to improve search precision
3. **OpenSearch Hybrid Search**: Combine semantic search (k-NN) with keyword matching (BM25) using OpenSearch
4. **Context Formatting & Prompt Engineering**: Prepare retrieved documentation for the LLM
5. **Response Generation**: Stream the answer from the LLM

Each step is explained below with its implementation.

## Step 1: Query Embedding

To search for relevant documentation, we need to convert text into numerical vectors (embeddings). These vectors capture semantic meaning - similar concepts have similar vectors.

**Why nomic-embed-text?**
- Specifically optimized for retrieval tasks (not just similarity)
- 768-dimensional vectors balance quality and performance  
- Open source and runs locally in Ollama

The embedding vector from the query will be compared against document embeddings stored in **OpenSearch's k-NN index**.

In [None]:
def get_embedding(text):
    """Generate embedding for text using Ollama."""
    response = requests.post(
        f'http://{OLLAMA_HOST}:{OLLAMA_PORT}/api/embeddings',
        json={
            'model': 'nomic-embed-text:v1.5',
            'prompt': text
        }
    )
    response.raise_for_status()
    return response.json()['embedding']

print("Query embedding function loaded")

## Step 2: Query Enhancement for Better Results

Raw user queries aren't always optimal for search. We enhance queries using two domain-specific strategies:

### Strategy 1: Product Name Detection
When a query mentions "Kafka" or "Trino", we want documentation **about** that product, not just pages that mention it. We detect product keywords and add metadata filters to the search.

### Strategy 2: Implementation Query Detection  
When users ask "how to" questions, code examples become more valuable. We detect implementation queries and boost document chunks containing code blocks.

These enhancements are specific to searching technical documentation and significantly improve result precision.

In [None]:
# Operator name mapping
OPERATOR_MAP = {
    'trino': 'trino-operator',
    'airflow': 'airflow-operator',
    'druid': 'druid-operator',
    'hdfs': 'hdfs-operator',
    'hbase': 'hbase-operator',
    'hive': 'hive-operator',
    'kafka': 'kafka-operator',
    'nifi': 'nifi-operator',
    'opensearch': 'opensearch-operator',
    'spark': 'spark-k8s-operator',
    'superset': 'superset-operator',
    'zookeeper': 'zookeeper-operator',
    'opa': 'opa-operator',
    'secret': 'secret-operator',
    'listener': 'listener-operator',
    'commons': 'commons-operator'
}

def detect_operator(query):
    """Detect operator names in query for metadata filtering."""
    query_lower = query.lower()
    for keyword, operator in OPERATOR_MAP.items():
        if keyword in query_lower:
            return operator
    return None

def is_implementation_query(query):
    """Detect if query is asking for implementation details."""
    impl_keywords = ['how', 'deploy', 'configure', 'setup', 'install', 'create', 'implement', 'example']
    query_lower = query.lower()
    return any(keyword in query_lower for keyword in impl_keywords)

print("Query enhancement functions loaded")

## Step 3: OpenSearch Hybrid Search (k-NN + BM25)

**OpenSearch** provides both vector search and keyword search in a single query. We use [OpenSearch's vector engine](https://opensearch.org/platform/vector-engine/) for k-NN similarity search combined with BM25 for keyword matching.

**Hybrid search** combines two complementary techniques:

1. **k-NN (semantic search)**: Finds documents conceptually similar to the query, even if they use different words
2. **BM25 (keyword search)**: Finds exact term matches with TF-IDF weighting

**Why hybrid?** Each method has strengths:
- k-NN: Handles synonyms, paraphrases, conceptual similarity
- BM25: Catches exact terminology, product names, specific features

We use a two-phase approach:
1. OpenSearch k-NN retrieves candidate documents (k*3 results for better coverage)
2. BM25 rescores these candidates to boost keyword matches

**Weight tuning**: 70% semantic (k-NN), 30% keyword (BM25). Semantic similarity matters more for documentation Q&A, but exact terms still help.

### 3.1: Building the OpenSearch k-NN Query

OpenSearch's k-NN query performs vector similarity search. If a product is detected in the query, we add a metadata filter to restrict results to that product's documentation.

In [None]:
def build_knn_query(query_embedding, detected_operator, k):
    """Build k-NN query with optional operator filter."""
    knn_query = {
        'embedding': {
            'vector': query_embedding,
            'k': k * 3  # Get more candidates for rescoring
        }
    }
    
    # Add operator filter if operator detected
    if detected_operator:
        knn_query['embedding']['filter'] = {
            'term': {'operator': detected_operator}
        }
    
    return knn_query

print("k-NN query builder loaded")

### 3.2: Building the BM25 Rescore Query

The rescore query uses BM25 for keyword matching. It searches in both `title` (with 1.5x boost) and `content` fields. For implementation queries, we additionally boost documents containing code blocks.

In [None]:
def build_rescore_query(query, boost_code_blocks):
    """Build BM25 rescore query for keyword matching."""
    rescore_should = [
        {
            'multi_match': {
                'query': query,
                'fields': ['title^1.5', 'content'],
                'type': 'best_fields'
            }
        }
    ]
    
    # Boost documents with code blocks for implementation queries
    if boost_code_blocks:
        rescore_should.append({
            'term': {
                'has_code_block': {
                    'value': True,
                    'boost': 1.5
                }
            }
        })
    
    return {
        'window_size': None,  # Will be set to k * 3
        'query': {
            'rescore_query': {
                'bool': {
                    'should': rescore_should
                }
            },
            'query_weight': 0.7,         # Favor k-NN semantic similarity
            'rescore_query_weight': 0.3  # Keyword matching has less weight
        }
    }

print("BM25 rescore query builder loaded")

### 3.3: Formatting Search Results

Extract the relevant fields from OpenSearch response and structure them for downstream use.

In [None]:
def format_search_results(opensearch_response):
    """Extract relevant fields from OpenSearch response."""
    results = []
    for hit in opensearch_response['hits']['hits']:
        results.append({
            'title': hit['_source']['title'],
            'content': hit['_source']['content'],
            'category': hit['_source']['category'],
            'operator': hit['_source'].get('operator', ''),
            'has_code_block': hit['_source'].get('has_code_block', False),
            'url': hit['_source'].get('url', ''),
            'score': hit['_score']
        })
    return results

print("Search result formatter loaded")

### 3.4: Execute OpenSearch Hybrid Search

The main search function orchestrates all the steps above: get embedding, detect enhancements, build queries, execute OpenSearch search, and format results. The constructed OpenSearch query is logged for inspection.

In [None]:
def search_documents(query, k=10, log_query=False):
    """Search for relevant document chunks using OpenSearch hybrid search (k-NN + BM25)."""
    # Step 1: Get embedding
    query_embedding = get_embedding(query)
    
    # Step 2: Detect enhancements
    detected_operator = detect_operator(query)
    boost_code_blocks = is_implementation_query(query)
    
    # Step 3: Build k-NN query
    knn_query = build_knn_query(query_embedding, detected_operator, k)
    
    # Step 4: Build rescore query
    rescore_query = build_rescore_query(query, boost_code_blocks)
    rescore_query['window_size'] = k * 3
    
    # Step 5: Construct complete search body
    search_body = {
        'size': k,
        'query': {
            'knn': knn_query
        },
        'rescore': rescore_query,
        '_source': ['title', 'content', 'category', 'operator', 'has_code_block', 'url']
    }
    
    # Log the query if requested (with truncated vector for readability)
    if log_query:
    # Create display version with truncated vector
        log_body = json.loads(json.dumps(search_body))  # Deep copy
        vector = log_body['query']['knn']['embedding']['vector']
        log_body['query']['knn']['embedding']['vector'] = [vector[0], vector[1], vector[2], "...", vector[-3], vector[-2], vector[-1]]
        print("OpenSearch Query:")
        print(json.dumps(log_body, indent=2))
        print()
    
    # Step 6: Execute search
    response = opensearch_client.search(
        index=INDEX_NAME,
        body=search_body
    )
    
    # Step 7: Format and return results
    return format_search_results(response)

print("OpenSearch hybrid search orchestrator loaded")

## Step 4: Context Formatting & Prompt Engineering

Once we have relevant documentation chunks, we need to format them for the LLM and construct an effective prompt. This step is critical for preventing hallucinations and ensuring high-quality responses.

### 4.1: Format Context for LLM

Each retrieved document is formatted with metadata (title, operator, relevance score, URL) to help the LLM understand the source quality and provide citations.

In [None]:
def format_context_for_llm(docs):
    """Format retrieved documents into context string with metadata."""
    context_parts = []
    for doc in docs:
        context_parts.append(
            f"Source: {doc['title']} (from {doc['operator']}, relevance: {doc['score']:.2f})\n"
            f"URL: {doc.get('url', 'N/A')}\n"
            f"{doc['content']}"
        )
    return "\n\n".join(context_parts)

print("Context formatter loaded")

### 4.2: Build RAG Prompt

The prompt is the most important part of RAG. It instructs the LLM on how to use the retrieved context.

**Key prompt engineering principles:**
- **Constrain to sources**: Explicitly tell the LLM to use ONLY the provided documentation, reducing hallucination
- **Prioritize by relevance**: Mention that higher relevance scores indicate better sources
- **Enable citations**: Include URLs so the LLM can reference specific documentation pages
- **Prevent URL fabrication**: Explicitly forbid making up URLs
- **Natural language**: Instruct to say "the documentation" rather than "the context" for more natural responses
- **Admit uncertainty**: Tell the LLM to say when the answer isn't in the docs

In [None]:
def build_rag_prompt(query, context_text):
    """Construct the complete RAG prompt with instructions."""
    prompt = f"""You are a technical documentation assistant for Stackable Data Platform. Answer the question using ONLY the provided documentation sources.

IMPORTANT RULES:
- Use only information from the documentation sources below
- Focus on sources with higher relevance scores
- Include documentation URLs for further reading where relevant
- Do NOT make up URLs - only use URLs provided below
- If the answer is not available in the documentation, say so clearly
- Be concise and technical
- When referring to your sources, use natural language like 'the documentation' rather than 'the context'

Documentation sources:
{context_text}

Question: {query}

Answer:"""
    return prompt

print("RAG prompt builder loaded")

## Step 5: Response Generation with Streaming

The final step sends our prompt to Llama 3.1 8B running in Ollama. We use streaming so tokens appear as they're generated, providing immediate feedback rather than waiting for the complete response.

**Why streaming?** Better user experience - users see progress and can start reading before the full response is ready.

In [None]:
def generate_response(query, context_docs, stream=True):
    """Generate response using Llama 3.1 8B with retrieved context."""
    # Format context
    context_text = format_context_for_llm(context_docs)
    
    # Build prompt
    prompt = build_rag_prompt(query, context_text)
    
    # Call Ollama
    response = requests.post(
        f'http://{OLLAMA_HOST}:{OLLAMA_PORT}/api/generate',
        json={
            'model': OLLAMA_LLM_MODEL,
            'prompt': prompt,
            'stream': stream
        },
        stream=stream
    )
    response.raise_for_status()
    
    if stream:
        # Stream tokens as they're generated
        full_response = ""
        for line in response.iter_lines():
            if line:
                chunk = json.loads(line)
                if 'response' in chunk:
                    token = chunk['response']
                    print(token, end='', flush=True)
                    full_response += token
        print()  # New line after streaming
        return full_response
    else:
        # Return complete response
        return response.json()['response']

print("Response generator loaded")

## Complete RAG Pipeline

The `rag_query()` function ties all steps together and provides user feedback at each stage. This is the main interface for asking questions.

In [None]:
def rag_query(query, k=10, stream=True, log_query=True):
    """Complete RAG workflow: search + generate."""
    print(f"Query: {query}")
    print()
    
    # Show if operator detected
    detected_operator = detect_operator(query)
    if detected_operator:
        print(f"Detected operator: {detected_operator}")
    
    print("Retrieving relevant chunks...")
    print()
    
    # Search with query logging
    context_docs = search_documents(query, k=k, log_query=log_query)
    
    print(f"Found {len(context_docs)} relevant chunks:")
    for i, doc in enumerate(context_docs, 1):
        code_indicator = " [code]" if doc.get('has_code_block') else ""
        print(f"  {i}. {doc['title']} (score: {doc['score']:.3f}){code_indicator}")
        print(f"     {doc['url']}")
    
    print()
    print("Generating response with Llama 3.1 8B...")
    print()
    answer = generate_response(query, context_docs, stream=stream)
    print("=" * 80)
    
    return {'query': query, 'context': context_docs, 'answer': answer}

print("RAG pipeline ready")

## Examples

Run the examples below to see the RAG pipeline in action. By default, `log_query=True` so you can inspect the OpenSearch query structure.

### Example 1: Ask about supported Kafka versions

In [None]:
result = rag_query("What versions of Kafka are supported by the Stackable Data Platform?")

### Example 2: Ask about NiFi's authentication methods

In [None]:
result = rag_query("What authentication methods does NiFi support?")

### Example 3: Ask about deploying a Trino cluster

In [None]:
result = rag_query("How do I deploy a Trino cluster with Stackable? Give me a sample yaml file.")

### Try Your Own Questions

Use the cell below to ask your own questions about the Stackable Platform.

In [None]:
# Use stream=True (default) to watch tokens generate in real-time
# Use stream=False to display the complete response once it is generated
# Modify k to change the number of documentation chunks that will be retrieved from OpenSearch to be used as context in the prompt.
# Use log_query=False to stop logging the OpenSearch query.
result = rag_query("YOUR QUESTION HERE", stream=True, k=10, log_query=True)