# Basic RAG System Example

This notebook demonstrates the basic functionality of the Unstructured RAG system. It walks through the complete process of:
1. Loading a document
2. Processing and chunking the text
3. Generating embeddings
4. Storing the chunks in Milvus vector database
5. Searching for relevant information
6. Generating natural language responses

By following this notebook, you'll understand how the RAG system works under the hood.

## Setup

First, let's import the necessary modules from the RAG system.

In [None]:
import os
import sys

# Add parent directory to path
sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), "..")))

from app.config import config
from rag.data_ingestion.loader import load_document
from rag.processing.chunker import chunk_text
from rag.processing.embedder import get_embedder
from rag.retrieval.milvus_client import get_milvus_client, store_chunks
from rag.retrieval.search import search_documents
from rag.generation.response import generate_response

## Loading and Processing a Document

Let's load a document and process it.

In [None]:
# Define document path - using one of the existing PDF files in the data directory
# You can replace this with any PDF, TXT, DOCX, or HTML file you want to process
document_path = "../data/documents/0bfbe0a3-49e9-4a44-9e4c-84f261f0b4b7.pdf"

# If the file doesn't exist, you can use any PDF file you have available
if not os.path.exists(document_path):
    print(f"File {document_path} not found. Please update the path to an existing document.")
    # You can list available documents in the data directory
    print("\nAvailable documents:")
    for file in os.listdir("../data/documents"):
        if file.endswith(".pdf") or file.endswith(".txt") or file.endswith(".docx"):
            print(f"../data/documents/{file}")
else:
    # Load document
    text, metadata = load_document(document_path)
    
    print(f"Document loaded: {metadata.get('file_name')}")
    print(f"Text length: {len(text)} characters")
    print(f"Document type: {metadata.get('file_type', 'unknown')}")
    print(f"\nMetadata: {metadata}")

## Chunking the Document

Now, let's split the document into chunks.

In [None]:
# Chunk text
chunks = chunk_text(text)

print(f"Document split into {len(chunks)} chunks")
print(f"\nSample chunk:\n{chunks[0].text[:200]}...")

## Generating Embeddings

Let's generate embeddings for the chunks.

In [None]:
# Get embedder
embedder = get_embedder()

# Generate embeddings
embedder.embed_chunks(chunks)

print(f"Embeddings generated for {len(chunks)} chunks")
print(f"Embedding dimension: {len(chunks[0].embedding)}")

## Storing Chunks in Milvus

Now, let's store the chunks in Milvus.

In [None]:
# Get document ID
doc_id = os.path.basename(document_path)
doc_name = metadata.get("file_name", doc_id)

# Store chunks
store_chunks(chunks, doc_id, doc_name, embedder)

print(f"Chunks stored in Milvus collection: {config.milvus.collection}")

## Searching for Information

Let's search for information in the document.

In [None]:
# Define a query - this will search across all documents in the vector database
query = "What is the main topic of this document?"

# Search for relevant chunks
results = search_documents(query)

print(f"Query: '{query}'")
print(f"Found {len(results)} relevant chunks")

# Display the top 3 results with their relevance scores
for i, result in enumerate(results[:3]):
    print(f"\nResult {i+1} (Score: {result.score:.4f})")
    print(f"Document: {result.metadata.get('document_name', 'Unknown')}")
    print(f"Text: {result.text[:200]}...")

## Generating a Response

Finally, let's generate a response to the query.

In [None]:
# Generate response
response = generate_response(query, results)

print(f"Query: {query}")
print(f"\nResponse:\n{response}")

## Try Another Query

Let's try another query.

In [None]:
# Define another query - try asking about specific information in your documents
query = "What information is available about LLMs or Large Language Models?"

# Search and generate response
results = search_documents(query)
response = generate_response(query, results)

print(f"Query: {query}")
print(f"\nResponse:\n{response}")

# You can try other queries related to your documents
# For example:
# - "What are the main features of the RAG system?"
# - "How does vector search work?"
# - "Explain the document processing pipeline"
# - "What are the advantages of using Milvus for vector storage?"