# Building a RAG Application with Ollama deepseek-r1:32b, Llama Index, and LangChain


## 1. Introduction

Welcome to this step-by-step guide on building a **Retrieval-Augmented Generation (RAG)** application! In this notebook, we will combine the power of retrieval methods with advanced language generation techniques. Our goal is to create a system that can retrieve relevant information from your data sources and use a state-of-the-art language model to generate insightful responses.

### What is RAG?

RAG stands for *Retrieval-Augmented Generation*. It is an approach that:
- **Retrieves**: Searches and fetches relevant documents or pieces of data.
- **Generates**: Leverages a large language model (LLM) to produce contextually accurate and insightful outputs based on the retrieved information.

This technique is especially useful for tasks such as:
- Question Answering
- Chatbots and Conversational Agents
- Document Summarization
- Knowledge-based Systems

### Key Components

1. **Ollama deepseek-r1:32b**:  
   A powerful model used for embedding and retrieval. We'll explain how to set it up and use it effectively, even if you are starting from scratch.

2. **Llama Index**:  
   A tool to efficiently build and manage indexes over your data. It simplifies organizing and querying your documents.

3. **LangChain**:  
   A versatile framework that helps integrate various components (like LLMs and indexes) into a coherent application. It provides a high-level interface to work with large language models.

### What to Expect

In the sections that follow, we will cover:
- **Setting Up Your Environment**:  
  How to install and configure Ollama deepseek-r1:32b, including a dedicated section for those who haven’t set it up yet.

- **Data Ingestion & Indexing with Llama Index**:  
  Step-by-step instructions on how to prepare your data and build an index to enable efficient retrieval.

- **Integrating with LangChain**:  
  How to tie everything together by interfacing the index with your language model for retrieval-augmented generation.

- **Example Use Cases & Exercises**:  
  Practical code snippets and exercises to help you apply what you’ve learned in real-world scenarios.

By the end of this notebook, you’ll have a clear understanding of how to build and deploy your own RAG application, empowering you to tackle complex information retrieval and generation tasks.

Let's get started!


## 2. Environment Setup

In this section, we will prepare our environment by installing the necessary Python libraries, setting up Ollama with the **deepseek-r1:32b** model, and verifying that our setup is working correctly.


### 2.1. Installing Required Libraries

Our RAG application will leverage the following key Python packages:
- **LangChain**: For integrating with large language models.
- **Llama Index**: For building and querying document indexes.
- **Requests**: For making HTTP calls (useful if you interact with an API).

Open your terminal or command prompt and run the following command to install these packages:

```bash
pip install langchain llama-index requests
```

For the latest installation instructions or updates, please refer to the official documentation:
- [LangChain GitHub Repository](https://github.com/hwchase17/langchain)
- [Llama Index Documentation](https://gpt-index.readthedocs.io/en/latest/)
- [Requests Documentation](https://requests.readthedocs.io/en/latest/)

---

### 2.2. Setting Up Ollama and deepseek-r1:32b

**Ollama** is a platform that enables you to run large language models locally. In our case, we will use it to host the **deepseek-r1:32b** model.

### Steps to Set Up:

1. **Install Ollama**:  
   Visit the [Ollama website](https://ollama.com) and follow the installation instructions for your operating system.

2. **Download deepseek-r1:32b**:  
   If you haven’t already downloaded the model, you can pull it via the Ollama CLI:
   ```bash
   ollama pull deepseek-r1:32b
   ```

3. **Start the Model**:  
   Ensure that the model is running on your machine. The exact steps might vary depending on your installation. Consult the [Ollama documentation](https://ollama.com/docs) for detailed guidance.

> **Note:** If you prefer using an HTTP API (if provided by your Ollama installation) over the CLI, instructions will be provided later in the notebook.

Once you have completed these steps, your local deepseek-r1:32b model should be ready for use.


---


### 2.3. Verifying the Setup

Before proceeding, let’s verify that both our Python environment and deepseek-r1:32b are working as expected.

#### 2.3.1. Verify Python Package Installation

Run the following code snippet in a Python cell to ensure that all necessary packages are installed and importable:

In [None]:
import langchain
import llama_index
import requests

#### 2.3.2. Verify deepseek-r1:32b via the Ollama CLI

If you’re using the Ollama CLI to interact with deepseek-r1:32b, you can run a quick test. Create a helper function in your notebook that sends a prompt to the model:

In [None]:
import subprocess

def query_deepseek(prompt: str) -> str:
    """
    Sends a prompt to the deepseek-r1:32b model via the Ollama CLI.
    """
    command = ["ollama", "run", "deepseek-r1:32b", prompt]
    result = subprocess.run(command, capture_output=True, text=True)
    
    if result.returncode != 0:
        raise RuntimeError(f"Error calling deepseek-r1:32b: {result.stderr}")
    
    return result.stdout.strip()

# Test the function:
try:
    test_response = query_deepseek("Hello, deepseek-r1:32b! Please confirm you are running.")
    print("Model Response:", test_response)
except Exception as e:
    print(e)


#### 2.3.3. (Optional) Verify deepseek-r1:32b via an HTTP API

If your Ollama installation provides an HTTP API endpoint, you can test it using the `requests` library. Adjust the API endpoint as needed:

In [None]:
import requests

def query_deepseek_api(prompt: str) -> dict:
    url = "http://localhost:11434/api/generate"
    payload = {
        "model": "deepseek-r1:32b",
        "prompt": prompt,
        "stream": False
    }
    response = requests.post(url, json=payload)
    response.raise_for_status()
    return response.json()

# Example usage
result = query_deepseek_api("Why is the sky blue?")
print("API Response:", result)


In [None]:
result.get('response')

## 3. Indexing with Llama Index – Advanced Customization

In this step, we focus on how to index and customize the processing of the documents you have already uploaded (e.g., into a local folder). This section covers:

1. **Loading Your Uploaded Documents**  
   Using Llama Index’s built-in loaders to ingest files from a directory.

2. **Transformations**  
   Customizing how the documents are split into nodes (chunks) and adding metadata to improve retrieval.

3. **Indexing and Querying**  
   Building a vector index with your transformed documents and querying it for relevant information.

---

### 3.1. Loading Your Uploaded Documents

Assuming you have already uploaded your documents into a local folder (for example, `./data`), you can use the `SimpleDirectoryReader` to load them:


In [None]:
!pip install llama-index-core llama-index-readers-file llama-index-llms-ollama llama-index-embeddings-huggingface transformers requests

In [None]:
from llama_index.core import SimpleDirectoryReader

# Define path to the root directory containing subfolders
directory_path = "./data"

# Load all files, including those inside subfolders
documents = SimpleDirectoryReader(directory_path, recursive=True).load_data()

print(f"Loaded {len(documents)} documents from {directory_path}")

### 3.2. Transformations

Before indexing, you can customize how your documents are processed. This typically involves splitting them into smaller chunks (nodes) and adding metadata. You have two main options:


In [None]:
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import Settings

# Customize the text splitter with desired parameters
text_splitter = SentenceSplitter(chunk_size=512, chunk_overlap=10)

# Option 1: Set the custom text splitter globally
Settings.text_splitter = text_splitter

print("Custom text splitter configured: chunk size 512 with 10 words overlap.")

#### **Building the Vector Index with Hugging Face API**  
This approach uses the Hugging Face API (`gte-large` model) to generate embeddings. A custom text splitter transformation is applied to segment the documents before indexing them. This method requires an internet connection and an API key.

In [None]:
from llama_index.embeddings.huggingface import HuggingFaceInferenceAPIEmbedding
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings

# Example using Hugging Face API
embedding_model = HuggingFaceInferenceAPIEmbedding(
    model_name="thenlper/gte-large",
    use_auth_token="<auth-token>"
)

# Build the index with the custom text splitter transformation and the local embedding model
index = VectorStoreIndex.from_documents(
    documents[0:1],
    transformations=[text_splitter],
    embed_model=embedding_model
)
print("Custom index built successfully!")

# Save index to disk
storage_dir = "custom_index_storage"
index.storage_context.persist(persist_dir=storage_dir)
print(f"Index saved to disk at: {storage_dir}")

#### **Building the Vector Index with Ollama Mistral**  
This approach uses the **Mistral model** via Ollama to generate embeddings locally. A custom text splitter transformation is applied before indexing the documents. This method runs entirely offline but requires downloading the model in advance.

**Note: Choose only one approach based on your needs:**  
- Use **Hugging Face API** if you want cloud-based embeddings and don't mind API costs. There is also a free tier but it has rate limits.
- Use **Ollama Mistral** if you want a **fully local** and **free** solution but have the necessary compute resources.

In [None]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.storage import StorageContext
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.ollama import OllamaEmbedding

# Initialize the Ollama embedding model
# Note: You may need to pull mistral model if you don't have it already
embedding_model = OllamaEmbedding(model_name="mistral")

# Create index with Ollama Mistral embedding
index = VectorStoreIndex.from_documents(documents, transformations=[text_splitter], embed_model=embedding_model)

# Save index
storage_dir = "mistral_index_storage"
index.storage_context.persist(persist_dir=storage_dir)
print(f"✅ Index saved at: {storage_dir}")


#### **Querying the Vector Index with Hugging Face API**  
This approach retrieves relevant information using the **Hugging Face API** embeddings. It requires an internet connection and an API key.

In [None]:
from llama_index.embeddings.huggingface import HuggingFaceInferenceAPIEmbedding
from llama_index.core import load_index_from_storage, StorageContext
from llama_index.core.query_engine import RetrieverQueryEngine

def query_index_hf(query, index=None, storage_dir="huggingface_index_storage"):
    """Query the index using Hugging Face API embeddings."""
    if index is None:
        storage_context = StorageContext.from_defaults(persist_dir=storage_dir)
        index = load_index_from_storage(storage_context)

    embedding_model = HuggingFaceInferenceAPIEmbedding(model_name="thenlper/gte-large")
    retriever = index.as_retriever()
    query_engine = RetrieverQueryEngine(retriever=retriever)

    return query_engine.query(query)


#### **Querying the Vector Index with Ollama Mistral**  
This approach retrieves relevant information using the **Mistral model** via Ollama. It runs locally and requires the model to be downloaded beforehand.

In [None]:
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.core import load_index_from_storage, StorageContext
from llama_index.core.query_engine import RetrieverQueryEngine

def query_index_ollama(query, index=None, storage_dir="mistral_index_storage"):
    """Query the index using Ollama Mistral embeddings."""
    if index is None:
        storage_context = StorageContext.from_defaults(persist_dir=storage_dir)
        index = load_index_from_storage(storage_context)

    embedding_model = OllamaEmbedding(model_name="mistral")
    retriever = index.as_retriever()
    query_engine = RetrieverQueryEngine(retriever=retriever)

    return query_engine.query(query)


query_index_ollama('What are my dental benefits', index)

## **Step 4: Implementing a RAG System with LlamaIndex, LangChain, and DeepSeek**
In this step, we integrate **LlamaIndex**, **LangChain**, and **DeepSeek** to build a **Retrieval-Augmented Generation (RAG) application**.  

### **How It Works**
1. **Retrieve Relevant Documents**  
   - We use `LlamaIndexRetriever` with **Ollama Mistral** embeddings to fetch relevant documents.  

2. **Format the Query with Context**  
   - The retrieved content is formatted into a structured prompt to guide DeepSeek in generating a better response.  

3. **Generate a Response with DeepSeek**  
   - We use `DeepSeekLLM`, a LangChain-compatible LLM wrapper, to process the prompt and generate an answer.  
   - The response is split into:
     - **Reasoning:** The model's thought process.
     - **Answer:** The final extracted response.  

### **Key Components**
- **LlamaIndexRetriever** → Retrieves context using **LlamaIndex** and **Ollama Mistral** embeddings.  
- **DeepSeekLLM** → Calls a locally running **DeepSeek-R1** model using an API.  
- **LangChain Integration** → Provides flexibility for chaining retrieval and generation models.  
- **`generate_response(query)`** → Orchestrates retrieval and response generation to deliver grounded answers.  



In [None]:
import requests
from langchain.llms.base import LLM
from llama_index.core import load_index_from_storage, StorageContext
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.settings import Settings
from typing import List, Optional, Dict

# Disable OpenAI API key dependency
Settings.embed_model = None

# --- DeepSeek LLM Class ---
class DeepSeekLLM:
    """A simple wrapper for a locally hosted DeepSeek model."""
    
    def __init__(self, api_url: str = "http://localhost:11434/api/generate", model_name: str = "deepseek-r1:32b"):
        self.api_url = api_url
        self.model_name = model_name
    
    def query(self, prompt: str) -> Dict[str, str]:
        payload = {
            "model": self.model_name,
            "prompt": prompt,
            "stream": False
        }
        response = requests.post(self.api_url, json=payload)
        response.raise_for_status()
        
        try:
            raw_response = response.json()
            full_response = raw_response.get("response", "No response generated.").strip()
            
            # Extract reasoning and answer based on <think> tags
            if "<think>" in full_response and "</think>" in full_response:
                reasoning_start = full_response.find("<think>") + len("<think>")
                reasoning_end = full_response.find("</think>")
                reasoning = full_response[reasoning_start:reasoning_end].strip()
                answer = full_response[reasoning_end + len("</think>"):].strip()
            else:
                reasoning, answer = full_response, "No explicit response found."
            
            return {
                "prompt": prompt,
                "reasoning": reasoning,
                "answer": answer,
                "relevant_documents": raw_response.get("context", [])
            }
        except requests.exceptions.JSONDecodeError:
            return {
                "prompt": prompt,
                "reasoning": "Error: Unable to decode DeepSeek response.",
                "answer": "",
                "relevant_documents": []
            }

# --- Retriever Class ---
class LlamaIndexRetriever:
    """Retrieves relevant documents using LlamaIndex and Ollama Mistral embeddings."""
    
    def __init__(self, index=None, storage_dir: str = "mistral_index_storage", num_passages: int = 15):
        self.storage_dir = storage_dir
        self.embedding_model = OllamaEmbedding(model_name="mistral", api_key=None)
        self.num_passages = num_passages  # Number of passages to retrieve

        if index is None:
            storage_context = StorageContext.from_defaults(persist_dir=storage_dir)
            self.index = load_index_from_storage(storage_context, embed_model=self.embedding_model)
        else:
            self.index = index
        
        self.retriever = self.index.as_retriever(similarity_top_k=self.num_passages)
        
    def retrieve(self, query: str) -> List[str]:
        query_engine = RetrieverQueryEngine(retriever=self.retriever)
        response = query_engine.query(query)
        
        # Extract text from retrieved nodes
        documents = [node.text for node in response.source_nodes]
        return documents

# --- End-to-End Execution ---
def generate_response(query: str, index=None, num_passages: int = 15) -> Dict[str, str]:
    retriever = LlamaIndexRetriever(index=index, num_passages=num_passages)
    deepseek_llm = DeepSeekLLM()
    
    retrieved_texts = retriever.retrieve(query)
    
    if not retrieved_texts:
        return {"prompt": query, "reasoning": "No relevant context found to answer the question.", "answer": "", "relevant_documents": []}
    
    formatted_context = "\n\n".join(retrieved_texts)
    
    prompt = (f"Given the following context, answer the question concisely.\n\n"
              f"### Context:\n{formatted_context}\n\n"
              f"### Question: {query}\n"
              f"### Answer:")
    
    response_dict = deepseek_llm.query(prompt)
    response_dict["relevant_documents"] = retrieved_texts  # Ensure full document texts are included
    
    return response_dict

# Example usage
query = "Summarize my HMO Medical Plan"
response = generate_response(query, index=None, num_passages=15)
print(f"DeepSeek R1 Response:\n{response['answer']}\n\n")
print(f"DeepSeek R1 Reasoning:\n{response['reasoning']}\n\n")
print(f"Retrieved Documents:\n{response['relevant_documents']}")
