# Local Offline RAG with Ollama

## Overview
This notebook demonstrates building a **completely offline RAG (Retrieval-Augmented Generation)** system using **Ollama** for local LLMs and embeddings.

### üöÄ Benefits of Local RAG:
- **100% Offline**: No internet required after setup
- **Privacy First**: Your documents never leave your machine
- **No API Costs**: Free to run unlimited queries
- **Fast**: No network latency
- **Full Control**: Customize models and parameters

### üìã Architecture:
```
PDF Documents ‚Üí Load ‚Üí Split ‚Üí Local Embeddings (Ollama) ‚Üí ChromaDB
                                                                  ‚Üì
User Query ‚Üí Retrieve Similar Chunks ‚Üí Local LLM (Ollama) ‚Üí Answer
```

### üõ†Ô∏è Components:
- **Document Loader**: PyPDFLoader
- **Text Splitter**: RecursiveCharacterTextSplitter
- **Embeddings**: Ollama with nomic-embed-text (or embeddinggemma)
- **Vector Store**: ChromaDB (persistent, local)
- **LLM**: Ollama with gemma3:1b
- **Chain**: LangChain Expression Language (LCEL)

---

## 1. Prerequisites & Installation

### Required Software:
1. **Ollama**: Download from https://ollama.ai
2. **Python 3.9+**: Recommended 3.11 or 3.13

### Install Python Packages:

In [None]:
# Install required packages (run this once)
# !pip install langchain langchain-core langchain-community langchain-text-splitters
# !pip install langchain-ollama langchain-chroma chromadb
# !pip install pypdf tiktoken

# Or install from requirements.txt with additional packages:
# !pip install -r requirements.txt
# !pip install langchain-ollama langchain-chroma chromadb

### Download Ollama Models:

Run these commands in your terminal (if you haven't already):

```bash
# Embedding model (choose one or both)
ollama pull nomic-embed-text    # Recommended: 274 MB
ollama pull embeddinggemma      # Alternative: 621 MB

# LLM for generation
ollama pull gemma3:1b          # Small & fast: 815 MB
```

**Note**: You already have these models downloaded! ‚úì

## 2. Import Required Libraries

In [1]:
# Standard library imports
import os
import sys
from pathlib import Path

# LangChain Document Loaders
from langchain_community.document_loaders import PyPDFLoader

# LangChain Text Splitters
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Ollama Integration
from langchain_ollama import OllamaEmbeddings, ChatOllama

# ChromaDB Vector Store
from langchain_chroma import Chroma

# LangChain Core Components
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

print("‚úì All imports successful!")
print("‚úì Ready for local offline RAG!")
print(f"\nPython version: {sys.version}")

‚úì All imports successful!
‚úì Ready for local offline RAG!

Python version: 3.13.2 (main, Mar 17 2025, 21:26:38) [Clang 20.1.0 ]


## 3. Verify Ollama Installation

Let's check that Ollama is running and our models are available.

In [2]:
# Check Ollama is running and list available models
!ollama list

NAME                       ID              SIZE      MODIFIED       
gemma3:1b                  8648f39daa8f    815 MB    29 minutes ago    
embeddinggemma:latest      85462619ee72    621 MB    32 minutes ago    
nomic-embed-text:latest    0a109f422b47    274 MB    7 months ago      
deepseek-r1:latest         0a8c26691023    4.7 GB    7 months ago      
llama3.2:latest            a80c4f17acd5    2.0 GB    7 months ago      


In [3]:
# Test Ollama connection with a simple query
print("Testing Ollama connection...\n")

try:
    test_llm = ChatOllama(model="gemma3:1b", temperature=0)
    response = test_llm.invoke("Say 'Hello! I am running locally on your machine!'")
    
    print("‚úì Ollama is working!")
    print(f"Response: {response.content}")
    
except Exception as e:
    print(f"‚úó Error connecting to Ollama: {e}")
    print("\nMake sure Ollama is running. Try: ollama serve")

Testing Ollama connection...

‚úì Ollama is working!
Response: Hello! I am running locally on your machine! üòä



## 4. Load PDF Documents

Load your PDF documents for the RAG system.

In [4]:
# ===== CONFIGURATION: Update this path to your PDF file =====
pdf_path = "attention.pdf"  # Change this to your PDF file path
# =============================================================

# Check if file exists
if not os.path.exists(pdf_path):
    print(f"‚ö†Ô∏è  ERROR: File '{pdf_path}' not found!")
    print("Please update the pdf_path variable with your PDF file location.")
else:
    # Load the PDF
    loader = PyPDFLoader(pdf_path)
    documents = loader.load()
    
    # Display information
    print(f"‚úì Loaded {len(documents)} pages from '{pdf_path}'")
    print(f"\n--- First Page Preview ---")
    print(f"Content (first 300 chars): {documents[0].page_content[:300]}...")
    print(f"\nMetadata: {documents[0].metadata}")
    print(f"\nTotal characters: {sum(len(doc.page_content) for doc in documents):,}")

‚úì Loaded 15 pages from 'attention.pdf'

--- First Page Preview ---
Content (first 300 chars): Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani‚àó
Google Brain
avaswani@google.com
Noam Shazeer‚àó
Google Brain
noam@google.com
Niki Par...

Metadata: {'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'attention.pdf', 'total_pages': 15, 'page': 0, 'page_label': '1'}

Total characters: 39,587


### Optional: Load Multiple PDFs from a Directory

In [None]:
# Uncomment to load multiple PDFs from a directory

# pdf_directory = "./pdfs"
# all_documents = []

# if os.path.exists(pdf_directory):
#     pdf_files = list(Path(pdf_directory).glob("*.pdf"))
#     print(f"Found {len(pdf_files)} PDF files\n")
    
#     for pdf_file in pdf_files:
#         loader = PyPDFLoader(str(pdf_file))
#         docs = loader.load()
#         all_documents.extend(docs)
#         print(f"  ‚úì Loaded {len(docs)} pages from {pdf_file.name}")
    
#     print(f"\nTotal pages loaded: {len(all_documents)}")
#     documents = all_documents  # Use this for the rest of the pipeline

## 5. Split Documents into Chunks

Break documents into smaller chunks for better retrieval precision.

In [5]:
# Initialize text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1024,        # Characters per chunk
    chunk_overlap=128,      # Overlap to maintain context
    length_function=len,
    separators=["\n\n", "\n", " ", ""]  # Split on paragraphs, then lines, etc.
)

# Split documents
chunks = text_splitter.split_documents(documents)

# Display results
avg_chunk_size = sum(len(chunk.page_content) for chunk in chunks) / len(chunks)

print(f"‚úì Split {len(documents)} documents into {len(chunks)} chunks")
print(f"\nAverage chunk size: {avg_chunk_size:.0f} characters")

# Preview chunks
print(f"\n--- Chunk Examples ---")
for i, chunk in enumerate(chunks[:3]):
    print(f"\nChunk {i+1} (length: {len(chunk.page_content)} chars):")
    print(f"{chunk.page_content[:200]}...")

‚úì Split 15 documents into 49 chunks

Average chunk size: 873 characters

--- Chunk Examples ---

Chunk 1 (length: 986 chars):
Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
...

Chunk 2 (length: 944 chars):
based solely on attention mechanisms, dispensing with recurrence and convolutions
entirely. Experiments on two machine translation tasks show these models to
be superior in quality while being more pa...

Chunk 3 (length: 986 chars):
‚àóEqual contribution. Listing order is random. Jakob proposed replacing RNNs with self-attention and started
the effort to evaluate this idea. Ashish, with Illia, designed and implemented the first Tra...


## 6. Create Embeddings (Primary: Nomic-Embed-Text)

### About Nomic-Embed-Text:
- **Size**: 274 MB
- **Dimensions**: 768
- **Performance**: State-of-the-art for local embeddings
- **Speed**: Fast inference
- **License**: Open source (Apache 2.0)

In [6]:
# Initialize Ollama Embeddings with nomic-embed-text
embeddings = OllamaEmbeddings(
    model="nomic-embed-text",
    # base_url="http://localhost:11434"  # Default Ollama URL
)

# Test embeddings
print("Testing nomic-embed-text embeddings...\n")
sample_text = "This is a test sentence for embeddings."
sample_embedding = embeddings.embed_query(sample_text)

print(f"‚úì Embeddings model: nomic-embed-text")
print(f"‚úì Embedding dimension: {len(sample_embedding)}")
print(f"‚úì Sample embedding (first 10 values): {sample_embedding[:10]}")
print(f"\n‚ÑπÔ∏è  Each chunk will be converted to a {len(sample_embedding)}-dimensional vector")
print(f"‚ÑπÔ∏è  All processing happens locally on your machine!")

Testing nomic-embed-text embeddings...

‚úì Embeddings model: nomic-embed-text
‚úì Embedding dimension: 768
‚úì Sample embedding (first 10 values): [0.032493647, 0.060827848, -0.16611777, -0.08213917, 0.04330868, -0.026053637, 0.051593572, -0.0151964305, -0.008287022, -0.028351676]

‚ÑπÔ∏è  Each chunk will be converted to a 768-dimensional vector
‚ÑπÔ∏è  All processing happens locally on your machine!


## 7. Alternative: EmbeddingGemma (Optional)

### About EmbeddingGemma:
- **Size**: 621 MB (larger than nomic)
- **Dimensions**: 768
- **Optimized for**: Google Gemma models
- **Use case**: Better alignment with Gemma LLMs

**Uncomment the code below to use embeddinggemma instead:**

In [None]:
# # Alternative: Use embeddinggemma instead
# embeddings = OllamaEmbeddings(
#     model="embeddinggemma:latest"
# )

# # Test embeddings
# print("Testing embeddinggemma embeddings...\n")
# sample_text = "This is a test sentence for embeddings."
# sample_embedding = embeddings.embed_query(sample_text)

# print(f"‚úì Embeddings model: embeddinggemma")
# print(f"‚úì Embedding dimension: {len(sample_embedding)}")
# print(f"‚úì Sample embedding (first 10 values): {sample_embedding[:10]}")

## 8. Create ChromaDB Vector Store

### Why ChromaDB?
- **Local & Persistent**: Stores vectors on disk
- **Python 3.13 Compatible**: Works with latest Python
- **Easy to Use**: Simple API
- **Open Source**: Free and fully featured

**Note**: This step may take a minute as it processes all chunks.

In [7]:
# Create ChromaDB vector store
print(f"Creating ChromaDB vector store from {len(chunks)} chunks...")
print("This may take a minute...\n")

# Set persistent directory
persist_directory = "./chroma_db"

# Create vector store
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory=persist_directory,
    collection_name="local_rag_collection"
)

print(f"‚úì ChromaDB vector store created successfully!")
print(f"‚úì Indexed {len(chunks)} document chunks")
print(f"‚úì Stored at: {persist_directory}")
print(f"\n‚ÑπÔ∏è  Vector store persisted to disk - you can reload it later!")

Creating ChromaDB vector store from 49 chunks...
This may take a minute...

‚úì ChromaDB vector store created successfully!
‚úì Indexed 49 document chunks
‚úì Stored at: ./chroma_db

‚ÑπÔ∏è  Vector store persisted to disk - you can reload it later!


### Load Existing Vector Store (Optional)

If you've already created the vector store, you can load it instead:

In [None]:
# # Uncomment to load existing vector store
# persist_directory = "./chroma_db"

# vectorstore = Chroma(
#     persist_directory=persist_directory,
#     embedding_function=embeddings,
#     collection_name="local_rag_collection"
# )

# print(f"‚úì Loaded existing vector store from '{persist_directory}'")
# print(f"‚úì Collection: local_rag_collection")

## 9. Create Retriever and Test

The retriever finds the most relevant chunks for a given query.

In [8]:
# Create retriever
retriever = vectorstore.as_retriever(
    search_type="similarity",    # Use cosine similarity
    search_kwargs={"k": 4}        # Retrieve top 4 most relevant chunks
)

print("‚úì Retriever configured successfully")
print(f"  - Search type: similarity")
print(f"  - Number of documents to retrieve (k): 4")

# Test retrieval
test_query = "What is the main topic of this document?"
print(f"\n--- Retriever Test ---")
print(f"Query: '{test_query}'")

retrieved_docs = retriever.invoke(test_query)

print(f"\nRetrieved {len(retrieved_docs)} documents:")
for i, doc in enumerate(retrieved_docs):
    print(f"\nDocument {i+1}:")
    print(f"  Content preview: {doc.page_content[:150]}...")
    print(f"  Source: Page {doc.metadata.get('page', 'N/A')}")

‚úì Retriever configured successfully
  - Search type: similarity
  - Number of documents to retrieve (k): 4

--- Retriever Test ---
Query: 'What is the main topic of this document?'

Retrieved 4 documents:

Document 1:
  Content preview: (section 5.4), learning rates and beam size on the Section 22 development set, all other parameters
remained unchanged from the English-to-German base...
  Source: Page 8

Document 2:
  Content preview: Input-Input Layer5
The
Law
will
never
be
perfect
,
but
its
application
should
be
just
-
this
is
what
we
are
missing
,
in
my
opinion
.
<EOS>
<pad>
The
...
  Source: Page 13

Document 3:
  Content preview: Table 1: Maximum path lengths, per-layer complexity and minimum number of sequential operations
for different layer types. n is the sequence length, d...
  Source: Page 5

Document 4:
  Content preview: Attention Visualizations
Input-Input Layer5
It
is
in
this
spirit
that
a
majority
of
American
governments
have
passed
new
laws
since
2009
making
the
re.

## 10. Configure Ollama LLM (Gemma3:1b)

### About Gemma3:1b:
- **Size**: 815 MB
- **Parameters**: 1 billion
- **Speed**: Very fast inference
- **Quality**: Good for most Q&A tasks
- **Memory**: Low RAM usage

**Alternatives**: You can also use llama3.2 (2GB) or deepseek-r1 (4.7GB) for better quality.

In [9]:
# Initialize Ollama LLM
llm = ChatOllama(
    model="gemma3:1b",
    temperature=0,          # Deterministic responses (0 = focused, 1 = creative)
    # num_predict=2000,     # Max tokens to generate
    # top_k=40,             # Top-k sampling
    # top_p=0.9,            # Top-p (nucleus) sampling
)

print("‚úì LLM configured successfully")
print(f"  - Model: gemma3:1b (local)")
print(f"  - Temperature: 0 (deterministic)")

# Test LLM
test_response = llm.invoke("Say 'Hello! I am Gemma running locally!'")
print(f"\nLLM Test Response: {test_response.content}")

‚úì LLM configured successfully
  - Model: gemma3:1b (local)
  - Temperature: 0 (deterministic)

LLM Test Response: Hello! I am Gemma running locally! üòä



### Try Other Local Models (Optional)

In [None]:
# # Alternative: Use llama3.2 for better quality
# llm = ChatOllama(
#     model="llama3.2:latest",
#     temperature=0
# )

# # Alternative: Use deepseek-r1 for reasoning tasks
# llm = ChatOllama(
#     model="deepseek-r1:latest",
#     temperature=0
# )

## 11. Build RAG Chain (LangChain Expression Language)

Combine retrieval and generation into a single pipeline using LCEL.

In [10]:
# Define prompt template
system_prompt = (
    "You are a helpful assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer the question. "
    "If you don't know the answer based on the context, say that you don't know. "
    "Keep the answer concise and accurate.\n\n"
    "Context: {context}\n\n"
    "Question: {question}"
)

prompt = ChatPromptTemplate.from_template(system_prompt)

# Helper function to format documents
def format_docs(docs):
    """Format retrieved documents into a single string."""
    return "\n\n".join(doc.page_content for doc in docs)

# Build RAG chain using LCEL
rag_chain = (
    {
        "context": retriever | format_docs,  # Retrieve and format docs
        "question": RunnablePassthrough()      # Pass through the question
    }
    | prompt           # Format with prompt template
    | llm              # Generate answer with local LLM
    | StrOutputParser() # Parse output to string
)

print("‚úì RAG chain created successfully using LCEL!")
print("\nRAG Pipeline Flow:")
print("  1. User provides a query")
print("  2. Retriever finds top 4 relevant chunks (local ChromaDB)")
print("  3. Chunks are formatted as context")
print("  4. Context + question formatted with prompt template")
print("  5. Local LLM (gemma3:1b) generates answer")
print("  6. Answer parsed and returned")
print("\nüîí Everything runs locally on your machine!")

‚úì RAG chain created successfully using LCEL!

RAG Pipeline Flow:
  1. User provides a query
  2. Retriever finds top 4 relevant chunks (local ChromaDB)
  3. Chunks are formatted as context
  4. Context + question formatted with prompt template
  5. Local LLM (gemma3:1b) generates answer
  6. Answer parsed and returned

üîí Everything runs locally on your machine!


## 12. Test RAG Pipeline with Example Queries

Let's test our complete local RAG system!

In [11]:
# Example Query 1: General question
query1 = "What is the main topic or contribution of this document?"

print(f"Query: {query1}")
print("\nProcessing locally...\n")

answer = rag_chain.invoke(query1)

print("=" * 80)
print("ANSWER:")
print("=" * 80)
print(answer)
print("\n" + "=" * 80)

# Show source documents
print("\nSOURCE DOCUMENTS USED:")
print("=" * 80)
retrieved_docs = retriever.invoke(query1)
for i, doc in enumerate(retrieved_docs):
    print(f"\nDocument {i+1}:")
    print(f"  Page: {doc.metadata.get('page', 'N/A')}")
    print(f"  Content: {doc.page_content[:200]}...")
    print("-" * 80)

Query: What is the main topic or contribution of this document?

Processing locally...

ANSWER:
The document discusses the parameters used during the development of the Section 22 development set, specifically focusing on learning rates and beam size.


SOURCE DOCUMENTS USED:

Document 1:
  Page: 8
  Content: (section 5.4), learning rates and beam size on the Section 22 development set, all other parameters
remained unchanged from the English-to-German base translation model. During inference, we
9...
--------------------------------------------------------------------------------

Document 2:
  Page: 12
  Content: Attention Visualizations
Input-Input Layer5
It
is
in
this
spirit
that
a
majority
of
American
governments
have
passed
new
laws
since
2009
making
the
registration
or
voting
process
more
difficult
.
<EOS...
--------------------------------------------------------------------------------

Document 3:
  Page: 13
  Content: Input-Input Layer5
The
Law
will
never
be
perfect
,
but
it

In [12]:
# Example Query 2: Specific information extraction
query2 = "Can you summarize the key technical contributions or innovations mentioned?"

print(f"Query: {query2}")
print("\nProcessing locally...\n")

answer = rag_chain.invoke(query2)

print("=" * 80)
print("ANSWER:")
print("=" * 80)
print(answer)
print("\n" + "=" * 80)

Query: Can you summarize the key technical contributions or innovations mentioned?

Processing locally...

ANSWER:
Here‚Äôs a summary of the key technical contributions and innovations mentioned in the text:

*   **Transformer Model Development:** Ashish, with Illia, designed and implemented the first Transformer models.
*   **Parameter-Free Position Representation:** Niki designed, implemented, tuned, and evaluated numerous model variants incorporating parameter-free position representation.
*   **Scaling Attention:** Llion experimented with novel model variants, focusing on scaled dot-product attention and multi-head attention.
*   **Tensor2Tensor Replacement:** Lukasz and Aidan replaced the earlier codebase with tensor2tensor, significantly improving results and accelerating research.
*   **Evaluation of Training Time:** The text estimates the number of floating-point operations used to train a model by multiplying the training time, the number of GPUs used, and an estimate of the s

In [13]:
# Example Query 3: Your custom question
custom_query = "What specific details are mentioned about the methodology or approach?"

print(f"Query: {custom_query}")
print("\nProcessing locally...\n")

answer = rag_chain.invoke(custom_query)

print("=" * 80)
print("ANSWER:")
print("=" * 80)
print(answer)
print("\n" + "=" * 80)

Query: What specific details are mentioned about the methodology or approach?

Processing locally...

ANSWER:
The context describes that the authors used a ‚Äúdecomposable attention model‚Äù during inference, averaging attention-weighted positions, and that they used ‚ÄúSelf-attention‚Äù as a successful technique in various tasks.



## 13. Interactive Q&A Session

Ask your own questions to the RAG system!

In [14]:
# Interactive Q&A
def ask_question(question):
    """Ask a question to the RAG system."""
    print(f"\n{'='*80}")
    print(f"Question: {question}")
    print(f"{'='*80}")
    
    answer = rag_chain.invoke(question)
    
    print(f"\nAnswer: {answer}")
    print(f"{'='*80}\n")
    
    return answer

# Try it out!
# Change the question below to ask anything about your document
my_question = "What are the main findings or results?"
ask_question(my_question)


Question: What are the main findings or results?

Answer: The text describes research focused on Transformer architecture, specifically during the NIPS 2017 conference. It details variations on the Transformer architecture, including different training sets and metrics. The key findings are:

*   **Beam search was used for inference.**
*   **The model used a byte-pair encoding for perplexity.**
*   **The model's performance on the English-to-German translation development set was improved in the Dev set.**
*   **The model's performance on the English-to-German translation development set was improved in the Dev set.**
*   **The model's performance on the newstest2013 development set was improved.**



"The text describes research focused on Transformer architecture, specifically during the NIPS 2017 conference. It details variations on the Transformer architecture, including different training sets and metrics. The key findings are:\n\n*   **Beam search was used for inference.**\n*   **The model used a byte-pair encoding for perplexity.**\n*   **The model's performance on the English-to-German translation development set was improved in the Dev set.**\n*   **The model's performance on the English-to-German translation development set was improved in the Dev set.**\n*   **The model's performance on the newstest2013 development set was improved.**"

## 14. Bonus: Compare Embedding Models (Optional)

Compare retrieval results between nomic-embed-text and embeddinggemma.

In [None]:
# # Uncomment to compare embedding models

# print("Comparing embedding models...\n")

# test_query = "What is attention mechanism?"

# # Test with nomic-embed-text
# print("=" * 80)
# print("Using nomic-embed-text:")
# print("=" * 80)
# embeddings_nomic = OllamaEmbeddings(model="nomic-embed-text")
# vectorstore_nomic = Chroma.from_documents(
#     documents=chunks[:10],  # Use first 10 chunks for quick test
#     embedding=embeddings_nomic,
#     collection_name="test_nomic"
# )
# retriever_nomic = vectorstore_nomic.as_retriever(search_kwargs={"k": 2})
# docs_nomic = retriever_nomic.invoke(test_query)

# print(f"\nTop retrieved document:")
# print(f"{docs_nomic[0].page_content[:200]}...\n")

# # Test with embeddinggemma
# print("=" * 80)
# print("Using embeddinggemma:")
# print("=" * 80)
# embeddings_gemma = OllamaEmbeddings(model="embeddinggemma:latest")
# vectorstore_gemma = Chroma.from_documents(
#     documents=chunks[:10],
#     embedding=embeddings_gemma,
#     collection_name="test_gemma"
# )
# retriever_gemma = vectorstore_gemma.as_retriever(search_kwargs={"k": 2})
# docs_gemma = retriever_gemma.invoke(test_query)

# print(f"\nTop retrieved document:")
# print(f"{docs_gemma[0].page_content[:200]}...\n")

# print("=" * 80)
# print("\n‚ÑπÔ∏è  Both models perform well. Choose based on your preference!")
# print("   - nomic-embed-text: Smaller (274MB), general-purpose")
# print("   - embeddinggemma: Larger (621MB), optimized for Gemma models")

## 15. Performance Tips & Next Steps

### üöÄ Performance Optimization:
1. **Chunk Size**: Experiment with different sizes (512, 1024, 2048)
2. **Retrieval Count (k)**: Try k=3, 4, 5, 6 based on your needs
3. **Model Selection**: 
   - Fast: gemma3:1b
   - Balanced: llama3.2
   - Best quality: deepseek-r1
4. **Temperature**: 0 for factual, 0.3-0.7 for creative

### üíæ Persistence:
- Vector store is saved at `./chroma_db`
- You can reload it without re-embedding documents
- Delete the directory to start fresh

### üîß Troubleshooting:
- **Slow responses**: Use smaller model (gemma3:1b)
- **Out of memory**: Reduce chunk count or use smaller model
- **Ollama not found**: Make sure `ollama serve` is running

### üìö Next Steps:
1. Try different documents and PDFs
2. Experiment with other Ollama models
3. Add custom preprocessing or post-processing
4. Build a simple UI with Gradio or Streamlit
5. Compare with cloud-based RAG (OpenAI, etc.)

### üéâ Congratulations!
You now have a fully functional **local, offline RAG system** running on your machine!

---

**Created with LangChain + Ollama + ChromaDB**