# Complete Offline RAG System with Ollama

## üìö Overview

This notebook implements a **complete Retrieval-Augmented Generation (RAG) system** that runs entirely offline.

### What is RAG?
- **Retrieval**: Finding relevant information from your documents
- **Augmented**: Using that information to enhance the AI's knowledge
- **Generation**: Creating accurate answers based on your documents

### Prerequisites
1. Install Ollama: https://ollama.com/download
2. Run: `ollama pull llama3.2`
3. Run: `ollama pull nomic-embed-text`
4. Create a `documents` folder and add your PDF/Markdown/HTML files

## üì¶ Step 1: Install Required Libraries

### Library Choices Explained:

| Library | Why This One? | Alternatives |
|---------|---------------|-------------|
| **faiss-cpu** | Fastest vector search, battle-tested, works offline | ChromaDB (heavier), LanceDB (newer) |
| **numpy** | Industry standard for arrays, required by FAISS | No real alternative |
| **PyPDF2** | Pure Python, no dependencies, simple | pdfplumber (slower), PyMuPDF (C dependencies) |
| **beautifulsoup4** | Best HTML parser, robust | lxml (harder install), html.parser (basic) |
| **markdown** | Clean MD to text conversion | mistune (overkill), regex (error-prone) |

In [6]:
# Install required packages 
!pip install --upgrade pip setuptools wheel
!pip install faiss-cpu numpy PyPDF2==3.0.1 beautifulsoup4==4.12.2 markdown==3.4.4

Collecting setuptools
  Using cached setuptools-80.9.0-py3-none-any.whl.metadata (6.6 kB)
Collecting wheel
  Using cached wheel-0.45.1-py3-none-any.whl.metadata (2.3 kB)
Using cached setuptools-80.9.0-py3-none-any.whl (1.2 MB)
Using cached wheel-0.45.1-py3-none-any.whl (72 kB)
Installing collected packages: wheel, setuptools

   -------------------- ------------------- 1/2 [setuptools]
   -------------------- ------------------- 1/2 [setuptools]
   -------------------- ------------------- 1/2 [setuptools]
   -------------------- ------------------- 1/2 [setuptools]
   -------------------- ------------------- 1/2 [setuptools]
   -------------------- ------------------- 1/2 [setuptools]
   -------------------- ------------------- 1/2 [setuptools]
   -------------------- ------------------- 1/2 [setuptools]
   -------------------- ------------------- 1/2 [setuptools]
   -------------------- ------------------- 1/2 [setuptools]
   -------------------- ------------------- 1/2 [setuptools]
 

## üì• Step 2: Import Libraries

In [None]:
import os
import json
import subprocess
import re
from pathlib import Path
from typing import List, Dict, Tuple, Optional
from dataclasses import dataclass

#for document processing
import PyPDF2
from bs4 import BeautifulSoup
import markdown

#for vector operations
import numpy as np
import faiss

print("all libraries sucessfully imported")

‚úÖ All libraries imported successfully!


## üèóÔ∏è Step 3: Data Structures

### Why Dataclasses?
- Type safety catches errors early
- Self-documenting
- Less boilerplate than classes
- Better than dicts for structured data

In [8]:
@dataclass
class Chunk:
    """Text chunk with metadata and embedding."""
    id: str
    text: str
    vector: Optional[np.ndarray]
    metadata: Dict

## üìñ Step 4: Document Loading

### Design Decisions:
1. **Static methods**: No instance state needed
2. **Separate per format**: Easier to extend
3. **Error handling**: Continues on failure
4. **Page-level tracking**: Enables precise citations

In [9]:
class DocumentLoader:
    """Load PDF, Markdown, and HTML documents."""
    
    @staticmethod
    def load_pdf(file_path: str) -> List[Dict]:
        """Extract text from PDF - page by page for citations."""
        chunks = []
        try:
            with open(file_path, 'rb') as file:
                pdf_reader = PyPDF2.PdfReader(file)
                for page_num, page in enumerate(pdf_reader.pages):
                    text = page.extract_text()
                    if text.strip():
                        chunks.append({
                            'text': text,
                            'metadata': {
                                'source': os.path.basename(file_path),
                                'page': page_num + 1,
                                'type': 'pdf'
                            }
                        })
        except Exception as e:
            print(f"‚ùå Error loading PDF {file_path}: {e}")
        return chunks
    
    @staticmethod
    def load_markdown(file_path: str) -> List[Dict]:
        """Convert Markdown to text via HTML (cleaner than regex)."""
        try:
            with open(file_path, 'r', encoding='utf-8') as file:
                md_content = file.read()
                html = markdown.markdown(md_content)
                soup = BeautifulSoup(html, 'html.parser')
                text = soup.get_text()
                
                return [{
                    'text': text,
                    'metadata': {
                        'source': os.path.basename(file_path),
                        'page': 1,
                        'type': 'markdown'
                    }
                }]
        except Exception as e:
            print(f"‚ùå Error loading Markdown {file_path}: {e}")
            return []
    
    @staticmethod
    def load_html(file_path: str) -> List[Dict]:
        """Extract text from HTML, removing scripts and styles."""
        try:
            with open(file_path, 'r', encoding='utf-8') as file:
                soup = BeautifulSoup(file.read(), 'html.parser')
                for script in soup(["script", "style"]):
                    script.decompose()
                text = soup.get_text()
                
                return [{
                    'text': text,
                    'metadata': {
                        'source': os.path.basename(file_path),
                        'page': 1,
                        'type': 'html'
                    }
                }]
        except Exception as e:
            print(f"‚ùå Error loading HTML {file_path}: {e}")
            return []
    
    @staticmethod
    def load_documents(directory: str) -> List[Dict]:
        """Load all supported documents from a directory."""
        documents = []
        doc_dir = Path(directory)
        
        if not doc_dir.exists():
            print(f"Creating {directory}...")
            doc_dir.mkdir(parents=True)
            print(f"‚ö†Ô∏è  Add documents to {directory} and run again.")
            return documents
        
        for file_path in doc_dir.rglob('*'):
            if file_path.is_file():
                ext = file_path.suffix.lower()
                
                if ext == '.pdf':
                    documents.extend(DocumentLoader.load_pdf(str(file_path)))
                elif ext in ['.md', '.markdown']:
                    documents.extend(DocumentLoader.load_markdown(str(file_path)))
                elif ext in ['.html', '.htm']:
                    documents.extend(DocumentLoader.load_html(str(file_path)))
        
        print(f"‚úÖ Loaded {len(documents)} document sections")
        return documents

print("‚úÖ Document loader ready!")

‚úÖ Document loader ready!


## ‚úÇÔ∏è Step 5: Text Chunking

### Why Chunking?
- **Problem**: Documents are too long for embeddings (token limits)
- **Solution**: Split into smaller, overlapping pieces

### Why Overlap?
- Prevents information loss at boundaries
- Example: "...safety protocol. First, wear PPE..." 
  - Without overlap: "First, wear PPE" loses context
  - With overlap: Previous chunk includes "safety protocol"

### Chunk Size Choice (750 chars):
- **Too small** (200): Loses context
- **Too large** (2000): Less precise retrieval
- **750**: Sweet spot for most documents (~150 tokens)

### Sentence Boundary Detection:
- Breaks at periods, not mid-sentence
- Better comprehension by LLM

In [10]:
class TextChunker:
    """Smart text chunking with overlap and sentence boundaries."""
    
    @staticmethod
    def clean_text(text: str) -> str:
        """Normalize whitespace and remove special characters."""
        text = re.sub(r'\s+', ' ', text)  # Multiple spaces ‚Üí single space
        text = re.sub(r'[^\w\s\.\,\!\?\-\:\;]', '', text)  # Keep punctuation
        return text.strip()
    
    @staticmethod
    def chunk_text(
        text: str,
        chunk_size: int = 750,
        overlap: int = 100,
        metadata: Dict = None
    ) -> List[Chunk]:
        """Split text into overlapping chunks at sentence boundaries.
        
        Args:
            chunk_size: Target size (‚âà150 tokens for embeddings)
            overlap: Overlap size to preserve context
            metadata: Source info for citations
        """
        text = TextChunker.clean_text(text)
        chunks = []
        
        if not text:
            return chunks
        
        start = 0
        chunk_index = 0
        
        while start < len(text):
            end = start + chunk_size
            
            # Try to break at sentence boundary (last 20% of chunk)
            if end < len(text):
                search_start = end - int(chunk_size * 0.2)
                sentence_end = max(
                    text.rfind('.', search_start, end),
                    text.rfind('!', search_start, end),
                    text.rfind('?', search_start, end)
                )
                
                if sentence_end != -1 and sentence_end > start:
                    end = sentence_end + 1
            
            chunk_text = text[start:end].strip()
            
            if chunk_text:
                chunk_metadata = metadata.copy() if metadata else {}
                chunk_metadata['chunk_index'] = chunk_index
                chunk_id = f"{chunk_metadata.get('source', 'unknown')}_{chunk_index}"
                
                chunks.append(Chunk(
                    id=chunk_id,
                    text=chunk_text,
                    vector=None,
                    metadata=chunk_metadata
                ))
                
                chunk_index += 1
            
            start = end - overlap  # Move with overlap
            
            if start >= len(text) - overlap:
                break
        
        return chunks

print("‚úÖ Text chunker ready!")

‚úÖ Text chunker ready!


## üßÆ Step 6: Ollama Embedder

### What are Embeddings?
- Convert text ‚Üí vector of numbers (768 dimensions)
- Similar text ‚Üí similar vectors
- Enables semantic search

### Why Ollama?
- **Runs locally**: No API keys, no cloud
- **Free**: No usage costs
- **Private**: Data never leaves your machine

### Why nomic-embed-text?
- **Size**: 274MB (lightweight)
- **Quality**: Good accuracy for general text
- **Dimensions**: 768 (standard size)
- **Alternative**: all-MiniLM-L6-v2 (384 dims, smaller but less accurate)

### HTTP API vs CLI:
- **HTTP**: Direct, faster, better for production
- **CLI**: Subprocess overhead
- **Choice**: HTTP API for efficiency

In [11]:
class OllamaEmbedder:
    """Generate embeddings using Ollama's embedding model."""
    
    def __init__(self, model_name: str = "nomic-embed-text"):
        self.model_name = model_name
        self._verify_model()
    
    def _verify_model(self):
        """Check if model is downloaded, pull if needed."""
        try:
            result = subprocess.run(
                ['ollama', 'list'],
                capture_output=True,
                text=True,
                check=True
            )
            if self.model_name not in result.stdout:
                print(f"‚¨áÔ∏è  Pulling {self.model_name}...")
                subprocess.run(['ollama', 'pull', self.model_name], check=True)
        except Exception as e:
            print(f"‚ùå Ollama error: {e}")
            raise
    
    def embed_text(self, text: str) -> np.ndarray:
        """Generate embedding vector for text using HTTP API."""
        try:
            import http.client
            
            conn = http.client.HTTPConnection("localhost", 11434, timeout=30)
            headers = {'Content-Type': 'application/json'}
            
            payload = json.dumps({
                "model": self.model_name,
                "prompt": text
            })
            
            conn.request("POST", "/api/embeddings", payload, headers)
            response = conn.getresponse()
            data = json.loads(response.read().decode())
            
            return np.array(data['embedding'], dtype=np.float32)
            
        except Exception as e:
            print(f"‚ùå Embedding error: {e}")
            return np.zeros(768, dtype=np.float32)  # Fallback
    
    def embed_chunks(self, chunks: List[Chunk]) -> List[Chunk]:
        """Generate embeddings for all chunks with progress."""
        print(f"üîÑ Generating embeddings for {len(chunks)} chunks...")
        
        for i, chunk in enumerate(chunks):
            if i % 10 == 0 and i > 0:
                print(f"  Progress: {i}/{len(chunks)}")
            chunk.vector = self.embed_text(chunk.text)
        
        print("‚úÖ Embeddings complete!")
        return chunks

print("‚úÖ Embedder ready!")

‚úÖ Embedder ready!


## üóÑÔ∏è Step 7: Vector Database

### Why FAISS?
- **Speed**: Fastest similarity search in Python
- **Offline**: No external services
- **Simple**: IndexFlatL2 = exact search, no tuning
- **Mature**: Used by Meta/Facebook in production

### FAISS Index Types:
- **IndexFlatL2**: Exact search, small datasets (<1M vectors)
- **IndexIVFFlat**: Approximate, faster for large datasets
- **IndexHNSW**: Graph-based, very fast
- **Choice**: Flat for simplicity and accuracy

### L2 Distance:
- Euclidean distance between vectors
- Lower = more similar
- Alternative: Cosine similarity (similar results, more complex)

### Persistence:
- Save index + metadata separately
- Vectors in binary (fast)
- Metadata in JSON (readable)

In [36]:
class VectorDatabase:
    """FAISS-based vector storage and retrieval with Cosine Similarity."""
    
    def __init__(self, dimension: int = 768):
        self.dimension = dimension
        # Use IndexFlatIP for cosine similarity (Inner Product after normalization)
        self.index = faiss.IndexFlatIP(dimension)  # Changed from IndexFlatL2
        self.chunks: List[Chunk] = []
    
    def add_chunks(self, chunks: List[Chunk]):
        """Add chunk embeddings to the index."""
        vectors = np.array([chunk.vector for chunk in chunks], dtype=np.float32)
        
        # Normalize vectors for cosine similarity
        faiss.normalize_L2(vectors)
        
        self.index.add(vectors)
        self.chunks.extend(chunks)
        print(f"‚úÖ Added {len(chunks)} chunks (total: {len(self.chunks)})")
    
    def search(self, query_vector: np.ndarray, top_k: int = 5) -> List[Tuple[Chunk, float]]:
        """Find top-k most similar chunks using cosine similarity.
        
        Returns:
            List of (chunk, distance) tuples
            Distance is (1 - cosine_similarity), so lower = more similar
        """
        query_vector = query_vector.reshape(1, -1).astype(np.float32)
        
        # Normalize query vector for cosine similarity
        faiss.normalize_L2(query_vector)
        
        # Search (returns similarity scores, not distances)
        similarities, indices = self.index.search(query_vector, top_k)
        
        results = []
        for idx, similarity in zip(indices[0], similarities[0]):
            if idx < len(self.chunks):
                # Convert similarity to distance: distance = 1 - similarity
                distance = 1 - similarity
                results.append((self.chunks[idx], float(distance)))
        
        return results
    
    def save(self, directory: str):
        """Persist database to disk."""
        os.makedirs(directory, exist_ok=True)
        
        # Save FAISS index (binary)
        faiss.write_index(self.index, os.path.join(directory, 'faiss.index'))
        
        # Save chunks metadata (JSON)
        chunks_data = [{'id': chunk.id,
            'text': chunk.text,
            'metadata': chunk.metadata
        } for chunk in self.chunks]
        
        with open(os.path.join(directory, 'chunks.json'), 'w', encoding='utf-8') as f:
            json.dump(chunks_data, f, indent=2)
        
        print(f"‚úÖ Database saved to {directory}")
    
    def load(self, directory: str, embedder) -> bool:
        """Load database from disk."""
        index_path = os.path.join(directory, 'faiss.index')
        chunks_path = os.path.join(directory, 'chunks.json')
        
        if not os.path.exists(index_path) or not os.path.exists(chunks_path):
            print(f"‚ö†Ô∏è  No database found in {directory}")
            return False
        
        # Load FAISS index
        self.index = faiss.read_index(index_path)
        
        # Load chunks
        with open(chunks_path, 'r', encoding='utf-8') as f:
            chunks_data = json.load(f)
        
        # Reconstruct chunks (re-embed for consistency)
        print("üîÑ Reconstructing chunk vectors...")
        self.chunks = []
        for data in chunks_data:
            chunk = Chunk(
                id=data['id'],
                text=data['text'],
                vector=embedder.embed_text(data['text']),
                metadata=data['metadata']
            )
            self.chunks.append(chunk)
        
        print(f"‚úÖ Database loaded: {len(self.chunks)} chunks")
        return True

print("‚úÖ Vector database ready!")

‚úÖ Vector database ready!


## ü§ñ Step 8: LLM Interface

### Why Llama 3?
- **Quality**: State-of-the-art open-source model
- **Size**: 4.7GB (manageable on consumer hardware)
- **Offline**: Runs locally
- **Fast**: Decent speed on CPU

### Temperature Parameter:
- **0.0**: Deterministic (always same answer)
- **0.3**: Slightly creative (good for QA)
- **0.7**: More creative (good for writing)
- **1.0**: Very creative (can hallucinate)
- **Choice**: 0.3 balances accuracy and naturalness

### HTTP API:
- Faster than CLI
- More control (temperature, tokens, etc.)
- Better for production

In [51]:
class OllamaLLM:
    """LLM interface using Ollama CLI (more reliable for CPU)."""
    
    def __init__(self, model_name: str = "llama3"):
        self.model_name = model_name
        self._verify_model()
    
    def _verify_model(self):
        """Check if model is downloaded."""
        try:
            result = subprocess.run(
                ['ollama', 'list'],
                capture_output=True,
                text=True,
                check=True
            )
            if self.model_name not in result.stdout:
                print(f"‚¨áÔ∏è  Pulling {self.model_name}...")
                subprocess.run(['ollama', 'pull', self.model_name], check=True)
        except Exception as e:
            print(f"‚ùå Ollama error: {e}")
            raise
    
    def generate(self, prompt: str, temperature: float = 0.3) -> str:
        """Generate response using Ollama CLI (more reliable on CPU).
        
        Args:
            prompt: Complete prompt with context and question
            temperature: Creativity (0.0=deterministic, 1.0=creative)
        """
        try:
            print(f"  Generating with {self.model_name} (may take 30-90 sec on CPU)...")
            
            # Use subprocess with CLI - more reliable than HTTP on CPU
            result = subprocess.run(
                ['ollama', 'run', self.model_name],
                input=prompt,
                capture_output=True,
                text=True,
                timeout=300,  # 5 minutes timeout
                encoding='utf-8'
            )
            
            if result.returncode != 0:
                error_msg = result.stderr or "Unknown error"
                print(f"  ‚ùå Ollama error: {error_msg}")
                return f"Error: {error_msg}"
            
            answer = result.stdout.strip()
            
            if not answer:
                print(f"  ‚ö†Ô∏è Empty response")
                return "Error: Empty response from LLM"
            
            print(f"  ‚úÖ Generated {len(answer)} characters")
            return answer
            
        except subprocess.TimeoutExpired:
            print(f"  ‚ùå Timeout after 5 minutes")
            return "Error: Generation timed out. Try a simpler question or smaller context."
        except Exception as e:
            error_msg = f"‚ùå Error: {str(e)}"
            print(f"  {error_msg}")
            return error_msg

print("‚úÖ LLM interface ready!")

‚úÖ LLM interface ready!


## üéØ Step 9: Complete RAG System

### System Architecture:
1. **Ingest**: Load docs ‚Üí Chunk ‚Üí Embed ‚Üí Store
2. **Query**: Question ‚Üí Embed ‚Üí Search ‚Üí Retrieve chunks
3. **Generate**: Build prompt with context ‚Üí LLM ‚Üí Answer

### Distance Threshold:
- Filters out irrelevant chunks
- **1.5**: Moderate strictness
- Lower = stricter (may miss relevant info)
- Higher = lenient (may include noise)

### Prompt Engineering:
- Clear instructions: "Answer only from context"
- Structured format: CONTEXT ‚Üí QUESTION ‚Üí INSTRUCTIONS
- Citation requirement: Forces source attribution
- Safety: Refuses to guess without context

In [38]:
class RAGSystem:
    """Complete RAG orchestration."""
    
    def __init__(
        self,
        documents_dir: str = "documents",
        db_dir: str = "vector_db",
        llm_model: str = "llama3",
        embedding_model: str = "nomic-embed-text"
    ):
        self.documents_dir = documents_dir
        self.db_dir = db_dir
        
        print("üöÄ Initializing RAG System...")
        self.embedder = OllamaEmbedder(embedding_model)
        self.llm = OllamaLLM(llm_model)
        self.vector_db = VectorDatabase()
        print("‚úÖ RAG System initialized!")
    
    def ingest_documents(
        self,
        chunk_size: int = 750,
        overlap: int = 100,
        force_rebuild: bool = False
    ):
        """Build or load vector database."""
        
        # Try loading existing database
        if not force_rebuild and os.path.exists(self.db_dir):
            print("üìÇ Loading existing database...")
            if self.vector_db.load(self.db_dir, self.embedder):
                return
        
        print("üî® Building new database...")
        
        # Load documents
        documents = DocumentLoader.load_documents(self.documents_dir)
        if not documents:
            print("‚ö†Ô∏è  No documents found!")
            return
        
        # Chunk documents
        all_chunks = []
        for doc in documents:
            chunks = TextChunker.chunk_text(
                doc['text'],
                chunk_size=chunk_size,
                overlap=overlap,
                metadata=doc['metadata']
            )
            all_chunks.extend(chunks)
        
        print(f"‚úÇÔ∏è  Created {len(all_chunks)} chunks")
        
        # Generate embeddings
        all_chunks = self.embedder.embed_chunks(all_chunks)
        
        # Store in vector DB
        self.vector_db.add_chunks(all_chunks)
        
        # Save for future use
        self.vector_db.save(self.db_dir)
    
    def query(
        self,
        question: str,
        top_k: int = 5,
        distance_threshold: float = 1.5
    ) -> Dict:
        """Answer question using RAG.
        
        Returns:
            {
                'answer': Generated answer,
                'sources': List of source chunks,
                'confidence': 'high'|'medium'|'low'
            }
        """
        print(f"\n‚ùì Question: {question}")
        
        # Embed query
        query_vector = self.embedder.embed_text(question)
        
        # Search vector DB
        results = self.vector_db.search(query_vector, top_k=top_k)
        
        # Filter by threshold
        filtered_results = [
            (chunk, dist) for chunk, dist in results
            if dist < distance_threshold
        ]
        
        if not filtered_results:
            return {
                'answer': "‚ùå Insufficient context to answer this question.",
                'sources': [],
                'confidence': 'low'
            }
        
        # Build context from chunks
        context_parts = []
        sources = []
        
        for i, (chunk, distance) in enumerate(filtered_results):
            context_parts.append(
                f"[Source {i+1}: {chunk.metadata['source']}, "
                f"Page {chunk.metadata.get('page', 'N/A')}]\n{chunk.text}\n"
            )
            sources.append({
                'id': chunk.id,
                'source': chunk.metadata['source'],
                'page': chunk.metadata.get('page', 'N/A'),
                'distance': distance
            })
        
        context = "\n".join(context_parts)
        
        # Build prompt
        prompt = f"""You are a helpful AI assistant. Answer the question based ONLY on the provided context.

CONTEXT:
{context}

QUESTION: {question}

INSTRUCTIONS:
1. Answer based only on the context above
2. Cite source numbers (e.g., "According to Source 1...")
3. If context is insufficient, state that clearly
4. Be concise but thorough

ANSWER:"""
        
        # Generate answer
        print("ü§ñ Generating answer...")
        answer = self.llm.generate(prompt, temperature=0.3)
        
        return {
            'answer': answer,
            'sources': sources,
            'confidence': 'high' if len(filtered_results) >= 3 else 'medium'
        }

print("‚úÖ RAG System class ready!")

‚úÖ RAG System class ready!


## üéÆ Step 10: Initialize System

### Configuration Options:
- `documents_dir`: Where your documents are
- `db_dir`: Where vector database is saved
- `chunk_size`: 500-1000 (750 is balanced)
- `overlap`: 10-20% of chunk_size
- `force_rebuild`: Set True to rebuild from scratch

In [46]:
!ollama pull llama3.2

[?2026h[?25l[1Gpulling manifest ‚†ã [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ‚†ô [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ‚†π [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ‚†∏ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ‚†º [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ‚†¥ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ‚†¶ [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ‚†ß [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ‚†á [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ‚†è [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ‚†ã [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest ‚†ô [K[?25h[?2026l[?2026h[?25l[1Gpulling manifest [K
pulling dde5aa3fc5ff:   0% ‚ñï                  ‚ñè 554 KB/2.0 GB                  [K[?25h[?2026l[?2026h[?25l[A[1Gpulling manifest [K
pulling dde5aa3fc5ff:   0% ‚ñï                  ‚ñè 1.7 MB/2.0 GB                  [K[?25h[?2026l[?2026h[?25l[A[1Gpulling manifest [K
pulli

In [52]:
# 1. Reinitialize with CLI-based LLM
rag = RAGSystem(
    documents_dir="documents",
    db_dir="vector_db",
    llm_model="llama3.2",
    embedding_model="nomic-embed-text"
)

rag.ingest_documents(force_rebuild=False)

# 2. Test with simpler/shorter prompt first
print("Testing LLM with simple query...")
simple_test = rag.llm.generate("What is 2+2? Answer briefly.", temperature=0.1)
print(f"Simple test result: {simple_test}\n")

# 3. Try real query with fewer sources (less context = faster)
result = rag.query(
    question="What is the main topic of this document?",
    top_k=3,  # Reduced from 5 for speed
    distance_threshold=0.6
)

print("\n" + "="*60)
print("ANSWER:")
print("="*60)
print(result['answer'])
print("="*60)

üöÄ Initializing RAG System...
‚úÖ RAG System initialized!
üìÇ Loading existing database...
üîÑ Reconstructing chunk vectors...
‚úÖ Database loaded: 61 chunks
Testing LLM with simple query...
  Generating with llama3.2 (may take 30-90 sec on CPU)...
  ‚úÖ Generated 2 characters
Simple test result: 4.


‚ùì Question: What is the main topic of this document?
ü§ñ Generating answer...
  Generating with llama3.2 (may take 30-90 sec on CPU)...
  ‚úÖ Generated 416 characters

ANSWER:
The main topic of this document appears to be Parameter Efficient Fine-Tuning (PEFT), a method for adapting large language models. According to Source 2 ([flora.pdf, Page 9]), PEFT includes various methods such as prefix or prompt-tuning, series adapters, and fused forward-backward adapters.

Further information can be found in survey papers mentioned in the context, including Xu et al., 2023; Balne et al., 2024.


In [47]:
# Initialize RAG system
rag = RAGSystem(
    documents_dir="documents",
    db_dir="vector_db",
    llm_model="llama3.2",
    embedding_model="nomic-embed-text"
)

# Build/load database
rag.ingest_documents(
    chunk_size=750,
    overlap=100,
    force_rebuild=True  
)

üöÄ Initializing RAG System...
‚úÖ RAG System initialized!
üî® Building new database...
‚úÖ Loaded 10 document sections
‚úÇÔ∏è  Created 61 chunks
üîÑ Generating embeddings for 61 chunks...
  Progress: 10/61
  Progress: 20/61
  Progress: 30/61
  Progress: 40/61
  Progress: 50/61
  Progress: 60/61
‚úÖ Embeddings complete!
‚úÖ Added 61 chunks (total: 61)
‚úÖ Database saved to vector_db


## üí¨ Step 11: Ask Questions!

### Tips for Good Questions:
- Be specific
- Use keywords from your documents
- One topic per question

### Tuning Parameters:
- `top_k`: More = more context, slower
- `distance_threshold`: Lower = stricter matching

In [None]:
# Example: Ask a question
question = "What problem does FLoRA aim to address in the context of parameter-efficient fine-tuning (PEFT) for large language models?"

result = rag.query(
    question=question,
    top_k=5,
    distance_threshold=0.6
)

# Display results
print("\n" + "="*60)
print("ANSWER:")
print("="*60)
print(result['answer'])
print("\n" + "="*60)
print(f"CONFIDENCE: {result['confidence'].upper()}")
print("="*60)
print("\nSOURCES:")
for i, source in enumerate(result['sources'], 1):
    print(f"  {i}. {source['source']} (Page {source['page']}) - Distance: {source['distance']:.4f}")
print("="*60)


‚ùì Question: WWhat problem does FLoRA aim to address in the context of parameter-efficient fine-tuning (PEFT) for large language models?
ü§ñ Generating answer...
  Generating with llama3.2 (may take 30-90 sec on CPU)...
  ‚úÖ Generated 283 characters

ANSWER:
According to Source 5, FLoRA aims to address the problem of parameter-efficient fine-tuning for large language models (LLMs) by proposing a family of fused forward-backward adapters (FFBA). This is done to improve overall fine-tuning accuracies and minimize inference-time latencies.

CONFIDENCE: HIGH

SOURCES:
  1. flora.pdf (Page 9) - Distance: 0.2168
  2. flora.pdf (Page 9) - Distance: 0.2531
  3. flora.pdf (Page 1) - Distance: 0.2610
  4. flora.pdf (Page 10) - Distance: 0.2785
  5. flora.pdf (Page 1) - Distance: 0.2790


## üìä Step 12: Batch Processing (Optional)

Process multiple questions at once.

In [None]:
# List of questions
questions = [
    "What is the main topic?",
    "What are the key points discussed?",
    "Are there any specific recommendations?"
]

# Process all questions
for i, q in enumerate(questions, 1):
    print(f"\n{'='*60}")
    print(f"Question {i}/{len(questions)}")
    print(f"{'='*60}")
    
    result = rag.query(q, top_k=3)
    
    print(f"Q: {q}")
    print(f"A: {result['answer'][:200]}...")
    print(f"Confidence: {result['confidence']}")
    print(f"Sources: {len(result['sources'])}")

## üîß Step 13: Advanced Customization

### Adjust Chunk Size

In [None]:
# For code or technical docs: smaller chunks
# rag.ingest_documents(chunk_size=500, overlap=75, force_rebuild=True)

# For long-form content: larger chunks
# rag.ingest_documents(chunk_size=1000, overlap=150, force_rebuild=True)

### Adjust Retrieval

In [None]:
# More context (slower, more comprehensive)
# result = rag.query(question, top_k=10, distance_threshold=2.0)

# Less context (faster, more focused)
# result = rag.query(question, top_k=3, distance_threshold=1.0)

### Use Different Models

In [None]:
# Faster but smaller model
# rag_fast = RAGSystem(llm_model="llama3.2")

# Alternative model
# rag_alt = RAGSystem(llm_model="mistral")

## üìà Step 14: Database Statistics

In [None]:
# Get database stats
print(f"Total chunks: {len(rag.vector_db.chunks)}")

# Count by source
sources = {}
for chunk in rag.vector_db.chunks:
    source = chunk.metadata['source']
    sources[source] = sources.get(source, 0) + 1

print("\nChunks per document:")
for source, count in sorted(sources.items(), key=lambda x: x[1], reverse=True):
    print(f"  {source}: {count} chunks")

## üíæ Step 15: Save/Export Results

In [None]:
# Save results to JSON
def save_qa_results(questions_and_answers, filename="qa_results.json"):
    with open(filename, 'w', encoding='utf-8') as f:
        json.dump(questions_and_answers, f, indent=2, ensure_ascii=False)
    print(f"‚úÖ Results saved to {filename}")

# Example:
# qa_pairs = []
# for q in questions:
#     result = rag.query(q)
#     qa_pairs.append({'question': q, 'answer': result['answer']})
# save_qa_results(qa_pairs)

## üéØ Summary

### What We Built:
1. ‚úÖ Document loader (PDF, MD, HTML)
2. ‚úÖ Smart text chunker with overlap
3. ‚úÖ Ollama embedder (768-dim vectors)
4. ‚úÖ FAISS vector database
5. ‚úÖ Ollama LLM interface
6. ‚úÖ Complete RAG pipeline with citations

### Key Advantages:
- üîí 100% offline and private
- üí∞ No API costs
- üìö Source citations
- üíæ Persistent storage
- ‚ö° Fast after first build

### Next Steps:
1. Add your documents to `documents/` folder
2. Run all cells
3. Start asking questions!
4. Tune parameters for your use case

### Performance Tips:
- First run: 5-15 min (building database)
- Subsequent runs: 2-5 sec (loading database)
- Per query: 3-10 seconds

---

**Made with ‚ù§Ô∏è using Ollama, FAISS, and Python**