# Simple RAG System with LM Studio

A lightweight Retrieval-Augmented Generation (RAG) system using:
- **LM Studio** for local LLM inference
- **ChromaDB** for local vector storage
- **Sentence Transformers** for embeddings

## Prerequisites
1. Install and run LM Studio with a model loaded (e.g., Mistral, Llama)
2. Ensure LM Studio server is running (default: http://localhost:1234)
3. Place your PDF file in the same directory as this notebook

## System Architecture

### Components:

1. **PDF Processing**
   - PyPDF2 for text extraction
   - Custom chunking with overlap

2. **Embeddings**
   - Sentence Transformers (all-MiniLM-L6-v2)
   - Local, fast, no API calls

3. **Vector Storage**
   - ChromaDB (embedded database)
   - Persistent local storage
   - Cosine similarity search

4. **LLM Integration**
   - LM Studio (OpenAI-compatible API)
   - Local inference, no cloud dependency

### Data Flow:
```
PDF → Extract Text → Chunk → Embed → Store in ChromaDB
                                           ↓
User Question → Embed → Search ChromaDB → Retrieve Chunks
                                           ↓
                         Context + Question → LM Studio → Answer
```

### Configuration Tips:

- **Chunk Size**: 500 chars works well for most documents
- **Overlap**: 50 chars prevents losing context at boundaries
- **Top K**: 3 chunks usually provides enough context
- **Temperature**: 0.7 for balanced creativity/accuracy

### Performance:

- Embedding generation: ~1 second per 100 chunks
- Vector search: <100ms
- LLM response: Depends on model and hardware

### Troubleshooting:

1. **LM Studio not connecting**: Check server is running on port 1234
2. **Poor quality answers**: Try adjusting chunk_size or top_k
3. **Slow performance**: Use smaller embedding model or reduce chunk count
4. **Out of memory**: Process PDFs in batches or use smaller chunks

## 1. Install Required Libraries

In [1]:
# Install required packages
!pip install chromadb sentence-transformers PyPDF2 requests numpy IProgress ipywidgets



## 2. Import Libraries and Setup

In [2]:
import os
import json
import requests
from typing import List, Dict, Tuple
import numpy as np
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings
import PyPDF2
from IPython.display import Markdown, display
from openai import OpenAI

# Configuration
LM_STUDIO_URL = "http://127.0.0.1:1234"
EMBEDDING_MODEL = "all-MiniLM-L6-v2"  # Small, fast, good quality
CHUNK_SIZE = 500  # Characters per chunk
CHUNK_OVERLAP = 50  # Overlap between chunks
TOP_K = 5  # Number of relevant chunks to retrieve
TEMPERATURE = 0.2
LLM_MODEL = "openai/gpt-oss-20b"
# Specify your PDF file path
PDF_PATH = "1706.03762v7.pdf"  # <-- Change this to your PDF file name

print("✅ Libraries imported successfully")

✅ Libraries imported successfully


In [3]:
from dotenv import load_dotenv
import os
from huggingface_hub import login

# Load from your specific file name
load_dotenv('secrets.env', override=True)

token = os.getenv("HF_TOKEN")
print(f"Token loaded.")

login(token=token)
print("Login successful!")

Token loaded.


Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


Login successful!


## 3. Initialize Components

In [4]:
# Initialize embedding model
print("Loading embedding model...")
embedder = SentenceTransformer(EMBEDDING_MODEL)
print(f"✅ Loaded {EMBEDDING_MODEL}")

# Initialize ChromaDB (persistent local storage)
print("\nInitializing ChromaDB...")
chroma_client = chromadb.PersistentClient(path="./chroma_db")

# Create or get collection
collection_name = "pdf_documents"
try:
    collection = chroma_client.create_collection(
        name=collection_name,
        metadata={"hnsw:space": "cosine"}
    )
    print(f"✅ Created new collection: {collection_name}")
except:
    collection = chroma_client.get_collection(name=collection_name)
    print(f"✅ Using existing collection: {collection_name}")

# Test LM Studio connection


# Initialize OpenAI client for LM Studio
client = OpenAI(
    base_url=f"{LM_STUDIO_URL}/v1",
    api_key="lm-studio"  # LM Studio doesn't need real API key
)

# Test LM Studio connection
def test_lm_studio():
    try:
        response = client.chat.completions.create(
            model=LLM_MODEL,
            messages=[{"role": "user", "content": "Say 'connected'"}],
            temperature=TEMPERATURE,
            max_tokens=10,
            timeout=30
        )
        print("\n✅ LM Studio is connected and ready")
        return True
    except:
        pass
    print("\n⚠️ LM Studio not responding. Please ensure:")
    print("1. LM Studio is running")
    print("2. A model is loaded")
    print("3. Server is started (default port 1234)")
    return False

test_lm_studio()

Loading embedding model...
✅ Loaded all-MiniLM-L6-v2

Initializing ChromaDB...
✅ Using existing collection: pdf_documents

✅ LM Studio is connected and ready


True

## 4. Document Processing Functions

In [5]:
def extract_text_from_pdf(pdf_path: str) -> str:
    """Extract text from PDF file"""
    text = ""
    try:
        with open(pdf_path, 'rb') as file:
            pdf_reader = PyPDF2.PdfReader(file)
            num_pages = len(pdf_reader.pages)
            print(f"📄 Processing {num_pages} pages...")
            
            for page_num in range(num_pages):
                page = pdf_reader.pages[page_num]
                text += page.extract_text() + "\n"
                
        print(f"✅ Extracted {len(text)} characters from PDF")
        return text
    except Exception as e:
        print(f"❌ Error reading PDF: {e}")
        return ""

def create_chunks(text: str, chunk_size: int = CHUNK_SIZE, overlap: int = CHUNK_OVERLAP) -> List[str]:
    """Split text into overlapping chunks"""
    chunks = []
    start = 0
    
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        
        # Clean up chunk
        chunk = chunk.strip()
        if chunk:
            chunks.append(chunk)
        
        start += chunk_size - overlap
    
    print(f"✅ Created {len(chunks)} chunks")
    return chunks

def embed_chunks(chunks: List[str]) -> List[List[float]]:
    """Generate embeddings for text chunks"""
    print("🔄 Generating embeddings...")
    embeddings = embedder.encode(chunks, show_progress_bar=True)
    print(f"✅ Generated {len(embeddings)} embeddings")
    return embeddings.tolist()

print("✅ Document processing functions ready")

✅ Document processing functions ready


## 5. Load and Process Your PDF

In [6]:


def process_pdf(pdf_path: str, clear_existing: bool=False):
    """Complete PDF processing pipeline"""
    
    # Check if file exists
    if not os.path.exists(pdf_path):
        print(f"❌ File not found: {pdf_path}")
        print("Please update PDF_PATH with your file name")
        return False
    
    print(f"\n📚 Processing: {pdf_path}")
    print("="*50)
    
    # Extract text
    text = extract_text_from_pdf(pdf_path)
    if not text:
        return False
    
    # Create chunks
    chunks = create_chunks(text)
    
    # Generate embeddings
    embeddings = embed_chunks(chunks)
    
    # Store in ChromaDB
    print("\n💾 Storing in vector database...")
    
    # Clear existing documents (optional)
    if clear_existing:
        try:
            # Get all document IDs first, then delete them
            existing_docs = collection.get()
            if existing_docs['ids']:
                collection.delete(ids=existing_docs['ids'])
                print(f"🗑️  Cleared {len(existing_docs['ids'])} existing documents from database")
            else:
                print("🗑️  Database was already empty")
        except Exception as e:
            print(f"⚠️  Warning: Could not clear existing documents: {e}")
            
    # Add documents with metadata
    ids = [f"chunk_{i}" for i in range(len(chunks))]
    metadatas = [{"source": pdf_path, "chunk_id": i} for i in range(len(chunks))]
    
    collection.add(
        embeddings=embeddings,
        documents=chunks,
        ids=ids,
        metadatas=metadatas
    )
    
    print(f"✅ Stored {len(chunks)} chunks in ChromaDB")
    print(f"\n📊 Summary:")
    print(f"  - Total characters: {len(text):,}")
    print(f"  - Number of chunks: {len(chunks)}")
    print(f"  - Avg chunk size: {len(text)//len(chunks)} chars")
    
    return True

# Process your PDF
if process_pdf(PDF_PATH, clear_existing=True):
    print("\n✅ PDF processed successfully! Ready for questions.")
else:
    print("\n⚠️ Please update PDF_PATH and run this cell again")


📚 Processing: 1706.03762v7.pdf
📄 Processing 15 pages...
✅ Extracted 39487 characters from PDF
✅ Created 88 chunks
🔄 Generating embeddings...


Batches:   0%|          | 0/3 [00:00<?, ?it/s]

✅ Generated 88 embeddings

💾 Storing in vector database...
🗑️  Cleared 88 existing documents from database
✅ Stored 88 chunks in ChromaDB

📊 Summary:
  - Total characters: 39,487
  - Number of chunks: 88
  - Avg chunk size: 448 chars

✅ PDF processed successfully! Ready for questions.


## 6. Query Functions

In [8]:
def search_similar_chunks(query: str, top_k: int = TOP_K) -> List[Tuple[str, float]]:
    """Search for most relevant chunks in vector database"""
    
    # Generate embedding for query
    query_embedding = embedder.encode([query])[0].tolist()
    
    # Search in ChromaDB
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k
    )
    
    # Extract chunks and distances
    chunks = results['documents'][0] if results['documents'] else []
    distances = results['distances'][0] if results['distances'] else []
    
    return list(zip(chunks, distances))

def query_lm_studio(prompt: str, temperature: float = 0.7) -> str:
    """Send query to LM Studio"""
    try:
        response = client.chat.completions.create(
            model="local-model",
            messages=[
                {"role": "system", "content": "You are a helpful assistant that answers questions based on the provided context. Be concise and accurate. USE ONLY THE PROVIDE CONTEXT TO ANSWER THE QUESTION."},
                {"role": "user", "content": prompt}
            ],
            temperature=TEMPERATURE,
            max_tokens=3500,
            timeout=30
        )
        
        return response.choices[0].message.content
        
    except requests.exceptions.Timeout:
        return "Error: LM Studio request timed out"
    except Exception as e:
        return f"Error: {str(e)}"

def ask_question(question: str, show_context: bool = False) -> str:
    """Complete RAG pipeline: retrieve context and generate answer"""
    
    print(f"\n🔍 Question: {question}")
    print("="*50)
    
    # Search for relevant chunks
    print("Searching for relevant context...")
    relevant_chunks = search_similar_chunks(question)
    
    if not relevant_chunks:
        return "No relevant information found in the document."
    
    # Prepare context
    context = "\n\n".join([chunk for chunk, _ in relevant_chunks])
    
    if show_context:
        print("\n📄 Retrieved Context:")
        for i, (chunk, distance) in enumerate(relevant_chunks, 1):
            print(f"\nChunk {i} (similarity: {1-distance:.2f}):")
            print(f"{chunk}" )
    
    # Create prompt with context
    prompt = f"""Based ONLY on the following context, answer the question.

Context:
{context}

Question: {question}

Answer:"""
    
    # Get answer from LM Studio
    print("\n🤖 Generating answer...")
    answer = query_lm_studio(prompt)
    
    return answer

print("✅ Query functions ready")

✅ Query functions ready


## 7. Interactive Q&A Interface

In [9]:
# Test with a sample question
sample_question = "What is 'Attention' - explain for a business user?"  # <-- Change this to test

answer = ask_question(sample_question, show_context=True)
print("\n💬 Answer:")
display(Markdown(answer))


🔍 Question: What is 'Attention' - explain for a business user?
Searching for relevant context...

📄 Retrieved Context:

Chunk 1 (similarity: 0.48):
benefit, self-attention could yield more interpretable models. We inspect attention distributions
from our models and present and discuss examples in the appendix. Not only do individual attention
heads clearly learn to perform different tasks, many appear to exhibit behavior related to the syntactic
and semantic structure of the sentences.
5 Training
This section describes the training regime for our models.
5.1 Training Data and Batching
We trained on the standard WMT 2014 English-German data

Chunk 2 (similarity: 0.45):
ned with fact that the output embeddings are offset by one position, ensures that the
predictions for position ican depend only on the known outputs at positions less than i.
3.2 Attention
An attention function can be described as mapping a query and a set of key-value pairs to an output,
where the query, keys, values, a

**Attention** is a way for the model to “look back” at different parts of a sentence (or any sequence) when it is trying to understand or generate text. Think of it as a smart spotlight that highlights the most relevant words for each word being processed.

- **How it works**: For every word, the model creates a small “query” vector. It then compares this query to all other words (the “keys”) and assigns a weight to each comparison. These weights decide how much each word should influence the current word’s representation.
- **Why it matters**: By focusing on the most important words, the model can capture long‑range relationships (e.g., a verb and its distant object) without needing to read the entire sentence multiple times. This makes the model more accurate and, because each attention “head” can learn a different pattern (syntax, semantics, etc.), it also becomes easier to interpret what the model is doing.

In business terms, attention lets a language system understand context more precisely—like spotting the key factors in a customer review or linking a product mention to its description—leading to better translations, summaries, or insights.

# Utility Functions

In [10]:
def get_collection_stats():
    """Display statistics about the vector database"""
    count = collection.count()
    print(f"📊 Vector Database Statistics:")
    print(f"  - Total chunks: {count}")
    print(f"  - Embedding dimensions: {len(embedder.encode(['test'])[0])}")
    print(f"  - Storage location: ./chroma_db")
    return count

def clear_database():
    """Clear all documents from the database"""
    collection.delete(where={})
    print("🗑️ Database cleared")

def process_multiple_pdfs(pdf_paths: List[str]):
    """Process multiple PDF files"""
    for path in pdf_paths:
        process_pdf(path)
        print()

# Display current stats
get_collection_stats()

📊 Vector Database Statistics:
  - Total chunks: 88
  - Embedding dimensions: 384
  - Storage location: ./chroma_db


88