# Build an AI Chatbot with LangChain

**Author:** Milos Saric [https://saricmilos.com/]  
**Date:** January 15, 2026  

Here‚Äôs the thing: most ‚Äúhow-to‚Äù guides start with the same line: *‚ÄúGrab your OpenAI API key and add a credit card.‚Äù*  

But what if you **don‚Äôt want to pay**? Maybe you‚Äôre a student. Maybe you hate hitting rate limits. Or maybe you just want a chatbot that runs locally, offline, and respects your privacy.  

Good news: it‚Äôs possible. In this guide, we‚Äôll show you how to build a chatbot using **LangChain, React, and TypeScript**‚Äîwith **zero cloud dependencies**.

---

## How AI Usually Works ‚Äî APIs and Tokens Explained

Think of an **API** as a waiter in a restaurant:

1. You ask the app: *‚ÄúWhat‚Äôs the capital of France?‚Äù*  
2. The waiter (API) delivers your request to the kitchen (the AI).  
3. The kitchen cooks up an answer.  
4. The waiter brings it back: *‚ÄúParis!‚Äù*

With OpenAI, your computer doesn‚Äôt do the thinking. Everything happens on OpenAI‚Äôs servers. The smarter the AI, the more expensive the ‚Äúmeal.‚Äù

---

## Why Tokens Matter ‚Äî Every Word Has a Price

Think of each word as a **puzzle piece**. The AI puts the pieces together to understand and respond:

- Short phrases: just a couple of pieces.  
- Complex sentences: 10‚Äì15 pieces or more.  

Every puzzle piece = **token**. APIs like OpenAI charge for tokens, not messages. That‚Äôs why long chats or memory-enabled bots get expensive fast.

---

### What‚Äôs a Token, Exactly?

A token is basically a chunk of text:

- `"Hello"` ‚Üí 1 token  
- `"Artificial Intelligence is awesome"` ‚Üí ~5‚Äì6 tokens  
- A full conversation ‚Üí hundreds of tokens  
- Upload a document ‚Üí thousands of tokens  

Example conversation (roughly 50‚Äì80 tokens):

**User:** "Hi, I can't log in to my account"  
**Bot:** "I can help! What error message are you seeing?"  
**User:** "'Invalid credentials,' but my password is correct"  
**Bot:** "Got it. Let's try some troubleshooting steps..."  

Multiply that by hundreds of users, and token costs can explode‚Äîeven at just $0.002 per 1K tokens.

---

## Hidden Costs of Relying on Paid APIs

Even beyond token fees, there are other downsides:

### Scaling Costs
- Testing & development: ~$50/month  
- Small business: $200‚Äì500/month  
- Enterprise: $2,000‚Äì10,000+/month  

### Infrastructure & Reliability
- Your app breaks if the API goes down  
- Latency slows the experience  
- Constant internet connection required  

### Privacy Risks
- All data goes through a third-party server  
- You can‚Äôt control retention policies  
- Compliance issues for sensitive industries  

### Service Limits
- Rate limits throttle usage  
- Outages leave users stranded  
- Price changes can break your model  

Normally, this is where you‚Äôd start reaching for your wallet. But in this guide, **we‚Äôll avoid all of it**.


# What is LangChain, Really?

LangChain is an **open-source framework** built to help developers create AI applications that are **context-aware, reasoning-capable, and able to use external tools**. The twist? You **don‚Äôt have to rely on cloud APIs** to make it work.  

Think of LangChain like **LEGO for AI**. You get different ‚Äúblocks‚Äù that you can snap together:

- **Language Model blocks** ‚Äì the brain of your AI  
- **Memory blocks** ‚Äì to remember conversations  
- **Tool blocks** ‚Äì calculators, web search, file access  
- **Logic blocks** ‚Äì decision-making and reasoning  
- **Data blocks** ‚Äì document search, databases  

Whether your ‚Äúbrain‚Äù is **GPT-4**, a **local model like LLaMA**, or something running entirely on your machine, LangChain helps you assemble all the pieces into a functioning chatbot.  

---

## Why LangChain is Special

Using LangChain with **local or open-source models** gives you benefits you can‚Äôt get with cloud APIs:

- ‚úÖ No API key required  
- ‚úÖ No token limits ‚Äì chat freely  
- ‚úÖ No unexpected bills ‚Äì completely free after setup  
- ‚úÖ Full privacy ‚Äì nothing leaves your computer  
- ‚úÖ Works offline ‚Äì no internet needed  
- ‚úÖ Total control ‚Äì fully customizable  

---

## Why Use LangChain? (The Big Deal)

LangChain isn‚Äôt just about saving money‚Äîit‚Äôs about **flexibility, power, and simplicity**. Whether you‚Äôre a student, beginner, or seasoned developer, here‚Äôs why it‚Äôs a game-changer:


### 1. Simplified Development
LangChain removes the headache of writing all the low-level code yourself. It handles **connecting models, managing prompts, and storing memory** so you can focus on what your AI should actually do. Think of it as a **pre-built toolbox for AI**, ready to go.  

---

### 2. Modular and Flexible Architecture
Remember those LEGO blocks? That‚Äôs the core of LangChain‚Äôs design.  

- Swap out models or tools with a few lines of code  
- Add new data sources easily  
- Experiment and iterate faster than ever  

---

### 3. Context Awareness and Memory
A good chatbot remembers the conversation. LangChain comes with **built-in memory management**, so your AI keeps track of previous turns, creating **more natural and helpful interactions**.  

---

### 4. Agentic Capabilities
This is where LangChain shines for advanced AI:  

- Build AI **agents** that can reason and make decisions  
- Use tools automatically: search the web, run calculations, execute code  
- Create **multi-step workflows** where the AI can actually solve problems, not just answer questions  

---

### 5. Community and Ecosystem
Even though it‚Äôs relatively new, LangChain has a **growing and supportive community**. Tutorials, examples, and resources are abundant. If you get stuck, chances are someone else has faced the same problem.  

---

LangChain isn‚Äôt just a framework‚Äîit‚Äôs a **platform for building smarter, flexible, and private AI applications** without relying on paid cloud services.  

# Building a RAG Chatbot with LangChain and OLLAMA APIs

Large Language Models (LLMs) are incredibly powerful‚Äîthey can write essays, answer questions, and even get creative‚Äîbut they have one key limitation: **their knowledge is static**. That means they can sometimes give outdated or incorrect answers.  

To overcome this, we can use **Retrieval Augmented Generation (RAG)**. RAG systems connect an LLM to **external data sources**, allowing the AI to fetch accurate, up-to-date information before generating a response.  

---

## Project Goal

The goal of this project is to build a **RAG chatbot** in **LangChain** using **OLLAMA APIs**. The chatbot can:

- Accept documents in **TXT, PDF, CSV, or DOCX** formats  
- Retrieve relevant content from your uploaded documents  
- Provide **accurate, context-aware answers** to your questions by sending the retrieved data to the LLM  

This setup ensures your chatbot isn‚Äôt just guessing‚Äîit **leverages your documents to give correct answers**.  

---

## How the RAG System Works

We broke the system down into its main components:

1. **Document Loader** ‚Äì Reads and parses your uploaded files  
2. **Vector Store / Embeddings** ‚Äì Converts documents into a format that‚Äôs easy for the chatbot to search  
3. **Retriever** ‚Äì Finds the most relevant pieces of your documents based on your question  
4. **Conversational Retrieval Chain** ‚Äì Combines the retrieved content with your question and passes it to the LLM for an accurate response  

---

## User Interface

To make the system interactive, we built a **React-based UI**. Users can:

- Upload documents  
- Ask questions about the content  
- Receive precise answers generated by the RAG-powered LLM  

This creates a **seamless chat experience** where the AI is directly informed by your data.

---

By combining **LangChain**, **RAG techniques**, and **OLLAMA APIs**, this project demonstrates how to create a chatbot that is **both intelligent and grounded in reliable data**, bridging the gap between static LLM knowledge and real-world information.


![Alt text](./images/rag-chatbot-architecture-1.png)

# Required Libraries Import

### CORE

In [1]:
from pathlib import Path
import numpy as np

In [2]:
BASE_DATA_DIR = Path("./data").resolve()  # main_folder/data

TMP_DIR = BASE_DATA_DIR / "tmp"  # main_folder/data/tmp
DOCS_DIR = BASE_DATA_DIR / "docs"  # main_folder/data/docs
VECTOR_STORE_DIR = BASE_DATA_DIR / "vector_stores"  # main_folder/data/vector_stores

# Make sure tmp folder exists
TMP_DIR.mkdir(parents=True, exist_ok=True)

### MODULES

In [3]:
from src.API_KEYS import get_environment_variable

In [4]:
%load_ext autoreload
%autoreload 2

from src.langchain import (langchain_document_loader,
                            show_random_preview, 
                            create_text_splitter,
                             split_documents,
                             tiktoken_tokens,
                             select_embeddings_model,
                             create_vectorstore,
                             print_documents,
                             create_advanced_markdown_splitter
                             )

# Required API Keys for Our Application

For this project, we will need **two API keys**:

- **OpenAI** API key: [Get an API key](https://platform.openai.com/account/api-keys)  
- **Google** API key: [Get an API key](https://makersuite.google.com/app/apikey)  

> ‚ö†Ô∏è **Security Notice:** We will **not include our secret keys** directly in this notebook.  
>
> First, make sure to set the following environment variables on your system:  
> `OPENAI_API_KEY`, `GOOGLE_API_KEY`.  
>
> After that, we can safely load them in Python as needed.

In [5]:
openai_api_key = get_environment_variable("OPENAI_API_KEY")


[INFO]: 'OPENAI_API_KEY' has been set for this session.


In [6]:
from openai import OpenAI
try:
    client = OpenAI() # defaults to getting the key using os.environ.get("OPENAI_API_KEY")
except:
    client = OpenAI(api_key=openai_api_key) # if OPENAI_API_KEY is not created as environment variable

completion = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": 'Who is Milos Saric"?'},
  ]
)

print(completion.choices[0].message.content)

Milos Saric is a fictional character from the TV show "The Vampire Diaries." He is a vampire hunter who has a vendetta against the vampires in the show.


# Conversational RAG Architecture

![Alt text](./images/rag_architecture.png)

## RAG Architecture: How Everything Fits Together

A Retrieval-Augmented Generation (RAG) system is built around **two core building blocks**, each with a clearly defined responsibility.

---

### üß± Block 1: Knowledge Ingestion & Retrieval

This block is responsible for **understanding and storing your data** so it can be searched efficiently later. It includes:

- **Document Loader** ‚Äì Ingests external data (PDFs, text files, CSVs, DOCX, etc.)  
- **Text Splitter** ‚Äì Breaks large documents into smaller, manageable chunks  
- **Embedding Model** ‚Äì Converts text chunks into numerical vectors  
- **Vector Store (Chroma)** ‚Äì Stores embeddings for fast similarity search  
- **Retriever** ‚Äì Finds the most relevant document chunks for a given query  

Together, this block transforms raw documents into a searchable knowledge base.

---

### üß† Block 2: Reasoning, Memory & Generation

This block focuses on **thinking, context, and response generation**. It consists of:

- **Large Language Model (LLM)** ‚Äì Generates human-like answers  
- **Prompt Templates** ‚Äì Structure how questions and retrieved documents are presented to the LLM  
- **Memory** ‚Äì Keeps track of conversation history across turns  

This block collaborates with the retriever to produce **accurate, context-aware answers**.

---

## End-to-End RAG Workflow

Below is a step-by-step walkthrough of how a user‚Äôs question flows through the RAG system:

---

### üîÑ Step 1: Question Reformulation (Steps 1‚Äì4)

When a user asks a **follow-up question**, it may rely on prior context.  
To avoid ambiguity, the system first converts it into a **standalone question**.

**Example:**
- User: *‚ÄúWhat does DTC stand for?‚Äù*  
- AI: *‚ÄúDTC stands for Diffuse to Choose.‚Äù*  
- Follow-up: *‚ÄúCan you explain its use cases and implementation?‚Äù*  

The LLM rewrites this as:  
> **‚ÄúWhat are the use cases and implementation of Diffuse to Choose (DTC)?‚Äù**

This ensures the retriever clearly understands the query.

---

### üîç Step 2: Document Retrieval (Steps 5‚Äì8)

- The standalone question is converted into an embedding  
- The retriever compares it with stored embeddings in the **Chroma vector database**  
- The most relevant document chunks are retrieved based on similarity  

This step grounds the AI‚Äôs response in **actual source material**.

---

### üß© Step 3: Prompt Augmentation & Answer Generation (Steps 9‚Äì10)

- Retrieved documents are injected into the LLM prompt  
- Chat history may also be included for continuity  
- The augmented prompt is sent to the LLM  

The LLM now generates an answer that is **both context-aware and evidence-based**.

---

### üíæ Step 4: Memory Update (Step 11)

- The user‚Äôs follow-up question and the AI‚Äôs response are saved to memory  
- This allows future questions to build naturally on previous interactions  

---

## What‚Äôs Next?

In the following sections, we‚Äôll take a deeper look at **each individual component**‚Äîfrom document loading and embeddings to retrievers, memory, and conversational chains‚Äîso you can fully understand how to build and customize your own RAG system.


# Conversational RAG Implmentation

## Retrieval

Retrieval includes: document loaders, text splitting into chunks, vector stores and embeddings, and finally retrievers.

# How to Add More Personal Information to Train Your RAG System

## First: A Key Mindset Shift

Retrieval-Augmented Generation (RAG) does **not** train or fine-tune the language model itself.  
Instead, it improves responses by **retrieving relevant documents** and injecting them into the model‚Äôs prompt.

> In simple terms:  
> **You don‚Äôt train the model ‚Äî you train the knowledge base.**

The quality of your RAG chatbot depends directly on the **quality, structure, and relevance** of the documents you provide.

---

## What Kind of Information Should You Add About Yourself?

To get the best results, organize your personal data into **clear categories** rather than one large, unstructured file.

## Document Loaders in LangChain

Document loaders are a core component of LangChain. Their job is simple but critical: **they ingest data from different sources and convert it into a standardized document format** that the rest of the pipeline can understand.

LangChain provides **80+ built-in document loaders**, making it easy to work with data from almost anywhere, including:

- üåê Web pages and APIs  
- ‚òÅÔ∏è Cloud storage services (e.g., AWS S3)  
- üìÅ Local files such as TXT, PDF, CSV, and JSON  
- üßæ Git repositories  
- üìß Emails and messaging platforms  
- üóÑÔ∏è Databases and other structured sources  

You can explore the full list here:  
üëâ [LangChain Document Loaders](https://python.langchain.com/docs/integrations/document_loaders)

---

## Why Document Loaders Matter

Raw data comes in many formats. Document loaders handle:
- File reading and parsing  
- Format-specific preprocessing  
- Converting data into LangChain‚Äôs `Document` objects  

This ensures downstream components‚Äîlike text splitters, embeddings, and retrievers‚Äîcan work seamlessly regardless of the data source.

---

## Document Loaders in Our Application

For our application, we use the **`DirectoryLoader`** to load files from a temporary directory (`TMP_DIR`).

This approach allows us to:
- Upload multiple files at once  
- Support different document formats  
- Process all files with a single loader  

Supported formats include:
- `.txt`
- `.pdf`
- `.csv`
- `.docx`

---


In [7]:
from collections import Counter

# Load documents
documents = langchain_document_loader(DOCS_DIR)

# Professional summary reporting
if documents:
    counts = Counter(doc.metadata.get('source', '').split('.')[-1] for doc in documents)
    
    print(f"--- Load Summary ---")
    print(f"Total Documents: {len(documents)}")
    for ext, count in counts.items():
        print(f"  ‚Ä¢ {ext.upper()}: {count} files")
else:
    print("‚ö†Ô∏è No documents found. Check your directory path and file extensions.")

--- Load Summary ---
Total Documents: 23
  ‚Ä¢ MD: 23 files


In [8]:
import random

In [9]:
import random
from IPython.display import Markdown
random_document_id = random.choice(range(len(documents)))

Markdown(f"**Document[{random_document_id}]** \n\n **Page content** (first 1000 character):\n\n" +\
         documents[random_document_id].page_content[0:1000] + " ..."  +\
         "\n\n**Metadata:**\n\n" + str(documents[random_document_id].metadata))

**Document[14]** 

 **Page content** (first 1000 character):

# Pricing & Packages ‚Äì Cassiopeia Intelligence

This document provides a **detailed, enterprise-grade overview of pricing and packages** for AI and RAG-powered solutions, ensuring transparency and clarity for clients.

---

## 1. Chatbots

* **Basic Package:**

  * Single-channel chatbot (web or mobile)
  * FAQ automation and standard multi-turn conversations
  * RAG integration with a single knowledge source
  * Price: $25,000 ‚Äì $50,000
  * Delivery: 4‚Äì6 weeks

* **Standard Package:**

  * Multi-channel deployment (web, mobile, social media)
  * Multi-turn RAG-powered conversations
  * Knowledge base integration across multiple documents
  * Analytics dashboard and reporting
  * Price: $50,000 ‚Äì $100,000
  * Delivery: 8‚Äì10 weeks

* **Enterprise Package:**

  * Advanced multi-turn RAG chatbot with personalized responses
  * Multi-source knowledge retrieval and vector search
  * Integration with CRM, ticketing, and ERP systems
  * Custom analytics and monitoring dashboards
  * Price: $1 ...

**Metadata:**

{'source': 'C:\\Users\\Milos\\Desktop\\GitHub_Kaggle_Projects\\simple_RAG_assistant_with_Langchain\\data\\docs\\pricing.md'}

# TEXT SPLITTERS

## <a class="anchor" id="text_splitters">Text Splitters</a>

Text splitters are tools that divide large documents into smaller sections, or **chunks**, that fit within a model's context window. Since language models can only process a limited number of tokens at a time, splitting text effectively is crucial for maintaining context and ensuring high-quality responses.

In **LangChain**, text can be split in several ways:

- **By tokens** ‚Äì divides text based on the number of tokens used by the model.  
- **By characters** ‚Äì splits text according to character counts.  
- **By code structure** ‚Äì specialized splitters exist for programming languages like Java, JavaScript, and PHP, allowing chunks to respect logical code blocks.  

### Recommended Splitter for General Text

For most text documents, it is recommended to use the **[RecursiveCharacterTextSplitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/recursive_text_splitter)**.  

- **How it works:**  
  The splitter uses a **list of characters or strings** (for example, `"\n\n"`, `"\n"`, `" "`) in a specific order to recursively divide the text. It continues splitting until each chunk is small enough to fit within the model‚Äôs context window.  

- **Why it‚Äôs effective:**  
  This method preserves **semantic relationships** between paragraphs, sentences, and words by keeping related content together as much as possible. Instead of splitting arbitrarily, it prioritizes meaningful boundaries.

### Chunk Overlap

To ensure consistency and improve context retention, a **small overlap** between consecutive chunks is recommended. This means that some content at the end of one chunk is repeated at the start of the next.  

- **Benefits:**  
  - Helps the model maintain context across chunks.  
  - Reduces the chance of losing important information between splits.  

By using a proper text splitter and overlap, you can maximize the usefulness of documents in **retrieval-augmented generation (RAG)** workflows and other applications requiring structured text input.


In [None]:
# 2. Initialize Splitters
md_splitter, rec_splitter = create_advanced_markdown_splitter()

In [12]:
# 3. Two-Stage Splitting Process
final_chunks = []

In [13]:
for doc in documents:
    # STAGE A: Structural Split (Header-aware)
    # Note: .split_text() returns a list of Document objects with header metadata
    header_docs = md_splitter.split_text(doc.page_content)
    
    # Optional: Attach original file metadata (like 'source') to each header split
    for h_doc in header_docs:
        h_doc.metadata.update(doc.metadata)
        
    # STAGE B: Size-based Split (Recursively to fit 1600 characters)
    # .split_documents() accepts the list of Docs from Stage A
    sub_chunks = rec_splitter.split_documents(header_docs)
    
    final_chunks.extend(sub_chunks)

print(f"\n‚úÖ Created {len(final_chunks)} semantic chunks ready for the Vector Store.")


‚úÖ Created 205 semantic chunks ready for the Vector Store.


In [14]:
token_counts = tiktoken_tokens(documents)
print(token_counts)

[1070, 932, 1291, 1174, 1432, 487, 671, 759, 771, 635, 709, 714, 635, 610, 811, 1043, 1068, 534, 1054, 946, 866, 727, 727]


In [15]:
chunks_length = tiktoken_tokens(final_chunks,model="gpt-3.5-turbo")

print(f"Number of tokens - Average : {int(np.mean(chunks_length))}")
print(f"Number of tokens - 25% percentile : {int(np.quantile(chunks_length,0.25))}")
print(f"Number of tokens - 50% percentile : {int(np.quantile(chunks_length,0.5))}")
print(f"Number of tokens - 75% percentile : {int(np.quantile(chunks_length,0.75))}")
print("\nMax_tokens for gpt-3.5-turbo: 4096")

Number of tokens - Average : 95
Number of tokens - 25% percentile : 52
Number of tokens - 50% percentile : 86
Number of tokens - 75% percentile : 127

Max_tokens for gpt-3.5-turbo: 4096


# Vectorsores and Embeddings

### Text Embeddings

**Text embeddings** are numerical representations of text in a high-dimensional vector space. In simpler terms, they convert words, sentences, or entire documents into a list of numbers (vectors) that capture their **semantic meaning**.  

For example, OpenAI‚Äôs `text-embedding-ada-002` model produces embeddings of size **1536**, meaning each text input is represented as a vector with 1,536 numbers.

---

#### Why embeddings matter

Embeddings allow us to **compare the meaning of different texts**. Once text is converted into vectors, we can measure similarity between them using mathematical techniques.  

- **Cosine similarity** is the most commonly used metric.  
  - Values range from `-1` (completely opposite) to `1` (exactly similar).  
  - Higher cosine similarity means the texts are more semantically alike.  

This is especially useful for:
- **Semantic search** ‚Äì finding documents most relevant to a query.  
- **Recommendation systems** ‚Äì suggesting similar content.  
- **Clustering and classification** ‚Äì grouping similar texts together.  

---

#### Embedding providers

Several platforms provide pre-trained embedding models. LangChain supports easy integration with these services:

| Provider       | Model | Vector dimension | Notes / Cost |
|----------------|-------|----------------|--------------|
| OpenAI         | [text-embedding-ada-002](https://platform.openai.com/docs/guides/embeddings/embedding-models) | 1536 | **$0.00010 per 1K tokens** |
| Google         | [models/embedding-001](https://ai.google.dev/models/gemini?hl=en) | 768 | **Rate limit:** 1500 requests per minute |
| Hugging Face   | [thenlper/gte-large](https://huggingface.co/thenlper/gte-large) | 1024 | **Free** |

> üí° **Tip:** Larger vector dimensions often capture more nuanced meaning but may be more computationally expensive for storage and similarity calculations.

---


In [16]:
embeddings_openai = select_embeddings_model(LLM_service="OpenAI", openai_api_key=openai_api_key)


In [17]:
import numpy as np
from itertools import combinations

# Example sentences
sentences = [
    "I want to become the world greatest data scientist.",
    "I love data science.",
    "How many hours should you walk per day?"
]

# -------------------------------
# Step 1: Generate embeddings
# -------------------------------
# embeddings_google is assumed to be initialized already
embedding_vectors = [embeddings_openai.embed_query(sentence) for sentence in sentences]

# -------------------------------
# Step 2: Calculate pairwise similarity
# -------------------------------
def cosine_similarity(vec1, vec2):
    """Compute cosine similarity between two vectors."""
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

# Iterate over all sentence pairs
for i, j in combinations(range(len(sentences)), 2):
    sim_score = round(cosine_similarity(embedding_vectors[i], embedding_vectors[j]), 3)
    print(f"Similarity between sentence {i} and {j}: {sim_score}")

Similarity between sentence 0 and 1: 0.899
Similarity between sentence 0 and 2: 0.711
Similarity between sentence 1 and 2: 0.712


### Vectorstores

A **vectorstore** is a specialized type of database designed to store **embedding vectors** ‚Äî the numerical representations of text, images, or other data.  

Unlike traditional databases that search by exact matches or keywords, vectorstores allow you to **search for the items that are most semantically similar** to a query. This is done by comparing the embeddings of the query with the stored embeddings, typically using metrics like **cosine similarity** or **Euclidean distance**.

There are several open-source and commercial vectorstore options available. For this guide, we will use **[Chroma](https://python.langchain.com/docs/integrations/vectorstores/chroma)**, a lightweight and efficient vector database that integrates seamlessly with LangChain.

> üí° **Tip:** Vectorstores are a key component of modern AI workflows, such as **semantic search**, **question answering**, and **retrieval-augmented generation (RAG)**, because they allow models to find relevant information quickly and accurately based on meaning, not just keywords.

In [18]:
vectorstore_name = "milossaric_vectorstore"
vector_store = create_vectorstore(
    embeddings=embeddings_openai,
    documents=final_chunks,
    vectorstore_name=vectorstore_name,
    vectorstore_dir=VECTOR_STORE_DIR
)

print(f"Vector store created and persisted at: {VECTOR_STORE_DIR / vectorstore_name}")

Vector store created and persisted at: C:\Users\Milos\Desktop\GitHub_Kaggle_Projects\simple_RAG_assistant_with_Langchain\data\vector_stores\milossaric_vectorstore


In [19]:
from langchain_community.vectorstores import Chroma

# Define the full path to the persisted vector store
vector_store_path = VECTOR_STORE_DIR / "milossaric_vectorstore"

# Load the persisted vector store
vector_store_OpenAI = Chroma(
    persist_directory=vector_store_path.as_posix(),
    embedding_function=embeddings_openai
)

# Print the number of vectors/chunks in the store
print("vector_store_OpenAI:", vector_store_OpenAI._collection.count(), "chunks.")


vector_store_OpenAI: 236 chunks.


  vector_store_OpenAI = Chroma(


#### Similarity Search

`Similarity search` is a technique used to find documents that are most relevant or similar to a given query. Unlike simple keyword matching, it relies on **embedding vectors**, which are numerical representations of the semantic meaning of text.  

**How it works:**

1. **Query Embedding:**  
   The input question or query is converted into an embedding vector using the same embedding model as the vector store. This vector captures the meaning of the query in a numerical form.

2. **Comparison with Vector Store:**  
   The query embedding is compared against all document embeddings stored in the vector store. Similarity is usually measured using metrics like **cosine similarity**, which evaluates how close vectors are in the embedding space.

3. **Selecting Top Matches:**  
   The system selects the **k most similar documents** to the query (by default, k = 4). These documents are considered most likely to contain information relevant to the query.

4. **Providing Context to LLMs:**  
   The retrieved documents are sent along with the query to a **large language model (LLM)** ‚Äî such as ChatGPT, Google Gemini, or others. Providing these relevant documents helps the LLM generate **more accurate and informed responses**.

5. **Optimizing Retrieval:**  
   To improve efficiency and relevance, we can later apply **contextual compression**, which condenses the retrieved documents to only the most essential information while preserving context for the LLM.

**Summary:**  
Similarity search acts as a **bridge between your knowledge base and the LLM**, ensuring that the model has access to the most relevant information when answering queries.


In [20]:
# Example query
query = "Who can make me a recommendation system?"

# Retrieve the top 4 most similar documents with scores
# Using cosine similarity (lower distance = more similar)
docs_with_scores = vector_store_OpenAI.similarity_search_with_score(query, k=4)

# Print the results with scores
print_documents(docs_with_scores, search_with_score=True)


Document 1:

## 2. Who Can Build a Recommendation System?  
The **Cassiopeia Intelligence Engineering Team** manages all recommendation system projects from conception to production deployment.

Score: 0.293

----------------------------------------------------------------------------------------------------
Document 2:

### 1.1 Recommendation Systems
- **Overview:** Custom collaborative, content-based, and hybrid recommendation engines.
- **Capabilities:**
- Collaborative filtering (Matrix Factorization, ALS, SVD)
- Content-based filtering (TF-IDF, embeddings with BERT / GPT models)
- Hybrid models combining collaborative + content signals
- RAG-enhanced recommendation for multi-source reasoning
- **Integration:** API-first design compatible with FastAPI, Django, or Node.js backends.
- **Use Cases:** E-commerce, Book Discovery platforms, SaaS personalization.
- **Reference Documents:**
- ‚ÄúRecommendation Systems Overview‚Äù
- ‚ÄúRecommendation System Implementation Guide‚Äù

Score: 0

In [21]:
import numpy as np

def compute_and_log_relevance(query, docs_with_scores, embeddings_model):
    """
    Computes semantic similarity using a vectorized dot product 
    and logs a formatted relevance report.
    """
    # 1. Generate Embeddings (Vectorized)
    query_vector = embeddings_model.embed_query(query)
    document_texts = [doc[0].page_content for doc in docs_with_scores]
    doc_matrix = np.array(embeddings_model.embed_documents(document_texts))

    # 2. Vectorized Similarity Calculation (Dot Product)
    # Using np.inner for high-performance vector-matrix multiplication
    relevance_scores = np.inner(query_vector, doc_matrix)

    # 3. Professional Report Formatting
    print(f"{'='*20} RETRIEVAL RELEVANCE REPORT {'='*20}")
    print(f"QUERY: \"{query}\"\n")
    
    for idx, score in enumerate(relevance_scores):
        source = docs_with_scores[idx][0].metadata.get('source', 'Unknown')
        # Display as normalized relevance (assuming unit vectors)
        print(f"RANK {idx+1} | SCORE: {score:.4f} | SOURCE: {source}")
        
    print(f"{'='*68}")

In [22]:
# Execute the reporting function
compute_and_log_relevance(query, docs_with_scores, embeddings_openai)

QUERY: "Who can make me a recommendation system?"

RANK 1 | SCORE: 0.8533 | SOURCE: C:\Users\Milos\Desktop\GitHub_Kaggle_Projects\simple_RAG_assistant_with_Langchain\data\docs\services-solutions.md
RANK 2 | SCORE: 0.8229 | SOURCE: C:\Users\Milos\Desktop\GitHub_Kaggle_Projects\simple_RAG_assistant_with_Langchain\data\docs\services-solutions.md
RANK 3 | SCORE: 0.8224 | SOURCE: C:\Users\Milos\Desktop\GitHub_Kaggle_Projects\simple_RAG_assistant_with_Langchain\data\docs\recommendation-systems.md
RANK 4 | SCORE: 0.8152 | SOURCE: C:\Users\Milos\Desktop\GitHub_Kaggle_Projects\simple_RAG_assistant_with_Langchain\data\docs\recommendation-system-implementation.md


# Maximum Marginal Relevance (MMR) Search

Maximum Marginal Relevance (MMR) is a technique used in search and information retrieval to **select documents that are both relevant to a query and diverse among themselves**. This helps avoid redundancy in search results.

- **Relevance:** How closely a document matches the query.
- **Diversity:** How different each selected document is from the others.

MMR balances these two aspects. It ensures that the selected documents are not only similar to your query but also cover different perspectives or topics.


In [23]:
query = 'what is Diffuse to Choose?'
docs_MMR = vector_store_OpenAI.max_marginal_relevance_search(query,k=4)

print_documents(docs_MMR)

Document 1:

## 2. Model Selection  
### 2.1 Classical Models  
* **Collaborative Filtering:** User-based, item-based, matrix factorization (SVD, ALS)
* **Content-Based Models:** TF-IDF, BM25, feature-based similarity
* **Hybrid Approaches:** Weighted, cascade, or switching models

----------------------------------------------------------------------------------------------------
Document 2:

## 5. Deployment Options  
### 5.1 Cloud Deployment  
* AWS, GCP, Azure for scalable, managed infrastructure
* Serverless functions for event-driven RAG pipelines
* Kubernetes (EKS, GKE, AKS) for containerized microservices

----------------------------------------------------------------------------------------------------
Document 3:

### 1.3 Tailwind CSS  
* Utility-first CSS framework for rapid UI development and consistent design language
* Enables responsive and adaptive design without heavy custom CSS
* Works seamlessly with component libraries for maintainable and scalable styling
* Reduc

## <a class="anchor" id="retrievers">Retrievers</a>

A **retriever** is a component in a search or question-answering system that is responsible for **finding and returning documents relevant to a user‚Äôs query**.  

Think of a retriever as the first step in a system that answers questions: it **finds the right documents**, which can then be processed further, e.g., summarized or used by a language model.

For more details, you can check the [LangChain retrievers documentation](https://python.langchain.com/docs/modules/data_connection/retrievers/).

---

### How Retrievers Work

1. **Receive a query**: The retriever takes a user query as input.
2. **Search documents**: It searches a collection of documents (could be a database, a vector store, or a traditional search engine).
3. **Return relevant documents**: It outputs documents that are most likely to answer the query.

> Retrievers **do not generate new text**; they only fetch existing information.  

---

### Types of Retrievers

1. **Vectorstore-backed Retriever** (Semantic Search)  
   - Uses **embeddings** to represent documents and queries as high-dimensional vectors.
   - Measures similarity between query vectors and document vectors.
   - Returns documents that are **semantically similar**, even if exact words do not match.

2. **Keyword-based Retriever**  
   - Uses traditional keyword matching (like search engines or databases).
   - Simple but may miss documents that **use different wording** for the same concept.

3. **Hybrid Retriever**  
   - Combines vector search and keyword search for **more accurate retrieval**.

---

In [24]:
from typing import Optional, Literal, Dict, Any
from langchain_core.vectorstores import VectorStore
from langchain_core.retrievers import BaseRetriever

def get_vectorstore_retriever(
    vectorstore: VectorStore,
    search_type: Literal["similarity", "mmr", "similarity_score_threshold"] = "similarity",
    k: int = 4,
    score_threshold: Optional[float] = None,
    fetch_k: int = 20
) -> BaseRetriever:
    """Standardizes the instantiation of vectorstore-backed retrievers.

    Args:
        vectorstore: The initialized LangChain VectorStore instance.
        search_type: Algorithm for retrieval. 
            - 'similarity': Standard cosine/L2 distance.
            - 'mmr': Max Marginal Relevance (diversifies results).
            - 'similarity_score_threshold': Filters results by absolute score.
        k: The number of final documents to return to the LLM.
        score_threshold: Minimum relevance score required (range 0.0 to 1.0).
            Only utilized when search_type is 'similarity_score_threshold'.
        fetch_k: Amount of documents to pass to the MMR algorithm for reranking.

    Returns:
        BaseRetriever: A configured retriever object ready for RAG chains.
    """
    
    # Define search parameters with sensible defaults for enterprise RAG
    search_kwargs: Dict[str, Any] = {"k": k}
    
    if search_type == "mmr":
        search_kwargs["fetch_k"] = fetch_k
        
    if score_threshold is not None:
        search_kwargs["score_threshold"] = score_threshold

    return vectorstore.as_retriever(
        search_type=search_type,
        search_kwargs=search_kwargs
    )

In [25]:
# similarity search
base_retriever_OpenAI = get_vectorstore_retriever(vector_store_OpenAI,"similarity",k=10)

In [28]:
# Get relevant documents

query = 'Who can build me a recommendation system?'
relevant_docs = base_retriever_OpenAI.invoke(query)

print_documents(relevant_docs)

Document 1:

## 2. Who Can Build a Recommendation System?  
The **Cassiopeia Intelligence Engineering Team** manages all recommendation system projects from conception to production deployment.

----------------------------------------------------------------------------------------------------
Document 2:

### 1.1 Recommendation Systems
- **Overview:** Custom collaborative, content-based, and hybrid recommendation engines.
- **Capabilities:**
- Collaborative filtering (Matrix Factorization, ALS, SVD)
- Content-based filtering (TF-IDF, embeddings with BERT / GPT models)
- Hybrid models combining collaborative + content signals
- RAG-enhanced recommendation for multi-source reasoning
- **Integration:** API-first design compatible with FastAPI, Django, or Node.js backends.
- **Use Cases:** E-commerce, Book Discovery platforms, SaaS personalization.
- **Reference Documents:**
- ‚ÄúRecommendation Systems Overview‚Äù
- ‚ÄúRecommendation System Implementation Guide‚Äù

----------------------

In [29]:
from langchain_core.runnables import RunnableConfig

def fetch_context_with_tracing(retriever, query: str):
    """Retrieves context with production-grade error handling and config."""
    try:
        # Using .invoke is the standard for LCEL (LangChain Expression Language)
        # config allows for tagging and metadata for LangSmith/Debugging
        config = RunnableConfig(tags=["production-retrieval"], metadata={"user_id": "milos_01"})
        
        docs = retriever.invoke(query, config=config)
        
        if not docs:
            print(f"DEBUG: No documents retrieved for query: {query}")
            return []
            
        return docs
    except Exception as e:
        print(f"CRITICAL: Retrieval failed. Error: {str(e)}")
        return []

In [30]:
relevant_docs = fetch_context_with_tracing(base_retriever_OpenAI, query)

In [31]:
print_documents(relevant_docs)

Document 1:

## 2. Who Can Build a Recommendation System?  
The **Cassiopeia Intelligence Engineering Team** manages all recommendation system projects from conception to production deployment.

----------------------------------------------------------------------------------------------------
Document 2:

### 1.1 Recommendation Systems
- **Overview:** Custom collaborative, content-based, and hybrid recommendation engines.
- **Capabilities:**
- Collaborative filtering (Matrix Factorization, ALS, SVD)
- Content-based filtering (TF-IDF, embeddings with BERT / GPT models)
- Hybrid models combining collaborative + content signals
- RAG-enhanced recommendation for multi-source reasoning
- **Integration:** API-first design compatible with FastAPI, Django, or Node.js backends.
- **Use Cases:** E-commerce, Book Discovery platforms, SaaS personalization.
- **Reference Documents:**
- ‚ÄúRecommendation Systems Overview‚Äù
- ‚ÄúRecommendation System Implementation Guide‚Äù

----------------------

### Contextual Compression

When using retrieval-augmented generation (RAG), **retrieved documents often contain irrelevant information** that is unrelated to the user‚Äôs query. Passing all of this information to a language model can be **costly and reduce accuracy**.  

> In high-stakes enterprise RAG systems, **more data often means more noise**.

For example:  
- Suppose a retrieved document chunk is **1,600 characters**, but only **200 characters are relevant** to the query.  
- Sending the entire chunk to the model means paying for **1,400 characters of unnecessary content**, which may **distract the model** and reduce answer quality.

**Contextual Compression** solves this problem by:  
- Extracting only the **‚Äúmeat‚Äù**‚Äîthe portions of the document that are **directly relevant** to the user‚Äôs query.  
- Passing a **concise, focused summary** to the model instead of the full chunk.  

This approach leads to:  
- **Lower costs** (fewer tokens processed by the LLM)  
- **Higher accuracy** (less noise for the model to process)  
- **More efficient retrieval-augmented pipelines** in enterprise applications

### Contextual Compression Retriever

The **Contextual Compression Retriever** is designed to **remove irrelevant information** from retrieved documents, keeping only content that is directly related to the query context. This helps reduce costs and improves the accuracy of LLM responses.

---

### How the Contextual Compression Retriever Works

1. **Query Retrieval:**  
   The user query is first passed to a **base retriever** (typically a vectorstore-backed retriever) which returns an initial set of documents.

2. **Document Compression:**  
   The retrieved documents are then passed through a **Document Compressor**, which either reduces the content or removes irrelevant documents entirely.  

> The document compressor can make an [LLM call](https://python.langchain.com/docs/modules/data_connection/retrievers/contextual_compression#adding-contextual-compression-with-an-llmchainextractor) to perform **contextual compression** on each document.  
> ‚ö†Ô∏è This approach can be slow and costly if done on large datasets.

---

### Efficient Alternative: Document Compressor Pipeline

Instead of compressing entire documents directly, we can use a **Document Compressor Pipeline** to process documents more efficiently:

1. **Split Documents into Chunks:**  
   Use `CharacterTextSplitter` to break each document into smaller chunks (e.g., chunk size = 500 characters).

2. **Remove Redundant Chunks:**  
   Apply `EmbeddingsRedundantFilter` to filter out overlapping or repeated content.

3. **Select Most Relevant Chunks:**  
   Use `EmbeddingsFilter` to pick the chunks most relevant to the query based on a **similarity threshold** and **k parameter**.  
   - For example, set `k = 16` to select the top 16 chunks.

4. **Reorder Chunks for LLM Efficiency:**  
   Use `LongContextReorder` to arrange the chunks so that the **most relevant elements appear at the top and bottom** of the list.  
   - This ordering improves LLM performance and helps the model focus on key information first.  
   - More details in the [Long Context paper](https://arxiv.org/abs/2307.03172).

---

### Benefits of Using Contextual Compression

- **Reduces token usage** ‚Üí lower LLM costs.  
- **Removes irrelevant content** ‚Üí improves answer quality.  
- **Maintains important context** while keeping documents concise.  
- **Supports high-performance RAG pipelines** by prioritizing relevant information.

