# üß† End-to-End Semantic Search with ChromaDB, LangChain, and Embeddings

## üìö Overview

This notebook provides a **step-by-step guide** to building a semantic search pipeline using **ChromaDB** and **LangChain**.  
You'll learn how to:

- Load and preprocess text data
- Split documents into manageable chunks
- Generate vector embeddings using Ollama models
- Store and index embeddings in ChromaDB
- Perform fast similarity search over your data
- Persist and reload your vector database for production use

Whether you're building a **Retrieval-Augmented Generation (RAG)** system, a semantic search engine, or exploring vector databases, this notebook demonstrates all the core fundamentals you need

In [11]:
# Install all required dependencies for this notebook
!pip install langchain langchain-chroma langchain-community chromadb python-dotenv



# ü¶ô Setting Up Ollama for Embeddings

To use **OllamaEmbeddings** in this notebook, you must have [Ollama](https://ollama.com/) installed and running on your computer, and the `nomic-embed-text` model downloaded.

---

## üõ†Ô∏è Step-by-Step Installation Guide

### 1. **Install Ollama**

Choose your operating system and follow the instructions:

- **Windows:**  
  Download and install from [Ollama for Windows](https://ollama.com/download).

- **macOS:**  
  Download and install from [Ollama for macOS](https://ollama.com/download).

- **Linux:**  
  Open your terminal and run:
  ```bash
  curl -fsSL https://ollama.com/install.sh | sh
  ```

### 2. **Start the Ollama Service**

After installing, you need to start the Ollama server so it can process embedding requests.  
**Open a new terminal window** (Command Prompt, PowerShell, or Terminal) and run:

```bash
ollama serve
```

Leave this terminal window open and running in the background while you use the notebook.

### 3. **Download the `nomic-embed-text` Model**

In the **same terminal window** where you started Ollama, run:

```bash
ollama pull nomic-embed-text
```

This command downloads the embedding model required for this notebook.

---

## üîó Helpful Resources

- [Ollama Documentation](https://ollama.com/docs)
- [Ollama Models Library](https://ollama.com/library)

---

> **Note for Beginners:**  
> - Always keep the Ollama server running in the background while working with embeddings in this notebook.
> - If you close the terminal running `ollama serve`, embeddings will not work until you start it

In [2]:
from langchain_chroma import Chroma
from langchain_community.document_loaders import TextLoader
from langchain_community.embeddings import OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter



# üìÑ TextLoader

## üîç Definition

**TextLoader** is a utility from LangChain that allows you to **load and read text files** into your workflow as documents.  
It is commonly used to ingest raw text data for further processing, such as splitting, embedding, or indexing.

---

## ‚öôÔ∏è Key Features

- Reads plain text files into LangChain document objects
- Supports various file formats (with other loaders)
- Easy integration with downstream LangChain components
- Useful for preparing data for vector databases or LLM pipelines

In [6]:
loader = TextLoader("speech.txt")
data = loader.load()
data

[Document(metadata={'source': 'speech.txt'}, page_content='\n    "Large Language Models (LLMs) represent a significant advancement in artificial intelligence, "\n    "enabling machines to comprehend and generate human language with remarkable accuracy. "\n    "Trained on vast amounts of textual data, these models are capable of performing a wide range of natural language processing tasks, "\n    "such as translation, summarization, question answering, and content creation. "\n    "The core technology behind LLMs is deep learning, particularly transformer architectures, "\n    "which allow the models to capture intricate patterns and relationships within language. "\n    "Prominent examples of LLMs include OpenAI\'s GPT series and Google\'s BERT and LaMDA models. "\n    "These systems have revolutionized the way we interact with technology, making it possible for computers to understand context, "\n    "generate coherent responses, and assist users in various domains. "\n    "As LLMs co

# ‚úÇÔ∏è RecursiveCharacterTextSplitter

## üîç Definition

**RecursiveCharacterTextSplitter** is a tool for **splitting large documents into smaller chunks** based on character count.  
This is essential for processing long texts with LLMs or vector databases, which often have input size limits.

---

## ‚öôÔ∏è Key Features

- Splits text into manageable chunks (e.g., 500 characters)
- Supports overlap between chunks for context preservation
- Handles various document types
- Improves retrieval and embedding quality in downstream tasks

In [7]:
#Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
splits = text_splitter.split_documents(data)

# üß¨ OllamaEmbeddings

## üîç Definition

**OllamaEmbeddings** is an embedding model integration for LangChain that generates **vector representations** (embeddings) from text.  
These embeddings capture the semantic meaning of text, enabling similarity search and retrieval tasks.

---

## ‚öôÔ∏è Key Features

- Converts text into high-dimensional vectors
- Supports various models (e.g., `nomic-embed-text`)
- Integrates with vector databases like ChromaDB
- Essential for semantic search and RAG workflows

# üì¶ ChromaDB

## üîç Definition

**ChromaDB** is an **open-source vector database** designed for storing, indexing, and querying **embeddings** generated from text, images, or other unstructured data.  
It is widely used in **LLM-powered applications** such as Retrieval-Augmented Generation (RAG), semantic search, and question-answering systems.

ChromaDB is known for being **lightweight, fast, and easy to use**, especially for local and small-to-medium scale AI projects.

---

## ‚öôÔ∏è Key Features

- Stores vector embeddings with metadata
- Performs semantic similarity search
- Works locally (no cloud dependency)
- Integrates seamlessly with LangChain
- Supports persistent storage
- Optimized for AI and LLM workflows

# üóÉÔ∏è Chroma.from_documents

## üîç Definition

`Chroma.from_documents` is a method to **create a ChromaDB vector store** directly from a list of documents and their embeddings.  
This enables fast and efficient similarity search over your data.

---

## ‚öôÔ∏è Key Features

- Indexes documents with their embeddings
- Supports in-memory and persistent storage
- Enables fast semantic search and retrieval
- Integrates seamlessly with LangChain pipelines

In [9]:
embedding = OllamaEmbeddings(model="nomic-embed-text")

vectordb = Chroma.from_documents(splits, embedding=embedding)
vectordb

<langchain_chroma.vectorstores.Chroma at 0x267ec26f3e0>

# üîé Similarity Search

## üîç Definition

**Similarity search** is the process of finding documents in a vector database that are **most similar** to a given query, based on their embeddings.  
This is a core technique in semantic search and retrieval-augmented generation (RAG).

---

## ‚öôÔ∏è Key Features

- Finds relevant documents using vector distance (e.g., cosine similarity)
- Enables question-answering over large text corpora
- Powers intelligent search in LLM applications
- Fast and scalable with vector databases like ChromaDB

In [10]:
### Querying

query = "What is core technology behind LLMs ?"
docs = vectordb.similarity_search(query)
docs[0].page_content

'"The core technology behind LLMs is deep learning, particularly transformer architectures, "\n    "which allow the models to capture intricate patterns and relationships within language. "\n    "Prominent examples of LLMs include OpenAI\'s GPT series and Google\'s BERT and LaMDA models. "\n    "These systems have revolutionized the way we interact with technology, making it possible for computers to understand context, "\n    "generate coherent responses, and assist users in various domains. "'

# üíæ Persisting ChromaDB

## üîç Definition

**Persisting** a ChromaDB instance means saving the vector database to disk, so it can be **reloaded and reused** later without rebuilding the index.  
This is crucial for production applications and large datasets.

---

## ‚öôÔ∏è Key Features

- Saves vector data and metadata to a specified directory
- Enables fast reloads and avoids recomputation
- Supports long-term storage for AI workflows
- Simple API: just set `persist_directory` when creating the database

In [8]:
### Saving to the disk

vectordb = Chroma.from_documents(splits, embedding=embedding, persist_directory="./chroma.db")