# Unit 4

## Chunking and Storing Text for Efficient LLM Processing with Chroma DB

# Chunking and Storing Text for Efficient LLM Processing with Chroma DB

-----

## Introduction and Context Setting

Welcome to the final lesson of our course on data processing for Large Language Models (LLMs). In previous lessons, you've learned about chunking text, advanced chunking techniques, and storing text chunks in JSONL format. Now, we'll focus on using the **Chroma DB** library to efficiently store and retrieve text chunks. This lesson will integrate these concepts, allowing you to handle text data effectively in LLM applications.

LLMs have a limitation when processing long text due to their fixed token window. **Chunking** text into smaller parts helps maintain context and ensures that important information isn't lost. In this lesson, we'll explore how to use **vector embeddings** and the Chroma DB library to store and retrieve these chunks efficiently.

-----

## Recall: Basics of Text Embeddings

Before we dive into the new material, let's briefly recall what **text embeddings** are. Text embeddings are numerical representations of text that capture semantic meaning. They allow us to perform operations like similarity searches, which are crucial for retrieving relevant information from large datasets.

In previous lessons, you learned about tokenization and basic vector operations. These concepts are foundational for understanding how embeddings work. Remember, embeddings transform text into a format that LLMs can process more efficiently.

-----

## Loading and Using an Embedding Model

To work with text embeddings, we need an embedding model. We'll use the **SentenceTransformer** library, which provides pre-trained models for generating embeddings.

First, let's load the "all-MiniLM-L6-v2" model:

```python
from sentence_transformers import SentenceTransformer

# Load embedding model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
```

In this code snippet, we import the `SentenceTransformer` class and load the "all-MiniLM-L6-v2" model. This model is pre-trained to generate embeddings for sentences, making it ideal for our task.

-----

## Chunking Text for LLMs

Chunking is the process of splitting long documents into smaller, manageable parts. This is crucial for maintaining context when processing text with LLMs.

Consider the following text:

```python
documents = [
    "Large language models process text within a limited token window.",
    "If the input text is too long, models may truncate information.",
    "Splitting long documents into smaller chunks helps maintain relevant context.",
]
```

Here, we have a list of sentences, each representing a chunk of text. By breaking down the text into these smaller parts, we ensure that each chunk can be processed without losing important context.

-----

## Converting Text to Vector Embeddings

Now that we have our text chunks, let's convert them into vector embeddings using the loaded model:

```python
# Convert text to vector embeddings
embeddings = embedding_model.encode(documents)
```

In this step, we use the `encode` method of our embedding model to transform the list of text chunks into a list of vector embeddings. These embeddings are numerical representations that capture the semantic meaning of each chunk.

-----

## Storing and Retrieving Chunks with Chroma DB

With our embeddings ready, we can now store and retrieve them using the Chroma DB library. Chroma DB is a library for efficient similarity search and clustering of dense vectors.

### Step 1: Initialize Chroma DB

First, initialize the Chroma DB client and create or load a collection:

```python
import chromadb
from chromadb.config import Settings

# Initialize ChromaDB client
client = chromadb.PersistentClient(path="./chroma_db")

# Create or load collection with a custom name
collection = client.get_or_create_collection(name="my_custom_collection")
```

**Explanation:** We import the necessary modules from Chroma DB and initialize a persistent client, specifying the path where the database will be stored. We then create or load a collection with a custom name, which will hold our vector embeddings.

### Step 2: Store Embeddings

Next, store the embeddings in Chroma DB:

```python
from chromadb.utils import embedding_functions
from chromadb import Client

# Load embedding model
model_name = "sentence-transformers/all-MiniLM-L6-v2"
embed_func = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=model_name)

# Create collection with embedding function
collection = client.get_or_create_collection(name="vector_collection", embedding_function=embed_func)

# Sample documents
documents = [
    {"id": "doc1", "content": "ChromaDB is an open-source vector database."},
    {"id": "doc2", "content": "Vector databases help in efficient similarity search."},
    {"id": "doc3", "content": "Embedding models convert text into numerical representations."}
]

# Insert documents into ChromaDB
collection.add(
    documents=[doc["content"] for doc in documents],
    ids=[doc["id"] for doc in documents]
)
```

**Explanation:** We initialize a Chroma DB client and load an embedding model using the `SentenceTransformerEmbeddingFunction`. We create a collection with this embedding function and insert sample documents into the collection. Each document is stored with a unique ID and its content.

### Step 3: Retrieve Relevant Chunks

To retrieve relevant chunks for a query, perform a retrieval operation:

```python
# Perform a retrieval operation to verify everything works
query_text = "What is an AI"
results = collection.query(
    query_texts=[query_text],
    n_results=3
)

# Display results
for i, doc in enumerate(results["documents"][0]):
    print(f"Result {i+1}: {doc} (Score: {results['distances'][0][i]})")
```

**Explanation:** We perform a retrieval operation by querying the collection with a text query. The `query` method returns the most similar text chunks along with their similarity scores. The output displays the top 3 results, showing the retrieved documents and their respective scores.

-----

## Summary and Preparation for Practice

In this lesson, you've learned how to integrate chunking, embedding, and retrieval techniques using the Chroma DB library. By storing text chunks as vector embeddings, you can efficiently retrieve relevant information, making your LLM applications more effective.

Congratulations on reaching the end of the course\! You've gained valuable skills in processing text for LLMs, from chunking and storing text to using advanced retrieval techniques. As you move forward, I encourage you to apply these skills in real-world applications and continue exploring the exciting field of natural language processing.

## Converting Text Chunks to Vector Embeddings

You've done well in understanding how to load and use an embedding model. Now, let's take it a step further.

Your task is to:

Load the SentenceTransformer embedding model.
Convert a list of text chunks into vector embeddings.
This exercise will solidify your understanding of integrating these steps. Dive in and see how efficiently you can transform text into embeddings!

```python
from sentence_transformers import SentenceTransformer

# TODO: Load embedding model

# Sample chunked text
documents = [
    "Large language models process text within a limited token window.",
    "If the input text is too long, models may truncate information.",
    "Splitting long documents into smaller chunks helps maintain relevant context.",
]

# TODO: Convert text to vector embeddings

print("Vector Embeddings:", embeddings)

```

```python
from sentence_transformers import SentenceTransformer

# Load embedding model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Sample chunked text
documents = [
    "Large language models process text within a limited token window.",
    "If the input text is too long, models may truncate information.",
    "Splitting long documents into smaller chunks helps maintain relevant context.",
]

# Convert text to vector embeddings
embeddings = embedding_model.encode(documents)

print("Vector Embeddings:", embeddings)
```

## Initializing ChromaDB for Vector Storage

Nice job on converting text chunks to vector embeddings! Now, let's move forward and initialize a ChromaDB vector database.

Here's what you'll do:

Import the necessary modules from chromadb.
Initialize a PersistentClient for ChromaDB.
Create or load a collection with a custom name.
This task will help you set up a vector database to store embeddings. Let's see how efficiently you can set this up!

```python
# TODO: Import the necessary modules from chromadb

# TODO: Initialize a PersistentClient for ChromaDB

# TODO: Create or load a collection with a custom name

# TODO: Print the name of the collection

# Hint: You can access the collection's name using collection.name

```

```python
import chromadb
from chromadb.config import Settings

# Initialize a PersistentClient for ChromaDB
client = chromadb.PersistentClient(path="./chroma_db")

# Create or load a collection with a custom name
collection = client.get_or_create_collection(name="my_custom_collection")

# Print the name of the collection
print("Collection Name:", collection.name)
```

## Storing Embeddings in ChromaDB

Excellent work setting up your ChromaDB vector database! Now it's time to put it to use by implementing the functionality to store embeddings — a crucial part of our text processing workflow.

In this exercise, you'll work with a pre-populated ChromaDB collection and learn how to:

Set up an embedding function for ChromaDB to process text chunks
Create a collection that uses this embedding function
Store text chunks as embeddings in the ChromaDB collection
This skill is essential for the text processing cycle for LLMs: chunking, embedding, and storing. Efficiently storing text as embeddings allows for powerful semantic search capabilities in modern LLM applications.

```python
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.utils import embedding_functions

# Load embedding model
model_name = "all-MiniLM-L6-v2"
embedding_model = SentenceTransformer(model_name)

# Initialize ChromaDB client
client = chromadb.PersistentClient(path="./chroma_db")

# TODO: Set up embedding function for ChromaDB
# Hint: Use SentenceTransformerEmbeddingFunction with the model name

# TODO: Create or get collection with embedding function
# Hint: Use client.get_or_create_collection and pass the embedding function

# Sample documents
documents = [
    {"id": "doc1", "content": "ChromaDB is an open-source vector database."},
    {"id": "doc2", "content": "Vector databases help in efficient similarity search."},
    {"id": "doc3", "content": "Embedding models convert text into numerical representations."}
]

# TODO: Add documents to the collection
# Hint: Use collection.add with document contents and their IDs

```

```python
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.utils import embedding_functions

# Load embedding model
model_name = "all-MiniLM-L6-v2"
embedding_model = SentenceTransformer(model_name)

# Initialize ChromaDB client
client = chromadb.PersistentClient(path="./chroma_db")

# Set up embedding function for ChromaDB
embed_func = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=model_name)

# Create or get collection with embedding function
collection = client.get_or_create_collection(name="vector_collection", embedding_function=embed_func)

# Sample documents
documents = [
    {"id": "doc1", "content": "ChromaDB is an open-source vector database."},
    {"id": "doc2", "content": "Vector databases help in efficient similarity search."},
    {"id": "doc3", "content": "Embedding models convert text into numerical representations."}
]

# Add documents to the collection
collection.add(
    documents=[doc["content"] for doc in documents],
    ids=[doc["id"] for doc in documents]
)

print("Documents successfully added to the collection.")
```

## Persisting Vector Databases for Production

Fantastic job setting up the environment for ChromaDB! Now, let's tackle a critical real-world requirement: performing a retrieval operation on your vector database.

In production applications, you don't want to rebuild your database from scratch every time. Instead, you'll want to persist it and reload it efficiently. We've set up the database for you, and your task is to perform a retrieval operation to verify everything works.

In this exercise, you'll:

Perform a retrieval operation to verify everything works
Specifically, perform a retrieval operation by querying:

```sh
What is a vector database?
```

Retrieve the top 3 most relevant documents and print the results.

This skill completes your toolkit for working with vector databases in LLM applications, allowing you to build systems that can efficiently store and retrieve information across multiple sessions.

```python
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.utils import embedding_functions

# Define the persistence directory
persist_dir = "./chroma_db_persist"

# Load embedding model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Initialize ChromaDB client with persistence directory
client = chromadb.PersistentClient(path=persist_dir)

# Load embedding model for ChromaDB
model_name = "sentence-transformers/all-MiniLM-L6-v2"
embed_func = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=model_name)

# Create collection with embedding function
collection = client.get_or_create_collection(name="vector_collection", embedding_function=embed_func)

# Sample documents
documents = [
    {"id": "doc1", "content": "ChromaDB is an open-source vector database."},
    {"id": "doc2", "content": "Vector databases help in efficient similarity search."},
    {"id": "doc3", "content": "Embedding models convert text into numerical representations."}
]

# Insert documents into ChromaDB
collection.add(
    documents=[doc["content"] for doc in documents],
    ids=[doc["id"] for doc in documents]
)

print(f"\nDatabase loaded from {persist_dir}")
print("Performing a retrieval operation...\n")

# TODO: Perform a retrieval operation by querying:
# "What is a vector database?"
# Retrieve the top 3 most relevant documents and print the results.
```

```python
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.utils import embedding_functions

# Define the persistence directory
persist_dir = "./chroma_db_persist"

# Load embedding model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Initialize ChromaDB client with persistence directory
client = chromadb.PersistentClient(path=persist_dir)

# Load embedding model for ChromaDB
model_name = "sentence-transformers/all-MiniLM-L6-v2"
embed_func = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=model_name)

# Create collection with embedding function
collection = client.get_or_create_collection(name="vector_collection", embedding_function=embed_func)

# Sample documents
documents = [
    {"id": "doc1", "content": "ChromaDB is an open-source vector database."},
    {"id": "doc2", "content": "Vector databases help in efficient similarity search."},
    {"id": "doc3", "content": "Embedding models convert text into numerical representations."}
]

# Insert documents into ChromaDB
collection.add(
    documents=[doc["content"] for doc in documents],
    ids=[doc["id"] for doc in documents]
)

print(f"\nDatabase loaded from {persist_dir}")
print("Performing a retrieval operation...\n")

# Perform a retrieval operation by querying:
# "What is a vector database?"
# Retrieve the top 3 most relevant documents and print the results.
query_text = "What is a vector database?"
results = collection.query(
    query_texts=[query_text],
    n_results=3
)

# Display results
for i, doc in enumerate(results["documents"][0]):
    print(f"Result {i+1}: {doc}")
```