**This tutorial is about chunking text data file, converting them into embeddings, adding and querying them from ChromaDB**

*Step 1: Install Libraries*

In [None]:
!pip install chromadb

Collecting chromadb
  Downloading chromadb-1.0.15-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.0 kB)
Collecting pybase64>=1.4.1 (from chromadb)
  Downloading pybase64-1.4.1-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.4 kB)
Collecting posthog<6.0.0,>=2.4.0 (from chromadb)
  Downloading posthog-5.4.0-py3-none-any.whl.metadata (5.7 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.22.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting opentelemetry-api>=1.2.0 (from chromadb)
  Downloading opentelemetry_api-1.34.1-py3-none-any.whl.metadata (1.5 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.34.1-py3-none-any.whl.metadata (2.4 kB)
Collecting opentelemetry-sdk>=1.2.0 (from chromadb)
  Downloading opentelemetry_sdk-1.34.1-py3-none-any.whl.metadata (1.6 k

In [None]:
!pip install langchain langchain-community
!pip install sentence-transformers

Collecting langchain-community
  Downloading langchain_community-0.3.27-py3-none-any.whl.metadata (2.9 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.10.1-py3-none-any.whl.metadata (3.4 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.1-py3-none-any.whl.metadata (9.4 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting mypy-extensions>=0.3.0 (from typing-inspect<1,>=0.4.0->dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading mypy_extensions-1.1.0-py3-n

*Step 2: Import libraries*

In [None]:
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from langchain_community.embeddings import HuggingFaceEmbeddings

*Step 3: Access the file*

In [None]:
file_path = "/content/AI.txt"

*Step 4: Read the file*

In [None]:
def process_file(file_path, chunk_size = 1000, chunk_overlap = 200):
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read()

    # Intialise text splitter
    splitter = RecursiveCharacterTextSplitter(
        chunk_size = chunk_size,
        chunk_overlap = chunk_overlap,
        separators = ["\n\n", "\n", " ", ""])
    chunks = splitter.split_text(text)

    # Create Document objects (Document objects are structured containers containing raw chunks and its metadata(extra information about chunks))
    documents = [Document(page_content=chunk, metadata={"chunk_id": i}) for i, chunk in enumerate(chunks)]

    print(f"Created {len(documents)} document chunks")
    return documents

*Step 5: Process the file*

In [None]:
docs = process_file(file_path, chunk_size = 500, chunk_overlap = 100)

Created 26 document chunks


*Step 6: View the chunks*

In [None]:
for i, doc in enumerate(docs):
    print(f"--- Chunk {i+1} ---")
    print(doc.page_content)

--- Chunk 1 ---
Title: A Comprehensive Guide to Artificial Intelligence (AI). 

Content: Introduction to Artificial Intelligence
--- Chunk 2 ---
Artificial Intelligence (AI) refers to the branch of computer science that is focused on creating machines capable of performing tasks that typically require human intelligence. These tasks include reasoning, learning, problem-solving, perception, and language understanding. Over the last few decades, AI has evolved from theoretical concepts into practical technologies that are reshaping industries globally.
--- Chunk 3 ---
The ultimate goal of AI is to develop systems that can think, learn, and adapt to new situations, similarly to the way humans do.
--- Chunk 4 ---
2. History and Evolution of AI
--- Chunk 5 ---
Early Beginnings (1950s-1960s):
The roots of AI can be traced back to the 1950s, with pioneering work from Alan Turing. His famous Turing Test, proposed in 1950, aimed to evaluate a machine's ability to exhibit intelligent behavior eq

*Step 7: Initialize embedding model*

In [None]:
def initialize_embeddings():
    """Initialize HuggingFace embeddings model"""
    embeddings = HuggingFaceEmbeddings(
        model_name="sentence-transformers/all-MiniLM-L6-v2",  # Fast and good quality
        model_kwargs={'device': 'cpu'}
    )
    return embeddings

*Step 8: Define function to generate embeddings*

In [None]:
# Generate embeddings for document chunks
def generate_embeddings(docs, embeddings_model):
    """Generate embeddings for all document chunks"""

    # Extract text from documents
    texts = [doc.page_content for doc in docs]

    # Generate embeddings
    print("Generating embeddings...")
    embeddings = embeddings_model.embed_documents(texts)

    print(f"Generated embeddings for {len(embeddings)} chunks")
    print(f"Embedding dimension: {len(embeddings[0])}")

    return embeddings

*Step 9: Define function to combine embeddings with metadata*

In [None]:
# Create embeddings with metadata
def create_embeddings_with_metadata(docs, embeddings):
    """Combine embeddings with document metadata""" # metadata is important when we chunk multiple files, and we want to retrieve using filters from the database

    embeddings_data = []
    for doc, embedding in zip(docs, embeddings):
        embeddings_data.append({
            'text': doc.page_content,
            'embedding': embedding,
            'metadata': doc.metadata
        })

    return embeddings_data

*Step 10: Function calling for generating embeddings*

In [None]:
# Main function to process documents to embeddings
def documents_to_embeddings(docs):

    # Initialize embeddings model
    embeddings_model = initialize_embeddings()

    # Generate embeddings
    embeddings = generate_embeddings(docs, embeddings_model)

    # Combine with metadata
    embeddings_data = create_embeddings_with_metadata(docs, embeddings)

    return embeddings_data, embeddings_model

In [None]:
# Usage with documents
print("Converting document chunks to embeddings...")
embeddings_data, embeddings_model = documents_to_embeddings(docs)

Converting document chunks to embeddings...


  embeddings = HuggingFaceEmbeddings(
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Generating embeddings...
Generated embeddings for 26 chunks
Embedding dimension: 384


*Step 11: View embedding results*

In [None]:
print(f"\nEmbedding Results:")
print(f"Total embeddings: {len(embeddings_data)}")
print(f"Embedding dimension: {len(embeddings_data[0]['embedding'])}")


Embedding Results:
Total embeddings: 26
Embedding dimension: 384


*Step 12: View Sample embeddings*

In [None]:
sample = embeddings_data[0]
print(f"\nSample embedding:")
print(f"Text preview: {sample['text'][:200]}...")
print(f"Metadata: {sample['metadata']}")
print(f"Embedding shape: {len(sample['embedding'])}")
print(f"First 5 embedding values: {sample['embedding'][:5]}")


Sample embedding:
Text preview: Title: A Comprehensive Guide to Artificial Intelligence (AI). 

Content: Introduction to Artificial Intelligence...
Metadata: {'chunk_id': 0}
Embedding shape: 384
First 5 embedding values: [0.024591200053691864, -0.009280803613364697, 0.02138330042362213, 0.010370465926826, -0.044209688901901245]


*Step 13: Adding the embeddings into ChromaDB*

In [None]:
import chromadb

In [None]:
# Create ChromaDB client
client = chromadb.Client() # This code represents begining of database engine

# Get or create a collection
collection = client.get_or_create_collection(name="my_documents") # This code represents creating space in the database to store the data

In [None]:
# Prepare data for ChromaDB
texts = [item['text'] for item in embeddings_data]
embeddings = [item['embedding'] for item in embeddings_data]
metadatas = [item['metadata'] for item in embeddings_data]
ids = [f"chunk_{i}" for i in range(len(embeddings_data))]  # Unique IDs

In [None]:
# Add to ChromaDB
collection.add(   # Data store in chromadb in collections, where collections are kind of structures which store embeddings, its chunks, its metdata and unique ids
    documents=texts,
    embeddings=embeddings,
    metadatas=metadatas,
    ids=ids
)

In [None]:
# Query Chroma with a test embedding
query_text = "What is AI?"
query_embedding = embeddings_model.embed_query(query_text)

In [None]:
# Retrieve relevant vectors for the given query
results = collection.query(
    query_embeddings=[query_embedding],
    n_results=3) # Top 3 closest vectors for the given query vector

In [None]:
# Print the results
print("Query Results:")
for doc in results["documents"][0]: # This code represents retrieving relevant texts for the given query, here results['documents'] will show all the relevant text for all the queries. Here we gave one query, so results['documents'][0] will give text for the first query
    print("-", doc) # results['documents'] is a list of list

Query Results:
- Artificial Intelligence (AI) refers to the branch of computer science that is focused on creating machines capable of performing tasks that typically require human intelligence. These tasks include reasoning, learning, problem-solving, perception, and language understanding. Over the last few decades, AI has evolved from theoretical concepts into practical technologies that are reshaping industries globally.
- The ultimate goal of AI is to develop systems that can think, learn, and adapt to new situations, similarly to the way humans do.
- Manufacturing and Robotics:
AI is used in robotics for tasks such as assembly, quality control, and predictive maintenance. Industrial robots can perform repetitive tasks with precision, improving efficiency and reducing human error.

Retail and E-Commerce:
AI is used in recommendation systems to suggest products to customers based on their past behavior and preferences. Chatbots assist customers in online shopping, while AI is also 