# Loading Data into Chroma

In this notebook, the only goal I have right now is to load all the data into ChromaDB and start trying things.

## Environment Variables

This notebook uses a `.env` file to securely store sensitive credentials like API keys and database information. The `.env` file contains key-value pairs that are loaded as environment variables, keeping sensitive data out of your code and version control.

**Setup:**
1. Copy `.env.example` to `.env`
2. Fill in your actual credentials in the `.env` file
3. The `.env` file should be in your `.gitignore` to prevent accidentally committing credentials

**Why use .env files?**
- Keeps secrets out of source code
- Easy to manage different environments (dev, staging, prod)
- Prevents accidentally sharing credentials in notebooks
- Makes collaboration easier - everyone uses their own credentials 

In [22]:
# Configuration Variables
# ======================
# All configuration parameters consolidated in one place for easy modification

# Dataset Configuration
DATASET_SIZE = "train[:50]"  # Number of samples to load from WildChat-1M
RANDOM_SEED = 42  # For reproducibility
MIN_MESSAGE_LENGTH = 10  # Minimum character length for messages to be included

In [23]:
import os
from dotenv import load_dotenv
import chromadb
from datasets import load_dataset

# Load environment variables from .env file
load_dotenv()

# Initialize ChromaDB client using environment variables
client = chromadb.CloudClient(
    api_key=os.getenv('CHROMA_API_KEY'),
    tenant=os.getenv('CHROMA_TENANT'),
    database=os.getenv('CHROMA_DATABASE')
)

print("ChromaDB client initialized successfully!")
print(f"Connected to database: {os.getenv('CHROMA_DATABASE')}")
print(f"Tenant: {os.getenv('CHROMA_TENANT')}")


INFO:chromadb.telemetry.product.posthog:Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.
INFO:httpx:HTTP Request: GET https://api.trychroma.com:8000/api/v2/auth/identity "HTTP/1.1 200 OK"
INFO:chromadb.telemetry.product.posthog:Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.
INFO:httpx:HTTP Request: GET https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0 "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m "HTTP/1.1 200 OK"
ERROR:chromadb.telemetry.product.posthog:Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given


ChromaDB client initialized successfully!
Connected to database: wild-chat-1m
Tenant: 13e7be1a-7c4c-4526-a2af-ca891a0031e0


In [24]:
# Import our custom data loader from local utils directory
from utils.dataloader import WildChatDataLoader

import pandas as pd

def load_data(limit=5):
    """Load WildChat data using our custom data loader"""
    loader = WildChatDataLoader()
    # Convert to DataFrame with selected columns
    df = loader.stream_conversations(limit=limit)
    return df

load_data(10)

<generator object WildChatDataLoader.stream_conversations at 0x14e172400>

In [29]:
def write_streaming_to_chroma(
    collection_name="wildchat_streaming", 
    limit=100, 
    batch_size=50,
    filter_language="English",
    min_message_length=20
):
    """
    Write conversations to ChromaDB using streaming data loader
    
    This approach can handle the full 1M dataset efficiently by:
    1. Loading data in streaming fashion (no memory issues)
    2. Processing in batches
    """
    
    print(f"Starting streaming write to collection: {collection_name}")
    print(f"Filters: language={filter_language}, min_length={min_message_length}, limit={limit}")
    
    try:
        # Create/get collection
        collection = client.get_or_create_collection(
            name=collection_name,
            metadata={"description": "Streaming WildChat data"}
        )
        
        # Batch processing
        documents = []
        metadatas = []
        ids = []
        total_processed = 0
        
        for conversation in load_data(limit=limit):
            # Prepare document
            doc_id = f"{conversation['conversation_hash']}"
            
            metadata = {
                "hash": conversation['conversation_hash'],
                "timestamp": str(conversation['timestamp']),
                "lang": conversation['language'],
                "model": conversation['model'],
                "length": conversation['conversation_length'],
                "type": "user_query"
            }
            
            documents.append(conversation['first_message'][:1000])
            metadatas.append(metadata)
            ids.append(doc_id)
            
            # Write batch when full
            if len(documents) >= batch_size:
                collection.add(
                    documents=documents,
                    metadatas=metadatas,
                    ids=ids
                )
                total_processed += len(documents)
                print(f"  Wrote batch: {total_processed} documents so far...")
                
                # Reset batch
                documents = []
                metadatas = []
                ids = []
        
        # Write final batch
        if documents:
            collection.add(
                documents=documents,
                metadatas=metadatas,
                ids=ids
            )
            total_processed += len(documents)
        
        print(f"✅ Successfully wrote {total_processed} conversations to {collection_name}")
        print(f"Collection now contains {collection.count()} total documents")
        
        return collection
        
    except Exception as e:
        print(f"❌ Error in streaming write: {e}")
        return None

# Test with a moderate dataset
print("Testing streaming write with 50 English conversations...")
streaming_collection = write_streaming_to_chroma(
    collection_name="wildchat_10k",
    limit=10000,  # Will get first 50 English conversations from first 100
    filter_language="English",
    min_message_length=30
)

INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections "HTTP/1.1 200 OK"
INFO:utils.dataloader:Loading WildChat dataset: train[:10000]


Testing streaming write with 50 English conversations...
Starting streaming write to collection: wildchat_10k
Filters: language=English, min_length=30, limit=10000


INFO:utils.dataloader:Loaded 10000 conversations
INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 50 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 100 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 150 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 200 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 250 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 300 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 350 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 400 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 450 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 500 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 550 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 600 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 650 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 700 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 750 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 800 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 850 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 900 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 950 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 1000 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 1050 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 1100 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 1150 documents so far...


INFO:httpx:HTTP Request: POST https://api.trychroma.com:8000/api/v2/tenants/13e7be1a-7c4c-4526-a2af-ca891a0031e0/databases/wild-chat-1m/collections/905f3eaf-2220-41ab-95d7-d0709d6e4fe8/add "HTTP/1.1 201 Created"


  Wrote batch: 1200 documents so far...
❌ Error in streaming write: Expected IDs to be unique, found duplicates of: f5f01c95c79d12c5c72f87f63577b6dc in add.
