# ChromaDB CRUD Operations Test

This notebook demonstrates and tests the ChromaDB vector database CRUD operations with sample data and fake embeddings (no embedding service required).

**Prerequisites:**
- ChromaDB must be running: `make chroma-up`

**What we'll test:**
- ChromaDB connection
- Collection initialization and management
- Book CRUD operations
- User CRUD operations
- Metadata filtering
- Batch operations

## 1. Setup and Configuration

In [1]:
import random
from typing import List

from bookdb.vector_db import (
    get_chroma_client,
    get_client_info,
    initialize_all_collections,
    CollectionNames,
    BookVectorCRUD,
    UserVectorCRUD,
    get_books_collection,
    get_users_collection,
)

In [2]:
# Get client and verify connection
client = get_chroma_client()
client_info = get_client_info()

print("ChromaDB Client Info:")
print(f"  Connected: {client_info['connected']}")
print(f"  Mode: {client_info['mode']}")
print(f"  Config: {client_info['config']}")

ChromaDB Client Info:
  Connected: True
  Mode: unknown
  Config: {'mode': 'server', 'host': 'localhost', 'port': 8000, 'persist_directory': './chroma_data'}


In [3]:
# Initialize collections
collection_manager = initialize_all_collections()

print("\nInitialized Collections:")
for collection_name in collection_manager.list_collections():
    count = collection_manager.get_collection_count(CollectionNames(collection_name))
    print(f"  - {collection_name}: {count} items")


Initialized Collections:
  - users: 0 items
  - books: 5 items


## 2. Helper Functions

Functions to generate fake embeddings and sample data.

In [4]:
def generate_fake_embedding(dimension: int = 384) -> List[float]:
    """Generate a fake embedding vector with random values."""
    return [random.uniform(-1, 1) for _ in range(dimension)]


def generate_similar_embedding(base_embedding: List[float], noise: float = 0.1) -> List[float]:
    """Generate an embedding similar to the base embedding."""
    return [val + random.uniform(-noise, noise) for val in base_embedding]

In [5]:
# Sample book data
SAMPLE_BOOKS = [
    {
        "book_id": "book_001",
        "title": "The Quantum Garden",
        "author": "Derek Künsken",
        "description": "A mind-bending science fiction adventure through quantum realities.",
        "genre": "Science Fiction",
        "publication_year": 2019,
        "language": "en",
        "page_count": 480,
        "average_rating": 4.2,
    },
    {
        "book_id": "book_002",
        "title": "The Midnight Library",
        "author": "Matt Haig",
        "description": "Between life and death is a library with infinite books and infinite lives.",
        "genre": "Fiction",
        "publication_year": 2020,
        "language": "en",
        "page_count": 304,
        "average_rating": 4.5,
    },
    {
        "book_id": "book_003",
        "title": "Project Hail Mary",
        "author": "Andy Weir",
        "description": "A lone astronaut must save Earth from an extinction-level threat.",
        "genre": "Science Fiction",
        "publication_year": 2021,
        "language": "en",
        "page_count": 496,
        "average_rating": 4.7,
    },
    {
        "book_id": "book_004",
        "title": "The Seven Husbands of Evelyn Hugo",
        "author": "Taylor Jenkins Reid",
        "description": "A legendary Hollywood actress tells her life story.",
        "genre": "Historical Fiction",
        "publication_year": 2017,
        "language": "en",
        "page_count": 388,
        "average_rating": 4.6,
    },
    {
        "book_id": "book_005",
        "title": "Klara and the Sun",
        "author": "Kazuo Ishiguro",
        "description": "An Artificial Friend observes the world with love and wonder.",
        "genre": "Science Fiction",
        "publication_year": 2021,
        "language": "en",
        "page_count": 320,
        "average_rating": 4.1,
    },
]

# Sample user data
SAMPLE_USERS = [
    {
        "user_id": 1001,
        "preferences_text": "I love science fiction novels with deep philosophical questions and space exploration themes.",
        "favorite_genres": "Science Fiction,Fantasy",
        "num_books_read": 45,
        "average_rating_given": 4.2,
        "reading_level": "advanced",
    },
    {
        "user_id": 1002,
        "preferences_text": "I enjoy contemporary fiction with strong character development and emotional depth.",
        "favorite_genres": "Fiction,Historical Fiction",
        "num_books_read": 32,
        "average_rating_given": 4.5,
        "reading_level": "intermediate",
    },
    {
        "user_id": 1003,
        "preferences_text": "I prefer fast-paced thrillers and mystery novels with unexpected plot twists.",
        "favorite_genres": "Mystery,Thriller",
        "num_books_read": 67,
        "average_rating_given": 4.0,
        "reading_level": "advanced",
    },
]

print(f"Created {len(SAMPLE_BOOKS)} sample books")
print(f"Created {len(SAMPLE_USERS)} sample users")

Created 5 sample books
Created 3 sample users


## 3. Test Book CRUD Operations

In [6]:
# Get books collection and create CRUD instance
books_collection = get_books_collection()
book_crud = BookVectorCRUD(books_collection)

# Clear existing data
existing_books = book_crud.get_all()
if existing_books:
    book_crud.delete_batch([book["id"] for book in existing_books])
    print(f"Cleared {len(existing_books)} existing books")

# Add sample books with fake embeddings
print("\nAdding sample books:")
book_embeddings = {}
for book in SAMPLE_BOOKS:
    embedding = generate_fake_embedding()
    book_embeddings[book["book_id"]] = embedding
    
    book_crud.add_book(
        book_id=book["book_id"],
        title=book["title"],
        author=book["author"],
        description=book["description"],
        genre=book["genre"],
        publication_year=book["publication_year"],
        language=book["language"],
        page_count=book["page_count"],
        average_rating=book["average_rating"],
        embedding=embedding,
    )
    print(f"  ✓ Added: {book['title']} by {book['author']}")

print(f"\nTotal books in collection: {book_crud.count()}")

Cleared 5 existing books

Adding sample books:
  ✓ Added: The Quantum Garden by Derek Künsken
  ✓ Added: The Midnight Library by Matt Haig
  ✓ Added: Project Hail Mary by Andy Weir
  ✓ Added: The Seven Husbands of Evelyn Hugo by Taylor Jenkins Reid
  ✓ Added: Klara and the Sun by Kazuo Ishiguro

Total books in collection: 5


### 3.1 Test Book Retrieval

In [7]:
retrieved_book = book_crud.get("book_001")
print("Retrieved book_001:")
print(f"  ID: {retrieved_book['id']}")
print(f"  Title: {retrieved_book['metadata']['title']}")
print(f"  Author: {retrieved_book['metadata']['author']}")
print(f"  Genre: {retrieved_book['metadata']['genre']}")
print(f"  Rating: {retrieved_book['metadata']['average_rating']}")

Retrieved book_001:
  ID: book_001
  Title: The Quantum Garden
  Author: Derek Künsken
  Genre: Science Fiction
  Rating: 4.2


### 3.2 Test Book Update

In [8]:
book_crud.update_book(
    book_id="book_001",
    average_rating=4.3,
    page_count=485,
)

updated_book = book_crud.get("book_001")
print("Updated book_001:")
print(f"  New Rating: {updated_book['metadata']['average_rating']}")
print(f"  New Page Count: {updated_book['metadata']['page_count']}")

Updated book_001:
  New Rating: 4.3
  New Page Count: 485


### 3.3 Test Metadata Filtering

In [9]:
# Search by genre
scifi_books = book_crud.search_by_metadata(genre="Science Fiction", limit=10)
print(f"Science Fiction books: {len(scifi_books)}")
for book in scifi_books:
    print(f"  - {book['metadata']['title']} ({book['metadata']['publication_year']})")

Science Fiction books: 3
  - The Quantum Garden (2019)
  - Project Hail Mary (2021)
  - Klara and the Sun (2021)


In [10]:
# Search by year range
recent_books = book_crud.search_by_metadata(min_year=2020, limit=10)
print(f"Books published 2020 or later: {len(recent_books)}")
for book in recent_books:
    print(f"  - {book['metadata']['title']} ({book['metadata']['publication_year']})")

Books published 2020 or later: 3
  - The Midnight Library (2020)
  - Project Hail Mary (2021)
  - Klara and the Sun (2021)


In [11]:
# Search by author
weir_books = book_crud.search_by_metadata(author="Andy Weir", limit=10)
print(f"Books by Andy Weir: {len(weir_books)}")
for book in weir_books:
    print(f"  - {book['metadata']['title']}")

Books by Andy Weir: 1
  - Project Hail Mary


## 4. Test User CRUD Operations

In [12]:
# Get users collection and create CRUD instance
users_collection = get_users_collection()
user_crud = UserVectorCRUD(users_collection)

# Clear existing data
existing_users = user_crud.get_all()
if existing_users:
    user_crud.delete_batch([user["id"] for user in existing_users])
    print(f"Cleared {len(existing_users)} existing users")

# Add sample users with fake embeddings
print("\nAdding sample users:")
user_embeddings = {}
for user in SAMPLE_USERS:
    embedding = generate_fake_embedding()
    user_embeddings[user["user_id"]] = embedding
    
    user_crud.add_user(
        user_id=user["user_id"],
        preferences_text=user["preferences_text"],
        favorite_genres=user["favorite_genres"],
        num_books_read=user["num_books_read"],
        average_rating_given=user["average_rating_given"],
        reading_level=user["reading_level"],
        embedding=embedding,
    )
    print(f"  ✓ Added: User {user['user_id']} ({user['num_books_read']} books read)")

print(f"\nTotal users in collection: {user_crud.count()}")


Adding sample users:
  ✓ Added: User 1001 (45 books read)
  ✓ Added: User 1002 (32 books read)
  ✓ Added: User 1003 (67 books read)

Total users in collection: 3


### 4.1 Test User Retrieval

In [13]:
retrieved_user = user_crud.get("user_1001")
print("Retrieved user_1001:")
print(f"  ID: {retrieved_user['id']}")
print(f"  User ID: {retrieved_user['metadata']['user_id']}")
print(f"  Books Read: {retrieved_user['metadata']['num_books_read']}")
print(f"  Favorite Genres: {retrieved_user['metadata']['favorite_genres']}")
print(f"  Reading Level: {retrieved_user['metadata']['reading_level']}")
print(f"  Preferences: {retrieved_user['document'][:100]}...")

Retrieved user_1001:
  ID: user_1001
  User ID: 1001
  Books Read: 45
  Favorite Genres: Science Fiction,Fantasy
  Reading Level: advanced
  Preferences: I love science fiction novels with deep philosophical questions and space exploration themes....


### 4.2 Test User Update

In [14]:
user_crud.update_user_preferences(
    user_id=1001,
    num_books_read=50,  # Read 5 more books
    average_rating_given=4.3,
)

updated_user = user_crud.get("user_1001")
print("Updated user_1001:")
print(f"  New Books Read: {updated_user['metadata']['num_books_read']}")
print(f"  New Average Rating: {updated_user['metadata']['average_rating_given']}")

Updated user_1001:
  New Books Read: 50
  New Average Rating: 4.3


## 5. Test Batch Operations

In [15]:
# Add multiple books at once
batch_ids = ["book_006", "book_007", "book_008"]
batch_documents = [
    "The Way of Kings by Brandon Sanderson",
    "Dune by Frank Herbert", 
    "Neuromancer by William Gibson",
]
batch_metadatas = [
    {
        "title": "The Way of Kings",
        "author": "Brandon Sanderson",
        "genre": "Fantasy",
        "publication_year": 2010,
        "language": "en",
    },
    {
        "title": "Dune",
        "author": "Frank Herbert",
        "genre": "Science Fiction",
        "publication_year": 1965,
        "language": "en",
    },
    {
        "title": "Neuromancer",
        "author": "William Gibson",
        "genre": "Science Fiction",
        "publication_year": 1984,
        "language": "en",
    },
]
batch_embeddings = [generate_fake_embedding() for _ in range(3)]

book_crud.add_batch(
    ids=batch_ids,
    documents=batch_documents,
    metadatas=batch_metadatas,
    embeddings=batch_embeddings,
)

print(f"Added {len(batch_ids)} books in batch")
print(f"Total books now: {book_crud.count()}")

Added 3 books in batch
Total books now: 8


In [16]:
# Retrieve batch
retrieved_batch = book_crud.get_batch(batch_ids)
print(f"Retrieved {len(retrieved_batch)} books:")
for book in retrieved_batch:
    print(f"  - {book['metadata']['title']} by {book['metadata']['author']}")

Retrieved 3 books:
  - The Way of Kings by Brandon Sanderson
  - Dune by Frank Herbert
  - Neuromancer by William Gibson


## 6. Summary and Statistics

### 6.1 Collection Statistics

In [17]:
print("Collection Statistics:")
print(f"  Books: {book_crud.count()}")
print(f"  Users: {user_crud.count()}")

print("\nAll collections:")
for collection_name in collection_manager.list_collections():
    count = collection_manager.get_collection_count(CollectionNames(collection_name))
    print(f"  - {collection_name}: {count} items")

Collection Statistics:
  Books: 8
  Users: 3

All collections:
  - users: 3 items
  - books: 8 items


### 6.2 All Books in Database

In [18]:
all_books = book_crud.get_all()
print(f"All {len(all_books)} books:")
for i, book in enumerate(all_books, 1):
    metadata = book["metadata"]
    print(f"\n{i}. {metadata['title']}")
    print(f"   Author: {metadata['author']}")
    print(f"   Genre: {metadata['genre']}")
    print(f"   Year: {metadata['publication_year']}")
    if 'average_rating' in metadata:
        print(f"   Rating: {metadata['average_rating']}")

All 8 books:

1. The Quantum Garden
   Author: Derek Künsken
   Genre: Science Fiction
   Year: 2019
   Rating: 4.3

2. The Midnight Library
   Author: Matt Haig
   Genre: Fiction
   Year: 2020
   Rating: 4.5

3. Project Hail Mary
   Author: Andy Weir
   Genre: Science Fiction
   Year: 2021
   Rating: 4.7

4. The Seven Husbands of Evelyn Hugo
   Author: Taylor Jenkins Reid
   Genre: Historical Fiction
   Year: 2017
   Rating: 4.6

5. Klara and the Sun
   Author: Kazuo Ishiguro
   Genre: Science Fiction
   Year: 2021
   Rating: 4.1

6. The Way of Kings
   Author: Brandon Sanderson
   Genre: Fantasy
   Year: 2010

7. Dune
   Author: Frank Herbert
   Genre: Science Fiction
   Year: 1965

8. Neuromancer
   Author: William Gibson
   Genre: Science Fiction
   Year: 1984


### 6.3 All Users in Database

In [19]:
all_users = user_crud.get_all()
print(f"All {len(all_users)} users:")
for i, user in enumerate(all_users, 1):
    metadata = user["metadata"]
    print(f"\n{i}. User {metadata['user_id']}")
    print(f"   Books Read: {metadata['num_books_read']}")
    print(f"   Favorite Genres: {metadata['favorite_genres']}")
    print(f"   Reading Level: {metadata['reading_level']}")
    print(f"   Average Rating Given: {metadata.get('average_rating_given', 'N/A')}")

All 3 users:

1. User 1001
   Books Read: 50
   Favorite Genres: Science Fiction,Fantasy
   Reading Level: advanced
   Average Rating Given: 4.3

2. User 1002
   Books Read: 32
   Favorite Genres: Fiction,Historical Fiction
   Reading Level: intermediate
   Average Rating Given: 4.5

3. User 1003
   Books Read: 67
   Favorite Genres: Mystery,Thriller
   Reading Level: advanced
   Average Rating Given: 4.0
