# üéØ Practice Exercises

## Exercise 1: Chroma with Advanced Filtering

### Task
Build a document management system using Chroma with rich metadata and filtering.

### Instructions

1. Create a collection of at least 30 documents with metadata:
   ```python
   metadata = {
       "category": "...",  # e.g., "tech", "business", "science"
       "date": "...",      # e.g., "2024-01-15"
       "author": "...",    # e.g., "John Doe"
       "priority": ...    # e.g., 1, 2, 3
   }
   ```

2. Implement queries with different filters:
   - By category
   - By date range
   - By author
   - Combined filters (e.g., category AND date)

3. Test MMR (Maximal Marginal Relevance) if Chroma supports it

4. Compare results with and without filters

### Sample Data

```python
documents = [
    {
        "text": "Python 3.12 introduces new performance improvements...",
        "metadata": {
            "category": "tech",
            "date": "2024-01-15",
            "author": "Tech Team",
            "priority": 1
        }
    },
    # Add 29 more...
]
```

### Expected Output

```
Query: "latest technology updates"

Without filters:
1. [Result from any category]
2. [Result from any category]
3. [Result from any category]

With filter (category="tech"):
1. [Tech result]
2. [Tech result]
3. [Tech result]

With filter (category="tech" AND date>="2024-01-01"):
1. [Recent tech result]
2. [Recent tech result]
3. [Recent tech result]
```


In [4]:
import chromadb
from chromadb.utils import embedding_functions
from chromadb.config import Settings

In [5]:
settings = Settings(
    chroma_db_impl="duckdb+parquet",
    persist_directory="./exercise_chroma_db",
    anonymized_telemetry=False
)

client = chromadb.Client(settings)

Failed to send telemetry event client_start: capture() takes 1 positional argument but 3 were given
Using embedded DuckDB with persistence: data will be stored in: ./exercise_chroma_db


In [6]:
embedding_fn = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

collection = client.get_or_create_collection(
    name="document_management",
    embedding_function=embedding_fn
)

In [7]:
documents_data = [
    # Tech category
    {
        "text": "Python 3.12 introduces new performance improvements and better error messages",
        "metadata": {"category": "tech", "date": "2024-01-15", "author": "Tech Team", "priority": 1}
    },
    {
        "text": "Machine learning frameworks are becoming more accessible to developers",
        "metadata": {"category": "tech", "date": "2024-01-20", "author": "AI Lead", "priority": 2}
    },
    {
        "text": "Cloud computing has revolutionized how we deploy applications",
        "metadata": {"category": "tech", "date": "2024-02-10", "author": "Cloud Expert", "priority": 1}
    },
    {
        "text": "Kubernetes orchestration simplifies container management",
        "metadata": {"category": "tech", "date": "2024-02-15", "author": "DevOps Team", "priority": 2}
    },
    {
        "text": "Quantum computers will soon solve problems classical computers cannot",
        "metadata": {"category": "tech", "date": "2024-03-01", "author": "Quantum Researcher", "priority": 3}
    },
    {
        "text": "Artificial intelligence is transforming every industry",
        "metadata": {"category": "tech", "date": "2024-03-10", "author": "Tech Team", "priority": 1}
    },
    
    # Business category
    {
        "text": "Q1 revenue exceeded expectations with 25% growth",
        "metadata": {"category": "business", "date": "2024-01-31", "author": "Finance Team", "priority": 1}
    },
    {
        "text": "New partnership with major tech company announced today",
        "metadata": {"category": "business", "date": "2024-02-05", "author": "CEO", "priority": 1}
    },
    {
        "text": "Market analysis shows increasing demand for our products",
        "metadata": {"category": "business", "date": "2024-02-20", "author": "Analytics Team", "priority": 2}
    },
    {
        "text": "Employee satisfaction survey results are very positive",
        "metadata": {"category": "business", "date": "2024-03-05", "author": "HR Team", "priority": 2}
    },
    {
        "text": "Expansion into three new markets scheduled for next quarter",
        "metadata": {"category": "business", "date": "2024-03-15", "author": "Strategy Team", "priority": 1}
    },
    {
        "text": "Cost reduction initiatives saved company 2 million dollars",
        "metadata": {"category": "business", "date": "2024-03-20", "author": "Operations", "priority": 3}
    },
    
    # Science category
    {
        "text": "New study shows benefits of daily exercise for brain health",
        "metadata": {"category": "science", "date": "2024-01-10", "author": "Health Researcher", "priority": 2}
    },
    {
        "text": "Climate change impacts more severe than previously predicted",
        "metadata": {"category": "science", "date": "2024-01-25", "author": "Climate Scientist", "priority": 1}
    },
    {
        "text": "Breakthrough in cancer treatment shows promising results",
        "metadata": {"category": "science", "date": "2024-02-08", "author": "Medical Team", "priority": 1}
    },
    {
        "text": "Mars rover discovers evidence of ancient microbial life",
        "metadata": {"category": "science", "date": "2024-02-25", "author": "NASA Team", "priority": 1}
    },
    {
        "text": "New vaccine developed to combat emerging disease",
        "metadata": {"category": "science", "date": "2024-03-08", "author": "Virologist", "priority": 1}
    },
    {
        "text": "Study reveals connections between sleep and memory formation",
        "metadata": {"category": "science", "date": "2024-03-18", "author": "Neuroscientist", "priority": 2}
    },
    
    # More Tech
    {
        "text": "Web development frameworks continue to evolve rapidly",
        "metadata": {"category": "tech", "date": "2024-01-05", "author": "Web Team", "priority": 3}
    },
    {
        "text": "Cybersecurity threats increase as hackers become more sophisticated",
        "metadata": {"category": "tech", "date": "2024-02-01", "author": "Security Team", "priority": 1}
    },
    {
        "text": "API design best practices guide developers to better architecture",
        "metadata": {"category": "tech", "date": "2024-02-28", "author": "Architecture Team", "priority": 2}
    },
    
    # More Business
    {
        "text": "Customer retention rate improved by 15 percent this quarter",
        "metadata": {"category": "business", "date": "2024-01-18", "author": "Sales Team", "priority": 2}
    },
    {
        "text": "New product launch exceeded initial sales projections",
        "metadata": {"category": "business", "date": "2024-02-12", "author": "Product Team", "priority": 1}
    },
    {
        "text": "Training program helps employees develop new skills",
        "metadata": {"category": "business", "date": "2024-03-02", "author": "HR Team", "priority": 3}
    },
    
    # More Science
    {
        "text": "Renewable energy technology becomes more cost effective",
        "metadata": {"category": "science", "date": "2024-01-22", "author": "Energy Expert", "priority": 2}
    },
    {
        "text": "Genetic engineering opens new possibilities for disease treatment",
        "metadata": {"category": "science", "date": "2024-02-18", "author": "Geneticist", "priority": 1}
    },
    {
        "text": "Ocean acidification threatens marine ecosystem survival",
        "metadata": {"category": "science", "date": "2024-03-12", "author": "Marine Biologist", "priority": 1}
    },
    
    # Additional for reaching 30+
    {
        "text": "Blockchain technology finds new applications in supply chain",
        "metadata": {"category": "tech", "date": "2024-03-25", "author": "Blockchain Team", "priority": 2}
    },
    {
        "text": "Social media strategy drives brand awareness campaign",
        "metadata": {"category": "business", "date": "2024-03-28", "author": "Marketing", "priority": 2}
    },
    {
        "text": "Space exploration missions advance human knowledge of universe",
        "metadata": {"category": "science", "date": "2024-03-30", "author": "Astronomer", "priority": 2}
    },
]

In [8]:
documents_list = [doc["text"] for doc in documents_data]
metadatas_list = [doc["metadata"] for doc in documents_data]
ids_list = [f"doc_{i}" for i in range(len(documents_data))]

collection.add(
    documents=documents_list,
    metadatas=metadatas_list,
    ids=ids_list
)

print(f"‚úÖ Added {len(documents_list)} documents to collection\n")
print("=" * 80)

Failed to send telemetry event collection_add: capture() takes 1 positional argument but 3 were given


‚úÖ Added 30 documents to collection



In [9]:
print("\nüîç TEST 1: QUERY WITHOUT FILTERS")
print("=" * 80)

query = "What about latest technology updates?"
results = collection.query(
    query_texts=[query],
    n_results=3
)

print(f"Query: '{query}'\n")
print("Results (any category):\n")

for i, (doc, metadata, distance) in enumerate(zip(
    results['documents'][0],
    results['metadatas'][0],
    results['distances'][0]
), 1):
    print(f"{i}. Category: {metadata['category'].upper()}")
    print(f"   Author: {metadata['author']}")
    print(f"   Date: {metadata['date']}")
    print(f"   Priority: {metadata['priority']}")
    print(f"   Relevance Score: {1/(1+distance):.3f}")
    print(f"   Text: {doc[:70]}...\n")



üîç TEST 1: QUERY WITHOUT FILTERS
Query: 'What about latest technology updates?'

Results (any category):

1. Category: BUSINESS
   Author: CEO
   Date: 2024-02-05
   Priority: 1
   Relevance Score: 0.442
   Text: New partnership with major tech company announced today...

2. Category: TECH
   Author: Web Team
   Date: 2024-01-05
   Priority: 3
   Relevance Score: 0.424
   Text: Web development frameworks continue to evolve rapidly...

3. Category: TECH
   Author: Security Team
   Date: 2024-02-01
   Priority: 1
   Relevance Score: 0.422
   Text: Cybersecurity threats increase as hackers become more sophisticated...



In [10]:
print("\n" + "=" * 80)
print("üîç TEST 2: QUERY WITH CATEGORY FILTER (Tech Only)")
print("=" * 80)

results_filtered_tech = collection.query(
    query_texts=[query],
    n_results=3,
    where={"category": "tech"}
)

print(f"Query: '{query}' (FILTERED: category=tech)\n")
print("Results (tech category only):\n")

for i, (doc, metadata, distance) in enumerate(zip(
    results_filtered_tech['documents'][0],
    results_filtered_tech['metadatas'][0],
    results_filtered_tech['distances'][0]
), 1):
    print(f"{i}. Category: {metadata['category'].upper()}")
    print(f"   Author: {metadata['author']}")
    print(f"   Date: {metadata['date']}")
    print(f"   Priority: {metadata['priority']}")
    print(f"   Relevance Score: {1/(1+distance):.3f}")
    print(f"   Text: {doc[:70]}...\n")


üîç TEST 2: QUERY WITH CATEGORY FILTER (Tech Only)
Query: 'What about latest technology updates?' (FILTERED: category=tech)

Results (tech category only):

1. Category: TECH
   Author: Web Team
   Date: 2024-01-05
   Priority: 3
   Relevance Score: 0.424
   Text: Web development frameworks continue to evolve rapidly...

2. Category: TECH
   Author: Security Team
   Date: 2024-02-01
   Priority: 1
   Relevance Score: 0.422
   Text: Cybersecurity threats increase as hackers become more sophisticated...

3. Category: TECH
   Author: Blockchain Team
   Date: 2024-03-25
   Priority: 2
   Relevance Score: 0.412
   Text: Blockchain technology finds new applications in supply chain...



In [11]:
print("\n" + "=" * 80)
print("üîç TEST 3: QUERY WITH MULTIPLE FILTERS (Tech + Priority High)")
print("=" * 80)

results_multi_filter = collection.query(
    query_texts=[query],
    n_results=5,
    where={
        "$and": [
            {"category": "tech"},
            {"priority": 1}
        ]
    }
)

print(f"Query: '{query}' (FILTERED: category=tech AND priority=1)\n")
print("Results (only high-priority tech documents):\n")

for i, (doc, metadata, distance) in enumerate(zip(
    results_multi_filter['documents'][0],
    results_multi_filter['metadatas'][0],
    results_multi_filter['distances'][0]
), 1):
    print(f"{i}. Category: {metadata['category'].upper()}")
    print(f"   Author: {metadata['author']}")
    print(f"   Date: {metadata['date']}")
    print(f"   Priority: {metadata['priority']}")
    print(f"   Relevance Score: {1/(1+distance):.3f}")
    print(f"   Text: {doc[:70]}...\n")



üîç TEST 3: QUERY WITH MULTIPLE FILTERS (Tech + Priority High)
Query: 'What about latest technology updates?' (FILTERED: category=tech AND priority=1)

Results (only high-priority tech documents):

1. Category: TECH
   Author: Security Team
   Date: 2024-02-01
   Priority: 1
   Relevance Score: 0.422
   Text: Cybersecurity threats increase as hackers become more sophisticated...

2. Category: TECH
   Author: Tech Team
   Date: 2024-03-10
   Priority: 1
   Relevance Score: 0.412
   Text: Artificial intelligence is transforming every industry...

3. Category: TECH
   Author: Tech Team
   Date: 2024-01-15
   Priority: 1
   Relevance Score: 0.402
   Text: Python 3.12 introduces new performance improvements and better error m...

4. Category: TECH
   Author: Cloud Expert
   Date: 2024-02-10
   Priority: 1
   Relevance Score: 0.402
   Text: Cloud computing has revolutionized how we deploy applications...



In [12]:
print("\n" + "=" * 80)
print("üîç TEST 4: QUERY WITH AUTHOR FILTER (Tech Team docs)")
print("=" * 80)

results_author_filter = collection.query(
    query_texts=[query],
    n_results=5,
    where={"author": "Tech Team"}
)

print(f"Query: '{query}' (FILTERED: author=Tech Team)\n")
print("Results (only Tech Team documents):\n")

for i, (doc, metadata, distance) in enumerate(zip(
    results_author_filter['documents'][0],
    results_author_filter['metadatas'][0],
    results_author_filter['distances'][0]
), 1):
    print(f"{i}. Category: {metadata['category'].upper()}")
    print(f"   Author: {metadata['author']}")
    print(f"   Date: {metadata['date']}")
    print(f"   Priority: {metadata['priority']}")
    print(f"   Relevance Score: {1/(1+distance):.3f}")
    print(f"   Text: {doc[:70]}...\n")


üîç TEST 4: QUERY WITH AUTHOR FILTER (Tech Team docs)
Query: 'What about latest technology updates?' (FILTERED: author=Tech Team)

Results (only Tech Team documents):

1. Category: TECH
   Author: Tech Team
   Date: 2024-03-10
   Priority: 1
   Relevance Score: 0.412
   Text: Artificial intelligence is transforming every industry...

2. Category: TECH
   Author: Tech Team
   Date: 2024-01-15
   Priority: 1
   Relevance Score: 0.402
   Text: Python 3.12 introduces new performance improvements and better error m...



In [13]:
print("\n" + "=" * 80)
print("üìä COMPARISON: Results With vs Without Filters")
print("=" * 80)

print("\n‚úÖ WITHOUT FILTERS:")
print(f"   - Got results from {len(set([m['category'] for m in results['metadatas'][0]]))} different categories")
print(f"   - Top result category: {results['metadatas'][0][0]['category']}")

print("\n‚úÖ WITH TECH FILTER:")
print(f"   - All results from: {set([m['category'] for m in results_filtered_tech['metadatas'][0]])}")
print(f"   - All are tech category: ‚úì")

print("\n‚úÖ WITH TECH + PRIORITY FILTER:")
print(f"   - All results from: {set([m['category'] for m in results_multi_filter['metadatas'][0]])}")
print(f"   - All have priority = 1: {all(m['priority'] == 1 for m in results_multi_filter['metadatas'][0])}")

print("\n‚úÖ WITH AUTHOR FILTER:")
print(f"   - All results from: {results_author_filter['metadatas'][0][0]['author']}")

print("\n" + "=" * 80)
print("‚úÖ EXERCISE COMPLETE!")
print("=" * 80)
print("\nNote: Data is persisted in ./exercise_chroma_db/")

# Optional: Call persist to explicitly save (older versions)
client.persist()
print("‚úÖ Data persisted to disk!")


üìä COMPARISON: Results With vs Without Filters

‚úÖ WITHOUT FILTERS:
   - Got results from 2 different categories
   - Top result category: business

‚úÖ WITH TECH FILTER:
   - All results from: {'tech'}
   - All are tech category: ‚úì

‚úÖ WITH TECH + PRIORITY FILTER:
   - All results from: {'tech'}
   - All have priority = 1: True

‚úÖ WITH AUTHOR FILTER:
   - All results from: Tech Team

‚úÖ EXERCISE COMPLETE!

Note: Data is persisted in ./exercise_chroma_db/
‚úÖ Data persisted to disk!
