# Notebook 1: Qdrant Fundamentals & Search Basics

## 🎯 Objectives

In this notebook, you'll learn:
- How to connect to Qdrant Cloud with your own credentials
- Create collections with vector configurations
- Ingest documents with metadata (payload)
- Perform basic vector searches
- Use filtering with payload indexes
- Understand core Qdrant concepts: collections, points, vectors, payload

## 📋 Prerequisites

- `qdrant-client=1.15`, `numpy`, `pandas`, `tqdm`
- Required environment variables: `QDRANT_URL`, `QDRANT_API_KEY`

In [1]:
import sys
import os
import numpy as np
import pandas as pd
from utils import (
    ensure_collection, create_sample_dataset,
    upsert_points_batch, search_dense, print_search_results,
    create_payload_index, print_system_info
)
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, Filter, FieldCondition, MatchValue

print_system_info()
print(f"\n📍 Working directory: {os.getcwd()}")

: 

## 📦 Auto-Install Dependencies

The cell below will automatically install any missing packages. Perfect for JupyterLab environments!

In [21]:
# Install dependencies (will skip if already installed)
import subprocess
import sys

def install_if_missing(package_name, import_name=None):
    """Install package if not already available"""
    if import_name is None:
        import_name = package_name.replace('-', '_')
    
    try:
        __import__(import_name)
        print(f"✅ {package_name} already available")
        return True
    except ImportError:
        print(f"📦 Installing {package_name}...")
        try:
            subprocess.check_call([sys.executable, "-m", "pip", "install", package_name, "-q"])
            print(f"✅ {package_name} installed successfully")
            return True
        except subprocess.CalledProcessError as e:
            print(f"❌ Failed to install {package_name}: {e}")
            return False

# Required packages for this notebook
required_packages = [
    ("qdrant-client", "qdrant_client"),
    ("pandas", "pandas"),
    ("numpy", "numpy"), 
    ("tqdm", "tqdm"),
    ("matplotlib", "matplotlib")
]

print("🔧 Checking and installing dependencies...")
all_installed = True
for package, import_name in required_packages:
    if not install_if_missing(package, import_name):
        all_installed = False

if all_installed:
    print("\n🎉 All dependencies ready!")
else:
    print("\n⚠️ Some packages failed to install. You may need to install them manually.")

## ⚙️ Qdrant Cloud Setup (Required)

For this notebook, you must use your own Qdrant Cloud cluster.

- Sign up and create a free cluster at: [cloud.qdrant.io](https://cloud.qdrant.io)
- Obtain your cluster URL and API key, then set these environment variables before connecting:

```python
import os
os.environ["QDRANT_URL"] = "https://your-cluster.qdrant.io:6333"
os.environ["QDRANT_API_KEY"] = "your-api-key"
```

No shared webinar cluster is provided in this notebook.

In [22]:
# Workshop Configuration
COLLECTION_NAME = "workshop_fundamentals"
VECTOR_SIZE = 384  # Compatible with many embedding models

print("🔐 Qdrant Cloud Setup (Your Own Credentials)")
print("=" * 40)

# Require user-provided cluster credentials
custom_url = os.getenv("QDRANT_URL")
custom_key = os.getenv("QDRANT_API_KEY")

if not custom_url or not custom_key:
    raise RuntimeError(
        "QDRANT_URL and QDRANT_API_KEY must be set. Create a free cluster at https://cloud.qdrant.io, "
        "then set the environment variables as shown in the previous cell."
    )

print("🌐 Using your Qdrant Cloud cluster")
print(f"   URL: {custom_url}")
print(f"   API Key: {'*' * (len(custom_key)-4) + custom_key[-4:]}")

print(f"\n📁 Collection: {COLLECTION_NAME}")
print(f"🎯 Vector size: {VECTOR_SIZE}")

## 🏗️ Dataset Creation

Let's create a small, portable dataset of FAQ and documentation entries.

In [23]:
# Create sample dataset
df = create_sample_dataset(size=150, seed=42)

print(f"📊 Created dataset with {len(df)} entries")
print(f"\n📂 Categories: {df['category'].value_counts().to_dict()}")
print(f"🌍 Languages: {df['lang'].value_counts().to_dict()}")

# Preview the data
print("\n🔍 Sample entries:")
df.head()

## 🎲 Generate Embedding Vectors

For this fundamentals notebook, we'll use random normalized vectors to focus on Qdrant concepts. In real applications, you'd use actual embedding models.

In [24]:
# Generate random normalized vectors for demonstration
# In production, you would use real embeddings from models like:
# - sentence-transformers
# - OpenAI embeddings
# - FastEmbed

np.random.seed(42)
vectors = np.random.randn(len(df), VECTOR_SIZE)
# Normalize vectors for cosine similarity
vectors = vectors / np.linalg.norm(vectors, axis=1, keepdims=True)

print(f"✅ Generated {vectors.shape[0]} vectors of dimension {vectors.shape[1]}")
print(f"📏 Vector norm check (should be ~1.0): {np.linalg.norm(vectors[0]):.4f}")

## 🔌 Connect to Qdrant

In [25]:
# Initialize Qdrant Cloud client (requires your own credentials)
try:
    qdrant_url = os.getenv("QDRANT_URL")
    qdrant_key = os.getenv("QDRANT_API_KEY")
    if not qdrant_url or not qdrant_key:
        raise RuntimeError(
            "QDRANT_URL and QDRANT_API_KEY must be set. Create a free cluster at https://cloud.qdrant.io, "
            "then set the environment variables as shown above."
        )

    client = QdrantClient(url=qdrant_url, api_key=qdrant_key)
    # Test connection
    health = client.get_collections()
    print("🌐 Connected to Qdrant Cloud successfully!")
    print(f"📦 Existing collections: {[c.name for c in health.collections]}")
except Exception as e:
    print(f"❌ Connection failed: {e}")
    print("\n🔧 Troubleshooting:")
    print("1. Check your QDRANT_URL (should start with https://)")
    print("2. Verify your QDRANT_API_KEY from cloud.qdrant.io")
    print("3. Make sure your cluster is running (check Qdrant Cloud dashboard)")
    print("4. Try setting variables directly in Python:")
    print('   os.environ["QDRANT_URL"] = "https://your-cluster.qdrant.io:6333"')
    print('   os.environ["QDRANT_API_KEY"] = "your-api-key"')
    raise

## 📚 Create Collection

Collections in Qdrant are like tables in databases - they store points (vectors + metadata).

In [26]:
# Define vector configuration
# Using a single named vector "text" with cosine distance
vector_config = VectorParams(
    size=VECTOR_SIZE,
    distance=Distance.COSINE  # Good for text embeddings
)

# Create collection
ensure_collection(
    client=client,
    collection_name=COLLECTION_NAME,
    vector_config=vector_config,
    force_recreate=False  # Set to True to start fresh
)

# Get collection info
info = client.get_collection(COLLECTION_NAME)
print(f"\n📋 Collection info:")
print(f"   Points count: {info.points_count}")
print(f"   Vector size: {info.config.params.vectors.size}")
print(f"   Distance: {info.config.params.vectors.distance}")

## 📥 Ingest Points

Points are the core data unit in Qdrant: ID + Vector + Payload (metadata).

In [27]:
# Define which DataFrame columns to include as payload
payload_columns = ["text", "category", "lang", "timestamp"]

# Upsert points in batches
print("📤 Uploading points...")
upsert_points_batch(
    client=client,
    collection_name=COLLECTION_NAME,
    df=df,
    vectors=vectors,
    payload_cols=payload_columns,
    batch_size=50
)

# Verify upload
info = client.get_collection(COLLECTION_NAME)
print(f"\n✅ Upload complete! Collection now has {info.points_count} points")

## 🏷️ Create Payload Indexes

Payload indexes speed up filtering operations.

In [28]:
# Create indexes for fields we'll filter on
create_payload_index(client, COLLECTION_NAME, "category", "keyword")
create_payload_index(client, COLLECTION_NAME, "lang", "keyword")
create_payload_index(client, COLLECTION_NAME, "timestamp", "integer")

print("\n📖 Payload indexes created for faster filtering!")

## 🔍 First Vector Search

Let's perform our first similarity search using one of our vectors as the query.

In [29]:
# Use the first document's vector as our query
query_idx = 0
query_vector = vectors[query_idx]
query_text = df.iloc[query_idx]["text"]

print(f"🔍 Query text: '{query_text}'")
print(f"📂 Query category: {df.iloc[query_idx]['category']}")

# Perform search
results = search_dense(
    client=client,
    collection_name=COLLECTION_NAME,
    query_vector=query_vector,
    limit=5,
    with_payload=True
)

print_search_results(results, "🎯 Most Similar Documents")

## 🎛️ Filtered Search

Now let's add filters to search within specific categories or time ranges.

In [30]:
# Create a filter for product and policy categories
category_filter = Filter(
    must=[
        FieldCondition(
            key="category",
            match=MatchValue(value="product")
        )
    ]
)

# Search with filter
filtered_results = search_dense(
    client=client,
    collection_name=COLLECTION_NAME,
    query_vector=query_vector,
    limit=5,
    filter_condition=category_filter,
    with_payload=True
)

print_search_results(filtered_results, "🎯 Product Category Results")

# Compare result counts
print(f"\n📊 Results comparison:")
print(f"   Unfiltered: {len(results)} results")
print(f"   Product only: {len(filtered_results)} results")

## ⏰ Time-based Filtering

Let's filter by timestamp to find recent documents.

In [31]:
import time

# Calculate timestamp for "last 6 months"
six_months_ago = int(time.time()) - (6 * 30 * 24 * 60 * 60)

# Create time-based filter
time_filter = Filter(
    must=[
        FieldCondition(
            key="timestamp",
            range={"gt": six_months_ago}
        )
    ]
)

# Search recent documents
recent_results = search_dense(
    client=client,
    collection_name=COLLECTION_NAME,
    query_vector=query_vector,
    limit=5,
    filter_condition=time_filter,
    with_payload=True
)

print_search_results(recent_results, "🕒 Recent Documents (Last 6 months)")

print(f"\n📊 Time filtering:")
print(f"   All documents: {len(results)} results")
print(f"   Recent only: {len(recent_results)} results")

## 🎚️ Score Threshold

Use score thresholds to filter out low-quality matches.

In [32]:
# Search with score threshold
high_quality_results = search_dense(
    client=client,
    collection_name=COLLECTION_NAME,
    query_vector=query_vector,
    limit=10,
    score_threshold=0.3,  # Only results with score >= 0.3
    with_payload=True
)

print_search_results(high_quality_results, "🎯 High Quality Matches (score >= 0.3)")

print(f"\n📊 Quality filtering:")
print(f"   All results: {len(results)} results")
print(f"   High quality: {len(high_quality_results)} results")

# Show score distribution
scores = [r.score for r in results]
print(f"\n📈 Score statistics:")
print(f"   Max: {max(scores):.4f}")
print(f"   Min: {min(scores):.4f}")
print(f"   Mean: {np.mean(scores):.4f}")

## 🌐 Multi-Language Search

Filter by language to search within specific locales.

In [33]:
# Search within English documents only
english_filter = Filter(
    must=[
        FieldCondition(
            key="lang",
            match=MatchValue(value="en")
        )
    ]
)

english_results = search_dense(
    client=client,
    collection_name=COLLECTION_NAME,
    query_vector=query_vector,
    limit=5,
    filter_condition=english_filter,
    with_payload=True
)

print_search_results(english_results, "🇺🇸 English Documents Only")

# Language distribution in results
all_langs = [r.payload["lang"] for r in results]
en_langs = [r.payload["lang"] for r in english_results]

print(f"\n🌍 Language distribution:")
print(f"   All results: {pd.Series(all_langs).value_counts().to_dict()}")
print(f"   English only: {pd.Series(en_langs).value_counts().to_dict()}")

## 🔍 Complex Filtering

Combine multiple filters using boolean logic.

In [34]:
# Complex filter: (product OR policy) AND english AND recent
complex_filter = Filter(
    must=[
        # Language must be English
        FieldCondition(
            key="lang",
            match=MatchValue(value="en")
        ),
        # Timestamp must be recent
        FieldCondition(
            key="timestamp",
            range={"gt": six_months_ago}
        )
    ],
    should=[
        # Category should be product OR policy
        FieldCondition(
            key="category",
            match=MatchValue(value="product")
        ),
        FieldCondition(
            key="category",
            match=MatchValue(value="policy")
        )
    ]
)

complex_results = search_dense(
    client=client,
    collection_name=COLLECTION_NAME,
    query_vector=query_vector,
    limit=5,
    filter_condition=complex_filter,
    with_payload=True
)

print_search_results(complex_results, "🎯 Complex Filter: Recent English Product/Policy Docs")

if complex_results:
    categories = [r.payload["category"] for r in complex_results]
    languages = [r.payload["lang"] for r in complex_results]
    
    print(f"\n✅ Filter verification:")
    print(f"   Categories found: {set(categories)}")
    print(f"   Languages found: {set(languages)}")
    print(f"   All recent: {all(r.payload['timestamp'] > six_months_ago for r in complex_results)}")
else:
    print("\n⚠️  No results found matching the complex filter criteria")

## 🖥️ Web UI Checkpoint (Optional)

If you're running Qdrant locally, you can explore the collection in the web UI.

In [35]:
# Qdrant Cloud Dashboard Access
qdrant_url = os.getenv("QDRANT_URL", "")

if "localhost" in qdrant_url:
    print("🌐 Local Qdrant Web UI:")
    print(f"   Open: {qdrant_url.replace(':6333', ':6333/dashboard')}")
    print(f"   Navigate to collection: {COLLECTION_NAME}")
else:
    print("🌐 Qdrant Cloud Dashboard:")
    print("   1. Go to https://cloud.qdrant.io")
    print("   2. Select your cluster")
    print("   3. Use the 'Console' tab to:")
    print("      • View collection schema and points")
    print("      • Run vector searches")
    print("      • Test payload filters")
    print("      • Monitor cluster performance")
    print(f"   4. Explore collection: {COLLECTION_NAME}")
    
print("\n🔍 Try these in the dashboard:")
print("   • Browse points and payload data")
print("   • Run similarity searches")
print("   • Test different filters")
print("   • View collection statistics")

## 📊 Summary & Key Concepts

Let's summarize what we've learned about Qdrant fundamentals.

In [36]:
# Collection statistics
final_info = client.get_collection(COLLECTION_NAME)

print("🎉 Qdrant Fundamentals Summary")
print("=" * 40)
print(f"\n📚 Collection: {COLLECTION_NAME}")
print(f"   📊 Total points: {final_info.points_count}")
print(f"   📏 Vector dimension: {final_info.config.params.vectors.size}")
print(f"   📐 Distance metric: {final_info.config.params.vectors.distance}")

print(f"\n🏷️ Payload structure:")
sample_point = client.retrieve(COLLECTION_NAME, ids=[1])[0]
for key, value in sample_point.payload.items():
    print(f"   {key}: {type(value).__name__} - {value}")

print(f"\n🔍 Search capabilities demonstrated:")
print("   ✅ Basic vector similarity search")
print("   ✅ Payload filtering (category, language, time)")
print("   ✅ Complex boolean filters (AND, OR logic)")
print("   ✅ Score thresholding")
print("   ✅ Payload indexes for fast filtering")

print(f"\n🎯 Key takeaways:")
print("   • Collections store points (vectors + metadata)")
print("   • Payload enables rich filtering capabilities")
print("   • Indexes dramatically speed up filtered searches")
print("   • Cosine distance works well for text embeddings")
print("   • Score thresholds help filter low-quality matches")

print(f"\n🚀 Ready for Notebook 2: Hybrid Search!")

## 🎮 Stretch Goals (Optional)

Try these additional experiments to deepen your understanding:

### 🔍 Full-Text Search with Payload Index

Add a full-text index to search within document text.

In [37]:
# Create full-text index on the text field
try:
    create_payload_index(client, COLLECTION_NAME, "text", "text")
    print("✅ Full-text index created!")
    
    # Example: Search for documents containing specific terms
    # Note: This searches in payload, not vector similarity
    text_filter = Filter(
        must=[
            FieldCondition(
                key="text",
                match={"text": "password"}  # Find docs mentioning "password"
            )
        ]
    )
    
    text_results = client.scroll(
        collection_name=COLLECTION_NAME,
        scroll_filter=text_filter,
        limit=5,
        with_payload=True
    )[0]  # scroll returns (points, next_page_offset)
    
    print(f"\n🔍 Full-text search results for 'password':")
    for i, point in enumerate(text_results, 1):
        print(f"{i}. {point.payload['text'][:80]}...")
        
except Exception as e:
    print(f"Note: Full-text search might not be available: {e}")

### 🎯 Second Named Vector Slot

Prepare for multi-vector scenarios by adding a second vector configuration.

In [38]:
# This would typically be done when creating the collection
# For demonstration, let's create a new collection with multiple named vectors

MULTI_VECTOR_COLLECTION = "workshop_multi_vector"

# Define multiple named vectors
multi_vector_config = {
    "text_dense": VectorParams(size=384, distance=Distance.COSINE),
    "text_sparse": VectorParams(size=0, distance=Distance.DOT)  # Sparse placeholder
}

try:
    ensure_collection(
        client=client,
        collection_name=MULTI_VECTOR_COLLECTION,
        vector_config=multi_vector_config,
        force_recreate=True
    )
    
    print(f"✅ Created multi-vector collection: {MULTI_VECTOR_COLLECTION}")
    
    # Show collection info
    info = client.get_collection(MULTI_VECTOR_COLLECTION)
    print(f"   Vector configurations:")
    if hasattr(info.config.params, 'vectors') and isinstance(info.config.params.vectors, dict):
        for name, config in info.config.params.vectors.items():
            print(f"     {name}: size={config.size}, distance={config.distance}")
    
    print(f"\n🚀 Ready for hybrid search in Notebook 2!")
    
except Exception as e:
    print(f"Note: Multi-vector setup encountered an issue: {e}")

## 🧹 Cleanup (Optional)

Uncomment to clean up collections after the workshop.

In [39]:
# Uncomment to clean up collections
# PRESERVE_COLLECTIONS = True  # Set to False to delete collections

# if not PRESERVE_COLLECTIONS:
#     try:
#         client.delete_collection(COLLECTION_NAME)
#         print(f"🗑️ Deleted collection: {COLLECTION_NAME}")
#     except Exception as e:
#         print(f"Note: Could not delete collection: {e}")
        
#     try:
#         client.delete_collection(MULTI_VECTOR_COLLECTION)
#         print(f"🗑️ Deleted collection: {MULTI_VECTOR_COLLECTION}")
#     except Exception as e:
#         print(f"Note: Could not delete collection: {e}")
# else:
#     print(f"💾 Collections preserved for next notebooks")

print(f"\n✨ Notebook 1 complete! Move on to 02_hybrid_search.ipynb")