[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/rdmurugan/d-vecDB/blob/master/python-client/vectordb_client/examples/google_colab_example.ipynb)

# d-vecDB Python Client - Google Colab Example

This notebook demonstrates how to use the d-vecDB Python client in Google Colab for vector similarity search and embeddings management.

## What you'll learn:
- How to install and set up d-vecDB client in Colab
- Connect to a remote d-vecDB server
- Create collections and insert vectors
- Perform similarity searches
- Work with text embeddings using sentence transformers

## üîß Installation

First, let's install the required packages:

In [None]:
# Install the d-vecDB Python client and dependencies
!pip install vectordb-client sentence-transformers numpy pandas matplotlib

# Import required libraries
import numpy as np
import pandas as pd
from typing import List, Dict, Any
import json
import time
from sentence_transformers import SentenceTransformer
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

print("‚úÖ Installation complete!")

## üöÄ Setting up the VectorDB Client

**Note**: For this example, you'll need access to a running d-vecDB server. You can:
1. Run a local server and use ngrok to expose it
2. Use a cloud-hosted d-vecDB instance
3. For demo purposes, we'll show how to set up the client (you'll need to replace with your actual server details)

In [None]:
from vectordb_client import VectorDBClient
from vectordb_client.types import (
    CollectionConfig, Vector, DistanceMetric, 
    IndexConfig, VectorType
)

# Configuration - Replace with your server details
SERVER_HOST = "your-server-host.com"  # Replace with your server host
SERVER_PORT = 8080  # Replace with your server port

# For local development with ngrok, it might look like:
# SERVER_HOST = "abc123.ngrok.io"
# SERVER_PORT = 80

print(f"üîå Connecting to d-vecDB server at {SERVER_HOST}:{SERVER_PORT}...")

try:
    # Initialize the client
    client = VectorDBClient(host=SERVER_HOST, port=SERVER_PORT)
    
    # Test the connection
    if client.ping():
        print("‚úÖ Successfully connected to d-vecDB!")
        
        # Get server info
        server_info = client.get_server_info()
        print(f"üìä Server Info: {server_info}")
    else:
        print("‚ùå Could not connect to d-vecDB server")
        print("Please check your server configuration and try again.")
        
except Exception as e:
    print(f"‚ùå Connection failed: {e}")
    print("\nüí° To run this example, you need:")
    print("1. A running d-vecDB server")
    print("2. Update SERVER_HOST and SERVER_PORT above")
    print("3. Ensure the server is accessible from Colab")

## üìÑ Preparing Sample Data

Let's create some sample documents and generate embeddings for them:

In [None]:
# Sample documents for demonstration
sample_documents = [
    "The quick brown fox jumps over the lazy dog",
    "Machine learning is a subset of artificial intelligence",
    "Vector databases enable efficient similarity search",
    "Python is a popular programming language for data science",
    "Natural language processing helps computers understand text",
    "Deep learning models can generate realistic images",
    "Cloud computing provides scalable infrastructure solutions",
    "Database optimization improves query performance",
    "Artificial neural networks mimic biological brain functions",
    "Big data analytics reveals insights from large datasets"
]

print(f"üìö Sample documents ({len(sample_documents)} total):")
for i, doc in enumerate(sample_documents, 1):
    print(f"{i:2d}. {doc}")

## üî§ Generating Text Embeddings

We'll use sentence-transformers to convert our text documents into vector embeddings:

In [None]:
# Initialize the sentence transformer model
print("ü§ñ Loading sentence transformer model...")
model = SentenceTransformer('all-MiniLM-L6-v2')  # Lightweight model, good for Colab

# Generate embeddings
print("‚ö° Generating embeddings...")
embeddings = model.encode(sample_documents)

print(f"‚úÖ Generated {len(embeddings)} embeddings")
print(f"üìè Embedding dimension: {embeddings.shape[1]}")
print(f"üî¢ Data type: {embeddings.dtype}")

# Convert to list format for d-vecDB
embedding_vectors = [embedding.tolist() for embedding in embeddings]

print(f"\nüìä First embedding preview (first 10 dimensions):")
print(embedding_vectors[0][:10])

## üìà Visualizing Embeddings

Let's visualize our embeddings in 2D using PCA:

In [None]:
# Reduce embeddings to 2D for visualization
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embeddings)

# Create the plot
plt.figure(figsize=(12, 8))
scatter = plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], 
                     alpha=0.7, s=100, c=range(len(sample_documents)), 
                     cmap='tab10')

# Add labels for each point
for i, doc in enumerate(sample_documents):
    plt.annotate(f"{i+1}", 
                xy=(embeddings_2d[i, 0], embeddings_2d[i, 1]),
                xytext=(5, 5), textcoords='offset points',
                fontsize=12, fontweight='bold')

plt.title('Document Embeddings Visualization (PCA)', fontsize=16)
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)', fontsize=12)
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)', fontsize=12)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nüìã Document Reference:")
for i, doc in enumerate(sample_documents, 1):
    print(f"{i:2d}. {doc[:50]}{'...' if len(doc) > 50 else ''}")

## üìÅ Creating a Collection

Now let's create a collection in d-vecDB to store our embeddings:

In [None]:
# Collection configuration
collection_name = "colab_text_embeddings"
embedding_dimension = len(embedding_vectors[0])

print(f"üìÅ Creating collection '{collection_name}'...")

try:
    # Clean up any existing collection
    try:
        client.delete_collection(collection_name)
        print(f"üóëÔ∏è  Deleted existing collection")
    except:
        pass
    
    # Create new collection with cosine similarity
    response = client.create_collection_simple(
        name=collection_name,
        dimension=embedding_dimension,
        distance_metric=DistanceMetric.COSINE
    )
    
    print(f"‚úÖ Created collection: {response}")
    
    # List all collections to verify
    collections = client.list_collections()
    print(f"üìã Available collections: {collections}")
    
except Exception as e:
    print(f"‚ùå Failed to create collection: {e}")
    print("Please ensure your d-vecDB server is running and accessible.")

## ‚¨ÜÔ∏è Inserting Vectors

Let's insert our document embeddings into the collection:

In [None]:
print("‚¨ÜÔ∏è  Inserting vectors into collection...")

try:
    # Prepare vectors with metadata
    vectors_to_insert = []
    
    for i, (doc, embedding) in enumerate(zip(sample_documents, embedding_vectors)):
        vector = Vector(
            id=str(i + 1),
            values=embedding,
            metadata={
                "document": doc,
                "length": len(doc),
                "index": i + 1,
                "word_count": len(doc.split())
            }
        )
        vectors_to_insert.append(vector)
    
    # Insert vectors in batch
    start_time = time.time()
    response = client.upsert_vectors(collection_name, vectors_to_insert)
    insert_time = time.time() - start_time
    
    print(f"‚úÖ Inserted {len(vectors_to_insert)} vectors in {insert_time:.2f} seconds")
    print(f"üìä Insert response: {response}")
    
    # Get collection statistics
    try:
        stats = client.get_collection_stats(collection_name)
        print(f"üìà Collection stats: {stats}")
    except:
        print("‚ÑπÔ∏è  Collection stats not available")
    
except Exception as e:
    print(f"‚ùå Failed to insert vectors: {e}")

## üîç Similarity Search

Now let's perform similarity searches to find related documents:

In [None]:
def search_similar_documents(query_text: str, top_k: int = 5):
    """Search for documents similar to the query text."""
    print(f"\nüîç Searching for: '{query_text}'")
    print("="*60)
    
    try:
        # Generate embedding for query
        query_embedding = model.encode([query_text])[0].tolist()
        
        # Perform search
        start_time = time.time()
        results = client.search_vectors(
            collection_name=collection_name,
            query_vector=query_embedding,
            top_k=top_k
        )
        search_time = time.time() - start_time
        
        print(f"‚ö° Search completed in {search_time:.3f} seconds")
        print(f"üìã Found {len(results)} results:\n")
        
        for i, result in enumerate(results, 1):
            doc_text = result.metadata.get('document', 'N/A')
            similarity = 1 - result.score  # Convert distance to similarity for cosine
            
            print(f"{i}. [Similarity: {similarity:.3f}] {doc_text}")
        
        return results
        
    except Exception as e:
        print(f"‚ùå Search failed: {e}")
        return []

# Example searches
search_queries = [
    "artificial intelligence and machine learning",
    "database and data storage",
    "programming languages for data",
    "computer vision and image processing"
]

for query in search_queries:
    search_similar_documents(query, top_k=3)

## üéØ Interactive Search

Try your own search queries:

In [None]:
# Interactive search - modify this cell to try different queries
your_query = "neural networks and AI"  # ‚Üê Change this to your query

print("üéØ Your custom search:")
results = search_similar_documents(your_query, top_k=5)

# Show detailed results with metadata
if results:
    print("\nüìä Detailed Results:")
    print("="*80)
    
    for i, result in enumerate(results, 1):
        similarity = 1 - result.score
        metadata = result.metadata
        
        print(f"\nResult {i}:")
        print(f"  üìÑ Document: {metadata.get('document', 'N/A')}")
        print(f"  üéØ Similarity: {similarity:.4f}")
        print(f"  üìè Length: {metadata.get('length', 'N/A')} characters")
        print(f"  üí¨ Words: {metadata.get('word_count', 'N/A')}")
        print(f"  üÜî ID: {result.id}")

## üîß Advanced Vector Operations

Let's explore some advanced operations:

In [None]:
print("üîß Advanced Vector Operations")
print("="*50)

try:
    # 1. Get a specific vector
    print("\n1Ô∏è‚É£ Retrieving specific vector...")
    vector_id = "1"
    retrieved_vector = client.get_vector(collection_name, vector_id)
    if retrieved_vector:
        print(f"‚úÖ Retrieved vector {vector_id}:")
        print(f"   Document: {retrieved_vector.metadata.get('document', 'N/A')[:50]}...")
        print(f"   Dimension: {len(retrieved_vector.values)}")
    
    # 2. Filter search with metadata
    print("\n2Ô∏è‚É£ Filtered search (documents with >50 characters)...")
    query_text = "data science programming"
    query_embedding = model.encode([query_text])[0].tolist()
    
    # Note: Metadata filtering syntax depends on your d-vecDB server implementation
    # This is a conceptual example - adjust based on your server's API
    filtered_results = client.search_vectors(
        collection_name=collection_name,
        query_vector=query_embedding,
        top_k=5
        # filter={"length": {"$gt": 50}}  # Uncomment if your server supports filtering
    )
    
    print(f"üìã Filtered results: {len(filtered_results)}")
    for result in filtered_results[:3]:
        doc_length = result.metadata.get('length', 0)
        if doc_length > 50:  # Client-side filtering as example
            similarity = 1 - result.score
            print(f"   ‚Ä¢ [Similarity: {similarity:.3f}, Length: {doc_length}] {result.metadata.get('document', 'N/A')[:60]}...")
    
    # 3. Batch operations
    print("\n3Ô∏è‚É£ Batch vector retrieval...")
    vector_ids = ["1", "3", "5"]
    batch_vectors = client.get_vectors(collection_name, vector_ids)
    print(f"‚úÖ Retrieved {len(batch_vectors)} vectors in batch")
    
    for vector in batch_vectors:
        doc = vector.metadata.get('document', 'N/A')
        print(f"   ‚Ä¢ ID {vector.id}: {doc[:40]}...")
    
except Exception as e:
    print(f"‚ùå Advanced operations failed: {e}")
    print("Some operations may not be supported by your d-vecDB server version.")

## ‚ö° Performance Testing

Let's test the performance of our vector database:

In [None]:
print("‚ö° Performance Testing")
print("="*40)

try:
    # Test search performance
    test_queries = [
        "machine learning algorithms",
        "database optimization techniques", 
        "natural language processing",
        "cloud computing infrastructure",
        "artificial intelligence applications"
    ]
    
    search_times = []
    
    print("üîç Running search performance test...")
    for i, query in enumerate(test_queries, 1):
        query_embedding = model.encode([query])[0].tolist()
        
        start_time = time.time()
        results = client.search_vectors(
            collection_name=collection_name,
            query_vector=query_embedding,
            top_k=5
        )
        search_time = (time.time() - start_time) * 1000  # Convert to milliseconds
        search_times.append(search_time)
        
        print(f"   Query {i}: {search_time:.2f}ms ({len(results)} results)")
    
    # Performance statistics
    avg_time = np.mean(search_times)
    min_time = np.min(search_times)
    max_time = np.max(search_times)
    
    print(f"\nüìä Performance Summary:")
    print(f"   Average search time: {avg_time:.2f}ms")
    print(f"   Fastest search: {min_time:.2f}ms")
    print(f"   Slowest search: {max_time:.2f}ms")
    
    # Visualize performance
    plt.figure(figsize=(10, 6))
    plt.bar(range(1, len(search_times) + 1), search_times, alpha=0.7)
    plt.axhline(y=avg_time, color='r', linestyle='--', label=f'Average: {avg_time:.2f}ms')
    plt.xlabel('Query Number')
    plt.ylabel('Search Time (milliseconds)')
    plt.title('Vector Search Performance')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()
    
except Exception as e:
    print(f"‚ùå Performance test failed: {e}")

## üßπ Cleanup

Clean up resources when done:

In [None]:
print("üßπ Cleaning up resources...")

try:
    # Optionally delete the collection
    delete_collection = False  # Set to True if you want to clean up
    
    if delete_collection:
        response = client.delete_collection(collection_name)
        print(f"üóëÔ∏è  Deleted collection '{collection_name}': {response}")
    else:
        print(f"‚ÑπÔ∏è  Collection '{collection_name}' preserved for further use")
    
    # List remaining collections
    collections = client.list_collections()
    print(f"üìã Remaining collections: {collections}")
    
except Exception as e:
    print(f"‚ùå Cleanup failed: {e}")

print("\n‚úÖ Notebook execution completed!")

## üöÄ Next Steps

Congratulations! You've successfully:
- ‚úÖ Set up d-vecDB client in Google Colab
- ‚úÖ Generated text embeddings using sentence transformers
- ‚úÖ Created a vector collection
- ‚úÖ Inserted and searched vectors
- ‚úÖ Performed similarity searches
- ‚úÖ Tested performance

### What to try next:

1. **Scale up**: Try with larger datasets (1000+ documents)
2. **Different embeddings**: Experiment with different sentence transformer models
3. **Real data**: Use your own documents or datasets
4. **Advanced features**: Explore filtering, metadata queries, and batch operations
5. **Integration**: Connect with your applications or data pipelines

### Useful Resources:

- üìö [d-vecDB Documentation](https://github.com/rdmurugan/d-vecDB)
- ü§ó [Sentence Transformers](https://www.sbert.net/)
- üêç [Python Client API Reference](https://github.com/rdmurugan/d-vecDB/tree/master/python-client)

### Need Help?

- üêõ Report issues: [GitHub Issues](https://github.com/rdmurugan/d-vecDB/issues)
- üí¨ Discussions: [GitHub Discussions](https://github.com/rdmurugan/d-vecDB/discussions)

Happy vector searching! üéâ