# Vector Database Utilities

This notebook provides utility functions for exploring and understanding ChromaDB vector collections. These functions help with inspecting the schema and contents of vector databases used in the kitchen recipe analysis project.

In [None]:
import chromadb

# Path to SQL database
DB_PATH = "final/kitchen_db.sqlite"
# Path to Vectorized database
VECTOR_DB_PATH = "final/vector_db"

## Database Exploration Functions

The following functions help explore the contents and structure of ChromaDB collections.

In [None]:
def view_schema_info(collection_name: str, db_path: str = VECTOR_DB_PATH):
    """
    View schema information for a collection (metadata fields and their data types).
    
    Args:
        collection_name: Name of the collection to analyze
        db_path: Path to the ChromaDB database
    """
    client = chromadb.PersistentClient(path=db_path)
    
    try:
        collection = client.get_collection(name=collection_name)
    except ValueError as e:
        print(f"Collection '{collection_name}' not found. Error: {str(e)}")
        return None
    
    # Get a sample of records to analyze schema
    try:
        results = collection.get(
            limit=100,
            include=['metadatas']
        )
        
        if not results['metadatas']:
            print(f"Collection '{collection_name}' is empty or has no metadata.")
            return None
        
        # Analyze metadata fields
        print(f"\n=== Schema for '{collection_name}' collection ===\n")
        print("Metadata fields:")
        
        # Collect all possible keys and their types
        all_keys = set()
        key_types = {}
        key_examples = {}
        
        for metadata in results['metadatas']:
            for key, value in metadata.items():
                all_keys.add(key)
                
                # Track the data type
                value_type = type(value).__name__
                if key not in key_types:
                    key_types[key] = set()
                key_types[key].add(value_type)
                
                # Store an example value
                if key not in key_examples and value:
                    example = str(value)
                    if len(example) > 50:
                        example = example[:50] + "..."
                    key_examples[key] = example
        
        # Display the schema information
        for key in sorted(all_keys):
            types_str = ", ".join(key_types[key])
            example = key_examples.get(key, "N/A")
            print(f"  - {key}: {types_str}")
            print(f"    Example: {example}")
        
        return key_types
    
    except Exception as e:
        print(f"Error getting schema info: {str(e)}")
        return None

In [None]:
def collection_info(db_path: str = VECTOR_DB_PATH):
    """
    A simple function to display basic information about all collections.
    More robust against API changes than the other functions.
    
    Args:
        db_path: Path to the ChromaDB database
    """
    client = chromadb.PersistentClient(path=db_path)
    
    try:
        collection_names = client.list_collections()
        print(f"Found {len(collection_names)} collections in {db_path}:")
        
        for name in collection_names:
            print(f"\nCollection: {name}")
            
            try:
                collection = client.get_collection(name=str(name))
                
                # Try to get count
                try:
                    count = collection.count(where={})
                    print(f"  Records: {count}")
                except:
                    print("  Count: Could not retrieve")
                
                # Try to get the first few items
                try:
                    first_items = collection.get(limit=3, include=["metadatas"])
                    print(f"  Sample IDs: {first_items['ids']}")
                    
                    # Show first item metadata as example
                    if first_items['metadatas'] and len(first_items['metadatas']) > 0:
                        print("  Sample metadata keys:", list(first_items['metadatas'][0].keys()))
                except:
                    print("  Sample: Could not retrieve")
                    
            except Exception as e:
                print(f"  Error accessing collection: {str(e)}")
        
    except Exception as e:
        print(f"Error listing collections: {str(e)}")

## Usage Examples

Here are examples of how to use these utility functions to explore your vector database.

In [None]:
# List all collections and their basic information
collection_info()

In [None]:
# View detailed schema for a specific collection (replace 'recipes' with your collection name)
view_schema_info('recipes')

## API Key Setup

This project requires several API keys for accessing different services. Follow the instructions below to set up your environment in Kaggle.

### Required API Keys:

1. **Google API Key** - Used for Google services like maps, search, etc.
2. **OpenAI API Key** - For accessing OpenAI models
3. **Google Application Credentials** - JSON file for Google Speech-to-Text service
4. **FoodData Central API Key** - For accessing nutritional data from USDA's FoodData Central

In [None]:
# API Keys Setup
from kaggle_secrets import UserSecretsClient

# Retrieve API keys from Kaggle secrets
# You need to add these secrets to your Kaggle notebook settings first
GOOGLE_API_KEY = UserSecretsClient().get_secret("GOOGLE_API_KEY")
OPENAI_API_KEY = UserSecretsClient().get_secret("OPENAI_API_KEY")
SecretValueJson = UserSecretsClient().get_secret("GOOGLE_APPLICATION_CREDENTIALS") # The JSON content of your Google credentials
OPENFOODFACTS_API_KEY = UserSecretsClient().get_secret("OPENFOODFACTS_API_KEY")

### How to Set Up Your API Keys in Kaggle

1. **FoodData Central API Key**:
   - Sign up at [USDA FoodData Central API Key Signup](https://fdc.nal.usda.gov/api-key-signup)
   - After registration, you will receive an API key via email
   - In your Kaggle notebook, add this key as a secret named `OPENFOODFACTS_API_KEY`

2. **Google API Key**:
   - Create a project in [Google Cloud Console](https://console.cloud.google.com/)
   - Enable the APIs you need (Maps, Search, etc.)
   - Create an API key in the Credentials section
   - Add this key as a Kaggle secret named `GOOGLE_API_KEY`

3. **Google Application Credentials**:
   - This is required specifically for Speech-to-Text services
   - In Google Cloud Console, create a service account
   - Generate a JSON key file for this service account
   - Copy the entire content of the JSON file
   - Add the JSON content as a Kaggle secret named `GOOGLE_APPLICATION_CREDENTIALS`

4. **OpenAI API Key**:
   - Sign up for an account at [OpenAI](https://openai.com/)
   - Generate an API key from your account dashboard
   - Add this key as a Kaggle secret named `OPENAI_API_KEY`

#### Adding Secrets to Kaggle:
1. In your Kaggle notebook, click on the '⚙️' icon (Settings) in the right sidebar
2. Go to the 'Secrets' section
3. Click 'Add-in secret'
4. Enter the secret name (e.g., `OPENAI_API_KEY`) and value
5. Click 'Save'

Note: If you don't have access to certain APIs, you can adjust the code to work with the APIs you do have access to.