Skip to content

Embedded and vectorized database using TreeSitter and OpenAPI embeddings with wrapper for Claude code for semantic search of document base.

Notifications You must be signed in to change notification settings

seanmichaelmcgee/pytorch-docs-search

Repository files navigation

PyTorch Documentation Search Tool

A specialized semantic search system for PyTorch documentation that understands both code and text, providing relevant results for technical queries.

πŸ“‹ Overview

This tool enables developers to efficiently search PyTorch documentation using natural language or code-based queries. It preserves the integrity of code examples, understands programming patterns, and intelligently ranks results based on query intent.

Key Features:

  • Code-aware document processing that preserves code block integrity
  • Semantic search powered by OpenAI embeddings
  • Query intent detection with confidence scoring
  • Auto-tuning of HNSW search parameters
  • Embedding cache with versioning and drift detection
  • Progressive timeout with partial results
  • Claude Code CLI integration
  • Incremental document updates
  • Robust API compatibility handling for OpenAI SDK

πŸš€ Getting Started

Prerequisites

  • Ubuntu 24.04 LTS (or similar)
  • Python 3.10+
  • OpenAI API key
  • 8GB RAM recommended

Installation

  1. Clone the repository:

    git clone https://github.com/yourusername/pytorch-docs-search.git
    cd pytorch-docs-search
  2. Choose your environment setup method:

    Option A: Conda Environment (Strongly Recommended)

    # Automated setup (recommended)
    ./setup_conda_env.sh
    
    # OR manually create the environment
    conda env create -f environment.yml
    conda activate pytorch_docs_search
    
    # Verify the environment is correctly set up
    python test_conda_env.py

    Option B: Python Virtual Environment (Only if Conda is unavailable)

    # Create and activate a virtual environment
    python3 -m venv venv
    source venv/bin/activate
    
    # Install dependencies
    pip install -r requirements.txt

    Note: The Conda environment is strongly recommended for better dependency management and compatibility. The virtual environment option is maintained only for special cases where Conda cannot be used.

  3. Set up your environment variables:

    cp .env.example .env
    # Edit .env file to add your OpenAI API key

πŸ› οΈ Usage

Initial Setup

  1. Index your PyTorch documentation:

    python scripts/index_documents.py --docs-dir /path/to/pytorch/docs
  2. Generate embeddings:

    python scripts/generate_embeddings.py
  3. Load embeddings to the database:

    python scripts/load_to_database.py

Searching Documentation

Command Line Interface

Run the search tool directly:

python scripts/document_search.py "How to implement a custom autograd function"

Or use interactive mode:

python scripts/document_search.py --interactive

Filter Results

Limit results to specific content types:

python scripts/document_search.py "custom loss function" --filter code

Updating Documentation

When PyTorch documentation is updated, you can incrementally process and add new or changed documents:

  1. Process only new/changed files:

    python scripts/index_documents.py --docs-dir /path/to/updated/docs --output-file ./data/updated_chunks.json
  2. Generate embeddings for these chunks:

    python scripts/generate_embeddings.py --input-file ./data/updated_chunks.json --output-file ./data/updated_chunks_with_embeddings.json
  3. Add to the database without resetting:

    python scripts/load_to_database.py --input-file ./data/updated_chunks_with_embeddings.json --no-reset

The embedding cache ensures efficient processing by avoiding regenerating embeddings for unchanged content.

Claude Code Integration

Legacy Integration

For the original Claude Code integration:

python scripts/register_tool.sh

Enhanced MCP Integration (Recommended)

For the improved Model-Context Protocol (MCP) integration:

  1. Start the Flask API server:

    python app.py
  2. Register the MCP tool with Claude:

    claude mcp add mcp__pytorch_docs__semantic_search "http://localhost:5000/search" \
      --description "Search PyTorch documentation with code-aware semantic search" \
      --transport sse
  3. Test the tool:

    claude run tool mcp__pytorch_docs__semantic_search --input '{"query": "batch normalization"}'

For complete documentation on the MCP integration, see MCP_INTEGRATION.md

πŸ“¦ Project Structure

pytorch-docs-search/
β”œβ”€β”€ backup/             # Backup of original environment
β”‚   β”œβ”€β”€ old_requirements.txt # Original pip requirements
β”‚   └── venv_backup/    # Backup of original virtual environment
β”œβ”€β”€ data/               # Storage for processed docs and database
β”‚   β”œβ”€β”€ chroma_db/      # Vector database
β”‚   β”œβ”€β”€ embedding_cache/ # Cached embeddings
β”‚   └── indexed_chunks.json
β”œβ”€β”€ docs/               # Documentation
β”‚   β”œβ”€β”€ GUIDE.md        # Implementation guide
β”‚   β”œβ”€β”€ Journal.md      # Development journal
β”‚   β”œβ”€β”€ MIGRATION_REPORT.md # Environment migration report
β”‚   β”œβ”€β”€ conda_migration_checklist.md # Migration tasks tracking
β”‚   └── USER_GUIDE.md   # End-user documentation
β”œβ”€β”€ scripts/            # Core scripts
β”‚   β”œβ”€β”€ config/         # Configuration module
β”‚   β”œβ”€β”€ database/       # ChromaDB integration
β”‚   β”œβ”€β”€ document_processing/ # Document parsing and chunking
β”‚   β”œβ”€β”€ embedding/      # Embedding generation
β”‚   β”œβ”€β”€ search/         # Search interface
β”‚   β”œβ”€β”€ check_db_status.py   # Check ChromaDB status
β”‚   β”œβ”€β”€ check_embedding_progress.py # Monitor embedding generation
β”‚   β”œβ”€β”€ claude-code-tool.py  # Claude Code integration tool
β”‚   β”œβ”€β”€ continue_embedding.py # Continue embedding generation
β”‚   β”œβ”€β”€ continue_loading.py  # Continue loading into ChromaDB
β”‚   β”œβ”€β”€ document_search.py   # Main search script
β”‚   β”œβ”€β”€ finalize_embedding.py # Finalize embedding process
β”‚   β”œβ”€β”€ generate_embeddings.py # Embedding generation script
β”‚   β”œβ”€β”€ index_documents.py   # Document processing script
β”‚   β”œβ”€β”€ load_to_database.py  # Database loading script
β”‚   β”œβ”€β”€ merge_and_load.py    # Merge part files and load
β”‚   β”œβ”€β”€ merge_parts.py       # Merge chunked parts
β”‚   β”œβ”€β”€ migrate_embeddings.py # Model migration script
β”‚   β”œβ”€β”€ monitor_and_load.py  # Monitor embedding process
β”‚   β”œβ”€β”€ register_tool.sh     # Claude Code tool registration
β”‚   β”œβ”€β”€ resume_embedding.py  # Resume embedding generation
β”‚   └── validate_chunking.py # Validate document chunking
β”œβ”€β”€ tests/              # Unit tests
β”œβ”€β”€ .env                # Environment variables
β”œβ”€β”€ CLAUDE.md           # Guidance for Claude Code
β”œβ”€β”€ environment.yml     # Conda environment configuration
β”œβ”€β”€ requirements.txt    # Pip dependencies (alternative to Conda)
β”œβ”€β”€ run_test_conda.sh   # Test script for Conda environment
β”œβ”€β”€ setup_conda_env.sh  # Conda environment setup script
β”œβ”€β”€ test_conda_env.py   # Environment validation script
└── README.md           # This file

πŸ”§ Configuration

Edit .env file to configure:

  • OPENAI_API_KEY: Your OpenAI API key
  • CHUNK_SIZE: Size of document chunks (default: 1000)
  • OVERLAP_SIZE: Overlap between chunks (default: 200)
  • MAX_RESULTS: Default number of search results (default: 5)
  • DB_DIR: ChromaDB storage location (default: ./data/chroma_db)
  • COLLECTION_NAME: Name of the ChromaDB collection (default: pytorch_docs)

Advanced settings can be modified in scripts/config/__init__.py.

🧠 How It Works

Document Processing Pipeline

  1. Parsing: Uses Tree-sitter to parse markdown and Python files, preserving structure.
  2. Chunking: Intelligently divides documents into chunks, respecting code boundaries:
    • Keeps code blocks intact where possible
    • Uses semantic boundaries (functions, classes) for large code blocks
    • Uses paragraphs and sentences for text
  3. Metadata: Enriches chunks with source, title, content type, and language information

Search Process

  1. Query Analysis: Analyzes the query with confidence scoring to determine if it's code-focused or concept-focused
  2. Embedding Generation: Creates a vector representation of the query with caching for efficiency
  3. Vector Search: Finds semantically similar chunks in the database using auto-tuned HNSW parameters
  4. Progressive Timeout: Implements staged timeouts to provide partial results rather than failures
  5. Result Ranking: Applies confidence-based boosting for code examples or explanations based on query intent
  6. Formatting: Returns structured results with relevant snippets and metadata, adapting to available time

🚧 Maintenance

Monitoring

Check system status:

python scripts/monitor_system.py

Checking Database Status

You can check the status of the ChromaDB database at any time:

python scripts/check_db_status.py

This will show you the number of chunks in each collection, their types, and sources.

Resuming Embedding Generation

If the embedding generation process is interrupted, you can resume it:

python scripts/resume_embedding.py

This script will identify which chunks already have embeddings and only process the remaining ones.

Checking Embedding Progress

To check the progress of the embedding generation process:

python scripts/check_embedding_progress.py

Finalizing Embeddings

After all embeddings are generated, you can finalize the process:

python scripts/finalize_embedding.py

This will merge all part files and load them into ChromaDB.

Backups

Create a database backup:

scripts/maintenance.sh

Upgrading Embedding Models

If you need to migrate to a newer embedding model:

python scripts/migrate_embeddings.py --input-file ./data/indexed_chunks.json --output-file ./data/migrated_chunks.json
python scripts/load_to_database.py --input-file ./data/migrated_chunks.json

πŸ“Š Benchmarking

To evaluate embedding performance:

python scripts/benchmark_embeddings.py

πŸ” Troubleshooting

Environment Setup Issues

If you encounter issues with the Conda environment:

  • Use the included validation script: python test_conda_env.py
  • Check for version conflicts with conda list
  • Try recreating the environment with setup_conda_env.sh
  • Ensure your terminal session is fresh (no other environments active)
  • For known compatibility issues, see docs/MIGRATION_REPORT.md

API Key and SDK Issues

If you encounter API key errors:

  • Check that your .env file contains a valid OPENAI_API_KEY
  • Verify the key has access to the embedding models

If you encounter OpenAI client errors:

  • Our robust client initialization should handle most API compatibility issues
  • For persistent errors, check the OpenAI SDK version compatibility with the installed httpx version
  • Monitor logs for "Creating custom HTTP client for OpenAI compatibility" which indicates fallback is being used

Memory Issues

If you encounter out-of-memory errors:

  • Reduce batch sizes in scripts/load_to_database.py (--batch-size 50)
  • Process documents in smaller batches
  • Increase system swap space if necessary

Database Issues

If ChromaDB fails to load or query:

  • Check database directory permissions
  • Verify ChromaDB installation
  • Try resetting the collection: python scripts/load_to_database.py --input-file ./data/indexed_chunks.json

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgements

About

Embedded and vectorized database using TreeSitter and OpenAPI embeddings with wrapper for Claude code for semantic search of document base.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published