A specialized semantic search system for PyTorch documentation that understands both code and text, providing relevant results for technical queries.
This tool enables developers to efficiently search PyTorch documentation using natural language or code-based queries. It preserves the integrity of code examples, understands programming patterns, and intelligently ranks results based on query intent.
Key Features:
- Code-aware document processing that preserves code block integrity
- Semantic search powered by OpenAI embeddings
- Query intent detection with confidence scoring
- Auto-tuning of HNSW search parameters
- Embedding cache with versioning and drift detection
- Progressive timeout with partial results
- Claude Code CLI integration
- Incremental document updates
- Robust API compatibility handling for OpenAI SDK
- Ubuntu 24.04 LTS (or similar)
- Python 3.10+
- OpenAI API key
- 8GB RAM recommended
-
Clone the repository:
git clone https://github.com/yourusername/pytorch-docs-search.git cd pytorch-docs-search
-
Choose your environment setup method:
Option A: Conda Environment (Strongly Recommended)
# Automated setup (recommended) ./setup_conda_env.sh # OR manually create the environment conda env create -f environment.yml conda activate pytorch_docs_search # Verify the environment is correctly set up python test_conda_env.py
Option B: Python Virtual Environment (Only if Conda is unavailable)
# Create and activate a virtual environment python3 -m venv venv source venv/bin/activate # Install dependencies pip install -r requirements.txt
Note: The Conda environment is strongly recommended for better dependency management and compatibility. The virtual environment option is maintained only for special cases where Conda cannot be used.
-
Set up your environment variables:
cp .env.example .env # Edit .env file to add your OpenAI API key
-
Index your PyTorch documentation:
python scripts/index_documents.py --docs-dir /path/to/pytorch/docs
-
Generate embeddings:
python scripts/generate_embeddings.py
-
Load embeddings to the database:
python scripts/load_to_database.py
Run the search tool directly:
python scripts/document_search.py "How to implement a custom autograd function"
Or use interactive mode:
python scripts/document_search.py --interactive
Limit results to specific content types:
python scripts/document_search.py "custom loss function" --filter code
When PyTorch documentation is updated, you can incrementally process and add new or changed documents:
-
Process only new/changed files:
python scripts/index_documents.py --docs-dir /path/to/updated/docs --output-file ./data/updated_chunks.json
-
Generate embeddings for these chunks:
python scripts/generate_embeddings.py --input-file ./data/updated_chunks.json --output-file ./data/updated_chunks_with_embeddings.json
-
Add to the database without resetting:
python scripts/load_to_database.py --input-file ./data/updated_chunks_with_embeddings.json --no-reset
The embedding cache ensures efficient processing by avoiding regenerating embeddings for unchanged content.
For the original Claude Code integration:
python scripts/register_tool.sh
For the improved Model-Context Protocol (MCP) integration:
-
Start the Flask API server:
python app.py
-
Register the MCP tool with Claude:
claude mcp add mcp__pytorch_docs__semantic_search "http://localhost:5000/search" \ --description "Search PyTorch documentation with code-aware semantic search" \ --transport sse
-
Test the tool:
claude run tool mcp__pytorch_docs__semantic_search --input '{"query": "batch normalization"}'
For complete documentation on the MCP integration, see MCP_INTEGRATION.md
pytorch-docs-search/
βββ backup/ # Backup of original environment
β βββ old_requirements.txt # Original pip requirements
β βββ venv_backup/ # Backup of original virtual environment
βββ data/ # Storage for processed docs and database
β βββ chroma_db/ # Vector database
β βββ embedding_cache/ # Cached embeddings
β βββ indexed_chunks.json
βββ docs/ # Documentation
β βββ GUIDE.md # Implementation guide
β βββ Journal.md # Development journal
β βββ MIGRATION_REPORT.md # Environment migration report
β βββ conda_migration_checklist.md # Migration tasks tracking
β βββ USER_GUIDE.md # End-user documentation
βββ scripts/ # Core scripts
β βββ config/ # Configuration module
β βββ database/ # ChromaDB integration
β βββ document_processing/ # Document parsing and chunking
β βββ embedding/ # Embedding generation
β βββ search/ # Search interface
β βββ check_db_status.py # Check ChromaDB status
β βββ check_embedding_progress.py # Monitor embedding generation
β βββ claude-code-tool.py # Claude Code integration tool
β βββ continue_embedding.py # Continue embedding generation
β βββ continue_loading.py # Continue loading into ChromaDB
β βββ document_search.py # Main search script
β βββ finalize_embedding.py # Finalize embedding process
β βββ generate_embeddings.py # Embedding generation script
β βββ index_documents.py # Document processing script
β βββ load_to_database.py # Database loading script
β βββ merge_and_load.py # Merge part files and load
β βββ merge_parts.py # Merge chunked parts
β βββ migrate_embeddings.py # Model migration script
β βββ monitor_and_load.py # Monitor embedding process
β βββ register_tool.sh # Claude Code tool registration
β βββ resume_embedding.py # Resume embedding generation
β βββ validate_chunking.py # Validate document chunking
βββ tests/ # Unit tests
βββ .env # Environment variables
βββ CLAUDE.md # Guidance for Claude Code
βββ environment.yml # Conda environment configuration
βββ requirements.txt # Pip dependencies (alternative to Conda)
βββ run_test_conda.sh # Test script for Conda environment
βββ setup_conda_env.sh # Conda environment setup script
βββ test_conda_env.py # Environment validation script
βββ README.md # This file
Edit .env
file to configure:
OPENAI_API_KEY
: Your OpenAI API keyCHUNK_SIZE
: Size of document chunks (default: 1000)OVERLAP_SIZE
: Overlap between chunks (default: 200)MAX_RESULTS
: Default number of search results (default: 5)DB_DIR
: ChromaDB storage location (default: ./data/chroma_db)COLLECTION_NAME
: Name of the ChromaDB collection (default: pytorch_docs)
Advanced settings can be modified in scripts/config/__init__.py
.
- Parsing: Uses Tree-sitter to parse markdown and Python files, preserving structure.
- Chunking: Intelligently divides documents into chunks, respecting code boundaries:
- Keeps code blocks intact where possible
- Uses semantic boundaries (functions, classes) for large code blocks
- Uses paragraphs and sentences for text
- Metadata: Enriches chunks with source, title, content type, and language information
- Query Analysis: Analyzes the query with confidence scoring to determine if it's code-focused or concept-focused
- Embedding Generation: Creates a vector representation of the query with caching for efficiency
- Vector Search: Finds semantically similar chunks in the database using auto-tuned HNSW parameters
- Progressive Timeout: Implements staged timeouts to provide partial results rather than failures
- Result Ranking: Applies confidence-based boosting for code examples or explanations based on query intent
- Formatting: Returns structured results with relevant snippets and metadata, adapting to available time
Check system status:
python scripts/monitor_system.py
You can check the status of the ChromaDB database at any time:
python scripts/check_db_status.py
This will show you the number of chunks in each collection, their types, and sources.
If the embedding generation process is interrupted, you can resume it:
python scripts/resume_embedding.py
This script will identify which chunks already have embeddings and only process the remaining ones.
To check the progress of the embedding generation process:
python scripts/check_embedding_progress.py
After all embeddings are generated, you can finalize the process:
python scripts/finalize_embedding.py
This will merge all part files and load them into ChromaDB.
Create a database backup:
scripts/maintenance.sh
If you need to migrate to a newer embedding model:
python scripts/migrate_embeddings.py --input-file ./data/indexed_chunks.json --output-file ./data/migrated_chunks.json
python scripts/load_to_database.py --input-file ./data/migrated_chunks.json
To evaluate embedding performance:
python scripts/benchmark_embeddings.py
If you encounter issues with the Conda environment:
- Use the included validation script:
python test_conda_env.py
- Check for version conflicts with
conda list
- Try recreating the environment with
setup_conda_env.sh
- Ensure your terminal session is fresh (no other environments active)
- For known compatibility issues, see docs/MIGRATION_REPORT.md
If you encounter API key errors:
- Check that your
.env
file contains a validOPENAI_API_KEY
- Verify the key has access to the embedding models
If you encounter OpenAI client errors:
- Our robust client initialization should handle most API compatibility issues
- For persistent errors, check the OpenAI SDK version compatibility with the installed httpx version
- Monitor logs for "Creating custom HTTP client for OpenAI compatibility" which indicates fallback is being used
If you encounter out-of-memory errors:
- Reduce batch sizes in
scripts/load_to_database.py
(--batch-size 50) - Process documents in smaller batches
- Increase system swap space if necessary
If ChromaDB fails to load or query:
- Check database directory permissions
- Verify ChromaDB installation
- Try resetting the collection:
python scripts/load_to_database.py --input-file ./data/indexed_chunks.json
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.