PyTorch Documentation Search Tool

A specialized semantic search system for PyTorch documentation that understands both code and text, providing relevant results for technical queries.

📋 Overview

This tool enables developers to efficiently search PyTorch documentation using natural language or code-based queries. It preserves the integrity of code examples, understands programming patterns, and intelligently ranks results based on query intent.

Key Features:

Code-aware document processing that preserves code block integrity
Semantic search powered by OpenAI embeddings
Query intent detection with confidence scoring
Auto-tuning of HNSW search parameters
Embedding cache with versioning and drift detection
Progressive timeout with partial results
Claude Code CLI integration
Incremental document updates
Robust API compatibility handling for OpenAI SDK

🚀 Getting Started

Prerequisites

Ubuntu 24.04 LTS (or similar)
Python 3.10+
OpenAI API key
8GB RAM recommended

Installation

Clone the repository:

git clone https://github.com/yourusername/pytorch-docs-search.git
cd pytorch-docs-search

Choose your environment setup method:

Option A: Conda Environment (Strongly Recommended)

# Automated setup (recommended)
./setup_conda_env.sh

# OR manually create the environment
conda env create -f environment.yml
conda activate pytorch_docs_search

# Verify the environment is correctly set up
python test_conda_env.py

Option B: Python Virtual Environment (Only if Conda is unavailable)

# Create and activate a virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Note: The Conda environment is strongly recommended for better dependency management and compatibility. The virtual environment option is maintained only for special cases where Conda cannot be used.

Set up your environment variables:

cp .env.example .env
# Edit .env file to add your OpenAI API key

🛠️ Usage

Initial Setup

Index your PyTorch documentation:

python scripts/index_documents.py --docs-dir /path/to/pytorch/docs

Generate embeddings:
```
python scripts/generate_embeddings.py
```
Load embeddings to the database:
```
python scripts/load_to_database.py
```

Searching Documentation

Command Line Interface

Run the search tool directly:

python scripts/document_search.py "How to implement a custom autograd function"

Or use interactive mode:

python scripts/document_search.py --interactive

Filter Results

Limit results to specific content types:

python scripts/document_search.py "custom loss function" --filter code

Updating Documentation

When PyTorch documentation is updated, you can incrementally process and add new or changed documents:

Process only new/changed files:

python scripts/index_documents.py --docs-dir /path/to/updated/docs --output-file ./data/updated_chunks.json

Generate embeddings for these chunks:

python scripts/generate_embeddings.py --input-file ./data/updated_chunks.json --output-file ./data/updated_chunks_with_embeddings.json

Add to the database without resetting:

python scripts/load_to_database.py --input-file ./data/updated_chunks_with_embeddings.json --no-reset

The embedding cache ensures efficient processing by avoiding regenerating embeddings for unchanged content.

Claude Code Integration

Legacy Integration

For the original Claude Code integration:

python scripts/register_tool.sh

Enhanced MCP Integration (Recommended)

For the improved Model-Context Protocol (MCP) integration:

Start the Flask API server:
```
python app.py
```

Register the MCP tool with Claude:

claude mcp add mcp__pytorch_docs__semantic_search "http://localhost:5000/search" \
  --description "Search PyTorch documentation with code-aware semantic search" \
  --transport sse

Test the tool:

claude run tool mcp__pytorch_docs__semantic_search --input '{"query": "batch normalization"}'

For complete documentation on the MCP integration, see MCP_INTEGRATION.md

📦 Project Structure

pytorch-docs-search/
├── backup/             # Backup of original environment
│   ├── old_requirements.txt # Original pip requirements
│   └── venv_backup/    # Backup of original virtual environment
├── data/               # Storage for processed docs and database
│   ├── chroma_db/      # Vector database
│   ├── embedding_cache/ # Cached embeddings
│   └── indexed_chunks.json
├── docs/               # Documentation
│   ├── GUIDE.md        # Implementation guide
│   ├── Journal.md      # Development journal
│   ├── MIGRATION_REPORT.md # Environment migration report
│   ├── conda_migration_checklist.md # Migration tasks tracking
│   └── USER_GUIDE.md   # End-user documentation
├── scripts/            # Core scripts
│   ├── config/         # Configuration module
│   ├── database/       # ChromaDB integration
│   ├── document_processing/ # Document parsing and chunking
│   ├── embedding/      # Embedding generation
│   ├── search/         # Search interface
│   ├── check_db_status.py   # Check ChromaDB status
│   ├── check_embedding_progress.py # Monitor embedding generation
│   ├── claude-code-tool.py  # Claude Code integration tool
│   ├── continue_embedding.py # Continue embedding generation
│   ├── continue_loading.py  # Continue loading into ChromaDB
│   ├── document_search.py   # Main search script
│   ├── finalize_embedding.py # Finalize embedding process
│   ├── generate_embeddings.py # Embedding generation script
│   ├── index_documents.py   # Document processing script
│   ├── load_to_database.py  # Database loading script
│   ├── merge_and_load.py    # Merge part files and load
│   ├── merge_parts.py       # Merge chunked parts
│   ├── migrate_embeddings.py # Model migration script
│   ├── monitor_and_load.py  # Monitor embedding process
│   ├── register_tool.sh     # Claude Code tool registration
│   ├── resume_embedding.py  # Resume embedding generation
│   └── validate_chunking.py # Validate document chunking
├── tests/              # Unit tests
├── .env                # Environment variables
├── CLAUDE.md           # Guidance for Claude Code
├── environment.yml     # Conda environment configuration
├── requirements.txt    # Pip dependencies (alternative to Conda)
├── run_test_conda.sh   # Test script for Conda environment
├── setup_conda_env.sh  # Conda environment setup script
├── test_conda_env.py   # Environment validation script
└── README.md           # This file

🔧 Configuration

Edit .env file to configure:

OPENAI_API_KEY: Your OpenAI API key
CHUNK_SIZE: Size of document chunks (default: 1000)
OVERLAP_SIZE: Overlap between chunks (default: 200)
MAX_RESULTS: Default number of search results (default: 5)
DB_DIR: ChromaDB storage location (default: ./data/chroma_db)
COLLECTION_NAME: Name of the ChromaDB collection (default: pytorch_docs)

Advanced settings can be modified in scripts/config/__init__.py.

🧠 How It Works

Document Processing Pipeline

Parsing: Uses Tree-sitter to parse markdown and Python files, preserving structure.
Chunking: Intelligently divides documents into chunks, respecting code boundaries:
- Keeps code blocks intact where possible
- Uses semantic boundaries (functions, classes) for large code blocks
- Uses paragraphs and sentences for text
Metadata: Enriches chunks with source, title, content type, and language information

Search Process

Query Analysis: Analyzes the query with confidence scoring to determine if it's code-focused or concept-focused
Embedding Generation: Creates a vector representation of the query with caching for efficiency
Vector Search: Finds semantically similar chunks in the database using auto-tuned HNSW parameters
Progressive Timeout: Implements staged timeouts to provide partial results rather than failures
Result Ranking: Applies confidence-based boosting for code examples or explanations based on query intent
Formatting: Returns structured results with relevant snippets and metadata, adapting to available time

🚧 Maintenance

Monitoring

Check system status:

python scripts/monitor_system.py

Checking Database Status

You can check the status of the ChromaDB database at any time:

python scripts/check_db_status.py

This will show you the number of chunks in each collection, their types, and sources.

Resuming Embedding Generation

If the embedding generation process is interrupted, you can resume it:

python scripts/resume_embedding.py

This script will identify which chunks already have embeddings and only process the remaining ones.

Checking Embedding Progress

To check the progress of the embedding generation process:

python scripts/check_embedding_progress.py

Finalizing Embeddings

After all embeddings are generated, you can finalize the process:

python scripts/finalize_embedding.py

This will merge all part files and load them into ChromaDB.

Backups

Create a database backup:

scripts/maintenance.sh

Upgrading Embedding Models

If you need to migrate to a newer embedding model:

python scripts/migrate_embeddings.py --input-file ./data/indexed_chunks.json --output-file ./data/migrated_chunks.json
python scripts/load_to_database.py --input-file ./data/migrated_chunks.json

📊 Benchmarking

To evaluate embedding performance:

python scripts/benchmark_embeddings.py

🔍 Troubleshooting

Environment Setup Issues

If you encounter issues with the Conda environment:

Use the included validation script: python test_conda_env.py
Check for version conflicts with conda list
Try recreating the environment with setup_conda_env.sh
Ensure your terminal session is fresh (no other environments active)
For known compatibility issues, see docs/MIGRATION_REPORT.md

API Key and SDK Issues

If you encounter API key errors:

Check that your .env file contains a valid OPENAI_API_KEY
Verify the key has access to the embedding models

If you encounter OpenAI client errors:

Our robust client initialization should handle most API compatibility issues
For persistent errors, check the OpenAI SDK version compatibility with the installed httpx version
Monitor logs for "Creating custom HTTP client for OpenAI compatibility" which indicates fallback is being used

Memory Issues

If you encounter out-of-memory errors:

Reduce batch sizes in scripts/load_to_database.py (--batch-size 50)
Process documents in smaller batches
Increase system swap space if necessary

Database Issues

If ChromaDB fails to load or query:

Check database directory permissions
Verify ChromaDB installation
Try resetting the collection: python scripts/load_to_database.py --input-file ./data/indexed_chunks.json

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
docs		docs
scripts		scripts
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
app.py		app.py
claude_code_test_results.json		claude_code_test_results.json
environment.yml		environment.yml
mamba_env.yml		mamba_env.yml
requirements.txt		requirements.txt
run_test_conda.sh		run_test_conda.sh
search_test_results.json		search_test_results.json
setup_conda_env.sh		setup_conda_env.sh
test_conda_env.py		test_conda_env.py

seanmichaelmcgee/pytorch-docs-search

Folders and files

Latest commit

History

Repository files navigation