An intelligent documentation crawler and RAG (Retrieval-Augmented Generation) agent built using LangChain, Supabase, and OpenAI. The agent can crawl documentation websites, store content in a vector database, and provide intelligent answers to user questions by retrieving and analyzing relevant documentation chunks.
- Ethical web crawling with robots.txt compliance
- Intelligent HTML to Markdown conversion
- Automatic content cleaning and noise removal
- Smart content chunking for optimal retrieval
- Supabase pgvector storage
- OpenAI embeddings for semantic search
- Efficient similarity matching
- Structured metadata storage
- LangChain-powered question answering
- Context-aware responses
- Source attribution for answers
- Semantic search capabilities
- Async/concurrent crawling
- Error handling and recovery
- Rate limiting and polite crawling
- Modular and extensible architecture
- Node.js 16+
- Supabase account
- OpenAI API key
-
Clone and Install Dependencies
git clone [repository-url] cd audit-the-audit-api npm install
-
Configure Environment Variables Create a
.env
file with the following:# OpenAI Configuration OPENAI_API_KEY=your_openai_api_key LLM_MODEL=gpt-4-turbo-preview # or your preferred model # Supabase Configuration SUPABASE_URL=your_supabase_url SUPABASE_SERVICE_KEY=your_supabase_key DATABASE_URL=your_database_url
-
Set Up Supabase Database Run the SQL commands in
supabase/init.sql
in your Supabase SQL editor:- Creates required tables
- Enables pgvector extension
- Sets up similarity search function
# Crawl a documentation website
npm start crawl https://docs.example.com
The crawler will:
- Check robots.txt compliance
- Extract main content
- Convert to clean Markdown
- Generate embeddings
- Store in Supabase
# Ask a question about the documentation
npm start query "Tell me about Python-centric Design"
The system will:
- Generate embeddings for your question
- Find relevant documentation chunks
- Generate a contextual answer
- Provide source references
.
โโโ src/
โ โโโ crawler.js # Documentation crawler
โ โโโ query.js # RAG query interface
โ โโโ utils/
โ โโโ processor.js # Content processing
โ โโโ storage.js # Supabase integration
โโโ supabase/
โ โโโ init.sql # Database setup
โโโ .env # Configuration
โโโ index.js # CLI interface
Adjust in src/crawler.js
:
const crawler = new CheerioCrawler({
maxRequestsPerCrawl: 50, // Max pages to crawl
maxConcurrency: 2, // Concurrent requests
// ... other options
});
Modify in src/utils/processor.js
:
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000, // Characters per chunk
chunkOverlap: 200, // Overlap between chunks
// ... other options
});
Configure in src/utils/storage.js
:
const { data } = await supabase.rpc('match_documents', {
query_embedding: embedding,
match_threshold: 0.7, // Similarity threshold
match_count: 5 // Number of matches
});
The system includes comprehensive error handling:
- Crawler failures
- Database connection issues
- API rate limits
- Invalid queries
- Missing content
Error messages are user-friendly and include debugging information when needed.
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
ISC
Built with: