A flexible, configuration-driven web scraper designed to extract article content and feed it into a Retrieval-Augmented Generation (RAG) pipeline.
This project uses a professional, scalable architecture with FastAPI, Celery, and Redis to create a robust system for data ingestion.
- Config-Driven Parsing: Define how to scrape any site using simple JSON configuration files.
- RAG-Ready: Automatically chunks content, generates embeddings via OpenAI, and stores it in an AstraDB vector store.
- Scalable Architecture:
- FastAPI: For a high-performance, non-blocking API.
- Celery: For distributed background task processing.
- Redis: As the message broker and result backend for Celery.
- Automated & On-Demand Scraping:
- Scrape entire sites on a recurring schedule.
- Scrape specific sites or single URLs via API endpoints.
.
├── Procfile # Defines processes for Honcho (api, worker, beat)
├── api.py # FastAPI application, the user-facing entrypoint
├── celery_app.py # Celery application instance configuration
├── pyproject.toml # Project metadata and dependencies
├── src/
│ └── llm_scraper/
│ ├── __init__.py
│ ├── articles.py # Core Article data model and chunking logic
│ ├── meta.py # Metadata extraction logic
│ ├── parsers/ # Site-specific parser configurations
│ ├── schema.py # Pydantic models for configuration and data
│ ├── settings.py # Application settings management (from .env)
│ ├── utils.py # Utility functions
│ └── vector_store.py # Handles interaction with OpenAI and AstraDB
└── worker.py # Celery worker and scheduler (Celery Beat) definitions
-
Install Dependencies: This project uses
uvfor package management.uv pip install -r requirements.txt
-
Environment Variables: Create a
.envfile in the root directory and add your credentials:# .env OPENAI_API_KEY="sk-..." ASTRA_DB_APPLICATION_TOKEN="AstraCS:..." ASTRA_DB_API_ENDPOINT="https://..." ASTRA_DB_COLLECTION_NAME="your_collection_name" REDIS_URL="redis://localhost:6379/0"
-
Run Redis: Ensure you have a Redis server running locally. You can use Docker for this:
docker run -d -p 6379:6379 redis
You can run the system in two ways: locally using honcho or with Docker.
This is the easiest way to run the entire system, including the Redis database.
Prerequisites:
- Docker and Docker Compose installed.
- A
.envfile with your credentials (see Setup section).
To start the entire system, run:
docker-compose up --buildThis command will:
- Build the Docker image for the application based on the
Dockerfile. - Start containers for the
api,worker,beat, andredisservices. - Display all logs in your terminal.
To stop the services, press Ctrl+C.
Use this method if you prefer not to use Docker.
Prerequisites:
- Python and
uvinstalled. - A running Redis server (e.g.,
docker run -d -p 6379:6379 redis). - Dependencies installed (
uv pip install -r requirements.txt). - A
.envfile with your credentials.
To start the entire system, run:
honcho startPOST /scrape-url: Scrape a single URL on-demand.POST /scrape-site: Trigger a background task to scrape an entire pre-configured site.GET /tasks/{task_id}: Check the status of a background task.POST /query: Perform a similarity search on the vectorized data in AstraDB.