LLM Scraper

A flexible, configuration-driven web scraper designed to extract article content and feed it into a Retrieval-Augmented Generation (RAG) pipeline.

This project uses a professional, scalable architecture with FastAPI, Celery, and Redis to create a robust system for data ingestion.

Core Features

Config-Driven Parsing: Define how to scrape any site using simple JSON configuration files.
RAG-Ready: Automatically chunks content, generates embeddings via OpenAI, and stores it in an AstraDB vector store.
Scalable Architecture:
- FastAPI: For a high-performance, non-blocking API.
- Celery: For distributed background task processing.
- Redis: As the message broker and result backend for Celery.
Automated & On-Demand Scraping:
- Scrape entire sites on a recurring schedule.
- Scrape specific sites or single URLs via API endpoints.

Project Structure

.
├── Procfile              # Defines processes for Honcho (api, worker, beat)
├── api.py                # FastAPI application, the user-facing entrypoint
├── celery_app.py         # Celery application instance configuration
├── pyproject.toml        # Project metadata and dependencies
├── src/
│   └── llm_scraper/
│       ├── __init__.py
│       ├── articles.py   # Core Article data model and chunking logic
│       ├── meta.py       # Metadata extraction logic
│       ├── parsers/      # Site-specific parser configurations
│       ├── schema.py     # Pydantic models for configuration and data
│       ├── settings.py   # Application settings management (from .env)
│       ├── utils.py      # Utility functions
│       └── vector_store.py # Handles interaction with OpenAI and AstraDB
└── worker.py             # Celery worker and scheduler (Celery Beat) definitions

Setup

Install Dependencies: This project uses uv for package management.
```
uv pip install -r requirements.txt
```

Environment Variables: Create a .env file in the root directory and add your credentials:

# .env
OPENAI_API_KEY="sk-..."
ASTRA_DB_APPLICATION_TOKEN="AstraCS:..."
ASTRA_DB_API_ENDPOINT="https://..."
ASTRA_DB_COLLECTION_NAME="your_collection_name"
REDIS_URL="redis://localhost:6379/0"

Run Redis: Ensure you have a Redis server running locally. You can use Docker for this:
```
docker run -d -p 6379:6379 redis
```

How to Run

You can run the system in two ways: locally using honcho or with Docker.

1. Running with Docker (Recommended)

This is the easiest way to run the entire system, including the Redis database.

Prerequisites:

Docker and Docker Compose installed.
A .env file with your credentials (see Setup section).

To start the entire system, run:

docker-compose up --build

This command will:

Build the Docker image for the application based on the Dockerfile.
Start containers for the api, worker, beat, and redis services.
Display all logs in your terminal.

To stop the services, press Ctrl+C.

2. Running Locally with Honcho

Use this method if you prefer not to use Docker.

Prerequisites:

Python and uv installed.
A running Redis server (e.g., docker run -d -p 6379:6379 redis).
Dependencies installed (uv pip install -r requirements.txt).
A .env file with your credentials.

To start the entire system, run:

honcho start

API Usage

POST /scrape-url: Scrape a single URL on-demand.
POST /scrape-site: Trigger a background task to scrape an entire pre-configured site.
GET /tasks/{task_id}: Check the status of a background task.
POST /query: Perform a similarity search on the vectorized data in AstraDB.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.vscode		.vscode
src/llm_scraper		src/llm_scraper
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
Dockerfile		Dockerfile
LICENSE		LICENSE
Procfile		Procfile
README.md		README.md
api.py		api.py
celery_app.py		celery_app.py
docker-compose.yml		docker-compose.yml
example.py		example.py
example_multi_json.py		example_multi_json.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock
worker.py		worker.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM Scraper

Core Features

Project Structure

Setup

How to Run

1. Running with Docker (Recommended)

2. Running Locally with Honcho

API Usage

About

Uh oh!

Releases

Packages

Languages

License

thewebscraping/llm-scraper

Folders and files

Latest commit

History

Repository files navigation

LLM Scraper

Core Features

Project Structure

Setup

How to Run

1. Running with Docker (Recommended)

2. Running Locally with Honcho

API Usage

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages