A comprehensive web scraping template built with FastAPI and Playwright. Features multi-level caching, proxy support, and extensive configurability.
- FastAPI-based API endpoints
- Caching system:
- Resource caching (JS, CSS, images or custom)
- Configurable proxy support
- Browser session management
- Comprehensive error handling
- Modular scraping architecture
- Detailed network statistics
- Automated metadata extraction
- Python 3.12+
- FastAPI
- Playwright
- Additional dependencies in requirements.txt
- Clone the repository:
git clone https://github.com/DanielWTE/template-web-scraper.git
cd template-web-scraper
- Create and activate virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
- Install Playwright browsers:
playwright install
- Configure environment variables:
cp .env.example .env
Edit the .env file with your configuration.
Required environment variables:
# API Configuration
API_KEY=your_api_key_here
PORT=8000
HOST=0.0.0.0
# Browser Configuration
BROWSER_POOL_SIZE=1
PAGE_TIMEOUT=300000
DETAILED_LOGGING=false
# Cache Configuration
CACHE_DIR=cache
ENABLE_CACHING=true
# Proxy Configuration
PROXY_FILE_PATH=proxies/proxies.txt
USE_PROXIES=false
Start the API server:
uvicorn main:app --host 0.0.0.0 --port 8000
Development mode with auto-reload:
uvicorn main:app --reload --reload-exclude 'venv'
Build and run with Docker:
docker build -t template-web-scraper .
docker run -p 8000:8000 template-web-scraper
Health check endpoint
Scrape a webpage with caching
- Required header:
Authorization: your_api_key
- Required parameter:
url
Example:
curl -X GET "http://localhost:8000/scrape?url=https://example.com" \
-H "Authorization: your_api_key" # From .env
The template implements a two-level caching system:
- Resource Caching:
- Caches static resources (JS, CSS, images)
- Reduces bandwidth usage and load times
- Configurable through ENABLE_CACHING