Skip to content

Modern web scraping template built with FastAPI and Playwright. Features resource caching, proxy support, and browser pool management.

License

Notifications You must be signed in to change notification settings

DanielWTE/template-web-scraper

Repository files navigation

Template Web Scraper

A comprehensive web scraping template built with FastAPI and Playwright. Features multi-level caching, proxy support, and extensive configurability.

Features

  • FastAPI-based API endpoints
  • Caching system:
    • Resource caching (JS, CSS, images or custom)
  • Configurable proxy support
  • Browser session management
  • Comprehensive error handling
  • Modular scraping architecture
  • Detailed network statistics
  • Automated metadata extraction

Requirements

  • Python 3.12+
  • FastAPI
  • Playwright
  • Additional dependencies in requirements.txt

Installation

  1. Clone the repository:
git clone https://github.com/DanielWTE/template-web-scraper.git
cd template-web-scraper
  1. Create and activate virtual environment:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Install Playwright browsers:
playwright install
  1. Configure environment variables:
cp .env.example .env

Edit the .env file with your configuration.

Configuration

Required environment variables:

# API Configuration
API_KEY=your_api_key_here
PORT=8000
HOST=0.0.0.0

# Browser Configuration
BROWSER_POOL_SIZE=1
PAGE_TIMEOUT=300000
DETAILED_LOGGING=false

# Cache Configuration
CACHE_DIR=cache
ENABLE_CACHING=true

# Proxy Configuration
PROXY_FILE_PATH=proxies/proxies.txt
USE_PROXIES=false

Usage

Start the API server:

uvicorn main:app --host 0.0.0.0 --port 8000

Development mode with auto-reload:

uvicorn main:app --reload --reload-exclude 'venv'

Docker

Build and run with Docker:

docker build -t template-web-scraper .
docker run -p 8000:8000 template-web-scraper

API Endpoints

GET /

Health check endpoint

GET /scrape

Scrape a webpage with caching

  • Required header: Authorization: your_api_key
  • Required parameter: url

Example:

curl -X GET "http://localhost:8000/scrape?url=https://example.com" \
     -H "Authorization: your_api_key" # From .env

Caching System

The template implements a two-level caching system:

  1. Resource Caching:
  • Caches static resources (JS, CSS, images)
  • Reduces bandwidth usage and load times
  • Configurable through ENABLE_CACHING

About

Modern web scraping template built with FastAPI and Playwright. Features resource caching, proxy support, and browser pool management.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published