Template Web Scraper

A comprehensive web scraping template built with FastAPI and Playwright. Features multi-level caching, proxy support, and extensive configurability.

Features

FastAPI-based API endpoints
Caching system:
- Resource caching (JS, CSS, images or custom)
Configurable proxy support
Browser session management
Comprehensive error handling
Modular scraping architecture
Detailed network statistics
Automated metadata extraction

Requirements

Python 3.12+
FastAPI
Playwright
Additional dependencies in requirements.txt

Installation

Clone the repository:

git clone https://github.com/DanielWTE/template-web-scraper.git
cd template-web-scraper

Create and activate virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Install Playwright browsers:

playwright install

Configure environment variables:

cp .env.example .env

Edit the .env file with your configuration.

Configuration

Required environment variables:

# API Configuration
API_KEY=your_api_key_here
PORT=8000
HOST=0.0.0.0

# Browser Configuration
BROWSER_POOL_SIZE=1
PAGE_TIMEOUT=300000
DETAILED_LOGGING=false

# Cache Configuration
CACHE_DIR=cache
ENABLE_CACHING=true

# Proxy Configuration
PROXY_FILE_PATH=proxies/proxies.txt
USE_PROXIES=false

Usage

Start the API server:

uvicorn main:app --host 0.0.0.0 --port 8000

Development mode with auto-reload:

uvicorn main:app --reload --reload-exclude 'venv'

Docker

Build and run with Docker:

docker build -t template-web-scraper .
docker run -p 8000:8000 template-web-scraper

API Endpoints

GET /

Health check endpoint

GET /scrape

Scrape a webpage with caching

Required header: Authorization: your_api_key
Required parameter: url

Example:

curl -X GET "http://localhost:8000/scrape?url=https://example.com" \
     -H "Authorization: your_api_key" # From .env

Caching System

The template implements a two-level caching system:

Resource Caching:

Caches static resources (JS, CSS, images)
Reduces bandwidth usage and load times
Configurable through ENABLE_CACHING

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
scraper		scraper
utils		utils
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Template Web Scraper

Features

Requirements

Installation

Configuration

Usage

Docker

API Endpoints

GET /

GET /scrape

Caching System

About

Releases

Packages

Languages

License

DanielWTE/template-web-scraper

Folders and files

Latest commit

History

Repository files navigation

Template Web Scraper

Features

Requirements

Installation

Configuration

Usage

Docker

API Endpoints

GET /

GET /scrape

Caching System

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages