This demo showcases how to use Trigger.dev with Python to build a web crawler that uses a headless browser to navigate websites and extract content.
- Trigger.dev for background task orchestration
- Trigger.dev Python build extension to install the dependencies and run the Python script
- Crawl4AI, an open source LLM friendly web crawler
- Playwright to create a headless chromium browser
- Proxy support
When web scraping, you MUST use a proxy to comply with the Trigger.dev terms of service. Direct scraping of third-party websites without the site owner’s permission using Trigger.dev Cloud is prohibited and will result in account suspension.
Some popular proxy services are:
Once you have a proxy service, set the following environment variables in your Trigger.dev .env file, and add them to the Trigger.dev dashboard:
PROXY_URL
: The URL of your proxy server (e.g.,http://proxy.example.com:8080
)PROXY_USERNAME
: Username for authenticated proxies (optional)PROXY_PASSWORD
: Password for authenticated proxies (optional)
- After cloning the repo, run
npm install
to install the dependencies. - Create a virtual environment
python -m venv venv
- Activate the virtual environment, depending on your OS: On Mac/Linux:
source venv/bin/activate
, on Windows:venv\Scripts\activate
- Install the Python dependencies
pip install -r requirements.txt
- Copy the project ref from your Trigger.dev dashboard and and add it to the
trigger.config.ts
file. - Run the Trigger.dev CLI dev command (it may ask you to authorize the CLI if you haven't already).
- Test the task in the dashboard
- Deploy the task to production using the Trigger.dev CLI deploy command.
- pythonTasks.ts triggers the Python script and returns the result
- trigger.config.ts uses the Trigger.dev Python extension to install the dependencies and run the script, as well as
installPlaywrightChromium()
to create a headless chromium browser - crawl-url.py is the main Python script that takes a URL and returns the markdown content of the page