Skip to content

Files

Latest commit

 

History

History

python-crawl4ai

Trigger.dev + Python headless web crawler example

This demo showcases how to use Trigger.dev with Python to build a web crawler that uses a headless browser to navigate websites and extract content.

Features

Using Proxies

When web scraping, you MUST use a proxy to comply with the Trigger.dev terms of service. Direct scraping of third-party websites without the site owner’s permission using Trigger.dev Cloud is prohibited and will result in account suspension.

Some popular proxy services are:

Once you have a proxy service, set the following environment variables in your Trigger.dev .env file, and add them to the Trigger.dev dashboard:

  • PROXY_URL: The URL of your proxy server (e.g., http://proxy.example.com:8080)
  • PROXY_USERNAME: Username for authenticated proxies (optional)
  • PROXY_PASSWORD: Password for authenticated proxies (optional)

Getting Started

  1. After cloning the repo, run npm install to install the dependencies.
  2. Create a virtual environment python -m venv venv
  3. Activate the virtual environment, depending on your OS: On Mac/Linux: source venv/bin/activate, on Windows: venv\Scripts\activate
  4. Install the Python dependencies pip install -r requirements.txt
  5. Copy the project ref from your Trigger.dev dashboard and and add it to the trigger.config.ts file.
  6. Run the Trigger.dev CLI dev command (it may ask you to authorize the CLI if you haven't already).
  7. Test the task in the dashboard
  8. Deploy the task to production using the Trigger.dev CLI deploy command.

Relevant code

  • pythonTasks.ts triggers the Python script and returns the result
  • trigger.config.ts uses the Trigger.dev Python extension to install the dependencies and run the script, as well as installPlaywrightChromium() to create a headless chromium browser
  • crawl-url.py is the main Python script that takes a URL and returns the markdown content of the page