python-crawl4ai

History

Name		Name	Last commit message	Last commit date
parent directory ..
src		src
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
requirements.txt		requirements.txt
trigger.config.ts		trigger.config.ts
tsconfig.json		tsconfig.json

README.md

Trigger.dev + Python headless web crawler example

This demo showcases how to use Trigger.dev with Python to build a web crawler that uses a headless browser to navigate websites and extract content.

Features

Trigger.dev for background task orchestration
Trigger.dev Python build extension to install the dependencies and run the Python script
Crawl4AI, an open source LLM friendly web crawler
Playwright to create a headless chromium browser
Proxy support

Using Proxies

When web scraping, you MUST use a proxy to comply with the Trigger.dev terms of service. Direct scraping of third-party websites without the site owner’s permission using Trigger.dev Cloud is prohibited and will result in account suspension.

Some popular proxy services are:

Once you have a proxy service, set the following environment variables in your Trigger.dev .env file, and add them to the Trigger.dev dashboard:

PROXY_URL: The URL of your proxy server (e.g., http://proxy.example.com:8080)
PROXY_USERNAME: Username for authenticated proxies (optional)
PROXY_PASSWORD: Password for authenticated proxies (optional)

Getting Started

After cloning the repo, run npm install to install the dependencies.
Create a virtual environment python -m venv venv
Activate the virtual environment, depending on your OS: On Mac/Linux: source venv/bin/activate, on Windows: venv\Scripts\activate
Install the Python dependencies pip install -r requirements.txt
Copy the project ref from your Trigger.dev dashboard and and add it to the trigger.config.ts file.
Run the Trigger.dev CLI dev command (it may ask you to authorize the CLI if you haven't already).
Test the task in the dashboard
Deploy the task to production using the Trigger.dev CLI deploy command.

Relevant code

pythonTasks.ts triggers the Python script and returns the result
trigger.config.ts uses the Trigger.dev Python extension to install the dependencies and run the script, as well as installPlaywrightChromium() to create a headless chromium browser
crawl-url.py is the main Python script that takes a URL and returns the markdown content of the page

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

python-crawl4ai

python-crawl4ai

README.md

Trigger.dev + Python headless web crawler example

Features

Using Proxies

Getting Started

Relevant code

Files

python-crawl4ai

Directory actions

More options

Directory actions

More options

Latest commit

History

python-crawl4ai

Folders and files

parent directory

README.md

Trigger.dev + Python headless web crawler example

Features

Using Proxies

Getting Started

Relevant code