AirScrapy - Scrapy contrib for Airflow

Installation

pip install airscrapy

Airflow Operator

This operator runs Scrapy directly within the worker process by invoking the Scrapy engine directly, eliminating the need for a separate process.

Example

If the spider is structured as follows:

import scrapy

class ExampleSpider(scrapy.Spider):
    name = "example"
    start_urls = [ "http://example.com" ]

    def parse(self, response):
        yield {
            'text': response.css('.info').extract_first()
        }

Here’s how you can create a DAG using the operator:

from airflow import DAG
from airscrapy import ScrapyOperator
from myscrapers.spiders.example import ExampleSpider
import os

with DAG(
    dag_id="scrapers",
        # Add extra settings like credentials or token
        params={
            "extra_settings": {
                "CONCURRENT_REQUESTS": 2,
            }
        },
) as dag:
    # Import the shared settings file
    os.environ["SCRAPY_SETTINGS_MODULE"] = "myscrapers.settings"

    task = ScrapyOperator(spider=ExampleSpider)

if __name__ == "__main__":
    dag.test()

The extra_settings parameter is used to dynamically include elements such as credentials or tokens, complementing the settings.py file.

Additionally, ensure you set the SCRAPY_SETTINGS_MODULE environment variable. Without it, Scrapy won't be able to locate the settings.

The DAG directory is organized as follows:

dags
|- myscrapers
   |- spiders
      |- __init__.py
      |- example.py
   |- __init__.py
   |- items.py
   |- middlewares.py
   |- pipelines.py
   |- settings.py
|- mydag.py
|- scrapy.cfg

This structure enables us to run the DAG in local debugging mode:

python mydag.py

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
airscrapy		airscrapy
.gitignore		.gitignore
LICENCE		LICENCE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AirScrapy - Scrapy contrib for Airflow

Installation

Airflow Operator

Example

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AirScrapy - Scrapy contrib for Airflow

Installation

Airflow Operator

Example

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages