Scrapy Pub/Sub

scrapy-pubsub is a Scrapy item pipeline which writes crawled items to Cloud Pub/Sub.

It is based on Google's Python client for Cloud Pub/Sub.

Install

pip install scrapy-pubsub

How to use

Add the following in your Scrapy settings.py:

ITEM_PIPELINES = {
  "scrapy_pubsub.PubSubItemPipeline": 100
}
PUBSUB_PROJECT_ID = "my-project-id"
PUBSUB_TOPIC = "my-topic"

Design

There were at least 4 different approaches possible for integrating Cloud Pub/Sub within the Scrapy framework and APIs.

Approach	Examples	Pros	Cons
1/ Item Exporter	- SQLite item exporter - Native item exporters such as `JsonLinesItemExporter`	- Pub/Sub is an alternative way to export items, so `ItemExporter` sounds like the right interface for it.	- `ItemExporter` objects are coupled to a `FeedExporter` which works with a file. In the case of Pub/Sub, we don't have a file. - Scrapy native `ItemExporters` are closer to formatters (to JSON, JSON lines, XMLs...) which is orthogonal to the persistence medium (see below)
2/ Storage Backend	- Native storage backends such as `StdoutFeedStorage`	- Pub/Sub could be seen as an alternative storage method, indepedently from the way the item are "exported", ie formatted. One could use either `JsonLinesItemExporter`, `XmlItemExporter`, or a even a custom item exporter and persist the items to the Pub/Sub "backend storage".	- The backend storage concept means that Pub/Sub should be seen as a file and provide a file interface - Some item exporters write beginning and end tags to a file (e.g. `JsonItemExporter`) which would trigger sending incorrect messages to Pub/Sub.
3/ Extension	- Kafka exporter extension - Native extensions such as `FeedExport`	- Simple - Decoupled from `FeedExport`: one can publish to Pub/Sub but also write to a file - This approach has previously been used for a Kafka extension, which should be very similar to Pub/Sub.	- Can't reuse different item exporters - Need to handle signals logic
4/ Item Pipeline	- MongoDB pipeline example	- Simple & decoupled like extensions - It appears that signals are already handled for item pipelines, as opposed to extensions - The MongoDB example from the official documentation indicates this would be the way to follow	- Can't reuse different item exporters

The Item Pipeline approach (4/) has been chosen for a first version.

Reference

Scrapy documentation

Scrapy community contributions

Stack overflow

How to create a custom Scrapy item exporter?

Tests

Google's Cloud Pub/Sub emulator is used for integration tests.

CI

The following Github actions are used:

For setup-gcloud, the creation of a dummy service account was necessary. The following steps were used:

# Create account
SERVICE_ACCOUNT=scrapy-pubsub-github-workflow
gcloud iam service-accounts create ${SERVICE_ACCOUNT}

# Check no roles have been given
PROJECT=xyz
gcloud projects get-iam-policy ${PROJECT} \
--flatten="bindings[].members" \
--format='table(bindings.role)' \
--filter="bindings.members:${SERVICE_ACCOUNT}"

# Create key
gcloud iam service-accounts keys create ./key.json \
  --iam-account ${SERVICE_ACCOUNT}@${PROJECT}.iam.gserviceaccount.com \
  --key-file-type=json
cat key.json | base64

Development

To install the dependencies:

pip install -e .[dev]

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.github/workflows		.github/workflows
scrapy_pubsub		scrapy_pubsub
tests		tests
.gitignore		.gitignore
.pylintrc		.pylintrc
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scrapy Pub/Sub

Install

How to use

Design

Reference

Scrapy documentation

Scrapy community contributions

Stack overflow

Tests

CI

Development

About

Releases 1

Packages

Languages

License

ynouri/scrapy-pubsub

Folders and files

Latest commit

History

Repository files navigation

Scrapy Pub/Sub

Install

How to use

Design

Reference

Scrapy documentation

Scrapy community contributions

Stack overflow

Tests

CI

Development

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages