scrapy-pubsub
is a Scrapy item pipeline which writes crawled items to Cloud Pub/Sub.
It is based on Google's Python client for Cloud Pub/Sub.
pip install scrapy-pubsub
Add the following in your Scrapy settings.py
:
ITEM_PIPELINES = {
"scrapy_pubsub.PubSubItemPipeline": 100
}
PUBSUB_PROJECT_ID = "my-project-id"
PUBSUB_TOPIC = "my-topic"
There were at least 4 different approaches possible for integrating Cloud Pub/Sub within the Scrapy framework and APIs.
Approach | Examples | Pros | Cons |
---|---|---|---|
1/ Item Exporter | - SQLite item exporter - Native item exporters such as JsonLinesItemExporter |
- Pub/Sub is an alternative way to export items, so ItemExporter sounds like the right interface for it. |
- ItemExporter objects are coupled to a FeedExporter which works with a file. In the case of Pub/Sub, we don't have a file. - Scrapy native ItemExporters are closer to formatters (to JSON, JSON lines, XMLs...) which is orthogonal to the persistence medium (see below) |
2/ Storage Backend | - Native storage backends such as StdoutFeedStorage |
- Pub/Sub could be seen as an alternative storage method, indepedently from the way the item are "exported", ie formatted. One could use either JsonLinesItemExporter , XmlItemExporter , or a even a custom item exporter and persist the items to the Pub/Sub "backend storage". |
- The backend storage concept means that Pub/Sub should be seen as a file and provide a file interface - Some item exporters write beginning and end tags to a file (e.g. JsonItemExporter ) which would trigger sending incorrect messages to Pub/Sub. |
3/ Extension | - Kafka exporter extension - Native extensions such as FeedExport |
- Simple - Decoupled from FeedExport : one can publish to Pub/Sub but also write to a file - This approach has previously been used for a Kafka extension, which should be very similar to Pub/Sub. |
- Can't reuse different item exporters - Need to handle signals logic |
4/ Item Pipeline | - MongoDB pipeline example | - Simple & decoupled like extensions - It appears that signals are already handled for item pipelines, as opposed to extensions - The MongoDB example from the official documentation indicates this would be the way to follow |
- Can't reuse different item exporters |
The Item Pipeline approach (4/) has been chosen for a first version.
Google's Cloud Pub/Sub emulator is used for integration tests.
The following Github actions are used:
For setup-gcloud
, the creation of a dummy service account was necessary. The following steps were used:
# Create account
SERVICE_ACCOUNT=scrapy-pubsub-github-workflow
gcloud iam service-accounts create ${SERVICE_ACCOUNT}
# Check no roles have been given
PROJECT=xyz
gcloud projects get-iam-policy ${PROJECT} \
--flatten="bindings[].members" \
--format='table(bindings.role)' \
--filter="bindings.members:${SERVICE_ACCOUNT}"
# Create key
gcloud iam service-accounts keys create ./key.json \
--iam-account ${SERVICE_ACCOUNT}@${PROJECT}.iam.gserviceaccount.com \
--key-file-type=json
cat key.json | base64
To install the dependencies:
pip install -e .[dev]