[Feature Request] S3 Storage #477

mgrist · 2023-03-24T05:36:52Z

It would be great if you could specify an S3 bucket URI in the scrapyd.conf file for the eggs_dir, logs_dir, items_dir, etc...

The text was updated successfully, but these errors were encountered:

jpmckinney · 2023-04-03T19:50:59Z

What would be the expected behavior?

Scrapy can write lines to log files, items files, etc. at a very high frequency, and over a very long period of time - it would not make sense to store these on S3 while the files are "open". It's perhaps possible (once the file is closed) to transfer these files to S3... but that's something you can do as a separate job (like a backup script) - it's not clear that it should be something Scrapyd is responsible for.

I assume this need arises from attempting to run Scrapyd on a host with only temporary storage (like Heroku). To get it working on such a platform:

Scrapyd ships with one implementation of the egg storage interface, FilesystemEggStorage, which is the only part of the code that uses eggs_dir. You will need to write your own implementation to use S3 (or whatever else) as you wish. https://scrapyd.readthedocs.io/en/stable/config.html#eggstorage
items_dir is in fact empty (disabled) by default in Scrapyd. It's recommended to have your spiders write to a database or some other feed exporter (see Scrapy documentation, which includes using S3).
Scrapy writes log files to standard output or to files. You can either configure your host to forward stdout (Heroku docs) to some service like Logstash, or you can reconfigure Scrapy's logger to behave as you wish (Scrapy uses Python logging).

jpmckinney · 2023-04-03T20:24:14Z

I've made a commit to clarify the above in the docs: e92edd6

If you implement a new egg storage option, please feel free to open a pull request.

mgrist · 2023-04-03T22:03:55Z

@jpmckinney Thanks for the information! I wasn't aware of some of those features. You are correct that I am running scrapyd on temporary storage. I created my own separate script that runs afterward that sends the logs to S3. I think you are correct that it isn't scrapyd's job to do this. Thanks again for the insight.

jpmckinney closed this as completed Apr 3, 2023

jpmckinney added the type: question a user support question label Apr 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] S3 Storage #477

[Feature Request] S3 Storage #477

mgrist commented Mar 24, 2023

jpmckinney commented Apr 3, 2023 •

edited

Loading

jpmckinney commented Apr 3, 2023

mgrist commented Apr 3, 2023

[Feature Request] S3 Storage #477

[Feature Request] S3 Storage #477

Comments

mgrist commented Mar 24, 2023

jpmckinney commented Apr 3, 2023 • edited Loading

jpmckinney commented Apr 3, 2023

mgrist commented Apr 3, 2023

jpmckinney commented Apr 3, 2023 •

edited

Loading