Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] S3 Storage #477

Closed
mgrist opened this issue Mar 24, 2023 · 3 comments
Closed

[Feature Request] S3 Storage #477

mgrist opened this issue Mar 24, 2023 · 3 comments
Labels
type: question a user support question

Comments

@mgrist
Copy link

mgrist commented Mar 24, 2023

It would be great if you could specify an S3 bucket URI in the scrapyd.conf file for the eggs_dir, logs_dir, items_dir, etc...

@jpmckinney
Copy link
Contributor

jpmckinney commented Apr 3, 2023

What would be the expected behavior?

Scrapy can write lines to log files, items files, etc. at a very high frequency, and over a very long period of time - it would not make sense to store these on S3 while the files are "open". It's perhaps possible (once the file is closed) to transfer these files to S3... but that's something you can do as a separate job (like a backup script) - it's not clear that it should be something Scrapyd is responsible for.

I assume this need arises from attempting to run Scrapyd on a host with only temporary storage (like Heroku). To get it working on such a platform:

  • Scrapyd ships with one implementation of the egg storage interface, FilesystemEggStorage, which is the only part of the code that uses eggs_dir. You will need to write your own implementation to use S3 (or whatever else) as you wish. https://scrapyd.readthedocs.io/en/stable/config.html#eggstorage
  • items_dir is in fact empty (disabled) by default in Scrapyd. It's recommended to have your spiders write to a database or some other feed exporter (see Scrapy documentation, which includes using S3).
  • Scrapy writes log files to standard output or to files. You can either configure your host to forward stdout (Heroku docs) to some service like Logstash, or you can reconfigure Scrapy's logger to behave as you wish (Scrapy uses Python logging).

@jpmckinney jpmckinney added the type: question a user support question label Apr 3, 2023
@jpmckinney
Copy link
Contributor

I've made a commit to clarify the above in the docs: e92edd6

If you implement a new egg storage option, please feel free to open a pull request.

@mgrist
Copy link
Author

mgrist commented Apr 3, 2023

@jpmckinney Thanks for the information! I wasn't aware of some of those features. You are correct that I am running scrapyd on temporary storage. I created my own separate script that runs afterward that sends the logs to S3. I think you are correct that it isn't scrapyd's job to do this. Thanks again for the insight.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: question a user support question
Projects
None yet
Development

No branches or pull requests

2 participants