Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature suggest: add downloaded/uptodate status to information about downloaded media #2893

djunzu opened this issue Aug 22, 2017 · 5 comments · Fixed by #4486


Copy link

djunzu commented Aug 22, 2017

Currently File/Image pipelines populate files/images fields with dicts containing information about the downloaded files (the downloaded path, the original scraped url, and the file checksum). It would be useful to have downloaded/uptodate status in this dict (motivation).

It goes along with other features requests such as having width/height of images also in the dict output.

Copy link

redapple commented Aug 23, 2017

Makes sense.
Also, the format of the dicts is not clearly documented (it is here, but not obvious).
If this proposal is implemented, I would also suggest improving the docs on the structure of results.

Copy link

dcurletti commented Dec 19, 2017


Copy link

ilias-ant commented Apr 1, 2020

this is a great feature suggestion, that is general enough to have many applications on projects that utilize Scrapy!

Hope I find the time to have a try on this soon!

update: did a bit of research and realized that this enhancement is achievable with minimal source code alterations. Namely:

  • FilesPipeline.media_to_download#_onsuccess callback should now return: {'url': request.url, 'path': path, 'checksum': checksum, 'status': 'uptodate'}
  • FilesPipeline.media_downloaded should now return: {'url': request.url, 'path': path, 'checksum': checksum, 'status': status}
  • the necessary testing and documentation considerations

Any feedback on this will be deeply appreciated and if this is indeed the case, I will certainly open a pull request (if that ok with you) :)

Copy link

Gallaecio commented Apr 3, 2020

I wonder if we want to make a distinction between:

  • It had been already downloaded in a previous crawl.
  • It had been already downloaded in the current crawl.

Copy link

ilias-ant commented Apr 3, 2020

I think the first approach (in a previous crawl) is sufficient and well-defined, at the right level of abstractness.

The second approach is probably going to face significant definition challenges in Scrapy projects that utilize more than one Media Pipelines or heavily depend on a specific file expiration policy. Since the whole point of the file expiration policy is to "avoid downloading files that were downloaded recently", I think to take into account the "current crawl" semantics would really confuse the end-user.

Overall, maintaining consistency with the file_status_count statistics is definitely the way.

I hope I did not miss your point on this distinction :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet

Successfully merging a pull request may close this issue.

5 participants