Currently File/Image pipelines populate files/images fields with dicts containing information about the downloaded files (the downloaded path, the original scraped url, and the file checksum). It would be useful to have downloaded/uptodate status in this dict (motivation).
It goes along with other features requests such as having width/height of images also in the dict output.
The text was updated successfully, but these errors were encountered:
Also, the format of the dicts is not clearly documented (it is here, but not obvious).
If this proposal is implemented, I would also suggest improving the docs on the structure of results.
I think the first approach (in a previous crawl) is sufficient and well-defined, at the right level of abstractness.
The second approach is probably going to face significant definition challenges in Scrapy projects that utilize more than one Media Pipelines or heavily depend on a specific file expiration policy. Since the whole point of the file expiration policy is to "avoid downloading files that were downloaded recently", I think to take into account the "current crawl" semantics would really confuse the end-user.
Overall, maintaining consistency with the file_status_count statistics is definitely the way.
I hope I did not miss your point on this distinction :)