You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
File expiration as described here does not work properly with Google Cloud Storage.
It is supposed to trigger the downloading of files only if they have expired based on FILES_EXPIRES but in practice it downloads all the files again and again. Actually it only works if you write at the root of your bucket (e.g: gs://my_bucket)
Steps to Reproduce
Feel free to clone and follow instructions from that repository to reproduce this easily.
Otherwise you can use that spider
The image should be in FILES_STORE so if we rerun the spider we would expect it to not download the image and we would expect the following line in logs
'file_status_count/uptodate': 1,
but we don't.
If you are not convinced set FILES_STORE to a local path and run the spider twice in a row to get the log with ile_status_count/uptodate: 1
The bug seems to have been introduced at the same time as GCSFileStore has been added in 1.5https://github.com/scrapy/scrapy/blob/1.5/scrapy/pipelines/files.py.
I patched GCSFilesStore internally in my company but I'll provide a fix for scrapy just after having published this issue
The text was updated successfully, but these errors were encountered:
Description
File expiration as described here does not work properly with Google Cloud Storage.
It is supposed to trigger the downloading of files only if they have expired based on
FILES_EXPIRES
but in practice it downloads all the files again and again. Actually it only works if you write at the root of your bucket (e.g:gs://my_bucket
)Steps to Reproduce
Feel free to clone and follow instructions from that repository to reproduce this easily.
Otherwise you can use that spider
and set
FILES_STORE
to a gcs bucket of your choice. For more information about Gcloud setup feel free to visit the repository above mentionned.Then if you run the scraper once you'll get the following stats at the end
The image should be in
FILES_STORE
so if we rerun the spider we would expect it to not download the image and we would expect the following line in logsbut we don't.
If you are not convinced set
FILES_STORE
to a local path and run the spider twice in a row to get the log withile_status_count/uptodate: 1
Versions
Scrapy : 2.5.1
lxml : 4.6.4.0
libxml2 : 2.9.4
cssselect : 1.1.0
parsel : 1.6.0
w3lib : 1.22.0
Twisted : 21.7.0
Python : 3.6.8
Additional context
The bug seems to have been introduced at the same time as
GCSFileStore
has been added in1.5
https://github.com/scrapy/scrapy/blob/1.5/scrapy/pipelines/files.py.I patched GCSFilesStore internally in my company but I'll provide a fix for scrapy just after having published this issue
The text was updated successfully, but these errors were encountered: