Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Files expiration does not work well for Google Cloud Storage (GCS) #5317

Closed
mnannan opened this issue Nov 14, 2021 · 1 comment
Closed

Files expiration does not work well for Google Cloud Storage (GCS) #5317

mnannan opened this issue Nov 14, 2021 · 1 comment

Comments

@mnannan
Copy link
Contributor

mnannan commented Nov 14, 2021

Description

File expiration as described here does not work properly with Google Cloud Storage.
It is supposed to trigger the downloading of files only if they have expired based on FILES_EXPIRES but in practice it downloads all the files again and again. Actually it only works if you write at the root of your bucket (e.g: gs://my_bucket)

Steps to Reproduce

Feel free to clone and follow instructions from that repository to reproduce this easily.
Otherwise you can use that spider

class MySpiderSpider(scrapy.Spider):
    name = 'my_spider'
    allowed_domains = ['example.org']
    start_urls = ['http://example.org/']

    custom_settings = {
        'ITEM_PIPELINES':  {
            'scrapy.pipelines.files.FilesPipeline': 1
        },
        'FILES_URLS_FIELD': 'files',
        'FILES_RESULT_FIELD':  'files_processed',
        'FILES_STORE': 'gs://my_bucket/my_prefix',
        'GCS_PROJECT_ID': 'project_id',
    }

    def parse(self, response):
        return {
            'files': ['https://scrapy.org/img/scrapylogo.png'],
        }

and set FILES_STORE to a gcs bucket of your choice. For more information about Gcloud setup feel free to visit the repository above mentionned.

Then if you run the scraper once you'll get the following stats at the end

'downloader/request_bytes': 439,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 14715,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'file_count': 1,
 'file_status_count/downloaded': 1,

The image should be in FILES_STORE so if we rerun the spider we would expect it to not download the image and we would expect the following line in logs

 'file_status_count/uptodate': 1,

but we don't.
If you are not convinced set FILES_STORE to a local path and run the spider twice in a row to get the log with ile_status_count/uptodate: 1

Versions

Scrapy : 2.5.1
lxml : 4.6.4.0
libxml2 : 2.9.4
cssselect : 1.1.0
parsel : 1.6.0
w3lib : 1.22.0
Twisted : 21.7.0
Python : 3.6.8

Additional context

The bug seems to have been introduced at the same time as GCSFileStore has been added in 1.5 https://github.com/scrapy/scrapy/blob/1.5/scrapy/pipelines/files.py.
I patched GCSFilesStore internally in my company but I'll provide a fix for scrapy just after having published this issue

@mnannan
Copy link
Contributor Author

mnannan commented Nov 14, 2021

Pull request can be found there #5318

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants