Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FilesDownloader with GCS downloading updtodate files again #4346

Closed
lblanche opened this issue Feb 19, 2020 · 10 comments
Closed

FilesDownloader with GCS downloading updtodate files again #4346

lblanche opened this issue Feb 19, 2020 · 10 comments
Labels

Comments

@lblanche
Copy link

lblanche commented Feb 19, 2020

Description

It seems that when using Google Cloud Storage, the Files pipeline does not have the expected behavior regarding up to date files.

Steps to Reproduce

  1. Clone this repo : git clone https://github.com/QYQ323/python.git
  2. Run the spider : scrapy crawl examples
  3. If you run it several times, the FilesPipeline has the right behavior : it does not download uptodate files
2020-02-19 14:41:36 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded file from <GET https://matplotlib.org/examples/animation/basic_example_writer.py> referred in <None>
2020-02-19 14:41:36 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded file from <GET https://matplotlib.org/examples/animation/basic_example.py> referred in <None>
2020-02-19 14:41:36 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded file from <GET https://matplotlib.org/examples/animation/bayes_update.py> referred in <None>
  1. Now change the FILE_STORE in settings.py to a gcs bucket

FILES_STORE = 'gs://mybucket/'

  1. If you then run the spider several times, the files are downloaded everytime :
2020-02-19 14:50:44 [scrapy.pipelines.files] DEBUG: File (downloaded): Downloaded file from <GET https://matplotlib.org/examples/animation/simple_anim.py> referred in <None>
2020-02-19 14:50:44 [urllib3.connectionpool] DEBUG: https://storage.googleapis.com:443 "POST /upload/storage/v1/b/cdcscrapingresults/o?uploadType=multipart HTTP/1.1" 200 843
2020-02-19 14:50:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://matplotlib.org/examples/animation/double_pendulum_animated.py> (referer: None)
2020-02-19 14:50:44 [scrapy.pipelines.files] DEBUG: File (downloaded): Downloaded file from <GET https://matplotlib.org/examples/animation/double_pendulum_animated.py> referred in <None>
2020-02-19 14:50:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://matplotlib.org/examples/api/collections_demo.html>

Expected behavior:
Files should not be downloaded again when running the spider consecutively. If a file is allready on GCS (same folder), it should not be downloaded (provided it was uploaded less than 90 days ago)

Actual behavior:
Everytime the spider is launched every file is downloaded again.

Reproduces how often: 100%

Versions

Scrapy : 1.8.0
lxml : 4.5.0.0
libxml2 : 2.9.10
cssselect : 1.1.0
parsel : 1.5.2
w3lib : 1.21.0
Twisted : 19.10.0
Python : 3.8.1 (default, Jan 8 2020, 16:15:59) - [Clang 4.0.1 (tags/RELEASE_401/final)]
pyOpenSSL : 19.1.0 (OpenSSL 1.1.1d 10 Sep 2019)
cryptography : 2.8
Platform : macOS-10.15.3-x86_64-i386-64bit

@wRAR
Copy link
Contributor

wRAR commented Feb 22, 2020

The pipeline just relies on the updated field in the google.cloud.storage.blob.Blob object, so maybe it always contains a wrong date?

@lblanche
Copy link
Author

lblanche commented Feb 25, 2020

Well I have checked on an example, and this date seemed right to me

@wRAR
Copy link
Contributor

wRAR commented Feb 25, 2020

@lblanche how will it be the easiest way to check this? Should I just create an empty bucket and put it into the spider settings?

@Gallaecio Gallaecio added the bug label Feb 26, 2020
@lblanche
Copy link
Author

lblanche commented Feb 26, 2020

@wRAR Yes thats what I did

@michalp2213
Copy link
Contributor

michalp2213 commented Mar 21, 2020

Hello, can I work on this issue?

@wRAR
Copy link
Contributor

wRAR commented Mar 21, 2020

@michalp2213 sure

@michalp2213
Copy link
Contributor

michalp2213 commented Mar 22, 2020

I cannot reproduce this bug. @lblanche, are you sure you set up permissions for the bucket correctly? The very first time I've tried reproducing it I got a setup where the service account I used had write permissions, but for some reason calling get_blob on the bucket raised a 403, which caused stat_file method in GCSFilesStore to fail, and that caused the file to be downloaded every time. After fixing the permissions everything worked as it should. If that's the case here, I think it would be a good idea to check permissions in GCSFilesStore's __init__ and display a warning if it's impossible to get file's metadata from the bucket.

@kmike
Copy link
Member

kmike commented May 6, 2020

Thanks @michalp2213 for the investigation and the fix!
Should we close it, as #4508 is merged?

@Gallaecio
Copy link
Member

Gallaecio commented May 7, 2020

@lblanche: if you have a chance, please test this again with the master branch of Scrapy, and feel free to reopen if you do not get the corresponding warning/error message in your log.

@LuisBlanche
Copy link

LuisBlanche commented May 7, 2020

Unfortunately I do not have access to this project anymore. I think I trust @michalp2213 answer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants