Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix file expiration issue with GCS (Issue #5317) #5318

Merged
merged 5 commits into from
May 20, 2022

Conversation

mnannan
Copy link
Contributor

@mnannan mnannan commented Nov 14, 2021

Problem description

As explained in that issue files expiration does not work with GCS if you don't write at the root of your bucket.
If you want to reproduce the issue I invite you to take a look at the issue.

Issue description

Let say that we set FILES_STORE=gs://my_bucket/my_prefix/

The issue is that for a given path persist_file uploads images in gs://my_bucket/my_prefix/path while stats_file returns information of the file located in gs://my_bucket/path. So there is no files that can considered as already downloaded as paths mismatch.

Proposed solution

The solution is quite straightforward as we just need to make sure we use the same path in both persist_file & stats_file.

I did not want to change self.prefix + path into something a bit cleaner like os.path.join(self.prefix, path) as it won't be backward compatible.
We could still do the same way as in S3FilesStore](https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/files.py#L122)
--> f"{self.prefix}{path}".

I wrapped the logic in _get_blob_path to make sure we use the same logic in both places, especially if if this is refactored in the future however it may be seen as an overkill (up to the reviewers).

Version impacted

All the versions since GCSFilesStore (so 1.6) seem to be impacted.

Tests

Unit tests

I wasn't able to find what is used in GCS_TEST_FILE_URI for unittest but if you set a value with a prefix like gs://my_bucket/my_prefix it breaks while if you set it to the root of your bucket like gs://my_bucket/ it passes.

On master

On my side I set GCS_TEST_FILE_URI to a personal bucket (with my_prefix) and ran the following command (after having tweaked a bit the tox.ini as I did not find how to run that test easily)

 tox -e py -- --no-cov tests/test_pipeline_files.py::TestGCSFilesStore -v

and got the following error

FAILED tests/test_pipeline_files.py::TestGCSFilesStore::test_persist - twisted.trial.unittest.FailTest: 'last_modified' not in {}

On this branch

The test worked well

tests/test_pipeline_files.py::TestGCSFilesStore::test_persist PASSED                                                                                                                                                                                [100%]

Why not adding a new test ?

As there is already a test covering this issue I was not sure whether or not we want to add another one. If we would like to make sure that path contained in GCS_TEST_FILE_URI has a prefix we could still add a one to the content of GCS_TEST_FILE_URI in the test. (This is up to reviewers)

@mnannan mnannan changed the title Fix file expirtation issue with GCS Fix file expirtation issue with GCS (Issue #5317) Nov 14, 2021
@codecov
Copy link

codecov bot commented Nov 15, 2021

Codecov Report

Merging #5318 (a6475eb) into master (9a28eb0) will decrease coverage by 0.38%.
The diff coverage is 16.66%.

❗ Current head a6475eb differs from pull request most recent head c656043. Consider uploading reports for the commit c656043 to get more accurate results

@@            Coverage Diff             @@
##           master    #5318      +/-   ##
==========================================
- Coverage   88.77%   88.39%   -0.39%     
==========================================
  Files         163      163              
  Lines       10666    10670       +4     
  Branches     1818     1818              
==========================================
- Hits         9469     9432      -37     
- Misses        922      960      +38     
- Partials      275      278       +3     
Impacted Files Coverage Δ
scrapy/pipelines/files.py 71.23% <16.66%> (-0.65%) ⬇️
scrapy/utils/asyncgen.py 33.33% <0.00%> (-66.67%) ⬇️
scrapy/robotstxt.py 75.30% <0.00%> (-22.23%) ⬇️
scrapy/utils/spider.py 69.69% <0.00%> (-12.13%) ⬇️
scrapy/utils/test.py 54.68% <0.00%> (-6.25%) ⬇️
scrapy/utils/defer.py 92.92% <0.00%> (-3.04%) ⬇️
scrapy/core/downloader/__init__.py 90.97% <0.00%> (-1.51%) ⬇️
scrapy/core/downloader/handlers/http11.py 93.13% <0.00%> (-0.99%) ⬇️

Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good to me, thanks!

Any chance we can cover this with an automated test somehow?

@mnannan
Copy link
Contributor Author

mnannan commented Nov 15, 2021

Thanks @Gallaecio for reviewing!
Actually there is a test that covers more or less this issue --> https://github.com/scrapy/scrapy/blob/master/tests/test_pipeline_files.py#L503
Depending of what is provided in GCS_TEST_FILE_URI the test can fail.

  • If GCS_TEST_FILE_URI=gs://my_bucket/ it will pass
  • if GCS_TEST_FILE_URI=gs://my_bucket/my_prefix it will fail
    The thing is that I don't think this test is meant to be trigger during the CI (at least it looks to be skipped by CI at the moment.
tests/test_pipeline_files.py ........................ss                  [ 67%]

From here

To improve tests/test_pipeline_files.py::TestGCSFilesStore::test_persist I see two options (that can be implemented together) :

  • Making sure that this test is run during CI: This would probably imply to mock some stuff like google.cloud.storage.Blob as it does not look we can trigger test against GCS during CI
  • Making sure that we use a prefix in the test: The path used in the test is provided with GCS_TEST_FILE_URI so we could still contatenate 'my_folder' to GCS_TEST_FILE_URI instead of directly use GCS_TEST_FILE_URI this would help to make sure it works well when someone trigger this test manually. Or we could even test with a path like gs://my_bucket/ and one like gs://my_bucket/my_prefix/.

@mnannan mnannan changed the title Fix file expirtation issue with GCS (Issue #5317) Fix file expiration issue with GCS (Issue #5317) Nov 15, 2021
Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if you feel up to it, I think it would be great to support a mock version of the existing test that does run in CI.

I’m OK with merging as is, though.

@mnannan
Copy link
Contributor Author

mnannan commented Nov 15, 2021

@Gallaecio I gave a try to mocking but I'm not 100% satisfied by the test, I thought the result would have been a bit better.
Unfortunately as it requires google.cloud to be installed it cannot be run in CI pipelines by default however it works as expected.
I cherry-picked this test on master and run the test and it broke as expected when I ran tox -e py -- --no-cov tests/test_pipeline_files.py::TestGCSFilesStore

=========================================================================================================== short test summary info ============================================================================================================
FAILED tests/test_pipeline_files.py::TestGCSFilesStore::test_blob_path_consistency - AssertionError: Expected call: get_blob('my_prefix/full/my_data.txt')
========================================================================================================= 1 failed, 1 skipped in 0.45s =========================================================================================================
E

while on that branch it worked well

tests/test_pipeline_files.py .s                                                                                                                                                                                                          [100%]

========================================================================================================= 1 passed, 1 skipped in 0.31s =========================================================================================================
___________________________________________________________________________________________________________________ summary ____________________________________________________________________________________________________________________
  py: commands succeeded
  congratulations :)

I would be interested in knowing what do you think about the test, on my side I'm quite balanced because it will ensure that we catch this issue in the future if it comes back but it's super tricky mocking for such a simple issue.

@Gallaecio
Copy link
Member

I would be interested in knowing what do you think about the test

A test that breaks if we revert your fix? Bring it in!

If your only concern is needing google.cloud, we have a specific CI job to test stuff that requires optional dependencies, so you could include the dependency for that job. 1 job running the test is better than none, and in this case it is probably enough.

Comment on lines 537 to 538
with mock.patch('google.cloud.storage') as _ :
with mock.patch('scrapy.pipelines.files.time') as _ :
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these needed for the test to work as expected? I’m not used to patching without then using the patched object to define output and side effects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it may not be the best way of using monkeypatch but my idea was that I only care about making sure that bucket.blob & bucket.get_blob are called with the same path.
The problem with patching objects here is that we would need to patch a couple of them if we don't want to refactor GCSFilesStore.
I may be missing something but it's looks that we would need to patch is the following methods

  • storage.Client
  • client.bucket
  • bucket.test_iam_permissions
  • bucket.get_blob
  • bucket.blob
  • blob.upload_from_string

if we would like to micmic that it works well but in practice this test is achieved by test_persist

The main advantage of using mock.patch here is that storage and all the above mentionned method will be Mock objects which means that everytime you call an attritube or a method it will return a Mock object and will accept any arguments.
After that you can check which method has been called with which argument

 store.bucket.blob.assert_called_with(expected_blob_path)
 store.bucket.get_blob.assert_called_with(expected_blob_path)

With those two lines I can make sure that self.get_blob from stat_file and self.bucket.blob from persist_file are called with the same and the right parameter.

Patching time is required because of last_modified = time.mktime(blob.updated.timetuple()) but an alternative would have been to patch blob.updated.timetuple in order to return something that can be processed by time.mktime

@mnannan
Copy link
Contributor Author

mnannan commented Nov 16, 2021

I fixed the flake8 issue and extended tox.ini as you suggested.
Now the test can be easily run with tox -e extra-deps -- --no-cov tests/test_pipeline_files.py::TestGCSFilesStore
If checks are green it's ready on my side

@mnannan
Copy link
Contributor Author

mnannan commented Nov 17, 2021

@Gallaecio Is there anything I need to do on my side ?

@mnannan
Copy link
Contributor Author

mnannan commented Nov 24, 2021

@Gallaecio Should we retrigger the actions in order all the tests pass before being able to merge ?

@mnannan
Copy link
Contributor Author

mnannan commented Dec 18, 2021

Hello @Gallaecio
I don't want to be pushy but is there anything blocking this issue ?

@mnannan
Copy link
Contributor Author

mnannan commented Jan 7, 2022

Hello @wRAR, sorry for pinging you but you seem to be the most active reviewers.
I'm afraid that this issue has been a bit forgotten for a couple of months while it's almost ready to be merged.
Should I put someone else for review ?

@Gallaecio
Copy link
Member

Hello @Gallaecio I don't want to be pushy but is there anything blocking this issue ?

Sorry for not answering earlier.

This issue is in my queue for review (3rd at the moment, after #5290 and #5205). So, I haven’t missed it or forgotten about it, I simply haven’t had the time to look at it yet.

@Gallaecio
Copy link
Member

Closing and reopening to re-trigger tests.

@Gallaecio Gallaecio closed this Mar 17, 2022
@Gallaecio Gallaecio reopened this Mar 17, 2022
@Gallaecio
Copy link
Member

@mnannan I tried to fix the current conflicts, but I do not have permissions to push to your branch. mnannan#1 should do the job, or you could address them yourself manually if you prefer.

@mnannan
Copy link
Contributor Author

mnannan commented Apr 2, 2022

Thanks for looking at this, I'll take care of the conflicts in the following days

@mnannan
Copy link
Contributor Author

mnannan commented Apr 24, 2022

Thanks for the PR @Gallaecio, I've merged it so you can move forward.

@Gallaecio Gallaecio merged commit 078622c into scrapy:master May 20, 2022
@Gallaecio
Copy link
Member

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants