-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix file expiration issue with GCS (Issue #5317) #5318
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5318 +/- ##
==========================================
- Coverage 88.77% 88.39% -0.39%
==========================================
Files 163 163
Lines 10666 10670 +4
Branches 1818 1818
==========================================
- Hits 9469 9432 -37
- Misses 922 960 +38
- Partials 275 278 +3
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks good to me, thanks!
Any chance we can cover this with an automated test somehow?
Thanks @Gallaecio for reviewing!
From here To improve
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, if you feel up to it, I think it would be great to support a mock version of the existing test that does run in CI.
I’m OK with merging as is, though.
…le and GCSFilesStore.stat_file
6b9f3a8
to
42acc6f
Compare
@Gallaecio I gave a try to mocking but I'm not 100% satisfied by the test, I thought the result would have been a bit better.
while on that branch it worked well
I would be interested in knowing what do you think about the test, on my side I'm quite balanced because it will ensure that we catch this issue in the future if it comes back but it's super tricky mocking for such a simple issue. |
A test that breaks if we revert your fix? Bring it in! If your only concern is needing |
tests/test_pipeline_files.py
Outdated
with mock.patch('google.cloud.storage') as _ : | ||
with mock.patch('scrapy.pipelines.files.time') as _ : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these needed for the test to work as expected? I’m not used to patching without then using the patched object to define output and side effects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it may not be the best way of using monkeypatch but my idea was that I only care about making sure that bucket.blob
& bucket.get_blob
are called with the same path
.
The problem with patching objects here is that we would need to patch a couple of them if we don't want to refactor GCSFilesStore.
I may be missing something but it's looks that we would need to patch is the following methods
- storage.Client
- client.bucket
- bucket.test_iam_permissions
- bucket.get_blob
- bucket.blob
- blob.upload_from_string
if we would like to micmic that it works well but in practice this test is achieved by test_persist
The main advantage of using mock.patch
here is that storage
and all the above mentionned method will be Mock objects which means that everytime you call an attritube or a method it will return a Mock object and will accept any arguments.
After that you can check which method has been called with which argument
store.bucket.blob.assert_called_with(expected_blob_path)
store.bucket.get_blob.assert_called_with(expected_blob_path)
With those two lines I can make sure that self.get_blob
from stat_file
and self.bucket.blob
from persist_file
are called with the same and the right parameter.
Patching time
is required because of last_modified = time.mktime(blob.updated.timetuple())
but an alternative would have been to patch blob.updated.timetuple
in order to return something that can be processed by time.mktime
…est_blob_path_consistency
I fixed the flake8 issue and extended |
@Gallaecio Is there anything I need to do on my side ? |
@Gallaecio Should we retrigger the actions in order all the tests pass before being able to merge ? |
Hello @Gallaecio |
Hello @wRAR, sorry for pinging you but you seem to be the most active reviewers. |
Sorry for not answering earlier. This issue is in my queue for review (3rd at the moment, after #5290 and #5205). So, I haven’t missed it or forgotten about it, I simply haven’t had the time to look at it yet. |
Closing and reopening to re-trigger tests. |
Thanks for looking at this, I'll take care of the conflicts in the following days |
Merge upstream/master
Thanks for the PR @Gallaecio, I've merged it so you can move forward. |
Thank you! |
Problem description
As explained in that issue files expiration does not work with GCS if you don't write at the root of your bucket.
If you want to reproduce the issue I invite you to take a look at the issue.
Issue description
Let say that we set
FILES_STORE=gs://my_bucket/my_prefix/
The issue is that for a given
path
persist_file uploads images ings://my_bucket/my_prefix/path
while stats_file returns information of the file located ings://my_bucket/path
. So there is no files that can considered as already downloaded as paths mismatch.Proposed solution
The solution is quite straightforward as we just need to make sure we use the same path in both
persist_file
&stats_file
.I did not want to change
self.prefix + path
into something a bit cleaner likeos.path.join(self.prefix, path)
as it won't be backward compatible.We could still do the same way as in S3FilesStore](https://github.com/scrapy/scrapy/blob/master/scrapy/pipelines/files.py#L122)
-->
f"{self.prefix}{path}"
.I wrapped the logic in
_get_blob_path
to make sure we use the same logic in both places, especially if if this is refactored in the future however it may be seen as an overkill (up to the reviewers).Version impacted
All the versions since
GCSFilesStore
(so 1.6) seem to be impacted.Tests
Unit tests
I wasn't able to find what is used in
GCS_TEST_FILE_URI
for unittest but if you set a value with a prefix likegs://my_bucket/my_prefix
it breaks while if you set it to the root of your bucket likegs://my_bucket/
it passes.On master
On my side I set
GCS_TEST_FILE_URI
to a personal bucket (withmy_prefix
) and ran the following command (after having tweaked a bit thetox.ini
as I did not find how to run that test easily)and got the following error
On this branch
The test worked well
Why not adding a new test ?
As there is already a test covering this issue I was not sure whether or not we want to add another one. If we would like to make sure that path contained in
GCS_TEST_FILE_URI
has a prefix we could still add a one to the content ofGCS_TEST_FILE_URI
in the test. (This is up to reviewers)