Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] add FEED_STORAGE_S3_ACL setting #3607

Merged
merged 21 commits into from Mar 22, 2019

Conversation

@victor-torres
Copy link
Contributor

@victor-torres victor-torres commented Jan 29, 2019

This is similar to the IMAGES_STORE_S3_ACL setting.

Just opened the pull request to validate the idea.

Still, need to:

  • update tests
  • update docs

Example usage on settings.py:

FEED_STORAGE_S3_ACL = 'bucket-owner-full-control'
@@ -118,6 +118,7 @@ def __init__(self, uri, access_key=None, secret_key=None):
self.secret_key = u.password or secret_key
self.is_botocore = is_botocore()
self.keyname = u.path[1:] # remove first "/"
self.policy = settings.get('FEED_STORAGE_S3_ACL', 'private')
Copy link
Member

@lucywang000 lucywang000 Jan 29, 2019

I think the settings variable is only defined in the if no_defaults block above

Copy link
Member

@lucywang000 lucywang000 Jan 29, 2019

and I think it's better to use a default value of None instead of "private" since the default value for polify kw arg in boto set_contents_from_file is also None

set_contents_from_file(fp, headers=None, replace=True, cb=None, num_cb=10, policy=None, md5=None, reduced_redundancy=False, query_args=None, encrypt_key=False, size=None, rewind=False)

http://boto.cloudhackers.com/en/latest/ref/s3.html#boto.s3.key.Key.set_contents_from_file

Copy link
Member

@dangra dangra Jan 30, 2019

I think the settings variable is only defined in the if no_defaults block above

that's true, but it also means the policy should be passed from from_crawler method as constructor argument defaulting to None.

@codecov
Copy link

@codecov codecov bot commented Feb 6, 2019

Codecov Report

Merging #3607 into master will decrease coverage by 0.63%.
The diff coverage is 0%.

@@            Coverage Diff             @@
##           master    #3607      +/-   ##
==========================================
- Coverage   84.52%   83.88%   -0.64%     
==========================================
  Files         167      167              
  Lines        9406     9407       +1     
  Branches     1397     1397              
==========================================
- Hits         7950     7891      -59     
- Misses       1199     1256      +57     
- Partials      257      260       +3
Impacted Files Coverage Δ
scrapy/extensions/feedexport.py 70% <0%> (-8.47%) ⬇️
scrapy/core/downloader/handlers/s3.py 62.9% <0%> (-32.26%) ⬇️
scrapy/utils/boto.py 46.66% <0%> (-26.67%) ⬇️
scrapy/core/downloader/tls.py 77.5% <0%> (-12.5%) ⬇️
scrapy/core/downloader/handlers/http11.py 89.92% <0%> (-2.62%) ⬇️
scrapy/core/scraper.py 86.48% <0%> (-2.03%) ⬇️
scrapy/pipelines/files.py 67.15% <0%> (-1.1%) ⬇️

@codecov
Copy link

@codecov codecov bot commented Feb 6, 2019

Codecov Report

Merging #3607 into master will increase coverage by 0.14%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #3607      +/-   ##
==========================================
+ Coverage   84.54%   84.68%   +0.14%     
==========================================
  Files         167      167              
  Lines        9420     9424       +4     
  Branches     1402     1402              
==========================================
+ Hits         7964     7981      +17     
+ Misses       1199     1187      -12     
+ Partials      257      256       -1
Impacted Files Coverage Δ
scrapy/extensions/feedexport.py 84.9% <100%> (+6.43%) ⬆️
scrapy/settings/default_settings.py 98.64% <100%> (ø) ⬆️

@victor-torres victor-torres force-pushed the feed-storage-s3-acl branch from 123df85 to b8ab9a1 Feb 6, 2019
@victor-torres victor-torres changed the title WIP - add FEED_STORAGE_S3_ACL setting add FEED_STORAGE_S3_ACL setting Feb 6, 2019
Copy link
Member

@lucywang000 lucywang000 left a comment

lgtm!

@victor-torres
Copy link
Contributor Author

@victor-torres victor-torres commented Feb 7, 2019

@dangra I believe it's ready. Could you take another look?

Copy link
Contributor

@ejulio ejulio left a comment

I wrote some thoughts here, but I open for other opinions on that.
Feel free to comment and share your ideas as well.

@@ -131,18 +132,28 @@ def __init__(self, uri, access_key=None, secret_key=None):
@classmethod
def from_crawler(cls, crawler, uri):
return cls(uri, crawler.settings['AWS_ACCESS_KEY_ID'],
crawler.settings['AWS_SECRET_ACCESS_KEY'])
crawler.settings['AWS_SECRET_ACCESS_KEY'],
crawler.settings.get('FEED_STORAGE_S3_ACL'))
Copy link
Contributor

@ejulio ejulio Feb 8, 2019

I don't think you need settings.get() here because there is a default value of in default_settings.py

Copy link
Contributor Author

@victor-torres victor-torres Feb 8, 2019

Makes sense now that I introduced default values in the default_settings.py file.

self.s3_client.put_object(
Bucket=self.bucketname, Key=self.keyname, Body=file)
Bucket=self.bucketname, Key=self.keyname, Body=file,
**kwargs)
Copy link
Contributor

@ejulio ejulio Feb 8, 2019

Any reasons to use **kwargs instead of simple method call with a named argument?

Copy link
Contributor Author

@victor-torres victor-torres Feb 8, 2019

I'm just trying to avoid changing the previous behavior. I'm not sure about the side-effects of explicitly passing ACL or policy arguments even if they're defined as None. That's because the ACL argument, for example, is not a named argument, but loaded via kwargs in the put_object method.

Copy link
Contributor

@ejulio ejulio Feb 8, 2019

Ok. Got it.
@dangra can you share some thought here as well?

self.assertEqual(storage.secret_key, 'secret_key')
self.assertEqual(storage.acl, 'custom-acl')

def test_store_in_thread_botocore_without_acl(self):
Copy link
Contributor

@ejulio ejulio Feb 8, 2019

I don't know if you should be calling the private method in a test.
The whole idea is to test the API, so calling store would make more sense and easier for future maintenance.
If there is a problem with threading, you can decorate the test with @defer.inlineCallbacks.

Copy link
Contributor Author

@victor-torres victor-torres Feb 8, 2019

It's true, I shouldn't be testing private methods.

@victor-torres
Copy link
Contributor Author

@victor-torres victor-torres commented Feb 8, 2019

@ejulio I have just updated the source code after your suggestions. Could you take another look, please?

@victor-torres victor-torres force-pushed the feed-storage-s3-acl branch from 8d32ab3 to f824f5b Feb 8, 2019
self.s3_client.put_object(
Bucket=self.bucketname, Key=self.keyname, Body=file)
Bucket=self.bucketname, Key=self.keyname, Body=file,
**kwargs)
Copy link
Contributor

@ejulio ejulio Feb 8, 2019

Ok. Got it.
@dangra can you share some thought here as well?

return f(*args, **kwargs)

with mock.patch('botocore.client.BaseClient._make_api_call') as m:
with mock.patch('twisted.internet.threads.deferToThread',
Copy link
Contributor

@ejulio ejulio Feb 8, 2019

I don't think you need to mock twisted, but you can yield storage.store() and decorate the test with @defer.inlineCallbacks

Copy link
Contributor Author

@victor-torres victor-torres Feb 8, 2019

Done. I've just pushed these changes.

self.assertEqual(storage.acl, 'custom-acl')

storage.is_botocore = False
storage.connect_s3 = mock.MagicMock()
Copy link
Contributor

@ejulio ejulio Feb 8, 2019

Maybe mock.create_autospec() is better than MagicMock because it could help catching errors if the API changes or if there is a typo in the test/implementation.

Copy link
Contributor Author

@victor-torres victor-torres Feb 8, 2019

I don't know. I decided to go with this simple method because I just would like to guarantee we're calling the put_object and new_key methods with ACL and policy arguments respectively.

If we change the method names, the code would break because the mocks wouldn't have been called from the test.

Let me know if I am missing anything.

Copy link
Contributor

@ejulio ejulio Feb 11, 2019

It is fine for me, the main concern was if the API (method names) change, the unit test won't catch these errors and we'll notice them in production or so. But, if we create_autospec it will create a mock with the same methods available in the API. So, if the API changes and we call a method that does not exist anymore, we'll get an error.
Again, that is just a thought and I was collecting some ideas as well :)

@victor-torres
Copy link
Contributor Author

@victor-torres victor-torres commented Feb 8, 2019

@ejulio I have just updated the source code. Thanks again for your valuable suggestions 😄

ejulio
ejulio approved these changes Feb 11, 2019
Copy link
Contributor

@ejulio ejulio left a comment

Looks good to me.
@dangra it is up to you.

scrapy/extensions/feedexport.py Outdated Show resolved Hide resolved
scrapy/extensions/feedexport.py Outdated Show resolved Hide resolved
scrapy/settings/default_settings.py Outdated Show resolved Hide resolved
@victor-torres
Copy link
Contributor Author

@victor-torres victor-torres commented Feb 12, 2019

Thank you for your review, @kmike.

I have just implemented your suggestions here, could you take another look, please?

tox.ini Outdated
@@ -47,6 +47,7 @@ deps =
lxml==3.4.0
Twisted==14.0.2
boto==2.34.0
botocore==1.12.89
Copy link
Member

@kmike kmike Feb 13, 2019

I think we should be dropping jessie support, but if we're keeping it, botocore should be 0.62 (see https://packages.debian.org/en/jessie/python-botocore)

Copy link
Contributor Author

@victor-torres victor-torres Feb 13, 2019

Updated tox.ini. Waiting for tests to pass. Thanks.

Copy link
Contributor Author

@victor-torres victor-torres Feb 13, 2019

botocore version supported by debian jessie does not contain required methods... I'll just skip this test when in jessie environments.

self.assertEqual(storage.secret_key, 'secret_key')
self.assertEqual(storage.acl, None)

with mock.patch('botocore.client.BaseClient._make_api_call') as m:
Copy link
Member

@kmike kmike Feb 13, 2019

The heavy use of mocks worries me, especially as we're mocking a private method; tbh I'd prefer not to use mocks at all, and write tests in some other way (including having less of them).

On the other hand, checking for exact operation names like PutObject looks nice here. So I'm ok with merging these tests as-is if others are ok with it, but ideas on how to avoid mocking a private method are welcome.

Copy link
Contributor Author

@victor-torres victor-torres Feb 13, 2019

Now that you pointed it's also bothering me. I have just refactored the tests not to use private methods. Thanks.

Copy link
Contributor

@ejulio ejulio Feb 14, 2019

Just writing a thought here.
I do agree that we should not test private methods, but the interaction with the API we're given.
Otherwise we are getting tests coupled to internals, and what we want is that given some input, this method will give me some output, no matter how.

Regarding mocks, no problems, just think we should be careful when using them.

😄

@kmike kmike changed the title add FEED_STORAGE_S3_ACL setting [MRG+1] add FEED_STORAGE_S3_ACL setting Feb 13, 2019
@kmike
Copy link
Member

@kmike kmike commented Feb 13, 2019

Thanks @victor-torres for a careful implementation!

I have some concerns about use of mocks in tests, but they are more of personal preferences, so if others are fine with the way tests are written, I'm +1 to merge it, as soon as tests pass.

Also, thanks @lucywang000 and @ejulio for the reviews, this is really appreciated.

@victor-torres victor-torres force-pushed the feed-storage-s3-acl branch 2 times, most recently from 43abf47 to 468202e Feb 13, 2019

@defer.inlineCallbacks
def test_store_botocore_without_acl(self):
if os.getenv('TOXENV') == 'jessie':
Copy link
Member

@kmike kmike Feb 13, 2019

What do you think about skipping a test if botocore is not installed?

Copy link
Contributor Author

@victor-torres victor-torres Feb 13, 2019

If someone accidentally removes botocore from other test environment requirements we could end up skipping this test and therefore hiding some errors.

Leaving it as it is now is also a good point for when someone is removing the jessie support. A quick grep on the source code would highlight this conditional statement and the exception raise line.

Copy link
Member

@kmike kmike Feb 14, 2019

I think that it is better to skip the test if botocore is not present because of two reasons:

  1. botocore is an optional Scrapy requirement; we should be running tests without it, to make sure Scrapy works without botocore. So it is fine not to have botocore in an environment.
  2. People are running Scrapy tests without tox, e.g. on their environments; tests currently are not aware of tox and don't require it - checking for TOXENV variable changes that.

Which methods are missing in botocore? Should we specify a minimal botocore version somewhere (is_botocore function? docs?)

Copy link
Contributor Author

@victor-torres victor-torres Feb 14, 2019

Ok, I'll just check the is_botocore method then.

About the other question, I did some research and this is what I found out:

Those versions are pretty old so I believe we shouldn't have problems.

@victor-torres victor-torres force-pushed the feed-storage-s3-acl branch from 468202e to ea8be62 Feb 13, 2019
@victor-torres
Copy link
Contributor Author

@victor-torres victor-torres commented Feb 13, 2019

Thank you for your review, @kmike. Tests are passing now.

@victor-torres
Copy link
Contributor Author

@victor-torres victor-torres commented Feb 14, 2019

@kmike I believe someone has to rebuild this last commit. Does not seem to be related to the latest changes.

@victor-torres
Copy link
Contributor Author

@victor-torres victor-torres commented Feb 14, 2019

@kmike I believe someone has to rebuild this last commit. Does not seem to be related to the latest changes.

Nervermind, made a blank commit.

@victor-torres
Copy link
Contributor Author

@victor-torres victor-torres commented Feb 14, 2019

@kmike done, tests are passing.

@kmike
Copy link
Member

@kmike kmike commented Feb 15, 2019

👍

@@ -185,6 +185,10 @@ passed through the following settings:
* :setting:`AWS_ACCESS_KEY_ID`
* :setting:`AWS_SECRET_ACCESS_KEY`

You can also define a custom ACL for exported feeds using this setting:

* :setting:`FEED_STORAGE_S3_ACL`
Copy link
Member

@Gallaecio Gallaecio Mar 22, 2019

I think we should consider changing the name to FEED_ACL, and follow an approach similar to other feed storage settings here such as FEED_URI: The backend-specific documentation is provided on the backend-specific documentation section.

Copy link
Member

@Gallaecio Gallaecio Mar 22, 2019

Oh, I see this approach is already followed in other places. I guess any change to that is then a different beast to treat in a different changeset. Lets merge.

Copy link
Member

@Gallaecio Gallaecio left a comment

@kmike I’ll let you merge yourself since there have been changes since you approved, it’s +1 from me.

@kmike kmike added this to the v1.7 milestone Mar 22, 2019
@kmike
Copy link
Member

@kmike kmike commented Mar 22, 2019

Thanks @victor-torres for the implementation and addressing all our comments, and thanks @ejulio, @lucywang000, @dangra and @Gallaecio for reviews !

@kmike kmike merged commit 4196d48 into scrapy:master Mar 22, 2019
3 checks passed
@victor-torres
Copy link
Contributor Author

@victor-torres victor-torres commented Mar 22, 2019

You're all welcome. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

6 participants