Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] add FEED_STORAGE_S3_ACL setting #3607

Merged
merged 21 commits into from Mar 22, 2019
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
0135680
add FEED_STORAGE_S3_ACL setting
victor-torres Jan 29, 2019
ad83ffd
refactoring
victor-torres Feb 6, 2019
126207f
PEP8: use short name for mock method
victor-torres Feb 6, 2019
e0f34be
update docs
victor-torres Feb 6, 2019
7b83ed7
remove typo
victor-torres Feb 6, 2019
e25b9a2
calling it feeds instead of objects
victor-torres Feb 6, 2019
dbeb088
trying to fix jessie testenv by adding botocore to requirements and f…
victor-torres Feb 7, 2019
079af88
also testing without botocore
victor-torres Feb 7, 2019
ceae356
add FEED_STORAGE_S3_ACL to default_settings.py file
victor-torres Feb 8, 2019
cfd183a
no need to use get here since we're defining a default value in defau…
victor-torres Feb 8, 2019
f824f5b
testing public method store instead of private method _store_in_thread
victor-torres Feb 8, 2019
1eac2a1
simplifying how we deal with threads.deferToThread calls
victor-torres Feb 8, 2019
7c9f0bd
using named params with optional amazon s3 params
victor-torres Feb 12, 2019
c2dede2
reduce code with simple ternary operator
victor-torres Feb 12, 2019
984e706
using blank string instead of None as default value as proposed by @k…
victor-torres Feb 12, 2019
b4d132b
setting botocore version as described in debian jessie website
victor-torres Feb 13, 2019
dc0b643
refactoring tests to avoid mocking private method
victor-torres Feb 13, 2019
ea8be62
botocore is not supported on debian jessie
victor-torres Feb 13, 2019
9b8ba4c
try to import botocore before runing some tests
victor-torres Feb 14, 2019
9fed6fc
trigger tests
victor-torres Feb 14, 2019
fda1d04
Merge branch 'master' into feed-storage-s3-acl
victor-torres Mar 15, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
17 changes: 17 additions & 0 deletions docs/topics/feed-exports.rst
Expand Up @@ -185,6 +185,10 @@ passed through the following settings:
* :setting:`AWS_ACCESS_KEY_ID`
* :setting:`AWS_SECRET_ACCESS_KEY`

You can also define a custom ACL for exported feeds using this setting:

* :setting:`FEED_STORAGE_S3_ACL`
Copy link
Member

@Gallaecio Gallaecio Mar 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should consider changing the name to FEED_ACL, and follow an approach similar to other feed storage settings here such as FEED_URI: The backend-specific documentation is provided on the backend-specific documentation section.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see this approach is already followed in other places. I guess any change to that is then a different beast to treat in a different changeset. Lets merge.


.. _topics-feed-storage-stdout:

Standard output
Expand All @@ -205,6 +209,7 @@ These are the settings used for configuring the feed exports:
* :setting:`FEED_URI` (mandatory)
* :setting:`FEED_FORMAT`
* :setting:`FEED_STORAGES`
* :setting:`FEED_STORAGE_S3_ACL`
* :setting:`FEED_EXPORTERS`
* :setting:`FEED_STORE_EMPTY`
* :setting:`FEED_EXPORT_ENCODING`
Expand Down Expand Up @@ -302,6 +307,17 @@ Default: ``{}``
A dict containing additional feed storage backends supported by your project.
The keys are URI schemes and the values are paths to storage classes.

.. setting:: FEED_STORAGE_S3_ACL

FEED_STORAGE_S3_ACL
-------------------

Default: ``''`` (empty string)

A string containing a custom ACL for feeds exported to Amazon S3 by your project.

For a complete list of available values, access the `Canned ACL`_ section on Amazon S3 docs.

.. setting:: FEED_STORAGES_BASE

FEED_STORAGES_BASE
Expand Down Expand Up @@ -366,3 +382,4 @@ format in :setting:`FEED_EXPORTERS`. E.g., to disable the built-in CSV exporter
.. _Amazon S3: https://aws.amazon.com/s3/
.. _boto: https://github.com/boto/boto
.. _botocore: https://github.com/boto/botocore
.. _Canned ACL: https://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html#canned-acl
18 changes: 13 additions & 5 deletions scrapy/extensions/feedexport.py
Expand Up @@ -93,7 +93,7 @@ def store(self, file):

class S3FeedStorage(BlockingFeedStorage):

def __init__(self, uri, access_key=None, secret_key=None):
def __init__(self, uri, access_key=None, secret_key=None, acl=None):
# BEGIN Backwards compatibility for initialising without keys (and
# without using from_crawler)
no_defaults = access_key is None and secret_key is None
Expand All @@ -118,6 +118,7 @@ def __init__(self, uri, access_key=None, secret_key=None):
self.secret_key = u.password or secret_key
self.is_botocore = is_botocore()
self.keyname = u.path[1:] # remove first "/"
self.acl = acl
if self.is_botocore:
import botocore.session
session = botocore.session.get_session()
Expand All @@ -130,19 +131,26 @@ def __init__(self, uri, access_key=None, secret_key=None):

@classmethod
def from_crawler(cls, crawler, uri):
return cls(uri, crawler.settings['AWS_ACCESS_KEY_ID'],
crawler.settings['AWS_SECRET_ACCESS_KEY'])
return cls(
uri=uri,
access_key=crawler.settings['AWS_ACCESS_KEY_ID'],
secret_key=crawler.settings['AWS_SECRET_ACCESS_KEY'],
acl=crawler.settings['FEED_STORAGE_S3_ACL'] or None
)

def _store_in_thread(self, file):
file.seek(0)
if self.is_botocore:
kwargs = {'ACL': self.acl} if self.acl else {}
self.s3_client.put_object(
Bucket=self.bucketname, Key=self.keyname, Body=file)
Bucket=self.bucketname, Key=self.keyname, Body=file,
**kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reasons to use **kwargs instead of simple method call with a named argument?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just trying to avoid changing the previous behavior. I'm not sure about the side-effects of explicitly passing ACL or policy arguments even if they're defined as None. That's because the ACL argument, for example, is not a named argument, but loaded via kwargs in the put_object method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Got it.
@dangra can you share some thought here as well?

else:
conn = self.connect_s3(self.access_key, self.secret_key)
bucket = conn.get_bucket(self.bucketname, validate=False)
key = bucket.new_key(self.keyname)
key.set_contents_from_file(file)
kwargs = {'policy': self.acl} if self.acl else {}
key.set_contents_from_file(file, **kwargs)
key.close()


Expand Down
2 changes: 2 additions & 0 deletions scrapy/settings/default_settings.py
Expand Up @@ -158,6 +158,8 @@
}
FEED_EXPORT_INDENT = 0

FEED_STORAGE_S3_ACL = ''

FILES_STORE_S3_ACL = 'private'
FILES_STORE_GCS_ACL = ''

Expand Down
139 changes: 139 additions & 0 deletions tests/test_feedexport.py
Expand Up @@ -18,6 +18,7 @@
from tests.mockserver import MockServer
from w3lib.url import path_to_file_uri

import botocore.client
import scrapy
from scrapy.exporters import CsvItemExporter
from scrapy.extensions.feedexport import (
Expand Down Expand Up @@ -186,6 +187,144 @@ def test_store(self):
content = get_s3_content_and_delete(u.hostname, u.path[1:])
self.assertEqual(content, expected_content)

def test_init_without_acl(self):
storage = S3FeedStorage(
's3://mybucket/export.csv',
'access_key',
'secret_key'
)
self.assertEqual(storage.access_key, 'access_key')
self.assertEqual(storage.secret_key, 'secret_key')
self.assertEqual(storage.acl, None)

def test_init_with_acl(self):
storage = S3FeedStorage(
's3://mybucket/export.csv',
'access_key',
'secret_key',
'custom-acl'
)
self.assertEqual(storage.access_key, 'access_key')
self.assertEqual(storage.secret_key, 'secret_key')
self.assertEqual(storage.acl, 'custom-acl')

def test_from_crawler_without_acl(self):
settings = {
'AWS_ACCESS_KEY_ID': 'access_key',
'AWS_SECRET_ACCESS_KEY': 'secret_key',
}
crawler = get_crawler(settings_dict=settings)
storage = S3FeedStorage.from_crawler(
crawler,
's3://mybucket/export.csv'
)
self.assertEqual(storage.access_key, 'access_key')
self.assertEqual(storage.secret_key, 'secret_key')
self.assertEqual(storage.acl, None)

def test_from_crawler_with_acl(self):
settings = {
'AWS_ACCESS_KEY_ID': 'access_key',
'AWS_SECRET_ACCESS_KEY': 'secret_key',
'FEED_STORAGE_S3_ACL': 'custom-acl',
}
crawler = get_crawler(settings_dict=settings)
storage = S3FeedStorage.from_crawler(
crawler,
's3://mybucket/export.csv'
)
self.assertEqual(storage.access_key, 'access_key')
self.assertEqual(storage.secret_key, 'secret_key')
self.assertEqual(storage.acl, 'custom-acl')

@defer.inlineCallbacks
def test_store_botocore_without_acl(self):
storage = S3FeedStorage(
's3://mybucket/export.csv',
'access_key',
'secret_key',
)
self.assertEqual(storage.access_key, 'access_key')
self.assertEqual(storage.secret_key, 'secret_key')
self.assertEqual(storage.acl, None)

with mock.patch('botocore.client.BaseClient._make_api_call') as m:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The heavy use of mocks worries me, especially as we're mocking a private method; tbh I'd prefer not to use mocks at all, and write tests in some other way (including having less of them).

On the other hand, checking for exact operation names like PutObject looks nice here. So I'm ok with merging these tests as-is if others are ok with it, but ideas on how to avoid mocking a private method are welcome.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that you pointed it's also bothering me. I have just refactored the tests not to use private methods. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just writing a thought here.
I do agree that we should not test private methods, but the interaction with the API we're given.
Otherwise we are getting tests coupled to internals, and what we want is that given some input, this method will give me some output, no matter how.

Regarding mocks, no problems, just think we should be careful when using them.

😄

yield storage.store(BytesIO(b'test file'))

operation_name, api_params = m.call_args[0]
self.assertEqual(operation_name, 'PutObject')
self.assertNotIn('ACL', api_params)

@defer.inlineCallbacks
def test_store_botocore_with_acl(self):
storage = S3FeedStorage(
's3://mybucket/export.csv',
'access_key',
'secret_key',
'custom-acl'
)
self.assertEqual(storage.access_key, 'access_key')
self.assertEqual(storage.secret_key, 'secret_key')
self.assertEqual(storage.acl, 'custom-acl')

with mock.patch('botocore.client.BaseClient._make_api_call') as m:
yield storage.store(BytesIO(b'test file'))

operation_name, api_params = m.call_args[0]
self.assertEqual(operation_name, 'PutObject')
self.assertEqual(api_params.get('ACL'), 'custom-acl')

@defer.inlineCallbacks
def test_store_not_botocore_without_acl(self):
storage = S3FeedStorage(
's3://mybucket/export.csv',
'access_key',
'secret_key',
)
self.assertEqual(storage.access_key, 'access_key')
self.assertEqual(storage.secret_key, 'secret_key')
self.assertEqual(storage.acl, None)

storage.is_botocore = False
storage.connect_s3 = mock.MagicMock()
self.assertFalse(storage.is_botocore)

yield storage.store(BytesIO(b'test file'))

conn = storage.connect_s3(*storage.connect_s3.call_args)
bucket = conn.get_bucket(*conn.get_bucket.call_args)
key = bucket.new_key(*bucket.new_key.call_args)
self.assertNotIn(
dict(policy='custom-acl'),
key.set_contents_from_file.call_args
)

@defer.inlineCallbacks
def test_store_not_botocore_with_acl(self):
storage = S3FeedStorage(
's3://mybucket/export.csv',
'access_key',
'secret_key',
'custom-acl'
)
self.assertEqual(storage.access_key, 'access_key')
self.assertEqual(storage.secret_key, 'secret_key')
self.assertEqual(storage.acl, 'custom-acl')

storage.is_botocore = False
storage.connect_s3 = mock.MagicMock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mock.create_autospec() is better than MagicMock because it could help catching errors if the API changes or if there is a typo in the test/implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. I decided to go with this simple method because I just would like to guarantee we're calling the put_object and new_key methods with ACL and policy arguments respectively.

If we change the method names, the code would break because the mocks wouldn't have been called from the test.

Let me know if I am missing anything.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is fine for me, the main concern was if the API (method names) change, the unit test won't catch these errors and we'll notice them in production or so. But, if we create_autospec it will create a mock with the same methods available in the API. So, if the API changes and we call a method that does not exist anymore, we'll get an error.
Again, that is just a thought and I was collecting some ideas as well :)

self.assertFalse(storage.is_botocore)

yield storage.store(BytesIO(b'test file'))

conn = storage.connect_s3(*storage.connect_s3.call_args)
bucket = conn.get_bucket(*conn.get_bucket.call_args)
key = bucket.new_key(*bucket.new_key.call_args)
self.assertIn(
dict(policy='custom-acl'),
key.set_contents_from_file.call_args
)


class StdoutFeedStorageTest(unittest.TestCase):

Expand Down
1 change: 1 addition & 0 deletions tox.ini
Expand Up @@ -47,6 +47,7 @@ deps =
lxml==3.4.0
Twisted==14.0.2
boto==2.34.0
botocore==1.12.89
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be dropping jessie support, but if we're keeping it, botocore should be 0.62 (see https://packages.debian.org/en/jessie/python-botocore)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated tox.ini. Waiting for tests to pass. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

botocore version supported by debian jessie does not contain required methods... I'll just skip this test when in jessie environments.

Pillow==2.6.1
cssselect==0.9.1
zope.interface==4.1.1
Expand Down