Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] [image_pipeline] bring back uppercase class attributes #1989

Merged
merged 14 commits into from Jul 13, 2016
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
13 changes: 13 additions & 0 deletions docs/topics/media-pipeline.rst
Expand Up @@ -191,6 +191,11 @@ For the Images Pipeline, set :setting:`IMAGES_URLS_FIELD` and/or
If you need something more complex and want to override the custom pipeline
behaviour, see :ref:`topics-media-pipeline-override`.

If you have multiple image pipelines inheriting from ImagePipeline and you want to have different settings in different pipelines
you can set setting keys preceded with uppercase name of your pipeline class. E.g. if your pipeline is called
MyPipeline and you want to have custom IMAGES_URLS_FIELD you define setting MYPIPELINE_IMAGES_URLS_FIELD and your custom
settings will be used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please use 80 columns limit?



Copy link
Contributor

@djunzu djunzu Jun 10, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be valid to add something like "Otherwise the custom pipeline will inherit settings values from default pipeline."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this follows from context, but probably wont harm to add this.

Additional features
===================
Expand All @@ -214,6 +219,14 @@ specifies the delay in number of days::

The default value for both settings is 90 days.

If you have pipeline that subclasses FilesPipeline and you'd like to have different setting
for it you can set setting keys preceded by uppercase class name. E.g. given pipeline class
called MyPipeline you can set setting key:

MYPIPELINE_FILES_EXPIRES = 180

and pipeline class MyPipeline will have expiration time set to 180.

.. _topics-images-thumbnails:

Thumbnail generation for images
Expand Down
22 changes: 18 additions & 4 deletions scrapy/pipelines/files.py
Expand Up @@ -214,23 +214,37 @@ class FilesPipeline(MediaPipeline):
"""

MEDIA_NAME = "file"
EXPIRES = 90
STORE_SCHEMES = {
'': FSFilesStore,
'file': FSFilesStore,
's3': S3FilesStore,
}
DEFAULT_FILES_URLS_FIELD = 'file_urls'
DEFAULT_FILES_RESULT_FIELD = 'files'

Copy link
Contributor

@djunzu djunzu Jun 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not fix the backward incompatibility (#1985). It must be FILES_URLS_FIELD and FILES_RESULT_FIELD, as it was before #1891. Look at this line: DEFAULT_* is used just to populate the other one.

Here arises one question: should both (e.g. DEFAULT_FILES_URLS_FIELD and FILES_URLS_FIELD) be present now? Because one could have a custom Files/ImagesPipeline using one or another.

I think both should be present in order to totally fix the backward incompatibility:

DEFAULT_FILES_URLS_FIELD = 'file_urls'
FILES_URLS_FIELD = DEFAULT_FILES_URLS_FIELD

DEFAULT_FILES_RESULT_FIELD = 'files'
FILES_RESULT_FIELD = DEFAULT_FILES_RESULT_FIELD = 'files'

Note: same in ImagePipeline.

Copy link
Contributor Author

@pawelmhm pawelmhm Jun 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look at this line: DEFAULT_* is used just to populate the other one.

good catch @djunzu I fixed that in fa4d0cd

def __init__(self, store_uri, download_func=None, settings=None):
if not store_uri:
raise NotConfigured

if isinstance(settings, dict) or settings is None:
settings = Settings(settings)


cls_name = "FilesPipeline"
self.store = self._get_store(store_uri)
self.expires = settings.getint('FILES_EXPIRES')
self.files_urls_field = settings.get('FILES_URLS_FIELD')
self.files_result_field = settings.get('FILES_RESULT_FIELD')
self.expires = settings.getint(
self._key_for_pipe('FILES_EXPIRES', cls_name), self.EXPIRES
)
if not hasattr(self, "FILES_URLS_FIELD"):
self.FILES_URLS_FIELD = self.DEFAULT_FILES_URLS_FIELD
if not hasattr(self, "FILES_RESULT_FIELD"):
self.FILES_RESULT_FIELD = self.DEFAULT_FILES_RESULT_FIELD
self.files_urls_field = settings.get(
self._key_for_pipe('FILES_URLS_FIELD', cls_name), self.FILES_URLS_FIELD
)
self.files_result_field = settings.get(
self._key_for_pipe('FILES_RESULT_FIELD', cls_name), self.FILES_RESULT_FIELD
)

super(FilesPipeline, self).__init__(download_func=download_func)

Expand Down
44 changes: 36 additions & 8 deletions scrapy/pipelines/images.py
Expand Up @@ -38,25 +38,53 @@ class ImagesPipeline(FilesPipeline):

MEDIA_NAME = 'image'

# Uppercase attributes kept for backward compatibility with code that subclasses
# ImagesPipeline. They may be overridden by settings.
MIN_WIDTH = 0
MIN_HEIGHT = 0
EXPIRES = 0
THUMBS = {}
DEFAULT_IMAGES_URLS_FIELD = 'image_urls'
DEFAULT_IMAGES_RESULT_FIELD = 'images'

def __init__(self, store_uri, download_func=None, settings=None):
super(ImagesPipeline, self).__init__(store_uri, settings=settings, download_func=download_func)

if isinstance(settings, dict) or settings is None:
settings = Settings(settings)

self.expires = settings.getint('IMAGES_EXPIRES')
self.images_urls_field = settings.get('IMAGES_URLS_FIELD')
self.images_result_field = settings.get('IMAGES_RESULT_FIELD')
self.min_width = settings.getint('IMAGES_MIN_WIDTH')
self.min_height = settings.getint('IMAGES_MIN_HEIGHT')
self.thumbs = settings.get('IMAGES_THUMBS')
cls_name = "ImagesPipeline"
self.expires = settings.getint(
self._key_for_pipe('IMAGES_EXPIRES', cls_name), self.EXPIRES
)
if not hasattr(self, "IMAGES_RESULT_FIELD"):
self.IMAGES_RESULT_FIELD = self.DEFAULT_IMAGES_RESULT_FIELD
if not hasattr(self, "IMAGES_URLS_FIELD"):
self.IMAGES_URLS_FIELD = self.DEFAULT_IMAGES_URLS_FIELD

default_images_urls_field = getattr(self, "IMAGES_URLS_FIELD", "DEFAULT_IMAGES_URLS_FIELD")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default_images_urls_field will be "DEFAULT_IMAGES_URLS_FIELD", not "image_urls" if "IMAGES_URLS_FIELD" attribute is missing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. This would never happen because this attribute is set one line above. It was dead code. I removed that.

self.images_urls_field = settings.get(
self._key_for_pipe('IMAGES_URLS_FIELD', cls_name), default_images_urls_field
)
default_images_result_field = getattr(self, "IMAGES_RESULT_FIELD", "DEFAULT_IMAGES_RESULT_FIELD")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same issue, "DEFAULT_IMAGES_RESULT_FIELD" shouldn't be a default value

self.images_result_field = settings.get(
self._key_for_pipe('IMAGES_RESULT_FIELD', cls_name), default_images_result_field
)
self.min_width = settings.getint(
self._key_for_pipe('IMAGES_MIN_WIDTH', cls_name), self.MIN_WIDTH
)
self.min_height = settings.getint(
self._key_for_pipe('IMAGES_MIN_HEIGHT', cls_name), self.MIN_HEIGHT
)
self.thumbs = settings.get(
self._key_for_pipe('IMAGES_THUMBS', cls_name), self.THUMBS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd cut some code and define

resolve = partial(self._key_for_pipe, base_class_name=cls_name)

and then

self.thumbs = settings.get(resolve('IMAGES_THUMBS'), self.THUMBS)

)

@classmethod
def from_settings(cls, settings):
s3store = cls.STORE_SCHEMES['s3']
s3store.AWS_ACCESS_KEY_ID = settings['AWS_ACCESS_KEY_ID']
s3store.AWS_SECRET_ACCESS_KEY = settings['AWS_SECRET_ACCESS_KEY']

store_uri = settings['IMAGES_STORE']
return cls(store_uri, settings=settings)

Expand Down
16 changes: 16 additions & 0 deletions scrapy/pipelines/media.py
Expand Up @@ -27,6 +27,22 @@ def __init__(self, spider):
def __init__(self, download_func=None):
self.download_func = download_func


def _key_for_pipe(self, key, base_class_name):
"""
Allow setting settings for user defined MediaPipelines that inherit from base.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function doesn't allow anything by itself; I think this comment belongs either to docs or to the caller code. Also, the sentence is quite hard to read, the example below is much more clear :) Maybe it makes sense to just expand the example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A doctest can be a good way to demonstrate what the function does.


User can define setting key:

MYPIPELINENAME_IMAGE_SETTING_NAME = <some value>

and it will override default settings and class attributes.
"""
class_name = self.__class__.__name__
if class_name == base_class_name:
return key
return "{}_{}".format(class_name.upper(), key)

@classmethod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if _key_for_pipe() belongs to MediaPipeline. It is used only in Files and ImagesPipelines. Since ImagesPipeline inherits from FilesPipeline, maybe FilesPipeline would be a better place to put this function.
But placing it here allows it to be used in other classes inheriting from MediaPipeline...
Not sure about the best option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But placing it here allows it to be used in other classes inheriting from MediaPipelin

yeah this is why I placed it there. I see no harm in keeping it in MediaPipeline, and there is potential benefit in the future so I think it's ok to keep it there.

def from_crawler(cls, crawler):
try:
Expand Down
10 changes: 0 additions & 10 deletions scrapy/settings/default_settings.py
Expand Up @@ -160,9 +160,6 @@
}

FILES_STORE_S3_ACL = 'private'
FILES_EXPIRES = 90
FILES_URLS_FIELD = 'file_urls'
FILES_RESULT_FIELD = 'files'

HTTPCACHE_ENABLED = False
HTTPCACHE_DIR = 'httpcache'
Expand All @@ -179,13 +176,6 @@

HTTPPROXY_AUTH_ENCODING = 'latin-1'

IMAGES_MIN_WIDTH = 0
IMAGES_MIN_HEIGHT = 0
IMAGES_EXPIRES = 90
IMAGES_THUMBS = {}
IMAGES_URLS_FIELD = 'image_urls'
IMAGES_RESULT_FIELD = 'images'

ITEM_PROCESSOR = 'scrapy.pipelines.ItemPipelineManager'

ITEM_PIPELINES = {}
Expand Down
147 changes: 128 additions & 19 deletions tests/test_pipeline_files.py
@@ -1,4 +1,5 @@
import os
import random
import time
import hashlib
import warnings
Expand Down Expand Up @@ -184,32 +185,140 @@ class TestItem(Item):


class FilesPipelineTestCaseCustomSettings(unittest.TestCase):
default_cls_settings = {
"EXPIRES": 90,
"FILES_URLS_FIELD": "file_urls",
"FILES_RESULT_FIELD": "files"
}
file_cls_attr_settings_map = {
("EXPIRES", "FILES_EXPIRES", "expires"),
("FILES_URLS_FIELD", "FILES_URLS_FIELD", "files_urls_field"),
("FILES_RESULT_FIELD", "FILES_RESULT_FIELD", "files_result_field")
}

def setUp(self):
self.tempdir = mkdtemp()
self.pipeline = FilesPipeline(self.tempdir)
self.default_settings = Settings()

def tearDown(self):
rmtree(self.tempdir)

def test_expires(self):
another_pipeline = FilesPipeline.from_settings(Settings({'FILES_STORE': self.tempdir,
'FILES_EXPIRES': 42}))
self.assertEqual(self.pipeline.expires, self.default_settings.getint('FILES_EXPIRES'))
self.assertEqual(another_pipeline.expires, 42)

def test_files_urls_field(self):
another_pipeline = FilesPipeline.from_settings(Settings({'FILES_STORE': self.tempdir,
'FILES_URLS_FIELD': 'funny_field'}))
self.assertEqual(self.pipeline.files_urls_field, self.default_settings.get('FILES_URLS_FIELD'))
self.assertEqual(another_pipeline.files_urls_field, 'funny_field')

def test_files_result_field(self):
another_pipeline = FilesPipeline.from_settings(Settings({'FILES_STORE': self.tempdir,
'FILES_RESULT_FIELD': 'funny_field'}))
self.assertEqual(self.pipeline.files_result_field, self.default_settings.get('FILES_RESULT_FIELD'))
self.assertEqual(another_pipeline.files_result_field, 'funny_field')
def _generate_fake_settings(self, prefix=None):

def random_string():
return "".join([chr(random.randint(97, 123)) for _ in range(10)])

settings = {
"FILES_EXPIRES": random.randint(1, 1000),
"FILES_URLS_FIELD": random_string(),
"FILES_RESULT_FIELD": random_string(),
"FILES_STORE": self.tempdir
}
if not prefix:
return settings

return {prefix.upper() + "_" + k if k != "FILES_STORE" else k: v for k, v in settings.items()}

def _generate_fake_pipeline(self):

class UserDefinedFilePipeline(FilesPipeline):
EXPIRES = 1001
FILES_URLS_FIELD = "alfa"
FILES_RESULT_FIELD = "beta"

return UserDefinedFilePipeline

def test_different_settings_for_different_instances(self):
"""
If there are different instances with different settings they should keep
different settings.
"""
custom_settings = self._generate_fake_settings()
another_pipeline = FilesPipeline.from_settings(Settings(custom_settings))
one_pipeline = FilesPipeline(self.tempdir)
for pipe_attr, settings_attr, pipe_ins_attr in self.file_cls_attr_settings_map:
default_value = self.default_cls_settings[pipe_attr]
self.assertEqual(getattr(one_pipeline, pipe_attr), default_value)
custom_value = custom_settings[settings_attr]
self.assertNotEqual(default_value, custom_value)
self.assertEqual(getattr(another_pipeline, pipe_ins_attr), custom_value)

def test_subclass_attributes_preserved_if_no_settings(self):
"""
If subclasses override class attributes and there are no special settings those values should be kept.
"""
pipe_cls = self._generate_fake_pipeline()
pipe = pipe_cls.from_settings(Settings({"FILES_STORE": self.tempdir}))
for pipe_attr, settings_attr, pipe_ins_attr in self.file_cls_attr_settings_map:
custom_value = getattr(pipe, pipe_ins_attr)
self.assertNotEqual(custom_value, self.default_cls_settings[pipe_attr])
self.assertEqual(getattr(pipe, pipe_ins_attr), getattr(pipe, pipe_attr))

def test_subclass_attrs_preserved_custom_settings(self):
"""
If file settings are defined but they are not defined for subclass class attributes
should be preserved.
"""
pipeline_cls = self._generate_fake_pipeline()
settings = self._generate_fake_settings()
pipeline = pipeline_cls.from_settings(Settings(settings))
for pipe_attr, settings_attr, pipe_ins_attr in self.file_cls_attr_settings_map:
value = getattr(pipeline, pipe_ins_attr)
self.assertNotEqual(value, self.default_cls_settings[pipe_attr])
self.assertEqual(value, getattr(pipeline, pipe_attr))

def test_no_custom_settings_for_subclasses(self):
"""
If there are no settings for subclass and no subclass attributes, pipeline should use
attributes of base class.
"""
class UserDefinedFilesPipeline(FilesPipeline):
pass

user_pipeline = UserDefinedFilesPipeline.from_settings(Settings({"FILES_STORE": self.tempdir}))
for pipe_attr, settings_attr, pipe_ins_attr in self.file_cls_attr_settings_map:
# Values from settings for custom pipeline should be set on pipeline instance.
custom_value = self.default_cls_settings.get(pipe_attr.upper())
self.assertEqual(getattr(user_pipeline, pipe_ins_attr), custom_value)

def test_custom_settings_for_subclasses(self):
"""
If there are custom settings for subclass and NO class attributes, pipeline should use custom
settings.
"""
class UserDefinedFilesPipeline(FilesPipeline):
pass

prefix = UserDefinedFilesPipeline.__name__.upper()
settings = self._generate_fake_settings(prefix=prefix)
user_pipeline = UserDefinedFilesPipeline.from_settings(Settings(settings))
for pipe_attr, settings_attr, pipe_inst_attr in self.file_cls_attr_settings_map:
# Values from settings for custom pipeline should be set on pipeline instance.
custom_value = settings.get(prefix + "_" + settings_attr)
self.assertNotEqual(custom_value, self.default_cls_settings[pipe_attr])
self.assertEqual(getattr(user_pipeline, pipe_inst_attr), custom_value)

def test_custom_settings_and_class_attrs_for_subclasses(self):
"""
If there are custom settings for subclass AND class attributes
setting keys are preferred and override attributes.
"""
pipeline_cls = self._generate_fake_pipeline()
prefix = pipeline_cls.__name__.upper()
settings = self._generate_fake_settings(prefix=prefix)
user_pipeline = pipeline_cls.from_settings(Settings(settings))
for pipe_cls_attr, settings_attr, pipe_inst_attr in self.file_cls_attr_settings_map:
custom_value = settings.get(prefix + "_" + settings_attr)
self.assertNotEqual(custom_value, self.default_cls_settings[pipe_cls_attr])
self.assertEqual(getattr(user_pipeline, pipe_inst_attr), custom_value)

def test_cls_attrs_with_DEFAULT_prefix(self):
class UserDefinedFilesPipeline(FilesPipeline):
DEFAULT_FILES_RESULT_FIELD = "this"
DEFAULT_FILES_URLS_FIELD = "that"

pipeline = UserDefinedFilesPipeline.from_settings(Settings({"FILES_STORE": self.tempdir}))
self.assertEqual(pipeline.files_result_field, "this")
self.assertEqual(pipeline.files_urls_field, "that")


class TestS3FilesStore(unittest.TestCase):
Expand Down