[MRG+1] [image_pipeline] bring back uppercase class attributes #1989
Conversation
allow users to have class attributes on image pipelines. This assumes that class attributes are useful if users want to have different pipeline classes inhriting from ImagePipeline.
This is better, yeah. One thing to note is that the setting is still global to all pipelines, so you can't set the setting Perhaps it would be nice to have some tests making these design choices more explicit?
|
good point, if someone sets something in settings this will override all class attributes. So it may be little confusing. Do you think this would be argument for deprecating class attributes at some point and replacing them with setting keys? Perhaps in the future we could support image pipeline settings as dicts? E.g. IMAGE_SETTINGS = {
"MyFirstImagePipeline": {
"IMAGE_RESULT_FIELD": "screenshots",
},
"MySecondIMagePipeline": {
"IMAGE_RESULT_FIELD": "product_images",
}
} |
Not sure. |
About this PR, I believe it's the best that can be done to handle both the ease-of-use of settings and also legacy customization using class attributes, even if it could be confusing that settings take precedence over class attributes (note/warning needed in docs at the very least). Maybe we should promote the use of settings and show with code example how to customize the values at init. About #1989 (comment) , I agree with @eliasdorneles on not using dicts settings for image pipelines. |
ok so I'll add tests for overriding class attributes with settings (as @eliasdorneles suggests) and also perhaps test for keeping uppercase attributes if there's pipeline inheriting from base image pipeline. In docs I'll add following things:
|
and docs about image pipeline settings.
Current coverage is 83.40%
|
updated in 6c67db3 |
So, this allows that we have different settings for different instances of the default ImagesPipeline OR multiple custom ImagesPipelines in a spider. Right? I think we should do it be AND, not OR: it should be possible to have multiple custom ImagesPipelines per spider together with multiple settings across different spiders (running trough CrawlerProcess). |
yeah I think this PR should do this. Multiple settings per pipeline class will be done by class attributes or custom init written by user (that will take values from settings). Multiple settings per different spiders will be possible by adding different custom_settings per spider class. |
Yes, you are right. I forgot to consider custom init written by user. I guess the problem is I think like a dumb end user.... There is nothing in the docs explaining how to write a custom init that will take values from settings. So, as a end user, I would expect something like this would just work: pipelines.py:
settings.py:
my_spider.py:
Ok, docs could explain how to write a custom init for custom ImagesPipelines. But, as a developer, I think it would expose a layer of complexity that the end user doesn't need to know about. Plus, if every custom ImagesPipeline must have something like:
Then, it makes sense to me provide some out of the box feature in that settings for custom ImagesPipelines can be set easily in the settings.py. Just my thoughts. But I guess multiple ImagesPipeline in a single spider and multiple settings per different spiders are not the common case; both together should be very rare. So, no worries. |
I also think that adding this layer of complexity in docs may NOT be necessary. It will require long code sample that will be used by 1-2 % of users. So that would suggest current PR is good as it is (of course if there are no other objections). @redapple @eliasdorneles is it okay if we dont add code sample illustrating how to take settings on pipeline init?
There is no obvious way to provide such feature. One way I was thinking about was using some form of dictionaries in settings see this comment here #1989 (comment), but others suggested that it would add complexity and break our established notions about settings keys. Plus if you have multiple image pipelines you have to create object instance anyway, so you can just as well add your own keys in settings or as class attributes whatever works best for you. It's actually not that complicated so I think most users should manage. |
Here I must ask: how much most users know or should know about Scrapy? Imagine you know absolutely nothing about Scrapy, you just started. How you would know it is possible to use a class attribute in the init? Two options: docs or dig into the code. We have agreed it should not go in the docs. Thinking as a end user of the framework, I do not want dig in the code. Ok,, there is the Settings API. But... All we want is to set a custom value in a custom pipeline for a existing setting parameter. I think it should not be required to know about the Settings API for a simple task like that.
I agree with you 100 percent!
What about something like this? urls_field = settings.get('IMAGES_URLS_FIELD', self.IMAGES_URLS_FIELD)
self.images_urls_field = settings.get('%s_URLS_FIELD' % self.__class__.__name__.upper(), urls_field)
result_field = settings.get('IMAGES_RESULT_FIELD', self.IMAGES_RESULT_FIELD)
self.images_result_field = settings.get('%s_RESULT_FIELD' % self.__class__.__name__.upper(), result_field) It allows users to have: # pipelines.py:
class ImagesPipelineOne(ImagesPipeline):
pass
class ImagesPipelineTwo(ImagesPipeline):
pass
class CustomImagesPipelineWithDefaultValues(ImagesPipeline):
pass
# settings.py:
IMAGESPIPELINEONE_URLS_FIELD = 'image_urls_one'
IMAGESPIPELINEONE_RESULT_FIELD = 'images_one'
IMAGESPIPELINETWO_URLS_FIELD = 'image_urls_two'
IMAGESPIPELINETWO_RESULT_FIELD = 'images_two'
# spider.py:
class DummySpider(scrapy.Spider):
name = "dummy"
allowed_domains = ["example.com"]
start_urls = ['http://www.example.com/']
def parse(self, response):
yield {'image_urls_one': ['http://scrapy.org/img/48-zimigo-logo.png'],
'image_urls_two': ['http://scrapy.org/img/shub-logo.png'],
'image_urls': ['http://scrapy.org/img/08-lyst-logo.png']}
# output:
# 2016-05-31 17:37:06 [dummy] DEBUG: Scraped from <200 http://www.example.com/>
{'image_urls': ['http://scrapy.org/img/08-lyst-logo.png'],
'image_urls_one': ['http://scrapy.org/img/48-zimigo-logo.png'],
'image_urls_two': ['http://scrapy.org/img/shub-logo.png'],
'images_two': [{'url': 'http://scrapy.org/img/shub-logo.png', 'path': '0d1de4a9ff01fd6c452460257ce36ee91120f9ab.jpg', 'checksum': '62eed3305ecc8e88e8b40f85e4f7b2c2'}],
'images_one': [{'url': 'http://scrapy.org/img/48-zimigo-logo.png', 'path': '7b67aa2be22751981093e2b28298e8fdf5716866.jpg', 'checksum': '80c1f6fb4a2807c55f57668946e3c645'}],
'images': [{'url': 'http://scrapy.org/img/08-lyst-logo.png', 'path': '06a6e5869eb2d9006f3e4b48422a1f9e26f954a9.jpg', 'checksum': 'f2ea5d62afaa040afc7fd8c6e2ed4610'}]} |
nice idea! will try this out and add here. |
allow to have image settings with class name, so that settings for user defined ImagePipeline subclasses can be defined easily.
@redapple @eliasdorneles @djunzu updated a62d4b0 |
.. note:: If you have multiple image pipelines inheriting from ImagePipeline and you want to have different settings in different pipelines | ||
you can set setting keys preceded with uppercase name of your pipeline class. E.g. if your pipeline is called | ||
MyPipeline and you want to have custom IMAGES_URLS_FIELD you define setting MYPIPELINE_IMAGES_URLS_FIELD. | ||
|
||
|
djunzu
Jun 10, 2016
•
Contributor
Maybe it would be valid to add something like "Otherwise the custom pipeline will inherit settings values from default pipeline."
Maybe it would be valid to add something like "Otherwise the custom pipeline will inherit settings values from default pipeline."
pawelmhm
Jun 10, 2016
Author
Contributor
I think this follows from context, but probably wont harm to add this.
I think this follows from context, but probably wont harm to add this.
|
||
# If image settings are defined they override class attributes. | ||
pipeline = UserDefinedImagePipeline.from_settings(Settings(settings)) | ||
self.assertEqual(pipeline.min_width, 1000) |
djunzu
Jun 10, 2016
Contributor
The above comment doesn't make sense to me. There is a image setting defined and it is not overriding the class attribute.
The above comment doesn't make sense to me. There is a image setting defined and it is not overriding the class attribute.
pawelmhm
Jun 10, 2016
Author
Contributor
good catch. Comment is outdated, it was relevant in previous version before this new update I pushed today. I'll remove it
good catch. Comment is outdated, it was relevant in previous version before this new update I pushed today. I'll remove it
In I think FilesPipeline should also get its And I think you can add one more test asserting a custom Files/ImagesPipeline inherit custom settings from parent. (So, if someone's code relies on this behavior, future updates will not break it.) |
there was identical test for different setting keys. I unified it into one unit test. Fixes comments for tests, adds comments about intention of uppercase attrs. Adds another test for user defined setting keys and uppercase attrs.
unify tests that test same thing for different attribute values into one. Add better docstrings for tests.
test case when subclass inherits from base class and has no attributes nor settings defined.
if test tests same thing but for different field it can be unified into one.
dont override class attributes with default settings (same as in image pipeline).
…classes * move key_for_pipe function to media pipeline so that file pipeline can use it * use key_for_pipe in file pipeline so that users can define custom settings for subclasses easily * add tests for file pipelines attributes and settings
for subclasses.
if class_name == base_class_name: | ||
return key | ||
return "{}_{}".format(class_name.upper(), key) | ||
|
||
@classmethod |
djunzu
Jun 20, 2016
Contributor
I am not sure if _key_for_pipe()
belongs to MediaPipeline
. It is used only in Files
and ImagesPipelines
. Since ImagesPipeline
inherits from FilesPipeline
, maybe FilesPipeline
would be a better place to put this function.
But placing it here allows it to be used in other classes inheriting from MediaPipeline
...
Not sure about the best option.
I am not sure if _key_for_pipe()
belongs to MediaPipeline
. It is used only in Files
and ImagesPipelines
. Since ImagesPipeline
inherits from FilesPipeline
, maybe FilesPipeline
would be a better place to put this function.
But placing it here allows it to be used in other classes inheriting from MediaPipeline
...
Not sure about the best option.
pawelmhm
Jun 20, 2016
Author
Contributor
But placing it here allows it to be used in other classes inheriting from MediaPipelin
yeah this is why I placed it there. I see no harm in keeping it in MediaPipeline, and there is potential benefit in the future so I think it's ok to keep it there.
But placing it here allows it to be used in other classes inheriting from MediaPipelin
yeah this is why I placed it there. I see no harm in keeping it in MediaPipeline, and there is potential benefit in the future so I think it's ok to keep it there.
some class attributes for ImagePipeline and FilesPipeline had DEFAULT prefix. These attributes should be preserved as well, if users subclasses define values for DEFAULT_<CLS_ATTRIBUTE_NAME> attribute this value should be preserved.
c5ca38c
to
fa4d0cd
so at this point it is probably ready. Are there any other comments or feedback I should include @redapple |
@kmike , @eliasdorneles , |
If you have multiple image pipelines inheriting from ImagePipeline and you want to have different settings in different pipelines | ||
you can set setting keys preceded with uppercase name of your pipeline class. E.g. if your pipeline is called | ||
MyPipeline and you want to have custom IMAGES_URLS_FIELD you define setting MYPIPELINE_IMAGES_URLS_FIELD and your custom | ||
settings will be used. |
kmike
Jul 9, 2016
Member
Could you please use 80 columns limit?
Could you please use 80 columns limit?
if not hasattr(self, "IMAGES_URLS_FIELD"): | ||
self.IMAGES_URLS_FIELD = self.DEFAULT_IMAGES_URLS_FIELD | ||
|
||
default_images_urls_field = getattr(self, "IMAGES_URLS_FIELD", "DEFAULT_IMAGES_URLS_FIELD") |
kmike
Jul 9, 2016
Member
default_images_urls_field
will be "DEFAULT_IMAGES_URLS_FIELD", not "image_urls" if "IMAGES_URLS_FIELD" attribute is missing
default_images_urls_field
will be "DEFAULT_IMAGES_URLS_FIELD", not "image_urls" if "IMAGES_URLS_FIELD" attribute is missing
pawelmhm
Jul 12, 2016
Author
Contributor
good point. This would never happen because this attribute is set one line above. It was dead code. I removed that.
good point. This would never happen because this attribute is set one line above. It was dead code. I removed that.
self.images_urls_field = settings.get( | ||
self._key_for_pipe('IMAGES_URLS_FIELD', cls_name), default_images_urls_field | ||
) | ||
default_images_result_field = getattr(self, "IMAGES_RESULT_FIELD", "DEFAULT_IMAGES_RESULT_FIELD") |
kmike
Jul 9, 2016
Member
the same issue, "DEFAULT_IMAGES_RESULT_FIELD" shouldn't be a default value
the same issue, "DEFAULT_IMAGES_RESULT_FIELD" shouldn't be a default value
self._key_for_pipe('IMAGES_MIN_HEIGHT', cls_name), self.MIN_HEIGHT | ||
) | ||
self.thumbs = settings.get( | ||
self._key_for_pipe('IMAGES_THUMBS', cls_name), self.THUMBS |
kmike
Jul 9, 2016
Member
I'd cut some code and define
resolve = partial(self._key_for_pipe, base_class_name=cls_name)
and then
self.thumbs = settings.get(resolve('IMAGES_THUMBS'), self.THUMBS)
I'd cut some code and define
resolve = partial(self._key_for_pipe, base_class_name=cls_name)
and then
self.thumbs = settings.get(resolve('IMAGES_THUMBS'), self.THUMBS)
|
||
def _key_for_pipe(self, key, base_class_name): | ||
""" | ||
Allow setting settings for user defined MediaPipelines that inherit from base. |
kmike
Jul 9, 2016
Member
This function doesn't allow anything by itself; I think this comment belongs either to docs or to the caller code. Also, the sentence is quite hard to read, the example below is much more clear :) Maybe it makes sense to just expand the example.
This function doesn't allow anything by itself; I think this comment belongs either to docs or to the caller code. Also, the sentence is quite hard to read, the example below is much more clear :) Maybe it makes sense to just expand the example.
kmike
Jul 9, 2016
Member
A doctest can be a good way to demonstrate what the function does.
A doctest can be a good way to demonstrate what the function does.
* 80 characters line limit * shortening some code * removed dead code * add doctest for _key_for_pipe function
def _key_for_pipe(self, key, base_class_name=None): | ||
""" | ||
>>> result = MediaPipeline()._key_for_pipe("IMAGES") | ||
>>> assert result == "IMAGES" |
kmike
Jul 12, 2016
•
Member
assert is usually not used in doctests; an easier way to write this:
>>> MediaPipeline()._key_for_pipe("IMAGES")
'IMAGES'
...
>>> MyPipe()._key_for_pipe("IMAGES", base_class_name="MediaPipeline")
'MYPIPE_IMAGES'
assert is usually not used in doctests; an easier way to write this:
>>> MediaPipeline()._key_for_pipe("IMAGES")
'IMAGES'
...
>>> MyPipe()._key_for_pipe("IMAGES", base_class_name="MediaPipeline")
'MYPIPE_IMAGES'
LGTM! |
if not hasattr(self, "IMAGES_URLS_FIELD"): | ||
self.IMAGES_URLS_FIELD = self.DEFAULT_IMAGES_URLS_FIELD | ||
|
||
resolve = functools.partial(self._key_for_pipe, base_class_name=cls_name) |
kmike
Jul 12, 2016
Member
if you move resolve function upper you can use it for self.expires and drop cls_name variable
if you move resolve function upper you can use it for self.expires and drop cls_name variable
dc0167c
to
ceecf3b
[backport][1.1] [image_pipeline] bring back uppercase class attributes (PR #1989)
This assumes that users may want to have multiple pipeline classes inheriting from Image pipeline. They want to have option to set settings per pipeline instances as in #1891 & #1850 but also per multiple pipeline classes (so having project wide settings key doesn't work for them).
It does not deprecate class attributes in favor of settings but uses class attributes as default values for settings. If there is nothing in user settings, pipeline will use uppercase attribute.
Aside from more flexibility this also gives full backward compatibility with whatever users were doing with uppercase attributes in their pipelines subclassing ImagePipeline.
Fixes: #1985, for alternative approach and more motivation see #1986