-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] [image_pipeline] bring back uppercase class attributes #1989
[MRG+1] [image_pipeline] bring back uppercase class attributes #1989
Conversation
allow users to have class attributes on image pipelines. This assumes that class attributes are useful if users want to have different pipeline classes inhriting from ImagePipeline.
This is better, yeah. One thing to note is that the setting is still global to all pipelines, so you can't set the setting Perhaps it would be nice to have some tests making these design choices more explicit?
|
good point, if someone sets something in settings this will override all class attributes. So it may be little confusing. Do you think this would be argument for deprecating class attributes at some point and replacing them with setting keys? Perhaps in the future we could support image pipeline settings as dicts? E.g. IMAGE_SETTINGS = {
"MyFirstImagePipeline": {
"IMAGE_RESULT_FIELD": "screenshots",
},
"MySecondIMagePipeline": {
"IMAGE_RESULT_FIELD": "product_images",
}
} |
Not sure. |
About this PR, I believe it's the best that can be done to handle both the ease-of-use of settings and also legacy customization using class attributes, even if it could be confusing that settings take precedence over class attributes (note/warning needed in docs at the very least). Maybe we should promote the use of settings and show with code example how to customize the values at init. About #1989 (comment) , I agree with @eliasdorneles on not using dicts settings for image pipelines. |
ok so I'll add tests for overriding class attributes with settings (as @eliasdorneles suggests) and also perhaps test for keeping uppercase attributes if there's pipeline inheriting from base image pipeline. In docs I'll add following things:
|
and docs about image pipeline settings.
Current coverage is 83.40%
|
updated in 6c67db3 |
So, this allows that we have different settings for different instances of the default ImagesPipeline OR multiple custom ImagesPipelines in a spider. Right? I think we should do it be AND, not OR: it should be possible to have multiple custom ImagesPipelines per spider together with multiple settings across different spiders (running trough CrawlerProcess). |
yeah I think this PR should do this. Multiple settings per pipeline class will be done by class attributes or custom init written by user (that will take values from settings). Multiple settings per different spiders will be possible by adding different custom_settings per spider class. |
Yes, you are right. I forgot to consider custom init written by user. I guess the problem is I think like a dumb end user.... There is nothing in the docs explaining how to write a custom init that will take values from settings. So, as a end user, I would expect something like this would just work: pipelines.py:
settings.py:
my_spider.py:
Ok, docs could explain how to write a custom init for custom ImagesPipelines. But, as a developer, I think it would expose a layer of complexity that the end user doesn't need to know about. Plus, if every custom ImagesPipeline must have something like:
Then, it makes sense to me provide some out of the box feature in that settings for custom ImagesPipelines can be set easily in the settings.py. Just my thoughts. But I guess multiple ImagesPipeline in a single spider and multiple settings per different spiders are not the common case; both together should be very rare. So, no worries. |
I also think that adding this layer of complexity in docs may NOT be necessary. It will require long code sample that will be used by 1-2 % of users. So that would suggest current PR is good as it is (of course if there are no other objections). @redapple @eliasdorneles is it okay if we dont add code sample illustrating how to take settings on pipeline init?
There is no obvious way to provide such feature. One way I was thinking about was using some form of dictionaries in settings see this comment here #1989 (comment), but others suggested that it would add complexity and break our established notions about settings keys. Plus if you have multiple image pipelines you have to create object instance anyway, so you can just as well add your own keys in settings or as class attributes whatever works best for you. It's actually not that complicated so I think most users should manage. |
Here I must ask: how much most users know or should know about Scrapy? Imagine you know absolutely nothing about Scrapy, you just started. How you would know it is possible to use a class attribute in the init? Two options: docs or dig into the code. We have agreed it should not go in the docs. Thinking as a end user of the framework, I do not want dig in the code. Ok,, there is the Settings API. But... All we want is to set a custom value in a custom pipeline for a existing setting parameter. I think it should not be required to know about the Settings API for a simple task like that.
I agree with you 100 percent!
What about something like this? urls_field = settings.get('IMAGES_URLS_FIELD', self.IMAGES_URLS_FIELD)
self.images_urls_field = settings.get('%s_URLS_FIELD' % self.__class__.__name__.upper(), urls_field)
result_field = settings.get('IMAGES_RESULT_FIELD', self.IMAGES_RESULT_FIELD)
self.images_result_field = settings.get('%s_RESULT_FIELD' % self.__class__.__name__.upper(), result_field) It allows users to have: # pipelines.py:
class ImagesPipelineOne(ImagesPipeline):
pass
class ImagesPipelineTwo(ImagesPipeline):
pass
class CustomImagesPipelineWithDefaultValues(ImagesPipeline):
pass
# settings.py:
IMAGESPIPELINEONE_URLS_FIELD = 'image_urls_one'
IMAGESPIPELINEONE_RESULT_FIELD = 'images_one'
IMAGESPIPELINETWO_URLS_FIELD = 'image_urls_two'
IMAGESPIPELINETWO_RESULT_FIELD = 'images_two'
# spider.py:
class DummySpider(scrapy.Spider):
name = "dummy"
allowed_domains = ["example.com"]
start_urls = ['http://www.example.com/']
def parse(self, response):
yield {'image_urls_one': ['http://scrapy.org/img/48-zimigo-logo.png'],
'image_urls_two': ['http://scrapy.org/img/shub-logo.png'],
'image_urls': ['http://scrapy.org/img/08-lyst-logo.png']}
# output:
# 2016-05-31 17:37:06 [dummy] DEBUG: Scraped from <200 http://www.example.com/>
{'image_urls': ['http://scrapy.org/img/08-lyst-logo.png'],
'image_urls_one': ['http://scrapy.org/img/48-zimigo-logo.png'],
'image_urls_two': ['http://scrapy.org/img/shub-logo.png'],
'images_two': [{'url': 'http://scrapy.org/img/shub-logo.png', 'path': '0d1de4a9ff01fd6c452460257ce36ee91120f9ab.jpg', 'checksum': '62eed3305ecc8e88e8b40f85e4f7b2c2'}],
'images_one': [{'url': 'http://scrapy.org/img/48-zimigo-logo.png', 'path': '7b67aa2be22751981093e2b28298e8fdf5716866.jpg', 'checksum': '80c1f6fb4a2807c55f57668946e3c645'}],
'images': [{'url': 'http://scrapy.org/img/08-lyst-logo.png', 'path': '06a6e5869eb2d9006f3e4b48422a1f9e26f954a9.jpg', 'checksum': 'f2ea5d62afaa040afc7fd8c6e2ed4610'}]} |
nice idea! will try this out and add here. |
allow to have image settings with class name, so that settings for user defined ImagePipeline subclasses can be defined easily.
@redapple @eliasdorneles @djunzu updated a62d4b0 |
.. note:: If you have multiple image pipelines inheriting from ImagePipeline and you want to have different settings in different pipelines | ||
you can set setting keys preceded with uppercase name of your pipeline class. E.g. if your pipeline is called | ||
MyPipeline and you want to have custom IMAGES_URLS_FIELD you define setting MYPIPELINE_IMAGES_URLS_FIELD. | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would be valid to add something like "Otherwise the custom pipeline will inherit settings values from default pipeline."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this follows from context, but probably wont harm to add this.
In I think FilesPipeline should also get its And I think you can add one more test asserting a custom Files/ImagesPipeline inherit custom settings from parent. (So, if someone's code relies on this behavior, future updates will not break it.) |
there was identical test for different setting keys. I unified it into one unit test. Fixes comments for tests, adds comments about intention of uppercase attrs. Adds another test for user defined setting keys and uppercase attrs.
unify tests that test same thing for different attribute values into one. Add better docstrings for tests.
test case when subclass inherits from base class and has no attributes nor settings defined.
if test tests same thing but for different field it can be unified into one.
dont override class attributes with default settings (same as in image pipeline).
…classes * move key_for_pipe function to media pipeline so that file pipeline can use it * use key_for_pipe in file pipeline so that users can define custom settings for subclasses easily * add tests for file pipelines attributes and settings
for subclasses.
STORE_SCHEMES = { | ||
'': FSFilesStore, | ||
'file': FSFilesStore, | ||
's3': S3FilesStore, | ||
} | ||
DEFAULT_FILES_URLS_FIELD = 'file_urls' | ||
DEFAULT_FILES_RESULT_FIELD = 'files' | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not fix the backward incompatibility (#1985). It must be FILES_URLS_FIELD
and FILES_RESULT_FIELD
, as it was before #1891. Look at this line: DEFAULT_*
is used just to populate the other one.
Here arises one question: should both (e.g. DEFAULT_FILES_URLS_FIELD
and FILES_URLS_FIELD
) be present now? Because one could have a custom Files/ImagesPipeline using one or another.
I think both should be present in order to totally fix the backward incompatibility:
DEFAULT_FILES_URLS_FIELD = 'file_urls'
FILES_URLS_FIELD = DEFAULT_FILES_URLS_FIELD
DEFAULT_FILES_RESULT_FIELD = 'files'
FILES_RESULT_FIELD = DEFAULT_FILES_RESULT_FIELD = 'files'
Note: same in ImagePipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some class attributes for ImagePipeline and FilesPipeline had DEFAULT prefix. These attributes should be preserved as well, if users subclasses define values for DEFAULT_<CLS_ATTRIBUTE_NAME> attribute this value should be preserved.
c5ca38c
to
fa4d0cd
Compare
so at this point it is probably ready. Are there any other comments or feedback I should include @redapple |
@kmike , @eliasdorneles , |
If you have multiple image pipelines inheriting from ImagePipeline and you want to have different settings in different pipelines | ||
you can set setting keys preceded with uppercase name of your pipeline class. E.g. if your pipeline is called | ||
MyPipeline and you want to have custom IMAGES_URLS_FIELD you define setting MYPIPELINE_IMAGES_URLS_FIELD and your custom | ||
settings will be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please use 80 columns limit?
* 80 characters line limit * shortening some code * removed dead code * add doctest for _key_for_pipe function
def _key_for_pipe(self, key, base_class_name=None): | ||
""" | ||
>>> result = MediaPipeline()._key_for_pipe("IMAGES") | ||
>>> assert result == "IMAGES" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert is usually not used in doctests; an easier way to write this:
>>> MediaPipeline()._key_for_pipe("IMAGES")
'IMAGES'
...
>>> MyPipe()._key_for_pipe("IMAGES", base_class_name="MediaPipeline")
'MYPIPE_IMAGES'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, I need to write more doctests. fixed in 9818c97
LGTM! |
if not hasattr(self, "IMAGES_URLS_FIELD"): | ||
self.IMAGES_URLS_FIELD = self.DEFAULT_IMAGES_URLS_FIELD | ||
|
||
resolve = functools.partial(self._key_for_pipe, base_class_name=cls_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you move resolve function upper you can use it for self.expires and drop cls_name variable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point ceecf3b
dc0167c
to
ceecf3b
Compare
[backport][1.1] [image_pipeline] bring back uppercase class attributes (PR #1989)
This assumes that users may want to have multiple pipeline classes inheriting from Image pipeline. They want to have option to set settings per pipeline instances as in #1891 & #1850 but also per multiple pipeline classes (so having project wide settings key doesn't work for them).
It does not deprecate class attributes in favor of settings but uses class attributes as default values for settings. If there is nothing in user settings, pipeline will use uppercase attribute.
Aside from more flexibility this also gives full backward compatibility with whatever users were doing with uppercase attributes in their pipelines subclassing ImagePipeline.
Fixes: #1985, for alternative approach and more motivation see #1986