Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] [image_pipeline] bring back uppercase class attributes #1989

Merged
merged 14 commits into from Jul 13, 2016

Conversation

@pawelmhm
Copy link
Contributor

@pawelmhm pawelmhm commented May 13, 2016

This assumes that users may want to have multiple pipeline classes inheriting from Image pipeline. They want to have option to set settings per pipeline instances as in #1891 & #1850 but also per multiple pipeline classes (so having project wide settings key doesn't work for them).

It does not deprecate class attributes in favor of settings but uses class attributes as default values for settings. If there is nothing in user settings, pipeline will use uppercase attribute.

Aside from more flexibility this also gives full backward compatibility with whatever users were doing with uppercase attributes in their pipelines subclassing ImagePipeline.

Fixes: #1985, for alternative approach and more motivation see #1986

allow users to have class attributes on image pipelines. This assumes
that class attributes are useful if users want to have different pipeline
classes inhriting from ImagePipeline.
@pawelmhm pawelmhm changed the title [image_pipeline] bring back uppercase pipeline attributes [image_pipeline] bring back uppercase class attributes May 13, 2016
@eliasdorneles
Copy link
Member

@eliasdorneles eliasdorneles commented May 13, 2016

This is better, yeah.

One thing to note is that the setting is still global to all pipelines, so you can't set the setting IMAGES_RESULT_FIELD if you want these class atributes to be respected.

Perhaps it would be nice to have some tests making these design choices more explicit?

    def test_derived_class_can_customize_field(self):
        class MyCustomImagesPipeline(ImagesPipeline):
            IMAGES_RESULT_FIELD = "screenshot"
        pipeline = MyCustomImagesPipeline(
            self.tempdir, download_func=_mocked_download_func,
            settings={})
        self.assertEqual("screenshot", pipeline.images_result_field)

    def test_always_respect_global_setting(self):
        class MyCustomImagesPipeline(ImagesPipeline):
            IMAGES_RESULT_FIELD = "screenshot"
        pipeline = MyCustomImagesPipeline(
            self.tempdir, download_func=_mocked_download_func,
            settings={"IMAGES_RESULT_FIELD": "from_setting"})

        # XXX: is this really desirable?
        self.assertEqual("from_setting", pipeline.images_result_field)
@pawelmhm
Copy link
Contributor Author

@pawelmhm pawelmhm commented May 16, 2016

One thing to note is that the setting is still global to all pipelines, so you can't set the setting IMAGES_RESULT_FIELD if you want these class atributes to be respected.

good point, if someone sets something in settings this will override all class attributes. So it may be little confusing.

Do you think this would be argument for deprecating class attributes at some point and replacing them with setting keys? Perhaps in the future we could support image pipeline settings as dicts? E.g.

IMAGE_SETTINGS = {
    "MyFirstImagePipeline": {
           "IMAGE_RESULT_FIELD": "screenshots",
     },
    "MySecondIMagePipeline": {
           "IMAGE_RESULT_FIELD": "product_images",
    }
}
@eliasdorneles
Copy link
Member

@eliasdorneles eliasdorneles commented May 16, 2016

Not sure.
I think if I were implementing the custom pipelines, I'd introduce a new setting for each.

@redapple
Copy link
Contributor

@redapple redapple commented May 17, 2016

About this PR, I believe it's the best that can be done to handle both the ease-of-use of settings and also legacy customization using class attributes, even if it could be confusing that settings take precedence over class attributes (note/warning needed in docs at the very least).

Maybe we should promote the use of settings and show with code example how to customize the values at init.

About #1989 (comment) , I agree with @eliasdorneles on not using dicts settings for image pipelines.
I see settings as global knobs, effective for all subclasses, except if explicitly changed in the subclass contructor in some way (via a custom setting for example) by the developer.

@pawelmhm
Copy link
Contributor Author

@pawelmhm pawelmhm commented May 17, 2016

ok so I'll add tests for overriding class attributes with settings (as @eliasdorneles suggests) and also perhaps test for keeping uppercase attributes if there's pipeline inheriting from base image pipeline.

In docs I'll add following things:

  • warning that values from settings will override class attributes
  • code sample showing how to take values from settings in init
and docs about image pipeline settings.
@codecov-io
Copy link

@codecov-io codecov-io commented May 18, 2016

Current coverage is 83.40%

Merging #1989 into master will increase coverage by 0.03%

Powered by Codecov. Last updated by 2dd1a9e...ceecf3b

@pawelmhm
Copy link
Contributor Author

@pawelmhm pawelmhm commented May 18, 2016

updated in 6c67db3

@djunzu
Copy link
Contributor

@djunzu djunzu commented May 19, 2016

So, this allows that we have different settings for different instances of the default ImagesPipeline OR multiple custom ImagesPipelines in a spider. Right?

I think we should do it be AND, not OR: it should be possible to have multiple custom ImagesPipelines per spider together with multiple settings across different spiders (running trough CrawlerProcess).

@pawelmhm
Copy link
Contributor Author

@pawelmhm pawelmhm commented May 20, 2016

I think we should do it be AND, not OR: it should be possible to have multiple custom ImagesPipelines per spider together with multiple settings across different spiders (running trough CrawlerProcess).

yeah I think this PR should do this. Multiple settings per pipeline class will be done by class attributes or custom init written by user (that will take values from settings). Multiple settings per different spiders will be possible by adding different custom_settings per spider class.

@djunzu
Copy link
Contributor

@djunzu djunzu commented May 20, 2016

yeah I think this PR should do this. Multiple settings per pipeline class will be done by class attributes or custom init written by user (that will take values from settings).

Yes, you are right. I forgot to consider custom init written by user.

I guess the problem is I think like a dumb end user.... There is nothing in the docs explaining how to write a custom init that will take values from settings. So, as a end user, I would expect something like this would just work:

pipelines.py:

class MyPipe(ImagesPipeline):
  def item_completed(self, results, item, info):
    item = super(MyPipe, self).item_completed(results, item, info)
    print("doing some stuff with item here")
    return item

class MyPipeTwo(ImagesPipeline):
  def item_completed(self, results, item, info):
    item = super(MyPipeTwo, self).item_completed(results, item, info)
    print("doing some different stuff with item here")
    return item

settings.py:

MYPIPE_EXPIRES = 10
MYPIPE_RESULT_FIELD = "my_field"
MYPIPETWO_EXPIRES = 20
MYPIPETWO_RESULT_FIELD = "another_field"

my_spider.py:

class MySpider1(CrawlSpider):
  ...
    item["my_field"] = links
    item["another_field"] = more_links
  ...

class MySpider2(CrawlSpider):
  custom_settings = {"MYPIPETWO_EXPIRES": 36500}
  ...
    item["my_field"] = links
    item["another_field"] = more_links
  ...

Ok, docs could explain how to write a custom init for custom ImagesPipelines. But, as a developer, I think it would expose a layer of complexity that the end user doesn't need to know about. Plus, if every custom ImagesPipeline must have something like:

 def __init__(self, store_uri, download_func=None, settings=None):
     self.files_urls_field = settings.get('CUSTOM_SETTING_NAME_FOR_URLS_FIELD', self.DEFAULT_FILES_URLS_FIELD)
     self.expires = settings.getint('CUSTOM_SETTING_NAME_FOR_EXPIRES', self.EXPIRES)

Then, it makes sense to me provide some out of the box feature in that settings for custom ImagesPipelines can be set easily in the settings.py.

Just my thoughts.

But I guess multiple ImagesPipeline in a single spider and multiple settings per different spiders are not the common case; both together should be very rare. So, no worries.

@pawelmhm
Copy link
Contributor Author

@pawelmhm pawelmhm commented May 30, 2016

@djunzu

docs could explain how to write a custom init for custom ImagesPipelines. But, as a developer, I think it would expose a layer of complexity that the end user doesn't need to know about.

I also think that adding this layer of complexity in docs may NOT be necessary. It will require long code sample that will be used by 1-2 % of users. So that would suggest current PR is good as it is (of course if there are no other objections). @redapple @eliasdorneles is it okay if we dont add code sample illustrating how to take settings on pipeline init?

@djunzu

Then, it makes sense to me provide some out of the box feature in that settings for custom ImagesPipelines can be set easily in the settings.py.

There is no obvious way to provide such feature. One way I was thinking about was using some form of dictionaries in settings see this comment here #1989 (comment), but others suggested that it would add complexity and break our established notions about settings keys. Plus if you have multiple image pipelines you have to create object instance anyway, so you can just as well add your own keys in settings or as class attributes whatever works best for you. It's actually not that complicated so I think most users should manage.

@djunzu
Copy link
Contributor

@djunzu djunzu commented May 31, 2016

@pawelmhm

It's actually not that complicated so I think most users should manage.

Here I must ask: how much most users know or should know about Scrapy?

Imagine you know absolutely nothing about Scrapy, you just started. How you would know it is possible to use a class attribute in the init? Two options: docs or dig into the code. We have agreed it should not go in the docs. Thinking as a end user of the framework, I do not want dig in the code.

Ok,, there is the Settings API. But... All we want is to set a custom value in a custom pipeline for a existing setting parameter. I think it should not be required to know about the Settings API for a simple task like that.


So that would suggest current PR is good as it is

I agree with you 100 percent!


There is no obvious way to provide such feature.

What about something like this?

urls_field = settings.get('IMAGES_URLS_FIELD', self.IMAGES_URLS_FIELD)
self.images_urls_field = settings.get('%s_URLS_FIELD' % self.__class__.__name__.upper(), urls_field)

result_field = settings.get('IMAGES_RESULT_FIELD', self.IMAGES_RESULT_FIELD)
self.images_result_field = settings.get('%s_RESULT_FIELD' % self.__class__.__name__.upper(), result_field)

It allows users to have:

# pipelines.py:
class ImagesPipelineOne(ImagesPipeline):
    pass
class ImagesPipelineTwo(ImagesPipeline):
    pass
class CustomImagesPipelineWithDefaultValues(ImagesPipeline):
    pass

# settings.py:
IMAGESPIPELINEONE_URLS_FIELD = 'image_urls_one'
IMAGESPIPELINEONE_RESULT_FIELD = 'images_one'
IMAGESPIPELINETWO_URLS_FIELD = 'image_urls_two'
IMAGESPIPELINETWO_RESULT_FIELD = 'images_two'

# spider.py:
class DummySpider(scrapy.Spider):
    name = "dummy"
    allowed_domains = ["example.com"]
    start_urls = ['http://www.example.com/']

    def parse(self, response):
        yield {'image_urls_one': ['http://scrapy.org/img/48-zimigo-logo.png'],
                  'image_urls_two': ['http://scrapy.org/img/shub-logo.png'],
                  'image_urls': ['http://scrapy.org/img/08-lyst-logo.png']}

# output:
# 2016-05-31 17:37:06 [dummy] DEBUG: Scraped from <200 http://www.example.com/>
{'image_urls': ['http://scrapy.org/img/08-lyst-logo.png'],
'image_urls_one': ['http://scrapy.org/img/48-zimigo-logo.png'],
'image_urls_two': ['http://scrapy.org/img/shub-logo.png'],
'images_two': [{'url': 'http://scrapy.org/img/shub-logo.png', 'path': '0d1de4a9ff01fd6c452460257ce36ee91120f9ab.jpg', 'checksum': '62eed3305ecc8e88e8b40f85e4f7b2c2'}],
'images_one': [{'url': 'http://scrapy.org/img/48-zimigo-logo.png', 'path': '7b67aa2be22751981093e2b28298e8fdf5716866.jpg', 'checksum': '80c1f6fb4a2807c55f57668946e3c645'}],
'images': [{'url': 'http://scrapy.org/img/08-lyst-logo.png', 'path': '06a6e5869eb2d9006f3e4b48422a1f9e26f954a9.jpg', 'checksum': 'f2ea5d62afaa040afc7fd8c6e2ed4610'}]}
@pawelmhm
Copy link
Contributor Author

@pawelmhm pawelmhm commented Jun 1, 2016

What about something like this?

nice idea! will try this out and add here.

allow to have image settings with class name, so that settings for user defined ImagePipeline
subclasses can be defined easily.
@pawelmhm
Copy link
Contributor Author

@pawelmhm pawelmhm commented Jun 10, 2016

.. note:: If you have multiple image pipelines inheriting from ImagePipeline and you want to have different settings in different pipelines
you can set setting keys preceded with uppercase name of your pipeline class. E.g. if your pipeline is called
MyPipeline and you want to have custom IMAGES_URLS_FIELD you define setting MYPIPELINE_IMAGES_URLS_FIELD.


This comment has been minimized.

@djunzu

djunzu Jun 10, 2016
Contributor

Maybe it would be valid to add something like "Otherwise the custom pipeline will inherit settings values from default pipeline."

This comment has been minimized.

@pawelmhm

pawelmhm Jun 10, 2016
Author Contributor

I think this follows from context, but probably wont harm to add this.


# If image settings are defined they override class attributes.
pipeline = UserDefinedImagePipeline.from_settings(Settings(settings))
self.assertEqual(pipeline.min_width, 1000)

This comment has been minimized.

@djunzu

djunzu Jun 10, 2016
Contributor

The above comment doesn't make sense to me. There is a image setting defined and it is not overriding the class attribute.

This comment has been minimized.

@pawelmhm

pawelmhm Jun 10, 2016
Author Contributor

good catch. Comment is outdated, it was relevant in previous version before this new update I pushed today. I'll remove it

@djunzu
Copy link
Contributor

@djunzu djunzu commented Jun 10, 2016

@pawelmhm

In default_settings.py, there are still three values for FilesPipeline that will override the respective class attributes.

I think FilesPipeline should also get its key_for_pipe and some tests.

And I think you can add one more test asserting a custom Files/ImagesPipeline inherit custom settings from parent. (So, if someone's code relies on this behavior, future updates will not break it.)

pawelmhm added 7 commits Jun 14, 2016
there was identical test for different setting keys. I unified it into
one unit test.

Fixes comments for tests, adds comments about intention of uppercase attrs.

Adds another test for user defined setting keys and uppercase attrs.
unify tests that test same thing for different attribute values into one. Add
better docstrings for tests.
test case when subclass inherits from base class and has no attributes nor
settings defined.
if test tests same thing but for different field it can be unified into one.
dont override class attributes with default settings (same as in image pipeline).
…classes

* move key_for_pipe function to media pipeline so that file pipeline can use it
* use key_for_pipe in file pipeline so that users can define custom settings for subclasses easily
* add tests for file pipelines attributes and settings
for subclasses.
if class_name == base_class_name:
return key
return "{}_{}".format(class_name.upper(), key)

@classmethod

This comment has been minimized.

@djunzu

djunzu Jun 20, 2016
Contributor

I am not sure if _key_for_pipe() belongs to MediaPipeline. It is used only in Files and ImagesPipelines. Since ImagesPipeline inherits from FilesPipeline, maybe FilesPipeline would be a better place to put this function.
But placing it here allows it to be used in other classes inheriting from MediaPipeline...
Not sure about the best option.

This comment has been minimized.

@pawelmhm

pawelmhm Jun 20, 2016
Author Contributor

But placing it here allows it to be used in other classes inheriting from MediaPipelin

yeah this is why I placed it there. I see no harm in keeping it in MediaPipeline, and there is potential benefit in the future so I think it's ok to keep it there.

some class attributes for ImagePipeline and FilesPipeline had DEFAULT prefix. These
attributes should be preserved as well, if users subclasses define values for
DEFAULT_<CLS_ATTRIBUTE_NAME> attribute this value should be preserved.
@pawelmhm pawelmhm force-pushed the pawelmhm:fix-images-pipeline-uppercase-other branch from c5ca38c to fa4d0cd Jun 20, 2016
@pawelmhm
Copy link
Contributor Author

@pawelmhm pawelmhm commented Jun 29, 2016

so at this point it is probably ready. Are there any other comments or feedback I should include @redapple

@redapple redapple changed the title [image_pipeline] bring back uppercase class attributes [MRG+1] [image_pipeline] bring back uppercase class attributes Jun 29, 2016
@kmike kmike added this to the v1.1.1 milestone Jul 8, 2016
@redapple
Copy link
Contributor

@redapple redapple commented Jul 8, 2016

@kmike , @eliasdorneles ,
are you ok with merging this? (and backporting to 1.1 of course)

If you have multiple image pipelines inheriting from ImagePipeline and you want to have different settings in different pipelines
you can set setting keys preceded with uppercase name of your pipeline class. E.g. if your pipeline is called
MyPipeline and you want to have custom IMAGES_URLS_FIELD you define setting MYPIPELINE_IMAGES_URLS_FIELD and your custom
settings will be used.

This comment has been minimized.

@kmike

kmike Jul 9, 2016
Member

Could you please use 80 columns limit?

if not hasattr(self, "IMAGES_URLS_FIELD"):
self.IMAGES_URLS_FIELD = self.DEFAULT_IMAGES_URLS_FIELD

default_images_urls_field = getattr(self, "IMAGES_URLS_FIELD", "DEFAULT_IMAGES_URLS_FIELD")

This comment has been minimized.

@kmike

kmike Jul 9, 2016
Member

default_images_urls_field will be "DEFAULT_IMAGES_URLS_FIELD", not "image_urls" if "IMAGES_URLS_FIELD" attribute is missing

This comment has been minimized.

@pawelmhm

pawelmhm Jul 12, 2016
Author Contributor

good point. This would never happen because this attribute is set one line above. It was dead code. I removed that.

self.images_urls_field = settings.get(
self._key_for_pipe('IMAGES_URLS_FIELD', cls_name), default_images_urls_field
)
default_images_result_field = getattr(self, "IMAGES_RESULT_FIELD", "DEFAULT_IMAGES_RESULT_FIELD")

This comment has been minimized.

@kmike

kmike Jul 9, 2016
Member

the same issue, "DEFAULT_IMAGES_RESULT_FIELD" shouldn't be a default value

self._key_for_pipe('IMAGES_MIN_HEIGHT', cls_name), self.MIN_HEIGHT
)
self.thumbs = settings.get(
self._key_for_pipe('IMAGES_THUMBS', cls_name), self.THUMBS

This comment has been minimized.

@kmike

kmike Jul 9, 2016
Member

I'd cut some code and define

resolve = partial(self._key_for_pipe, base_class_name=cls_name)

and then

self.thumbs = settings.get(resolve('IMAGES_THUMBS'), self.THUMBS)

def _key_for_pipe(self, key, base_class_name):
"""
Allow setting settings for user defined MediaPipelines that inherit from base.

This comment has been minimized.

@kmike

kmike Jul 9, 2016
Member

This function doesn't allow anything by itself; I think this comment belongs either to docs or to the caller code. Also, the sentence is quite hard to read, the example below is much more clear :) Maybe it makes sense to just expand the example.

This comment has been minimized.

@kmike

kmike Jul 9, 2016
Member

A doctest can be a good way to demonstrate what the function does.

* 80 characters line limit
* shortening some code
* removed dead code
* add doctest for _key_for_pipe function
@pawelmhm
Copy link
Contributor Author

@pawelmhm pawelmhm commented Jul 12, 2016

@kmike updated in c22cc10

  • keeping line limit in docs
  • added functools partial
  • removed dead code
  • add doctest for key for pipe
def _key_for_pipe(self, key, base_class_name=None):
"""
>>> result = MediaPipeline()._key_for_pipe("IMAGES")
>>> assert result == "IMAGES"

This comment has been minimized.

@kmike

kmike Jul 12, 2016
Member

assert is usually not used in doctests; an easier way to write this:

>>> MediaPipeline()._key_for_pipe("IMAGES")
'IMAGES'

...

>>> MyPipe()._key_for_pipe("IMAGES", base_class_name="MediaPipeline")
'MYPIPE_IMAGES'

This comment has been minimized.

@pawelmhm

pawelmhm Jul 12, 2016
Author Contributor

good point, I need to write more doctests. fixed in 9818c97

@redapple
Copy link
Contributor

@redapple redapple commented Jul 12, 2016

LGTM!
Thanks again @pawelmhm

if not hasattr(self, "IMAGES_URLS_FIELD"):
self.IMAGES_URLS_FIELD = self.DEFAULT_IMAGES_URLS_FIELD

resolve = functools.partial(self._key_for_pipe, base_class_name=cls_name)

This comment has been minimized.

@kmike

kmike Jul 12, 2016
Member

if you move resolve function upper you can use it for self.expires and drop cls_name variable

This comment has been minimized.

@pawelmhm

pawelmhm Jul 13, 2016
Author Contributor

good point ceecf3b

@pawelmhm pawelmhm force-pushed the pawelmhm:fix-images-pipeline-uppercase-other branch from dc0167c to ceecf3b Jul 13, 2016
@kmike kmike merged commit 79639d0 into scrapy:master Jul 13, 2016
2 checks passed
2 checks passed
codecov/patch 100% of diff hit (target 100%)
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
redapple added a commit that referenced this pull request Jul 13, 2016
[backport][1.1] [image_pipeline] bring back uppercase class attributes (PR #1989)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

6 participants
You can’t perform that action at this time.