Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] [image_pipeline] bring back uppercase class attributes #1989

Merged
merged 14 commits into from Jul 13, 2016

Conversation

pawelmhm
Copy link
Contributor

@pawelmhm pawelmhm commented May 13, 2016

This assumes that users may want to have multiple pipeline classes inheriting from Image pipeline. They want to have option to set settings per pipeline instances as in #1891 & #1850 but also per multiple pipeline classes (so having project wide settings key doesn't work for them).

It does not deprecate class attributes in favor of settings but uses class attributes as default values for settings. If there is nothing in user settings, pipeline will use uppercase attribute.

Aside from more flexibility this also gives full backward compatibility with whatever users were doing with uppercase attributes in their pipelines subclassing ImagePipeline.

Fixes: #1985, for alternative approach and more motivation see #1986

allow users to have class attributes on image pipelines. This assumes
that class attributes are useful if users want to have different pipeline
classes inhriting from ImagePipeline.
@pawelmhm pawelmhm changed the title [image_pipeline] bring back uppercase pipeline attributes [image_pipeline] bring back uppercase class attributes May 13, 2016
@eliasdorneles
Copy link
Member

This is better, yeah.

One thing to note is that the setting is still global to all pipelines, so you can't set the setting IMAGES_RESULT_FIELD if you want these class atributes to be respected.

Perhaps it would be nice to have some tests making these design choices more explicit?

    def test_derived_class_can_customize_field(self):
        class MyCustomImagesPipeline(ImagesPipeline):
            IMAGES_RESULT_FIELD = "screenshot"
        pipeline = MyCustomImagesPipeline(
            self.tempdir, download_func=_mocked_download_func,
            settings={})
        self.assertEqual("screenshot", pipeline.images_result_field)

    def test_always_respect_global_setting(self):
        class MyCustomImagesPipeline(ImagesPipeline):
            IMAGES_RESULT_FIELD = "screenshot"
        pipeline = MyCustomImagesPipeline(
            self.tempdir, download_func=_mocked_download_func,
            settings={"IMAGES_RESULT_FIELD": "from_setting"})

        # XXX: is this really desirable?
        self.assertEqual("from_setting", pipeline.images_result_field)

@pawelmhm
Copy link
Contributor Author

pawelmhm commented May 16, 2016

One thing to note is that the setting is still global to all pipelines, so you can't set the setting IMAGES_RESULT_FIELD if you want these class atributes to be respected.

good point, if someone sets something in settings this will override all class attributes. So it may be little confusing.

Do you think this would be argument for deprecating class attributes at some point and replacing them with setting keys? Perhaps in the future we could support image pipeline settings as dicts? E.g.

IMAGE_SETTINGS = {
    "MyFirstImagePipeline": {
           "IMAGE_RESULT_FIELD": "screenshots",
     },
    "MySecondIMagePipeline": {
           "IMAGE_RESULT_FIELD": "product_images",
    }
}

@eliasdorneles
Copy link
Member

Not sure.
I think if I were implementing the custom pipelines, I'd introduce a new setting for each.

@redapple
Copy link
Contributor

About this PR, I believe it's the best that can be done to handle both the ease-of-use of settings and also legacy customization using class attributes, even if it could be confusing that settings take precedence over class attributes (note/warning needed in docs at the very least).

Maybe we should promote the use of settings and show with code example how to customize the values at init.

About #1989 (comment) , I agree with @eliasdorneles on not using dicts settings for image pipelines.
I see settings as global knobs, effective for all subclasses, except if explicitly changed in the subclass contructor in some way (via a custom setting for example) by the developer.

@pawelmhm
Copy link
Contributor Author

pawelmhm commented May 17, 2016

ok so I'll add tests for overriding class attributes with settings (as @eliasdorneles suggests) and also perhaps test for keeping uppercase attributes if there's pipeline inheriting from base image pipeline.

In docs I'll add following things:

  • warning that values from settings will override class attributes
  • code sample showing how to take values from settings in init

and docs about image pipeline settings.
@codecov-io
Copy link

codecov-io commented May 18, 2016

Current coverage is 83.40%

Merging #1989 into master will increase coverage by 0.03%

Powered by Codecov. Last updated by 2dd1a9e...ceecf3b

@pawelmhm
Copy link
Contributor Author

updated in 6c67db3

@djunzu
Copy link
Contributor

djunzu commented May 19, 2016

So, this allows that we have different settings for different instances of the default ImagesPipeline OR multiple custom ImagesPipelines in a spider. Right?

I think we should do it be AND, not OR: it should be possible to have multiple custom ImagesPipelines per spider together with multiple settings across different spiders (running trough CrawlerProcess).

@pawelmhm
Copy link
Contributor Author

pawelmhm commented May 20, 2016

I think we should do it be AND, not OR: it should be possible to have multiple custom ImagesPipelines per spider together with multiple settings across different spiders (running trough CrawlerProcess).

yeah I think this PR should do this. Multiple settings per pipeline class will be done by class attributes or custom init written by user (that will take values from settings). Multiple settings per different spiders will be possible by adding different custom_settings per spider class.

@djunzu
Copy link
Contributor

djunzu commented May 20, 2016

yeah I think this PR should do this. Multiple settings per pipeline class will be done by class attributes or custom init written by user (that will take values from settings).

Yes, you are right. I forgot to consider custom init written by user.

I guess the problem is I think like a dumb end user.... There is nothing in the docs explaining how to write a custom init that will take values from settings. So, as a end user, I would expect something like this would just work:

pipelines.py:

class MyPipe(ImagesPipeline):
  def item_completed(self, results, item, info):
    item = super(MyPipe, self).item_completed(results, item, info)
    print("doing some stuff with item here")
    return item

class MyPipeTwo(ImagesPipeline):
  def item_completed(self, results, item, info):
    item = super(MyPipeTwo, self).item_completed(results, item, info)
    print("doing some different stuff with item here")
    return item

settings.py:

MYPIPE_EXPIRES = 10
MYPIPE_RESULT_FIELD = "my_field"
MYPIPETWO_EXPIRES = 20
MYPIPETWO_RESULT_FIELD = "another_field"

my_spider.py:

class MySpider1(CrawlSpider):
  ...
    item["my_field"] = links
    item["another_field"] = more_links
  ...

class MySpider2(CrawlSpider):
  custom_settings = {"MYPIPETWO_EXPIRES": 36500}
  ...
    item["my_field"] = links
    item["another_field"] = more_links
  ...

Ok, docs could explain how to write a custom init for custom ImagesPipelines. But, as a developer, I think it would expose a layer of complexity that the end user doesn't need to know about. Plus, if every custom ImagesPipeline must have something like:

 def __init__(self, store_uri, download_func=None, settings=None):
     self.files_urls_field = settings.get('CUSTOM_SETTING_NAME_FOR_URLS_FIELD', self.DEFAULT_FILES_URLS_FIELD)
     self.expires = settings.getint('CUSTOM_SETTING_NAME_FOR_EXPIRES', self.EXPIRES)

Then, it makes sense to me provide some out of the box feature in that settings for custom ImagesPipelines can be set easily in the settings.py.

Just my thoughts.

But I guess multiple ImagesPipeline in a single spider and multiple settings per different spiders are not the common case; both together should be very rare. So, no worries.

@pawelmhm
Copy link
Contributor Author

@djunzu

docs could explain how to write a custom init for custom ImagesPipelines. But, as a developer, I think it would expose a layer of complexity that the end user doesn't need to know about.

I also think that adding this layer of complexity in docs may NOT be necessary. It will require long code sample that will be used by 1-2 % of users. So that would suggest current PR is good as it is (of course if there are no other objections). @redapple @eliasdorneles is it okay if we dont add code sample illustrating how to take settings on pipeline init?

@djunzu

Then, it makes sense to me provide some out of the box feature in that settings for custom ImagesPipelines can be set easily in the settings.py.

There is no obvious way to provide such feature. One way I was thinking about was using some form of dictionaries in settings see this comment here #1989 (comment), but others suggested that it would add complexity and break our established notions about settings keys. Plus if you have multiple image pipelines you have to create object instance anyway, so you can just as well add your own keys in settings or as class attributes whatever works best for you. It's actually not that complicated so I think most users should manage.

@djunzu
Copy link
Contributor

djunzu commented May 31, 2016

@pawelmhm

It's actually not that complicated so I think most users should manage.

Here I must ask: how much most users know or should know about Scrapy?

Imagine you know absolutely nothing about Scrapy, you just started. How you would know it is possible to use a class attribute in the init? Two options: docs or dig into the code. We have agreed it should not go in the docs. Thinking as a end user of the framework, I do not want dig in the code.

Ok,, there is the Settings API. But... All we want is to set a custom value in a custom pipeline for a existing setting parameter. I think it should not be required to know about the Settings API for a simple task like that.


So that would suggest current PR is good as it is

I agree with you 100 percent!


There is no obvious way to provide such feature.

What about something like this?

urls_field = settings.get('IMAGES_URLS_FIELD', self.IMAGES_URLS_FIELD)
self.images_urls_field = settings.get('%s_URLS_FIELD' % self.__class__.__name__.upper(), urls_field)

result_field = settings.get('IMAGES_RESULT_FIELD', self.IMAGES_RESULT_FIELD)
self.images_result_field = settings.get('%s_RESULT_FIELD' % self.__class__.__name__.upper(), result_field)

It allows users to have:

# pipelines.py:
class ImagesPipelineOne(ImagesPipeline):
    pass
class ImagesPipelineTwo(ImagesPipeline):
    pass
class CustomImagesPipelineWithDefaultValues(ImagesPipeline):
    pass

# settings.py:
IMAGESPIPELINEONE_URLS_FIELD = 'image_urls_one'
IMAGESPIPELINEONE_RESULT_FIELD = 'images_one'
IMAGESPIPELINETWO_URLS_FIELD = 'image_urls_two'
IMAGESPIPELINETWO_RESULT_FIELD = 'images_two'

# spider.py:
class DummySpider(scrapy.Spider):
    name = "dummy"
    allowed_domains = ["example.com"]
    start_urls = ['http://www.example.com/']

    def parse(self, response):
        yield {'image_urls_one': ['http://scrapy.org/img/48-zimigo-logo.png'],
                  'image_urls_two': ['http://scrapy.org/img/shub-logo.png'],
                  'image_urls': ['http://scrapy.org/img/08-lyst-logo.png']}

# output:
# 2016-05-31 17:37:06 [dummy] DEBUG: Scraped from <200 http://www.example.com/>
{'image_urls': ['http://scrapy.org/img/08-lyst-logo.png'],
'image_urls_one': ['http://scrapy.org/img/48-zimigo-logo.png'],
'image_urls_two': ['http://scrapy.org/img/shub-logo.png'],
'images_two': [{'url': 'http://scrapy.org/img/shub-logo.png', 'path': '0d1de4a9ff01fd6c452460257ce36ee91120f9ab.jpg', 'checksum': '62eed3305ecc8e88e8b40f85e4f7b2c2'}],
'images_one': [{'url': 'http://scrapy.org/img/48-zimigo-logo.png', 'path': '7b67aa2be22751981093e2b28298e8fdf5716866.jpg', 'checksum': '80c1f6fb4a2807c55f57668946e3c645'}],
'images': [{'url': 'http://scrapy.org/img/08-lyst-logo.png', 'path': '06a6e5869eb2d9006f3e4b48422a1f9e26f954a9.jpg', 'checksum': 'f2ea5d62afaa040afc7fd8c6e2ed4610'}]}

@pawelmhm
Copy link
Contributor Author

pawelmhm commented Jun 1, 2016

What about something like this?

nice idea! will try this out and add here.

allow to have image settings with class name, so that settings for user defined ImagePipeline
subclasses can be defined easily.
@pawelmhm
Copy link
Contributor Author

@redapple @eliasdorneles @djunzu updated a62d4b0

.. note:: If you have multiple image pipelines inheriting from ImagePipeline and you want to have different settings in different pipelines
you can set setting keys preceded with uppercase name of your pipeline class. E.g. if your pipeline is called
MyPipeline and you want to have custom IMAGES_URLS_FIELD you define setting MYPIPELINE_IMAGES_URLS_FIELD.


Copy link
Contributor

@djunzu djunzu Jun 10, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be valid to add something like "Otherwise the custom pipeline will inherit settings values from default pipeline."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this follows from context, but probably wont harm to add this.

@djunzu
Copy link
Contributor

djunzu commented Jun 10, 2016

@pawelmhm

In default_settings.py, there are still three values for FilesPipeline that will override the respective class attributes.

I think FilesPipeline should also get its key_for_pipe and some tests.

And I think you can add one more test asserting a custom Files/ImagesPipeline inherit custom settings from parent. (So, if someone's code relies on this behavior, future updates will not break it.)

there was identical test for different setting keys. I unified it into
one unit test.

Fixes comments for tests, adds comments about intention of uppercase attrs.

Adds another test for user defined setting keys and uppercase attrs.
unify tests that test same thing for different attribute values into one. Add
better docstrings for tests.
test case when subclass inherits from base class and has no attributes nor
settings defined.
if test tests same thing but for different field it can be unified into one.
dont override class attributes with default settings (same as in image pipeline).
…classes

* move key_for_pipe function to media pipeline so that file pipeline can use it
* use key_for_pipe in file pipeline so that users can define custom settings for subclasses easily
* add tests for file pipelines attributes and settings
STORE_SCHEMES = {
'': FSFilesStore,
'file': FSFilesStore,
's3': S3FilesStore,
}
DEFAULT_FILES_URLS_FIELD = 'file_urls'
DEFAULT_FILES_RESULT_FIELD = 'files'

Copy link
Contributor

@djunzu djunzu Jun 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not fix the backward incompatibility (#1985). It must be FILES_URLS_FIELD and FILES_RESULT_FIELD, as it was before #1891. Look at this line: DEFAULT_* is used just to populate the other one.

Here arises one question: should both (e.g. DEFAULT_FILES_URLS_FIELD and FILES_URLS_FIELD) be present now? Because one could have a custom Files/ImagesPipeline using one or another.

I think both should be present in order to totally fix the backward incompatibility:

DEFAULT_FILES_URLS_FIELD = 'file_urls'
FILES_URLS_FIELD = DEFAULT_FILES_URLS_FIELD

DEFAULT_FILES_RESULT_FIELD = 'files'
FILES_RESULT_FIELD = DEFAULT_FILES_RESULT_FIELD = 'files'

Note: same in ImagePipeline.

Copy link
Contributor Author

@pawelmhm pawelmhm Jun 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look at this line: DEFAULT_* is used just to populate the other one.

good catch @djunzu I fixed that in fa4d0cd

some class attributes for ImagePipeline and FilesPipeline had DEFAULT prefix. These
attributes should be preserved as well, if users subclasses define values for
DEFAULT_<CLS_ATTRIBUTE_NAME> attribute this value should be preserved.
@pawelmhm pawelmhm force-pushed the fix-images-pipeline-uppercase-other branch from c5ca38c to fa4d0cd Compare June 20, 2016 10:53
@pawelmhm
Copy link
Contributor Author

so at this point it is probably ready. Are there any other comments or feedback I should include @redapple

@redapple redapple changed the title [image_pipeline] bring back uppercase class attributes [MRG+1] [image_pipeline] bring back uppercase class attributes Jun 29, 2016
@kmike kmike added this to the v1.1.1 milestone Jul 8, 2016
@redapple
Copy link
Contributor

redapple commented Jul 8, 2016

@kmike , @eliasdorneles ,
are you ok with merging this? (and backporting to 1.1 of course)

If you have multiple image pipelines inheriting from ImagePipeline and you want to have different settings in different pipelines
you can set setting keys preceded with uppercase name of your pipeline class. E.g. if your pipeline is called
MyPipeline and you want to have custom IMAGES_URLS_FIELD you define setting MYPIPELINE_IMAGES_URLS_FIELD and your custom
settings will be used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please use 80 columns limit?

* 80 characters line limit
* shortening some code
* removed dead code
* add doctest for _key_for_pipe function
@pawelmhm
Copy link
Contributor Author

@kmike updated in c22cc10

  • keeping line limit in docs
  • added functools partial
  • removed dead code
  • add doctest for key for pipe

def _key_for_pipe(self, key, base_class_name=None):
"""
>>> result = MediaPipeline()._key_for_pipe("IMAGES")
>>> assert result == "IMAGES"
Copy link
Member

@kmike kmike Jul 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert is usually not used in doctests; an easier way to write this:

>>> MediaPipeline()._key_for_pipe("IMAGES")
'IMAGES'

...

>>> MyPipe()._key_for_pipe("IMAGES", base_class_name="MediaPipeline")
'MYPIPE_IMAGES'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, I need to write more doctests. fixed in 9818c97

@redapple
Copy link
Contributor

LGTM!
Thanks again @pawelmhm

if not hasattr(self, "IMAGES_URLS_FIELD"):
self.IMAGES_URLS_FIELD = self.DEFAULT_IMAGES_URLS_FIELD

resolve = functools.partial(self._key_for_pipe, base_class_name=cls_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you move resolve function upper you can use it for self.expires and drop cls_name variable

Copy link
Contributor Author

@pawelmhm pawelmhm Jul 13, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point ceecf3b

@pawelmhm pawelmhm force-pushed the fix-images-pipeline-uppercase-other branch from dc0167c to ceecf3b Compare July 13, 2016 14:19
@kmike kmike merged commit 79639d0 into scrapy:master Jul 13, 2016
redapple added a commit that referenced this pull request Jul 13, 2016
[backport][1.1] [image_pipeline] bring back uppercase class attributes (PR #1989)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants