Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Account for mangling when serializing requests with private callbacks #3790

Merged
merged 8 commits into from Jun 17, 2019
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
10 changes: 9 additions & 1 deletion scrapy/utils/reqser.py
Expand Up @@ -2,12 +2,16 @@
Helper functions for serializing (and deserializing) requests.
"""
import six
import re

from scrapy.http import Request
from scrapy.utils.python import to_unicode, to_native_str
from scrapy.utils.misc import load_object


private_name_regex = re.compile('^__[^_](.*[^_])?_?$')
Copy link
Member

@elacuesta elacuesta May 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the use of ^ in the middle of the pattern. IIRC ^ means the start of the string (https://docs.python.org/3/library/re.html#regular-expression-syntax). Also, .* without ? is greedy so it would match everything, including trailing underscores.

Copy link
Member

@elacuesta elacuesta May 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not working for me:

re.match('^__[^_](.*[^_])?_?$', '___private_stuff_')

I think this is difficult to do just with a regex, because you need to allow an arbitrary amount of underscores in the middle of the identifier. What about something like '^(_*).*?(_*)$', which captures underscores at the beginning and end, and then you count the amount?
I'm also concerned about the performance impact of applying the regex, since this would be done for each request that goes through the scheduler. Could you maybe run a test with https://github.com/scrapy/scrapy-bench?

Copy link
Member

@Gallaecio Gallaecio May 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the use of ^ in the middle of the pattern.

^ at the beginning of […] means that it matches any character not in […].

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only issue with '^__[^_](.*[^_])?_?$' is that it does not allow more than 2 underscores at the beginning.

What about '^__+.*[^_]_?$'?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I updated the regex and did something basically like that, and added a unit test to test the regex directly since it was easier doing that than creating a bunch of classes/callbacks to test various name underscore combinations.

I'll try the bench - if this is too slow maybe it could be memoized? Function names are typically constant through the job.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is too slow maybe it could be memoized? Function names are typically constant through the job.

I agree, creating a separate function and using functools.lru_cache looks like a good option here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, didn't get to the benchmarking but I switched to the startswith/endswith solution because it's easier to read and faster :) and all the tests pass.

Still important to memoize?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to memoize, you're right that several requests will have the same callback so it's probably not necessary to check them every time.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not what I was originally going to bench, but benched with/without lru_cache:

With lru_cache

Test = 'Book Spider' Iterations = '10'


Mean : 65.64734686683296 Median : 65.72720576665657 Std Dev : 0.4875654825761466

Without lru_cache

Test = 'Book Spider' Iterations = '10'


Mean : 65.23675882544786 Median : 65.02926278845428 Std Dev : 0.6786585837183575

Both were using the code on this branch otherwise.

So my computer isn't exactly a clean bench environment and the samples are fairly close -- according to this :D, I think any optimization at this level is probably lost in the noise. Also, I lru_cache'd _find_method which might be beyond the scope of this change - maybe optimizing function lookup could be a separate PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that sounds good. Also, sorry for bringing functools.lru_cache into the conversation without checking its availability first, but it was added in py3.2 and we're not dropping support for py2 just yet :-/



def request_to_dict(request, spider=None):
"""Convert Request object to a dict.

Expand Down Expand Up @@ -75,7 +79,11 @@ def _find_method(obj, func):
pass
else:
if func_self is obj:
return six.get_method_function(func).__name__
name = six.get_method_function(func).__name__
if private_name_regex.search(name):
classname = obj.__class__.__name__.lstrip('_')
name = '_%s%s' % (classname, name)
return name
raise ValueError("Function %s is not a method of: %s" % (func, obj))


Expand Down
9 changes: 9 additions & 0 deletions tests/test_utils_reqser.py
Expand Up @@ -68,6 +68,12 @@ def test_callback_serialization(self):
errback=self.spider.handle_error)
self._assert_serializes_ok(r, spider=self.spider)

def test_private_callback_serialization(self):
r = Request("http://www.example.com",
callback=self.spider._TestSpider__parse_item_private,
errback=self.spider.handle_error)
self._assert_serializes_ok(r, spider=self.spider)

def test_unserializable_callback1(self):
r = Request("http://www.example.com", callback=lambda x: x)
self.assertRaises(ValueError, request_to_dict, r)
Expand All @@ -87,6 +93,9 @@ def parse_item(self, response):
def handle_error(self, failure):
pass

def __parse_item_private(self, response):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you maybe add a test that accounts for methods that do not match the requirements? More than one trailing underscore, for instance.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, commented above but added a test for the regex directly to make sure it matches/doesn't match variously underscored names.

pass


class CustomRequest(Request):
pass