Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Account for mangling when serializing requests with private callbacks #3790

Merged
merged 8 commits into from Jun 17, 2019

Conversation

@andrewbaxter
Copy link

@andrewbaxter andrewbaxter commented May 24, 2019

Sorry for making a PR without an issue~ just hit an issue where mixins with similarly named methods used private names to prevent method conflicts. The spider worked properly in memory but had errors when requests were serialized to a disk queue.

The Python documentation specifically documents the mangling convention so I think it's reasonable to assume it works in other implementations, but I haven't tested except with CPython.

@andrewbaxter
Copy link
Author

@andrewbaxter andrewbaxter commented May 24, 2019

AppVeyor build failure looks unrelated.

@elacuesta
Copy link
Member

@elacuesta elacuesta commented May 24, 2019

This looks good I think. Just one comment: according to https://docs.python.org/3.5/tutorial/classes.html#private-variables, "Any identifier of the form __spam (at least two leading underscores, at most one trailing underscore) is textually replaced with _classname__spam, where classname is the current class name with leading underscore(s) stripped". I wonder if we should check the "at most one trailing underscore" part as well.

@codecov
Copy link

@codecov codecov bot commented May 24, 2019

Codecov Report

Merging #3790 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #3790      +/-   ##
==========================================
+ Coverage   85.42%   85.43%   +<.01%     
==========================================
  Files         169      169              
  Lines        9635     9639       +4     
  Branches     1433     1434       +1     
==========================================
+ Hits         8231     8235       +4     
  Misses       1156     1156              
  Partials      248      248
Impacted Files Coverage Δ
scrapy/utils/reqser.py 93.02% <100%> (+0.71%) ⬆️

@codecov
Copy link

@codecov codecov bot commented May 24, 2019

Codecov Report

Merging #3790 into master will decrease coverage by 0.52%.
The diff coverage is 100%.

@@            Coverage Diff            @@
##           master   #3790      +/-   ##
=========================================
- Coverage   85.42%   84.9%   -0.53%     
=========================================
  Files         169     169              
  Lines        9635    9647      +12     
  Branches     1433    1435       +2     
=========================================
- Hits         8231    8191      -40     
- Misses       1156    1202      +46     
- Partials      248     254       +6
Impacted Files Coverage Δ
scrapy/utils/reqser.py 94.11% <100%> (+1.8%) ⬆️
scrapy/core/downloader/handlers/s3.py 62.9% <0%> (-32.26%) ⬇️
scrapy/utils/boto.py 46.66% <0%> (-26.67%) ⬇️
scrapy/core/downloader/tls.py 77.5% <0%> (-12.5%) ⬇️
scrapy/extensions/feedexport.py 80.18% <0%> (-4.72%) ⬇️
scrapy/core/downloader/handlers/http11.py 89.92% <0%> (-2.62%) ⬇️
scrapy/core/scraper.py 86.48% <0%> (-2.03%) ⬇️
scrapy/pipelines/files.py 67.15% <0%> (-1.1%) ⬇️


from scrapy.http import Request
from scrapy.utils.python import to_unicode, to_native_str
from scrapy.utils.misc import load_object


private_name_regex = re.compile('^__[^_](.*[^_])?_?$')
Copy link
Member

@elacuesta elacuesta May 24, 2019

I'm not sure I understand the use of ^ in the middle of the pattern. IIRC ^ means the start of the string (https://docs.python.org/3/library/re.html#regular-expression-syntax). Also, .* without ? is greedy so it would match everything, including trailing underscores.

Copy link
Member

@elacuesta elacuesta May 24, 2019

This is not working for me:

re.match('^__[^_](.*[^_])?_?$', '___private_stuff_')

I think this is difficult to do just with a regex, because you need to allow an arbitrary amount of underscores in the middle of the identifier. What about something like '^(_*).*?(_*)$', which captures underscores at the beginning and end, and then you count the amount?
I'm also concerned about the performance impact of applying the regex, since this would be done for each request that goes through the scheduler. Could you maybe run a test with https://github.com/scrapy/scrapy-bench?

Copy link
Member

@Gallaecio Gallaecio May 27, 2019

I'm not sure I understand the use of ^ in the middle of the pattern.

^ at the beginning of […] means that it matches any character not in […].

Copy link
Member

@Gallaecio Gallaecio May 27, 2019

The only issue with '^__[^_](.*[^_])?_?$' is that it does not allow more than 2 underscores at the beginning.

What about '^__+.*[^_]_?$'?

Copy link
Author

@andrewbaxter andrewbaxter May 27, 2019

Thanks! I updated the regex and did something basically like that, and added a unit test to test the regex directly since it was easier doing that than creating a bunch of classes/callbacks to test various name underscore combinations.

I'll try the bench - if this is too slow maybe it could be memoized? Function names are typically constant through the job.

Copy link
Member

@elacuesta elacuesta May 27, 2019

if this is too slow maybe it could be memoized? Function names are typically constant through the job.

I agree, creating a separate function and using functools.lru_cache looks like a good option here.

Copy link
Author

@andrewbaxter andrewbaxter May 27, 2019

Sorry, didn't get to the benchmarking but I switched to the startswith/endswith solution because it's easier to read and faster :) and all the tests pass.

Still important to memoize?

Copy link
Member

@elacuesta elacuesta May 27, 2019

I think it would be good to memoize, you're right that several requests will have the same callback so it's probably not necessary to check them every time.

Copy link
Author

@andrewbaxter andrewbaxter May 29, 2019

Not what I was originally going to bench, but benched with/without lru_cache:

With lru_cache

Test = 'Book Spider' Iterations = '10'


Mean : 65.64734686683296 Median : 65.72720576665657 Std Dev : 0.4875654825761466

Without lru_cache

Test = 'Book Spider' Iterations = '10'


Mean : 65.23675882544786 Median : 65.02926278845428 Std Dev : 0.6786585837183575

Both were using the code on this branch otherwise.

So my computer isn't exactly a clean bench environment and the samples are fairly close -- according to this :D, I think any optimization at this level is probably lost in the noise. Also, I lru_cache'd _find_method which might be beyond the scope of this change - maybe optimizing function lookup could be a separate PR?

Copy link
Member

@elacuesta elacuesta May 29, 2019

Yeah that sounds good. Also, sorry for bringing functools.lru_cache into the conversation without checking its availability first, but it was added in py3.2 and we're not dropping support for py2 just yet :-/

@@ -87,6 +93,9 @@ def parse_item(self, response):
def handle_error(self, failure):
pass

def __parse_item_private(self, response):
Copy link
Member

@elacuesta elacuesta May 24, 2019

Could you maybe add a test that accounts for methods that do not match the requirements? More than one trailing underscore, for instance.

Copy link
Author

@andrewbaxter andrewbaxter May 27, 2019

Ah, commented above but added a test for the regex directly to make sure it matches/doesn't match variously underscored names.

@elacuesta
Copy link
Member

@elacuesta elacuesta commented May 27, 2019

AppVeyor build failure looks unrelated.

It seems to be related to the change, I believe it has to do with the spider's default callback in the parse command. What I'm not sure is why it fails only in AppVeyor and not in Travis 🤔

@andrewbaxter
Copy link
Author

@andrewbaxter andrewbaxter commented May 27, 2019

I had to look at the logs for like 10 minutes, but it seems like there's a ton of tests failing and they're all failing because an include path is wrong or something:

overage.py warning: --include is ignored because --source is set (include-ignored)\r\nc:\\projects\\scrapy\\.tox\\py36\\scripts\\python.exe: Error while finding module specification for 'scrapy.cmdline' (ModuleNotFoundError: No module named 'scrapy')

So the parse tests are failing but it's because scrapy.cmdline isn't there... AFAICT.

I saw some other PRs with the same error, but I might have missed a differentiating message in here.

@andrewbaxter
Copy link
Author

@andrewbaxter andrewbaxter commented May 29, 2019

Okay, now both CIs are failing 😅

I'm claiming innocence on the Travis py36 build as well - python 3.6 tests work fine on my computer and I can't for the life of me imagine how this change would affect deprecation notice warnings.

@Gallaecio Gallaecio closed this May 31, 2019
@Gallaecio Gallaecio reopened this May 31, 2019
@Gallaecio
Copy link
Member

@Gallaecio Gallaecio commented May 31, 2019

(closing and reopening for tests to re-run)

@andrewbaxter
Copy link
Author

@andrewbaxter andrewbaxter commented May 31, 2019

Travis passed! Thanks @Gallaecio .

Okay, last argument in my CI defense :D . Appveyor is failing on master as well so I think it's expected to fail here. I don't know why codecov is failing - it shows 100% relative coverage for the last commit. Maybe it's complaining that it didn't get data from all Travis builds, but it shows the most up to date builds in the "builds" tab. Maybe one more rebuild would make it pass too?

@Gallaecio
Copy link
Member

@Gallaecio Gallaecio commented May 31, 2019

I would simply ignore the Windows-specific and coverage CI builds here.

@andrewbaxter
Copy link
Author

@andrewbaxter andrewbaxter commented Jun 3, 2019

I found an issue - this breaks if the method's declaring class is different than the obj class (ex: the method was inherited).

AFAICT there's no way to reliably get the declaring class in multiple python versions without a lot of extra code -- Python 3.3+ has __qualname__ which can be parsed to generate the correct name, there's a qualname package which tries to do it by source inspection, and otherwise the solution is to walk the class hierarchy tree and try to locate the first appearance of the method. There's a six issue referencing this but they defer to thequalname package.

Since this is essentially a new feature I expect only new projects will require this and it may be safe to additionally assume they're using a recent Python version, so I think a solution that relies on __qualname__ may be the cleanest solution here. Updating.

Edit: So to confirm the patch behavior here's an example:

class A:
    class B:
        def __m(self):
            print('nested B')


class B:
    def x(self):
        self.__m()

    def __m(self):
        print('top level B')


class Z(A):
    pass


class Y(A.B):
    pass


class X(A.B, B):
    pass


z = Z()
y = Y()
x = X()

print(z.B._B__m.__qualname__)
print(y._B__m.__qualname__)
print(x._B__m.__qualname__)
print(dir(x))
x.x()

prints

A.B.__m
A.B.__m
A.B.__m
['_B__m', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'x']
nested B

The mangled private name always uses the immediate parent class, no consideration for nesting. Additionally if two classes have the same name "private" breaks and the method is overwritten upon inheritance - this is standard Python behavior so no workarounds need to be considered.

@andrewbaxter
Copy link
Author

@andrewbaxter andrewbaxter commented Jun 3, 2019

Build failure again looks unrelated, this time something about the docs build.

@Gallaecio
Copy link
Member

@Gallaecio Gallaecio commented Jun 3, 2019

Yes, I’ve just taken care of that documentation issue.

@Gallaecio Gallaecio changed the title Account for mangling when serializing requests with private callbacks [MRG+1] Account for mangling when serializing requests with private callbacks Jun 3, 2019
name = '_%s%s' % (classname, name)
else:
splits = qualname.split('.')
name = '_%s%s' % (splits[-2], splits[-1])
Copy link
Member

@kmike kmike Jun 4, 2019

do you think it is possible to extract this logic to a separate function, e.g. _mangled_name (?), to make the intention of the code more clear, and also to be able to add tests for it? You can check that names produced are the same as Python makes.

Copy link
Author

@andrewbaxter andrewbaxter Jun 5, 2019

I added tests for the two cases (own method, inherited method)... I feel like further testing would be going into "writing tests to prove Python behavior" rather than "writing tests to prove Scrapy behavior" but please let me know if you think a test with a nested class or something could be helpful.

Copy link
Member

@kmike kmike Jun 5, 2019

👍 thanks @andrewbaxter!

@andrewbaxter
Copy link
Author

@andrewbaxter andrewbaxter commented Jun 6, 2019

Tests fixed (uh, relatively) and ready for re-review.

@dangra dangra merged commit b53ff59 into scrapy:master Jun 17, 2019
3 checks passed
@kmike kmike added this to the v1.7 milestone Jun 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants