Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] use response selector cache #3157

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

malloxpb
Copy link
Member

@malloxpb malloxpb commented Mar 8, 2018

Use response selector cache instead of creating new one every time an ItemLoader object is created. As mentioned in issue #3125. I created this as a sample code to continue the discussion :)

@codecov
Copy link

codecov bot commented Mar 8, 2018

Codecov Report

Merging #3157 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #3157      +/-   ##
==========================================
+ Coverage   82.11%   82.11%   +<.01%     
==========================================
  Files         228      228              
  Lines        9588     9589       +1     
  Branches     1385     1385              
==========================================
+ Hits         7873     7874       +1     
  Misses       1456     1456              
  Partials      259      259
Impacted Files Coverage Δ
scrapy/loader/__init__.py 94.55% <100%> (+0.03%) ⬆️

@@ -26,7 +26,10 @@ class ItemLoader(object):

def __init__(self, item=None, selector=None, response=None, parent=None, **context):
if selector is None and response is not None:
selector = self.default_selector_class(response)
if response.selector.__class__ == Selector:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nctl144,

There are two issues with this line:

  1. If a subclass overrides default_selector_class attribute, it won't have an effect after this change.
  2. As I mentioned in https://github.com/scrapy/scrapy/issues/3125#issuecomment-366316774, this approach has an issue: when someone accesses response.selector first time, Scrapy parses response body; when this condition is False, we'd be parsing response body twice (here + later, at self.default_selector_class(response) line). This looks unnecesary.

@malloxpb
Copy link
Member Author

malloxpb commented Mar 11, 2018

Hey @kmike , I got what you mean. So I modified the code so that we use the response.selector even when the response._cached_selector is None. Since response.selector returns a new Selector object (of the current response) when the selector has no cache, I think the code is still the same right?
Please let me know what you think! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants