New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] Remove duplicate code now handled by newer w3lib #1881
Conversation
Oops. Fixed ajaxcrawl middleware. The CI error with the docs now can't be related though: Exception occurred:
File "conf.py", line 113, in <module>
import sphinx_rtd_theme
ImportError: No module named 'sphinx_rtd_theme' |
f4d03e1
to
d3c8665
Compare
Current coverage is
|
text = _script_re.sub(u'', text) | ||
text = _noscript_re.sub(u'', text) | ||
text = html.remove_comments(html.replace_entities(text)) | ||
text = html.remove_tags_with_content(text, ('script', 'noscript')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you please check that the performance is not much worse?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup. Done.
script:
import timeit
timed = timeit.timeit("""
texts = [
'<html><head><meta name="fragment" content="!"/></head><body></body></html>',
"<html><head><meta name='fragment' content='!'></head></html>",
'<html><head><!--<meta name="fragment" content="!"/>--></head><body></body></html>',
'<html></html>',
]
# [1] all of ajaxcrawl func
#for text in texts:
# _has_ajaxcrawlable_meta(text)
# [2] w3lib
#for text in texts:
# remove_tags_with_content(text, ('script', 'noscript'))
# [3] scrapy.utils.response
for text in texts:
_noscript_re.sub(u'', _script_re.sub(u'', text))
""", setup="""
# [1]
#from scrapy.downloadermiddlewares.ajaxcrawl import _has_ajaxcrawlable_meta
# [2]
#from w3lib.html import remove_tags_with_content
# [3]
from scrapy.utils.response import _noscript_re, _script_re
""")
print timed
timings for 1000000 (timeit.default_number) loops:
[1] on master branch (scrapy.utils.response)
42.3605451584
[1] on this PR branch (w3lib)
54.5658538342
[2] w3lib.html
33.3376588821
[3] scrapy.utils.response
13.8655879498
So performance looks quite a bit worse without the regex compilation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please check it with 32K bodies (this is what AjaxCrawlMiddleware passes as text
argument)? I'm worried about the performance because when AjaxCrawlMiddleware is enabled this function is called for every response; according to #343 existing implementation runs with 2k/sec speed which is not very fast already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
54 vs 42 (~20% less speed) sounds fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Errr. Now it's minimally faster ;)
I checked twice.
# current master
27.9134149551
# with w3lib
25.6779019833
script:
import timeit
timed = timeit.timeit("""
_has_ajaxcrawlable_meta(text)
""", setup="""
import requests
from scrapy.downloadermiddlewares.ajaxcrawl import _has_ajaxcrawlable_meta
r = requests.get("http://www.lipsum.com/feed/html?amount=32768&what=bytes&start=yes&generate=Generate+Lorem+Ipsum")
text = r.text[:32768]
""")
print timed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually more like the same... moving between 25...28 on my machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you need a website with <meta name="fragment">
(e.g. a website created with wix.com) to see the difference, there are shortcuts for websites without this tag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, those templates without content all look a lot smaller than 32kb. So I just copied the beginning of such a template with the <meta name="fragment" content="!"/>
in it and filled the rest of it (body) with lorem ipsum again.
Well that hurt. For comparability I kept the default loop number:
# master branch
1163.59052396
# w3lib (PR)
945.980225086
Since system load might have varied over that duration, another run with only 10.000 loops:
# master branch
11.6963450909
# w3lib (PR)
9.49189996719
I understand thats just a single page test-case. I don't feel like looking for more real-world cases though. Don't happen to have any broad crawl data to look up.
(edit: script is same as before, I only pasted this custom page into text
in the setup part.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nyov for checking that! As the speed difference is only in percents it looks OK to me. And did I get it right, is the code in this PR faster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strangely enough it times slightly faster... I don't even know why.
[backport][1.1] Remove duplicate code now handled by newer w3lib (PR #1881)
Can't say if bumping the w3lib dependency to 1.13 is a good idea yet. But maybe it is once scrapy 1.2 sees a release. So this might be for then or later.