Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Remove duplicate code now handled by newer w3lib #1881

Merged
merged 1 commit into from Apr 6, 2016

Conversation

nyov
Copy link
Contributor

@nyov nyov commented Mar 26, 2016

Can't say if bumping the w3lib dependency to 1.13 is a good idea yet. But maybe it is once scrapy 1.2 sees a release. So this might be for then or later.

@nyov
Copy link
Contributor Author

nyov commented Mar 28, 2016

Oops. Fixed ajaxcrawl middleware.

The CI error with the docs now can't be related though:

Exception occurred:

  File "conf.py", line 113, in <module>

    import sphinx_rtd_theme

ImportError: No module named 'sphinx_rtd_theme'

@nyov nyov force-pushed the dedupe branch 2 times, most recently from f4d03e1 to d3c8665 Compare March 29, 2016 16:52
@redapple
Copy link
Contributor

@nyov , CI error on docs should be fixed with #1893

@codecov-io
Copy link

Current coverage is 83.17%

Merging #1881 into master will decrease coverage by -0.01% as of 5e7ea2b

Powered by Codecov. Updated on successful CI builds.

text = _script_re.sub(u'', text)
text = _noscript_re.sub(u'', text)
text = html.remove_comments(html.replace_entities(text))
text = html.remove_tags_with_content(text, ('script', 'noscript'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please check that the performance is not much worse?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. Done.

script:

import timeit

timed = timeit.timeit("""
texts = [
    '<html><head><meta name="fragment"  content="!"/></head><body></body></html>',
    "<html><head><meta name='fragment' content='!'></head></html>",
    '<html><head><!--<meta name="fragment"  content="!"/>--></head><body></body></html>',
    '<html></html>',
]

# [1] all of ajaxcrawl func
#for text in texts:
#    _has_ajaxcrawlable_meta(text)

# [2] w3lib
#for text in texts:
#    remove_tags_with_content(text, ('script', 'noscript'))

# [3] scrapy.utils.response
for text in texts:
    _noscript_re.sub(u'', _script_re.sub(u'', text))
""", setup="""
# [1]
#from scrapy.downloadermiddlewares.ajaxcrawl import _has_ajaxcrawlable_meta
# [2]
#from w3lib.html import remove_tags_with_content
# [3]
from scrapy.utils.response import _noscript_re, _script_re
""")

print timed

timings for 1000000 (timeit.default_number) loops:

[1] on master branch (scrapy.utils.response)
42.3605451584
[1] on this PR branch (w3lib)
54.5658538342

[2] w3lib.html
33.3376588821

[3] scrapy.utils.response
13.8655879498

So performance looks quite a bit worse without the regex compilation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please check it with 32K bodies (this is what AjaxCrawlMiddleware passes as text argument)? I'm worried about the performance because when AjaxCrawlMiddleware is enabled this function is called for every response; according to #343 existing implementation runs with 2k/sec speed which is not very fast already.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

54 vs 42 (~20% less speed) sounds fine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Errr. Now it's minimally faster ;)
I checked twice.

# current master
27.9134149551

# with w3lib
25.6779019833

script:

import timeit

timed = timeit.timeit("""

_has_ajaxcrawlable_meta(text)

""", setup="""

import requests
from scrapy.downloadermiddlewares.ajaxcrawl import _has_ajaxcrawlable_meta
r = requests.get("http://www.lipsum.com/feed/html?amount=32768&what=bytes&start=yes&generate=Generate+Lorem+Ipsum")
text = r.text[:32768]

""")

print timed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually more like the same... moving between 25...28 on my machine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you need a website with <meta name="fragment"> (e.g. a website created with wix.com) to see the difference, there are shortcuts for websites without this tag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, those templates without content all look a lot smaller than 32kb. So I just copied the beginning of such a template with the <meta name="fragment" content="!"/> in it and filled the rest of it (body) with lorem ipsum again.

Well that hurt. For comparability I kept the default loop number:

# master branch
1163.59052396

# w3lib (PR)
945.980225086

Since system load might have varied over that duration, another run with only 10.000 loops:

# master branch
11.6963450909

# w3lib (PR)
9.49189996719

I understand thats just a single page test-case. I don't feel like looking for more real-world cases though. Don't happen to have any broad crawl data to look up.

(edit: script is same as before, I only pasted this custom page into text in the setup part.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nyov for checking that! As the speed difference is only in percents it looks OK to me. And did I get it right, is the code in this PR faster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strangely enough it times slightly faster... I don't even know why.

@kmike kmike changed the title Remove duplicate code now handled by newer w3lib [MRG+1] Remove duplicate code now handled by newer w3lib Mar 31, 2016
@redapple redapple merged commit cbb695d into scrapy:master Apr 6, 2016
@nyov nyov deleted the dedupe branch April 8, 2016 17:05
redapple added a commit that referenced this pull request Apr 12, 2016
[backport][1.1] Remove duplicate code now handled by newer w3lib (PR #1881)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants