Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Remove duplicate code now handled by newer w3lib #1881

Merged
merged 1 commit into from Apr 6, 2016

Conversation

@nyov
Copy link
Contributor

@nyov nyov commented Mar 26, 2016

Can't say if bumping the w3lib dependency to 1.13 is a good idea yet. But maybe it is once scrapy 1.2 sees a release. So this might be for then or later.

@nyov nyov force-pushed the nyov:dedupe branch from 544b163 to 5bc1a7c Mar 28, 2016
@nyov
Copy link
Contributor Author

@nyov nyov commented Mar 28, 2016

Oops. Fixed ajaxcrawl middleware.

The CI error with the docs now can't be related though:

Exception occurred:

  File "conf.py", line 113, in <module>

    import sphinx_rtd_theme

ImportError: No module named 'sphinx_rtd_theme'
@nyov nyov force-pushed the nyov:dedupe branch 2 times, most recently from f4d03e1 to d3c8665 Mar 29, 2016
@redapple
Copy link
Contributor

@redapple redapple commented Mar 30, 2016

@nyov , CI error on docs should be fixed with #1893

@nyov nyov force-pushed the nyov:dedupe branch from d3c8665 to 3787fec Mar 30, 2016
@codecov-io
Copy link

@codecov-io codecov-io commented Mar 30, 2016

Current coverage is 83.17%

Merging #1881 into master will decrease coverage by -0.01% as of 5e7ea2b

Powered by Codecov. Updated on successful CI builds.

text = _script_re.sub(u'', text)
text = _noscript_re.sub(u'', text)
text = html.remove_comments(html.replace_entities(text))
text = html.remove_tags_with_content(text, ('script', 'noscript'))

This comment has been minimized.

@kmike

kmike Mar 30, 2016
Member

could you please check that the performance is not much worse?

This comment has been minimized.

@nyov

nyov Mar 30, 2016
Author Contributor

Yup. Done.

script:

import timeit

timed = timeit.timeit("""
texts = [
    '<html><head><meta name="fragment"  content="!"/></head><body></body></html>',
    "<html><head><meta name='fragment' content='!'></head></html>",
    '<html><head><!--<meta name="fragment"  content="!"/>--></head><body></body></html>',
    '<html></html>',
]

# [1] all of ajaxcrawl func
#for text in texts:
#    _has_ajaxcrawlable_meta(text)

# [2] w3lib
#for text in texts:
#    remove_tags_with_content(text, ('script', 'noscript'))

# [3] scrapy.utils.response
for text in texts:
    _noscript_re.sub(u'', _script_re.sub(u'', text))
""", setup="""
# [1]
#from scrapy.downloadermiddlewares.ajaxcrawl import _has_ajaxcrawlable_meta
# [2]
#from w3lib.html import remove_tags_with_content
# [3]
from scrapy.utils.response import _noscript_re, _script_re
""")

print timed

timings for 1000000 (timeit.default_number) loops:

[1] on master branch (scrapy.utils.response)
42.3605451584
[1] on this PR branch (w3lib)
54.5658538342

[2] w3lib.html
33.3376588821

[3] scrapy.utils.response
13.8655879498

So performance looks quite a bit worse without the regex compilation.

This comment has been minimized.

@kmike

kmike Mar 30, 2016
Member

Could you please check it with 32K bodies (this is what AjaxCrawlMiddleware passes as text argument)? I'm worried about the performance because when AjaxCrawlMiddleware is enabled this function is called for every response; according to #343 existing implementation runs with 2k/sec speed which is not very fast already.

This comment has been minimized.

@kmike

kmike Mar 30, 2016
Member

54 vs 42 (~20% less speed) sounds fine

This comment has been minimized.

@nyov

nyov Mar 30, 2016
Author Contributor

Errr. Now it's minimally faster ;)
I checked twice.

# current master
27.9134149551

# with w3lib
25.6779019833

script:

import timeit

timed = timeit.timeit("""

_has_ajaxcrawlable_meta(text)

""", setup="""

import requests
from scrapy.downloadermiddlewares.ajaxcrawl import _has_ajaxcrawlable_meta
r = requests.get("http://www.lipsum.com/feed/html?amount=32768&what=bytes&start=yes&generate=Generate+Lorem+Ipsum")
text = r.text[:32768]

""")

print timed

This comment has been minimized.

@nyov

nyov Mar 30, 2016
Author Contributor

Actually more like the same... moving between 25...28 on my machine.

This comment has been minimized.

@kmike

kmike Mar 30, 2016
Member

Yeah, you need a website with <meta name="fragment"> (e.g. a website created with wix.com) to see the difference, there are shortcuts for websites without this tag.

This comment has been minimized.

@nyov

nyov Mar 31, 2016
Author Contributor

Hmm, those templates without content all look a lot smaller than 32kb. So I just copied the beginning of such a template with the <meta name="fragment" content="!"/> in it and filled the rest of it (body) with lorem ipsum again.

Well that hurt. For comparability I kept the default loop number:

# master branch
1163.59052396

# w3lib (PR)
945.980225086

Since system load might have varied over that duration, another run with only 10.000 loops:

# master branch
11.6963450909

# w3lib (PR)
9.49189996719

I understand thats just a single page test-case. I don't feel like looking for more real-world cases though. Don't happen to have any broad crawl data to look up.

(edit: script is same as before, I only pasted this custom page into text in the setup part.)

This comment has been minimized.

@kmike

kmike Mar 31, 2016
Member

Thanks @nyov for checking that! As the speed difference is only in percents it looks OK to me. And did I get it right, is the code in this PR faster?

This comment has been minimized.

@nyov

nyov Mar 31, 2016
Author Contributor

Strangely enough it times slightly faster... I don't even know why.

@kmike kmike changed the title Remove duplicate code now handled by newer w3lib [MRG+1] Remove duplicate code now handled by newer w3lib Mar 31, 2016
@redapple redapple merged commit cbb695d into scrapy:master Apr 6, 2016
2 checks passed
2 checks passed
codecov/patch 100.00% of diff hit (target 100.00%)
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@nyov nyov deleted the nyov:dedupe branch Apr 8, 2016
redapple added a commit that referenced this pull request Apr 12, 2016
[backport][1.1] Remove duplicate code now handled by newer w3lib (PR #1881)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants