New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicitly ignore rel='download' links while looking for html pages. #677

Merged
merged 3 commits into from Oct 24, 2012

Conversation

Projects
None yet
2 participants
@blaze33
Contributor

blaze33 commented Sep 14, 2012

Yesterday I got tired to have to wait 15 to 20 minutes each time I need to reinstall the virtualenv of my current project (also had a bad connection which made this more noticeable). This looks clearly too long considering I have a requirements.txt explicitly defining each package version with each package archive already present in the download cache folder.

Ideally I hoped to find an easy way to firstly check the cache when specifying an explicit revision number but that would require too much work because the cache is currently only checked after analyzing every link of every requested page.

Nevertheless, after profiling several pip installations I found out that index.PackageFinder._get_queued_page added each and every link to the queue even those who where explicitly marked as rel='download'. Those links are obviously rejected later in the process by HTMLPage.get_page when checking content-type != text/html but that costs us a header request for every link.

I propose not to include the rel='download' links in the page download queue.

A simple comparison with pip install -r requirements.txt shows the following improvement :
*Current develop branch:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       59    0.041    0.001  172.957    2.931 index.py:220(_get_pages)

*Patched develop branch:

       59    0.039    0.001   99.354    1.684 index.py:220(_get_pages)

As you can see we save almost half of the time spent in _get_pages. With a single package install, like pip install Django==1.4, who has a lot of download links on http://pypi.python.org/simple/Django/ the time of _get_pages dropped from 9s to ~2s.

I ensured all tests pass.

Explicitly ignore rel='download' links while looking for html pages.
This way we avoid requesting archive headers just to see they're not
HTML pages.
@pnasrat

This comment has been minimized.

Show comment
Hide comment
@pnasrat

pnasrat Oct 1, 2012

Contributor

Any chance you can add a test in tests/test_index.py for this so we don't accidently regress, etc.

Contributor

pnasrat commented Oct 1, 2012

Any chance you can add a test in tests/test_index.py for this so we don't accidently regress, etc.

@blaze33

This comment has been minimized.

Show comment
Hide comment
@blaze33

blaze33 Oct 1, 2012

Contributor

Sure, I'll have a look at it, hopefully this week. Any idea about how to test it are welcome.

I didn't initially bothered to write a test as it doesn't introduce new behaviors, but you're right, i guess it's more than time for me to dive into TDD.

Contributor

blaze33 commented Oct 1, 2012

Sure, I'll have a look at it, hopefully this week. Any idea about how to test it are welcome.

I didn't initially bothered to write a test as it doesn't introduce new behaviors, but you're right, i guess it's more than time for me to dive into TDD.

@blaze33

This comment has been minimized.

Show comment
Hide comment
@blaze33

blaze33 Oct 2, 2012

Contributor

The test verifies that pip correctly filter links when told to. It doesn't check that we actually exclude the rel=download links in _get_queued_page. I wasn't sure if & how I should test it as threads are used to fetch the index pages.
Let me know if you have any issues with this.

Contributor

blaze33 commented Oct 2, 2012

The test verifies that pip correctly filter links when told to. It doesn't check that we actually exclude the rel=download links in _get_queued_page. I wasn't sure if & how I should test it as threads are used to fetch the index pages.
Let me know if you have any issues with this.

pnasrat added a commit that referenced this pull request Oct 24, 2012

Merge pull request #677 from blaze33/develop
Explicitly ignore rel='download' links while looking for html pages.

@pnasrat pnasrat merged commit f54f422 into pypa:develop Oct 24, 2012

1 check passed

default The Travis build passed
Details
@pnasrat

This comment has been minimized.

Show comment
Hide comment
@pnasrat
Contributor

pnasrat commented Oct 24, 2012

@blaze33

This comment has been minimized.

Show comment
Hide comment
@blaze33

blaze33 Oct 24, 2012

Contributor

Bummer. The following test fails:

FAIL: Test installing a package from the PyPI mirrors.
when running:
run_pip('install', '-vvv', '--use-mirrors', '--no-index', 'INITools==0.2')

INITools ends up being correctly installed, problem is nose catches an assert_no_stderr because httplib raises a BadStatusLine exception during the process.

According to the python doc, http://docs.python.org/library/httplib.html#httplib.BadStatusLine :

A BadStatusLine is raised if a server responds with a HTTP status code that we don’t understand.

Could it be there was a problem with the pypi mirrors during the test ? I'm not really sure where to start looking.

Contributor

blaze33 commented Oct 24, 2012

Bummer. The following test fails:

FAIL: Test installing a package from the PyPI mirrors.
when running:
run_pip('install', '-vvv', '--use-mirrors', '--no-index', 'INITools==0.2')

INITools ends up being correctly installed, problem is nose catches an assert_no_stderr because httplib raises a BadStatusLine exception during the process.

According to the python doc, http://docs.python.org/library/httplib.html#httplib.BadStatusLine :

A BadStatusLine is raised if a server responds with a HTTP status code that we don’t understand.

Could it be there was a problem with the pypi mirrors during the test ? I'm not really sure where to start looking.

@pnasrat

This comment has been minimized.

Show comment
Hide comment
@pnasrat

pnasrat Oct 24, 2012

Contributor

Yeah I think it's a transient issue on a mirror that we should catch the exception for. As it passed elsewhere I'm happy to keep and just keep an eye on travis.

Thanks for the patch.

Contributor

pnasrat commented Oct 24, 2012

Yeah I think it's a transient issue on a mirror that we should catch the exception for. As it passed elsewhere I'm happy to keep and just keep an eye on travis.

Thanks for the patch.

@blaze33

This comment has been minimized.

Show comment
Hide comment
@blaze33

blaze33 Oct 24, 2012

Contributor

Ok, thanks for the feedback.

I just ran the failing test with python 2.5 and it passed, wasn't able to reproduce it.

Glad to see my first contribution merged, thanks :)

Contributor

blaze33 commented Oct 24, 2012

Ok, thanks for the feedback.

I just ran the failing test with python 2.5 and it passed, wasn't able to reproduce it.

Glad to see my first contribution merged, thanks :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment