Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache parse_links() by --find-links html page url #7729

Merged
merged 1 commit into from Apr 8, 2020

Conversation

cosmicexplorer
Copy link
Contributor

@cosmicexplorer cosmicexplorer commented Feb 13, 2020

pip._internal.index.collector.parse_links() can take a nontrivial amount of time for large --find-links html repos. This cost is paid every single time a --find-links page is consulted (once per transitive requirement resolved), and can therefore become quite lengthy for large resolves. See pex-tool/pex#887 and pex-tool/pex#888 for further background on how we arrived at this issue while overcoming blockers to https://github.com/pantsbuild/pants upgrading to pex version 2, which uses pip to resolve and fetch wheels.

This change caches the result of invoking parse_links() for a --find-links html repo, keyed by the HTMLPage's url. This means that other code in pip calling parse_links() can safely expect to idempotently resolve a set of links from an html page from any other thread, for example, without having to share any state between threads.

To reproduce the pathological behavior and demonstrate that this fix improves performance, check out this branch and execute this script:

#!/bin/bash

set -euxo pipefail

function generate-large-find-links-page {
  echo '<html><head><meta charset="utf-8"><head><body>'
  set +x
  for i in $(seq 50000); do
    echo '<a href="nonexistent-0.5.0.whl">nonexistent-0.5.0.whl</a>'
  done
  for w in *.whl *.tar.gz; do
    echo "<a href=\"${w}\">${w}</a>"
  done
  set -x
  echo '</body></html>'
}

python setup.py bdist_wheel

mkdir -pv tmp-whls/
pushd tmp-whls/
pip download tensorflow==1.14.0
generate-large-find-links-page > large-find-links-page.html
popd

time PYTHONPATH="$(pwd)/$(echo dist/*.whl)" python -m pip \
     -vvv download --no-index --find-links=./tmp-whls/large-find-links-page.html \
     tensorflow==1.14.0

time pip \
     -vvv download --no-index --find-links=./tmp-whls/large-find-links-page.html \
     tensorflow==1.14.0

@chrahunt
Copy link
Member

A decorator would help us keep this logic out of parse_links, like

from functools import lru_cache


class CacheablePage(object):
    def __init__(self, page):
        self.page = page

    def __hash__(self):
        return hash((page.content, page.encoding))


def with_cached_html_pages(fn):
    @lru_cache
    def wrapper(cacheable_page):
        return fn(cacheable_page.page)

    def wrapper_wrapper(page):
        return wrapper(CacheablePage(page))

    return wrapper_wrapper

...

@with_cached_html_pages
def parse_links(page):
   ...

@cosmicexplorer
Copy link
Contributor Author

Thank you -- that sounds great! I will make this change!

@uranusjr
Copy link
Member

uranusjr commented Feb 13, 2020

Is there a reason behind caching the result against the content, instead of the find-links URL/path? I kind of feel the former would be enough.

On lru_cache: this is not available on Python 2.7, which pip unfortunately still needs to support. But I feel it would be acceptable to only introduce this optimisation for Python 3 (and to make the cache an no-op that always misses on Python 2).

@cosmicexplorer
Copy link
Contributor Author

cosmicexplorer commented Feb 13, 2020

Is there a reason behind caching the result against the content, instead of the find-links URL/path? I kind of feel the former would be enough.

Super reasonable! This change was motivated because we saw the repeated message:

Request header has "max_age" as 0, cache bypassed

which we pinned down to

# We don't want to blindly returned cached data for
# /simple/, because authors generally expecting that
# twine upload && pip install will function, but if
# they've done a pip install in the last ~10 minutes
# it won't. Thus by setting this to zero we will not
# blindly use any cached data, however the benefit of
# using max-age=0 instead of no-cache, is that we will
# still support conditional requests, so we will still
# minimize traffic sent in cases where the page hasn't
# changed at all, we will just always incur the round
# trip for the conditional GET now instead of only
# once per 10 minutes.
# For more information, please see pypa/pip#5670.
"Cache-Control": "max-age=0",
in pex-tool/pex#887.

I agree that caching by the url path seems sufficient, and I can definitely see how that might be more "expected" behavior in some respect.

I'm thinking that (with the use case of #5670 in mind) if some build process publishes a change to a --find-links repo, we would want to make sure that newly-published wheels aren't cached at this layer of pip. Given that this is non-persistent caching, though, I can't see any obvious way that this caching keyed on url paths only would cause a regression on #5670, unless perhaps if pip is being invoked as a library by some persistent background process which concurrently performs a publish (which is not a use case I'm aware of, but seems maybe plausible).

But using the url path seems maybe cleaner to implement, and possibly easier to fold into some other caching via cachecontrol or elsewhere later on. I think url-based or content-based keying would definitely both fix pex-tool/pex#887. I'm not sure whether we're safe assuming pip is invoked as a subprocess mostly ephemerally (but I feel like we are?).

@uranusjr
Copy link
Member

Given that this is non-persistent caching, though, I can't see any obvious way that this caching keyed on url paths only would cause a regression on #5670, unless perhaps if pip is being invoked as a library by some persistent background process which concurrently performs a publish (which is not a use case I'm aware of, but seems maybe plausible).

Yup, I was thinking along a similar line as well, and lru_cache being non-persistent is the main reason I feel caching by URI would be correct. pip explicitly does not support being imported as a library, and already relies heavily on global states in other places, so it would be okay to do the same here (IMO). We can always add a cache-busting hook (lru_cache.cache_clear()) if by some off-chance we ever want to refactor pip into importable.

@cosmicexplorer
Copy link
Contributor Author

cosmicexplorer commented Feb 13, 2020

pip explicitly does not support being imported as a library, and already relies heavily on global states in other places

Great! Was being overly cautious! I think the CacheablePage keyed on url sounds great then -- coming right up!

@cosmicexplorer cosmicexplorer changed the title Cache parse_links() by --find-links html page content Cache parse_links() by --find-links html page url Feb 13, 2020
@cosmicexplorer
Copy link
Contributor Author

cosmicexplorer commented Feb 13, 2020

@uranusjr An unexpected consequence of keying by page url is that the testing in test_collector.py which invokes parse_links() within a @pytest.mark.parameterized() now needs to have a unique page url injected to have tests pass (as demonstrated in c399ba2). I believe that keying by page content and encoding avoids this workaround -- but I'm also definitely still ok keying by the url and leaving this small hack in place for testing!

@cosmicexplorer
Copy link
Contributor Author

I've detailed in pex-tool#5 some quick testing I did using the benchmark script in this PR's description on the effects of adding a similar cache for fetching of HTMLPages (on top of the parse_links() cache), and how it seems that this produces a nonzero speedup even when the --find-links html page is available locally (as opposed to being fetched from over a network). I can go into more depth on how I arrived at this conclusion as necessary.

Since the separate HTMLPage fetch caching I just added is keyed by url, and because of the above (keying the parse_links() cache by url requires adding a uuid salt to each url to test), I've also reverted the CacheablePageContent cache key for parse_links() to use the page's content and encoding to cache parse_links(), not just its url. This was implemented in 8809012.

benjyw pushed a commit to pex-tool/pip that referenced this pull request Feb 14, 2020
@cosmicexplorer
Copy link
Contributor Author

cosmicexplorer commented Feb 15, 2020

I've added type annotations for all of the things! I was unable to figure out how to get mypy to accept the:

try:
  from functools import lru_cache
except ImportError:
  def lru_cache(...):

pattern, as it would always see both the "try" and the "except" and therefore be unable to satisfy mypy for python 2 and 3. I have therefore created a method _lru_cache() which has a clear TODO on it to replace with @functools.lru_cache() after dropping python 2 support!

jsirois pushed a commit to jsirois/pip that referenced this pull request Feb 16, 2020
@cosmicexplorer
Copy link
Contributor Author

cosmicexplorer commented Feb 18, 2020

@uranusjr would love to know your thoughts on the change to additionally support caching HTMLPage downloads with @with_cached_link_fetch, which keys the cache by page url. It appears to have a performance benefit, but I don't have a benchmark offhand except from what was mentioned in pex-tool#5 (comment).

@uranusjr
Copy link
Member

I would guess the performance benefits (compared to only caching find-links pages) are marginal. Index pages are already per-package, and each package page is only read once in the vast majority of cases (exceptions being build-time dependencies such as setuptools). On the flip side though, the added memory requirements could be significant on constrained systems (cache entry of a well-known package is likely to be in the 100kb-500kb range) that I don’t feel comfortable blindly caching them all without adding a flag to disable it. I would suggest sticking to only caching find-links pages for now.

@cosmicexplorer
Copy link
Contributor Author

Thank you for the extremely thoughtful response! I had not considered at all memory-constrained systems, and I agree that the benefits are marginal enough not to warrant these tradeoffs (we do not have any benchmarks showing any further speedup -- this was a mistake on my part, being too eager to find another optimization). I really appreciate you being so specific and helpful in your response! I will revert the change to add an HTMLPage fetch cache, and I will revert the cache key for parse_links() to key by url!

@cosmicexplorer
Copy link
Contributor Author

@uranusjr ping! I have removed the separate caching by page content and have just keyed parse_links() by url. I would love to fix any further issues :) no rush, thanks so much for your input so far!

@uranusjr
Copy link
Member

Doesn’t parse_links() also handle index pages (e.g. pypi.org)? I am under the impression it does (this could be wrong since I didn’t actually read the implementation), and I don’t really want to cache them.

@cosmicexplorer
Copy link
Contributor Author

Thank you! I will check that and avoid caching index pages if so!

@cosmicexplorer
Copy link
Contributor Author

cosmicexplorer commented Feb 29, 2020

@uranusjr in beac52a I have marked Link objects coming from index urls with a new instance field is_index_url for Link, which is then transferred to the HTMLPage object created from the Link in _get_html_page(). I have then modified the @with_cached_html_pages wrapper to avoid using the cache if is_index_url=True.

I also added a unit test for this selective caching logic (see page_3 in test_collector.py). However, since I still had to modify a couple other files to add is_index_url, it might make sense to add a more "end-to-end" test which will actually exercise the code path where we construct Link(url, is_index_url=True). I'll see if there are any good examples of that in the codebase already.

@cosmicexplorer
Copy link
Contributor Author

@uranusjr ping! I have added appropriate testing and a couple docstrings.

@cosmicexplorer
Copy link
Contributor Author

Ping again!

@uranusjr
Copy link
Member

Hi! Sorry I’ve been a bit tied up recently, I will find some time to review this entirely this weekend.

@cosmicexplorer
Copy link
Contributor Author

Thanks!! Not a problem!! No rush on my end ^_^ Thanks so much for your efforts to make this change work!

Copy link
Member

@uranusjr uranusjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks sound to me, but I still have reservations about vendoring lru_cache (see inline comments).

news/7729.bugfix Outdated Show resolved Hide resolved
cache[cache_key] = value
return value
return wrapped
return wrapper
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether we should simply make the cache a no-op on Python 2. Something like:

F = TypeVar('F', Callable)

def stub_lru_cache(maxsize=None):
    # type: (Optional[int]) -> Callable[[F], F]
    def _wrapper(f):
        # type: (F) -> F
        return f
    return _wrapper
try:
    from functools import lru_cache as real_lru_cache
except ImportError:
    _lru_cache = stub_lru_cache
else:
    _lru_cache = real_lru_cache

There are probably a couple of typing hoops we need to jump through, but the point is not avoid vendoring a duplicated lru_cache altogether.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of references for convinience, if we are going down this route:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have slightly modified your example to make mypy and flake8 accept it:

try:
from functools import lru_cache as _lru_cache # type: ignore
except ImportError:
def _lru_cache(maxsize=None): # type: ignore
# type: (Optional[int]) -> Callable[[F], F]
def _wrapper(f):
# type: (F) -> F
return f
return _wrapper

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a patch against your repo to replace the type: ignore comments with getattr() and a callback protocol that IMO is easier to follow. You can merge that PR to include it in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for diving into this and figuring out the Protocol usage!! I have merged the PR!

Copy link
Member

@xavfernandez xavfernandez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems weird to set uncacheable_links to True for PyPI as the links could actually be cached: it's just that it would not be useful (if I'm not mistaken).
If that's the case, a simple cache_link_parsing=False argument might make more sense ?
(with

                (Link(url) for url in index_url_loc),
                (Link(url, cache_link_parsing=True) for url in fl_url_loc),

)

src/pip/_internal/index/collector.py Show resolved Hide resolved
tests/unit/test_collector.py Outdated Show resolved Hide resolved
@cosmicexplorer
Copy link
Contributor Author

If that's the case, a simple cache_link_parsing=False argument might make more sense ?

Absolutely!! I'll make this change!

@cosmicexplorer cosmicexplorer force-pushed the cache-parse-links branch 4 times, most recently from 50030c6 to d7318e6 Compare April 7, 2020 19:00
add failing test

apply the fix

add template NEWS entry according to https://pip.pypa.io/en/latest/development/contributing/#news-entries (wrong PR #)

rename news entry to the current PR #

respond to review comments

fix test failures

fix tests by adding uuid salt in urls

cache html page fetching by link

make CI pass (?)

make the types much better

finally listen to the maintainer and cache parse_links() by url :)

avoid caching parse_links() when the url is an index url

cleanup

add testing for uncachable marking

only conditionally vendor _lru_cache for py2

bugfix => feature

python 2 does not cache!

Do away with type: ignore with getattr()

respond to review comments
@xavfernandez
Copy link
Member

@pradyunsg this should be ready for version 20.1

@pradyunsg pradyunsg added this to the 20.1 milestone Apr 8, 2020
@pradyunsg
Copy link
Member

@xavfernandez feel free to click the merge button! :)

@xavfernandez xavfernandez merged commit 327a315 into pypa:master Apr 8, 2020
@xavfernandez
Copy link
Member

Thanks @cosmicexplorer :)

bors bot added a commit to duckinator/emanate that referenced this pull request May 13, 2020
118: Update pip to 20.1 r=duckinator a=pyup-bot


This PR updates [pip](https://pypi.org/project/pip) from **20.0.2** to **20.1**.



<details>
  <summary>Changelog</summary>
  
  
   ### 20.1
   ```
   =================

Process
-------

- Document that pip 21.0 will drop support for Python 2.7.

Features
--------

- Add ``pip cache dir`` to show the cache directory. (`7350 &lt;https://github.com/pypa/pip/issues/7350&gt;`_)

Bug Fixes
---------

- Abort pip cache commands early when cache is disabled. (`8124 &lt;https://github.com/pypa/pip/issues/8124&gt;`_)
- Correctly set permissions on metadata files during wheel installation,
  to permit non-privileged users to read from system site-packages. (`8139 &lt;https://github.com/pypa/pip/issues/8139&gt;`_)
   ```
   
  
  
   ### 20.1b1
   ```
   ===================

Deprecations and Removals
-------------------------

- Remove emails from AUTHORS.txt to prevent usage for spamming, and only populate names in AUTHORS.txt at time of release (`5979 &lt;https://github.com/pypa/pip/issues/5979&gt;`_)
- Remove deprecated ``--skip-requirements-regex`` option. (`7297 &lt;https://github.com/pypa/pip/issues/7297&gt;`_)
- Building of local directories is now done in place, instead of a temporary
  location containing a copy of the directory tree. (`7555 &lt;https://github.com/pypa/pip/issues/7555&gt;`_)
- Remove unused ``tests/scripts/test_all_pip.py`` test script and the ``tests/scripts`` folder. (`7680 &lt;https://github.com/pypa/pip/issues/7680&gt;`_)

Features
--------

- pip now implements PEP 610, so ``pip freeze`` has better fidelity
  in presence of distributions installed from Direct URL requirements. (`609 &lt;https://github.com/pypa/pip/issues/609&gt;`_)
- Add ``pip cache`` command for inspecting/managing pip&#39;s wheel cache. (`6391 &lt;https://github.com/pypa/pip/issues/6391&gt;`_)
- Raise error if ``--user`` and ``--target`` are used together in ``pip install`` (`7249 &lt;https://github.com/pypa/pip/issues/7249&gt;`_)
- Significantly improve performance when ``--find-links`` points to a very large HTML page. (`7729 &lt;https://github.com/pypa/pip/issues/7729&gt;`_)
- Indicate when wheel building is skipped, due to lack of the ``wheel`` package. (`7768 &lt;https://github.com/pypa/pip/issues/7768&gt;`_)
- Change default behaviour to always cache responses from trusted-host source. (`7847 &lt;https://github.com/pypa/pip/issues/7847&gt;`_)
- An alpha version of a new resolver is available via ``--unstable-feature=resolver``. (`988 &lt;https://github.com/pypa/pip/issues/988&gt;`_)

Bug Fixes
---------

- Correctly freeze a VCS editable package when it is nested inside another VCS repository. (`3988 &lt;https://github.com/pypa/pip/issues/3988&gt;`_)
- Correctly handle ``%2F`` in URL parameters to avoid accidentally unescape them
  into ``/``. (`6446 &lt;https://github.com/pypa/pip/issues/6446&gt;`_)
- Reject VCS URLs with an empty revision. (`7402 &lt;https://github.com/pypa/pip/issues/7402&gt;`_)
- Warn when an invalid URL is passed with ``--index-url`` (`7430 &lt;https://github.com/pypa/pip/issues/7430&gt;`_)
- Use better mechanism for handling temporary files, when recording metadata
  about installed files (RECORD) and the installer (INSTALLER). (`7699 &lt;https://github.com/pypa/pip/issues/7699&gt;`_)
- Correctly detect global site-packages availability of virtual environments
  created by PyPA’s virtualenv&gt;=20.0. (`7718 &lt;https://github.com/pypa/pip/issues/7718&gt;`_)
- Remove current directory from ``sys.path`` when invoked as ``python -m pip &lt;command&gt;`` (`7731 &lt;https://github.com/pypa/pip/issues/7731&gt;`_)
- Stop failing uninstallation, when trying to remove non-existent files. (`7856 &lt;https://github.com/pypa/pip/issues/7856&gt;`_)
- Prevent an infinite recursion with ``pip wheel`` when ``$TMPDIR`` is within the source directory. (`7872 &lt;https://github.com/pypa/pip/issues/7872&gt;`_)
- Significantly speedup ``pip list --outdated`` by parallelizing index interaction. (`7962 &lt;https://github.com/pypa/pip/issues/7962&gt;`_)
- Improve Windows compatibility when detecting writability in folder. (`8013 &lt;https://github.com/pypa/pip/issues/8013&gt;`_)

Vendored Libraries
------------------

- Update semi-supported debundling script to reflect that appdirs is vendored.
- Add ResolveLib as a vendored dependency.
- Upgrade certifi to 2020.04.05.1
- Upgrade contextlib2 to 0.6.0.post1
- Upgrade distro to 1.5.0.
- Upgrade idna to 2.9.
- Upgrade msgpack to 1.0.0.
- Upgrade packaging to 20.3.
- Upgrade pep517 to 0.8.2.
- Upgrade pyparsing to 2.4.7.
- Remove pytoml as a vendored dependency.
- Upgrade requests to 2.23.0.
- Add toml as a vendored dependency.
- Upgrade urllib3 to 1.25.8.

Improved Documentation
----------------------

- Emphasize that VCS URLs using git, git+git and git+http are insecure due to
  lack of authentication and encryption (`1983 &lt;https://github.com/pypa/pip/issues/1983&gt;`_)
- Clarify the usage of --no-binary command. (`3191 &lt;https://github.com/pypa/pip/issues/3191&gt;`_)
- Clarify the usage of freeze command in the example of Using pip in your program (`7008 &lt;https://github.com/pypa/pip/issues/7008&gt;`_)
- Add a &quot;Copyright&quot; page. (`7767 &lt;https://github.com/pypa/pip/issues/7767&gt;`_)
- Added example of defining multiple values for options which support them (`7803 &lt;https://github.com/pypa/pip/issues/7803&gt;`_)
   ```
   
  
</details>


 

<details>
  <summary>Links</summary>
  
  - PyPI: https://pypi.org/project/pip
  - Changelog: https://pyup.io/changelogs/pip/
  - Homepage: https://pip.pypa.io/
</details>



Co-authored-by: pyup-bot <github-bot@pyup.io>
@lock lock bot added the auto-locked Outdated issues that have been locked by automation label May 21, 2020
@lock lock bot locked as resolved and limited conversation to collaborators May 21, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
auto-locked Outdated issues that have been locked by automation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants