New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow users to override default PyPI index URL with PyPI mirror URLs (without modifying Pipfile) #2075

Closed
JacobHenner opened this Issue Apr 27, 2018 · 46 comments

Comments

Projects
None yet
@JacobHenner
Copy link
Contributor

JacobHenner commented Apr 27, 2018

Hello all,

The situation

Currently, there is no easy way to override the default PyPI index URL to use a URL pointed at a mirror. In corporate environments, requiring developers to use a repository mirror is quite common:

  1. Corporate firewalls prohibit access to external software repositories.
  2. Internal repository mirrors conduct malware and vulnerability analysis, which can be a compliance requirement.
  3. Internal mirrors preserve modules that might later be unavailable upstream (due to outage, deletion, etc), which is necessary to ensure the availability and auditability of modules used within the company's environment.

Unfortunately, this doesn't appear to be easily accommodated by pipenv. Although the mirror could be explicitly added to the Pipfile as the source for these packages, this breaks portability.

  1. Projects initialized internally will contain unreachable indexes if published externally. Users of the public version will have to modify the Pipfile prior to installing the module's dependencies.
  2. Projects initialized externally will not work internally without modification of the Pipfile. These modifications must be maintained locally (but not shared), and reapplied if the Pipfile changes upstream.

There should be a way to override the location of the PyPI index, by specifying a (true) mirror. This would only be applicable to PyPI, and not to other third-party repositories (these would still be specified explicitly in the Pipfile).

General proposal

Docker accommodates this situation by allowing the user to specify a registry mirror in the daemon's configuration file. Likewise, it'd be great if the pipenv user could specify a (true) mirror for PyPI, via an environment variable, configuration file, or command line parameter. If this value is set, pipenv should use the mirror for all PyPI packages, even if a connection to PyPI is available. In some corporate environments, PyPI remains unblocked, but policy dictates that the mirror is used for the other reasons mentioned above.

Implementation considerations

  1. Pip already allows users to override the default index url through pip's configuration file. Although this would likely be the most obvious source of the internal mirror's url (and would likely be set for these users), this parameter can be used for repositories that aren't true mirrors. Accordingly, it's probably unsuitable for this purpose.
  2. For modules whose dependencies are all available on PyPI, it's my understanding that the explicit source can be removed from the Pipfile, and pip's default will be used. Unfortunately, this does not apply to projects with modules outside of PyPI. Furthermore, since the Pipfile generation process is explicit by default, many existing projects would have to modify their qualifying Pipfiles to accommodate this pattern by removing the default index url.
  3. If an environment variable is set as a source in the Pipfile, the variable could be optionally set to provide a mirror. Unfortunately, this requires existing projects to modify their Pipfiles to accommodate this pattern, which is not ideal.
  4. If an environment variable, command line parameter, or configuration setting is used to override the PyPI index url with a (true) mirror, how would the override work? Would it assume the mirror's index should be specified in all calls to pip which would otherwise use the PyPI index? Would it require a change to existing Pipfiles? Would it require redefining how sources are specified, including an overrideable PyPI default? Something else?

Related discussion
#1451
#1783

This has been discussed in #python and #pypa on Freenode. After some constructive back-and-forth, it was decided that it'd be helpful to open an issue here for discussion. I appreciate everyone's effort towards resolving this issue.

@techalchemy

This comment has been minimized.

Copy link
Member

techalchemy commented Apr 27, 2018

@techalchemy

This comment has been minimized.

Copy link
Member

techalchemy commented Apr 27, 2018

I am persuaded that this is a thing that happens commonly (corporate FW / caching proxy) -- I feel we need an override setting to specify a mirror to use instead of pypi if we find it in the pipfile-- like PIPENV_PYPI_MIRROR or PIPENV_PYPI_CACHING_PROXY or something like that to specify that it should be tried first, sliced into sources in front of pypi basically.

Does that seem like it accomplishes the goal? If so, we can tag in the implementation genie to tell us why this is good or bad (@ncoghlan)

@ncoghlan

This comment has been minimized.

Copy link
Member

ncoghlan commented Apr 27, 2018

I'll start with a note of caution: until PyPI has implemented a package signing mechanism akin to PEP 458 to provide a TLS-independent way for pip to ensure that packages that nominally originate from PyPI actually match what PyPI published, then offering the ability to transparently redirect traffic to a different server is genuinely concerning from a security perspective.

Unfortunately, that particular attack vector is already open by way of pip.conf, so offering something comparable at the pipenv level isn't going to make anything any worse than it already is.

Beyond that, I think a general purpose repository URL rewriting mechanism could actually be easier to document and explain than something PyPI specific, at least at the base capability layer. Something like:

pipenv --override-source-url 'default=https://pypi-proxy.example.com/api' --override-source-url 'https://pypi.python.org/simple=https://pypi-proxy.example.com/api'  --override-source-url 'https://pypi.org/simple=https://pypi-proxy.example.com/api' install

(The only PyPI specific bit there would be using "default" to refer to pip's default download source, as specific in pip.conf).

Spelling out the entire source URL override map every time would be unwieldy to use in practice though, so a couple of options for CLI sugar might look like:

pipenv --override-source-urls <config file> install

pipenv --pypi-mirror https://pypi-proxy.example.com/api install

Whether or not to expose the --override-source-url layer immediately is a different question - it might make more sense to implement the simpler --pypi-mirror option first, and merely keep the possibility of --override-source-url and --override-source-urls as possible future options in mind while doing so.

@njsmith

This comment has been minimized.

Copy link
Member

njsmith commented Apr 27, 2018

A general {given URL: override URL} mapping was my first thought too, but on further consideration, there are some arguments for special-casing PyPI:

  • PyPI is pretty unique in having a well-known public URL and lots of mirrors

  • PyPI actually has multiple URLs (e.g., we'll probably have Pipfiles floating around for a while with both https://pypi.python.org/simple and https://pypi.org/simple, and maybe also https://pypi.python.org/simple/ and https://pypi.org/simple/ with the trailing slash?), and it'd be nice if we could solve this once instead of forcing each user to figure it out themselves

@ncoghlan

This comment has been minimized.

Copy link
Member

ncoghlan commented Apr 27, 2018

@njsmith See the --pypi-mirror <URL> sugar suggestion in the last part of my post - if the initial implementation focused solely on that, then the general URL rewriting capability could start out as an internal implementation detail (driven by the fact that "PyPI" has multiple URLs that all ultimately resolve to the same place), and then be considered for exposure as a feature in its own right later on (after it's been confirmed that it's working as desired for the primary --pypi-mirror use case).

@njsmith

This comment has been minimized.

Copy link
Member

njsmith commented Apr 27, 2018

Ah, right, I missed that :-)

Is there a general rule mapping command line arguments to some kind of more persistent configuration? I imagine that most users of this would want to set it up once and then forget it.

@altendky

This comment has been minimized.

Copy link
Contributor

altendky commented Apr 27, 2018

@ncoghlan wrote:

I'll start with a note of caution: until PyPI has implemented a package signing mechanism akin to PEP 458 to provide a TLS-independent way for pip to ensure that packages that nominally originate from PyPI actually match what PyPI published, then offering the ability to transparently redirect traffic to a different server is genuinely concerning from a security perspective.

If I'm reading my Pipfile.lock correctly there is no relationship stored between a package and which source it was installed from. Given that the existing featureset allows multiple sources to be specified isn't that creating a similar issue? A sync could end up getting a package from a different source than the one that was used to when creating the lockfile.

@ncoghlan

This comment has been minimized.

Copy link
Member

ncoghlan commented Apr 27, 2018

Pipfile.lock stores a list of acceptable artifact hashes for each pinned dependency, so once you've done a lock, surreptitiously replacing packages is difficult. At lock generation time, explicitly opting in to a source in Pipfile is saying "I trust this source not to mess with me, and will use TLS to verify that I'm actually talking to this point of origin". (I think there's an issue somewhere discussing the prospect of binding particular packages to particular source repos, although it may be in pip or one of the other PyPA repos, rather than here)

Changing the default index URL (or adding an extra index URL) in pip.conf, or using the override feature proposed here through a config file or shell profile based mechanism is different: that's saying "I, or some arbitrary process I ran at some point in time with write access to my home directory (such as an sdist's setup.py file), decided to configure my settings to trust this source of packages". And even a signing scheme like PEP 458 isn't a complete defence against those kinds of shenanigans if the public keys used for verification are themselves stored somewhere inside your home directory rather than in a directory that requires elevated privileges to modify.

There are good reasons why organisations with strict security requirements execute builds on locked down servers with only limited access to the internet at large, or otherwise monitor for these kinds of problems at the network level :)

@techalchemy

This comment has been minimized.

Copy link
Member

techalchemy commented Apr 27, 2018

Note also if you use multiple indexes and a package comes from the non-primary index it will be indicated in the lockfile.

The pep 458 concerns were essentially what I had in mind, since things that are different urls but in actuality point at pypi are different than if you just locally copied pypi and claimed it was the same.

@njsmith

This comment has been minimized.

Copy link
Member

njsmith commented Apr 27, 2018

I, or some arbitrary process I ran at some point in time with write access to my home directory (such as an sdist's setup.py file), decided to configure my settings to trust this source of packages

If this is your threat model, then I don't see how anything pipenv can do will effect it much. Someone who can modify your home directory config can also do things like insert a new directory on $PATH and insert a fake pipenv in there that does whatever they want.

@techalchemy

This comment has been minimized.

Copy link
Member

techalchemy commented Apr 27, 2018

@njsmith this is also pip’s threat model, because package installation requires the execution of arbitrary code from sdist setup.py files be allowed. That code indeed could overwrite things in your home directory like your settings, or add things to your path, or any number of things. That’s why explicitly privileging pypi (a know, trusted index) and requiring hash checking is a good step toward security. It allows centralized control and elimination of known security threats and identify verification of the packages you are downloading in a distributed fashion. What did the lockfile you downloaded say about the hash you should be getting? It doesn’t match what you’re getting from the index? In order for this mode of operation to fail you need to have failures at more than one of the local machine, index and network layer because you’re talking about having multiple corrupted packages in your application stack working in concert verifying hashes against a trusted index, and in many cases the hashes themselves came from yet another uninvolved source. So now you need to have at a minimum, all of the hash checking in both pip and pipenv somehow tampered with such that it generates hashes that are identical to the ones you are hoping for, but installs yet other malicious things?

I guess what I’m saying is, if your local machine is compromised there is nothing pip or pipenv is going to do to save you. But we can ensure that the package you’re downloading is the one you were looking for, from the place you were supposed to search for it, which can provide one element in the chain of security.

@techalchemy

This comment has been minimized.

Copy link
Member

techalchemy commented Apr 28, 2018

@ncoghlan @njsmith how does this all factor in with the move to push back against sudo pip install... and the general sense I think we all have that if you're going to use pip, you probably shouldn't also use your system package manager to install python things broadly speaking. This isn't really a pipenv question maybe, but it's where the discussion is right now and this might guide the next steps...

@njsmith

This comment has been minimized.

Copy link
Member

njsmith commented Apr 28, 2018

@techalchemy I don't see any connection to this topic at all? I think the conclusion of all the above is that letting users override which mirror pipenv uses for PyPI doesn't introduce any additional threats, and doing sudo pipenv doesn't even make sense in the first place, right?

@techalchemy

This comment has been minimized.

Copy link
Member

techalchemy commented Apr 28, 2018

@njsmith no I don't think anyone should use sudo pipenv, like I mentioned it's not really on topic but since we went a bit down the threat model path, I thought it was worth exploring. Specifically:

And even a signing scheme like PEP 458 isn't a complete defence against those kinds of shenanigans if the public keys used for verification are themselves stored somewhere inside your home directory rather than in a directory that requires elevated privileges to modify.
There are good reasons why organisations with strict security requirements execute builds on locked down servers with only limited access to the internet at large, or otherwise monitor for these kinds of problems at the network level :)

If a defense at least in some capacity relies on keys being stored in a privileged location, but we are advising against using privileged python installs, I think it's possibly worth discussing. Maybe I'm wrong. But it definitely seems related to @ncoghlan's comment (but not sudo pipenv, that should never be a thing)

Yeah that probably seemed like it came out of nowhere, just a random thought. Hopefully the additional context clears it up some

@njsmith

This comment has been minimized.

Copy link
Member

njsmith commented Apr 28, 2018

I vote we keep this issue on the topic of helping folks who need to use PyPI mirrors, rather than getting into a speculative discussion of how we might implement TUF. (Anyway, I don't think there's much we can or should do to try to defend against an attacker who has arbitrary write access to the the user's home directory.)

@techalchemy

This comment has been minimized.

Copy link
Member

techalchemy commented Apr 29, 2018

Okay, so lets define the behavior that we would expect or prefer. My current working understanding is that:

  • If --pypi-mirror is passed or PIPENV_PYPI_MIRROR is set, we should prefer that
  • Should we prefer it over PyPI only? How are we making the assessment as to whether a given index url is 'PyPI' -- we can't query it, so we would have to maintain a list
  • Should the list contain all possible permutations, or should we be content with using the two urls we've used in the past for generating Pipfiles as the things for which we should try the provided mirror first?
@njsmith

This comment has been minimized.

Copy link
Member

njsmith commented Apr 29, 2018

It should override PyPI only, not other URLs. I guess there are probably only a few different PyPI URLs in use, so they can be listed, and if we miss one then someone will file a bug, it'll get added, and pretty soon we'll have all of them.

@techalchemy techalchemy added this to the 11.11.0 milestone Apr 29, 2018

@techalchemy

This comment has been minimized.

Copy link
Member

techalchemy commented Apr 29, 2018

Seems like the right approach to me.

@ncoghlan

This comment has been minimized.

Copy link
Member

ncoghlan commented Apr 29, 2018

What @njsmith said matches my perspective as well. The 3 repo URLs I'd suggest replacing in an initial PR would be:

The trailing-slash-or-not is likely better handled as a URL normalisation step, rather than by listing the URLs separately.

@njsmith

This comment has been minimized.

Copy link
Member

njsmith commented Apr 29, 2018

Note that the requests Pipfile does have a trailing slash (at time of writing), so we probably do need to handle this one way or another.

@ncoghlan

This comment has been minimized.

Copy link
Member

ncoghlan commented Apr 29, 2018

Right, my thought was:

  • maintain a list of URLs without trailing slashes
  • check incoming URLs for a trailing slash, and remove it if found (str.rstrip would likely be good enough for the task, even though it would remove an arbitrary number of trailing slashes, or else we could be stricter about it, and remove at most one trailing slash)
@techalchemy

This comment has been minimized.

Copy link
Member

techalchemy commented Apr 29, 2018

Awesome. I think this is enough to work with and simple enough to build. Thanks all!

@leafonsword

This comment has been minimized.

Copy link

leafonsword commented May 2, 2018

Hope mirror feature could be added soon~

JacobHenner added a commit to JacobHenner/pipenv that referenced this issue Jun 5, 2018

JacobHenner added a commit to JacobHenner/pipenv that referenced this issue Jun 5, 2018

JacobHenner added a commit to JacobHenner/pipenv that referenced this issue Jun 5, 2018

JacobHenner added a commit to JacobHenner/pipenv that referenced this issue Jun 5, 2018

Add support for PyPI mirrors
Adds support for the --pypi-mirror command line parameter and the
PIPENV_PYPI_MIRROR environment variable for most pipenv operations.
This permits pipenv to function without pypi.org, which is necessary for
users:

    1. behind restrictive networks
    2. facing strict artifact sourcing policies
    3. experiencing poor performance connecting to pypi.org
    4. who've configured a local cache for performance reasons

When specified, the value of this parameter replaces all instances of
pypi.org and pypi.python.org within pipenv operations without modifying
or requring the modification of Pipfiles.

- Resolves pypa#2075

JacobHenner added a commit to JacobHenner/pipenv that referenced this issue Jun 6, 2018

Add support for PyPI mirrors
Adds support for the --pypi-mirror command line parameter and the
PIPENV_PYPI_MIRROR environment variable for most pipenv operations.
This permits pipenv to function without pypi.org, which is necessary for
users:

    1. behind restrictive networks
    2. facing strict artifact sourcing policies
    3. experiencing poor performance connecting to pypi.org
    4. who've configured a local cache for performance reasons

When specified, the value of this parameter replaces all instances of
pypi.org and pypi.python.org within pipenv operations without modifying
or requring the modification of Pipfiles.

- Resolves pypa#2075

JacobHenner added a commit to JacobHenner/pipenv that referenced this issue Jun 6, 2018

Add support for PyPI mirrors
Adds support for the --pypi-mirror command line parameter and the
PIPENV_PYPI_MIRROR environment variable for most pipenv operations.
This permits pipenv to function without pypi.org, which is necessary for
users:

    1. behind restrictive networks
    2. facing strict artifact sourcing policies
    3. experiencing poor performance connecting to pypi.org
    4. who've configured a local cache for performance reasons

When specified, the value of this parameter replaces all instances of
pypi.org and pypi.python.org within pipenv operations without modifying
or requring the modification of Pipfiles.

- Resolves pypa#2075

JacobHenner added a commit to JacobHenner/pipenv that referenced this issue Jun 7, 2018

Add support for PyPI mirrors
Adds support for the --pypi-mirror command line parameter and the
PIPENV_PYPI_MIRROR environment variable for most pipenv operations.
This permits pipenv to function without pypi.org, which is necessary for
users:

    1. behind restrictive networks
    2. facing strict artifact sourcing policies
    3. experiencing poor performance connecting to pypi.org
    4. who've configured a local cache for performance reasons

When specified, the value of this parameter replaces all instances of
pypi.org and pypi.python.org within pipenv operations without modifying
or requring the modification of Pipfiles.

- Resolves pypa#2075

JacobHenner added a commit to JacobHenner/pipenv that referenced this issue Jun 7, 2018

Add support for PyPI mirrors
Adds support for the --pypi-mirror command line parameter and the
PIPENV_PYPI_MIRROR environment variable for most pipenv operations.
This permits pipenv to function without pypi.org, which is necessary for
users:

    1. behind restrictive networks
    2. facing strict artifact sourcing policies
    3. experiencing poor performance connecting to pypi.org
    4. who've configured a local cache for performance reasons

When specified, the value of this parameter replaces all instances of
pypi.org and pypi.python.org within pipenv operations without modifying
or requring the modification of Pipfiles.

- Resolves pypa#2075

JacobHenner added a commit to JacobHenner/pipenv that referenced this issue Jun 7, 2018

Add support for PyPI mirrors
Adds support for the --pypi-mirror command line parameter and the
PIPENV_PYPI_MIRROR environment variable for most pipenv operations.
This permits pipenv to function without pypi.org, which is necessary for
users:

    1. behind restrictive networks
    2. facing strict artifact sourcing policies
    3. experiencing poor performance connecting to pypi.org
    4. who've configured a local cache for performance reasons

When specified, the value of this parameter replaces all instances of
pypi.org and pypi.python.org within pipenv operations without modifying
or requring the modification of Pipfiles.

- Resolves pypa#2075
@polski-g

This comment has been minimized.

Copy link

polski-g commented Jun 7, 2018

I presume you're talking about using devpi as caching proxy for official PyPi. For pip itself, you would need to modify /etc/pip.conf and /usr/lib64/python3.6/disutils/distutils.cfg for pip to use your local devpi server for all requests.

However, it looks like pipenv ignores these system-wide settings, so you are forced to modify the [[source]] config setting in Pipfile to reference your devpi server. But then if you publish your Pipfile externally, external contributors have to remove your [[source]] settings to actually build their own environment.

I think that pipenv should just respect the global settings from /etc/pip.conf and /usr/lib.../distutils.cfg

@brettdh

This comment has been minimized.

Copy link

brettdh commented Jun 7, 2018

@polski-g

I presume you're talking about using devpi as caching proxy for official PyPi

Nexus Repository, but yeah, same idea.

However, it looks like pipenv ignores these system-wide settings

As @techalchemy mentioned, I believe that pipenv (11.6.0) used to respect pip.conf (homedir as well), but the latest version does not - specifically, there's a hard-coded pypi.org URL somewhere (dependency resolution, IIRC) that can't be overridden.

I think that pipenv should just respect the global settings from /etc/pip.conf and /usr/lib.../distutils.cfg

Agreed - though personally I haven't had to modify distutils.cfg in my use case.

@uranusjr

This comment has been minimized.

Copy link
Member

uranusjr commented Jun 7, 2018

IIRC there was a resolution to not respect pip.conf, but you’ll need to dig deep into the issue tracker to find it. In any case, the ship has sailed, and with PyPI mirroring almost done, this is unlikely to change in near future.

@techalchemy

This comment has been minimized.

Copy link
Member

techalchemy commented Jun 7, 2018

I'm fairly confident this feature will ship in the next release (which will ship in the next day or two with luck)

@techalchemy

This comment has been minimized.

Copy link
Member

techalchemy commented Jun 7, 2018

Also I'm not sure about this, but it's possible we might just need to call .load() after we create the config parser here to get the config defaults

https://github.com/pypa/pipenv/blob/master/pipenv/project.py#L573-#L577

@brettdh

This comment has been minimized.

Copy link

brettdh commented Jun 7, 2018

@uranusjr as long as the mirroring configuration works (i.e. doesn't use that hardcoded pypi.org URL I mentioned), I don't see any problem with pipenv having its own configuration for this and ignoring pip's.

@JacobHenner

This comment has been minimized.

Copy link
Contributor

JacobHenner commented Jun 7, 2018

@brettdh

This comment has been minimized.

Copy link

brettdh commented Jun 7, 2018

@JacobHenner yep, thanks. My initial testing with the --pypi-mirror option (pipenv install, pipenv lock) looks like it works fine. I left a small suggestion on the PR.

I'm a bit concerned, though, that hardcoded URLs to pypi.org still appear scattered across the pipenv sources. I can't be sure which ones are correctly overridden from [[source]] entries, and I can't remember exactly which workflow caused my issue above. So it's hard to tell if it's fixed. 😬

@techalchemy

This comment has been minimized.

Copy link
Member

techalchemy commented Jun 7, 2018

Yeah following this release I am planning a major code cleanup. Cli stuff moving to the cli, bubbling exceptions there and handling all the exits there, deduping duplicated code, etc. It’s going to be a lot of work and help will be appreciated if anyone wants to volunteer :p

@kylecribbs

This comment has been minimized.

Copy link

kylecribbs commented Jan 3, 2019

Just pulled the recent version and it is still hardcoding the pypi.org in the sources. Is the goal to take the environmental variable or the pypi-mirror and put that as the default for [[source]]?

edit:

Just dug through the code.. Looks like you have

if PIPENV_TEST_INDEX:
    DEFAULT_SOURCE = {
        u"url": PIPENV_TEST_INDEX,
        u"verify_ssl": True,
        u"name": u"custom",
    }
else:
    DEFAULT_SOURCE = {
        u"url": u"https://pypi.org/simple",
        u"verify_ssl": True,
        u"name": u"pypi",
    }

I think if you changed that If PIPENV_TEST_INDEX to the environmental variable PIPENV_PYPI_MIRROR it would be a good start

@uranusjr

This comment has been minimized.

Copy link
Member

uranusjr commented Jan 3, 2019

The solution discussed here has long been implemented. The snippet you quoted is a default, i.e. used if you do not provide a source when creating the Pipfile.

@JacobHenner

This comment has been minimized.

Copy link
Contributor

JacobHenner commented Jan 3, 2019

@ncoghlan

This comment has been minimized.

Copy link
Member

ncoghlan commented Jan 4, 2019

@JacobHenner The mirror handling code postprocesses the source list and replaces pypi.org URLs with references to the specified mirror.

That's what allows the mirror override to work even if there is an explicit pypi.org entry in the Pipfile. pipenv then relies on that same logic to override its own default source as well.

If there are currently cases where that postprocessing isn't being applied correctly, that's a new bug report against the already implemented feature, rather than a feature request.

@JacobHenner

This comment has been minimized.

Copy link
Contributor

JacobHenner commented Jan 7, 2019

I think that last comment was intended for @kylecribbs?

@ncoghlan

This comment has been minimized.

Copy link
Member

ncoghlan commented Jan 8, 2019

@JacobHenner Ah, sorry - I misinterpreted your comment as saying that this change hadn't achieved its original goal, rather than as a response to Kyle that aimed to clarify what that outcome actually was.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment