Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Differentiating organic vs automated installations #5499

Closed
mahmoud opened this issue Jun 13, 2018 · 20 comments
Closed

Differentiating organic vs automated installations #5499

mahmoud opened this issue Jun 13, 2018 · 20 comments

Comments

@mahmoud
Copy link

@mahmoud mahmoud commented Jun 13, 2018

What's the problem this feature will solve?

Currently, pip installation statistics are aggregated to the gCloud and made available on libraries.io and pepy.tech. A lot of effort has gone into these numbers, but thanks to automation, they mean less now than they did a few years ago.

CI and other automation, combined with maybe a bit too much reliance on PyPI's central infrastructure, have inflated the download numbers and diluted the signal with noise.

Describe the solution you'd like

We could detect when pip is being used interactively (by checking if stdin is a tty or some other mechanism), and include that in the pip install request headers, to be included in the statistics generated by the server.

This would provide us with much cleaner data for highlighting actual community activity, instead of drowning in automation trends, overly favoring professionalized sectors of Python. Specifically, a library being manually installed 100 times may well indicate something much more interesting than a CI (or, unfortunately, a production) fleet installing a package 10,000 times.

Additional context

  • I wasn't sure whether to file this on pip or on Warehouse, it seems kind of 🐔 / 🥚 to me.
  • I'm not really sure if/how other package indexes solve this, but would be very interested in hearing.
  • As an arbitrary example, I happen to know Mozilla uses PyPI for quite a few relatively-internal packages. Granted, they're open-source and I'm happy to see some infrastructure synergy. But, picking at random, mozlog actually ranks ok for downloads, even though it's not a very broadly-useful package, and I'm pretty sure the data will show it's mostly Mozilla infrastructure downloading it.

Thanks for your attention and keep up the good work!

@mhsmith
Copy link

@mhsmith mhsmith commented Jun 13, 2018

Note that because of pypa/linehaul#30, the numbers on Google BigQuery may already be meaningless for answering some questions.

@pradyunsg
Copy link
Member

@pradyunsg pradyunsg commented Jun 13, 2018

I agree. It would definitely be useful to have separation of automated vs direct usage.

Maybe pypa/packaging-problems would be a good place for it?

@mahmoud
Copy link
Author

@mahmoud mahmoud commented Jun 15, 2018

@mhsmith like Nathaniel pointed out, the lossage should be fairly uniform, so I think the numbers would still be somewhat representative, if we were collecting them on top of the leaky linehaul, that is :)

@pradyunsg Glad you agree! Given that I suspect (and suggested) a straightforward pip enhancement, I'd like to keep this issue open. That said, I may cross-post this there, if you think it would improve the visibility. Let me know if so!

@dstufft
Copy link
Member

@dstufft dstufft commented Jun 15, 2018

I think the fundamental problem here is I don't think you can actually detect this reasonably. For instance, if someone manually runs a bash script (or even a tox command), we'd probably want that to be not set as automated-- but by default those things will not have a tty. On the flip side, you have things like Travis CI which I believe mimics a tty, so then Travis CI will look like like a manual install instead of automated.

On a theoretical level, I don't have any problem with the idea-- I just have never been able to think of a good way of actually differentiating the types of uses automatically.

@njsmith
Copy link
Member

@njsmith njsmith commented Jul 22, 2018

If we want to detect running under CI, I think that's actually fairly easy, because CI systems tend to advertise that fact in the environment. Just checking for "CI" in os.environ or "BUILD_ID" in os.environ or "BUILD_BUILDID" in os.environ would probably catch 95% of cases (including at least Travis-CI, Appveyro, Circle-CI, Jenkins, VSTS).

Or if you want to get fancier, it looks like the ci-info package (2.5 million weekly downloads) has a fairly comprehensive list of envvars to check for: https://github.com/watson/ci-info/blob/master/index.js
(Looks like they're missing VSTS though.)

@hroncok
Copy link
Contributor

@hroncok hroncok commented Feb 14, 2019

See https://github.com/The-Compiler/pytest-vw for a Python project that can detect CI.

@pradyunsg
Copy link
Member

@pradyunsg pradyunsg commented Feb 15, 2019

Yea, it isn't difficult to detect whether you're running in a CI, on most CI services -- or for that matter even which one you're running on. We likely still won't know what %age of the non-CI runs are not automated but having a separation between CI/non-CI is a good start.

I don't know if we'd want to have any distinction between various CI services (logging NULL if we don't have the information, otherwise a string like "travis" representing the service).

@cjerdonek
Copy link
Member

@cjerdonek cjerdonek commented Feb 16, 2019

I posted #6273 to start addressing this.

cjerdonek added a commit that referenced this issue Feb 24, 2019
…-agent

Fix #5499: Include in pip's User-Agent whether it looks like pip is in CI
@cjerdonek
Copy link
Member

@cjerdonek cjerdonek commented Feb 24, 2019

I'm going to leave this open for now as opposed to auto-closing for the purposes of discussing whether an additional key-value should be added to store the value of isatty(). The PR that was just merged stored the different info of whether something is known to be running in CI.

@pradyunsg
Copy link
Member

@pradyunsg pradyunsg commented Feb 25, 2019

FWIW, I pinged on #zuul on Freenode, to see if anyone there has inputs on how to detect running within Zuul. That said, better detection of that is not a blocker in any form.

@theacodes
Copy link
Member

@theacodes theacodes commented Apr 29, 2019

I'd love to see a way for us to tell pip we're running in a CI. For context, Google has several custom CIs that wouldn't be detected by this code, so a flag or env var that's something like "PIP_IS_CI" would be really cool.

@mahmoud
Copy link
Author

@mahmoud mahmoud commented Apr 29, 2019

@theacodes if one can set an environment variable, wouldn't setting CI=true achieve just that? Or will that have an impact on other parts of the CI?

@theacodes
Copy link
Member

@theacodes theacodes commented Apr 29, 2019

@cjerdonek
Copy link
Member

@cjerdonek cjerdonek commented May 7, 2019

I'd love to see a way for us to tell pip we're running in a CI. For context, Google has several custom CIs that wouldn't be detected by this code, so a flag or env var that's something like "PIP_IS_CI" would be really cool.

I think it'd be fine (and low maintenance) to support this. The implementation would just be a matter of adding PIP_IS_CI to the CI_ENVIRONMENT_VARIABLES variable here:

CI_ENVIRONMENT_VARIABLES = (

@pradyunsg
Copy link
Member

@pradyunsg pradyunsg commented May 19, 2019

The implementation would just be a matter of adding PIP_IS_CI to the CI_ENVIRONMENT_VARIABLES variable

@theacodes If you could file a PR for this, that'd be great!

@methane
Copy link
Contributor

@methane methane commented May 20, 2019

Is PIP_IS_CI recommended for non-CI automated installations?
For example, provisioning server via cloud-init or ansible.

@cjerdonek
Copy link
Member

@cjerdonek cjerdonek commented May 23, 2019

Is PIP_IS_CI recommended for non-CI automated installations?

It seems like it should be for any automated runs, but I'm not the one using this data. Is it worth making the environment variable name more descriptive (e.g. PIP_IS_AUTOMATED)? I'm also not sure to what extent this should be publicized / recommended for others to use.

@cjerdonek
Copy link
Member

@cjerdonek cjerdonek commented May 23, 2019

Reflecting a bit more on this, to @methane's implicit point, if we're going to expose an environment variable I'm thinking it would be better to call it something like PIP_IS_AUTOMATED. That would document the intent more clearly.

@njsmith
Copy link
Member

@njsmith njsmith commented May 23, 2019

@cjerdonek
Copy link
Member

@cjerdonek cjerdonek commented May 23, 2019

Okay, that's fine with me. And that would mean then that the answer to @methane's original question ("Is PIP_IS_CI recommended for non-CI automated installations?") is no.

@lock lock bot added the S: auto-locked label Jun 23, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Jun 23, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.