Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Differentiating organic vs automated installations #5499

Open
mahmoud opened this issue Jun 13, 2018 · 16 comments

Comments

Projects
None yet
9 participants
@mahmoud
Copy link

commented Jun 13, 2018

What's the problem this feature will solve?

Currently, pip installation statistics are aggregated to the gCloud and made available on libraries.io and pepy.tech. A lot of effort has gone into these numbers, but thanks to automation, they mean less now than they did a few years ago.

CI and other automation, combined with maybe a bit too much reliance on PyPI's central infrastructure, have inflated the download numbers and diluted the signal with noise.

Describe the solution you'd like

We could detect when pip is being used interactively (by checking if stdin is a tty or some other mechanism), and include that in the pip install request headers, to be included in the statistics generated by the server.

This would provide us with much cleaner data for highlighting actual community activity, instead of drowning in automation trends, overly favoring professionalized sectors of Python. Specifically, a library being manually installed 100 times may well indicate something much more interesting than a CI (or, unfortunately, a production) fleet installing a package 10,000 times.

Additional context

  • I wasn't sure whether to file this on pip or on Warehouse, it seems kind of 🐔 / 🥚 to me.
  • I'm not really sure if/how other package indexes solve this, but would be very interested in hearing.
  • As an arbitrary example, I happen to know Mozilla uses PyPI for quite a few relatively-internal packages. Granted, they're open-source and I'm happy to see some infrastructure synergy. But, picking at random, mozlog actually ranks ok for downloads, even though it's not a very broadly-useful package, and I'm pretty sure the data will show it's mostly Mozilla infrastructure downloading it.

Thanks for your attention and keep up the good work!

@mhsmith

This comment has been minimized.

Copy link

commented Jun 13, 2018

Note that because of pypa/linehaul#30, the numbers on Google BigQuery may already be meaningless for answering some questions.

@pradyunsg

This comment has been minimized.

Copy link
Member

commented Jun 13, 2018

I agree. It would definitely be useful to have separation of automated vs direct usage.

Maybe pypa/packaging-problems would be a good place for it?

@mahmoud

This comment has been minimized.

Copy link
Author

commented Jun 15, 2018

@mhsmith like Nathaniel pointed out, the lossage should be fairly uniform, so I think the numbers would still be somewhat representative, if we were collecting them on top of the leaky linehaul, that is :)

@pradyunsg Glad you agree! Given that I suspect (and suggested) a straightforward pip enhancement, I'd like to keep this issue open. That said, I may cross-post this there, if you think it would improve the visibility. Let me know if so!

@dstufft

This comment has been minimized.

Copy link
Member

commented Jun 15, 2018

I think the fundamental problem here is I don't think you can actually detect this reasonably. For instance, if someone manually runs a bash script (or even a tox command), we'd probably want that to be not set as automated-- but by default those things will not have a tty. On the flip side, you have things like Travis CI which I believe mimics a tty, so then Travis CI will look like like a manual install instead of automated.

On a theoretical level, I don't have any problem with the idea-- I just have never been able to think of a good way of actually differentiating the types of uses automatically.

@njsmith

This comment has been minimized.

Copy link
Member

commented Jul 22, 2018

If we want to detect running under CI, I think that's actually fairly easy, because CI systems tend to advertise that fact in the environment. Just checking for "CI" in os.environ or "BUILD_ID" in os.environ or "BUILD_BUILDID" in os.environ would probably catch 95% of cases (including at least Travis-CI, Appveyro, Circle-CI, Jenkins, VSTS).

Or if you want to get fancier, it looks like the ci-info package (2.5 million weekly downloads) has a fairly comprehensive list of envvars to check for: https://github.com/watson/ci-info/blob/master/index.js
(Looks like they're missing VSTS though.)

@hroncok

This comment has been minimized.

Copy link
Contributor

commented Feb 14, 2019

See https://github.com/The-Compiler/pytest-vw for a Python project that can detect CI.

@pradyunsg

This comment has been minimized.

Copy link
Member

commented Feb 15, 2019

Yea, it isn't difficult to detect whether you're running in a CI, on most CI services -- or for that matter even which one you're running on. We likely still won't know what %age of the non-CI runs are not automated but having a separation between CI/non-CI is a good start.

I don't know if we'd want to have any distinction between various CI services (logging NULL if we don't have the information, otherwise a string like "travis" representing the service).

@cjerdonek

This comment has been minimized.

Copy link
Member

commented Feb 16, 2019

I posted #6273 to start addressing this.

cjerdonek added a commit that referenced this issue Feb 24, 2019

Merge pull request #6273 from cjerdonek/issue-5499-detect-ci-for-user…
…-agent

Fix #5499: Include in pip's User-Agent whether it looks like pip is in CI
@cjerdonek

This comment has been minimized.

Copy link
Member

commented Feb 24, 2019

I'm going to leave this open for now as opposed to auto-closing for the purposes of discussing whether an additional key-value should be added to store the value of isatty(). The PR that was just merged stored the different info of whether something is known to be running in CI.

@pradyunsg

This comment has been minimized.

Copy link
Member

commented Feb 25, 2019

FWIW, I pinged on #zuul on Freenode, to see if anyone there has inputs on how to detect running within Zuul. That said, better detection of that is not a blocker in any form.

@theacodes

This comment has been minimized.

Copy link
Member

commented Apr 29, 2019

I'd love to see a way for us to tell pip we're running in a CI. For context, Google has several custom CIs that wouldn't be detected by this code, so a flag or env var that's something like "PIP_IS_CI" would be really cool.

@mahmoud

This comment has been minimized.

Copy link
Author

commented Apr 29, 2019

@theacodes if one can set an environment variable, wouldn't setting CI=true achieve just that? Or will that have an impact on other parts of the CI?

@theacodes

This comment has been minimized.

Copy link
Member

commented Apr 29, 2019

@cjerdonek

This comment has been minimized.

Copy link
Member

commented May 7, 2019

I'd love to see a way for us to tell pip we're running in a CI. For context, Google has several custom CIs that wouldn't be detected by this code, so a flag or env var that's something like "PIP_IS_CI" would be really cool.

I think it'd be fine (and low maintenance) to support this. The implementation would just be a matter of adding PIP_IS_CI to the CI_ENVIRONMENT_VARIABLES variable here:

CI_ENVIRONMENT_VARIABLES = (

@pradyunsg

This comment has been minimized.

Copy link
Member

commented May 19, 2019

The implementation would just be a matter of adding PIP_IS_CI to the CI_ENVIRONMENT_VARIABLES variable

@theacodes If you could file a PR for this, that'd be great!

@methane

This comment has been minimized.

Copy link
Contributor

commented May 20, 2019

Is PIP_IS_CI recommended for non-CI automated installations?
For example, provisioning server via cloud-init or ansible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.