Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Usage policy for pyOpenSci - what does our policy look like for data usage? #183

Open
lwasser opened this issue Feb 15, 2023 · 38 comments

Comments

@lwasser
Copy link
Member

lwasser commented Feb 15, 2023

Hi Everyone,
I know that this issue of data usage is an important one. As such i wanted to start an issue to capture community thoughts on the issue. Based upon our first iterations below is the language that we are using around data usage.

### Telemetry & user-informed consent

Your package should avoid collecting usage analytics. With
that in mind, we understand that package-use data can be invaluable for the
development process. If the package does collect such data, it should do so
by prioritizing user-informed-consent. This means that before any data are
collected, the user understands:

1. What data are collected
2. How the data are collected.
3. What you plan to do with the data
4. How and where the data are stored

Once the user is informed of what will be collected and how that data will be handled, stored and used, you can implement `opt-in` consent. `opt-in` means that the user agrees to usage-data collection prior to it being collected (rather than having to opt-out when using your package).

We will evaluate usage data collected by packages on a case-by-case basis
and reserve the right not to review a package if the data collection is overly
invasive.

There are some good, important points here regarding how maintainers collect data to support development

For maintainers

  • The data can be useful to improve user experience
  • The data can inform / focus development efforts

On the user end:
There is so much assumption today that collecting data is ok and is even a part of many organizations (think facebook) business model. In return the offer users a tool / service at now / low cost.
However there is a level of trust and ethical acknowledgement to consider. People should have some control about data derived from their activities.

We have a few models such as homebrew which up front is clear about collecting data upon install. the user can opt in or out there.
Is there a model for o(scientific) python that we could follow that would balance the needs of developers with the privacy of users?

References to two items:

pinging:

@sneakers-the-rat @NickleDave @tupui @stefanv @Batalex @yuvipanda @pradyun @choldgraf @skrawcz and anyone else who has some thoughts on what our policy looks like. Let's have an open, productive conversation here so we have a record of it!

@tupui
Copy link

tupui commented Feb 15, 2023

Thank you for putting this together.

Some thoughts:

I am +1 on ethical data collection as described. Maybe add some wording like "ethical" and how data collection can be used. Also if we have practice we want to forbid, we should state it.e.g. not allow to be able to build user profiles, forbidden data to track, compliance with other laws like RGPD, etc.

One data point I think is Apple, devs are now requested to provide "card" explaining what their app is collecting etc. The card is normalised which is helpful for user as the information is easily accessible and understandable. I think we should do something similar, provide a template basically and tell how/where the info should be presented to the users. e.g. I would suggest having a section in the readme as it's important (also what Apple does, it's on the main page description of apps.)

I think we need to be careful here. Data collection is very sensible and if you do it wrong, liability can be at stake in some countries (and imagine if you leak sensitive info...). Might be worth to have some legal wording which would work in most countries (i.e. this need to be seen by some international lawyers IMHO.)

@Batalex
Copy link
Contributor

Batalex commented Feb 16, 2023

As someone who has been involved in the financial & healthcare industries, two heavily regulated sectors, I agree with the community stance.
Open-source software is provided "as is," and developers would be well within their rights to include telemetry. However, no one likes a nosy neighbor coming unannounced to the party.
Moreover, I feel this would strain the trust the users may put in the community if their sysadmin starts slamming them for "unauthorized outgoing traffic". It is already hard enough to demonstrate the benefits of open-source software in an environment with a need for compliance. Let's not make it harder!

@tupui
Copy link

tupui commented Feb 16, 2023

Agreed, the opt-in condition is here paramount and should be non-negotiable.

@NickleDave
Copy link
Contributor

In the spirit of having an open discussion here, I want to add a link to the Command-Line Interface Guidelines that @pradyun helpfully shared in the pyOpenSci Slack, specifically the section on "analytics":
https://clig.dev/#analytics

(and maybe this will save @pradyun some typing 🙂 who I'm sure would also share it here)

Some quotes:

unlike websites, users of the command-line expect to be in control of their environment, and it is surprising when programs do things in the background without telling them.

Do not phone home usage or crash data without consent. Users will find out, and they will be angry. Be very explicit about what you collect, why you collect it, how anonymous it is and how you go about anonymizing it, and how long you retain it for.

Ideally, ask users whether they want to contribute data (“opt-in”). If you choose to do it by default (“opt-out”), then clearly tell users about it on your website or first run, and make it easy to disable.

Examples of projects that collect usage statistics:

Consider alternatives to collecting analytics.

  • Instrument your web docs. If you want to know how people are using your CLI tool, make a set of docs around the use cases you’d like to understand best, and see how they perform over time. Look at what people search for within your docs.
  • Instrument your downloads. This can be a rough metric to understand usage and what operating systems your users are running.
  • Talk to your users. Reach out and ask people how they’re using your tool. Encourage feedback and feature requests in your docs and repos, and try to draw out more context from those who submit feedback.

Further reading: Open Source Metrics

Most of what's there aligns with what's already been said. And the fact that it's a resource provided by a community that's already done a lot of thinking about this reinforces those points.

I think it applies to research software more generally, not just CLIs.

In Slack, I think I'm the only one who voted for "either opt-in or opt-out". It seems like our community in particular is really sensitive to issues around analytics; it could look really bad if you found out your HPC cluster or whatever was sending off data to who knows where. Which makes me think we'd want to lean more towards "opt-in only", even if as devs we'd really love to have that usage info.

@stefanv
Copy link
Contributor

stefanv commented Feb 17, 2023

In Slack, I think I'm the only one who voted for "either opt-in or opt-out". It seems like our community in particular is really sensitive to issues around analytics; it could look really bad if you found out your HPC cluster or whatever was sending off data to who knows where. Which makes me think we'd want to lean more towards "opt-in only", even if as devs we'd really love to have that usage info.

+1 With libraries, there is often no user interaction. That cannot be taken as implicit permission to send data!

@choldgraf
Copy link
Contributor

choldgraf commented Feb 17, 2023

I agree with everything in the policy as a first pass, except for the first sentence. I think we should encourage more practices where projects thoughtfully collect data about how people use their tools. What we want to avoid is people doing so unthoughtfully , in a way that has unintended consequences that conflict with the values pyOpenSci wants to promote. So I'd suggest adding "uninformed" to the first sentence and then making it a "must" rather than "should"

Also a meta point from a practicality perspective: I think it's important to consider that there are three things often true in projects:

  1. They would benefit from and want basic information about how their users are using their tools and documentation
  2. They have zero resources to collect it on their own with any kind of manual labor
  3. They don't have many options to try

I think this is why everybody uses google Analytics - it's not because they want to per se, it's because it is their only option to realisticically solve a problem they have. I think that if policy and guidelines don't appreciate this nuance, then people will just not follow the policy, or they will do the bare minimum to follow the rules but not it's spirit. (Eg this is why everybody just ignores policy in universities - they are given a mandate but not empowered and resourced to follow it).

I appreciate the language in your proposal above that touches on this, I think it helps to say "we know this might be hard to follow and here are some things you can do to take minimal steps forward". I also think that if POS wants to mandate more ethical data collection, it should find ways to provide resources to projects to empower them to do it (eg, maybe it contracts with a UX firm to run user interviews as a service, maybe it runs its own plausible or motomo instances for people that want to use web tracking)

@NickleDave
Copy link
Contributor

👍 for changing the first sentence to something like

Your package must avoid collecting usage analytics without informing people using it.

@tupui
Copy link

tupui commented Feb 17, 2023

must+avoid is ambiguous. In this sentence it would be better to say: must not

@lwasser
Copy link
Member Author

lwasser commented Feb 21, 2023

Good. morning everyone! I am going to remove the telemetry from our current PR and make a new PR so our process of deciding language is explicit. One question about this.

@choldgraf wrote above:

They would benefit from and want basic information about how their users are using their tools and documentation
They have zero resources to collect it on their own with any kind of manual labor
They don't have many options to try
I think this is why everybody uses google Analytics

Frankly this is exactly the issue that has been bothering me (and that i've been researching without finding a good path forward) for the past months. one note that @skrawcz mention was that developers ARE providing a service to users and that data (as chris points out) is very valuable. GA is free and easy to setup. it's also baked into numerous docs etc templates. plausible and matomo for web analytics are great but require infrastructure / $$. for our project i plan to pay for matomo but there is no way that in general we can expect volunteers to start to pick up bills for self hosted analytics. especially the smaller projects that we review.

So - my question for y'all is what are the options for actually implementing this (aside from a route like homebrew takes which IS opt out) ? I think homebrew does a good job of explaining what they collect and how.

@lwasser
Copy link
Member Author

lwasser commented Feb 21, 2023

@skrawcz
Copy link

skrawcz commented Feb 22, 2023

Late to the party here. I will write more when I get more of a chance (hopefully next week), but here is a quick high level thought:

  • No one is forcing anyone to use open source software.

Thus, I'm of the opinion that end users can ultimately make their own decision about what's best for them. But not everyone knows how to evaluate/understand implications, and thus they need a trusted entity to:

a) set clear and transparent guidelines
b) grade the project against those guidelines

With all that said, I think opt-in vs opt-out shouldn't be dictated by guidelines, instead the guidelines should specify what is or isn't acceptable under those different mechanism for getting telemetry.

@betatim
Copy link

betatim commented Feb 22, 2023

As a user I dislike opt-out as a model, mostly because I tend to find out about it "by accident" and then am surprised. And negative surprise quickly turns into anger.

Yet I also think that opt-in is not so useful, almost to the point that I'd say it isn't worth discussing a policy around "opt-in" given how much work that would be, how much work it would be for library authors and how little insight comes out of it at the end.

Taking these two together I think there needs to be room for "opt-out" models and you need to find a way to remove the "surprise factor". Generally in life it seems most people are happy to do something if you explain to them upfront what it is, why you'd like to do it and how it would help you. However, they will not agree to exactly the same thing if they are surprised or otherwise feel tricked.

For example, I think I'd be totally happy to once a year let a library report the top N functions/classes/methods from its library used in the last week. Especially if you ask me beforehand and explain how this is useful to the library development. If you also want to report things about the function arguments, my system, a "unique" user ID, etc I'd probably quickly say "nope". In part because I'd start worrying that your reporting is too complex and hence the chances of you accidentally leaking some of my data (not just stats about it) are too high. And I don't see why you need to know this.

This means that I think finding a model that allows opt-out, collects data with a limited scope and doesn't surprise the user is worth the effort. A key challenge here is how a library can interact with the user (I think for applications like Jupyter it is ~trivial to see how the interaction would work).


The reason I think opt-in is nearly not worth the time to discuss is that I have no idea how I'd learn about a library having this functionality and then activating it. I'm happy to help you, but if I need to do a lot of legwork (even having to look up the name of that environment variable I'd have to set is probably too much) I'm likely to just forget about it.

@tupui
Copy link

tupui commented Feb 22, 2023

Some of what you are saying is contradictory.

there needs to be room for "opt-out" models and you need to find a way to remove the "surprise factor"
...
Especially if you ask me beforehand

This is exactly what opt-in is.

Regardless, user data collection in any form is regulated and if we want to stay on the safe side, we must not do it in the background without prior explicit user consent.

Yes, for opt-in to be effective, people need to know it exists and activate the thing. This can be achieve by promoting the feature widely in the ecosystem, doing outreach, workshop explaining why we want that, etc. Even a few reports would already be more helpful than 0. And as the time goes the situation would improve.

I think it's better to be transparent and put ourselves out there for such things. Data collection is a very sensible topic nowadays and for it to not backfire, the communication needs to be rock solid. Yes it's very hard, and even big corps sometimes fail at this.

@betatim
Copy link

betatim commented Feb 22, 2023

Especially if you ask me beforehand

Maybe a better way to phrase it is "if you tell me beforehand". I agree the difference is small. An example I am thinking of is that the privacy policy of an online service is not something the user has to agree to, but they do need to be able to find it and read it and be told that there is such a policy.

Regardless, user data collection in any form is regulated and if we want to stay on the safe side, we must not do it in the background without prior explicit user consent.

I'm not a lawyer, so I don't know if this is true or not. Data protection is a tricky business, for example a online service that sends you a bill can collect and store your IP address and other data about you. It also does not need to delete it when you ask for your data to be deleted. The reason being that some data is required for the company to prove it collected the right amount of taxes or sent you the correct invoice. These often go under "exception to full fill other regulatory duties".

I don't know what the answer is, as I think this is a difficult to solve problem. However, what if during install time the library printed a well visible message saying "we collect some data, ..... simple description like iOS apps..... To turn it off use ....". That could be something I'd be happy with and it isn't opt-in either.

@choldgraf
Copy link
Contributor

This is why my initial recommendation for what to require is to ask projects to be transparent about what data they collect and why, link to documentation about data collected by whatever services they use, and many best faith efforts for this to be discoverable to users. To me, that is the only thing that we can unequivocally say is a good idea.

For everything else about specific decision trees or opt in vs out, I feel like there is a lot of grey area, depends on who you ask, etc. I think answers to those questions are important, and should come after thoughtful deliberation and consultation with experts. I don't want that effort to block interim progress in defining a reasonable policy to hold PoS projects to.

@tupui
Copy link

tupui commented Feb 22, 2023

For online services it's "easy" since they ask you to accept some terms (even though some terms might be illegal and render the thing void in some countries.) Hence you can expose your users to such choices. In the case of the library, we don't have this luxury especially if the library is a dependency. Last example I have in mind: a lot of folks I talked to at the last SciPy conf did not even know what SciPy was although they had it as deps.

Install or runtime message can work and some packages do that. They ask you to either make a choice to continue the process, or you have to set a variable, etc.

opt in vs out, I feel like there is a lot of grey area

Agreed with that, and to me the safe route is just to say opt-in until we know better.

@skrawcz
Copy link

skrawcz commented Feb 22, 2023

Some responses:

Yes, for opt-in to be effective, people need to know it exists and activate the thing. This can be achieve by promoting the feature widely in the ecosystem, doing outreach, workshop explaining why we want that, etc. Even a few reports would already be more helpful than 0. And as the time goes the situation would improve.

  1. @tupui can you find an example library that does opt-in telemetry and gains value from it? If there are none, then we have an answer. If there is, maybe there is something to learn/voice to add.

As a user I dislike opt-out as a model, mostly because I tend to find out about it "by accident" and then am surprised. And negative surprise quickly turns into anger.

  1. @betatim The argument about surprise factor I think is misleading. Everyone should be doing diligence before installing something. I don't see how you can solve for that globally. So to @choldgraf 's point, I think during diligence, it should be clear to someone what is or isn't being tracked -- before they install.

I don't know what the answer is, as I think this is a difficult to solve problem. However, what if during install time the library printed a well visible message saying "we collect some data, ..... simple description like iOS apps..... To turn it off use ....". That could be something I'd be happy with and it isn't opt-in either.

  1. @betatim The mechanics of library installs in python do not allow you to put in an explicit set up step for a library into the process - though not everyone controls their environment. A print statement could help -- but it's not guaranteed to be read.

In the case of the library, we don't have this luxury especially if the library is a dependency. Last example I have in mind: a lot of folks I talked to at the last SciPy conf did not even know what SciPy was although they had it as deps.

  1. Libraries that build on top of other libraries, should be able to programmatically make the choice to opt-out for a user. So telemetry being pulled in as a dependency of something is in the hands of the library that uses that underlying library.

For example, I think I'd be totally happy to once a year let a library report the top N functions/classes/methods from its library used in the last week. Especially if you ask me beforehand and explain how this is useful to the library development. If you also want to report things about the function arguments, my system, a "unique" user ID, etc I'd probably quickly say "nope". In part because I'd start worrying that your reporting is too complex and hence the chances of you accidentally leaking some of my data (not just stats about it) are too high. And I don't see why you need to know this.
This means that I think finding a model that allows opt-out, collects data with a limited scope and doesn't surprise the user is worth the effort. A key challenge here is how a library can interact with the user (I think for applications like Jupyter it is ~trivial to see how the interaction would work).

  1. Yep, this the camp that I'm in. To @choldgraf 's earlier point, providing guidance on how to best do this, and what options there are would be great. And the reason that I'm fine with this, is that project maintainers are not stopping people from NOT downloading/using the project. No one is forcing someone to use their free software; we just need to help ensure people have the information they need to make the best decision for themselves.

@tupui
Copy link

tupui commented Feb 23, 2023

@skrawcz finding an example where opt-in works or not would only establish a single data point of precedence and not constitute any "proof". Prior experiments could have done things in a non optimal way, have biases.

I think the dependency question is linked but a follow up. There are rules (legal ones) to handle user data and share them with "external services" (the sum of anonymous data could break the anonymity.) But again, there is no harm is everything is opt-in only.

To the last point, "it's free". Not sure what you wanted to say exactly, but just in case: it's not because you are providing anything free that you gain special "privileges" (and there are still regulations.)

@skrawcz
Copy link

skrawcz commented Feb 23, 2023

@skrawcz finding an example where opt-in works or not would only establish a single data point of precedence and not constitute any "proof". Prior experiments could have done things in a non optimal way, have biases.

I think it would add to the conversation, which is exactly the point, can we learn from them? Rather speculating that opt-in is a successful way to gather telemetry.

I think the dependency question is linked but a follow up. There are rules (legal ones) to handle user data and share them with "external services" (the sum of anonymous data could break the anonymity.) But again, there is no harm is everything is opt-in only.

Agreed no-harm. But I don't think you're considering the implications on projects.

To the last point, "it's free". Not sure what you wanted to say exactly, but just in case: it's not because you are providing anything free that you gain special "privileges" (and there are still regulations.)

I'm not following your point on regulations; it's up to the respective projects to comply with the laws of where they live. That burden is on them, and not on the end user.

@betatim
Copy link

betatim commented Feb 23, 2023

2. @betatim The argument about surprise factor I think is misleading. Everyone should be doing diligence before installing something. I don't see how you can solve for that globally. So to @choldgraf 's point, I think during diligence, it should be clear to someone what is or isn't being tracked -- before they install.

The point I was trying to make was that as a package you can IMHO remove most(?) of the negative feelings people have by working (hard) to make sure they discover what you are doing before they install you or as part of the install. As a negative example, the first time I installed brew there was nothing about tracking next to the "copy this one liner to install brew" and I only found out later (and was surprised and hence negative feelings). A positive example, for me personally, would be having a sentence right next to the install command saying "Hey, we collect usage stats that help us prioritize work. Find out how to turn it off and more details." (the link would take you to a page that explains more).

I think how to do this, especially for packages that most people don't directly install (dependencies of dependencies), is non trivial. But worth thinking about because the benefit of figuring it out would be large.

@elijahbenizzy
Copy link

elijahbenizzy commented Feb 23, 2023

Chiming in as an open-source author and contributor.

The folks from PostHog wrote a blog post that might be relevant here: https://posthog.com/blog/open-source-telemetry-ethical. It helps classify the different types of user-tracking.

  1. @betatim The argument about surprise factor I think is misleading. Everyone should be doing diligence before installing something. I don't see how you can solve for that globally. So to @choldgraf 's point, I think during diligence, it should be clear to someone what is or isn't being tracked -- before they install.

The point I was trying to make was that as a package you can IMHO remove most(?) of the negative feelings people have by working (hard) to make sure they discover what you are doing before they install you or as part of the install. As a negative example, the first time I installed brew there was nothing about tracking next to the "copy this one liner to install brew" and I only found out later (and was surprised and hence negative feelings). A positive example, for me personally, would be having a sentence right next to the install command saying "Hey, we collect usage stats that help us prioritize work. Find out how to turn it off and more details." (the link would take you to a page that explains more).

I think how to do this, especially for packages that most people don't directly install (dependencies of dependencies), is non trivial. But worth thinking about because the benefit of figuring it out would be large.

With python libraries its a little tricky -- it has to be in the README or somewhere in the documentation. And you don't want it to be the first sentence, as that's distracting. Maybe a good policy could be that, if you employ tracking, it has to:

  1. Be mentioned at some point within the install/get started section of the README
  2. Has to be easy to opt out
  3. Has to link to a "collection policy" that has a certain set of information about what/how data is used (maybe a template provided by PyOpenSci?
  4. Maybe has a tag at the top of the repo about what class of telemetry is collected (E.G this):

image

Another idea is to have an environment variable that every project obeys (DISABLE_TRACKING_OS). That would globally opt out and make it simple to manage. Would handle dependencies.

I also like the posthog classification about types of telemetry too -- we could link to that + encode.

I'd hate to exclude projects that utilize opt-out telemetry -- Open Source is already enough of a contribution to the world that we shouldn't make it any harder for them to provide value. Excluding projects that utilize opt-out telemetry seems like short-term thinking to me.

In the long-term, projects that utilize telemetry will be able to succeed better, update APIs quicker, and prune branches for product direction. For example, if pandas had employed telemetry, they might realize that 90% of the people use 10% of their API, and that developing a library with 5 different ways to do every type of operation might not have been the best approach 😆.

That said, consistency around this is key -- hence the recommendations above.

@tupui
Copy link

tupui commented Feb 23, 2023

I think it would add to the conversation, which is exactly the point, can we learn from them? Rather speculating that opt-in is a successful way to gather telemetry.

The point of this conversation, at least to me, is to give the option or not to libraries to go the telemetry route. Wether it's effective or not is another discussion every package can evaluate themselves based on their use, relationships with users, etc.

If if "we" say there is an option, how it should be framed.

But I don't think you're considering the implications on projects.

As a SciPy (and other stuff) maintainer, I do 😉

@skrawcz
Copy link

skrawcz commented Mar 2, 2023

hey so it seems like conversation hasn't moved.

So opt-in or opt-out?

I'm for allowing a default of having to opt-out as long as:

  1. the project clearly states in documentation that there is data collection, and how to opt-out: e.g. environment variable, configuration file, and/or programmatically.
  2. the data collected is for usage statistics of the library, taking care not to log user data or identifiable traits.
  3. this can all be verified in a straightforward manner - the code is not obtuse.

If you need identifiable information/user data, then this should require an explicit opt-in.

@stefanv
Copy link
Contributor

stefanv commented Mar 2, 2023

opt-out is, in my mind, a no-go. I think you'll very quickly find our libraries ejected from national labs, from caring user's laptops, etc.

I would be VERY ANNOYED if you sent data from my computer without my permission. It's already a thorny issue to download data transparently (which we do in certain instances), and distros may patch libraries to have all files local.

That said, opt-in doesn't have to be so painful if you can do it once per system. Debian's popcon (popularity contest) package is installed widely, and reports useful data.
@njsmith and @Carreau worked on this problem for a bit; I'm not sure how far they got?

@stefanv
Copy link
Contributor

stefanv commented Mar 2, 2023

  1. the project clearly states in documentation that there is data collection, and how to opt-out: e.g. environment variable, configuration file, and/or programmatically.

You have no guarantee that a user reads the docs before installing a package, unfortunately. The blocker with library telemetry has been exactly that: which user interface do you utilize to communicate that you want to collect data?

@Carreau
Copy link

Carreau commented Mar 2, 2023

I'm not sure how far they got?

Not far, because many people disagreed on where this should go.

The minimal we agreed on was that there should be a common dependency that just act at a central point for collecting consent and exposing this value to libraries.

Libraries would use this to query wether they are allowed to collect (or not).
Frontends would gather user's consent – potentially on a per-library/per environment basis/time limited (re-request say every 6 month).

(that's the summary which is in the slides).

This library could also provide utilities to collect/batch upload.

It was just a slippery slope of what and how we implement just the above. In particular WRT legal responsibility. And like @stefanv even opt-in is problematic for some institution as end-users tends to blindly click yes not realizing they share potentially private data. So one needs to have a system wide flag to disable telemetry.

From the other-side, worrying about library author collecting data.

We had a well known social media company who deployed a Jupyter-Related repo that was not supposed to be for public usage and has GA metrics enabled by default, and it was leaking internal private information about quarterly results. It was a real problem.

So libraries developers need to think about the fact that their metrics endpoints may start to receive data they do not wish to receive.

Keep in mind, that the source is open, so it's trivial if someone makes a fork or install a dev-version and start sending bad data, sensitive data or data that you legally are not allowed to hold. Even non-intentionally.

The legality of holding this data may depend on the country of origin, typically even IP are considered personal information if the user is in the EU, so your data collection endpoint need to take that into account (and likely the server need to be in the EU as well).

@lwasser
Copy link
Member Author

lwasser commented Mar 2, 2023

ive been thinking about this a lot and dont see a clear answer.

there is absolutely some public trauma around data collection give it has been used against people in so many ways (sometimes even in a way that supports racism). People are sensitive to this and big companies like Google, meta etc i think are partially behind the bad blood.

Technically i don't see a clear path forward regarding how someone should collect data in this ecosystem.

It seems like opt-in data will never be quite the quality that a developer needs . Essentially there is a particularly type of user that will go through the extra effort to opt in. It's not like websites where you can have a quick prompt alerting the users to data collection easily. we are talking about complex environments with many dependencies etc

I'm concerned about implementing a policy that

  1. provides unusable / ineffective data to developers because of what companies like meta have done to our society. it will be a lot of development investment with not the best return in data quality. (opt in)
  2. I'm concerned about implementing a policy that is opt out, provides better data but we have a community wide sensitivity and trauma to deal with here. And also no consistent model to collect and store data in a transparent way (that i've seen yet anyway)

Finally to chris' point above i'm not sure how we can implement ANY policy where there is not a clear technical implementation solution for developers.

So where does this leave us? i am thinking about this - i don't want to implement something without fully understanding all aspects of it. i feel like we've hit a bit of a wall here

this may just take a bit of time to figure out. unfortunately, we may not be able to review packages that use telemetry until we figure this out.

i don't have a good answer yet but am open to suggestions.

@stefanv
Copy link
Contributor

stefanv commented Mar 2, 2023

this may just take a bit of time to figure out. unfortunately, we may not be able to review packages that use telemetry until we figure this out.

I think this is the most sensible course of action. None of the open source scientific Python packages out there currently use telemetry, so I don't think this will hold your target audience back too much.

@skrawcz
Copy link

skrawcz commented Mar 3, 2023

I would be VERY ANNOYED if you sent data from my computer without my permission. It's already a thorny issue to download data transparently (which we do in certain instances), and distros may patch libraries to have all files local.

Well, (a) it's your choice to use said software. (b) don't download things you didn't read the license & README for... Technically by using any open source software, you're agreeing to the license it's provisioned. With out case law this is all just speculation, but I'm pretty sure I could argue that a note on telemetry that is prominent, much like a license, would cover "consent".

  1. the project clearly states in documentation that there is data collection, and how to opt-out: e.g. environment variable, configuration file, and/or programmatically.

You have no guarantee that a user reads the docs before installing a package, unfortunately. The blocker with library telemetry has been exactly that: which user interface do you utilize to communicate that you want to collect data?

Correct, you don't. But then again, people should be reading the license of the software they're freely appropriating too, to make sure they're complying with it... 🤷

this may just take a bit of time to figure out. unfortunately, we may not be able to review packages that use telemetry until we figure this out.

I'd argue the focus should be on packages and their merits to scientific work. Not including something that is useful to the community because it has "telemetry" I think is doing the community a disservice. Just prominently state this package captures telemetry and users can make their own choice. This body should not say what is good or bad, but instead what is useful or not and any caveats therein.

Keep in mind, that the source is open, so it's trivial if someone makes a fork or install a dev-version and start sending bad data, sensitive data or data that you legally are not allowed to hold. Even non-intentionally.

Not to down play the issues -- but, this body worrying about whether someone forks and starts sending you data you didn't intend, seems to be getting outside the scope of things, unless the body intends to provide guidelines on how to stop such an attack?

@stefanv
Copy link
Contributor

stefanv commented Mar 3, 2023

Correct, you don't. But then again, people should be reading the license of the software they're freely appropriating too, to make sure they're complying with it... shrug

The license doesn't govern telemetry, and requiring that users read the documentation before using a library is absurd. Imagine you are reproducing the work of a colleague, and all of a sudden your computer starts sending machine info out. This is outside of reasonable expectation.

My take is that PyOpenSci is about establishing good community practices that lead to high quality open source scientific software. The "you should have done your homework" model is not one I'd consider virtuous.

@tupui
Copy link

tupui commented Mar 3, 2023

Things like GDPR require an explicit action from the user. Just writing it somewhere would be insufficient as it needs it's own "check box".

Besides making a precedent in court, to me there is the ethical side of things. That's just a red flag to my moral compass here.

@Carreau
Copy link

Carreau commented Mar 3, 2023

Not to down play the issues -- but, this body worrying about whether someone forks and starts sending you data you didn't intend, seems to be getting outside the scope of things, unless the body intends to provide guidelines on how to stop such an attack?

There are two 3 here:

  1. Warn users that telemetry is not that easy, and collecting clean data is often harder than you think.
  2. Warn that if the community publish a guidance that it's ok to collect data, many packages may start to do so without realizing the consequence from them.
  3. I was not thinking of explicit attacks, just that you have no idea how people use your libraries, and it's easy to get flooded unintentionally.

I would not recommend to start developing a full framework/server to handle this. That's IMHO one of the reason sempervierens did not succeed by trying to implement too much.

@Carreau
Copy link

Carreau commented Mar 3, 2023

Also I got nerd sniped, and started to implement a library to broker asking for consent between libraries and frontend

@Batalex
Copy link
Contributor

Batalex commented Mar 3, 2023

don't download things you didn't read the license & README for...

Correct, you don't. But then again, people should be reading the license of the software they're freely appropriating too, to make sure they're complying with it... 🤷

This is wrong on so many levels. Yes, technically you are right, but realistically this is not feasible.

You are working on the assumption that everyone is in equal capacity to understand what they are signing for by pip install-ing software. This is false, whether it is about the language or even the knowledge of what telemetry is. For licenses, we have stuff like choosealicense.com, license scanning tools, etc.
The ecosystem needs to be more mature when it comes to telemetry. For instance, there is no dedicated classifier on PyPI as of today: https://pypi.org/classifiers/.

Even from a technical standpoint, this is infeasible. See how many dependencies I need to install jupyter:

echo jupyter > req.in && pip-compile req.in

Details

anyio==3.6.2
# via jupyter-server
argon2-cffi==21.3.0
# via
# jupyter-server
# nbclassic
# notebook
argon2-cffi-bindings==21.2.0
# via argon2-cffi
arrow==1.2.3
# via isoduration
asttokens==2.2.1
# via stack-data
attrs==22.2.0
# via jsonschema
backcall==0.2.0
# via ipython
beautifulsoup4==4.11.2
# via nbconvert
bleach==6.0.0
# via nbconvert
cffi==1.15.1
# via argon2-cffi-bindings
colorama==0.4.6
# via ipython
comm==0.1.2
# via ipykernel
debugpy==1.6.6
# via ipykernel
decorator==5.1.1
# via ipython
defusedxml==0.7.1
# via nbconvert
executing==1.2.0
# via stack-data
fastjsonschema==2.16.3
# via nbformat
fqdn==1.5.1
# via jsonschema
idna==3.4
# via
# anyio
# jsonschema
ipykernel==6.21.2
# via
# ipywidgets
# jupyter
# jupyter-console
# nbclassic
# notebook
# qtconsole
ipython==8.11.0
# via
# ipykernel
# ipywidgets
# jupyter-console
ipython-genutils==0.2.0
# via
# nbclassic
# notebook
# qtconsole
ipywidgets==8.0.4
# via jupyter
isoduration==20.11.0
# via jsonschema
jedi==0.18.2
# via ipython
jinja2==3.1.2
# via
# jupyter-server
# nbclassic
# nbconvert
# notebook
jsonpointer==2.3
# via jsonschema
jsonschema[format-nongpl]==4.17.3
# via
# jupyter-events
# nbformat
jupyter==1.0.0
# via -r req.in
jupyter-client==8.0.3
# via
# ipykernel
# jupyter-console
# jupyter-server
# nbclassic
# nbclient
# notebook
# qtconsole
jupyter-console==6.6.2
# via jupyter
jupyter-core==5.2.0
# via
# ipykernel
# jupyter-client
# jupyter-console
# jupyter-server
# nbclassic
# nbclient
# nbconvert
# nbformat
# notebook
# qtconsole
jupyter-events==0.6.3
# via jupyter-server
jupyter-server==2.3.0
# via
# nbclassic
# notebook-shim
jupyter-server-terminals==0.4.4
# via jupyter-server
jupyterlab-pygments==0.2.2
# via nbconvert
jupyterlab-widgets==3.0.5
# via ipywidgets
markupsafe==2.1.2
# via
# jinja2
# nbconvert
matplotlib-inline==0.1.6
# via
# ipykernel
# ipython
mistune==2.0.5
# via nbconvert
nbclassic==0.5.2
# via notebook
nbclient==0.7.2
# via nbconvert
nbconvert==7.2.9
# via
# jupyter
# jupyter-server
# nbclassic
# notebook
nbformat==5.7.3
# via
# jupyter-server
# nbclassic
# nbclient
# nbconvert
# notebook
nest-asyncio==1.5.6
# via
# ipykernel
# nbclassic
# notebook
notebook==6.5.2
# via jupyter
notebook-shim==0.2.2
# via nbclassic
packaging==23.0
# via
# ipykernel
# jupyter-server
# nbconvert
# qtpy
pandocfilters==1.5.0
# via nbconvert
parso==0.8.3
# via jedi
pickleshare==0.7.5
# via ipython
platformdirs==3.0.0
# via jupyter-core
prometheus-client==0.16.0
# via
# jupyter-server
# nbclassic
# notebook
prompt-toolkit==3.0.38
# via
# ipython
# jupyter-console
psutil==5.9.4
# via ipykernel
pure-eval==0.2.2
# via stack-data
pycparser==2.21
# via cffi
pygments==2.14.0
# via
# ipython
# jupyter-console
# nbconvert
# qtconsole
pyrsistent==0.19.3
# via jsonschema
python-dateutil==2.8.2
# via
# arrow
# jupyter-client
python-json-logger==2.0.7
# via jupyter-events
pywin32==305
# via jupyter-core
pywinpty==2.0.10
# via
# jupyter-server
# jupyter-server-terminals
# terminado
pyyaml==6.0
# via jupyter-events
pyzmq==25.0.0
# via
# ipykernel
# jupyter-client
# jupyter-console
# jupyter-server
# nbclassic
# notebook
# qtconsole
qtconsole==5.4.0
# via jupyter
qtpy==2.3.0
# via qtconsole
rfc3339-validator==0.1.4
# via
# jsonschema
# jupyter-events
rfc3986-validator==0.1.1
# via
# jsonschema
# jupyter-events
send2trash==1.8.0
# via
# jupyter-server
# nbclassic
# notebook
six==1.16.0
# via
# asttokens
# bleach
# python-dateutil
# rfc3339-validator
sniffio==1.3.0
# via anyio
soupsieve==2.4
# via beautifulsoup4
stack-data==0.6.2
# via ipython
terminado==0.17.1
# via
# jupyter-server
# jupyter-server-terminals
# nbclassic
# notebook
tinycss2==1.2.1
# via nbconvert
tornado==6.2
# via
# ipykernel
# jupyter-client
# jupyter-server
# nbclassic
# notebook
# terminado
traitlets==5.9.0
# via
# comm
# ipykernel
# ipython
# ipywidgets
# jupyter-client
# jupyter-console
# jupyter-core
# jupyter-events
# jupyter-server
# matplotlib-inline
# nbclassic
# nbclient
# nbconvert
# nbformat
# notebook
# qtconsole
uri-template==1.2.0
# via jsonschema
wcwidth==0.2.6
# via prompt-toolkit
webcolors==1.12
# via jsonschema
webencodings==0.5.1
# via
# bleach
# tinycss2
websocket-client==1.5.1
# via jupyter-server
widgetsnbextension==4.0.5
# via ipywidgets

Even with the technical knowledge of how to use pip-tools, you cannot reasonably expect people to check all of this.

@skrawcz
Copy link

skrawcz commented Mar 3, 2023

My take is that PyOpenSci is about establishing good community practices that lead to high quality open source scientific software.

@stefanv I agree. But, in my career I've seen that data helps, hence why I want telemetry to help guide and make things better. 🤷 So I acknowledge the view point that people want it all -- free software, which they didn't write, that comes without at any cost to them, and helps them with their task at hand.

The "you should have done your homework" model is not one I'd consider virtuous.

I don't see a way around that. Someone, somewhere, needs to do the homework. So who?

@Carreau @Batalex, thanks those are great points!

Even with the technical knowledge of how to use pip-tools, you cannot reasonably expect people to check all of this.

Agreed I wouldn't. I would expect the jupyter project to do that for me, since they control their dependencies, right?

Also I got nerd sniped, and started to implement a library to broker asking for consent between libraries and frontend

@Carreau a standard for packages to obey doesn't seem unreasonable. Make everyone look for a specific config file, much like robots.txt for websites?

The other thought I had -- which increases the burden on the package maintainer (and potentially could get a little tricky with python package management), is to create two packages in the case of opt-out being the default:

  1. first package would be their main one.
  2. second package would be PACKAGE-telemetry-off or something like that. The maintainer could at least then see via download stats, how many people are opt-ing for this path to know how many "uncounted" uses there are.

@lwasser
Copy link
Member Author

lwasser commented Jul 27, 2023

hi friends - i wanted to circle back on this after thinking about it (i know for too long) and talking with @skrawcz at scipy! I think that we should have an opt-in telemetry policy for the time being.

opt-in allows maintainers to atleast consider collecting some data from their users as it is very helpful for development purposes on numerous levels. But we don't want users to be surprised by data being sent to other locations.

we have only had 1 package come through that has telemetry. what do you all think about starting with a opt-in option moving forward.

i opened this pr ages ago that contains language related to telemetry that we can add to our policies page. we can then add some text to our packaging guide about best practices for implementation in the future. i know implementation is the hard part here.

I invite y'all to revisit that pr on telemetry and to provide any additional feedback that you might have to the text there.

If you want to discuss this further here, i welcome that as well.

@tupui
Copy link

tupui commented Jul 27, 2023

I am +1 on the opt-in strategy. The next big thing is now how to properly and carefully communicate such things.

@lwasser
Copy link
Member Author

lwasser commented Aug 2, 2023

@tupui wonderful - thank you!! we ended up re-engaging with this over at our discourse in hopes that others might see it and respond. Jonny has a few more stipulations that i think are really nice. so if you want to follow along there you and (EVERYONE) here are welcome to. we are going to move forward with an opt-in minimum policy with some other stipulations around data, how it shared and used, etc. let's work together to find a policy that we all feel good about and that will protect our users but also support maintainers needs!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants