Skip to content
This repository has been archived by the owner on Jul 3, 2023. It is now read-only.

Usage telemetry of Hamilton features #248

Closed
skrawcz opened this issue Dec 16, 2022 · 6 comments · Fixed by #255
Closed

Usage telemetry of Hamilton features #248

skrawcz opened this issue Dec 16, 2022 · 6 comments · Fixed by #255

Comments

@skrawcz
Copy link
Collaborator

skrawcz commented Dec 16, 2022

Is your feature request related to a problem? Please describe.
To be able to better serve the Hamilton community, finer grained usage metrics would be very helpful.

In the project's current state, we don't know any usage of the feature set that hamilton offers, other than want people ask in the slack help channel.

It would be create to know what is really being used. E.g. what decorators, what experimental modules, etc.
That way when deciding on future improvements and adjustments we could:

  1. Make an informed decision as to how likely a change is to impact the community.
  2. Understand the impact of new feature additions and adoption.
  3. Understand when features should move on from being experimental.
  4. Understand how quickly people adjust and upgrade their Hamilton versions.
  5. Understand where people encounter the most errors -- and help improve documentation/and or error messages.

Describe the solution you'd like
It would be great to know in an anonymous fashion:

  1. Provide the ability to opt-out to not sending any tracking information.
  2. What decorators are used in a Hamilton DAG definition.
  3. What graph adapters are used.
  4. How many functions comprise a DAG & what are the in/out edge counts.
  5. Python version
  6. Operating system type
  7. Operating system version
  8. Source of errors at DAG construction time, i.e. which part of the Hamilton code base is throwing it. Ideally we know which line of Hamilton code caused it.
  9. Source of errors at DAG execution time -- is it user code, or Hamilton code.

Of course we'd have an explicit policy on its usage, and make it clear to users how to opt-out.

Describe alternatives you've considered
N/A

Additional context
Telemetry usage tracking is becoming more standard in open source. It helps the maintainers to better serve the community.

E.g. data diff does this -- see their tracking code and privacy policy:

@elijahbenizzy
Copy link
Collaborator

elijahbenizzy commented Dec 16, 2022

I think we want to be very clear about what not to include, although most of this is implied above:

(1) IP Address, anything identifying (implied by anonymous)
(2) function names
(3) Any information about return sizes
(4) Any information about graph shape (other than size)

Another Q is how we disambiguate on a user's behalf anonymously -- one option is to include a tracking ID as an env variable (generate a token), but I think that's too high lift.

@elshize
Copy link

elshize commented Dec 20, 2022

I personally have no issue with the idea of telemetry, though as you both mentioned, it is crucial to (1) be clear and transparent as to what and how is collected, and (2) there is an easy way to disable it altogether.

Another thing to consider is that certain environments where Hamilton is deployed may simply not allow to send anything outside of the runtime environment. By allow I mean things like firewalls rather than policies. It should not affect the overall functionality if that is the case (like a crash or significant slowdown, or anything like that). I'm sure you've already thought of that, but it doesn't hurt to mention...

@elijahbenizzy
Copy link
Collaborator

I personally have no issue with the idea of telemetry, though as you both mentioned, it is crucial to (1) be clear and transparent as to what and how is collected, and (2) there is an easy way to disable it altogether.

Another thing to consider is that certain environments where Hamilton is deployed may simply not allow to send anything outside of the runtime environment. By allow I mean things like firewalls rather than policies. It should not affect the overall functionality if that is the case (like a crash or significant slowdown, or anything like that). I'm sure you've already thought of that, but it doesn't hurt to mention...

Yeah I think that's a great call -- specifically adding another requirement:

  • errors = log, not failure

@elshize
Copy link

elshize commented Dec 21, 2022

errors = log, not failure

Yep, it would be sad if I pushed an update to production and it failed because it can't send metrics. It wouldn't actually happen to me because my staging env would fail first, but you get the point.

I also would have to think that I cannot be the only person in this situation, something to think about when analyzing the data. It could be that a big chunk of it will be from a development/local environment and not necessarily fully representative of what is run in production.

It could be a good idea to write all telemetry to a file in a case of failure to send, and maybe provide a simple way to submit that manually. I honestly don't know it that's worth the effort or not, I imagine not many people would do that, and if they did, that would probably be a one-time thing, but maybe it would provide some useful information to y'all...

I have never tried to collect telemetry in a similar scenario (I work on an internal solution, so telemetry is a different, simpler story), so I have no idea if any of what I say makes sense, but just throwing my thoughts out there :)

@skrawcz skrawcz linked a pull request Dec 21, 2022 that will close this issue
7 tasks
@skrawcz
Copy link
Collaborator Author

skrawcz commented Dec 21, 2022

Started a draft PR to sketch out some of what has been discussed here with #255

@skrawcz skrawcz self-assigned this Dec 21, 2022
@skrawcz
Copy link
Collaborator Author

skrawcz commented Dec 27, 2022

PR is up for those interested, with tests and all - #255.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants