-
Notifications
You must be signed in to change notification settings - Fork 678
cdn_logs: Filter downloads by user agent #12210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add `user_agent` module with `should_count_user_agent()` fn to determine if downloads should be counted based on user agent. Currently filters to only count cargo client downloads.
Apply user agent filtering to Fastly log parsing. Downloads are counted if they have no user agent (for backwards compatibility with older logs) or if the user agent passes the `should_count_user_agent()` check. Currently filters out non-cargo user agents like Bazel.
Apply user agent filtering to CloudFront log parsing. Downloads are counted if they have no user agent (for backwards compatibility) or if the user agent passes the `should_count_user_agent()` check.
|
Would it possible to categorize the download by user agent instead of filtering-out anything that isn't cargo? The UI could then have a vertical axis with the others UAs as disabled by default. That would allow interested parties to still see the bots and other package manager downloads compared to people using Cargo. |
|
it would be possible, but it would significantly complicate our data storage requirements. I'm not convinced that it's worth it at the moment tbh. |
|
Would it be possible to have a simple separation between Cargo (and other known package manager) on one side and everything else on the other side? That way you only need to have two columns in our database. |
|
the download counts are by far our biggest table already and take up about 60-70% of the whole database for just the data of the past 90 days. I don't see us doubling our space requirements for that table for something that is just nice to have 😉 |
|
Unfortunate, but make sense. |
eth3lbert
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
This PR proposes to only count crate downloads for requests with a user-agent header belonging to cargo. This addresses the discussion at https://rust-lang.zulipchat.com/#narrow/channel/318791-t-crates-io/topic/download.20counting/with/547314718.
This should hopefully filter out all of the systems that appear to mirror crates.io for various purposes, leaving only somewhat legitimate downloads.
I chose to also ignore clients like Buck and Bazel for now. If we notice that they start to be responsible for a significant portion of the traffic we can always change the logic to include them too.