Skip to content

Conversation

@Turbo87
Copy link
Member

@Turbo87 Turbo87 commented Oct 27, 2025

This PR proposes to only count crate downloads for requests with a user-agent header belonging to cargo. This addresses the discussion at https://rust-lang.zulipchat.com/#narrow/channel/318791-t-crates-io/topic/download.20counting/with/547314718.

This should hopefully filter out all of the systems that appear to mirror crates.io for various purposes, leaving only somewhat legitimate downloads.

I chose to also ignore clients like Buck and Bazel for now. If we notice that they start to be responsible for a significant portion of the traffic we can always change the logic to include them too.

Add `user_agent` module with `should_count_user_agent()` fn to
determine if downloads should be counted based on user agent.
Currently filters to only count cargo client downloads.
Apply user agent filtering to Fastly log parsing. Downloads are counted
if they have no user agent (for backwards compatibility with older logs)
or if the user agent passes the `should_count_user_agent()` check.

Currently filters out non-cargo user agents like Bazel.
Apply user agent filtering to CloudFront log parsing. Downloads are
counted if they have no user agent (for backwards compatibility) or if
the user agent passes the `should_count_user_agent()` check.
@Turbo87 Turbo87 added the C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works label Oct 27, 2025
@Turbo87 Turbo87 requested a review from a team October 27, 2025 17:12
@Urgau
Copy link
Member

Urgau commented Oct 27, 2025

Would it possible to categorize the download by user agent instead of filtering-out anything that isn't cargo?

The UI could then have a vertical axis with the others UAs as disabled by default.

That would allow interested parties to still see the bots and other package manager downloads compared to people using Cargo.

@Turbo87
Copy link
Member Author

Turbo87 commented Oct 27, 2025

it would be possible, but it would significantly complicate our data storage requirements. I'm not convinced that it's worth it at the moment tbh.

@Urgau
Copy link
Member

Urgau commented Oct 27, 2025

Would it be possible to have a simple separation between Cargo (and other known package manager) on one side and everything else on the other side? That way you only need to have two columns in our database.

@Turbo87
Copy link
Member Author

Turbo87 commented Oct 27, 2025

the download counts are by far our biggest table already and take up about 60-70% of the whole database for just the data of the past 90 days. I don't see us doubling our space requirements for that table for something that is just nice to have 😉

@Urgau
Copy link
Member

Urgau commented Oct 27, 2025

Unfortunate, but make sense.

Copy link
Contributor

@eth3lbert eth3lbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@Turbo87 Turbo87 merged commit 5fb8de1 into rust-lang:main Oct 29, 2025
10 checks passed
@Turbo87 Turbo87 deleted the cdn-log-user-agent-filtering branch October 29, 2025 07:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-backend ⚙️ C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants