Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster crates.io fetching? #8

Closed
repi opened this issue Oct 10, 2020 · 9 comments
Closed

Faster crates.io fetching? #8

repi opened this issue Oct 10, 2020 · 9 comments
Labels
enhancement New feature or request

Comments

@repi
Copy link

repi commented Oct 10, 2020

Awesome project, been wanting a tool with this type of functionality for a while and really glad ran into it!

We have a fairly large project with, ahem, 548 crate dependencies, so the 2 second delay between fetching data on each crates.io well does add up quite a bit!

Fetching publisher info from crates.io
This will take roughly 2 seconds per crate due to API rate limits
Fetching data for "addr2line" (0/548)
Fetching data for "adler" (1/548)

Are there any paths to speeding this up?

  • Can one simply increase the rate limit / lower the delay?
  • Does crates.io support some type of batched queries?
  • Could one add support for crates.io API tokens and not be rate limited? Though those tokens seem to be read/write, where here one only wants read-only access to the database, some may be uncomfortable technically giving publishing access tokens laying around
@Shnatsel
Copy link
Member

According to https://crates.io/data-access and https://crates.io/policies#crawlers, crates.io requires crawlers to request no more than 1 page per second. There is no API access policy for non-crawler usage, so I prefer to err on the side of caution.

crates.io API is undocumented, so there might be batched queries I'm not aware of. So far I've reverse-engineered the web UI to write this tool.

It is possible to load the data from the crates.io database dump. The relevant files are only 3.5Mb total when uncompressed. However, they are inside a 250Mb archive, which complicates access. And they are only updated once a day.

I'm going to talk to crates.io team about this. Is operating on data that's updated daily good enough for your use cases, or do you need live data?

@repi
Copy link
Author

repi commented Oct 10, 2020

Thanks! Think daily data would work just fine for our usage of using the CLI manually for reviewing and looking over dependencies.

If one would later automate this in CI use it as a security review thing, potentially through our cargo-deny and have specific allow/deny list of individual and group publishers, then having the correct data from the time it is run could be more important. But then one could also just run in the current rate limited mode also.

@Shnatsel
Copy link
Member

I've raised the topic in the crates.io team discord, you're welcome to join the conversation. See https://www.rust-lang.org/governance/teams/crates-io

Automation would likely require structured output as well. The tool is only a week old so we didn't get there yet 😄

@repi
Copy link
Author

repi commented Oct 10, 2020

Hah that is perfectly fine, just glad you built the tool and I found it :)

@Shnatsel Shnatsel added the enhancement New feature or request label Oct 10, 2020
@Shnatsel
Copy link
Member

Shnatsel commented Oct 10, 2020

#12 might be of use - it enables the use of crates.io database dumps.

@Shnatsel
Copy link
Member

I never heard back from crates.io team about whether the scraping limits apply to cargo-supply-chain or not, but we have a fairly mature infrastructure for using database dumps now, so I'm going to go ahead and close this.

@Shnatsel
Copy link
Member

Also, I would be interested to hear what use cases Embark has for the tool, to understand what kind of facilities would be interesting to users.

There's a gazillion things we could do from structured output to cargo-deny/cargo-crev whitelist/blacklist model to notifications about changes and numerous other features. But I don't want to sink effort into any of those until there is a clear use case.

@repi
Copy link
Author

repi commented Oct 22, 2020

Nice, I'll test the current database dump support.

Not fully sure yet what our use cases will/can be, but now when it is faster with the dumps we should be able to experiment more. So makes perfect sense to not go to deep with any other specific implementation or optimization until one has some more clarity around this. Thanks!

@Shnatsel
Copy link
Member

@repi the latest cargo supply-chain supports JSON output, so you can implement custom logic on top of it and/or integrate with cargo deny. No crate yet - just the CLI, but that could possibly be changed.

I'm still interested in hearing about the use cases you may have for the tool - we might want to support some of them in cargo supply-chain itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants