Faster crates.io fetching? #8

repi · 2020-10-10T09:23:49Z

Awesome project, been wanting a tool with this type of functionality for a while and really glad ran into it!

We have a fairly large project with, ahem, 548 crate dependencies, so the 2 second delay between fetching data on each crates.io well does add up quite a bit!

Fetching publisher info from crates.io
This will take roughly 2 seconds per crate due to API rate limits
Fetching data for "addr2line" (0/548)
Fetching data for "adler" (1/548)

Are there any paths to speeding this up?

Can one simply increase the rate limit / lower the delay?
Does crates.io support some type of batched queries?
Could one add support for crates.io API tokens and not be rate limited? Though those tokens seem to be read/write, where here one only wants read-only access to the database, some may be uncomfortable technically giving publishing access tokens laying around

The text was updated successfully, but these errors were encountered:

Shnatsel · 2020-10-10T11:05:21Z

According to https://crates.io/data-access and https://crates.io/policies#crawlers, crates.io requires crawlers to request no more than 1 page per second. There is no API access policy for non-crawler usage, so I prefer to err on the side of caution.

crates.io API is undocumented, so there might be batched queries I'm not aware of. So far I've reverse-engineered the web UI to write this tool.

It is possible to load the data from the crates.io database dump. The relevant files are only 3.5Mb total when uncompressed. However, they are inside a 250Mb archive, which complicates access. And they are only updated once a day.

I'm going to talk to crates.io team about this. Is operating on data that's updated daily good enough for your use cases, or do you need live data?

repi · 2020-10-10T11:14:08Z

Thanks! Think daily data would work just fine for our usage of using the CLI manually for reviewing and looking over dependencies.

If one would later automate this in CI use it as a security review thing, potentially through our cargo-deny and have specific allow/deny list of individual and group publishers, then having the correct data from the time it is run could be more important. But then one could also just run in the current rate limited mode also.

Shnatsel · 2020-10-10T12:04:58Z

I've raised the topic in the crates.io team discord, you're welcome to join the conversation. See https://www.rust-lang.org/governance/teams/crates-io

Automation would likely require structured output as well. The tool is only a week old so we didn't get there yet 😄

repi · 2020-10-10T12:24:24Z

Hah that is perfectly fine, just glad you built the tool and I found it :)

Shnatsel · 2020-10-10T23:12:46Z

#12 might be of use - it enables the use of crates.io database dumps.

Shnatsel · 2020-10-22T17:42:23Z

I never heard back from crates.io team about whether the scraping limits apply to cargo-supply-chain or not, but we have a fairly mature infrastructure for using database dumps now, so I'm going to go ahead and close this.

Shnatsel · 2020-10-22T17:44:34Z

Also, I would be interested to hear what use cases Embark has for the tool, to understand what kind of facilities would be interesting to users.

There's a gazillion things we could do from structured output to cargo-deny/cargo-crev whitelist/blacklist model to notifications about changes and numerous other features. But I don't want to sink effort into any of those until there is a clear use case.

repi · 2020-10-22T17:50:19Z

Nice, I'll test the current database dump support.

Not fully sure yet what our use cases will/can be, but now when it is faster with the dumps we should be able to experiment more. So makes perfect sense to not go to deep with any other specific implementation or optimization until one has some more clarity around this. Thanks!

Shnatsel · 2021-07-17T21:43:39Z

@repi the latest cargo supply-chain supports JSON output, so you can implement custom logic on top of it and/or integrate with cargo deny. No crate yet - just the CLI, but that could possibly be changed.

I'm still interested in hearing about the use cases you may have for the tool - we might want to support some of them in cargo supply-chain itself.

repi mentioned this issue Oct 10, 2020

Embark issues EmbarkStudios/rust-ecosystem#36

Open

Shnatsel added the enhancement New feature or request label Oct 10, 2020

Shnatsel closed this as completed Oct 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster crates.io fetching? #8

Faster crates.io fetching? #8

repi commented Oct 10, 2020

Shnatsel commented Oct 10, 2020

repi commented Oct 10, 2020

Shnatsel commented Oct 10, 2020

repi commented Oct 10, 2020

Shnatsel commented Oct 10, 2020 •

edited

Loading

Shnatsel commented Oct 22, 2020

Shnatsel commented Oct 22, 2020

repi commented Oct 22, 2020

Shnatsel commented Jul 17, 2021

Faster crates.io fetching? #8

Faster crates.io fetching? #8

Comments

repi commented Oct 10, 2020

Shnatsel commented Oct 10, 2020

repi commented Oct 10, 2020

Shnatsel commented Oct 10, 2020

repi commented Oct 10, 2020

Shnatsel commented Oct 10, 2020 • edited Loading

Shnatsel commented Oct 22, 2020

Shnatsel commented Oct 22, 2020

repi commented Oct 22, 2020

Shnatsel commented Jul 17, 2021

Shnatsel commented Oct 10, 2020 •

edited

Loading