Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse less JSON on null builds #6880

Merged
merged 7 commits into from May 3, 2019

Conversation

Projects
None yet
4 participants
@alexcrichton
Copy link
Member

commented Apr 26, 2019

This commit fixes a performance pathology in Cargo today. Whenever Cargo
generates a lock file (which happens on all invocations of cargo build
for example) Cargo will parse the crates.io index to learn about
dependencies. Currently, however, when it parses a crate it parses the
JSON blob for every single version of the crate. With a lock file,
however, or with incremental builds only one of these lines of JSON is
relevant. Measured today Cargo building Cargo parses 3700 JSON
dependencies in the registry.

This commit implements an optimization that brings down the number of
parsed JSON lines in the registry to precisely the right number
necessary to build a project. For example Cargo has 150 crates in its
lock file, so now it only parses 150 JSON lines (a 20x reduction from
3700). This in turn can greatly improve Cargo's null build time. Cargo
building Cargo dropped from 120ms to 60ms on a Linux machine and 400ms
to 200ms on a Mac.

The commit internally has a lot more details about how this is done but
the general idea is to have a cache which is optimized for Cargo to read
which is maintained automatically by Cargo.

Closes #6866

@rust-highfive

This comment has been minimized.

Copy link

commented Apr 26, 2019

r? @Eh2406

(rust_highfive has picked a reviewer for you, use r? to override)

@alexcrichton

This comment has been minimized.

Copy link
Member Author

commented Apr 26, 2019

This isn't 100% ready to go yet since it still doesn't handle concurrent writes into the global cache, but I wanted to put this up for initial thoughts if there were any! I hope to fix the global cache write synchronization tomorrow

@Eh2406

This comment has been minimized.

Copy link
Contributor

commented Apr 26, 2019

This is realy cool! I will have to revue when I am more awake. Overall I wonder how can we test this well. How can we be absolutely sure the there are no race conditions, or other bugs, leading to things getting out of sink? Now and in the future? Similarly for making sure that the files work across cargo versions.

@alexcrichton alexcrichton force-pushed the alexcrichton:cache branch from 8b1a902 to 720d4d1 Apr 26, 2019

@alexcrichton

This comment has been minimized.

Copy link
Member Author

commented Apr 26, 2019

Heh that's a good question! I don't think we can really be sure that race conditions/bugs are gone related to that ever. It's largely I think about how we architect locking and evaluate it if it feels as foolproof as possible. I had an idea this morning I'm going to toy which that I'm pretty confident in, but really we can only get but so far here.

In terms of working with Cargo against future versions it's sort of the same, I'm trying to be very liberal with ignoring errors and proactive with some degree of versioning, but in reality there's really only so much we can do against this I think.

@alexcrichton alexcrichton force-pushed the alexcrichton:cache branch from 68dcc83 to 25ee430 Apr 26, 2019

@alexcrichton

This comment has been minimized.

Copy link
Member Author

commented Apr 26, 2019

Ok I've updated with a strategy to lock the index and ensure that concurrent updates work ok, even with this new caching strategy. The new locking strategy is to basically just not have granular locks and instead have one large global lock protecting all of resolve, for example. This is done to avoid us having to worry about all these concurrent updates, and it in theory isn't any loss in functionality either.

@alexcrichton

This comment has been minimized.

Copy link
Member Author

commented Apr 26, 2019

@ehuss you might be intersted in the locking commit as well

@Eh2406

This comment has been minimized.

Copy link
Contributor

commented Apr 26, 2019

I did not grock this yet, but some thoughts:

  • Can we make CURRENT_CACHE_VERSION part of the file path so we reduce the chance of cross talk? (or has this already been done.) Actually, where are the cache file stored?
  • Should we have a debug assertion that the cached version matches the canonical one?
  • Can we have test that the Cargo that was used to build the tests is compatible with the Cargo being tested? (This would be good for lock files as well.) Set up a registry, have the host Cargo make its cache files, have the test Cargo run to see if it correctly uses or ignores the files, have the host Cargo run to make sure we don't brake.
  • Can we have test that the host Cargo and the test Cargo, can run concurrently without messing up each others lock? BTW what happens if new Cargo gets a new course lock while an old Cargo has a granular lock?
  • Given that the registry is a Git project can we use the Hash of head instead of mtime? (mtime, often does weird things)
@alexcrichton

This comment has been minimized.

Copy link
Member Author

commented Apr 29, 2019

I'm personally not really super concerned about cross-cargo-version issues here. I think we need to at a bare minimum ensure it works (nothing gets corrupted across Cargo invocations), but other than that I feel like it's a bit much for us to maintain anything else. For example I don't think we need to optimize for the use case where you oscillate between Cargo versions and it might thrash the cache that we're building here. The purpose of this PR is to reduce the overhead of Cargo as much as possible on incremental builds, and part of the incremental aspect is not changing Cargo that much!

In that sense I could make the version part of the path for sure and we could reduce cache thrashing, but I don't think it's too too important here. Additionally while it's happened to work in the past I don't think we should strive to say that concurrent invocations of different versions of Cargo are supposed to work (rather only concurrent invocations of the same version are expected to work).

I like the idea of a debug assertion and using the git hash instead of the mtime, I'll look to implement those later when this is closer to being god to go!

For testing, I'm not sure how we'd manage that unfortunately. We can't really rely on the host Cargo to be any particular version so the tests would already have to be really loose. It may be best to just unit-test the code in question and make sure that error handling is as conservative as possible, since I'm not sure how to best test these things (but I think it's pretty minor)

@bors

This comment has been minimized.

Copy link
Contributor

commented Apr 29, 2019

☔️ The latest upstream changes (presumably #6871) made this pull request unmergeable. Please resolve the merge conflicts.

@@ -14,6 +14,7 @@ pub use self::shell::{Shell, Verbosity};
pub use self::source::{GitReference, Source, SourceId, SourceMap};
pub use self::summary::{FeatureMap, FeatureValue, Summary};
pub use self::workspace::{Members, Workspace, WorkspaceConfig, WorkspaceRootConfig};
pub use self::interning::InternedString;

This comment has been minimized.

Copy link
@Eh2406

Eh2406 Apr 30, 2019

Contributor

There are a lot of Cow<'_, str> that can be replaced with InternedString now that it is publick.

This comment has been minimized.

Copy link
@alexcrichton

alexcrichton Apr 30, 2019

Author Member

Yeah that was one thing I was going to try if JSON parsing still showed up in the profile, but after this PR the JSON parsing disappeared so I think it's less pressing to do that just yet (can be a follow-up of course!)

@Eh2406

This comment has been minimized.

Copy link
Contributor

commented Apr 30, 2019

I did a more indepth review it looks good, and I really like it!
Three fundamental questions:

    1. Why do the processing on the client side, why not have the index just write the better format in the first place? (Presumably this is a smaller change, givin now that custom registries are stable.)
    1. If we are building a custom format why stay with JSON? (Presumably we can always experiment with changing this after this lands, so start with the smallest change.
    1. Lots of small files, how does this perform on windows... (Presumably you want me to answer that)
@alexcrichton

This comment has been minimized.

Copy link
Member Author

commented Apr 30, 2019

Good questions!


Why do the processing on the client side, why not have the index just write the better format in the first place? (Presumably this is a smaller change, givin now that custom registries are stable.)

For doing this on the client side rather than the index itself, I think the main reason is feasibility. I've long figured that the index will not satisfy Cargo until the end of time and we'd need an even faster indexing format at some point. Having the capability of a local cache managed by Cargo allows us to divorce these two aspects. The index is primarily focused on optimizing for delta updates (making it super easy and fast to incrementally update it), whereas Cargo's problem is different where it's effectively requiring random access to the index. I think that if this is the only change we make to Cargo's internal format for a long time to come, but otherwise I think it's inevitable that Cargo's own format for the index on disk is divorced from the index's upstream format.

Frankly though for the index format it's just way easier to do it in Cargo. The amount of effort needed to change the index itself and make sure everything doesn't break means that this probably wouldn't get done.


If we are building a custom format why stay with JSON? (Presumably we can always experiment with changing this after this lands, so start with the smallest change.

Heh another good question! I wasn't sure whether JSON would still be slow, so this is where I let the profiles guide me. It was obvious from before that parsing thousands of lines of JSON took hundreds of milliseconds and the easiest win was to simply not parse thousands of lines but only the handful needed. Since then I haven't seen JSON parsing in the profile.

We could of course, however, change the format of the cache files at any time. That's the point of the local cache for Cargo :). If even JSON is too slow we'll have to carefully design a new format to ensure it preserves all the relevant information, but it's certainly possible to do so.


Lots of small files, how does this perform on windows...

Ah yeah unfortunately I don't have access to my Windows machine right now to test this out. Previously we were reading one big git file but now we're reading a lot of little files around the filesystem, so I'm honestly not entirely sure what the performance is. It's a good point though and something we should measure before landing. Would you be up for helping me out with measurements?

@Eh2406

This comment has been minimized.

Copy link
Contributor

commented Apr 30, 2019

I ran a number of commands with two versions of cargo. One is from the head of this PR (25ee430) the other from after the small optimizations (319e9bb)
both rebase on master (af1fcb3). Both build locally with --release from rustc 1.36.0-nightly (6d599337f 2019-04-22). The script ran each combination once outside the timing loop then calculated the wall time to run the command 15 times in the Cargo project.

for command in [["update"], "update -p hex".split(), ["generate-lockfile"], ["build"]]:
    for cargo in ["PR", "Master"]:
        p1 = subprocess.Popen(['speed/cargo-' + cargo, command, "-Zno-index-update"],
                              stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        p1.communicate()
        start = time.clock()
        n = 15
        for i in range(n):
            p1 = subprocess.Popen(['speed/cargo-' + cargo] + command + ["-Zno-index-update"],
                                  stdout=subprocess.PIPE, stderr=subprocess.PIPE)
            p1.communicate()
        print command, cargo, (time.clock() - start) / n

The table below shows the average wall time in seconds for each combination.

Edit: looks like you need to update the index to get the new speeds. So here are the numbers with the same artifacts and commands after I once ran a command without "-Zno-index-update",

Edit: the command for update -p hex was malformed.

command Master PR %change
update 0.27955078 0.195570633333 30%
update -p hex 0.266721386667 0.151231233333 43%
generate-lockfile 0.266229426667 0.171039146667 35%
build 0.714220346667 0.711805573333 0.3%

@alexcrichton alexcrichton force-pushed the alexcrichton:cache branch from 25ee430 to 928c867 Apr 30, 2019

@alexcrichton

This comment has been minimized.

Copy link
Member Author

commented Apr 30, 2019

Ok I've rebased and pushed up a commit which uses git sha information instead of mtime information which should be more robust, as well as an additional commit which adds a debug assertion that if we think the cache is fresh it actually is.

@Eh2406

This comment has been minimized.

Copy link
Contributor

commented May 1, 2019

New timings are:

command Master PR %change
update 0.26499556 0.183760626667 30.6%
update -p hex 0.247472593333 0.13809348 44.2%
generate-lockfile 0.249164073333 0.16347192 34.4%
build 0.6984459 0.69787098 0.1%

@alexcrichton alexcrichton force-pushed the alexcrichton:cache branch 2 times, most recently from 1258890 to b8ca83a May 1, 2019

@alexcrichton

This comment has been minimized.

Copy link
Member Author

commented May 1, 2019

Updated!

@Eh2406

This comment has been minimized.

Copy link
Contributor

commented May 1, 2019

Ok this looks good! Other improvements can always be done in follow up PRs.

I keep getting distracted by changes that may be small improvements, when according to my profiling the only thing that matters is a design that uses fewer files.

@Eh2406

This comment has been minimized.

Copy link
Contributor

commented May 1, 2019

So here is a fleshed out version of my 3 file straw man.

  • index.json is a uncompressed concatenated version of all of the index files we have actually read. In the existing format.
  • versions.lookup is an fast way to look up where in index.json the row is for each version. It would have the format version-string\0start\0end\n.
  • names.lookup is an fast way to look up where in versions.lookup are the rows for each name. It would have the format name\0start\0end\n.

to do a search:

  1. read in and parce all of names.lookup. If the name we want is not in the then read it from the raw index and append to the files.
  2. read in all of versions.lookup, but only parce the part of the file that the names.lookup told us to.
  3. read in all of index.json, but only parce the part of the file that the versions.lookup told us to.

when we pull a new version of the index dell the files.

"read in all of" can probably be mmap, if that is a bottleneck.

Also we should look if there are existing libraries for on disk indexing into a file... https://crates.io/crates/csv-index mabey? actually that does something very similar.

@bors

This comment has been minimized.

Copy link
Contributor

commented May 1, 2019

☔️ The latest upstream changes (presumably #6896) made this pull request unmergeable. Please resolve the merge conflicts.

alexcrichton added some commits Apr 25, 2019

Only execute `exclude_from_backups` once
This was currently getting executed on all builds, even if the directory
already exists. There shouldn't be any reason though to exclude the
directory from backups on all builds, and after seeing this get a stack
sample in a profile I figured it's best to ensure it only executes once
in case the backing system implementation isn't the speediest.
Parse less JSON on null builds
This commit fixes a performance pathology in Cargo today. Whenever Cargo
generates a lock file (which happens on all invocations of `cargo build`
for example) Cargo will parse the crates.io index to learn about
dependencies. Currently, however, when it parses a crate it parses the
JSON blob for every single version of the crate. With a lock file,
however, or with incremental builds only one of these lines of JSON is
relevant. Measured today Cargo building Cargo parses 3700 JSON
dependencies in the registry.

This commit implements an optimization that brings down the number of
parsed JSON lines in the registry to precisely the right number
necessary to build a project. For example Cargo has 150 crates in its
lock file, so now it only parses 150 JSON lines (a 20x reduction from
3700). This in turn can greatly improve Cargo's null build time. Cargo
building Cargo dropped from 120ms to 60ms on a Linux machine and 400ms
to 200ms on a Mac.

The commit internally has a lot more details about how this is done but
the general idea is to have a cache which is optimized for Cargo to read
which is maintained automatically by Cargo.

Closes #6866
Don't allocate in `SourceId::is_default_registry`
This gets called quite a lot and doesn't need to allocate in the first
place!

alexcrichton added some commits Apr 26, 2019

Make registry locking more coarse
This commit updates the locking strategy in Cargo handle the recent
addition of creating a cache of the on-disk index also on disk. The goal
here is reduce the overhead of locking both cognitively when reading but
also performance wise by requiring fewer locks. Previously Cargo had a
bunch of fine-grained locks throughout the index and git repositories,
but after this commit there's just one global "package cache" lock.

This global lock now serves to basically synchronize the entire crate
graph resolution step. This shouldn't really take that long unless it's
downloading, in which case there's not a ton of benefit to running in
parallel anyway. The other intention of this single global lock is to
make it much easier on the sources to not worry so much about lock
ordering or when to acquire locks, but rather they just assert in their
various operations that they're locked.

Cargo now has a few coarse-grained locations where locks are held (for
example during resolution and during package downloading). These locks
are a bit sprinkled about but they have in-code asserts which assert
that they're held, so we'll find bugs quickly if any lock isn't held
(before a race condition is hit that is)
Thread through last update time to index cache
Removed in the previous commit this now adds dedicated tracking for the
last update.
Avoid using mtime information for reusing cache files
Using mtime information is pretty finnicky across platforms, so instead
take a different strategy where we embed the sha that a cache file was
generated from into the cache file itself. If the registry's sha has
changed then we regenerate the cache file, otherwise we can reuse the
cache file.

This should make cache file generation more robust (any command can
generate a cache file to get used at any time) as well as works better
across platforms (doesn't run into issues with coarse mtime systems and
the like).
Add debug assertions for cache contents
This will largely only get tested during Cargo's own tests, but this
commit adds debug assertions where the cache of registry JSON files is
always valid and up to date when we consider it being up to date.
@alexcrichton

This comment has been minimized.

Copy link
Member Author

commented May 3, 2019

That seems plausible to me, but would it be ok to defer that to a future PR? That does sound like a significant step up in terms of complication and it might be good to get something cached in there first.

@alexcrichton alexcrichton force-pushed the alexcrichton:cache branch from b8ca83a to e33881b May 3, 2019

@Eh2406

This comment has been minimized.

Copy link
Contributor

commented May 3, 2019

Definitely, this is an improvement. (even on my oddly slow windows) So I am comfortable landing when you are.

@alexcrichton

This comment has been minimized.

Copy link
Member Author

commented May 3, 2019

Ok!

@bors: r=Eh2406

@bors

This comment has been minimized.

Copy link
Contributor

commented May 3, 2019

📌 Commit e33881b has been approved by Eh2406

@bors

This comment has been minimized.

Copy link
Contributor

commented May 3, 2019

⌛️ Testing commit e33881b with merge 22e2f23...

bors added a commit that referenced this pull request May 3, 2019

Auto merge of #6880 - alexcrichton:cache, r=Eh2406
Parse less JSON on null builds

This commit fixes a performance pathology in Cargo today. Whenever Cargo
generates a lock file (which happens on all invocations of `cargo build`
for example) Cargo will parse the crates.io index to learn about
dependencies. Currently, however, when it parses a crate it parses the
JSON blob for every single version of the crate. With a lock file,
however, or with incremental builds only one of these lines of JSON is
relevant. Measured today Cargo building Cargo parses 3700 JSON
dependencies in the registry.

This commit implements an optimization that brings down the number of
parsed JSON lines in the registry to precisely the right number
necessary to build a project. For example Cargo has 150 crates in its
lock file, so now it only parses 150 JSON lines (a 20x reduction from
3700). This in turn can greatly improve Cargo's null build time. Cargo
building Cargo dropped from 120ms to 60ms on a Linux machine and 400ms
to 200ms on a Mac.

The commit internally has a lot more details about how this is done but
the general idea is to have a cache which is optimized for Cargo to read
which is maintained automatically by Cargo.

Closes #6866
@bors

This comment has been minimized.

Copy link
Contributor

commented May 3, 2019

☀️ Test successful - checks-travis, status-appveyor
Approved by: Eh2406
Pushing 22e2f23 to master...

@bors bors merged commit e33881b into rust-lang:master May 3, 2019

3 checks passed

Travis CI - Pull Request Build Passed
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
homu Test successful
Details

@alexcrichton alexcrichton deleted the alexcrichton:cache branch May 6, 2019

@alexcrichton

This comment has been minimized.

Copy link
Member Author

commented May 6, 2019

I've opened #6908 to track your suggestion @Eh2406

@ehuss ehuss referenced this pull request May 7, 2019

Merged

Update cargo #60596

bors added a commit to rust-lang/rust that referenced this pull request May 7, 2019

Auto merge of #60596 - ehuss:update-cargo, r=alexcrichton
Update cargo

12 commits in beb8fcb5248dc2e6aa488af9613216d5ccb31c6a..759b6161a328db1d4863139e90875308ecd25a75
2019-04-30 23:58:00 +0000 to 2019-05-06 20:47:49 +0000
- Small things (rust-lang/cargo#6910)
- Fix skipping over invalid registry packages (rust-lang/cargo#6912)
- Fixes rust-lang/cargo#6874 (rust-lang/cargo#6905)
- doc: Format examples of version to ease reading (rust-lang/cargo#6907)
- fix more typos (codespell) (rust-lang/cargo#6903)
- Parse less JSON on null builds (rust-lang/cargo#6880)
- chore: Update opener to 0.4 (rust-lang/cargo#6902)
- Update documentation for auto-discovery. (rust-lang/cargo#6898)
- Update some doc links. (rust-lang/cargo#6897)
- Default Cargo.toml template provide help for completing the metadata (rust-lang/cargo#6881)
- Run 'cargo fmt --all' (rust-lang/cargo#6896)
- Refactor command definition (rust-lang/cargo#6894)

alexcrichton added a commit to alexcrichton/cargo that referenced this pull request May 14, 2019

Re-enable compatibility with readonly CARGO_HOME
Previously Cargo would attempt to work as much as possible with a
previously filled out CARGO_HOME, even if it was mounted as read-only.
In rust-lang#6880 this was regressed as a few global locks and files were always
attempted to be opened in writable mode.

This commit fixes these issues by correcting two locations:

* First the global package cache lock has error handling to allow
  acquiring the lock in read-only mode inaddition to read/write mode. If
  the read/write mode failed due to an error that looks like a readonly
  filesystem then we assume everything in the package cache is readonly
  and we switch to just acquiring any lock, this time a shared readonly
  one. We in theory aren't actually doing any synchronization at that
  point since it's all readonly anyway.

* Next when unpacking package we're careful to issue a `stat` call
  before opening a file in writable mode. This way our preexisting guard
  to return early if a package is unpacked will succeed before we open
  anything in writable mode.

Closes rust-lang#6928

alexcrichton added a commit to alexcrichton/cargo that referenced this pull request May 14, 2019

Re-enable compatibility with readonly CARGO_HOME
Previously Cargo would attempt to work as much as possible with a
previously filled out CARGO_HOME, even if it was mounted as read-only.
In rust-lang#6880 this was regressed as a few global locks and files were always
attempted to be opened in writable mode.

This commit fixes these issues by correcting two locations:

* First the global package cache lock has error handling to allow
  acquiring the lock in read-only mode inaddition to read/write mode. If
  the read/write mode failed due to an error that looks like a readonly
  filesystem then we assume everything in the package cache is readonly
  and we switch to just acquiring any lock, this time a shared readonly
  one. We in theory aren't actually doing any synchronization at that
  point since it's all readonly anyway.

* Next when unpacking package we're careful to issue a `stat` call
  before opening a file in writable mode. This way our preexisting guard
  to return early if a package is unpacked will succeed before we open
  anything in writable mode.

Closes rust-lang#6928

bors added a commit that referenced this pull request May 14, 2019

Auto merge of #6940 - alexcrichton:readonly-compat, r=ehuss
Re-enable compatibility with readonly CARGO_HOME

Previously Cargo would attempt to work as much as possible with a
previously filled out CARGO_HOME, even if it was mounted as read-only.
In #6880 this was regressed as a few global locks and files were always
attempted to be opened in writable mode.

This commit fixes these issues by correcting two locations:

* First the global package cache lock has error handling to allow
  acquiring the lock in read-only mode inaddition to read/write mode. If
  the read/write mode failed due to an error that looks like a readonly
  filesystem then we assume everything in the package cache is readonly
  and we switch to just acquiring any lock, this time a shared readonly
  one. We in theory aren't actually doing any synchronization at that
  point since it's all readonly anyway.

* Next when unpacking package we're careful to issue a `stat` call
  before opening a file in writable mode. This way our preexisting guard
  to return early if a package is unpacked will succeed before we open
  anything in writable mode.

Closes #6928
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.