Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs installation slow on NFS home #1867

Closed
saethlin opened this issue May 23, 2019 · 18 comments
Closed

docs installation slow on NFS home #1867

saethlin opened this issue May 23, 2019 · 18 comments
Labels

Comments

@saethlin
Copy link
Member

saethlin commented May 23, 2019

Problem
rustup docs installation is remarkably slow on an HPC system (probably most HPC systems). In case it's useful to anyone, I'm observing this on HiPerGator, which uses the Lustre filesystem.
I've attached some strace -c output below along with what rustup printed during the update. The 433.6 KiB/s is probably mis-stating the situation; for a few seconds at a time the install speed drops to between 50 and 8.0 KiB/s as if it's being throttled by the filesystem.

Possible Solution(s)
It would be great to opt out of docs altogether; I'm told this is a feature under consideration which would be great. There's no way for me to use the HTML docs in an HPC environment; I use the local ones on the machine I'm ssh'd from.

Notes
Output of rustup --version: rustup 1.18.3 (435397f 2019-05-22)
Output of rustup show:

Default host: x86_64-unknown-linux-gnu

installed toolchains
--------------------

stable-x86_64-unknown-linux-gnu (default)
nightly-x86_64-unknown-linux-gnu

installed targets for active toolchain
--------------------------------------

x86_64-unknown-linux-gnu
x86_64-unknown-linux-musl

active toolchain
----------------

stable-x86_64-unknown-linux-gnu (default)
rustc 1.34.2 (6c2484dc3 2019-05-13)
info: syncing channel updates for 'nightly-x86_64-unknown-linux-gnu'
info: latest update on 2019-05-23, rust version 1.36.0-nightly (37ff5d388 2019-05-22)
info: downloading component 'rustc'
 91.0 MiB /  91.0 MiB (100 %)  48.0 MiB/s in  2s ETA:  0s
info: downloading component 'rust-std'
 61.3 MiB /  61.3 MiB (100 %)  38.4 MiB/s in  1s ETA:  0s
info: downloading component 'cargo'
info: downloading component 'rust-docs'
info: removing component 'rustc'
info: removing component 'rust-std'
info: removing component 'cargo'
info: removing component 'rust-docs'
info: installing component 'rustc'
 91.0 MiB /  91.0 MiB (100 %)  11.7 MiB/s in  9s ETA:  0s
info: installing component 'rust-std'
 61.3 MiB /  61.3 MiB (100 %)  10.3 MiB/s in  5s ETA:  0s
info: installing component 'cargo'
info: installing component 'rust-docs'
 11.1 MiB /  11.1 MiB (100 %) 433.6 KiB/s in  1m 58s ETA:  0s

  nightly-x86_64-unknown-linux-gnu updated - rustc 1.36.0-nightly (37ff5d388 2019-05-22)

strace -c output while running a rustup update nightly

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 26.12    1.577563          79     19971           unlink
 22.39    1.351963          66     20379        37 open
 15.56    0.939575          47     20002           fchmod
 13.00    0.784812          37     21441           chmod
  7.07    0.426997           3    157983           write
  4.40    0.265628          13     20503           lstat
  2.53    0.153032          49      3108           getdents
  1.98    0.119512           5     21896           close
  1.32    0.079967         396       202           read
  1.26    0.076021          49      1545           openat
  0.91    0.055049           5     10621        91 futex
  0.83    0.050257          18      2737       510 stat
  0.82    0.049569          62       803        10 mkdir
  0.67    0.040695          51       794           rmdir
  0.53    0.031927        1774        18           fdatasync
  0.35    0.021093           1     33633           ioctl
  0.17    0.009982          22       462           rename
  0.05    0.002836          24       118           brk
  0.02    0.001310         164         8           munmap
  0.01    0.000356          27        13           copy_file_range
  0.00    0.000296         148         2           clone
  0.00    0.000164           6        28           mmap
  0.00    0.000162           4        39           fstat
  0.00    0.000124           8        15           mprotect
  0.00    0.000054          27         2         1 poll
  0.00    0.000037           7         5           getrandom
  0.00    0.000028           5         6           rt_sigaction
  0.00    0.000020          10         2           wait4
  0.00    0.000018           9         2           socketpair
  0.00    0.000017           4         4         2 recvfrom
  0.00    0.000015           5         3           pipe2
  0.00    0.000013           4         3           sigaltstack
  0.00    0.000011           6         2           getrlimit
  0.00    0.000007           7         1           sched_getaffinity
  0.00    0.000005           1         4           lseek
  0.00    0.000005           5         1           rt_sigprocmask
  0.00    0.000005           5         1           fcntl
  0.00    0.000005           5         1           arch_prctl
  0.00    0.000005           5         1           set_tid_address
  0.00    0.000005           5         1           set_robust_list
  0.00    0.000004           4         1         1 rt_sigreturn
  0.00    0.000004           4         1           mremap
  0.00    0.000000           0         1         1 access
  0.00    0.000000           0         1           execve
------ ----------- ----------- --------- --------- ----------------
100.00    6.039148                336364       653 total
@saethlin saethlin added the bug label May 23, 2019
@kinnison
Copy link
Contributor

What's fascinating here is that while the install took minutes, there's only 6 seconds of syscalls there. I can only assume we're only getting CPU cost, not wait latency logged by strace.

However, profiles support should resolve this issue for you (PR #1673) so let's revisit this after the release of a rustup containing that.

@saethlin
Copy link
Member Author

I don't think I'd assume that; here's the time output for a re-installation of nightly

real	2m38.806s
user	0m16.254s
sys	0m6.602s

I'll be watching that PR with great interest!

@kinnison
Copy link
Contributor

Yep, so 2m38s of wallclock time, of which 16s user time (likely decompression, hashing, etc) and 6.6s of syscall CPU time. The remaining 2m14s or so will be waiting for the storage/network. :D

@rbtcollins
Copy link
Contributor

Yep, you are entirely blocked on high latency IO. Can I get another strace please, with -T this time? Attach it somewhere I can download it, it will be long.

Pending verification from that trace bug...
There are several things we can do, some of which are low hanging.

We can introduce threading into your environment; if the open+write calls are low latency and close is slow, which client-checked but server enforced filesystems often have, then the current threading will be effective, we just need to enable it for you.

We can stop doing a second walk over the unpacked files - tar-rs already chmods (fchmod now) the files as it unpacks; we actually completely reset the permissions afterwards. A little bit of smarts (e.g. doing a fchmod on the handle from tar-rs for the common case and then only doing a fchmod for the uncommon case from the manifest) would let us drastically reduce syscalls); better still might be to tell tar-rs to not set chmod's at all as another feature - we'll want to be guided by the profile data.

@saethlin
Copy link
Member Author

rbtcollins added a commit to rbtcollins/rustup.rs that referenced this issue May 26, 2019
Set RUSTUP_CLOSE_THREADS=disabled to force single threaded IO, or
to a specific number if desired for testing/tuning.

This may improve rust-lang#1867, but has no impact on rust-lang#1866 due to the coarse
lock around the fd-handle table inside WSL.
@rbtcollins
Copy link
Contributor

@saethlin I haven't reviewed the strace yet, and its late, but that patch of mine is the lowest hanging fruit that may help, I'd be interested to know whether it has any impact for you at all.

@rbtcollins
Copy link
Contributor

Ok, so interesting that we get open() rather than openat, but -

open("/home/kimockb/.rustup/tmp/4duqw9jkaf9kmt3r_dir/rust-docs/share/doc/rust/html/core/core_arch/x86/avx/_mm256_broadcast_sd.v.html", O_WRONLY|O_CREAT|O_EXCL|O_CLOEXEC, 0666) = 9 <0.002960>
write(9, "<!DOCTYPE html>\n<html lang=\"en\">"..., 349) = 349 <0.000012>
fchmod(9, 0644)                         = 0 <0.001859>
close(9)                                = 0 <0.000009>
...
chmod("/home/kimockb/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/share/doc/rust/html/tutorial.html", 0644) = 0 <0.001082>

about 3ms to create the inode, about 2ms to fchmod it, and writes and closes are nearly free, and chmod is also ~1ms.

20k items, so 1ms == 20 seconds. so 60 seconds in open(), 40 seconds in fchmod(), and 20 seconds() in the final chmods.

So, the things we need to do to make it fast for HPC filesystems are:

  • don't block on inode creation
  • don't apply chmod's twice
  • and do the final chmod threaded or in the initial IO

@rbtcollins
Copy link
Contributor

@saethlin how many cores do you have available? Clearly we can oversubscribe as your actual CPU time is near zero, but I'm wondering if a heuristic of machine-cpu's which we use for dealing with defender will be reasonable here too (and save on code path proliferation)

@saethlin
Copy link
Member Author

saethlin commented May 27, 2019

Usually I have 32 cores available.

Just to be clear, any changes you make in the name of installing docs faster specifically on HPC systems (as opposed to everywhere) will be moot as soon as we can disable docs. I'm also not really keen on having an installation process that uses threads to drive enormous load onto the shared filesystem. I've witnessed write latencies of a few seconds because of misuse and I don't want to see rustup usage be responsible for that.

@rbtcollins
Copy link
Contributor

alexcrichton/tar-rs#207 will permit tackling the chmod duplication. Its worth noting (for @alexcrichton if no-one else that there are several reasons we don't trust the distribution archives: they have been wrong in the past; we are still used to install those old archives, and when we start streaming-installs of archives we'll only verify signatures at the end so we want to trust as little as possible.

@saethlin so there are two parts here, one is actually doing the least work possible. Like - we used to do a full additional copy of the files which would have made older releases even slower; fixed already. Eliminating a full second pass of chmod metadata updates will reduce load on the metadata server (which HPC system are you using BTW? gluster? cephfs?).

The second part, once we have reduce the work done to the bare minimum with no waste is about scheduling: do we schedule the work to complete as rapidly as possible, or do we throttle it?

I understand your concern and share it: I certainly don't want to create headaches for folk by driving up a backlog somewhere. But lets put this in context - we have under a second of IO to a single platter of spinning rust disks (consumer disks have exceeded 200MBps for a decade or more now). Lets not assume that we are going to cause a problem until we have data showing it. I know for instance that with ext4 you want to create the files relatively closely together on a busy system, because ext4 is going to try to keep them together, and if the fs is too full (and high activity is moving what is used around a lot) it won't be able to. https://www.kernel.org/doc/html/latest/filesystems/ext4/overview.html#block-and-inode-allocation-policy

So my proposal is this: I write various things, and you try them out - I don't have your HPC cluster with its particular tradeoffs. If they cause havoc in test, we don't push them into master. If they don't cause havoc, we can consider moving forward.

@saethlin
Copy link
Member Author

We're on Lustre, not entirely sure what version though.

Your proposal sounds good. I'm learning about iostat, hopefully I can watch rustup run and learn something or just keep an eye on the system.

@rbtcollins
Copy link
Contributor

As a for instance rust-docs/share/doc/rust/html/rust-by-example/fn/closures/closure_examples:

strace for closure_examples
mkdir("/home/kimockb/.rustup/tmp/4duqw9jkaf9kmt3r_dir/rust-docs/share/doc/rust/html/rust-by-example/fn/closures/closure_examples", 0777) = 0 <0.001906>
chmod("/home/kimockb/.rustup/tmp/4duqw9jkaf9kmt3r_dir/rust-docs/share/doc/rust/html/rust-by-example/fn/closures/closure_examples", 0755) = 0 <0.001278>
stat("/home/kimockb/.rustup/tmp/4duqw9jkaf9kmt3r_dir/rust-docs/share/doc/rust/html/rust-by-example/fn/closures/closure_examples", {st_mode=S_IFDIR|0755, st_size=0, ...}) = 0 <0.000020>
openat(AT_FDCWD, "/home/kimockb/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/share/doc/rust/html/rust-by-example/fn/closures/closure_examples", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 12 <0.000223>
lstat("/home/kimockb/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/share/doc/rust/html/rust-by-example/fn/closures/closure_examples", {st_mode=S_IFDIR|0755, st_size=1024, ...}) = 0 <0.000013>
chmod("/home/kimockb/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/share/doc/rust/html/rust-by-example/fn/closures/closure_examples", 0755) = 0 <0.001041>

This has a total syscall time of

mkdir  0.001906
chmod  0.001278
stat   0.000020  
openat 0.000223 
lstat  0.000013 
chmod  0.001041
--------------
       0.004481

Which is representative of the extra work here - we really only need to set umask appropriately, do the mkdir once, then never touch the dir again. That would halve the amount of work that the HPC file system is being asked to do.

@rbtcollins
Copy link
Contributor

http://www.nics.tennessee.edu/computing-resources/file-systems/io-lustre-tips seems like it has good recommendations here.

In particular, @saethlin can you

lfs setstripe ~/.rustup -s 1m -i -1 -c 1
rm -rf ~/.rustup/tmp

And test current released rustup ? I'm not expecting any radical change, just curious about any impact this might have. The stdlib and rustc etc are obviously stripable, and we could look at special casing docs etc to make just them single-stripe, but this is more along the lines of information gathering.

@rbtcollins
Copy link
Contributor

(Also TBH I think just some sensible docs in README.md might be useful here).

@saethlin
Copy link
Member Author

Oh dangit, I was mistaken. Our /home partitions aren't Lustre, only the big data partitions are.

╰ ➤ mount | grep $HOME
10.13.152.221:/home/kimockb on /home/kimockb type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.13.152.221,mountvers=3,mountport=2049,mountproto=tcp,local_lock=none,addr=10.13.152.221)

I'll sure find a use for those lustre tips anyway.

@rbtcollins
Copy link
Contributor

Ahha! ok, so if ~ is on plain old NFS, we're probably facing a classic single-machine multi user situation - but its also likely that that machine actually has decent IO backing it, if we're not nuts - we're just 1ms away on the network, so anything that has to cache-bust or be synchronous blocks. I'll have some non-aggressive test branch for you soon.

rbtcollins added a commit to rbtcollins/rustup.rs that referenced this issue May 30, 2019
Set RUSTUP_CLOSE_THREADS=disabled to force single threaded IO, or
to a specific number if desired for testing/tuning.

This may improve rust-lang#1867, but has no impact on rust-lang#1866 due to the coarse
lock around the fd-handle table inside WSL.
rbtcollins added a commit to rbtcollins/rustup.rs that referenced this issue Jun 4, 2019
Set RUSTUP_CLOSE_THREADS=disabled to force single threaded IO, or
to a specific number if desired for testing/tuning.

This may improve rust-lang#1867, but has no impact on rust-lang#1866 due to the coarse
lock around the fd-handle table inside WSL.
rbtcollins added a commit to rbtcollins/rustup.rs that referenced this issue Jun 5, 2019
Set RUSTUP_CLOSE_THREADS=disabled to force single threaded IO, or
to a specific number if desired for testing/tuning.

This may improve rust-lang#1867, but has no impact on rust-lang#1866 due to the coarse
lock around the fd-handle table inside WSL.
rbtcollins added a commit to rbtcollins/rustup.rs that referenced this issue Jun 6, 2019
Set RUSTUP_CLOSE_THREADS=disabled to force single threaded IO, or
to a specific number if desired for testing/tuning.

This may improve rust-lang#1867, but has no impact on rust-lang#1866 due to the coarse
lock around the fd-handle table inside WSL.
@rbtcollins rbtcollins changed the title docs installation slow on HPC system(s) docs installation slow on NFS home Jun 6, 2019
@kinnison
Copy link
Contributor

Rustup 1.20 has been released which supports a minimal profile to not install rust-docs by default. Is this acceptable enough to close this issue?

@saethlin
Copy link
Member Author

Absolutely. Thanks so much for all your hard work on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants