New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
grpc RecvTensor is slow #6116
Comments
To make things more clear, I collected more detailed data for memmove: // int move_size = (sb->count - 1) * sizeof(gpr_slice);
memmove(&sb->slices[0], &sb->slices[1], (sb->count - 1) * sizeof(gpr_slice));
// int data_size = GPR_SLICE_LENGTH(slice); A typical
So the problem is obvious (the slice_size will sum to 100MB per run). The root cause should be the grpc buffer management does not work well for large message. This also explains why the throughput will decrease with the increase of the tensor size. Not quite familiar with the grpc code, adding an grpc option to change 'gpr_slice_buffer_take_first' to 'gpr_slice_buffer_take_all' can remove the unnecessary memory copy? Tuning the slice size can also help reducing the overhead but can't eliminate it. |
Interesting! The correct fix is probably to have I'll make sure someone takes a look soon. |
@llhe amazing investigation work! |
I wonder if there's extra ineffiency in that benchmark in that repeated fields are used ( |
@yaroslavvb Do you mean the serialization in sending side? I haven't identified that issue. I use float32 with size 100MB, and looks like it goes in to this branch as expected. And in the receiving side, it also goes to this branch and in protobuf. However, the 18.2% time consumption looks unexpected high to me for the bare |
As discussed off-channel, we were looking at slightly different benchmarks. My original local_distributed_benchmark.py does The "fetching into Python" version is ridiculously slow for gRPC runtime (0.05 GB/sec grpc vs 3.4 GB/sec in-process), whereas the non-fetching is just slow (0.9 GB/sec grpc vs 20.2 GB/sec in-process) for |
BTW, you can partially mitigate this problem by running multiple ps processes and sharding your variables over them. this makes the logic more complicated unfortunately
|
@yaroslavvb Good suggestion! Make more shardings should help. |
This is being fixed here: grpc/grpc#8975. The issue with grpc polling buffer is resolved: The current remaining unnecessary memory copy is a known issue marked as TODO by Jeff and Sanjay: tensor raw content decoding which actually buffer memcpy. |
With that patch, I'm seeing 10x improvement for transfer speed of 1GB buffer ./sharded_ps_benchmark.py --ps=1 --iters=1 --data_mb=1024 after patch |
@jhseu, thought you might want to take a look at this issue. Thanks. |
Is there any news on this being ported over to TensorFlow? Seems to be a very straightforward patch. |
Looks like it hasn't been merged into gRPC yet. I'll make sure it gets into the TensorFlow 1.0 release. |
Anyone working on this? The grpc patch is merged. I can have a try. |
Yeah, that'd be great! Note that it didn't make it into the gRPC 1.1 release, so you'd have to update gRPC to a more recent git version. We also use a custom BUILD file that's in third_party/grpc.BUILD. You can diff it from the version it's derived from to see its changes. Ideally, we'd just use gRPC's build file and not have our own, but I'm not sure how feasible that is. I'm not sure how many of those changes are still required, or whether any more might be needed. We haven't updated gRPC for a few months. |
Also, if you update, it'd be useful to run this benchmark to compare before and after: Run it with: |
I'd be super interested in hearing the result.
…On Thu, Feb 9, 2017 at 7:56 PM Jonathan Hseu ***@***.***> wrote:
Also, if you update, it'd be useful to run this benchmark to compare
before and after:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/distributed_runtime/rpcbench_test.cc
Run it with: bazel run -c opt rpcbench_test -- --benchmarks=all
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#6116 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJpudblAbqS0zHHFttY5W99lwFot0MXkks5ra9_7gaJpZM4LFH9p>
.
|
@ctiller Is bazel still supported in gRPC trunk? Looks like there are somethings broken, like unreferenced header, incorrect header path, redefined symbols etc. I'm trying to solve them to make it build with tensorflow. |
We are in the process of switching to Bazel as our source of truth. I'd be
surprised if anything is broken, but if so it's something we'd want to
track down immediately.
Can I trouble you for more details?
…On Sun, Feb 12, 2017, 9:38 AM Liangliang He ***@***.***> wrote:
@ctiller <https://github.com/ctiller> Is bazel still supported in gRPC
trunk? Looks like there are somethings broken, like unreferenced header,
incorrect header path, redefined symbols etc. I'm trying to solve them to
make it build with tensorflow.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6116 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJpudbyHQLiZLObsOhH5Ntxydr_6asDiks5rb0OwgaJpZM4LFH9p>
.
|
This PR is still being worked on. The PR was updated recently. |
Shall we close this issue as gRPC upgrade has been completed by @jhseu? I re-ran the test benchmark_grpc_recv.py and found it has been improved from 100-200 MB/s to 800 MB/s. It now fully utilizes a 10Gbps link, and it is actually much better than we'd expected. |
Wow @yaroslavvb I suppose you are a committer to TF now? That is awesome. |
@byronyi , I re-ran the test using tf built from the latest master branch, and I still get
not much improvement. |
For your reference:
|
I got
, but have you run workers on two different machines, how is the speed? |
I use this script to test tensor transmission on 3 different machines, note that task 2 is a client which is responsible for submitting job.
|
So I just tested this again on latest version, and the speed is roughly the same. However, this 971 MB/second is too slow. AWS now has 25 Gbps ethernet cards, so using this capacity requires being able to send 100MB tensor in <=32ms. Currently RecvTensor of that size takes >100ms locally
|
@yaroslavvb I guess you could give the latest accelerated networking feature a shot, which provides Mellanox ConnectX-3/ConnectX-3 Pro Virtual Function (and supposedly supports RDMA). |
@byronyi well, this slowness within a single machine, so the bottleneck must software related. My suspicion is that RecvTensor involves a single-threaded memcpy somewhere, this would explain 75% of the slowness -- I see 105ms, while 100 MB at 1.25 GBps memcpy speed = 80ms |
@yaroslavvb Well I do agree more optimisations could be done with gRPC to eliminate memcpy even with current TCP/IP stack. The loopback performance in my DigitalOcean box is around 20 Gbps: $ sudo perf record netperf -t TCP_STREAM -H 127.0.0.1
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 127.0.0.1 () port 0 AF_INET : demo
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
87380 16384 16384 10.00 18905.90
[ perf record: Woken up 6 times to write data ]
[ perf record: Captured and wrote 1.338 MB perf.data (30386 samples) ] But I don't think it is really about memcpy, as the kernel also memcpy and it seems doing it pretty fast: $ sudo perf report --header
# ========
# captured on: Thu Mar 8 05:09:45 2018
# hostname : localhost
# os release : 4.15.0-1-amd64
# perf version : 4.15.4
# arch : x86_64
# nrcpus online : 4
# nrcpus avail : 4
# cpudesc : Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
# cpuid : GenuineIntel,6,79,1
# total memory : 8172116 kB
# cmdline : /usr/bin/perf_4.15 record netperf -t TCP_STREAM -H 127.0.0.1
# event : name = cycles, , size = 112, { sample_period, sample_freq } = 1500, sample_type = IP|TID|TIME|PERIOD, disabled = 1, inherit = 1, mmap = 1, comm = 1,
# CPU_TOPOLOGY info available, use -I to display
# NUMA_TOPOLOGY info available, use -I to display
# pmu mappings: breakpoint = 5, cpu = 4, software = 1, tracepoint = 2, msr = 6
# CACHE info available, use -I to display
# missing features: TRACING_DATA BRANCH_STACK GROUP_DESC AUXTRACE STAT
# ========
#
#
# Total Lost Samples: 0
#
# Samples: 13K of event 'cycles'
# Event count (approx.): 19061735130
#
# Overhead Command Shared Object Symbol
# ........ ....... ................. ..............................................
#
34.79% netperf [kernel.kallsyms] [k] copy_user_enhanced_fast_string
4.40% netperf [kernel.kallsyms] [k] tcp_sendmsg_locked
3.08% netperf [kernel.kallsyms] [k] __tcp_ack_snd_check
2.55% netperf [kernel.kallsyms] [k] __pv_queued_spin_lock_slowpath
2.29% netperf [kernel.kallsyms] [k] syscall_return_via_sysret
1.77% netperf [kernel.kallsyms] [k] pvclock_clocksource_read
1.49% netperf [unknown] [k] 0xfffffe000003201e
1.38% netperf [kernel.kallsyms] [k] __raw_callee_save___pv_queued_spin_unlock
1.23% netperf [kernel.kallsyms] [k] get_page_from_freelist |
It sounds like your loopback performance is also bottlenecked by single-threaded memcpy, 20 Gbps loopback seems low given that AWS can do 25 Gbps over ethernet (ps, my earlier numbers are GBps rather than Gbps) |
@yaroslavvb If you are using in kernel TCP/IP stack then it’s rather unlikely to avoid memory copy at receiving side even if you use the new feature MSG_ZEROCOPY available since Linux 4.9 :) |
Just a quote from the article:
|
So I think the original problem with RecvTensor is the single-threaded memcpy. You can do memory copy fast if you use multiple threads. For instance putting 100MB object into Ray storage takes 17ms, that involves a memory copy and translates to about 50 Gbps. I can add a constant to 100MB worth of 1s and put result into new memory in 3.5ms, that's about 250 Gbps and is probably the upper limit of how fast you can copy memory on XeonV4 |
Would a possible workaround be splitting your tensor into partitions and load balance your RecvTensor calls? I think that’s what TF currently does to load balance PS tasks. |
Plus for a truly distributed case, as TCP connection is stream oriented, it would involve a fair amount of locking if you’d perform multithreading with the same socket and effectively make its performance equivalent to the single-thread case at best. If we restrict it to the intranode case, why not just use immutable shared memory? You could avoid bulk memcpy all together. |
I made benchmark tests for distributed setup with loopback network, profiling it and found there is excessive memory copying in the client side of RecvTensor call, which is actually one of the bottleneck.
Here is the code, which mainly stolen from @yaroslavvb here,
Here is the profiling result (google perftools) with tensor size 100MB (one fact is, the throughput will degrade with the increasing of tensor size):
From the result, the sending side (device2) look fine, but the receiving side (device1, the grpc client) consumes too many CPU cycles for the data transfer.
By the way, I made rough stats for this memmove call. For one round of 100MB tensor assignment, there are roughly 2GB data moved (actually, including the copy inside memmove, it should be 4GB copied with a naive memmove), which is 20+ times RAM bandwidth amplification (the result is an average of 100 round run, which may not precise but the scale should be ok).
The text was updated successfully, but these errors were encountered: