New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance with lower sampling rate #70
Conversation
Some unit tests are failing due to a discrepancy between which data is being sent and which data is being received, probably a shift in sample order, something I suspected could occur due to the a race condition. Fortunately the additional mutex in #71, meant to deal with memory leak, also solves this issue :) (a.k.a. buy one pull request, get one free :D) |
Thank you for the thorough investigation. I'm sure @tstenner will look at this before I have a chance to so I'll wait for his reply before digging in. w.r.t. the resource consumption, LSL defaults to buffering 6 minutes of data at the outlet. It's designed to allow for enough time to swap a battery or replace a device running an inlet. I think this use case is probably quite rare outside the MOBI lab at SCCN but it's still better to default to a number too big than a number too small, so it remains 6 minutes. See here. You can reduce this number to decrease the RAM usage. |
I haven't looked through everything, but I'm already impressed. The current design is a single-producer multiple-consumer queue with the twist that a sample is only removed when all consumers have gotten it, instead of only a single one. It's a lockless queue that's optimized for throughput with a lot of consumers at high sampling rates, i.e. all consumers can pull samples at once. With the mutex, one consumer blocks all other consumers from pulling samples. As you've noted, it has some trade-offs for a single consumer at low sampling rates. For performance tests, you can use GNU time (not the bash builtin time, you have to write the full
The next option would be perf, but in most situations it's overkill. I see you're in the same timezone, so I could just call you this or next midmorning to discuss how to best proceed. |
Thanks for looking into it! I am not sure I understand at which level occurs the "multiple-consumers" you are talking about. Here the modification takes place in Or I am missing situations where several outlet or inlet use the same "consumer_queue"? If there are resources describing the internal architecture and the implementation of LSL, I'll take them :) |
Right, in my head I was already the one step ahead where there's a single queue for all consumers. As for design documents, feel free to ask about anything. I guess Christian knows more, but I'm sure I'll reply faster. |
If you plan to use a single queue for all, indeed that should simplify the code and bring optimizations! I just synced with master -- here as well as in #71. |
Concerning the additional lock in push_sample(), there are indeed various possible scenarios with concurrent threads, I see one where for example consumer checks sample, producer adds a sample, producer acquires lock, consumer attempts to lock, producer notifies and releases lock, consumer acquires lock and now is waiting a notify that would, this time as well, arrive only upon next enque. I guess acquiring the lock as early as possible, upon entering the function is the most predictable, i.e. first one has to push before there's something to pull -- this is what I committed just now. Note: I don't see a situation where the buffer is full and more than one sample would need to be dropped in push_sample -- otherwise it might keep the lock for an undetermined amount of time? I ran some tests -- actually using my old method because at the moment the unit test benchmark provokes a |
Hello, I am sure you have limited time to work on the library, but don't hesitate to tell me if there is something I can do to ease the merge of this pull request as well as #71 -- I need the increase in performance in several use-cases, and if possible I'd rather use the upstream version of LSL :) |
I'm (slowly) adding some more unit tests for the queue and some threading related issues and I'll check with the LLVM ThreadSanitizer, just to be on the safe side. During some benchmarks I've found another concurrency issue resulting in a samples On the other hand, the latency when theres no sample immediately available drops from 1-2ms to almost nothing. I've rebased and combined the commits in the https://github.com/tstenner/liblsl/tree/dev-leak branch |
Thanks for the update! I understand it takes much effort to profile the code and investigate all possible side effects, especially when multi-threading is involved :\ I did not come up with a more clever to way to implement the workaround, at least not without touching |
I looked at this PR yesterday, and I also came to the conclusion that, in theory, the optimal strategy for throughput would be to choose the wait strategy based on the stream's declared sampling rate (low srate: use However, I think that in practice this path would be a small project all in itself (and probably not too fruitful), for the following reasons:
So long story short, it seems that going down the locking route would be the best trade-off in terms of sanity (thread safe), predictable timing, and simplicity (and thus likelihood of bugs). In that case, perhaps the most appealing would be to use either *: if |
src/consumer_queue.cpp
Outdated
std::this_thread::sleep_for(std::chrono::milliseconds(1)); | ||
} while (!buffer_.pop(result)); | ||
// wait untill for a new sample until the thread calling push_sample delivers one, or until timeout | ||
std::unique_lock<std::mutex> lk(lock_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we move this lock to the top of the function, we can have it be thread-safe in the presence of a buffer overflow.
src/consumer_queue.cpp
Outdated
std::unique_lock<std::mutex> lk(lock_); | ||
while (!buffer_.push(sample)) { | ||
sample_p dummy; | ||
buffer_.pop(dummy); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrapping this whole section (including the lock) in a code block would resolve the thread safety issue and double free error. Will only need a std::lock_guard then. After the other change to pop_sample, we should be able to leave the notify_one outside the lock as per your suggestion.
That's potential error, the other is a sample referring to itself as the next sample (i.e.
That's the gist from the queue work item in the CZI grant proposal. The next step then is not doing the locking and additional overhead once per sample but rather once per push/pull operation so the locking doesn't affect the throughput too much. |
Ah @jfrey-xx I just realized that the thread-safety issue I mentioned above is rolled into your PR #71 for the memory leak at buffer overflow (only glancing at that one now). Good catch! So @tstenner in line with what you say, perhaps a good first step for the Conversely, considering that the sleep/poll method as originally written incurs a ~0.5ms latency, I'm not sure we'd want to cling to that in any case, despite the modest throughput increase, even if we tried to have some sampling rate dependent switch. I had searched and did find some viable thread-safe queues benchmarked here, some of which have >15x throughput improvement over the std::mutex method in the low-contention case (which we have here) -- however, unfortunately using any of those would again come with a painful latency vs CPU load tradeoff (i.e., not so appealing). I did try out thread::yield() btw and with that the latency goes down to something much lower, and the max throughput exceeds the lock-based version. But I recall having read some arguments against using yield (forgot what those were), which is why I had used a sleep there back then. |
Not on all platforms, but if it's configurable those affected can decide whether they'd rather have lower throughput or a potential memory leak.
According to my measurements ~1.2ms on Linux and ~1.6ms on Windows, even with Boost.Thread latency patch.
That's for one lock/unlock per operation? In the typical case, the consumer pulls samples until its buffer is full. With the right memory layout (ideally a contiguous ring buffer) the throughput would be mostly limited by the memory speed.
In the common case where one thread pulls data from one sporadically sending outlet you'd max out a CPU core per inlet. |
Or a crash... Well, I think if we allow it to be nonblocking, we'd need to use a different type of lockfree queue. Btw I had seen an interesting recipe here, which suggests that one can make a queue that opportunistically does a lockfree buffer.pop, and only if that fails, i.e., if no data is available, acquires a lock and waits until notified. That could be the best of both worlds, i.e., low CPU load at low sampling rate and higher max throughput at high rates (but I'm not sure I entirely trust the memory_order::relaxed that's used there, someone needs to do some thorough reading on the underlying rules).
Yes, on the push/pop. |
I'd have to check if it's really impossible to miss any notifications, but that sounds good.
Technically yes, as it pulls/pushes a single item at once. For chunked operations (which are the only way to handle high sampling rates efficiently), it would still lock once per sample. From the benchmarks: even std::mutex only has 3.5x the latency or rather 330ns-2000ns more latency than the best queue. Also from the readme:
|
Ok, I'll amalgamate this PR with the concurrency fix from #71 (plus some tweaks) and base a PR off that - that should leave us with a safe & stable codebase for the time being with very little throughput impact in most cases except ultra high sampling rates. Then later we can look into some queue variants with higher throughput (e.g. opportunistic lock-free), e.g. for a future release. edit: merged PR is #89. |
This one is now merged into master - closing. |
Hello,
A bit of context: I happened to use LSL on a computer with rather low resources, a Raspberry Pi Zero W (1Ghz single core ARM CPU). I was using various python scripts to send data back and forth. Everything worked okayish until I launched LabRecorder on a desktop computer to record all the data. And suddenly the system as a whole underwent a huge slow down on the Pi. I tracked down the problem to a script that was doing little beside pushing samples a couple times per second. There was no connection on the Pi, the stream was only there to record and sync the output of some computation alongside everything else. I started to investigate, and discovered that the script would start to consume CPU only when there was an inlet created somewhere on network. Okay, that could make sense, LSL might not waste resources if there is no one to send data to. Good point. However, more troublesome, upon connection the CPU load increased a lot, despite the fact that the sampling was very low. I did not see why and how LSL would consume so much resources for that amount of data, disrupting everything else. A bit of a verbose background, here to stress the importance of the issue in some specific situations.
From there I went down the rabbit hole. I thought that there was something wrong with the python bindings, but It appeared that the problem lies in the library itself, in the way polling is used in consumer_queue to synchronize data between the threads managing push and pull.
Note that it is possible to know upon return of
cv .wait_for
if there was a timeout or if a sample was indeed retrieved, but I did not see any use of that beside debug logging.Long story slightly shorter, instead of putting to sleep and waking up the "pulling" thread every milisecond to check if a new sample arrived until timeout, I changed the code to use conditional variable, so as the thread goes to sleep and is woke up only once. Huge improvement at slow sampling rate, slight penalty at a high sampling rate (see below).
Here I used standard functions available since c++11, a prerequisite for LSL I believe. I also tested the boost equivalent, same outcome, as well as pthread directly (I been testing on Linux only), here again same outcome.
A better workaround might be to use a buffer that naively allow for both non-blocking and blocking calls, but I do not know well of the alternatives to
spc_queue
here, and in the meantime the fix is simple enough.I made a very crude comparisons on my laptop and on an old Raspberry Pi 1B (as I don't have a Zero nearby anymore). Every time there was a dummy outlet streaming 8 (fixed) values of float32, at either 1Hz, 10Hz, 100Hz, 1000Hz, or as fast as possible (with 100KHz declared on the outlet), and one inlet in a separate program on the same machine, pulling as fast as it could. Note that on the outlet side the sampling rate was set by putting the process to sleep for a fixed duration of 1/sampling rate, could be improved to take into account code execution, I happened to be lazy here. Below are crude comparison because I just used
top
to monitor CPU and a counter resetting every second to check the sampling rate. Code based on the "SimpleData" examples, hence no fancy chunks.On the laptop (i7-7600U, 4 cores, 2.8GHz, Ubuntu 16.04):
original code
proposal
On the Raspberry Pi 1B (one core 0.7GHz, Raspbian Buster)
original code
proposal
These numbers should be double checked; due to the use of
top
and because of the other tasks occurring in the background occasionally, the variability is quite high. It gives you an idea. Hopefully you have a better way to profile the code, if you have such guidelines I would happily follow them and perform more strict comparisons if necessary.Note that I also did the same bench using the python bindings, with very similar results, minus some overhead due to the language. I also ran all of that on a Raspberry Pi 4B and numbers were shockingly close to those of my laptop. Very interesting as a way to sense how powerful this little thing really is, but redundant here IMHO. Still, I can post all of that if you deem it is useful.
I should also note that on the Pi I actually had to decrease the declared sampling rate of the sender for the "unlimited" case, from 100khz down to 10KHz (I did not try in-between), otherwise with both codes the receiver would crash with a segmentation fault
lsl_inlet_c.cpp:23 ERR| Unexpected error in lsl_create_inlet: std::bad_alloc
. Not related, probably a buffer too large that is being allocated (there is only 182MB of RAM on the device with my configuration); it will be a separate issue and investigation at some point in the future.Overall, I would say that the proposed modification makes a big difference on most use-cases (in my former lab we never went beyond 512Hz for any signal, including EEG, and with many off-the shelf devices, OpenBCI, Muse, Epoc, etc. it is more in the range of 100 to 250Hz, and for personal projects I have been using LSL mostly for on heart rate and breathing data under 20hz, and many marker streams below that rate). Even though the code is tested only on Linux at the moment it should improve performances whatever the platform, whatever the binding. LSL still consumes way more resources than I would have expected at first, but I guess there is much going on under the hood :)
This is it for this pull request, but the story is not over, as I stumbled on a memory leak while doing my tests -- more about it in a separate pull request to keep separate discussions.