Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance with lower sampling rate #70

Closed
wants to merge 6 commits into from

Conversation

jfrey-xx
Copy link
Contributor

@jfrey-xx jfrey-xx commented Apr 1, 2020

Hello,

A bit of context: I happened to use LSL on a computer with rather low resources, a Raspberry Pi Zero W (1Ghz single core ARM CPU). I was using various python scripts to send data back and forth. Everything worked okayish until I launched LabRecorder on a desktop computer to record all the data. And suddenly the system as a whole underwent a huge slow down on the Pi. I tracked down the problem to a script that was doing little beside pushing samples a couple times per second. There was no connection on the Pi, the stream was only there to record and sync the output of some computation alongside everything else. I started to investigate, and discovered that the script would start to consume CPU only when there was an inlet created somewhere on network. Okay, that could make sense, LSL might not waste resources if there is no one to send data to. Good point. However, more troublesome, upon connection the CPU load increased a lot, despite the fact that the sampling was very low. I did not see why and how LSL would consume so much resources for that amount of data, disrupting everything else. A bit of a verbose background, here to stress the importance of the issue in some specific situations.

From there I went down the rabbit hole. I thought that there was something wrong with the python bindings, but It appeared that the problem lies in the library itself, in the way polling is used in consumer_queue to synchronize data between the threads managing push and pull.

Note that it is possible to know upon return of cv .wait_for if there was a timeout or if a sample was indeed retrieved, but I did not see any use of that beside debug logging.

Long story slightly shorter, instead of putting to sleep and waking up the "pulling" thread every milisecond to check if a new sample arrived until timeout, I changed the code to use conditional variable, so as the thread goes to sleep and is woke up only once. Huge improvement at slow sampling rate, slight penalty at a high sampling rate (see below).

Here I used standard functions available since c++11, a prerequisite for LSL I believe. I also tested the boost equivalent, same outcome, as well as pthread directly (I been testing on Linux only), here again same outcome.

A better workaround might be to use a buffer that naively allow for both non-blocking and blocking calls, but I do not know well of the alternatives to spc_queue here, and in the meantime the fix is simple enough.

I made a very crude comparisons on my laptop and on an old Raspberry Pi 1B (as I don't have a Zero nearby anymore). Every time there was a dummy outlet streaming 8 (fixed) values of float32, at either 1Hz, 10Hz, 100Hz, 1000Hz, or as fast as possible (with 100KHz declared on the outlet), and one inlet in a separate program on the same machine, pulling as fast as it could. Note that on the outlet side the sampling rate was set by putting the process to sleep for a fixed duration of 1/sampling rate, could be improved to take into account code execution, I happened to be lazy here. Below are crude comparison because I just used top to monitor CPU and a counter resetting every second to check the sampling rate. Code based on the "SimpleData" examples, hence no fancy chunks.

On the laptop (i7-7600U, 4 cores, 2.8GHz, Ubuntu 16.04):

  • original code

    • 1hz: 2.6% CPU for sender, 2.6% CPU for receiver
    • 10hz: 2.6% CPU for sender, 2.6% CPU for receiver
    • 100hz: 3.6% CPU for sender, 3.6% CPU for receiver
    • 1000hz: 14.5% CPU for sender, 9.5% CPU for receiver
    • unlimited: 197% CPU for sender with (very) approximately 4.8M (!) samples pushed per second, 55% CPU for receiver with 60k samples pulled per second
  • proposal

    • 1hz: 0% CPU for sender, 0% CPU for receiver
    • 10hz: 0.3% CPU for sender, 0.3% CPU for receiver
    • 100hz: 2% CPU for sender, 1.7% CPU for receiver
    • 1000hz: 16% CPU for sender, 10% CPU for receiver
    • unlimited: 193% CPU for sender with 4.6M samples pushed per second, 91% CPU for receiver with 50k samples pulled per second

On the Raspberry Pi 1B (one core 0.7GHz, Raspbian Buster)

  • original code

    • 1hz: 13% CPU for sender, 8% CPU for receiver
    • 10hz: 13% CPU for sender, 9% CPU for receiver
    • 100hz: 18.2% CPU for sender, 13.2% CPU for receiver
    • 1000hz: 32% CPU for sender with around 550 samples pushed per second, 31% CPU for receiver with around 550 samples pulled per second
    • unlimited: 33.8% CPU for sender with 10k samples pushed per second, 26.2% CPU for receiver with 370 samples pulled per second
  • proposal

    • 1hz: 0% CPU for sender, 0% CPU for receiver
    • 10hz: 0.7% CPU for sender, 0.7% CPU for receiver
    • 100hz: 7.6% CPU for sender, 4.6% CPU for receiver
    • 1000hz: 33% CPU for sender with around 510 samples pushed per second, 24.2% CPU for receiver with around 510 samples pulled per second
    • unlimited: 37.4% CPU for sender with 11k samples pushed per second, 23.2% CPU for receiver with 480 samples pulled per second

These numbers should be double checked; due to the use of top and because of the other tasks occurring in the background occasionally, the variability is quite high. It gives you an idea. Hopefully you have a better way to profile the code, if you have such guidelines I would happily follow them and perform more strict comparisons if necessary.

Note that I also did the same bench using the python bindings, with very similar results, minus some overhead due to the language. I also ran all of that on a Raspberry Pi 4B and numbers were shockingly close to those of my laptop. Very interesting as a way to sense how powerful this little thing really is, but redundant here IMHO. Still, I can post all of that if you deem it is useful.

I should also note that on the Pi I actually had to decrease the declared sampling rate of the sender for the "unlimited" case, from 100khz down to 10KHz (I did not try in-between), otherwise with both codes the receiver would crash with a segmentation fault lsl_inlet_c.cpp:23 ERR| Unexpected error in lsl_create_inlet: std::bad_alloc. Not related, probably a buffer too large that is being allocated (there is only 182MB of RAM on the device with my configuration); it will be a separate issue and investigation at some point in the future.

Overall, I would say that the proposed modification makes a big difference on most use-cases (in my former lab we never went beyond 512Hz for any signal, including EEG, and with many off-the shelf devices, OpenBCI, Muse, Epoc, etc. it is more in the range of 100 to 250Hz, and for personal projects I have been using LSL mostly for on heart rate and breathing data under 20hz, and many marker streams below that rate). Even though the code is tested only on Linux at the moment it should improve performances whatever the platform, whatever the binding. LSL still consumes way more resources than I would have expected at first, but I guess there is much going on under the hood :)

This is it for this pull request, but the story is not over, as I stumbled on a memory leak while doing my tests -- more about it in a separate pull request to keep separate discussions.

@jfrey-xx
Copy link
Contributor Author

jfrey-xx commented Apr 1, 2020

Some unit tests are failing due to a discrepancy between which data is being sent and which data is being received, probably a shift in sample order, something I suspected could occur due to the a race condition. Fortunately the additional mutex in #71, meant to deal with memory leak, also solves this issue :) (a.k.a. buy one pull request, get one free :D)

@cboulay
Copy link
Collaborator

cboulay commented Apr 1, 2020

Thank you for the thorough investigation. I'm sure @tstenner will look at this before I have a chance to so I'll wait for his reply before digging in.

w.r.t. the resource consumption, LSL defaults to buffering 6 minutes of data at the outlet. It's designed to allow for enough time to swap a battery or replace a device running an inlet. I think this use case is probably quite rare outside the MOBI lab at SCCN but it's still better to default to a number too big than a number too small, so it remains 6 minutes. See here. You can reduce this number to decrease the RAM usage.

@tstenner
Copy link
Collaborator

tstenner commented Apr 1, 2020

I haven't looked through everything, but I'm already impressed.
The sampling queue is something I had on my list since some time ago, but there are some blockers before it.

The current design is a single-producer multiple-consumer queue with the twist that a sample is only removed when all consumers have gotten it, instead of only a single one. It's a lockless queue that's optimized for throughput with a lot of consumers at high sampling rates, i.e. all consumers can pull samples at once. With the mutex, one consumer blocks all other consumers from pulling samples. As you've noted, it has some trade-offs for a single consumer at low sampling rates.
That being said, it's a good start, but there's a bit more that'd need to be done before we can replace the queue implementation.

For performance tests, you can use GNU time (not the bash builtin time, you have to write the full /usr/bin/time -v), it gets you a somewhat useful summary:

        Command being timed: "./SpeedTest"
        User time (seconds): 21.94
        System time (seconds): 31.04
        Percent of CPU this job got: 100%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:52.55
[..]
        Maximum resident set size (kbytes): 6484136
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 98
        Minor (reclaiming a frame) page faults: 1621282
        Voluntary context switches: 4367567
        Involuntary context switches: 54858

The next option would be perf, but in most situations it's overkill.

I see you're in the same timezone, so I could just call you this or next midmorning to discuss how to best proceed.

@jfrey-xx
Copy link
Contributor Author

jfrey-xx commented Apr 1, 2020

Thanks for looking into it!

I am not sure I understand at which level occurs the "multiple-consumers" you are talking about. Here the modification takes place in consumer_queue, and from my observations and some tests I ran, even if for some reasons there are multiple inlets connected to the same outlet in the same program, each inlet will be associated to a separate instance of the consumer_queue class, with a separate instance of its buffer. Hence consumers would not block between themselves, plus in practice the lock is acquired for a very short time, one instruction, since cv_.wait_for() releases it while it waits for a notification. The only case where there would really be an impact is if the thread pushing samples to the consumer queues has to remove old sample from the consumer queue buffer, and if at the exact same time the thread of the main program enters consumer queue's pull function (consequence of pulling from an inlet). Even there we talk about few CPU cycles, plus this lock prevents undefined states with the queue.

Or I am missing situations where several outlet or inlet use the same "consumer_queue"? If there are resources describing the internal architecture and the implementation of LSL, I'll take them :)

@tstenner
Copy link
Collaborator

tstenner commented Apr 1, 2020

I am not sure I understand at which level occurs the "multiple-consumers" you are talking about.

Right, in my head I was already the one step ahead where there's a single queue for all consumers.
I hope I get to review this PR by the end of the week, but from what I've looked at so far there shouldn't be anything speaking against it.
I've pushed a bunch of changes to replace the boost mutexes with std::mutexes so you'll need to rebase the PR.

As for design documents, feel free to ask about anything. I guess Christian knows more, but I'm sure I'll reply faster.

@jfrey-xx
Copy link
Contributor Author

jfrey-xx commented Apr 1, 2020

If you plan to use a single queue for all, indeed that should simplify the code and bring optimizations!

I just synced with master -- here as well as in #71.

@jfrey-xx
Copy link
Contributor Author

jfrey-xx commented Apr 6, 2020

Concerning the additional lock in push_sample(), there are indeed various possible scenarios with concurrent threads, I see one where for example consumer checks sample, producer adds a sample, producer acquires lock, consumer attempts to lock, producer notifies and releases lock, consumer acquires lock and now is waiting a notify that would, this time as well, arrive only upon next enque. I guess acquiring the lock as early as possible, upon entering the function is the most predictable, i.e. first one has to push before there's something to pull -- this is what I committed just now. Note: I don't see a situation where the buffer is full and more than one sample would need to be dropped in push_sample -- otherwise it might keep the lock for an undetermined amount of time?

I ran some tests -- actually using my old method because at the moment the unit test benchmark provokes a double free or corruption (fasttop) error (fixed in #71, as it happens). It seems that the additional lock has little penalty here on performance (4Mhz push instead of 4.2Mhz with the "unlimited" test, with again a the big variance due to the methodology). A similar change in #71 (that I updated as well) does not seem to impact performance.

@jfrey-xx
Copy link
Contributor Author

Hello,

I am sure you have limited time to work on the library, but don't hesitate to tell me if there is something I can do to ease the merge of this pull request as well as #71 -- I need the increase in performance in several use-cases, and if possible I'd rather use the upstream version of LSL :)

@tstenner
Copy link
Collaborator

tstenner commented Apr 17, 2020

I'm (slowly) adding some more unit tests for the queue and some threading related issues and I'll check with the LLVM ThreadSanitizer, just to be on the safe side.

During some benchmarks I've found another concurrency issue resulting in a samples next member pointing to itself and thus an infinite loop on sample factory destruction and a potential double free() so I'm somewhat inclined to merge your changes even though there's more to be done and at least on some preliminary tests the chunked push performance dropped more than I like when there are >5 outlets connected.

On the other hand, the latency when theres no sample immediately available drops from 1-2ms to almost nothing.

I've rebased and combined the commits in the https://github.com/tstenner/liblsl/tree/dev-leak branch

@jfrey-xx
Copy link
Contributor Author

Thanks for the update! I understand it takes much effort to profile the code and investigate all possible side effects, especially when multi-threading is involved :\ I did not come up with a more clever to way to implement the workaround, at least not without touching spc_queue... but I'll keep thinking about it!

@chkothe
Copy link
Contributor

chkothe commented Oct 27, 2020

I looked at this PR yesterday, and I also came to the conclusion that, in theory, the optimal strategy for throughput would be to choose the wait strategy based on the stream's declared sampling rate (low srate: use wait_for, high srate: use the sleep/poll method).

However, I think that in practice this path would be a small project all in itself (and probably not too fruitful), for the following reasons:

  • it'd require some solid benchmarking on different architectures to find what's a generally good threshold (500Hz is prob not too far off though)
  • if we use locking anyway, then using a lockfree queue would technically be overkill since the memory accesses are already protected, and whatever it does to ensure synchronization would be unnecessary overhead (a vanilla circular buffer like e.g. boost::circular_buffer, or a small (but very well tested) handcrafted one, should be more efficient).
  • on close inspection, the buffer overflow case, triggered by your test case of pushing data faster into the buffer than it can be picked up and sent over the network, is actually not something that this queue is currently robust to (neither before nor after the commit) -- this is because in that case, the producer becomes another consumer by doing .pop() itself, which isn't thread safe (that was overlooked in the original implementation). That in turn can, and probably will, cause a double free, and that may well be the one that was discussed above. So if we wanted to keep the wait-free fast path and have it be robust to the buffer overflow, we'd need to use the boost mpmc queue -- however, that one has I think a lot more overhead internally than the lockfree circular buffer (uses linked lists instead of an array) and so it's surely slower and may even be slower and/or incur more CPU load in practice than the wait-based method (would have to be benchmarked). Technically we'd only need spmc, and I've seen some 3rd party implementations/papers, but using one of those would require some very serious vetting.
  • as Tristan pointed out, when using the wait_for, one can not elide the lock in the push_sample() method around the notify_one call, because it's not just less predictable, but can cause an event marker to simply not go through to the other side (until there's another event later that "unlocks" it)*. However, we only need a std::lock_guard there and not a full std::unique_lock since it's not being unlocked intermittently in push_sample (not sure if that's less overhead in practice, but may well be on some OSes).

So long story short, it seems that going down the locking route would be the best trade-off in terms of sanity (thread safe), predictable timing, and simplicity (and thus likelihood of bugs). In that case, perhaps the most appealing would be to use either circular_buffer or a handcrafted queue, but as a start we could (ab)use the existing lockfree queue. And we'd need full locking on both sides. edit: I'll work this suggestion into some suggestions on the code.

*: if pop_sample() held its lock the whole time as in a normal locking queue, then the sequence that triggers that problem wouldn't happen

std::this_thread::sleep_for(std::chrono::milliseconds(1));
} while (!buffer_.pop(result));
// wait untill for a new sample until the thread calling push_sample delivers one, or until timeout
std::unique_lock<std::mutex> lk(lock_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we move this lock to the top of the function, we can have it be thread-safe in the presence of a buffer overflow.

Comment on lines 26 to 30
std::unique_lock<std::mutex> lk(lock_);
while (!buffer_.push(sample)) {
sample_p dummy;
buffer_.pop(dummy);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrapping this whole section (including the lock) in a code block would resolve the thread safety issue and double free error. Will only need a std::lock_guard then. After the other change to pop_sample, we should be able to leave the notify_one outside the lock as per your suggestion.

src/consumer_queue.h Outdated Show resolved Hide resolved
@tstenner
Copy link
Collaborator

on close inspection, the buffer overflow case, triggered by your test case of pushing data faster into the buffer than it can be picked up and sent over the network, is actually not something that this queue is currently robust to (neither before nor after the commit) -- this is because in that case, the producer becomes another consumer by doing .pop() itself, which isn't thread safe (that was overlooked in the original implementation). That in turn can, and probably will, cause a double free, and that may well be the one that was discussed above.

That's potential error, the other is a sample referring to itself as the next sample (i.e. sample == sample->next) so on deconstruction an infinite loop is encountered. A bit better for the data, but worse for the user.

So long story short, it seems that going down the locking route would be the best trade-off in terms of sanity (thread safe), predictable timing, and simplicity (and thus likelihood of bugs). In that case, perhaps the most appealing would be to use either circular_buffer or a handcrafted queue, but as a start we could (ab)use the existing lockfree queue. And we'd need full locking on both sides. edit: I'll work this suggestion into some suggestions on the code.

That's the gist from the queue work item in the CZI grant proposal. The next step then is not doing the locking and additional overhead once per sample but rather once per push/pull operation so the locking doesn't affect the throughput too much.

@chkothe
Copy link
Contributor

chkothe commented Oct 28, 2020

Ah @jfrey-xx I just realized that the thread-safety issue I mentioned above is rolled into your PR #71 for the memory leak at buffer overflow (only glancing at that one now). Good catch!

So @tstenner in line with what you say, perhaps a good first step for the consumer_queue is just fixing the potential crash bug using the minimal set of changes along the lines of what we have above, which then also gives us the CPU load and latency reduction. I'm inclined to try to get that into 1.14 actually, since it's a real crash bug that can trigger in production (namely when the network connection is interrupted for long enough that the buffer overflows). And the performance improvement is a very nice additional gain that comes along with it for free.

Conversely, considering that the sleep/poll method as originally written incurs a ~0.5ms latency, I'm not sure we'd want to cling to that in any case, despite the modest throughput increase, even if we tried to have some sampling rate dependent switch. I had searched and did find some viable thread-safe queues benchmarked here, some of which have >15x throughput improvement over the std::mutex method in the low-contention case (which we have here) -- however, unfortunately using any of those would again come with a painful latency vs CPU load tradeoff (i.e., not so appealing). I did try out thread::yield() btw and with that the latency goes down to something much lower, and the max throughput exceeds the lock-based version. But I recall having read some arguments against using yield (forgot what those were), which is why I had used a sleep there back then.

@tstenner
Copy link
Collaborator

And the performance improvement is a very nice additional gain that comes along with it for free.

Not on all platforms, but if it's configurable those affected can decide whether they'd rather have lower throughput or a potential memory leak.

Conversely, considering that the sleep/poll method as originally written incurs a ~0.5ms latency, I'm not sure we'd want to cling to that in any case

According to my measurements ~1.2ms on Linux and ~1.6ms on Windows, even with Boost.Thread latency patch.

I had searched and did find some viable thread-safe queues benchmarked here, some of which have >15x throughput improvement over the std::mutex method in the low-contention case (which we have here) -- however, unfortunately using any of those would again come with a painful latency vs CPU load tradeoff (i.e., not so appealing).

That's for one lock/unlock per operation? In the typical case, the consumer pulls samples until its buffer is full. With the right memory layout (ideally a contiguous ring buffer) the throughput would be mostly limited by the memory speed.

I did try out thread::yield() btw and with that the latency goes down to something much lower, and the max throughput exceeds the lock-based version. But I recall having read some arguments against using yield (forgot what those were), which is why I had used a sleep there back then.

In the common case where one thread pulls data from one sporadically sending outlet you'd max out a CPU core per inlet.

@chkothe
Copy link
Contributor

chkothe commented Oct 30, 2020

Not on all platforms, but if it's configurable those affected can decide whether they'd rather have lower throughput or a potential memory leak.

Or a crash... Well, I think if we allow it to be nonblocking, we'd need to use a different type of lockfree queue. Btw I had seen an interesting recipe here, which suggests that one can make a queue that opportunistically does a lockfree buffer.pop, and only if that fails, i.e., if no data is available, acquires a lock and waits until notified. That could be the best of both worlds, i.e., low CPU load at low sampling rate and higher max throughput at high rates (but I'm not sure I entirely trust the memory_order::relaxed that's used there, someone needs to do some thorough reading on the underlying rules).

That's for one lock/unlock per operation? In the typical case, the consumer pulls samples until its buffer is full. With the right memory layout (ideally a contiguous ring buffer) the throughput would be mostly limited by the memory speed.

Yes, on the push/pop.

@tstenner
Copy link
Collaborator

Not on all platforms, but if it's configurable those affected can decide whether they'd rather have lower throughput or a potential memory leak.

Or a crash... Well, I think if we allow it to be nonblocking, we'd need to use a different type of lockfree queue. Btw I had seen an interesting recipe here, which suggests that one can make a queue that opportunistically does a lockfree buffer.pop, and only if that fails, i.e., if no data is available, acquires a lock and waits until notified. That could be the best of both worlds, i.e., low CPU load at low sampling rate and higher max throughput at high rates (but I'm not sure I entirely trust the memory_order::relaxed that's used there, someone needs to do some thorough reading on the underlying rules).

I'd have to check if it's really impossible to miss any notifications, but that sounds good.

That's for one lock/unlock per operation? In the typical case, the consumer pulls samples until its buffer is full. With the right memory layout (ideally a contiguous ring buffer) the throughput would be mostly limited by the memory speed.

Yes, on the push/pop.

Technically yes, as it pulls/pushes a single item at once. For chunked operations (which are the only way to handle high sampling rates efficiently), it would still lock once per sample. From the benchmarks: even std::mutex only has 3.5x the latency or rather 330ns-2000ns more latency than the best queue.

Also from the readme:

Some people proposed busy-waiting with a call to sched_yield/pthread_yield. However, sched_yield is a wrong tool for locking because it doesn't communicate to the OS kernel what the thread is waiting for, so that the OS scheduler can never wake up the calling thread at the "right" time, unless there are no other threads that can run on this CPU. More details about sched_yield and spinlocks from Linus Torvalds.

@chkothe
Copy link
Contributor

chkothe commented Oct 30, 2020

Ok, I'll amalgamate this PR with the concurrency fix from #71 (plus some tweaks) and base a PR off that - that should leave us with a safe & stable codebase for the time being with very little throughput impact in most cases except ultra high sampling rates. Then later we can look into some queue variants with higher throughput (e.g. opportunistic lock-free), e.g. for a future release.

edit: merged PR is #89.

@chkothe
Copy link
Contributor

chkothe commented Nov 5, 2020

This one is now merged into master - closing.

@chkothe chkothe closed this Nov 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants