New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Topic suffers under heavy load, Set
comparison CPU intensive
#1406
Comments
In the meantime, I can share the stack trace commonly seen in the problematic scenario
As the number of subscribers grows, more and more threads end up doing this most of the time, until complete starvation (in our case, kubernetes was restarting the service since the healthcheck endpoint would become unresponsive) |
We have the same issue in our team. In our case we have an income stream with ~ 1k elements per second and ~100 subscribers who just put every elements to the cache. Simple example can looks like: object Example extends IOApp {
def doSomething(x: Any) = IO(println(x))
def arbitraryData = IO(Random.nextInt(100))
override def run(args: List[String]): IO[ExitCode] = {
for {
// topic ← CustomTopic[IO, Int] <- will be used in comparison with default implementation
topic ← fs2.concurrent.Topic[IO, Int](Random.nextInt(100))
subscriptions = (1 to 100).map(
i ⇒ topic.subscribe(1).collect({ case d if d == i ⇒ d }).evalMap(doSomething)
)
_ ← (Stream.fixedDelay(10.millis).evalMap(_ ⇒ arbitraryData).to(topic.publish) merge Stream
.emits(subscriptions)
.parJoinUnbounded).compile.drain
} yield {
ExitCode.Success
}
}
} We tried to solve it by using implementation of the class CustomTopic[F[_], A](cache: Ref[F, List[Queue[F, A]]])(implicit C: Concurrent[F])
extends Topic[F, A] {
override def publish: Sink[F, A] =
_.evalMap(publish1)
override def publish1(a: A): F[Unit] =
cache.get.flatMap { subscribers ⇒
subscribers.traverse(_.enqueue1(a)).void
}
override def subscribe(maxQueued: Int): Stream[F, A] =
emptyQueue(maxQueued).evalTap(q ⇒ cache.update(_ :+ q)).flatMap(_.dequeue)
private def emptyQueue(maxQueued: Int): Stream[F, fs2.concurrent.Queue[F, A]] = {
Stream.bracket(fs2.concurrent.Queue.bounded[F, A](maxQueued))(
queue ⇒ cache.update(_.filter(_ ne queue))
)
}
}
object CustomTopic {
def apply[F[_], A](implicit C: Concurrent[F]): F[CustomTopic[F, A]] =
Ref.of[F, List[fs2.concurrent.Queue[F, A]]](List.empty).map(new CustomTopic(_))
} When we use our custom topic in the example above then we have a very different picture in the profiler: UPD: I created PR to illustrate our full solution: #1407 |
@SystemFw the current implementation of topic won't perform super-well on heavy contention between busy publisher and lot of subscribers. The older implemenatiton was slightly better, specifically due the fact that each subscriber has its own queue where things were fed to, that was in fact registered with the state, and set during publish. The observation of lot of queues and states is correct, as what you see is a lot of spins and attempts to fight for setting that single ref, which has own single state. Simply too many subscribers :-) I think there is a workaround to reimplement Topic with slightly modified State and PubSub, again part of registration will be actually the queue of the subscriber. Another alternatives with current implementation: Change the topic to deeper hierarchy, i.e. 1 topic handling 16 subs, and these will handle each lets say 16 subs will give you 6 pow 16 subscribers with really low contention. I wouldn't suggest yet to give up and go to old implementation, as that had different problems, i.e. flow control etc. |
@vladimir-popov I am not surprised, the current implementation of topic will perfrom poorly under your load. This is sort of worst case for the current implementation. However I am quite confident topic may be improved with PubSub to perform similarly like your solution. |
Also @SystemFw, @vladimir-popov perhaps if we just change the internals of how pubsub works, perhaps we get a better performance w/o changing anything in state logic. |
Actually I'm not sure the problem is contention on Ref, and I don't even know whether it's a problem with PubSub in general, necessarily.
Well, the thing is that in my scenario the old implementation just works with no issues, whereas the new one chokes and causes the service to restart continuously, so that's more than slightly better :P
Well, that's a hot fix that I need to do right now, otherwise everything just chokes, but ideally I would want something based on PubSub, but more scalable, so I agree with you on that. |
@SystemFw I think the issue, this demonstrates here is that you have so many concurrent attempts to CAS the State, so the overhead of this is killing the whole thing. I hope to squeeze this over weekend to ake a look on this, but I feel these are two approach that may fix this
|
@SystemFw, @vladimir-popov, I have diven a bit into it, and wrote a simple bechmark, that was run with 8 threads, 500 messages in topic and 1 to 128 subscribers. Code is here(https://gist.github.com/pchlupacek/f033993302ee2a741a6473286306c9b3), the results are below:
From the code you see there are no operations on messages, so this shall measure overhead only. Hence this is run on 6 core / 12 hyper machine but on 8 jvm threads I would expect the sweet point to be somewhere at 8-16 subscribers. For the larger amount of subscribers, in ideal state the execution shall be something like (Nx/8)*T8, where Nx is number of subscribers to execute, and T8 is the time for 8 subscribers. I do not understand that much the Error ranges, they seem a bit off for me, and when I compared the individual results they were usually +/- 5% range, not like the 128 shows to be more than 100%. So I suspect this will be something wrong with jmh, or my lack of understanding it. As you see the results are somewhat expected for 1-16 subscribers, where there is just different from setting up the program and subscribers. Interesting is the difference between 8-16 subscribers, which is getting close to ideal. Anything after that is complete disaster, and clearly demonstrates the issue with concurrency. I think it confirm my original suspicion of single point contention problem. I will follow up with more experiments to see how we can avoid the problem. |
Thanks for looking into this @pchlupacek :) FYI, the old implementation in the service I mentioned above is handling 1400 subs at peak load without breaking a sweat, on a 2 CPU machine which also has some blocking code on it |
@SystemFw I am not surprised that old implementation performs much better. I would say the only limit is the RAM, and obvioulsy more subscribers -> more time to publish one element. Problem of that implementation is that it is highly specialised and yours only control over the OOM in case of slow subscriber is to use bounded queues. |
Ok, so I have changed the PubSub to use Lock + Var and this is the result :
Now this is much better, however still you see that we are |
The non lock implementation has the advantage of being...well, lock free, so I'd prefer it if possible. I really like the abstraction given by PubSub, however the old Topic was both lock-free and had lower contention, so I feel that anything we do should be proven to be better than the old implementation |
@SystemFw this is still lock free. It uses semaphore. So it is semantically locking. I am not sure that old topic had lower contention per se. It just distributed tasks a bit better. So in situations where you have one publishers and many subscribers it performed better. |
If it uses semaphore it's not blocking, but it's not lock-free either: if the thread holding the lock cannot proceed you get stuck all the same.
You're probably right, it still has a single Ref, but it's spending less time on it because then you go on the queues. The thing is that one publisher and many subscribers is like the perfect Topic use case, and we're talking an order of magnitude better, so that's great (and after all, you wrote both :P ) |
@SystemFw yes, and I think there is a solution with this. Event with Ref implementation. I'll update you. Essentially we use PubSub to implement topic in old way :-) |
that would be ideal :) |
@SystemFw I think current pubsub signatures doesn't allow for the more concurrent subscriber behaviour. I have some ideas on improving it. I think, that the improved implementation with locking reduces significantly contention. However as you have pointed out still, the one element to pub/sub is taking about 2*N+1 exclusive accesses on single contention ref, whereas the former implementation had only one exclusive access, and N exclusive accesses distributed over individual subscribers, resulting in much better contention control. I need some time to think how to improve pubsub to have same performance characteristics, whilst not losing the features of it as we have today. |
@SystemFw I went ahead a bit and just created the benchmark of the These are the results :
One interesting observation I made is that neither this implementation was able to utilize cores at full. Also from the performance standpoint, it is interesting that this is not |
update: Somehow current implementation when Semantic lock is put in place is always using single thread. This should not be the case, as the F.start is used on completion of subscriber. I am investigating this more. |
Just a quick update on this one. I am still experimenting with the ways how to tackle on this one. It seems that i may need to touch pubsub strategy signature a bit, but not yet sure 100% how. Not sure about urgency of this one, but I am targeting this to be resolved until next major release In the menarime there is workaround with manual implementatio. Let me know if that works with you. |
It does work for me, but perhaps we should make it a bit easier for other people to grab the old implementation if they stumble upon this |
@pchlupacek Do you think this is solvable in next week or so? We'll have a minor API breakage release in 1.1 where you could change PubSub signature if needed. |
yes this would be ideal. @SystemFw any pointers? I'd love to find a workaround while this gets fixed properly |
I think the latest commit with the old implementation is this one https://github.com/typelevel/fs2/blob/4ddd75a2dc032b7604dc1205c86d7d6adc993859/core/shared/src/main/scala/fs2/concurrent/Topic.scala Which should be mostly source compatible. Are you hitting a problem because of Topic under load? |
@SystemFw awesome, thanks a lot! |
actually, the performance boost is even more impressive than that. I think I can handle approximately 5x more messages. what I didn't discover yet though are the implications of this workaround |
FYI in fs2 3.0 we have reinstated the implementation strategy of the old version of Topic |
As far as I understand, this issue is now present only on |
There is a problem with Topic (and I suspect with Signal as well), probably due to how CPU intensive Set comparison is. One of my teams had problems where reasonably heavy traffic was causing thread starvation due to the CPU work needed for that comparison. I was able to fix by porting the old Topic (without PubSub), but we need to address this because it's a pretty serious regression right now. I'm trying to come up with a reproducible example, which isn't super easy.
The text was updated successfully, but these errors were encountered: