Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INT-3846: Some fixes for the SimpleMessageStore #1585

Closed
wants to merge 18 commits into from

Conversation

artembilan
Copy link
Member

JIRAs:
https://jira.spring.io/browse/INT-3830
https://jira.spring.io/browse/INT-3523
https://jira.spring.io/browse/INT-3846

@artembilan
Copy link
Member Author

@celandro,
this is a fresh PR, which includes your fixes and my polishing.

Feel free to comment!

Thank you very much for the contribution any way!

@celandro
Copy link

celandro commented Oct 5, 2015

Looks great! I really like the per group limit, quite clean

@@ -233,8 +270,8 @@ public void removeMessageGroup(Object groupId) {
return;
}

groupUpperBound.release(groupIdToMessageGroup.get(groupId).size());
groupIdToMessageGroup.remove(groupId);
this.groupToUpperBound.remove(groupId).release(this.groupCapacity);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this NPE if groupId does not exist?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. Because the code here is enough robust from the concurrent access:

lock.lockInterruptibly();
try {
        if (!groupIdToMessageGroup.containsKey(groupId)) {
        return;
    }
    this.groupToUpperBound.remove(groupId).release(this.groupCapacity);
    this.groupIdToMessageGroup.remove(groupId);
}
finally {
    lock.unlock();
}

You can see the lock and a couple of modification operators if groupId exists.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, i checked the addMessageToGroup code to make sure. Id still be paranoid about it and add a null check to future proof the code on this.groupToUpperBound.remove(groupId)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've revised this and it does not make sense for me to release that UpperBound any more.
If if remove the group, we should just clean everything around it.
If something is waiting for the permits (]addMessageToGroup]), it should create a new UpperBound.

@celandro
Copy link

celandro commented Oct 5, 2015

If you can get this build on maven central, I can load test it before your release this week. Its a pain to test a foreign snapshot in our build system or I'd test it off this branch.

@artembilan
Copy link
Member Author

@celandro ,
thank you for reaching this PR and your review!

We don't place any SNAPSHOT, or even M or RC release to the Maven Central, but you can test it against Spring's repository:

repositories {
    maven { url 'http://repo.spring.io/libs-snapshot' }
}

We let you know when it will be available there.

BTW, this PR is only for master - 4.2.1 version.
For backporting I'll raise different PRs, but only after acceptance of this one.

CC @garyrussell

@celandro
Copy link

celandro commented Oct 6, 2015

Found a major bug in DelayHandler which makes this unusable

    private void doReleaseMessage(Message<?> message) {
        if (removeDelayedMessageFromMessageStore(message)) {
            this.messageStore.removeMessagesFromGroup(this.messageGroupId, message);
            this.handleMessageInternal(message);
        }
        else {
            if (logger.isDebugEnabled()) {
                logger.debug("No message in the Message Store to release: " + message +
                        ". Likely another instance has already released it.");
            }
        }
    }
    private boolean removeDelayedMessageFromMessageStore(Message<?> message) {
        if (this.messageStore instanceof SimpleMessageStore) {
            SimpleMessageGroup messageGroup =
                    (SimpleMessageGroup) this.messageStore.getMessageGroup(this.messageGroupId);
            return messageGroup.remove(message);
        }
        else {
            return ((MessageStore) this.messageStore).removeMessage(message.getHeaders().getId()) != null;
        }
    }

EDIT:
Sorry after looking into it further, what happens is a bit more complicated, First it tries to remove the message from the store without using any locking. It then tries to lock inside of this.messageStore.removeMessagesFromGroup which blocks because addMessageToGroup already has the lock. Eventually addMessageToGroup unblocks, and it tries to remove messages which are no longer there so the upperbound did not get modified.

@celandro
Copy link

celandro commented Oct 6, 2015

This fixes the issue but I don't like it. The upperbound release has to be moved outside of the lock.

    @Override
    public void removeMessagesFromGroup(Object groupId, Collection<Message<?>> messages) {
        UpperBound upperBound = this.groupToUpperBound.get(groupId);
        upperBound.release(messages.size());
        Lock lock = this.lockRegistry.obtain(groupId);
        try {
            lock.lockInterruptibly();
            try {
                SimpleMessageGroup group = this.groupIdToMessageGroup.get(groupId);
                Assert.notNull(group, "MessageGroup for groupId '" + groupId + "' " +
                        "can not be located while attempting to remove Message(s) from the MessageGroup");
                boolean modified = false;
                for (Message<?> messageToRemove : messages) {
                    if (group.remove(messageToRemove)) {
                        modified = true;
                    }
                }
                if (modified) {
                    group.setLastModified(System.currentTimeMillis());
                }
            }
            finally {
                lock.unlock();
            }
        }
        catch (InterruptedException e) {
            Thread.currentThread().interrupt();
            throw new MessagingException("Interrupted while obtaining lock", e);
        }
    }

I'm sure there is a better way but I'm at a loss.
Edit: Moving the tryAquire outside the lock is probably more correct. That puts every add on an exact timeout instead of them going 1 by 1 into the lock and blocking for X milliseconds. Then fix the DelayHandler code to not call remove twice.
Edit2: You should consider always decrementing the tryAquire for now even if its incorrect. I found this issue with just 1 use case of MessageStore, who knows what other use cases would not handle a properly working capacity

@@ -208,6 +239,12 @@ public MessageGroup addMessageToGroup(Object groupId, Message<?> message) {
if (group == null) {
group = new SimpleMessageGroup(groupId);
this.groupIdToMessageGroup.putIfAbsent(groupId, group);
this.groupToUpperBound.putIfAbsent(groupId, new UpperBound(this.groupCapacity));
}
if (!this.groupToUpperBound.get(groupId).tryAcquire(this.upperBoundTimeout)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We either need a new catch for InterruptedException here, or at least we need to fix the message in the MessagingException in the outer catch since the interrupt might be here not on the lock acquisition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@garyrussell , unfortunately we don't have here InterruptedException.
See UpperBound.tryAcquire() source code:

catch (InterruptedException e) {
    Thread.currentThread().interrupt();
    return false;
}

Or am I missing anything here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, of course; yes, no problem; ignore my comment.

@artembilan
Copy link
Member Author

Hello, @celandro !

Thank you very much for pointing that out!
Really good catch and it looks like nasty bug 😄

Well, looking to the DelayHandler code and this our new stuff around UpperBound, I'd say that we introduced a new serious dead lock issue.

See:

  1. We call addMessageToGroup obtaining the lock for the group.
  2. Reaching this.groupToUpperBound.get(groupId).tryAcquire(this.upperBoundTimeout) and wait for the gap in the UpperBound
  3. We call removeMessageFromGroup to free a slot in the UpperBound.
  4. And ... we can't do that, because we are waiting for lock here !

Not sure yet how to fix or even overcome, but I see that your concern for the DelayHandler should be fixed exactly from there, not MS part. We just must not remove from the group outside of MS.

I guess we can disable messageGroup.remove(message) if we are together with SimpleMessageStore. In other words don't allow any modifications outside of UpperBound for the group.

@celandro
Copy link

celandro commented Oct 6, 2015

That fixes 1 issue, but the issue of the tryAquire inside the lockExclusively means that you can easily exceed the maxAquire time. Moving the tryAquires in all the addMessage methods outside the lock would fix that

edit: and yes it was nasty, my simple code that takes messages off a queue, puts them in a delayer and then aggregates and sends never got to the aggregate part :)

@artembilan
Copy link
Member Author

That is an interesting point. But how about do not rely on the tryAquires for the groups, but switch to the lock.newCondition() and free the lock for the removeMessageFromGroup?

Having these all concerns, I'd say that we can't bakcport it to all those maint versions.
Or, let's make the upperBoundTimeout only as a master feature!
We can backport those if (message != null) around release and try to fix DelayHandler.

But I definetelly sure all this changes and fixes are very dangerous for those versions which are here for a while already...

@celandro
Copy link

celandro commented Oct 6, 2015

I will make a few changes and make a new pull request after I loadtest a few million messages through a delayhandler. The code will be setup to log warnings if remove is called on messages that don't exist.

Remember that the code in the release version will hang if you ever insert maxCapacity elements into the store. An imperfect fix is still much better than the release version

@artembilan
Copy link
Member Author

Remember that the code in the release version will hang if you ever insert maxCapacity elements into the store. An imperfect fix is still much better than the release version

Not sure what you mean. Please, find the JavaDocs for the Semaphore.tryAcquire:

     * is returned.  If the time is less than or equal to zero, the method
     * will not wait at all.

As I said: this " imperfect fix" produces dead lock. Even if it is for some time (upperBoundTimeout), we don't have a gain of it. Just because we don't release any permit, when we would like.

Having that I even don't see reason to have this UpperBound at all. Simple counter (AtomicInteger) together with Lock and the Condition on it does the stuff.

@celandro
Copy link

celandro commented Oct 6, 2015

I agree this version, can not go to production, I was saying I could make a version without this deadlock that will work.

You are correct, it does not hang on the producer end but it prevents any new elements from being added, ever. The effect is rejecting all new entries into the store or completely breaking a delayer. My consumers of the queue look hung in this situation but its actually the message store.

As for semaphores or atomic integers, I have no opinion, I was trying for the most minimally invasive changes originally but I can certainly make this version have no race conditions and only have issues with more elements in the store than is allowed when remove is called on elements that do not exist (same issue as the current version).

I'm also a little concerned there may have been a performance degredation at some point that makes a delayer working off a large message store(~150K backlog) perform poorly. The remove methods being used are O(N) worst case, I will look into it. Im only getting about 70/sec in my load test environment with this version, my prod version based on 2.2.6 gets 1200/sec. There may be some other issue in my environment, or the locking differences may be the culprit, Will look today.

Perhaps I should extend SimpleMessageStore instead of modifying it? or make an interface they both conform to so DelayerHandler and others can check for that?

@artembilan
Copy link
Member Author

You definitely can make your own MessageGroupStore and inject it to the DelayHandler. For this one I must think more. Come back to after the lunch

@artembilan
Copy link
Member Author

Pushed some fix.
It isn't finished yet, because I'm switching computers.
But be sure to review it as you wish!

@artembilan
Copy link
Member Author

Good morning!

Pushed some simple polishing, but it isn't the final solution yet.
So, it isn't ready for the merge.

I can't still figure out the fix for dead lock.

Need to think more.
And meanwhile I go to fix other JIRAs for the tomorrow's release.

Feel free to share you thought here how to overcome that dead lock.

Thanks in advance!

@artembilan
Copy link
Member Author

Pushed.
From my perspective it is the best solution right now.
Please, review ASAP, since we would like to release all the latest fixes for all version we support.

Thanks for feedback, contribution, testing and just for the keeping dialog.

Looking forward for more contribution!

@artembilan
Copy link
Member Author

Hi @celandro !

I know you are busy, too.
Anyway would you mind to find some minute to review this one more time for the final solution.

We have some urgent plan to release 😄

Thank you for the contribution one more time!

@celandro
Copy link

It looks good to me but I haven't had a chance to load test unfortunately.

@celandro
Copy link

Unfortunately I was not able to test with this code change however I tested with just a modified extension of SimpleMessageStore.

2 issues

  1. if you call remove twice with the same message (since I have not changed the delayer), the second remove will scan the entire queue which is horrendously slow. i hope this is not the case with the new code but I can not test today
  2. if you remove somewhere else and then remove, it does not decrement the tryAquire. I worked around that since its not realistic
    @Override
    public void removeMessagesFromGroup(Object groupId, Collection<Message<?>> messages) {
        Lock lock = this.lockRegistry.obtain(groupId);
        try {
            lock.lockInterruptibly();
            try {
                SimpleMessageGroup group = this.groupIdToMessageGroup.get(groupId);
                Assert.notNull(group, "MessageGroup for groupId '" + groupId + "' " +
                        "can not be located while attempting to remove Message(s) from the MessageGroup");
                UpperBound upperBound = this.groupToUpperBound.get(groupId);
                Assert.state(upperBound != null, "'upperBound' must not be null.");
                boolean modified = false;
                upperBound.release(messages.size());
                for (Message<?> messageToRemove : messages) {
                    if (group.remove(messageToRemove)) {
                        modified = true;
                    } else {
                        log.warn("Could not remove message, capacity may be exceeded");
                    }
                }
                if (modified) {
                    group.setLastModified(System.currentTimeMillis());
                }
            }
            finally {
                lock.unlock();
            }
        }
        catch (InterruptedException e) {
            Thread.currentThread().interrupt();
            throw new MessagingException("Interrupted while obtaining lock", e);
        }
    }

@artembilan
Copy link
Member Author

Hello, @celandro !

Thank you again for feedback!

So, let me try to answer to your concerns:

  1. if you call remove twice with the same message (since I have not changed the delayer), the second remove will scan the entire queue which is horrendously slow. i hope this is not the case with the new code but I can not test today

Not a case for now. This my DelayHandler version has these code changes:

private boolean removeDelayedMessageFromMessageStore(Message<?> message) {
        if (this.messageStore instanceof SimpleMessageStore) {
            synchronized (this.messageGroupId) {
                Collection<Message<?>> messages = this.messageStore.getMessageGroup(this.messageGroupId).getMessages();
                if (messages.contains(message)) {
                    this.messageStore.removeMessageFromGroup(this.messageGroupId, message);

So, we call here removeMessageFromGroup just for single Message. It isn't an issue here because we deal with in-memory store. Returning the MessageGroup from there doesn't impact the whore performance.

Another change looks like:

if (!(this.messageStore instanceof SimpleMessageStore)) {
    this.messageStore.removeMessagesFromGroup(this.messageGroupId, message);
}

So, we don't perform removeMessagesFromGroup (iterable one) for the SimpleMessageStore.

Therefore I don't see your issue right now.

if you remove somewhere else and then remove, it does not decrement the tryAquire. I worked around that since its not realistic

It isn't clear for me what is somewhere else from your perspective, but if you mean that my issue before, like this:

messageGroup.remove(message);

I agree and from the SimpleMessageStore we should disallow any MessageGroup modification.
But we can't do this breaking change for the maint releases. It maybe like robustness improvement for the next release.
Right now it is up to end-user to destroy his application by his conscience.

@celandro
Copy link

Agree remove is not an issue however .contains might be. I would do the following:

    private void doReleaseMessage(Message<?> message) {
        if (removeDelayedMessageFromMessageStore(message)) {
            this.messageStore.removeMessagesFromGroup(this.messageGroupId, message);
            this.handleMessageInternal(message);
        }
        else {
            if (logger.isDebugEnabled()) {
                logger.debug("No message in the Message Store to release: " + message +
                        ". Likely another instance has already released it.");
            }
        }
    }

    private boolean removeDelayedMessageFromMessageStore(Message<?> message) {
        if (this.messageStore instanceof SimpleMessageStore) {
            // not handled in here. Due to potential locking on .contains, just return true
            return true;
        }
        else {
            return ((MessageStore) this.messageStore).removeMessage(message.getHeaders().getId()) != null;
        }
    }

Unfortunately the contract of SimpleMessageStore is not a great fit for use in a delay store as there is no way to get the number of messages removed and it does not match the implementation of any other setup. Passing around the message itself instead of just the id and group seems strange to me as well. I also am concerned with the contract of MessageStore in general as calling removeMessage followed by removeMessageFromGroup is unintuitive and implies that the caller needs to clean up what amounts to an index.

One more major issue is when configurable delays are enabled, SimpleMessageGroup reverts to O(n) performance since there is no guaranteed ordering of the .remove methods when they are called. For example, in my situation, certain events have no delay and some events have 30s delay. This causes the no delay events to scan the entire queue before removal. These events then have the potential of blocking up the entire processing pipeline.

I think making a version of SimpleMessageGroup that works with only ConcurrentHashMaps would be better performance. I will work on it.

@artembilan
Copy link
Member Author

I would do the following:

No, It's bad again. You don't check if you removed message from the store really. And therefore will end up with the issue for which that change in the DelayHandler exists there: do not duplicate message after releasing.

I don't think that my synchronized (this.messageGroupId) around if (messages.contains(message)) { isn't so bad there, because we are single-threaded in the this.messageStore.removeMessageFromGroup anyway according to the lock for the groupId.

Regarding all your other concerns...

From one side I agree that MessageStore and MessageGroupStore contracts are a bit distructive, but we can do nothing right now on the matter. And live with them as is just because there is no premise to use them out side of Framework. So, the question to change their contract is something for the far future...

For you idea around ConcurrentHashMap, please, take a look here https://gist.github.com/artembilan/a7adca1cdb9c12a5d3a8. We can even consider this impl for as an out-of-the box, but already in the next release.

You last issue isn't clear to me at all...
The events without delay doesn't cause any scheduling at all:

if (delay > 0) {
    this.releaseMessageAfterDelay(requestMessage, delay);
    return null;
}

Please, be more specific, but already in the new JIRA and with some test-case to reproduce or, at least, to play.

Let's finish with this PR and summarize if it is sufficient to merge and backport!

Actually we with @garyrussell just wait here for your acceptance 😄

@celandro
Copy link

I understand your desire not to touch the contract.

Looked at the version using ConcurrentHashMap and I think it looks good. I implemented something very similar as a different class since there might be some use case where they want to use it as an actual queue. Maybe name the class RandomAccessMessageStore. Regardless, I will test your version and let you know how it performs.

Sorry we had just increased the default delay to 100ms with the others waiting 30000ms, my mistake. The random access should resolve the issue anyhow.

@artembilan
Copy link
Member Author

Great! Thank you!
But again: we are talking here about the SimpleMessageStore bugs, but not the DelayHandler performance. It is definitely the different story and must be reviewed in the separate thread.
Yes, we should keep in mind any component making such a critical changes, but I feel we can sever that DelayHandler issue right now.

@celandro
Copy link

I think it is good to release as well and resolves the issue with this bug. I see no reason to hold up the release.

I will let you know about the other class you made. The only issue I see so far is the wiring is a bit awkward as you have to set the groupId to something like "delayerId.messageGroupId" and I'm about to run the load test...

Ryan Barker and others added 16 commits January 7, 2016 14:38
JIRAs:
https://jira.spring.io/browse/INT-3830
https://jira.spring.io/browse/INT-3523
https://jira.spring.io/browse/INT-3846

* Fix the OOM condition, when we `release()` `UpperBound` independently of the previous `remove` result (https://jira.spring.io/browse/INT-3846)
* Fix "confuse" around `groupCapacity`, when we really didn't care about individual groups (https://jira.spring.io/browse/INT-3523)
* Add `upperBoundTimeout` to have a hook to wait some time for the empty slot in the store (https://jira.spring.io/browse/INT-3830)
* Fix some JavaDocs warnings
* Fix some typos
…oup` when `MS` is `SimpleMessageStore`

 * Remove `UpperBound.release()` operation from `SimpleMessageStore.removeGroup()`.
 The waiting process should worry about the new `UpperBound` instance.
But do that only for groups which already exist.
For the new groups we have a fresh `UpperBound`, so no need to worry about dead lock and
 we can obtain  a permit immediately.
Make some synchronization fixes according to the migration to the `LinkedHashSet`
@artembilan
Copy link
Member Author

Pushed after rebase and Copyright fix, and //NOSONAR for the FileWritingMH

@garyrussell
Copy link
Contributor

There is need some JavaDocs for new API and maybe some docs on the matter.

So I should merge and that will be a new PR?

@artembilan
Copy link
Member Author

So I should merge and that will be a new PR?

Doh... Yes, of course we need doc. They will be here.
Thank you for catching! 😄

@artembilan
Copy link
Member Author

Pushed Docs.

@artembilan
Copy link
Member Author

Pushed fix for JavaDocs.

@artembilan
Copy link
Member Author

@garyrussell , would you mind to review the last couple commits to close this lo-o-ong track and 3 JIRAs to have green graphic in JIRA project again 😄

@garyrussell
Copy link
Contributor

Rebased, squashed, merged as 201f0fc

Congrats @artembilan !!

@artembilan
Copy link
Member Author

And especial thanks to @celandro , who initiated the work and has forced us to reconsider the stuff.

Looking forward for more similar valuable contributions! 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants