Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server hang on communication disabling #53

Open
gfgit opened this issue Apr 13, 2023 · 4 comments
Open

Server hang on communication disabling #53

gfgit opened this issue Apr 13, 2023 · 4 comments
Labels
bug Something isn't working

Comments

@gfgit
Copy link
Contributor

gfgit commented Apr 13, 2023

Describe the bug
Simple world with Z21 connected

To Reproduce
Steps to reproduce the behavior:

  1. Open attached world
  2. While disconnected from Z21, power on and run
  3. Put Z21 in Emergency Stop state
  4. Open Server Log view
  5. Enable communication
  6. You will see world going in stop and possibly a Broadcast flag mismatch warning
  7. Disable communication
  8. Server hangs for a socket error, something like "connection forcibly interrupted"?

Expected behavior
No hang

World export
Test Local Z21 2023-04-13_new.zip

Traintastic server/client:

Connected hardware
Z21ServerEmulator running on localhost

Additional context
A socket exception is catched on mani() so no backtrace. And it is not happening always...

@gfgit gfgit added the bug Something isn't working label Apr 13, 2023
@gfgit
Copy link
Contributor Author

gfgit commented May 9, 2023

Found the issue but I need suggestions.
Z21's ClientKernel internally calls: (NOTE: lambda capture this to access this->m_logId)

EventLoop::call(
        [this, msg=toString(message)]()
        {
          Log::log(m_logId, LogMessage::D2001_TX_X, msg);
        });

When communication are disabled, kernel shared pointer is reset -> ClientKernel object deleted.
But event loop sometimes still has to process the lambda above so m_logId is accessed after kernel is deleted.

A possible solution is to capture a shared pointer inside the lambda instead of "raw" this to avoid it being deleted while lambda is executed.
This should be done for every lambda and also for other command stations which might present same race behaviour.

Drawback is that Kernel object might have connections inside it's parent hardware interface and the interface expect to have deleted the kernel immediately but in reality it might still use some callbacks or access some interface members

@gfgit
Copy link
Contributor Author

gfgit commented May 9, 2023

In this particolar case we could just capture a copy of log ID and not capture this at all.
But in other lambdas a reference to this is needed and we should use a different method.
Also I'm not sure shared_ptr is thread safe

@gfgit
Copy link
Contributor Author

gfgit commented May 14, 2023

This problem is partially fixed inside #59 but other lambdas passed to EventLoop::call(...) are capturing this so we need to change them also.
Other interfaces might have the same issue.
The issue is not showing up because all event loop callbacks are executed in time before kernel object gets deleted, but there is
no guarantee that object will be valid for all callback execution time.
For example EventLoop could be busy and delay the call and kernel gets deleted in the meantime (not likely to happen but theoretically possible)

@reinder
Copy link
Member

reinder commented May 14, 2023

Think we should change the kernel cleanup a bit, instead of doing m_kernel.reset(); direct we do delay is by posting the cleanup to the eventloop, as it processes everything in order, any pending log message is then processed before the kernel is destroyed.

p.s. regarding shared_ptr, the reference counting is thread safe, the rest is not. Whatever the shared_ptr is pointing the developers responsibility.

gfgit added a commit to gfgit/traintastic that referenced this issue May 15, 2023
See traintastic#53
When disabling communications, kernel object is deleted
but EventLoop callback will run after and access deleted
object through captured 'this'
Temporary fix for this problem is to copy log ID value instead.
gfgit added a commit to gfgit/traintastic that referenced this issue Jun 10, 2023
See traintastic#53
When disabling communications, kernel object is deleted
but EventLoop callback will run after and access deleted
object through captured 'this'
Temporary fix for this problem is to copy log ID value instead.
gfgit added a commit to gfgit/traintastic that referenced this issue Jun 13, 2023
See traintastic#53
When disabling communications, kernel object is deleted
but EventLoop callback will run after and access deleted
object through captured 'this'
Temporary fix for this problem is to copy log ID value instead.
gfgit added a commit to gfgit/traintastic that referenced this issue Jun 13, 2023
See traintastic#53
When disabling communications, kernel object is deleted
but EventLoop callback will run after and access deleted
object through captured 'this'
Temporary fix for this problem is to copy log ID value instead.
gfgit added a commit to gfgit/traintastic that referenced this issue Jun 18, 2023
See traintastic#53
When disabling communications, kernel object is deleted
but EventLoop callback will run after and access deleted
object through captured 'this'
Temporary fix for this problem is to copy log ID value instead.
gfgit added a commit to gfgit/traintastic that referenced this issue Jun 19, 2023
See traintastic#53
When disabling communications, kernel object is deleted
but EventLoop callback will run after and access deleted
object through captured 'this'
Temporary fix for this problem is to copy log ID value instead.
gfgit added a commit to gfgit/traintastic that referenced this issue Jun 19, 2023
See traintastic#53
When disabling communications, kernel object is deleted
but EventLoop callback will run after and access deleted
object through captured 'this'
Temporary fix for this problem is to copy log ID value instead.
gfgit added a commit to gfgit/traintastic that referenced this issue Jul 10, 2023
See traintastic#53
When disabling communications, kernel object is deleted
but EventLoop callback will run after and access deleted
object through captured 'this'
Temporary fix for this problem is to copy log ID value instead.
gfgit added a commit to gfgit/traintastic that referenced this issue Nov 2, 2023
See traintastic#53
When disabling communications, kernel object is deleted
but EventLoop callback will run after and access deleted
object through captured 'this'
Temporary fix for this problem is to copy log ID value instead.
This was referenced Nov 2, 2023
gfgit added a commit to gfgit/traintastic that referenced this issue Nov 3, 2023
See traintastic#53
When disabling communications, kernel object is deleted
but EventLoop callback will run after and access deleted
object through captured 'this'
Temporary fix for this problem is to copy log ID value instead.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants