Lock problems in networktables #11

Merged
merged 14 commits into from Jan 31, 2015

Projects

None yet

1 participant

@virtuald
Member

This is the start of my investigation into possible lock problems with networktables in the face of delayed socket writes. There definitely are problems, so it's just a question of the best way to fix them before it hangs someone's robot.

Not going to merge this until the problems have been resolved.

@virtuald virtuald merged commit 2cf1038 into master Jan 31, 2015

1 check passed

continuous-integration/travis-ci The Travis CI build passed
Details
@virtuald virtuald deleted the lock-debugging branch Jan 31, 2015
@virtuald
Member

After a week of deep examination of our networktables implementation, I believe I've eliminated most (if not all) of the deadlocks present in the code. This pull request tried to remove any place where robot code might call into NetworkTables which would cause a lock to be held while making a network call, and restructure some things to be easier to understand.

There were a few places in particular that I'd like to call out:

  • (07ec2da) The bug that affected C++ last year was here, and I believe it is present in the java code AND somehow came back in the 2015 C++ code (though, I suspect it is harder to trigger in a Linux environment). This bug is in the write manager's thread, where the entry store lock is held while trying to write the entries out to the network. Instead, I buffer the data to be sent, and send it after the lock is released.
  • (c904f43) When the client/server sends the list of entries between 'hello' and 'hello complete', the entry store's lock is held. In theory, if disconnection happened while a client was connecting, the main thread could deadlock there.
  • (2d16753) The client adapter doesn't hold a lock when the hello comes in, and a slow network connection could cause the connection to disconnect/reconnect briefly

I setup some instrumentation to help debug this stuff. In particular, you can see the latency introduced when these deadlocks are present by just inserting a sleep(1) anywhere that a socket write operation is going to happen. If you're interested in playing with some of the instrumentation I've setup to help debug these, just add the following lines before you do any networktables operations:

import networktables2._impl
networktables2._impl.enable_lock_debugging()

I believe that these bugs affect Java & C++ implementations of NetworkTables also.

@virtuald virtuald changed the title from WIP: Lock problems in networktables to Lock problems in networktables Jan 31, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment