Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive lock contention #50

Closed
nxrighthere opened this issue Jan 28, 2019 · 18 comments
Closed

Excessive lock contention #50

nxrighthere opened this issue Jan 28, 2019 · 18 comments
Assignees
Labels
enhancement New feature or request

Comments

@nxrighthere
Copy link
Contributor

nxrighthere commented Jan 28, 2019

Right now thread-safety achieved via the excessive amount of locks, this leads to performance degradation and it's killing scalability. While locks aren't slow, lock contention is. Practical solution: remove any shared states/data, and re-design the core using inter-thread message-based communication without any contention. This approach scales perfectly for traditional networking on top of UDP sockets, Disruptor works best for this.

Additional information regarding multi-threading you can find here.

@zpostfacto zpostfacto added the enhancement New feature or request label Jan 28, 2019
@zpostfacto
Copy link
Contributor

When thinking about optimizations, I think it's important to have a specific use case in mind. We intend to be a general purpose networking library, so I don't want to have very bad perf for an important use case. But right now lock contention hasn't been an issue in practice. We have had at least one user of this lib use a process model with a high number of threads and connections all within the same process. They were initially concerned about the global lock, as well, but we never did find any actual perf problems. It was an intentional design to keep the multi-threading design as simple as possible, and I'm hesitant to deviate from that design without a specific customer with a real world use case that is experiencing perf problems. Then we can look at more granular locking strategies.

At Valve, our approach is to use multi-process when we want to host thousands of players on a single box, and use a relatively small number of threads per process. This is how the SDR relays work -- they are single-threaded, and this is also how most of the different types of steam servers are designed to scale to fully utilize the CPU power of a box. Because multiple threads in the same process is just much more complicated. I personally think multi-process is the better way to skin the cat, but I also know that it doesn't work everywhere.

Also, this is not visible in the opensource code, but the relayed connections involve a lot more sharing of global data structures. E.g. two different connections might share a connection to one relay, but each have a separate session with two other relays, and the information about ping times used for route selection is all global. More granular locking would be very complicated.

@nxrighthere
Copy link
Contributor Author

nxrighthere commented Jan 28, 2019

One question, what is the maximum number of players/messages per second a single dedicated server can now host for Valve games (which was determined using simulations or real-world experience)?

@zpostfacto
Copy link
Contributor

zpostfacto commented Jan 28, 2019

We don't really have any use cases that approach the limits for our own games. We had a partner use the lib with a multi-threaded server, with hundreds of connections, and that was the closest we got to having problems.

I think a use case of 100's of connections to a server is within the design scope of this library. 1000 is probably not. (Depending on the nature of the connections, of course.) For 100s of connections, my expectation is that lock contention will not be an issue unless the application code is designed to be accessing network functionality from many different threads at once. If the application only accesses the networking API for 1 or 2 threads, I don't think that there will be high contention, even if the number of connections is relatively high.

To be specific, I would expect a battle royale server that doesn't access networking from many different threads to not have issues with lock contention.

The APIs that receive multiple messages at once (and send -- if that API is added) could reduce the lock contention a lot, too, I suspect.

Obviously these expectations could be wrong.

@nxrighthere
Copy link
Contributor Author

nxrighthere commented Mar 11, 2019

@fletcherdvalve Do I understand correctly that lock synchronizer in steamnetworkingsockets_lowlevel is globally shared among the whole process and its threads? So if a user will try to start multiple clients within a single process they will share the same synchronizer and try to acquire/release it?

@zpostfacto
Copy link
Contributor

That's right.
There's one, global, network service thread, and only one, global, lock.

@nxrighthere
Copy link
Contributor Author

Ah, so this explains why my simulation fails at 70 concurrent connections within the same process... 😿

@nxrighthere
Copy link
Contributor Author

Is there any read-only functionality that potentially could benefit from SRW lock instead of the mutex in cases where read operations exceed write operations?

@zpostfacto
Copy link
Contributor

Yeah, for sure.

@nxrighthere
Copy link
Contributor Author

nxrighthere commented Jun 20, 2019

@fletcherdvalve Another thing that came to my mind is that, wouldn't the scoped lock in CSteamNetworkingSockets::RunCallbacks cause micro-stuttering if a user will dispatch callbacks to the main thread (game loop basically) since this lock will wait to acquire the synchronizer instead of trying to acquire it only once to not block the thread?

Actually, this applies to all functions with the scoped lock that frequently invoked.

I believe that some sort of Task/Phase-Fair RW should be used there at least to avoid starvation.

zpostfacto added a commit that referenced this issue Dec 5, 2019
This is used to poll many connections in a single function call.  Previously,
this was only possible if all of the connections were those accepted on the
same listen socket.  (ReceiveMessagesOnListenSocket).  But this left out at
least two important use cases with known users:

- If you create more than one listen socket (because there is more way to
  contact your service, e.g. once for P2P and another for direct IP, and
  another for relayed connections), then you could not poll all of the
  connections efficiently.
- In P2P use cases, we may initiate many connections to peers, and we want
  to poll all of them at once.

This change is relevant to: Issue #49, Issue #50, and issue #52.  (But I don't
this it really "fixes" any of them.)
zpostfacto added a commit that referenced this issue Dec 23, 2020
This is in response to issue #50.  I have a partner who wants to make sure
that all calls form their "main" thread never stall waiting on the global
lock, even for a few ms, and has documented some times when some stalls
were visible due to contention.

Using tons of locks to cover lots of different objects is really scary.
A few high level strategies were used to make this possible:
- Document what locks protect what data.  Sprinkle asserts liberally
  when accessing that data to ensure that the appropriate lock is held.
- Restrict lock aquisition patterns so that deadlocks are impossible,
  and add logic to detect violations of the expected lock pattern.
- Use "leaf" locks smartyly.  A leaf lock is one that only protects
  a relatively small piece of data, and for which the usage pattern
  is "lock, access data quickly, unlock".  No other locks are allowed
  to be taken while a leaf lock is held, and so we don't have to worry
  about deadlocks.
- Major state changes, object creation and destruction, and other
  operations that are likely to touch multiple objects are still
  protected by the global lock.  Some API calls still take the global
  lock.  The most common API calls we expect to happen per frame and
  need to be fast, however, no longer take the global lock.
- Using scope lock objects, not only to take advantage of RAII unlocking,
  but also to clearly communicate lock ownership semantics between
  caller and calee.

Here are all the locks:
- Global lock.  All background processing and most "major" operations
  like creating and destroying objects, is done while holding this lock.
- Connection and poll group locks.  You must hold this lock while queuing
  messages or pulling them off the queue.
- Handle table lock.  The important API calls that need to be optimized
  begin by looking up a handle in a table and getting an object.  We have
  a lock to protect these tables.  (We use the same lock for both connection
  and pollgroup tables.)  The access pattern for this lock is the most
  unique -- see the code for details.
- Thinker lock.  Think times, and the PQ of thinkers, cannot be modified
  without holding this lock.  This lock also has a relatively unique access
  pattern.
- A single global recv message queue lock.  Received messages waiting to
  be fetched by the API are complicated because they may exist in multiple
  queues at the same time.  Instead of trying to have a lock for each queue
  and dealing with safely aquiring multiple locks when necessary, we just
  use a single "leaf" lock to protect *all* queues.  This is OK because
  the lock is only held very briefly.
- Misc other "leaf" locks.

If you are only using UDP connections, with each connection having its own
UDP socket, this may seem like overkill.  But the relationship between
connection and UDP socket is most definitely not one-to-one in the case of
SDR, where many connections utilize the same data structures.  One connection
may use one or more sessions to relays, and the server state data may be shared
between multiple connections.  These server data may use different UDP sockets,
or the sockets may be shared.  Basically each connection doesn't simply have
all its own data in such a way that it can be easily siloed.

There are still a few FIXMEs tagged in the code that I know need to be fixed,
but I confirmed basic functionality work in the simple UDP case, as well as
the much more complicated SDR case.  So this seems like a good time to check it
into a branch and get some broader testing on it.
@zpostfacto
Copy link
Contributor

zpostfacto commented Dec 23, 2020

@nxrighthere
MERRY CHRISTMAS!

I've got a change to implement fine-grained locking. I'd be very interested, if you try it out, how performance behaves, or if you even hit any bugs.

Let me know if you have a chance to look at this!

https://github.com/ValveSoftware/GameNetworkingSockets/tree/fine-grained-locking

tycho pushed a commit to tycho/GameNetworkingSockets that referenced this issue Jan 27, 2021
This is in response to issue ValveSoftware#50.  I have a partner who wants to make sure
that all calls form their "main" thread never stall waiting on the global
lock, even for a few ms, and has documented some times when some stalls
were visible due to contention.

Using tons of locks to cover lots of different objects is really scary.
A few high level strategies were used to make this possible:
- Document what locks protect what data.  Sprinkle asserts liberally
  when accessing that data to ensure that the appropriate lock is held.
- Restrict lock aquisition patterns so that deadlocks are impossible,
  and add logic to detect violations of the expected lock pattern.
- Use "leaf" locks smartyly.  A leaf lock is one that only protects
  a relatively small piece of data, and for which the usage pattern
  is "lock, access data quickly, unlock".  No other locks are allowed
  to be taken while a leaf lock is held, and so we don't have to worry
  about deadlocks.
- Major state changes, object creation and destruction, and other
  operations that are likely to touch multiple objects are still
  protected by the global lock.  Some API calls still take the global
  lock.  The most common API calls we expect to happen per frame and
  need to be fast, however, no longer take the global lock.
- Using scope lock objects, not only to take advantage of RAII unlocking,
  but also to clearly communicate lock ownership semantics between
  caller and calee.

Here are all the locks:
- Global lock.  All background processing and most "major" operations
  like creating and destroying objects, is done while holding this lock.
- Connection and poll group locks.  You must hold this lock while queuing
  messages or pulling them off the queue.
- Handle table lock.  The important API calls that need to be optimized
  begin by looking up a handle in a table and getting an object.  We have
  a lock to protect these tables.  (We use the same lock for both connection
  and pollgroup tables.)  The access pattern for this lock is the most
  unique -- see the code for details.
- Thinker lock.  Think times, and the PQ of thinkers, cannot be modified
  without holding this lock.  This lock also has a relatively unique access
  pattern.
- A single global recv message queue lock.  Received messages waiting to
  be fetched by the API are complicated because they may exist in multiple
  queues at the same time.  Instead of trying to have a lock for each queue
  and dealing with safely aquiring multiple locks when necessary, we just
  use a single "leaf" lock to protect *all* queues.  This is OK because
  the lock is only held very briefly.
- Misc other "leaf" locks.

If you are only using UDP connections, with each connection having its own
UDP socket, this may seem like overkill.  But the relationship between
connection and UDP socket is most definitely not one-to-one in the case of
SDR, where many connections utilize the same data structures.  One connection
may use one or more sessions to relays, and the server state data may be shared
between multiple connections.  These server data may use different UDP sockets,
or the sockets may be shared.  Basically each connection doesn't simply have
all its own data in such a way that it can be easily siloed.

There are still a few FIXMEs tagged in the code that I know need to be fixed,
but I confirmed basic functionality work in the simple UDP case, as well as
the much more complicated SDR case.  So this seems like a good time to check it
into a branch and get some broader testing on it.
zpostfacto added a commit that referenced this issue Jan 27, 2021
This is in response to issue #50.  I have a partner who wants to make sure
that all calls form their "main" thread never stall waiting on the global
lock, even for a few ms, and has documented some times when some stalls
were visible due to contention.

Using tons of locks to cover lots of different objects is really scary.
A few high level strategies were used to make this possible:
- Document what locks protect what data.  Sprinkle asserts liberally
  when accessing that data to ensure that the appropriate lock is held.
- Restrict lock aquisition patterns so that deadlocks are impossible,
  and add logic to detect violations of the expected lock pattern.
- Use "leaf" locks smartyly.  A leaf lock is one that only protects
  a relatively small piece of data, and for which the usage pattern
  is "lock, access data quickly, unlock".  No other locks are allowed
  to be taken while a leaf lock is held, and so we don't have to worry
  about deadlocks.
- Major state changes, object creation and destruction, and other
  operations that are likely to touch multiple objects are still
  protected by the global lock.  Some API calls still take the global
  lock.  The most common API calls we expect to happen per frame and
  need to be fast, however, no longer take the global lock.
- Using scope lock objects, not only to take advantage of RAII unlocking,
  but also to clearly communicate lock ownership semantics between
  caller and calee.

Here are all the locks:
- Global lock.  All background processing and most "major" operations
  like creating and destroying objects, is done while holding this lock.
- Connection and poll group locks.  You must hold this lock while queuing
  messages or pulling them off the queue.
- Handle table lock.  The important API calls that need to be optimized
  begin by looking up a handle in a table and getting an object.  We have
  a lock to protect these tables.  (We use the same lock for both connection
  and pollgroup tables.)  The access pattern for this lock is the most
  unique -- see the code for details.
- Thinker lock.  Think times, and the PQ of thinkers, cannot be modified
  without holding this lock.  This lock also has a relatively unique access
  pattern.
- A single global recv message queue lock.  Received messages waiting to
  be fetched by the API are complicated because they may exist in multiple
  queues at the same time.  Instead of trying to have a lock for each queue
  and dealing with safely aquiring multiple locks when necessary, we just
  use a single "leaf" lock to protect *all* queues.  This is OK because
  the lock is only held very briefly.
- Misc other "leaf" locks.

If you are only using UDP connections, with each connection having its own
UDP socket, this may seem like overkill.  But the relationship between
connection and UDP socket is most definitely not one-to-one in the case of
SDR, where many connections utilize the same data structures.  One connection
may use one or more sessions to relays, and the server state data may be shared
between multiple connections.  These server data may use different UDP sockets,
or the sockets may be shared.  Basically each connection doesn't simply have
all its own data in such a way that it can be easily siloed.

There are still a few FIXMEs tagged in the code that I know need to be fixed,
but I confirmed basic functionality work in the simple UDP case, as well as
the much more complicated SDR case.  So this seems like a good time to check it
into a branch and get some broader testing on it.
zpostfacto added a commit that referenced this issue Jan 28, 2021
This is in response to issue #50.  I have a partner who wants to make sure
that all calls form their "main" thread never stall waiting on the global
lock, even for a few ms, and has documented some times when some stalls
were visible due to contention.

Using tons of locks to cover lots of different objects is really scary.
A few high level strategies were used to make this possible:
- Document what locks protect what data.  Sprinkle asserts liberally
  when accessing that data to ensure that the appropriate lock is held.
- Restrict lock aquisition patterns so that deadlocks are impossible,
  and add logic to detect violations of the expected lock pattern.
- Use "leaf" locks smartyly.  A leaf lock is one that only protects
  a relatively small piece of data, and for which the usage pattern
  is "lock, access data quickly, unlock".  No other locks are allowed
  to be taken while a leaf lock is held, and so we don't have to worry
  about deadlocks.
- Major state changes, object creation and destruction, and other
  operations that are likely to touch multiple objects are still
  protected by the global lock.  Some API calls still take the global
  lock.  The most common API calls we expect to happen per frame and
  need to be fast, however, no longer take the global lock.
- Using scope lock objects, not only to take advantage of RAII unlocking,
  but also to clearly communicate lock ownership semantics between
  caller and calee.

Here are all the locks:
- Global lock.  All background processing and most "major" operations
  like creating and destroying objects, is done while holding this lock.
- Connection and poll group locks.  You must hold this lock while queuing
  messages or pulling them off the queue.
- Handle table lock.  The important API calls that need to be optimized
  begin by looking up a handle in a table and getting an object.  We have
  a lock to protect these tables.  (We use the same lock for both connection
  and pollgroup tables.)  The access pattern for this lock is the most
  unique -- see the code for details.
- Thinker lock.  Think times, and the PQ of thinkers, cannot be modified
  without holding this lock.  This lock also has a relatively unique access
  pattern.
- A single global recv message queue lock.  Received messages waiting to
  be fetched by the API are complicated because they may exist in multiple
  queues at the same time.  Instead of trying to have a lock for each queue
  and dealing with safely aquiring multiple locks when necessary, we just
  use a single "leaf" lock to protect *all* queues.  This is OK because
  the lock is only held very briefly.
- Misc other "leaf" locks.

If you are only using UDP connections, with each connection having its own
UDP socket, this may seem like overkill.  But the relationship between
connection and UDP socket is most definitely not one-to-one in the case of
SDR, where many connections utilize the same data structures.  One connection
may use one or more sessions to relays, and the server state data may be shared
between multiple connections.  These server data may use different UDP sockets,
or the sockets may be shared.  Basically each connection doesn't simply have
all its own data in such a way that it can be easily siloed.

There are still a few FIXMEs tagged in the code that I know need to be fixed,
but I confirmed basic functionality work in the simple UDP case, as well as
the much more complicated SDR case.  So this seems like a good time to check it
into a branch and get some broader testing on it.
@zpostfacto
Copy link
Contributor

Hey @nxrighthere just checking in to see if you think you might have any time to re-run your tests with the threading improvements. I am going to be testing these changes using Dota and CSGO, and then eventually the Steamworks SDK, and maybe one day Bungie will ship it, too in Destiny. But having your test case, which is very different, would be really great, especially since your use case really hammers the problem very directly and is a big stress test. We never really had issues with lock contention before in CSGO or Dota, but those apps only make networking calls from one (maybe 2) threads, so it isn't really a stress test.

Let me know if you think you might end up re-testing this sometime. It'll help me set my expectations.

Thanks again for your contributions!

zpostfacto added a commit that referenced this issue Feb 3, 2021
This is in response to issue #50.  I have a partner who wants to make sure
that all calls form their "main" thread never stall waiting on the global
lock, even for a few ms, and has documented some times when some stalls
were visible due to contention.

Using tons of locks to cover lots of different objects is really scary.
A few high level strategies were used to make this possible:
- Document what locks protect what data.  Sprinkle asserts liberally
  when accessing that data to ensure that the appropriate lock is held.
- Restrict lock aquisition patterns so that deadlocks are impossible,
  and add logic to detect violations of the expected lock pattern.
- Use "leaf" locks smartyly.  A leaf lock is one that only protects
  a relatively small piece of data, and for which the usage pattern
  is "lock, access data quickly, unlock".  No other locks are allowed
  to be taken while a leaf lock is held, and so we don't have to worry
  about deadlocks.
- Major state changes, object creation and destruction, and other
  operations that are likely to touch multiple objects are still
  protected by the global lock.  Some API calls still take the global
  lock.  The most common API calls we expect to happen per frame and
  need to be fast, however, no longer take the global lock.
- Using scope lock objects, not only to take advantage of RAII unlocking,
  but also to clearly communicate lock ownership semantics between
  caller and calee.

Here are all the locks:
- Global lock.  All background processing and most "major" operations
  like creating and destroying objects, is done while holding this lock.
- Connection and poll group locks.  You must hold this lock while queuing
  messages or pulling them off the queue.
- Handle table lock.  The important API calls that need to be optimized
  begin by looking up a handle in a table and getting an object.  We have
  a lock to protect these tables.  (We use the same lock for both connection
  and pollgroup tables.)  The access pattern for this lock is the most
  unique -- see the code for details.
- Thinker lock.  Think times, and the PQ of thinkers, cannot be modified
  without holding this lock.  This lock also has a relatively unique access
  pattern.
- A single global recv message queue lock.  Received messages waiting to
  be fetched by the API are complicated because they may exist in multiple
  queues at the same time.  Instead of trying to have a lock for each queue
  and dealing with safely aquiring multiple locks when necessary, we just
  use a single "leaf" lock to protect *all* queues.  This is OK because
  the lock is only held very briefly.
- Misc other "leaf" locks.

If you are only using UDP connections, with each connection having its own
UDP socket, this may seem like overkill.  But the relationship between
connection and UDP socket is most definitely not one-to-one in the case of
SDR, where many connections utilize the same data structures.  One connection
may use one or more sessions to relays, and the server state data may be shared
between multiple connections.  These server data may use different UDP sockets,
or the sockets may be shared.  Basically each connection doesn't simply have
all its own data in such a way that it can be easily siloed.

There are still a few FIXMEs tagged in the code that I know need to be fixed,
but I confirmed basic functionality work in the simple UDP case, as well as
the much more complicated SDR case.  So this seems like a good time to check it
into a branch and get some broader testing on it.
zpostfacto added a commit that referenced this issue Feb 3, 2021
This is in response to issue #50.  I have a partner who wants to make sure
that all calls form their "main" thread never stall waiting on the global
lock, even for a few ms, and has documented some times when some stalls
were visible due to contention.

Using tons of locks to cover lots of different objects is really scary.
A few high level strategies were used to make this possible:
- Document what locks protect what data.  Sprinkle asserts liberally
  when accessing that data to ensure that the appropriate lock is held.
- Restrict lock aquisition patterns so that deadlocks are impossible,
  and add logic to detect violations of the expected lock pattern.
- Use "leaf" locks smartyly.  A leaf lock is one that only protects
  a relatively small piece of data, and for which the usage pattern
  is "lock, access data quickly, unlock".  No other locks are allowed
  to be taken while a leaf lock is held, and so we don't have to worry
  about deadlocks.
- Major state changes, object creation and destruction, and other
  operations that are likely to touch multiple objects are still
  protected by the global lock.  Some API calls still take the global
  lock.  The most common API calls we expect to happen per frame and
  need to be fast, however, no longer take the global lock.
- Using scope lock objects, not only to take advantage of RAII unlocking,
  but also to clearly communicate lock ownership semantics between
  caller and calee.

Here are all the locks:
- Global lock.  All background processing and most "major" operations
  like creating and destroying objects, is done while holding this lock.
- Connection and poll group locks.  You must hold this lock while queuing
  messages or pulling them off the queue.
- Handle table lock.  The important API calls that need to be optimized
  begin by looking up a handle in a table and getting an object.  We have
  a lock to protect these tables.  (We use the same lock for both connection
  and pollgroup tables.)  The access pattern for this lock is the most
  unique -- see the code for details.
- Thinker lock.  Think times, and the PQ of thinkers, cannot be modified
  without holding this lock.  This lock also has a relatively unique access
  pattern.
- A single global recv message queue lock.  Received messages waiting to
  be fetched by the API are complicated because they may exist in multiple
  queues at the same time.  Instead of trying to have a lock for each queue
  and dealing with safely aquiring multiple locks when necessary, we just
  use a single "leaf" lock to protect *all* queues.  This is OK because
  the lock is only held very briefly.
- Misc other "leaf" locks.

If you are only using UDP connections, with each connection having its own
UDP socket, this may seem like overkill.  But the relationship between
connection and UDP socket is most definitely not one-to-one in the case of
SDR, where many connections utilize the same data structures.  One connection
may use one or more sessions to relays, and the server state data may be shared
between multiple connections.  These server data may use different UDP sockets,
or the sockets may be shared.  Basically each connection doesn't simply have
all its own data in such a way that it can be easily siloed.

There are still a few FIXMEs tagged in the code that I know need to be fixed,
but I confirmed basic functionality work in the simple UDP case, as well as
the much more complicated SDR case.  So this seems like a good time to check it
into a branch and get some broader testing on it.
@nxrighthere
Copy link
Contributor Author

I can't find this test on my machine, sadly. But I know guys who did something similar before, so I'll ask them if they can do some tests.

zpostfacto added a commit that referenced this issue Feb 3, 2021
This is in response to issue #50.  I have a partner who wants to make sure
that all calls form their "main" thread never stall waiting on the global
lock, even for a few ms, and has documented some times when some stalls
were visible due to contention.

Using tons of locks to cover lots of different objects is really scary.
A few high level strategies were used to make this possible:
- Document what locks protect what data.  Sprinkle asserts liberally
  when accessing that data to ensure that the appropriate lock is held.
- Restrict lock aquisition patterns so that deadlocks are impossible,
  and add logic to detect violations of the expected lock pattern.
- Use "leaf" locks smartyly.  A leaf lock is one that only protects
  a relatively small piece of data, and for which the usage pattern
  is "lock, access data quickly, unlock".  No other locks are allowed
  to be taken while a leaf lock is held, and so we don't have to worry
  about deadlocks.
- Major state changes, object creation and destruction, and other
  operations that are likely to touch multiple objects are still
  protected by the global lock.  Some API calls still take the global
  lock.  The most common API calls we expect to happen per frame and
  need to be fast, however, no longer take the global lock.
- Using scope lock objects, not only to take advantage of RAII unlocking,
  but also to clearly communicate lock ownership semantics between
  caller and calee.

Here are all the locks:
- Global lock.  All background processing and most "major" operations
  like creating and destroying objects, is done while holding this lock.
- Connection and poll group locks.  You must hold this lock while queuing
  messages or pulling them off the queue.
- Handle table lock.  The important API calls that need to be optimized
  begin by looking up a handle in a table and getting an object.  We have
  a lock to protect these tables.  (We use the same lock for both connection
  and pollgroup tables.)  The access pattern for this lock is the most
  unique -- see the code for details.
- Thinker lock.  Think times, and the PQ of thinkers, cannot be modified
  without holding this lock.  This lock also has a relatively unique access
  pattern.
- A single global recv message queue lock.  Received messages waiting to
  be fetched by the API are complicated because they may exist in multiple
  queues at the same time.  Instead of trying to have a lock for each queue
  and dealing with safely aquiring multiple locks when necessary, we just
  use a single "leaf" lock to protect *all* queues.  This is OK because
  the lock is only held very briefly.
- Misc other "leaf" locks.

If you are only using UDP connections, with each connection having its own
UDP socket, this may seem like overkill.  But the relationship between
connection and UDP socket is most definitely not one-to-one in the case of
SDR, where many connections utilize the same data structures.  One connection
may use one or more sessions to relays, and the server state data may be shared
between multiple connections.  These server data may use different UDP sockets,
or the sockets may be shared.  Basically each connection doesn't simply have
all its own data in such a way that it can be easily siloed.

There are still a few FIXMEs tagged in the code that I know need to be fixed,
but I confirmed basic functionality work in the simple UDP case, as well as
the much more complicated SDR case.  So this seems like a good time to check it
into a branch and get some broader testing on it.
zpostfacto added a commit that referenced this issue Feb 3, 2021
This is in response to issue #50.  I have a partner who wants to make sure
that all calls form their "main" thread never stall waiting on the global
lock, even for a few ms, and has documented some times when some stalls
were visible due to contention.

Using tons of locks to cover lots of different objects is really scary.
A few high level strategies were used to make this possible:
- Document what locks protect what data.  Sprinkle asserts liberally
  when accessing that data to ensure that the appropriate lock is held.
- Restrict lock aquisition patterns so that deadlocks are impossible,
  and add logic to detect violations of the expected lock pattern.
- Use "leaf" locks smartyly.  A leaf lock is one that only protects
  a relatively small piece of data, and for which the usage pattern
  is "lock, access data quickly, unlock".  No other locks are allowed
  to be taken while a leaf lock is held, and so we don't have to worry
  about deadlocks.
- Major state changes, object creation and destruction, and other
  operations that are likely to touch multiple objects are still
  protected by the global lock.  Some API calls still take the global
  lock.  The most common API calls we expect to happen per frame and
  need to be fast, however, no longer take the global lock.
- Using scope lock objects, not only to take advantage of RAII unlocking,
  but also to clearly communicate lock ownership semantics between
  caller and calee.

Here are all the locks:
- Global lock.  All background processing and most "major" operations
  like creating and destroying objects, is done while holding this lock.
- Connection and poll group locks.  You must hold this lock while queuing
  messages or pulling them off the queue.
- Handle table lock.  The important API calls that need to be optimized
  begin by looking up a handle in a table and getting an object.  We have
  a lock to protect these tables.  (We use the same lock for both connection
  and pollgroup tables.)  The access pattern for this lock is the most
  unique -- see the code for details.
- Thinker lock.  Think times, and the PQ of thinkers, cannot be modified
  without holding this lock.  This lock also has a relatively unique access
  pattern.
- A single global recv message queue lock.  Received messages waiting to
  be fetched by the API are complicated because they may exist in multiple
  queues at the same time.  Instead of trying to have a lock for each queue
  and dealing with safely aquiring multiple locks when necessary, we just
  use a single "leaf" lock to protect *all* queues.  This is OK because
  the lock is only held very briefly.
- Misc other "leaf" locks.

If you are only using UDP connections, with each connection having its own
UDP socket, this may seem like overkill.  But the relationship between
connection and UDP socket is most definitely not one-to-one in the case of
SDR, where many connections utilize the same data structures.  One connection
may use one or more sessions to relays, and the server state data may be shared
between multiple connections.  These server data may use different UDP sockets,
or the sockets may be shared.  Basically each connection doesn't simply have
all its own data in such a way that it can be easily siloed.

There are still a few FIXMEs tagged in the code that I know need to be fixed,
but I confirmed basic functionality work in the simple UDP case, as well as
the much more complicated SDR case.  So this seems like a good time to check it
into a branch and get some broader testing on it.
@zpostfacto zpostfacto self-assigned this Feb 3, 2021
@zpostfacto
Copy link
Contributor

zpostfacto commented Feb 3, 2021

Sorry, it looks like every time I rebase and force push the fine-grained locking branch, it adds more notes to this issue. I'll see if there's a better way to do this.

zpostfacto added a commit that referenced this issue Feb 6, 2021
This is in response to issue #50.  I have a partner who wants to make sure
that all calls form their "main" thread never stall waiting on the global
lock, even for a few ms, and has documented some times when some stalls
were visible due to contention.

Using tons of locks to cover lots of different objects is really scary.
A few high level strategies were used to make this possible:
- Document what locks protect what data.  Sprinkle asserts liberally
  when accessing that data to ensure that the appropriate lock is held.
- Restrict lock aquisition patterns so that deadlocks are impossible,
  and add logic to detect violations of the expected lock pattern.
- Use "leaf" locks smartyly.  A leaf lock is one that only protects
  a relatively small piece of data, and for which the usage pattern
  is "lock, access data quickly, unlock".  No other locks are allowed
  to be taken while a leaf lock is held, and so we don't have to worry
  about deadlocks.
- Major state changes, object creation and destruction, and other
  operations that are likely to touch multiple objects are still
  protected by the global lock.  Some API calls still take the global
  lock.  The most common API calls we expect to happen per frame and
  need to be fast, however, no longer take the global lock.
- Using scope lock objects, not only to take advantage of RAII unlocking,
  but also to clearly communicate lock ownership semantics between
  caller and calee.

Here are all the locks:
- Global lock.  All background processing and most "major" operations
  like creating and destroying objects, is done while holding this lock.
- Connection and poll group locks.  You must hold this lock while queuing
  messages or pulling them off the queue.
- Handle table lock.  The important API calls that need to be optimized
  begin by looking up a handle in a table and getting an object.  We have
  a lock to protect these tables.  (We use the same lock for both connection
  and pollgroup tables.)  The access pattern for this lock is the most
  unique -- see the code for details.
- Thinker lock.  Think times, and the PQ of thinkers, cannot be modified
  without holding this lock.  This lock also has a relatively unique access
  pattern.
- A single global recv message queue lock.  Received messages waiting to
  be fetched by the API are complicated because they may exist in multiple
  queues at the same time.  Instead of trying to have a lock for each queue
  and dealing with safely aquiring multiple locks when necessary, we just
  use a single "leaf" lock to protect *all* queues.  This is OK because
  the lock is only held very briefly.
- Misc other "leaf" locks.

If you are only using UDP connections, with each connection having its own
UDP socket, this may seem like overkill.  But the relationship between
connection and UDP socket is most definitely not one-to-one in the case of
SDR, where many connections utilize the same data structures.  One connection
may use one or more sessions to relays, and the server state data may be shared
between multiple connections.  These server data may use different UDP sockets,
or the sockets may be shared.  Basically each connection doesn't simply have
all its own data in such a way that it can be easily siloed.

There are still a few FIXMEs tagged in the code that I know need to be fixed,
but I confirmed basic functionality work in the simple UDP case, as well as
the much more complicated SDR case.  So this seems like a good time to check it
into a branch and get some broader testing on it.
@zpostfacto
Copy link
Contributor

We've been using this on Dota internally for a while with no issues. I think it's stable enable to merge into master.

I'll keep this bug open for a while longer. If somebody has a chance to really stress test the new code, please comment here.

@kisak-valve
Copy link
Member

@fletcherdvalve, are you happy with how things look here? Looks like this issue has fallen into limbo pending feedback.

@zpostfacto
Copy link
Contributor

Let's keep it open. Dota is going to ship with the fine-grained-locking fix soon. If that doesn't show any problems, CSGO will be soon thereafter.

@zpostfacto
Copy link
Contributor

OK, well I'd love it if @nxrighthere was able to run some of his tests again (since he reported this issue). But, we have been doing some work sending a bunch of messages on different threads in the dota server, and this change has made a big difference in perf.

So, it isn't crashing, and it's made Dota server perf quite a bit better. I think that's good enough to call this one closed.

@zpostfacto
Copy link
Contributor

BTW, the latest steam client beta includes the new fine grained locking code

https://steamcommunity.com/groups/SteamClientBeta/announcements/detail/3032590367259099679

(It's not mentioned in the patch notes.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants