Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

glow runtime: use folly cpu thread pool #4260

Closed
wants to merge 1 commit into from

Conversation

@tracelogfb
Copy link
Contributor

tracelogfb commented Mar 4, 2020

Summary:
In glow we have our own implementation of a ThreadPool. Glow's implementation is a pool of thread each having their own work queue. New work is assigned to different threads round robin.

This kind of implementation is usually good if and only if

  1. all the work items are guaranteed to complete with similar amount of time. Or
  2. we implement work stealing.

Otherwise we can end up with load imbalance where some threads have a heavy workload and long queue while other threads are idle.

Now that we have folly, we can just use folly's thread pool which is implemented with a shared queue using high performance data structure and primitives (MPMC queue + LIfoSemaphore). Using this implementation can reduce load imbalance and tail latency.

In my testing, the avg latency and throughput doesn't really change. But the tail latency is much better. The reason is because of shorter queuing time between device manager thread being done with an inference and the host manager executor thread handling the result.

Previously, the handling of the device manager result is always scheduled onto a thread that was pre-determined via round robin. If this thread happens to be busy (say handling some other input), then the result handling has to wait while there may be other threads that are totally idle. After the change, the result will be handled by any worker thread that's free.

You can observe this change in the glow trace.

Reviewed By: yinghai

Differential Revision: D20203927

@facebook-github-bot

This comment has been minimized.

Copy link

facebook-github-bot commented Mar 4, 2020

This pull request was exported from Phabricator. Differential Revision: D20203927

@tracelogfb tracelogfb force-pushed the tracelogfb:export-D20203927 branch from 0f3bda5 to 744ebae Mar 4, 2020
tracelogfb added a commit to tracelogfb/glow that referenced this pull request Mar 4, 2020
Summary:
Pull Request resolved: pytorch#4260

In glow we have our own implementation of a ThreadPool. Glow's implementation is a pool of thread each having their own work queue. New work is assigned to different threads round robin.

This kind of implementation is usually good if and only if
1. all the work items are guaranteed to complete with similar amount of time. Or
2. we implement work stealing.

Otherwise we can end up with load imbalance where some threads have a heavy workload and long queue while other threads are idle.

Now that we have folly, we can just use folly's thread pool which is implemented with a shared queue using high performance data structure and primitives (MPMC queue + LIfoSemaphore). Using this implementation can reduce load imbalance and tail latency.

In my testing, the avg latency and throughput doesn't really change. But the tail latency is much better. The reason is because of shorter queuing time between device manager thread being done with an inference and the host manager executor thread handling the result.

Previously, the handling of the device manager result is always scheduled onto a thread that was pre-determined via round robin. If this thread happens to be busy (say handling some other input), then the result handling has to wait while there may be other threads that are totally idle. After the change, the result will be handled by any worker thread that's free.

You can observe this change in the glow trace.

Reviewed By: yinghai

Differential Revision: D20203927

fbshipit-source-id: 7782f226ae29f8debe814e3ab71f0d9706baf72f
@facebook-github-bot

This comment has been minimized.

Copy link

facebook-github-bot commented Mar 4, 2020

This pull request was exported from Phabricator. Differential Revision: D20203927

tracelogfb added a commit to tracelogfb/glow that referenced this pull request Mar 5, 2020
Summary:
Pull Request resolved: pytorch#4260

In glow we have our own implementation of a ThreadPool. Glow's implementation is a pool of thread each having their own work queue. New work is assigned to different threads round robin.

This kind of implementation is usually good if and only if
1. all the work items are guaranteed to complete with similar amount of time. Or
2. we implement work stealing.

Otherwise we can end up with load imbalance where some threads have a heavy workload and long queue while other threads are idle.

Now that we have folly, we can just use folly's thread pool which is implemented with a shared queue using high performance data structure and primitives (MPMC queue + LIfoSemaphore). Using this implementation can reduce load imbalance and tail latency.

In my testing, the avg latency and throughput doesn't really change. But the tail latency is much better. The reason is because of shorter queuing time between device manager thread being done with an inference and the host manager executor thread handling the result.

Previously, the handling of the device manager result is always scheduled onto a thread that was pre-determined via round robin. If this thread happens to be busy (say handling some other input), then the result handling has to wait while there may be other threads that are totally idle. After the change, the result will be handled by any worker thread that's free.

You can observe this change in the glow trace.

Reviewed By: yinghai

Differential Revision: D20203927

fbshipit-source-id: 923426dd58c4afcd41d2a5026940d92534dcc77b
@tracelogfb tracelogfb force-pushed the tracelogfb:export-D20203927 branch from 744ebae to bb612e0 Mar 5, 2020
@facebook-github-bot

This comment has been minimized.

Copy link

facebook-github-bot commented Mar 5, 2020

This pull request was exported from Phabricator. Differential Revision: D20203927

tracelogfb added a commit to tracelogfb/glow that referenced this pull request Mar 5, 2020
Summary:
Pull Request resolved: pytorch#4260

In glow we have our own implementation of a ThreadPool. Glow's implementation is a pool of thread each having their own work queue. New work is assigned to different threads round robin.

This kind of implementation is usually good if and only if
1. all the work items are guaranteed to complete with similar amount of time. Or
2. we implement work stealing.

Otherwise we can end up with load imbalance where some threads have a heavy workload and long queue while other threads are idle.

Now that we have folly, we can just use folly's thread pool which is implemented with a shared queue using high performance data structure and primitives (MPMC queue + LIfoSemaphore). Using this implementation can reduce load imbalance and tail latency.

In my testing, the avg latency and throughput doesn't really change. But the tail latency is much better. The reason is because of shorter queuing time between device manager thread being done with an inference and the host manager executor thread handling the result.

Previously, the handling of the device manager result is always scheduled onto a thread that was pre-determined via round robin. If this thread happens to be busy (say handling some other input), then the result handling has to wait while there may be other threads that are totally idle. After the change, the result will be handled by any worker thread that's free.

You can observe this change in the glow trace.

Reviewed By: yinghai

Differential Revision: D20203927

fbshipit-source-id: b465e52442d96648b5388f991ed64e8f6665a29d
@tracelogfb tracelogfb force-pushed the tracelogfb:export-D20203927 branch from bb612e0 to eb1483e Mar 5, 2020
@facebook-github-bot

This comment has been minimized.

Copy link

facebook-github-bot commented Mar 5, 2020

This pull request was exported from Phabricator. Differential Revision: D20203927

tracelogfb added a commit to tracelogfb/glow that referenced this pull request Mar 5, 2020
Summary:
Pull Request resolved: pytorch#4260

In glow we have our own implementation of a ThreadPool. Glow's implementation is a pool of thread each having their own work queue. New work is assigned to different threads round robin.

This kind of implementation is usually good if and only if
1. all the work items are guaranteed to complete with similar amount of time. Or
2. we implement work stealing.

Otherwise we can end up with load imbalance where some threads have a heavy workload and long queue while other threads are idle.

Now that we have folly, we can just use folly's thread pool which is implemented with a shared queue using high performance data structure and primitives (MPMC queue + LIfoSemaphore). Using this implementation can reduce load imbalance and tail latency.

In my testing, the avg latency and throughput doesn't really change. But the tail latency is much better. The reason is because of shorter queuing time between device manager thread being done with an inference and the host manager executor thread handling the result.

Previously, the handling of the device manager result is always scheduled onto a thread that was pre-determined via round robin. If this thread happens to be busy (say handling some other input), then the result handling has to wait while there may be other threads that are totally idle. After the change, the result will be handled by any worker thread that's free.

You can observe this change in the glow trace.

Reviewed By: yinghai

Differential Revision: D20203927

fbshipit-source-id: cd73bc17ad9c2c830a1bcb60d0270cd71044f5da
@tracelogfb tracelogfb force-pushed the tracelogfb:export-D20203927 branch from eb1483e to 354599a Mar 5, 2020
@facebook-github-bot

This comment has been minimized.

Copy link

facebook-github-bot commented Mar 5, 2020

This pull request was exported from Phabricator. Differential Revision: D20203927

tracelogfb added a commit to tracelogfb/glow that referenced this pull request Mar 5, 2020
Summary:
Pull Request resolved: pytorch#4260

In glow we have our own implementation of a ThreadPool. Glow's implementation is a pool of thread each having their own work queue. New work is assigned to different threads round robin.

This kind of implementation is usually good if and only if
1. all the work items are guaranteed to complete with similar amount of time. Or
2. we implement work stealing.

Otherwise we can end up with load imbalance where some threads have a heavy workload and long queue while other threads are idle.

Now that we have folly, we can just use folly's thread pool which is implemented with a shared queue using high performance data structure and primitives (MPMC queue + LIfoSemaphore). Using this implementation can reduce load imbalance and tail latency.

In my testing, the avg latency and throughput doesn't really change. But the tail latency is much better. The reason is because of shorter queuing time between device manager thread being done with an inference and the host manager executor thread handling the result.

Previously, the handling of the device manager result is always scheduled onto a thread that was pre-determined via round robin. If this thread happens to be busy (say handling some other input), then the result handling has to wait while there may be other threads that are totally idle. After the change, the result will be handled by any worker thread that's free.

You can observe this change in the glow trace.

Reviewed By: yinghai

Differential Revision: D20203927

fbshipit-source-id: a32333708de134ea68f09017fbf992e0490c4933
@tracelogfb tracelogfb force-pushed the tracelogfb:export-D20203927 branch from 354599a to bd12475 Mar 5, 2020
@facebook-github-bot

This comment has been minimized.

Copy link

facebook-github-bot commented Mar 5, 2020

This pull request was exported from Phabricator. Differential Revision: D20203927

tracelogfb added a commit to tracelogfb/glow that referenced this pull request Mar 5, 2020
Summary:
Pull Request resolved: pytorch#4260

In glow we have our own implementation of a ThreadPool. Glow's implementation is a pool of thread each having their own work queue. New work is assigned to different threads round robin.

This kind of implementation is usually good if and only if
1. all the work items are guaranteed to complete with similar amount of time. Or
2. we implement work stealing.

Otherwise we can end up with load imbalance where some threads have a heavy workload and long queue while other threads are idle.

Now that we have folly, we can just use folly's thread pool which is implemented with a shared queue using high performance data structure and primitives (MPMC queue + LIfoSemaphore). Using this implementation can reduce load imbalance and tail latency.

In my testing, the avg latency and throughput doesn't really change. But the tail latency is much better. The reason is because of shorter queuing time between device manager thread being done with an inference and the host manager executor thread handling the result.

Previously, the handling of the device manager result is always scheduled onto a thread that was pre-determined via round robin. If this thread happens to be busy (say handling some other input), then the result handling has to wait while there may be other threads that are totally idle. After the change, the result will be handled by any worker thread that's free.

You can observe this change in the glow trace.

Reviewed By: yinghai

Differential Revision: D20203927

fbshipit-source-id: b7c9b6d3ad4948a4097d2abbc8f1c8362063f2e3
@tracelogfb tracelogfb force-pushed the tracelogfb:export-D20203927 branch from bd12475 to 748e6e2 Mar 5, 2020
@facebook-github-bot

This comment has been minimized.

Copy link

facebook-github-bot commented Mar 5, 2020

This pull request was exported from Phabricator. Differential Revision: D20203927

tracelogfb added a commit to tracelogfb/glow that referenced this pull request Mar 5, 2020
Summary:
Pull Request resolved: pytorch#4260

In glow we have our own implementation of a ThreadPool. Glow's implementation is a pool of thread each having their own work queue. New work is assigned to different threads round robin.

This kind of implementation is usually good if and only if
1. all the work items are guaranteed to complete with similar amount of time. Or
2. we implement work stealing.

Otherwise we can end up with load imbalance where some threads have a heavy workload and long queue while other threads are idle.

Now that we have folly, we can just use folly's thread pool which is implemented with a shared queue using high performance data structure and primitives (MPMC queue + LIfoSemaphore). Using this implementation can reduce load imbalance and tail latency.

In my testing, the avg latency and throughput doesn't really change. But the tail latency is much better. The reason is because of shorter queuing time between device manager thread being done with an inference and the host manager executor thread handling the result.

Previously, the handling of the device manager result is always scheduled onto a thread that was pre-determined via round robin. If this thread happens to be busy (say handling some other input), then the result handling has to wait while there may be other threads that are totally idle. After the change, the result will be handled by any worker thread that's free.

You can observe this change in the glow trace.

Reviewed By: yinghai

Differential Revision: D20203927

fbshipit-source-id: d080f8fa0d8507c7cfd7625f02b832687f293986
@tracelogfb tracelogfb force-pushed the tracelogfb:export-D20203927 branch from 748e6e2 to df6635b Mar 5, 2020
@facebook-github-bot

This comment has been minimized.

Copy link

facebook-github-bot commented Mar 5, 2020

This pull request was exported from Phabricator. Differential Revision: D20203927

tracelogfb added a commit to tracelogfb/glow that referenced this pull request Mar 5, 2020
Summary:
Pull Request resolved: pytorch#4260

In glow we have our own implementation of a ThreadPool. Glow's implementation is a pool of thread each having their own work queue. New work is assigned to different threads round robin.

This kind of implementation is usually good if and only if
1. all the work items are guaranteed to complete with similar amount of time. Or
2. we implement work stealing.

Otherwise we can end up with load imbalance where some threads have a heavy workload and long queue while other threads are idle.

Now that we have folly, we can just use folly's thread pool which is implemented with a shared queue using high performance data structure and primitives (MPMC queue + LIfoSemaphore). Using this implementation can reduce load imbalance and tail latency.

In my testing, the avg latency and throughput doesn't really change. But the tail latency is much better. The reason is because of shorter queuing time between device manager thread being done with an inference and the host manager executor thread handling the result.

Previously, the handling of the device manager result is always scheduled onto a thread that was pre-determined via round robin. If this thread happens to be busy (say handling some other input), then the result handling has to wait while there may be other threads that are totally idle. After the change, the result will be handled by any worker thread that's free.

You can observe this change in the glow trace.

Reviewed By: yinghai

Differential Revision: D20203927

fbshipit-source-id: f14dea4e765fb713efc17b90679625ef6eea272d
@tracelogfb tracelogfb force-pushed the tracelogfb:export-D20203927 branch from df6635b to 616c2c3 Mar 5, 2020
@facebook-github-bot

This comment has been minimized.

Copy link

facebook-github-bot commented Mar 5, 2020

This pull request was exported from Phabricator. Differential Revision: D20203927

tracelogfb added a commit to tracelogfb/glow that referenced this pull request Mar 5, 2020
Summary:
Pull Request resolved: pytorch#4260

In glow we have our own implementation of a ThreadPool. Glow's implementation is a pool of thread each having their own work queue. New work is assigned to different threads round robin.

This kind of implementation is usually good if and only if
1. all the work items are guaranteed to complete with similar amount of time. Or
2. we implement work stealing.

Otherwise we can end up with load imbalance where some threads have a heavy workload and long queue while other threads are idle.

Now that we have folly, we can just use folly's thread pool which is implemented with a shared queue using high performance data structure and primitives (MPMC queue + LIfoSemaphore). Using this implementation can reduce load imbalance and tail latency.

In my testing, the avg latency and throughput doesn't really change. But the tail latency is much better. The reason is because of shorter queuing time between device manager thread being done with an inference and the host manager executor thread handling the result.

Previously, the handling of the device manager result is always scheduled onto a thread that was pre-determined via round robin. If this thread happens to be busy (say handling some other input), then the result handling has to wait while there may be other threads that are totally idle. After the change, the result will be handled by any worker thread that's free.

You can observe this change in the glow trace.

Reviewed By: yinghai

Differential Revision: D20203927

fbshipit-source-id: 2b0a60d34fbcd8294f8274de966a58e73b2e3403
@tracelogfb tracelogfb force-pushed the tracelogfb:export-D20203927 branch from 616c2c3 to 8775840 Mar 5, 2020
@facebook-github-bot

This comment has been minimized.

Copy link

facebook-github-bot commented Mar 5, 2020

This pull request was exported from Phabricator. Differential Revision: D20203927

tracelogfb added a commit to tracelogfb/glow that referenced this pull request Mar 6, 2020
Summary:
Pull Request resolved: pytorch#4260

In glow we have our own implementation of a ThreadPool. Glow's implementation is a pool of thread each having their own work queue. New work is assigned to different threads round robin.

This kind of implementation is usually good if and only if
1. all the work items are guaranteed to complete with similar amount of time. Or
2. we implement work stealing.

Otherwise we can end up with load imbalance where some threads have a heavy workload and long queue while other threads are idle.

Now that we have folly, we can just use folly's thread pool which is implemented with a shared queue using high performance data structure and primitives (MPMC queue + LIfoSemaphore). Using this implementation can reduce load imbalance and tail latency.

In my testing, the avg latency and throughput doesn't really change. But the tail latency is much better. The reason is because of shorter queuing time between device manager thread being done with an inference and the host manager executor thread handling the result.

Previously, the handling of the device manager result is always scheduled onto a thread that was pre-determined via round robin. If this thread happens to be busy (say handling some other input), then the result handling has to wait while there may be other threads that are totally idle. After the change, the result will be handled by any worker thread that's free.

You can observe this change in the glow trace.

Reviewed By: yinghai

Differential Revision: D20203927

fbshipit-source-id: 9a86d3380b62a2d8e1e096216974981dbfd32f64
@facebook-github-bot

This comment has been minimized.

Copy link

facebook-github-bot commented Mar 9, 2020

This pull request was exported from Phabricator. Differential Revision: D20203927

tracelogfb added a commit to tracelogfb/glow that referenced this pull request Mar 9, 2020
Summary:
Pull Request resolved: pytorch#4260

In glow we have our own implementation of a ThreadPool. Glow's implementation is a pool of thread each having their own work queue. New work is assigned to different threads round robin.

This kind of implementation is usually good if and only if
1. all the work items are guaranteed to complete with similar amount of time. Or
2. we implement work stealing.

Otherwise we can end up with load imbalance where some threads have a heavy workload and long queue while other threads are idle.

Now that we have folly, we can just use folly's thread pool which is implemented with a shared queue using high performance data structure and primitives (MPMC queue + LIfoSemaphore). Using this implementation can reduce load imbalance and tail latency.

In my testing, the avg latency and throughput doesn't really change. But the tail latency is much better. The reason is because of shorter queuing time between device manager thread being done with an inference and the host manager executor thread handling the result.

Previously, the handling of the device manager result is always scheduled onto a thread that was pre-determined via round robin. If this thread happens to be busy (say handling some other input), then the result handling has to wait while there may be other threads that are totally idle. After the change, the result will be handled by any worker thread that's free.

You can observe this change in the glow trace.

Reviewed By: yinghai

Differential Revision: D20203927

fbshipit-source-id: 6e6f2df95c57a55bbbfff34f00f1976a928c6c68
@tracelogfb tracelogfb force-pushed the tracelogfb:export-D20203927 branch from 47787d9 to 7f4fc24 Mar 9, 2020
@facebook-github-bot

This comment has been minimized.

Copy link

facebook-github-bot commented Mar 9, 2020

This pull request was exported from Phabricator. Differential Revision: D20203927

tracelogfb added a commit to tracelogfb/glow that referenced this pull request Mar 9, 2020
Summary:
Pull Request resolved: pytorch#4260

In glow we have our own implementation of a ThreadPool. Glow's implementation is a pool of thread each having their own work queue. New work is assigned to different threads round robin.

This kind of implementation is usually good if and only if
1. all the work items are guaranteed to complete with similar amount of time. Or
2. we implement work stealing.

Otherwise we can end up with load imbalance where some threads have a heavy workload and long queue while other threads are idle.

Now that we have folly, we can just use folly's thread pool which is implemented with a shared queue using high performance data structure and primitives (MPMC queue + LIfoSemaphore). Using this implementation can reduce load imbalance and tail latency.

In my testing, the avg latency and throughput doesn't really change. But the tail latency is much better. The reason is because of shorter queuing time between device manager thread being done with an inference and the host manager executor thread handling the result.

Previously, the handling of the device manager result is always scheduled onto a thread that was pre-determined via round robin. If this thread happens to be busy (say handling some other input), then the result handling has to wait while there may be other threads that are totally idle. After the change, the result will be handled by any worker thread that's free.

You can observe this change in the glow trace.

Reviewed By: yinghai

Differential Revision: D20203927

fbshipit-source-id: 174e2d4bbcce70b9b07885be8d1875c14dbb481e
@tracelogfb tracelogfb force-pushed the tracelogfb:export-D20203927 branch from 7f4fc24 to ee11a4c Mar 9, 2020
@facebook-github-bot

This comment has been minimized.

Copy link

facebook-github-bot commented Mar 9, 2020

This pull request was exported from Phabricator. Differential Revision: D20203927

tracelogfb added a commit to tracelogfb/glow that referenced this pull request Mar 10, 2020
Summary:
Pull Request resolved: pytorch#4260

In glow we have our own implementation of a ThreadPool. Glow's implementation is a pool of thread each having their own work queue. New work is assigned to different threads round robin.

This kind of implementation is usually good if and only if
1. all the work items are guaranteed to complete with similar amount of time. Or
2. we implement work stealing.

Otherwise we can end up with load imbalance where some threads have a heavy workload and long queue while other threads are idle.

Now that we have folly, we can just use folly's thread pool which is implemented with a shared queue using high performance data structure and primitives (MPMC queue + LIfoSemaphore). Using this implementation can reduce load imbalance and tail latency.

In my testing, the avg latency and throughput doesn't really change. But the tail latency is much better. The reason is because of shorter queuing time between device manager thread being done with an inference and the host manager executor thread handling the result.

Previously, the handling of the device manager result is always scheduled onto a thread that was pre-determined via round robin. If this thread happens to be busy (say handling some other input), then the result handling has to wait while there may be other threads that are totally idle. After the change, the result will be handled by any worker thread that's free.

You can observe this change in the glow trace.

Reviewed By: yinghai

Differential Revision: D20203927

fbshipit-source-id: a4da3553f8f3707c8da501a4ddfa70f931ca7f59
@tracelogfb tracelogfb force-pushed the tracelogfb:export-D20203927 branch from ee11a4c to f7c813d Mar 10, 2020
@facebook-github-bot

This comment has been minimized.

Copy link

facebook-github-bot commented Mar 10, 2020

This pull request was exported from Phabricator. Differential Revision: D20203927

tracelogfb added a commit to tracelogfb/glow that referenced this pull request Mar 10, 2020
Summary:
Pull Request resolved: pytorch#4260

In glow we have our own implementation of a ThreadPool. Glow's implementation is a pool of thread each having their own work queue. New work is assigned to different threads round robin.

This kind of implementation is usually good if and only if
1. all the work items are guaranteed to complete with similar amount of time. Or
2. we implement work stealing.

Otherwise we can end up with load imbalance where some threads have a heavy workload and long queue while other threads are idle.

Now that we have folly, we can just use folly's thread pool which is implemented with a shared queue using high performance data structure and primitives (MPMC queue + LIfoSemaphore). Using this implementation can reduce load imbalance and tail latency.

In my testing, the avg latency and throughput doesn't really change. But the tail latency is much better. The reason is because of shorter queuing time between device manager thread being done with an inference and the host manager executor thread handling the result.

Previously, the handling of the device manager result is always scheduled onto a thread that was pre-determined via round robin. If this thread happens to be busy (say handling some other input), then the result handling has to wait while there may be other threads that are totally idle. After the change, the result will be handled by any worker thread that's free.

You can observe this change in the glow trace.

Reviewed By: yinghai

Differential Revision: D20203927

fbshipit-source-id: 373af524b9c59a938c4b9ed094e7dcb5195d01e8
@tracelogfb tracelogfb force-pushed the tracelogfb:export-D20203927 branch from f7c813d to 0a0704c Mar 10, 2020
@facebook-github-bot

This comment has been minimized.

Copy link

facebook-github-bot commented Mar 10, 2020

This pull request was exported from Phabricator. Differential Revision: D20203927

tracelogfb added a commit to tracelogfb/glow that referenced this pull request Mar 10, 2020
Summary:
Pull Request resolved: pytorch#4260

In glow we have our own implementation of a ThreadPool. Glow's implementation is a pool of thread each having their own work queue. New work is assigned to different threads round robin.

This kind of implementation is usually good if and only if
1. all the work items are guaranteed to complete with similar amount of time. Or
2. we implement work stealing.

Otherwise we can end up with load imbalance where some threads have a heavy workload and long queue while other threads are idle.

Now that we have folly, we can just use folly's thread pool which is implemented with a shared queue using high performance data structure and primitives (MPMC queue + LIfoSemaphore). Using this implementation can reduce load imbalance and tail latency.

In my testing, the avg latency and throughput doesn't really change. But the tail latency is much better. The reason is because of shorter queuing time between device manager thread being done with an inference and the host manager executor thread handling the result.

Previously, the handling of the device manager result is always scheduled onto a thread that was pre-determined via round robin. If this thread happens to be busy (say handling some other input), then the result handling has to wait while there may be other threads that are totally idle. After the change, the result will be handled by any worker thread that's free.

You can observe this change in the glow trace.

Reviewed By: yinghai

Differential Revision: D20203927

fbshipit-source-id: 7e0f5938a3df7e5c96a920f8ca36a11e5140e859
@tracelogfb tracelogfb force-pushed the tracelogfb:export-D20203927 branch from 0a0704c to a6f7ffb Mar 10, 2020
@facebook-github-bot

This comment has been minimized.

Copy link

facebook-github-bot commented Mar 10, 2020

This pull request was exported from Phabricator. Differential Revision: D20203927

tracelogfb added a commit to tracelogfb/glow that referenced this pull request Mar 10, 2020
Summary:
Pull Request resolved: pytorch#4260

In glow we have our own implementation of a ThreadPool. Glow's implementation is a pool of thread each having their own work queue. New work is assigned to different threads round robin.

This kind of implementation is usually good if and only if
1. all the work items are guaranteed to complete with similar amount of time. Or
2. we implement work stealing.

Otherwise we can end up with load imbalance where some threads have a heavy workload and long queue while other threads are idle.

Now that we have folly, we can just use folly's thread pool which is implemented with a shared queue using high performance data structure and primitives (MPMC queue + LIfoSemaphore). Using this implementation can reduce load imbalance and tail latency.

In my testing, the avg latency and throughput doesn't really change. But the tail latency is much better. The reason is because of shorter queuing time between device manager thread being done with an inference and the host manager executor thread handling the result.

Previously, the handling of the device manager result is always scheduled onto a thread that was pre-determined via round robin. If this thread happens to be busy (say handling some other input), then the result handling has to wait while there may be other threads that are totally idle. After the change, the result will be handled by any worker thread that's free.

You can observe this change in the glow trace.

Reviewed By: yinghai

Differential Revision: D20203927

fbshipit-source-id: a6e73f5e59dd2bcd3dd7aa6f9bb5a25b5027fbc7
@tracelogfb tracelogfb force-pushed the tracelogfb:export-D20203927 branch from a6f7ffb to 461b373 Mar 10, 2020
@facebook-github-bot

This comment has been minimized.

Copy link

facebook-github-bot commented Mar 10, 2020

This pull request was exported from Phabricator. Differential Revision: D20203927

tracelogfb added a commit to tracelogfb/glow that referenced this pull request Mar 11, 2020
Summary:
Pull Request resolved: pytorch#4260

In glow we have our own implementation of a ThreadPool. Glow's implementation is a pool of thread each having their own work queue. New work is assigned to different threads round robin.

This kind of implementation is usually good if and only if
1. all the work items are guaranteed to complete with similar amount of time. Or
2. we implement work stealing.

Otherwise we can end up with load imbalance where some threads have a heavy workload and long queue while other threads are idle.

Now that we have folly, we can just use folly's thread pool which is implemented with a shared queue using high performance data structure and primitives (MPMC queue + LIfoSemaphore). Using this implementation can reduce load imbalance and tail latency.

In my testing, the avg latency and throughput doesn't really change. But the tail latency is much better. The reason is because of shorter queuing time between device manager thread being done with an inference and the host manager executor thread handling the result.

Previously, the handling of the device manager result is always scheduled onto a thread that was pre-determined via round robin. If this thread happens to be busy (say handling some other input), then the result handling has to wait while there may be other threads that are totally idle. After the change, the result will be handled by any worker thread that's free.

You can observe this change in the glow trace.

Reviewed By: yinghai

Differential Revision: D20203927

fbshipit-source-id: 20a1f1325b9b4598c4590702312318c98568911d
@tracelogfb tracelogfb force-pushed the tracelogfb:export-D20203927 branch from 461b373 to 6629fd5 Mar 11, 2020
@facebook-github-bot

This comment has been minimized.

Copy link

facebook-github-bot commented Mar 11, 2020

This pull request was exported from Phabricator. Differential Revision: D20203927

Summary:
Pull Request resolved: #4260

In glow we have our own implementation of a ThreadPool. Glow's implementation is a pool of thread each having their own work queue. New work is assigned to different threads round robin.

This kind of implementation is usually good if and only if
1. all the work items are guaranteed to complete with similar amount of time. Or
2. we implement work stealing.

Otherwise we can end up with load imbalance where some threads have a heavy workload and long queue while other threads are idle.

Now that we have folly, we can just use folly's thread pool which is implemented with a shared queue using high performance data structure and primitives (MPMC queue + LIfoSemaphore). Using this implementation can reduce load imbalance and tail latency.

In my testing, the avg latency and throughput doesn't really change. But the tail latency is much better. The reason is because of shorter queuing time between device manager thread being done with an inference and the host manager executor thread handling the result.

Previously, the handling of the device manager result is always scheduled onto a thread that was pre-determined via round robin. If this thread happens to be busy (say handling some other input), then the result handling has to wait while there may be other threads that are totally idle. After the change, the result will be handled by any worker thread that's free.

You can observe this change in the glow trace.

Reviewed By: yinghai

Differential Revision: D20203927

fbshipit-source-id: dd3110b7b4a032f66e9e77e5946012d156b3ab91
@tracelogfb tracelogfb force-pushed the tracelogfb:export-D20203927 branch from 6629fd5 to 9095b72 Mar 11, 2020
@facebook-github-bot

This comment has been minimized.

Copy link

facebook-github-bot commented Mar 11, 2020

This pull request was exported from Phabricator. Differential Revision: D20203927

@@ -108,7 +104,6 @@ workflows:
- DEBUG
- OPENCL
- ASAN
- TSAN

This comment has been minimized.

Copy link
@yinghai

yinghai Mar 11, 2020

Contributor

:)

@facebook-github-bot

This comment has been minimized.

Copy link

facebook-github-bot commented Mar 11, 2020

This pull request has been merged in 476c7bd.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.