-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REF] Introduce multiple streams execution in TensorFlow. #61185
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This was referenced Jul 6, 2023
buptzyb
force-pushed
the
multistream-release
branch
from
July 7, 2023 14:30
0afcd2d
to
b27a1cd
Compare
buptzyb
force-pushed
the
multistream-release
branch
from
July 9, 2023 14:19
b27a1cd
to
e77676e
Compare
Hi @buptzyb This PR is in draft, any update on this? Please. Thank you! |
2 similar comments
Hi @buptzyb This PR is in draft, any update on this? Please. Thank you! |
Hi @buptzyb This PR is in draft, any update on this? Please. Thank you! |
copybara-service bot
pushed a commit
that referenced
this pull request
Apr 27, 2024
Imported from GitHub PR #61632 This PR works as a part of the whole Multi-Stream feature in TF, which is proposed in #61185. Allow merging the host_to_device/device_to_host/device_to_device data copy streams into the compute stream in one stream group. This is useful to reduce the overhead caused by GPU stream synchronization, especially when data transfers are frequent. Another benefit is, for host_to_device copy, merging streams allows early scheduling of subsequent ops, doesn't have to wait until the data copy is really finished. As a part of the multi-stream feature, it can help multi-stream reach a much higher throughput. Taking our proto models as an example, the original model inference throughput is **1524** samples/second, and **2229** samples/ second with multi-stream, and **2471** samples/second further with stream-merging. However, stream-merging can also be used separately. We got inference throughput gain from **1028** samples/second to **1187** samples/second by enabling stream-merging. Please refer to the 'Performance' part in our [document](https://docs.google.com/document/d/1yL3lWk_iFKqLTyekkuaiKXZ78I0lPmD5kM1fghHRs4Y/edit?usp=sharing) for detailed and more experiment results. Copybara import of the project: -- 9e51f38 by Robin Zhang <robinz@nvidia.com>: Allow merging compute-copy streams -- a45967f by Robin Zhang <robinz@nvidia.com>: Improve coding style -- ccae79b by Robin Zhang <robinz@nvidia.com>: Rename stream_merge_options_ -- 332e1fe by Robin Zhang <robinz@nvidia.com>: Put stream checking out of callback -- 4a0c789 by Robin Zhang <robinz@nvidia.com>: Move StreamMergeOptions to Experimental -- efe56d7 by Robin Zhang <robinz@nvidia.com>: add some comments Merging this change closes #61632 FUTURE_COPYBARA_INTEGRATE_REVIEW=#61632 from buptzyb:multistream-streammerge 5aabb58 PiperOrigin-RevId: 628618396
copybara-service bot
pushed a commit
that referenced
this pull request
Apr 29, 2024
Imported from GitHub PR #61632 This PR works as a part of the whole Multi-Stream feature in TF, which is proposed in #61185. Allow merging the host_to_device/device_to_host/device_to_device data copy streams into the compute stream in one stream group. This is useful to reduce the overhead caused by GPU stream synchronization, especially when data transfers are frequent. Another benefit is, for host_to_device copy, merging streams allows early scheduling of subsequent ops, doesn't have to wait until the data copy is really finished. As a part of the multi-stream feature, it can help multi-stream reach a much higher throughput. Taking our proto models as an example, the original model inference throughput is **1524** samples/second, and **2229** samples/ second with multi-stream, and **2471** samples/second further with stream-merging. However, stream-merging can also be used separately. We got inference throughput gain from **1028** samples/second to **1187** samples/second by enabling stream-merging. Please refer to the 'Performance' part in our [document](https://docs.google.com/document/d/1yL3lWk_iFKqLTyekkuaiKXZ78I0lPmD5kM1fghHRs4Y/edit?usp=sharing) for detailed and more experiment results. Copybara import of the project: -- 9e51f38 by Robin Zhang <robinz@nvidia.com>: Allow merging compute-copy streams -- a45967f by Robin Zhang <robinz@nvidia.com>: Improve coding style -- ccae79b by Robin Zhang <robinz@nvidia.com>: Rename stream_merge_options_ -- 332e1fe by Robin Zhang <robinz@nvidia.com>: Put stream checking out of callback -- 4a0c789 by Robin Zhang <robinz@nvidia.com>: Move StreamMergeOptions to Experimental -- efe56d7 by Robin Zhang <robinz@nvidia.com>: add some comments Merging this change closes #61632 FUTURE_COPYBARA_INTEGRATE_REVIEW=#61632 from buptzyb:multistream-streammerge 5aabb58 PiperOrigin-RevId: 628618396
copybara-service bot
pushed a commit
that referenced
this pull request
Apr 30, 2024
Imported from GitHub PR #61632 This PR works as a part of the whole Multi-Stream feature in TF, which is proposed in #61185. Allow merging the host_to_device/device_to_host/device_to_device data copy streams into the compute stream in one stream group. This is useful to reduce the overhead caused by GPU stream synchronization, especially when data transfers are frequent. Another benefit is, for host_to_device copy, merging streams allows early scheduling of subsequent ops, doesn't have to wait until the data copy is really finished. As a part of the multi-stream feature, it can help multi-stream reach a much higher throughput. Taking our proto models as an example, the original model inference throughput is **1524** samples/second, and **2229** samples/ second with multi-stream, and **2471** samples/second further with stream-merging. However, stream-merging can also be used separately. We got inference throughput gain from **1028** samples/second to **1187** samples/second by enabling stream-merging. Please refer to the 'Performance' part in our [document](https://docs.google.com/document/d/1yL3lWk_iFKqLTyekkuaiKXZ78I0lPmD5kM1fghHRs4Y/edit?usp=sharing) for detailed and more experiment results. Copybara import of the project: -- 9e51f38 by Robin Zhang <robinz@nvidia.com>: Allow merging compute-copy streams -- a45967f by Robin Zhang <robinz@nvidia.com>: Improve coding style -- ccae79b by Robin Zhang <robinz@nvidia.com>: Rename stream_merge_options_ -- 332e1fe by Robin Zhang <robinz@nvidia.com>: Put stream checking out of callback -- 4a0c789 by Robin Zhang <robinz@nvidia.com>: Move StreamMergeOptions to Experimental -- efe56d7 by Robin Zhang <robinz@nvidia.com>: add some comments Merging this change closes #61632 FUTURE_COPYBARA_INTEGRATE_REVIEW=#61632 from buptzyb:multistream-streammerge 5aabb58 PiperOrigin-RevId: 628618396
copybara-service bot
pushed a commit
that referenced
this pull request
Apr 30, 2024
Imported from GitHub PR #61632 This PR works as a part of the whole Multi-Stream feature in TF, which is proposed in #61185. Allow merging the host_to_device/device_to_host/device_to_device data copy streams into the compute stream in one stream group. This is useful to reduce the overhead caused by GPU stream synchronization, especially when data transfers are frequent. Another benefit is, for host_to_device copy, merging streams allows early scheduling of subsequent ops, doesn't have to wait until the data copy is really finished. As a part of the multi-stream feature, it can help multi-stream reach a much higher throughput. Taking our proto models as an example, the original model inference throughput is **1524** samples/second, and **2229** samples/ second with multi-stream, and **2471** samples/second further with stream-merging. However, stream-merging can also be used separately. We got inference throughput gain from **1028** samples/second to **1187** samples/second by enabling stream-merging. Please refer to the 'Performance' part in our [document](https://docs.google.com/document/d/1yL3lWk_iFKqLTyekkuaiKXZ78I0lPmD5kM1fghHRs4Y/edit?usp=sharing) for detailed and more experiment results. Copybara import of the project: -- 9e51f38 by Robin Zhang <robinz@nvidia.com>: Allow merging compute-copy streams -- a45967f by Robin Zhang <robinz@nvidia.com>: Improve coding style -- ccae79b by Robin Zhang <robinz@nvidia.com>: Rename stream_merge_options_ -- 332e1fe by Robin Zhang <robinz@nvidia.com>: Put stream checking out of callback -- 4a0c789 by Robin Zhang <robinz@nvidia.com>: Move StreamMergeOptions to Experimental -- efe56d7 by Robin Zhang <robinz@nvidia.com>: add some comments Merging this change closes #61632 FUTURE_COPYBARA_INTEGRATE_REVIEW=#61632 from buptzyb:multistream-streammerge 5aabb58 PiperOrigin-RevId: 628618396
copybara-service bot
pushed a commit
that referenced
this pull request
Apr 30, 2024
Imported from GitHub PR #61632 This PR works as a part of the whole Multi-Stream feature in TF, which is proposed in #61185. Allow merging the host_to_device/device_to_host/device_to_device data copy streams into the compute stream in one stream group. This is useful to reduce the overhead caused by GPU stream synchronization, especially when data transfers are frequent. Another benefit is, for host_to_device copy, merging streams allows early scheduling of subsequent ops, doesn't have to wait until the data copy is really finished. As a part of the multi-stream feature, it can help multi-stream reach a much higher throughput. Taking our proto models as an example, the original model inference throughput is **1524** samples/second, and **2229** samples/ second with multi-stream, and **2471** samples/second further with stream-merging. However, stream-merging can also be used separately. We got inference throughput gain from **1028** samples/second to **1187** samples/second by enabling stream-merging. Please refer to the 'Performance' part in our [document](https://docs.google.com/document/d/1yL3lWk_iFKqLTyekkuaiKXZ78I0lPmD5kM1fghHRs4Y/edit?usp=sharing) for detailed and more experiment results. Copybara import of the project: -- 9e51f38 by Robin Zhang <robinz@nvidia.com>: Allow merging compute-copy streams -- a45967f by Robin Zhang <robinz@nvidia.com>: Improve coding style -- ccae79b by Robin Zhang <robinz@nvidia.com>: Rename stream_merge_options_ -- 332e1fe by Robin Zhang <robinz@nvidia.com>: Put stream checking out of callback -- 4a0c789 by Robin Zhang <robinz@nvidia.com>: Move StreamMergeOptions to Experimental -- efe56d7 by Robin Zhang <robinz@nvidia.com>: add some comments Merging this change closes #61632 FUTURE_COPYBARA_INTEGRATE_REVIEW=#61632 from buptzyb:multistream-streammerge 5aabb58 PiperOrigin-RevId: 628618396
copybara-service bot
pushed a commit
that referenced
this pull request
Apr 30, 2024
Imported from GitHub PR #61632 This PR works as a part of the whole Multi-Stream feature in TF, which is proposed in #61185. Allow merging the host_to_device/device_to_host/device_to_device data copy streams into the compute stream in one stream group. This is useful to reduce the overhead caused by GPU stream synchronization, especially when data transfers are frequent. Another benefit is, for host_to_device copy, merging streams allows early scheduling of subsequent ops, doesn't have to wait until the data copy is really finished. As a part of the multi-stream feature, it can help multi-stream reach a much higher throughput. Taking our proto models as an example, the original model inference throughput is **1524** samples/second, and **2229** samples/ second with multi-stream, and **2471** samples/second further with stream-merging. However, stream-merging can also be used separately. We got inference throughput gain from **1028** samples/second to **1187** samples/second by enabling stream-merging. Please refer to the 'Performance' part in our [document](https://docs.google.com/document/d/1yL3lWk_iFKqLTyekkuaiKXZ78I0lPmD5kM1fghHRs4Y/edit?usp=sharing) for detailed and more experiment results. Copybara import of the project: -- 9e51f38 by Robin Zhang <robinz@nvidia.com>: Allow merging compute-copy streams -- a45967f by Robin Zhang <robinz@nvidia.com>: Improve coding style -- ccae79b by Robin Zhang <robinz@nvidia.com>: Rename stream_merge_options_ -- 332e1fe by Robin Zhang <robinz@nvidia.com>: Put stream checking out of callback -- 4a0c789 by Robin Zhang <robinz@nvidia.com>: Move StreamMergeOptions to Experimental -- efe56d7 by Robin Zhang <robinz@nvidia.com>: add some comments Merging this change closes #61632 FUTURE_COPYBARA_INTEGRATE_REVIEW=#61632 from buptzyb:multistream-streammerge 5aabb58 PiperOrigin-RevId: 628618396
copybara-service bot
pushed a commit
that referenced
this pull request
May 2, 2024
Imported from GitHub PR #61632 This PR works as a part of the whole Multi-Stream feature in TF, which is proposed in #61185. Allow merging the host_to_device/device_to_host/device_to_device data copy streams into the compute stream in one stream group. This is useful to reduce the overhead caused by GPU stream synchronization, especially when data transfers are frequent. Another benefit is, for host_to_device copy, merging streams allows early scheduling of subsequent ops, doesn't have to wait until the data copy is really finished. As a part of the multi-stream feature, it can help multi-stream reach a much higher throughput. Taking our proto models as an example, the original model inference throughput is **1524** samples/second, and **2229** samples/ second with multi-stream, and **2471** samples/second further with stream-merging. However, stream-merging can also be used separately. We got inference throughput gain from **1028** samples/second to **1187** samples/second by enabling stream-merging. Please refer to the 'Performance' part in our [document](https://docs.google.com/document/d/1yL3lWk_iFKqLTyekkuaiKXZ78I0lPmD5kM1fghHRs4Y/edit?usp=sharing) for detailed and more experiment results. Copybara import of the project: -- 9e51f38 by Robin Zhang <robinz@nvidia.com>: Allow merging compute-copy streams -- a45967f by Robin Zhang <robinz@nvidia.com>: Improve coding style -- ccae79b by Robin Zhang <robinz@nvidia.com>: Rename stream_merge_options_ -- 332e1fe by Robin Zhang <robinz@nvidia.com>: Put stream checking out of callback -- 4a0c789 by Robin Zhang <robinz@nvidia.com>: Move StreamMergeOptions to Experimental -- efe56d7 by Robin Zhang <robinz@nvidia.com>: add some comments Merging this change closes #61632 FUTURE_COPYBARA_INTEGRATE_REVIEW=#61632 from buptzyb:multistream-streammerge 5aabb58 PiperOrigin-RevId: 628618396
copybara-service bot
pushed a commit
that referenced
this pull request
May 2, 2024
Imported from GitHub PR #61632 This PR works as a part of the whole Multi-Stream feature in TF, which is proposed in #61185. Allow merging the host_to_device/device_to_host/device_to_device data copy streams into the compute stream in one stream group. This is useful to reduce the overhead caused by GPU stream synchronization, especially when data transfers are frequent. Another benefit is, for host_to_device copy, merging streams allows early scheduling of subsequent ops, doesn't have to wait until the data copy is really finished. As a part of the multi-stream feature, it can help multi-stream reach a much higher throughput. Taking our proto models as an example, the original model inference throughput is **1524** samples/second, and **2229** samples/ second with multi-stream, and **2471** samples/second further with stream-merging. However, stream-merging can also be used separately. We got inference throughput gain from **1028** samples/second to **1187** samples/second by enabling stream-merging. Please refer to the 'Performance' part in our [document](https://docs.google.com/document/d/1yL3lWk_iFKqLTyekkuaiKXZ78I0lPmD5kM1fghHRs4Y/edit?usp=sharing) for detailed and more experiment results. Copybara import of the project: -- 9e51f38 by Robin Zhang <robinz@nvidia.com>: Allow merging compute-copy streams -- a45967f by Robin Zhang <robinz@nvidia.com>: Improve coding style -- ccae79b by Robin Zhang <robinz@nvidia.com>: Rename stream_merge_options_ -- 332e1fe by Robin Zhang <robinz@nvidia.com>: Put stream checking out of callback -- 4a0c789 by Robin Zhang <robinz@nvidia.com>: Move StreamMergeOptions to Experimental -- efe56d7 by Robin Zhang <robinz@nvidia.com>: add some comments Merging this change closes #61632 FUTURE_COPYBARA_INTEGRATE_REVIEW=#61632 from buptzyb:multistream-streammerge 5aabb58 PiperOrigin-RevId: 628618396
copybara-service bot
pushed a commit
that referenced
this pull request
May 2, 2024
Imported from GitHub PR #61632 This PR works as a part of the whole Multi-Stream feature in TF, which is proposed in #61185. Allow merging the host_to_device/device_to_host/device_to_device data copy streams into the compute stream in one stream group. This is useful to reduce the overhead caused by GPU stream synchronization, especially when data transfers are frequent. Another benefit is, for host_to_device copy, merging streams allows early scheduling of subsequent ops, doesn't have to wait until the data copy is really finished. As a part of the multi-stream feature, it can help multi-stream reach a much higher throughput. Taking our proto models as an example, the original model inference throughput is **1524** samples/second, and **2229** samples/ second with multi-stream, and **2471** samples/second further with stream-merging. However, stream-merging can also be used separately. We got inference throughput gain from **1028** samples/second to **1187** samples/second by enabling stream-merging. Please refer to the 'Performance' part in our [document](https://docs.google.com/document/d/1yL3lWk_iFKqLTyekkuaiKXZ78I0lPmD5kM1fghHRs4Y/edit?usp=sharing) for detailed and more experiment results. Copybara import of the project: -- 9e51f38 by Robin Zhang <robinz@nvidia.com>: Allow merging compute-copy streams -- a45967f by Robin Zhang <robinz@nvidia.com>: Improve coding style -- ccae79b by Robin Zhang <robinz@nvidia.com>: Rename stream_merge_options_ -- 332e1fe by Robin Zhang <robinz@nvidia.com>: Put stream checking out of callback -- 4a0c789 by Robin Zhang <robinz@nvidia.com>: Move StreamMergeOptions to Experimental -- efe56d7 by Robin Zhang <robinz@nvidia.com>: add some comments Merging this change closes #61632 Reverts changelist 525613555 FUTURE_COPYBARA_INTEGRATE_REVIEW=#61632 from buptzyb:multistream-streammerge 5aabb58 PiperOrigin-RevId: 628618396
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Multiple Stream TensorFlow is developed based on the official TensorFlow. It leverages the features of modern GPUs to accelerate deep learning training and inference. This Multi-Stream implementation has successfully helped several customers migrate their RecSys TF models to the GPU and go online.
For more details please visit README_MultiStream.md.
This PR is used as a reference and will not be merged to master. @changhuilin