Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REF] Introduce multiple streams execution in TensorFlow. #61185

Closed
wants to merge 14 commits into from

Conversation

buptzyb
Copy link
Contributor

@buptzyb buptzyb commented Jul 6, 2023

Multiple Stream TensorFlow is developed based on the official TensorFlow. It leverages the features of modern GPUs to accelerate deep learning training and inference. This Multi-Stream implementation has successfully helped several customers migrate their RecSys TF models to the GPU and go online.

For more details please visit README_MultiStream.md.

This PR is used as a reference and will not be merged to master. @changhuilin

@gbaned gbaned requested a review from d0k July 13, 2023 03:34
@gbaned
Copy link
Contributor

gbaned commented Dec 15, 2023

Hi @buptzyb This PR is in draft, any update on this? Please. Thank you!

2 similar comments
@gbaned
Copy link
Contributor

gbaned commented Dec 29, 2023

Hi @buptzyb This PR is in draft, any update on this? Please. Thank you!

@gbaned
Copy link
Contributor

gbaned commented Jan 19, 2024

Hi @buptzyb This PR is in draft, any update on this? Please. Thank you!

@buptzyb buptzyb closed this Jan 19, 2024
PR Queue automation moved this from Assigned Reviewer to Closed/Rejected Jan 19, 2024
copybara-service bot pushed a commit that referenced this pull request Apr 27, 2024
Imported from GitHub PR #61632

This PR works as a part of the whole Multi-Stream feature in TF, which is proposed in #61185.

Allow merging the host_to_device/device_to_host/device_to_device data copy streams into the compute stream in one stream group. This is useful to reduce the overhead caused by GPU stream synchronization, especially when data transfers are frequent. Another benefit is, for host_to_device copy, merging streams allows early scheduling of subsequent ops, doesn't have to wait until the data copy is really finished.

As a part of the multi-stream feature, it can help multi-stream reach a much higher throughput. Taking our proto models as an example, the original model inference throughput is **1524** samples/second, and **2229** samples/ second with multi-stream, and **2471** samples/second further with stream-merging.

However, stream-merging can also be used separately. We got inference throughput gain from **1028** samples/second to **1187** samples/second by enabling stream-merging.

Please refer to the 'Performance' part in our [document](https://docs.google.com/document/d/1yL3lWk_iFKqLTyekkuaiKXZ78I0lPmD5kM1fghHRs4Y/edit?usp=sharing) for detailed and more experiment results.
Copybara import of the project:

--
9e51f38 by Robin Zhang <robinz@nvidia.com>:

Allow merging compute-copy streams

--
a45967f by Robin Zhang <robinz@nvidia.com>:

Improve coding style

--
ccae79b by Robin Zhang <robinz@nvidia.com>:

Rename stream_merge_options_

--
332e1fe by Robin Zhang <robinz@nvidia.com>:

Put stream checking out of callback

--
4a0c789 by Robin Zhang <robinz@nvidia.com>:

Move StreamMergeOptions to Experimental

--
efe56d7 by Robin Zhang <robinz@nvidia.com>:

add some comments

Merging this change closes #61632

FUTURE_COPYBARA_INTEGRATE_REVIEW=#61632 from buptzyb:multistream-streammerge 5aabb58
PiperOrigin-RevId: 628618396
copybara-service bot pushed a commit that referenced this pull request Apr 29, 2024
Imported from GitHub PR #61632

This PR works as a part of the whole Multi-Stream feature in TF, which is proposed in #61185.

Allow merging the host_to_device/device_to_host/device_to_device data copy streams into the compute stream in one stream group. This is useful to reduce the overhead caused by GPU stream synchronization, especially when data transfers are frequent. Another benefit is, for host_to_device copy, merging streams allows early scheduling of subsequent ops, doesn't have to wait until the data copy is really finished.

As a part of the multi-stream feature, it can help multi-stream reach a much higher throughput. Taking our proto models as an example, the original model inference throughput is **1524** samples/second, and **2229** samples/ second with multi-stream, and **2471** samples/second further with stream-merging.

However, stream-merging can also be used separately. We got inference throughput gain from **1028** samples/second to **1187** samples/second by enabling stream-merging.

Please refer to the 'Performance' part in our [document](https://docs.google.com/document/d/1yL3lWk_iFKqLTyekkuaiKXZ78I0lPmD5kM1fghHRs4Y/edit?usp=sharing) for detailed and more experiment results.
Copybara import of the project:

--
9e51f38 by Robin Zhang <robinz@nvidia.com>:

Allow merging compute-copy streams

--
a45967f by Robin Zhang <robinz@nvidia.com>:

Improve coding style

--
ccae79b by Robin Zhang <robinz@nvidia.com>:

Rename stream_merge_options_

--
332e1fe by Robin Zhang <robinz@nvidia.com>:

Put stream checking out of callback

--
4a0c789 by Robin Zhang <robinz@nvidia.com>:

Move StreamMergeOptions to Experimental

--
efe56d7 by Robin Zhang <robinz@nvidia.com>:

add some comments

Merging this change closes #61632

FUTURE_COPYBARA_INTEGRATE_REVIEW=#61632 from buptzyb:multistream-streammerge 5aabb58
PiperOrigin-RevId: 628618396
copybara-service bot pushed a commit that referenced this pull request Apr 30, 2024
Imported from GitHub PR #61632

This PR works as a part of the whole Multi-Stream feature in TF, which is proposed in #61185.

Allow merging the host_to_device/device_to_host/device_to_device data copy streams into the compute stream in one stream group. This is useful to reduce the overhead caused by GPU stream synchronization, especially when data transfers are frequent. Another benefit is, for host_to_device copy, merging streams allows early scheduling of subsequent ops, doesn't have to wait until the data copy is really finished.

As a part of the multi-stream feature, it can help multi-stream reach a much higher throughput. Taking our proto models as an example, the original model inference throughput is **1524** samples/second, and **2229** samples/ second with multi-stream, and **2471** samples/second further with stream-merging.

However, stream-merging can also be used separately. We got inference throughput gain from **1028** samples/second to **1187** samples/second by enabling stream-merging.

Please refer to the 'Performance' part in our [document](https://docs.google.com/document/d/1yL3lWk_iFKqLTyekkuaiKXZ78I0lPmD5kM1fghHRs4Y/edit?usp=sharing) for detailed and more experiment results.
Copybara import of the project:

--
9e51f38 by Robin Zhang <robinz@nvidia.com>:

Allow merging compute-copy streams

--
a45967f by Robin Zhang <robinz@nvidia.com>:

Improve coding style

--
ccae79b by Robin Zhang <robinz@nvidia.com>:

Rename stream_merge_options_

--
332e1fe by Robin Zhang <robinz@nvidia.com>:

Put stream checking out of callback

--
4a0c789 by Robin Zhang <robinz@nvidia.com>:

Move StreamMergeOptions to Experimental

--
efe56d7 by Robin Zhang <robinz@nvidia.com>:

add some comments

Merging this change closes #61632

FUTURE_COPYBARA_INTEGRATE_REVIEW=#61632 from buptzyb:multistream-streammerge 5aabb58
PiperOrigin-RevId: 628618396
copybara-service bot pushed a commit that referenced this pull request Apr 30, 2024
Imported from GitHub PR #61632

This PR works as a part of the whole Multi-Stream feature in TF, which is proposed in #61185.

Allow merging the host_to_device/device_to_host/device_to_device data copy streams into the compute stream in one stream group. This is useful to reduce the overhead caused by GPU stream synchronization, especially when data transfers are frequent. Another benefit is, for host_to_device copy, merging streams allows early scheduling of subsequent ops, doesn't have to wait until the data copy is really finished.

As a part of the multi-stream feature, it can help multi-stream reach a much higher throughput. Taking our proto models as an example, the original model inference throughput is **1524** samples/second, and **2229** samples/ second with multi-stream, and **2471** samples/second further with stream-merging.

However, stream-merging can also be used separately. We got inference throughput gain from **1028** samples/second to **1187** samples/second by enabling stream-merging.

Please refer to the 'Performance' part in our [document](https://docs.google.com/document/d/1yL3lWk_iFKqLTyekkuaiKXZ78I0lPmD5kM1fghHRs4Y/edit?usp=sharing) for detailed and more experiment results.
Copybara import of the project:

--
9e51f38 by Robin Zhang <robinz@nvidia.com>:

Allow merging compute-copy streams

--
a45967f by Robin Zhang <robinz@nvidia.com>:

Improve coding style

--
ccae79b by Robin Zhang <robinz@nvidia.com>:

Rename stream_merge_options_

--
332e1fe by Robin Zhang <robinz@nvidia.com>:

Put stream checking out of callback

--
4a0c789 by Robin Zhang <robinz@nvidia.com>:

Move StreamMergeOptions to Experimental

--
efe56d7 by Robin Zhang <robinz@nvidia.com>:

add some comments

Merging this change closes #61632

FUTURE_COPYBARA_INTEGRATE_REVIEW=#61632 from buptzyb:multistream-streammerge 5aabb58
PiperOrigin-RevId: 628618396
copybara-service bot pushed a commit that referenced this pull request Apr 30, 2024
Imported from GitHub PR #61632

This PR works as a part of the whole Multi-Stream feature in TF, which is proposed in #61185.

Allow merging the host_to_device/device_to_host/device_to_device data copy streams into the compute stream in one stream group. This is useful to reduce the overhead caused by GPU stream synchronization, especially when data transfers are frequent. Another benefit is, for host_to_device copy, merging streams allows early scheduling of subsequent ops, doesn't have to wait until the data copy is really finished.

As a part of the multi-stream feature, it can help multi-stream reach a much higher throughput. Taking our proto models as an example, the original model inference throughput is **1524** samples/second, and **2229** samples/ second with multi-stream, and **2471** samples/second further with stream-merging.

However, stream-merging can also be used separately. We got inference throughput gain from **1028** samples/second to **1187** samples/second by enabling stream-merging.

Please refer to the 'Performance' part in our [document](https://docs.google.com/document/d/1yL3lWk_iFKqLTyekkuaiKXZ78I0lPmD5kM1fghHRs4Y/edit?usp=sharing) for detailed and more experiment results.
Copybara import of the project:

--
9e51f38 by Robin Zhang <robinz@nvidia.com>:

Allow merging compute-copy streams

--
a45967f by Robin Zhang <robinz@nvidia.com>:

Improve coding style

--
ccae79b by Robin Zhang <robinz@nvidia.com>:

Rename stream_merge_options_

--
332e1fe by Robin Zhang <robinz@nvidia.com>:

Put stream checking out of callback

--
4a0c789 by Robin Zhang <robinz@nvidia.com>:

Move StreamMergeOptions to Experimental

--
efe56d7 by Robin Zhang <robinz@nvidia.com>:

add some comments

Merging this change closes #61632

FUTURE_COPYBARA_INTEGRATE_REVIEW=#61632 from buptzyb:multistream-streammerge 5aabb58
PiperOrigin-RevId: 628618396
copybara-service bot pushed a commit that referenced this pull request Apr 30, 2024
Imported from GitHub PR #61632

This PR works as a part of the whole Multi-Stream feature in TF, which is proposed in #61185.

Allow merging the host_to_device/device_to_host/device_to_device data copy streams into the compute stream in one stream group. This is useful to reduce the overhead caused by GPU stream synchronization, especially when data transfers are frequent. Another benefit is, for host_to_device copy, merging streams allows early scheduling of subsequent ops, doesn't have to wait until the data copy is really finished.

As a part of the multi-stream feature, it can help multi-stream reach a much higher throughput. Taking our proto models as an example, the original model inference throughput is **1524** samples/second, and **2229** samples/ second with multi-stream, and **2471** samples/second further with stream-merging.

However, stream-merging can also be used separately. We got inference throughput gain from **1028** samples/second to **1187** samples/second by enabling stream-merging.

Please refer to the 'Performance' part in our [document](https://docs.google.com/document/d/1yL3lWk_iFKqLTyekkuaiKXZ78I0lPmD5kM1fghHRs4Y/edit?usp=sharing) for detailed and more experiment results.
Copybara import of the project:

--
9e51f38 by Robin Zhang <robinz@nvidia.com>:

Allow merging compute-copy streams

--
a45967f by Robin Zhang <robinz@nvidia.com>:

Improve coding style

--
ccae79b by Robin Zhang <robinz@nvidia.com>:

Rename stream_merge_options_

--
332e1fe by Robin Zhang <robinz@nvidia.com>:

Put stream checking out of callback

--
4a0c789 by Robin Zhang <robinz@nvidia.com>:

Move StreamMergeOptions to Experimental

--
efe56d7 by Robin Zhang <robinz@nvidia.com>:

add some comments

Merging this change closes #61632

FUTURE_COPYBARA_INTEGRATE_REVIEW=#61632 from buptzyb:multistream-streammerge 5aabb58
PiperOrigin-RevId: 628618396
copybara-service bot pushed a commit that referenced this pull request May 2, 2024
Imported from GitHub PR #61632

This PR works as a part of the whole Multi-Stream feature in TF, which is proposed in #61185.

Allow merging the host_to_device/device_to_host/device_to_device data copy streams into the compute stream in one stream group. This is useful to reduce the overhead caused by GPU stream synchronization, especially when data transfers are frequent. Another benefit is, for host_to_device copy, merging streams allows early scheduling of subsequent ops, doesn't have to wait until the data copy is really finished.

As a part of the multi-stream feature, it can help multi-stream reach a much higher throughput. Taking our proto models as an example, the original model inference throughput is **1524** samples/second, and **2229** samples/ second with multi-stream, and **2471** samples/second further with stream-merging.

However, stream-merging can also be used separately. We got inference throughput gain from **1028** samples/second to **1187** samples/second by enabling stream-merging.

Please refer to the 'Performance' part in our [document](https://docs.google.com/document/d/1yL3lWk_iFKqLTyekkuaiKXZ78I0lPmD5kM1fghHRs4Y/edit?usp=sharing) for detailed and more experiment results.
Copybara import of the project:

--
9e51f38 by Robin Zhang <robinz@nvidia.com>:

Allow merging compute-copy streams

--
a45967f by Robin Zhang <robinz@nvidia.com>:

Improve coding style

--
ccae79b by Robin Zhang <robinz@nvidia.com>:

Rename stream_merge_options_

--
332e1fe by Robin Zhang <robinz@nvidia.com>:

Put stream checking out of callback

--
4a0c789 by Robin Zhang <robinz@nvidia.com>:

Move StreamMergeOptions to Experimental

--
efe56d7 by Robin Zhang <robinz@nvidia.com>:

add some comments

Merging this change closes #61632

FUTURE_COPYBARA_INTEGRATE_REVIEW=#61632 from buptzyb:multistream-streammerge 5aabb58
PiperOrigin-RevId: 628618396
copybara-service bot pushed a commit that referenced this pull request May 2, 2024
Imported from GitHub PR #61632

This PR works as a part of the whole Multi-Stream feature in TF, which is proposed in #61185.

Allow merging the host_to_device/device_to_host/device_to_device data copy streams into the compute stream in one stream group. This is useful to reduce the overhead caused by GPU stream synchronization, especially when data transfers are frequent. Another benefit is, for host_to_device copy, merging streams allows early scheduling of subsequent ops, doesn't have to wait until the data copy is really finished.

As a part of the multi-stream feature, it can help multi-stream reach a much higher throughput. Taking our proto models as an example, the original model inference throughput is **1524** samples/second, and **2229** samples/ second with multi-stream, and **2471** samples/second further with stream-merging.

However, stream-merging can also be used separately. We got inference throughput gain from **1028** samples/second to **1187** samples/second by enabling stream-merging.

Please refer to the 'Performance' part in our [document](https://docs.google.com/document/d/1yL3lWk_iFKqLTyekkuaiKXZ78I0lPmD5kM1fghHRs4Y/edit?usp=sharing) for detailed and more experiment results.
Copybara import of the project:

--
9e51f38 by Robin Zhang <robinz@nvidia.com>:

Allow merging compute-copy streams

--
a45967f by Robin Zhang <robinz@nvidia.com>:

Improve coding style

--
ccae79b by Robin Zhang <robinz@nvidia.com>:

Rename stream_merge_options_

--
332e1fe by Robin Zhang <robinz@nvidia.com>:

Put stream checking out of callback

--
4a0c789 by Robin Zhang <robinz@nvidia.com>:

Move StreamMergeOptions to Experimental

--
efe56d7 by Robin Zhang <robinz@nvidia.com>:

add some comments

Merging this change closes #61632

FUTURE_COPYBARA_INTEGRATE_REVIEW=#61632 from buptzyb:multistream-streammerge 5aabb58
PiperOrigin-RevId: 628618396
copybara-service bot pushed a commit that referenced this pull request May 2, 2024
Imported from GitHub PR #61632

This PR works as a part of the whole Multi-Stream feature in TF, which is proposed in #61185.

Allow merging the host_to_device/device_to_host/device_to_device data copy streams into the compute stream in one stream group. This is useful to reduce the overhead caused by GPU stream synchronization, especially when data transfers are frequent. Another benefit is, for host_to_device copy, merging streams allows early scheduling of subsequent ops, doesn't have to wait until the data copy is really finished.

As a part of the multi-stream feature, it can help multi-stream reach a much higher throughput. Taking our proto models as an example, the original model inference throughput is **1524** samples/second, and **2229** samples/ second with multi-stream, and **2471** samples/second further with stream-merging.

However, stream-merging can also be used separately. We got inference throughput gain from **1028** samples/second to **1187** samples/second by enabling stream-merging.

Please refer to the 'Performance' part in our [document](https://docs.google.com/document/d/1yL3lWk_iFKqLTyekkuaiKXZ78I0lPmD5kM1fghHRs4Y/edit?usp=sharing) for detailed and more experiment results.
Copybara import of the project:

--
9e51f38 by Robin Zhang <robinz@nvidia.com>:

Allow merging compute-copy streams

--
a45967f by Robin Zhang <robinz@nvidia.com>:

Improve coding style

--
ccae79b by Robin Zhang <robinz@nvidia.com>:

Rename stream_merge_options_

--
332e1fe by Robin Zhang <robinz@nvidia.com>:

Put stream checking out of callback

--
4a0c789 by Robin Zhang <robinz@nvidia.com>:

Move StreamMergeOptions to Experimental

--
efe56d7 by Robin Zhang <robinz@nvidia.com>:

add some comments

Merging this change closes #61632

Reverts changelist 525613555

FUTURE_COPYBARA_INTEGRATE_REVIEW=#61632 from buptzyb:multistream-streammerge 5aabb58
PiperOrigin-RevId: 628618396
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:XL CL Change Size:Extra Large
Projects
PR Queue
  
Closed/Rejected
Development

Successfully merging this pull request may close these issues.

None yet

2 participants