Deepspeed: Tensor Parallelism with 2 GPUs Falls Short of Single GPU Performance #7247

thri5ha · 2025-04-24T16:23:50Z

thri5ha
Apr 24, 2025

When inferencing Llama7b in A100 GPU 40GB,

FP16 datatype - Single GPU performs better than TP=2

FP32 datatype - TP = 2 performs better, however the overall throughput is reduced.

1. Is there any reason on why in F16 datatype, the single GPU outperforms TP-2 ?
2. Can you please let me know, if there are other methods to improve the inference throughput ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deepspeed: Tensor Parallelism with 2 GPUs Falls Short of Single GPU Performance #7247

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Deepspeed: Tensor Parallelism with 2 GPUs Falls Short of Single GPU Performance #7247

Uh oh!

thri5ha Apr 24, 2025

Replies: 0 comments

thri5ha
Apr 24, 2025