Skip to content
This repository was archived by the owner on Dec 9, 2024. It is now read-only.
This repository was archived by the owner on Dec 9, 2024. It is now read-only.

Benchmark performance drops significantly when using map_and_batch #137

@eladweiss

Description

@eladweiss

After taking the latest benchmarks, we noticed a drop in performance on models inception3 and resnet152. Testing with TensorFlow r1.5 on 32xP100 GPUs (8 servers), imagenet data, batch size 64.

Inception3:

  • grpc: 3350 ==> 3000
  • grpc + verbs: 3800 ==> 3150

Resnet152:

  • grpc: 2050 ==> 2000
  • grpc + verbs: 2450 ==> 2250

We isolated the 'problematic' change to: 82dd053#diff-3269d1838b2ebc9c6c071802fb946ca1R521

After replacing the specific call to map_and_batch(), with the previous call to map() with 16 parallel calls (https://github.com/Mellanox/benchmarks/commit/56e0b2298f835905f7d8a53c5bf482ed1dce55fd), we get high numbers again. We don't have a theory to explain this.

Thanks

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions