[test] GPU dgemm and pointer chasing tests #1687

jjotero · 2021-01-15T12:41:36Z

New dgemm and pointer_chase regression tests. Like all the other microbenchmark tests, these run with both cuda and hip.

Pointer chase checks:

Check the latency of calling the clock function.
Average latency on a single device. This test uses several stride values.
Average P2P latency. The stride is fixed to 1 here.
L1 latency, L1 miss rate, and miss latency on a single device.
L1 latency, L1 miss rate and miss latency when executing the pointer chase through P2P

jjotero · 2021-02-01T15:52:12Z

Hmm, not sure there is a need for more node jumps. All these reframe tests do not use the random init, and despite having only 64 nodes, the timings are pretty consistent.

There are two types of tests in here. The first type measures the overall traversal time, and the second one measures each jump independently. For the second type, the reframe tests work out the L1 hits and misses, and compute the average timing only for these two categories. As it can be seen in the figure below (data from a P100), the timings for an L1 hit are pretty consistent at 158 cycles. If we were to capture the latency of L2 (which sits somewhere below 400 cycles), then I'd agree that we might need more nodes. But in that case, extracting the data that only belongs to an L2 hit is a challenge on its own.

Though, as you say, if we make it circular we could easily tune the list to fit in any memory level. Let me experiment a bit.

jjotero · 2021-02-01T17:53:33Z

The linked lists are now circular and the number of node jumps is fixed to 256. The number of nodes is now an input parameter from the user, which means that we can do something like a list with 2 nodes (resident in L1) and jump around 256 times to collect all the data we need.

At the moment, there is some inconsistent behaviour with the clock latency on the P100s. This needs further testing in the other cards present in Ault.

cscs-checks/microbenchmarks/gpu/pointer_chase/src/linked_list.hpp

sekelle · 2021-02-12T11:35:40Z

The timings don't seem converged. As I mentioned previously, setting the number of hops to a fixed number is not a viable approach and also 256 jumps are not enough by a long shot to get converged results.
I took the code and made a small modification: I added num_jumps as a command line parameter, changed the traversal to a while loop as described above and increased num_jumps for each list size until the time per jump converged.

Here are the results on V100:
V100.pdf

and P100:
P100.pdf

List linkage was random, with no additional node-padding.

On the V100, the L1 latency now is 28 cycles which agrees with other reports that I found online.
The reason why 256 jumps are not nearly enough is in fact not due to timing overhead. Like CPUs, GPUs nowadays have multiple (opaque hardware-controlled) levels of cache and GPUs also use virtual memory, which means that page table lookup also happens. Despite a warm-up pass through the list, where exactly an address resides and whether or not there's a page-fault is beyond your control for individual accesses. You can only hope to iron out these effects by averaging over multiple repeat passes. The faster the device and the faster the cache, the more pronounced these effects become.
In this context, I would say that trying to time individual jumps is an interesting experiment, but in practice actually just yields random linear combinations of the different cache/DRAM values. Again, not because the timer is not accurate enough, but due to the mentioned background effects that you cannot control.

jjotero

I've now stripped all the code related to measuring latencies of single node jumps and done the cleanup.

The plot below shows a parametric sweep for all the cards in ault, for a large range of list sizes, and different strides; all using 400k node jumps for the timing. The nodes are placed in sequential order in the buffer, so the --stride parameter here was used to control the filling of the cache lines.

A100: solid; V100: dashed; P100: dash-dotted; Vega20: dotted

Now I just need to add the actual tests. For now, I'll fix the stride to 32 in all of them and check the L1, L2 and DRAM latencies.

cscs-checks/microbenchmarks/gpu/pointer_chase/src/linked_list.hpp

jjotero · 2021-03-02T19:14:31Z

The tests are in now.

vkarak

I didn't go through the C++ code, but I've gone through your discussions and the results and I think that we are good to go. Great work of both of you! I've only some minor ReFrame test comments.

cscs-checks/microbenchmarks/gpu/dgemm/dgmemm.py

cscs-checks/microbenchmarks/gpu/pointer_chase/pointer_chase.py

jjotero

PR comments addressed.

jjotero added 30 commits October 21, 2020 16:31

Add first pointer_chase draft.

d5a7f7c

Bugfix on memory access.

15ecf71

Add random node placement on the linked list.

9efa771

Add support for command line args.

0c365a8

Reword help menu.

5dee5bf

Merge branch 'master' into test/pointer-chase

d90b632

Update pointer_chase test to use the XDevice lib.

e39c710

Pointer chase ported to HIP.

3b1eab4

Rename src and dst pointers in dev copy functions.

8552a5f

Add node ID to the test prints.

055be79

Merge branch 'test/mem-bandwidth-ault' into test/pointer-chase

58c28d2

Merge branch 'test/ault' into test/pointer-chase

2994696

Add comments in the source code.

8d09541

Add P2P pointer chase.

1c1ab5e

Extend options to retrieve min latency.

1d6b06b

Add asm XClock and XClock64 functions.

20fda05

Restructure pChase algo

ff209db

Create pointer_chase reframe test.

12033fc

Update ref for A100s,

ec1317b

Add XClocks class to Xdevice lib.

ee42305

Bugfix in the clocks implementation for hip.

51567a2

Merge branch 'master' into test/pointer-chase

345bf9f

Expand pointer_chase checks.

525d25b

Add tsa references.

c481ee9

Update a100 refs.

b3ea42c

Update refs for dom/daint.

8280c18

Add clock latency check.

8056cd6

Add refs for daint, dom and tsa.

c7b23f2

Fix PEP8 issues and comments to the src code.

a44c67e

Bugfix in the HIP clockLatency function.

9272d97

jjotero added 4 commits January 29, 2021 15:28

Merge branch 'master' into test/ault-dev

fd582b4

Add consitent naming

0ee2c13

Address PR comments

4c36930

Add missing include

80f3b14

Make the chase circular

a823a4e

vkarak modified the milestones: ReFrame sprint 21.02, ReFrame sprint 21.03, ReFrame 3.4.2 Feb 8, 2021

sekelle suggested changes Feb 12, 2021

View reviewed changes

vkarak added this to the ReFrame 3.5.0 milestone Feb 18, 2021

jjotero added 3 commits March 1, 2021 17:17

Remove single-jump timing routines

e875d0f

Merge branch 'master' into test/ault-dev

dc41d68

Cleanup source code and remove single-step tests

aef694a

jjotero commented Mar 2, 2021

View reviewed changes

jjotero added 2 commits March 2, 2021 19:35

Add memory latency tests

6da6785

Update refs for tsa

9c429fb

vkarak reviewed Mar 4, 2021

View reviewed changes

Address PR comments

9cfd000

jjotero commented Mar 4, 2021

View reviewed changes

jjotero added 2 commits March 4, 2021 14:58

Merge branch 'master' into test/ault-dev

19f3054

Do chase simultaneously in all devices

c89f950

vkarak approved these changes Mar 5, 2021

View reviewed changes

Vasileios Karakasis added 2 commits March 5, 2021 11:14

Merge branch 'master' into test/ault-dev

9884951

Merge branch 'master' into test/ault-dev

26e7dc5

vkarak merged commit 3a39a39 into reframe-hpc:master Mar 5, 2021

jjotero deleted the test/ault-dev branch March 10, 2021 10:28

[test] GPU dgemm and pointer chasing tests #1687

[test] GPU dgemm and pointer chasing tests #1687

Uh oh!

Conversation

jjotero commented Jan 15, 2021

Uh oh!

jjotero commented Feb 1, 2021

Uh oh!

jjotero commented Feb 1, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sekelle commented Feb 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jjotero left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jjotero commented Mar 2, 2021

Uh oh!

vkarak left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jjotero left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

sekelle commented Feb 12, 2021 •

edited

Loading