Skip to content

Conversation

@jjotero
Copy link
Contributor

@jjotero jjotero commented Jan 15, 2021

New dgemm and pointer_chase regression tests. Like all the other microbenchmark tests, these run with both cuda and hip.

Pointer chase checks:

  • Check the latency of calling the clock function.
  • Average latency on a single device. This test uses several stride values.
  • Average P2P latency. The stride is fixed to 1 here.
  • L1 latency, L1 miss rate, and miss latency on a single device.
  • L1 latency, L1 miss rate and miss latency when executing the pointer chase through P2P

@jjotero
Copy link
Contributor Author

jjotero commented Feb 1, 2021

Hmm, not sure there is a need for more node jumps. All these reframe tests do not use the random init, and despite having only 64 nodes, the timings are pretty consistent.

There are two types of tests in here. The first type measures the overall traversal time, and the second one measures each jump independently. For the second type, the reframe tests work out the L1 hits and misses, and compute the average timing only for these two categories. As it can be seen in the figure below (data from a P100), the timings for an L1 hit are pretty consistent at 158 cycles. If we were to capture the latency of L2 (which sits somewhere below 400 cycles), then I'd agree that we might need more nodes. But in that case, extracting the data that only belongs to an L2 hit is a challenge on its own.

Though, as you say, if we make it circular we could easily tune the list to fit in any memory level. Let me experiment a bit.

P100_latency

@jjotero
Copy link
Contributor Author

jjotero commented Feb 1, 2021

The linked lists are now circular and the number of node jumps is fixed to 256. The number of nodes is now an input parameter from the user, which means that we can do something like a list with 2 nodes (resident in L1) and jump around 256 times to collect all the data we need.

At the moment, there is some inconsistent behaviour with the clock latency on the P100s. This needs further testing in the other cards present in Ault.

@sekelle
Copy link
Contributor

sekelle commented Feb 12, 2021

The timings don't seem converged. As I mentioned previously, setting the number of hops to a fixed number is not a viable approach and also 256 jumps are not enough by a long shot to get converged results.
I took the code and made a small modification: I added num_jumps as a command line parameter, changed the traversal to a while loop as described above and increased num_jumps for each list size until the time per jump converged.

Here are the results on V100:
V100.pdf

and P100:
P100.pdf

List linkage was random, with no additional node-padding.

On the V100, the L1 latency now is 28 cycles which agrees with other reports that I found online.
The reason why 256 jumps are not nearly enough is in fact not due to timing overhead. Like CPUs, GPUs nowadays have multiple (opaque hardware-controlled) levels of cache and GPUs also use virtual memory, which means that page table lookup also happens. Despite a warm-up pass through the list, where exactly an address resides and whether or not there's a page-fault is beyond your control for individual accesses. You can only hope to iron out these effects by averaging over multiple repeat passes. The faster the device and the faster the cache, the more pronounced these effects become.
In this context, I would say that trying to time individual jumps is an interesting experiment, but in practice actually just yields random linear combinations of the different cache/DRAM values. Again, not because the timer is not accurate enough, but due to the mentioned background effects that you cannot control.

V100

P100

@vkarak vkarak added this to the ReFrame 3.5.0 milestone Feb 18, 2021
Copy link
Contributor Author

@jjotero jjotero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've now stripped all the code related to measuring latencies of single node jumps and done the cleanup.

The plot below shows a parametric sweep for all the cards in ault, for a large range of list sizes, and different strides; all using 400k node jumps for the timing. The nodes are placed in sequential order in the buffer, so the --stride parameter here was used to control the filling of the cache lines.

plot
A100: solid; V100: dashed; P100: dash-dotted; Vega20: dotted

Now I just need to add the actual tests. For now, I'll fix the stride to 32 in all of them and check the L1, L2 and DRAM latencies.

@jjotero
Copy link
Contributor Author

jjotero commented Mar 2, 2021

The tests are in now.

Copy link
Contributor

@vkarak vkarak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't go through the C++ code, but I've gone through your discussions and the results and I think that we are good to go. Great work of both of you! I've only some minor ReFrame test comments.

Copy link
Contributor Author

@jjotero jjotero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR comments addressed.

@vkarak vkarak merged commit 3a39a39 into reframe-hpc:master Mar 5, 2021
@jjotero jjotero deleted the test/ault-dev branch March 10, 2021 10:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants