Stack monoid take 3 #66

raphlinus · 2022-01-11T23:12:35Z

I have figured out a seriously faster version of the stack monoid algorithm (see stack monoid revisited) and want to write that up.

The implementation exists in a PR on the piet-gpu repo, where it will become part of the GPU test suite.

This will become a blog post at least at some point, but I'm also doing a submission to the HPDC conference. Attached is a very early draft of that paper, for review while I iterate on it. It's also possible I'll do a PR adding the paper to the assets section here.

stack.pdf

rschreyer · 2022-01-24T08:10:45Z

I’m just getting started digging into the algorithms themselves, but two initial questions:

Section 3

Note that Bic(s[i..j]).b is monotonically increasing as i increases

What about this example? Am I calculating the semigroup incorrectly?

s = “(((“
Bic(s[0..3]) = (0, 3)
Bic(s[1..3]) = (0, 2)

In Section 5, what is the meaning of the Stk(enum(s))[…]) syntax? Especially the enum token? Is the reverse can an inclusive or exclusive scan?

raphlinus · 2022-01-24T15:22:02Z

I got this wrong, it should be as i decreases. This is fixed in my current draft. I'll upload a new one today, which will address these and more. Glad you're interested!

raphlinus · 2022-01-25T04:13:19Z

What's attached is just a snapshot of the current state, but it doesn't reflect what I want it to be. I'll take a solid pass tomorrow and hopefully fill in the major gaps.
stack.pdf

rschreyer · 2022-01-25T21:49:47Z

I'm trying to understand the output of Section 5.1 Simple Algorithm / Stack Slices

As described, I'd expect the reverse scan for the string )(()( to produce:
rev_scan(")(()(") --> (1,2), (0,2), (0,1), (1,1), (0,1) (I'm assuming this is an inclusive scan)
rev_scan(")(()(").compact(.a==0) --> (0,2),(0,1),(0,1)
So the partition produces 3 elements. However, one of those elements is already matched with a closing parenthesis within the partition, and so shouldn't be visible as an element-of-interest by subsequent partitions? (and two of them have .b==1, so they'd write to the same location in the produced stack).

rschreyer · 2022-01-26T06:08:50Z

Section 5.2, first bullet says It consists of a forward Hillis-Steele scan.... Is it a forward scan, or reverse scan? I think you'd need a reverse scan to make this work, and the implementation also appears to be a reverse scan?

raphlinus · 2022-01-26T17:13:46Z

Thanks for these questions, they're very useful in making me describe things more clearly. In 5.2, yes, this is a reverse scan.

The description of 5.1 is too informal, and is also missing the fact that only open parentheses are written (this is why your interpretation is that 3 elements are written when it should only be 2).

Index i is written to location Bic(s).b - Bic(s[i..n]).b if s[i] = '(' and Bic(s[i+1..n]).a = 0. So for ")(()(", Bic(s).b is 2, the sequence Bic(s[i+1..n]).a is (0, 0, 1, 0, 0). The predicate hits at indices 1 and 4, which results in 1 written at 0 and 4 written at 1. I think giving an example here will help.

raphlinus · 2022-01-27T20:36:35Z

Attached is my latest draft. I haven't done a real polish pass yet, but there's a new introduction (following the suggested format for the HPDC conference) and one graphic, so I think it says mostly what I want it to say now. I'll do some measurements and see if I can get another graphic in there. I'm very open to comments and suggestions: what's not clear? what could be better illustrated?

stack.pdf

raphlinus · 2022-01-28T00:50:47Z

Still needs work, but here it is in the recommended template form, and incorporating some of the feedback I've gotten.

stack.pdf

raphlinus · 2022-01-28T06:45:49Z

I'm actually pretty happy with this version, though of course there's always more that can be done.

stack.pdf

rschreyer · 2022-01-28T07:04:39Z

'𝑟𝑒𝑣(Stk(𝑒𝑛𝑢𝑚(𝑠)[𝑖0..𝑖2])[𝑘]'

That's missing a closing parenthesis.

It's taken me quite awhile to understand Equation 1 and the following paragraph. I think some additional exposition may have helped me (assuming I've gotten it right):
In cases where Bic(s[i1..i2]).b is non-zero, then a the match the index i2 will be found within the slice s[i1..i2]. If .b is zero, then a match will need to index into a stack monoid covering the slice s[i0..i1]. Maybe this is a good place for a forward reference to justify the division of the sequence into i0..i1 and i1..i2 to say that one of these is solved within the workgroup, and the other requires reading the stack generated by other workgroups?

Section 7.2: This section does feel like it is missing context. For example, the introduction of segments, and that they can be empty vs. non-empty. Would these be tokens other than ( and )? Or is an empty segment an entire prior partion that doesn't have any unclosed open parenthesis?

raphlinus · 2022-01-28T07:07:00Z

Thanks! Since posting that draft I've actually added an explanation English, I hope that will explain it better. I'll have another look, fix the parentheses, and have another go at 7.2 (I know this is still a bit in shorthand). New draft up soon.

raphlinus · 2022-01-28T07:46:16Z

I rewrote the section, expanding out a bit what was previously in shorthand. Hopefully, there's enough detail there to figure out what it's actually saying. (empty refers strictly to unmatched open parentheses, and I believe I've made that more precise and explicit now).

I don't really expect people to get 7.2, it's pretty detailed and weedsy. But it should be a description of what I've done, and arguably important as it's what supports the performance result and the ability to scale up pretty well even with two dispatches.

In any case, I very much appreciate your review and comments. I've been getting some on the Zulip as well, and between the two it makes me feel a lot less like I'm shouting into the void. I think I'll submit this version, and then I do have a small window to revise that further if I feel the need.

stack.pdf

rschreyer · 2022-02-04T20:33:55Z

Sorry for the delay, I didn't see a notification of a new post.

In cases where Bic([s1..i2])
That should probably be i1. Also, the long sentence that follows is confusing. It should probably be split into two separate statements, with the latter being explicit as using the stack monoid for [i0..i] if the match can't be found within [i1..i2]`.

Materialize the stack for the prefix of the input up to the current partition
This immediately makes me think you are materializing the entire stack. Perhaps rephrase to 'Materialize the w-suffix of the stack at the beginning of the current partition'

I think the new Section 7.2 works quite well. Send it!

raphlinus · 2022-04-30T18:47:27Z

An update. The paper submission above was rejected. I am trying again. The new forum allows publication of the work but asks that the forum not be identified. I am furiously working to make the May 2 deadline. I've done some writing and editing, but the measurements (both of the existing stack monoid and the new bbox oriented work) haven't been done yet.

So at this point I'd love to have feedback on the writing, understanding that the obvious TODO sections will be filled in later.

bbox-04-30.pdf

raphlinus · 2022-05-02T20:12:08Z

Attached is a new draft, this time with performance measurements and analysis. I don't think I will have time to create a new visual as originally planned, so from here I will basically take a pass to clean up typos and obvious TODO items.

bbox-05-02.pdf

raphlinus · 2022-05-23T18:37:42Z

This was not accepted by the conference. I now plan to post to arXiv and write an associated blog post. The new version, with conference information stripped off, is attached. Comments are welcome, it will probably be a few days before I get everything lined up here. I'm also hoping to rerun measurements on Nvidia 3080 hardware, but that will also take a few days.

bbox-05-23.pdf

saona-raimundo · 2022-05-25T08:25:13Z

Hi there!

I hope you do not mind some naive comments. I studied mathematics and work in a computer science group. I have no experience in theoretical parallel algorithms but am generally interested in practical implementations. I hope the comments are relevant.

Theoretical algorithms and their implementation

In Section 1.2, page 2, first bullet point, you say

Theoretical presentations of work-efficient algorithms analyzed in terms of an abstract Parallel Random Access Machine PRAM) model but no clear mapping to an efficient GPU implementation ([Bar-On and Vishkin 1985], [Levcopoulos and Petersson 1992], [Prasad et al.1994]).

My question is why this mapping is not clear:

Is it just the usual "this is a theoretical algorithm and there was no implementation to be found"? or
Is there a more substantial reason for the mapping to be hard?

PRAM model

In Section 1.3, page 3, third paragraph, you say

The second insight is that interleaving these two approaches yields a work-efficient algorithm. Further, the two approaches map well to the hierarchical structure of actual GPU hardware, in which each GPU workgroup computes one partition of the input problem.

My question is would this not mean that a hierarchical memory model fits better current hardware?

Maybe the PRAM model is never going to be implemented in real hardware and a different model fits better? or
Would it not be a good improvement (to this paper at least) to formalize such a hierarchical model for GPUs?

raphlinus · 2022-05-25T15:19:49Z

First, the paper is now on arXiv and that will be the definitive place to find it. I am tentatively planning a v2 with more measurement and will probably delay publicizing it widely until then, but feel free to share.

Second, absolutely I don't mind comments. Think of this as the Q&A at the hypothetical conference it would have gotten accepted to :)

Personally I believe that there is a huge and deep area of study adapting "classical" parallel algorithms to more realistic hardware. There might be differences of opinion about whether GPU compute shaders represent a narrow niche (not as efficient as TPU-like accelerators, not as agile as CPUs) or whether the hierarchies are a more principled way to organize large numbers of threads, likely to persist in similar fashion well into the future. I'm honestly not sure the extent to which making an abstract compute performance model is useful, there's a possibility it's too complicated.

Making such a model is a fairly deep study, and I think wildly too ambitious for this particular paper. Essentially if you were proposing such a thing, you'd need to justify that its predictions correlated with real-world performance. I think a good step in this direction is Vasily Volkov's thesis work.

I'm not sure where this kind of work is being done, and where is a good place to publish it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stack monoid take 3 #66

Stack monoid take 3 #66

raphlinus commented Jan 11, 2022

rschreyer commented Jan 24, 2022 •

edited

raphlinus commented Jan 24, 2022

raphlinus commented Jan 25, 2022

rschreyer commented Jan 25, 2022

rschreyer commented Jan 26, 2022

raphlinus commented Jan 26, 2022

raphlinus commented Jan 27, 2022

raphlinus commented Jan 28, 2022

raphlinus commented Jan 28, 2022

rschreyer commented Jan 28, 2022 •

edited

raphlinus commented Jan 28, 2022

raphlinus commented Jan 28, 2022

rschreyer commented Feb 4, 2022

raphlinus commented Apr 30, 2022

raphlinus commented May 2, 2022

raphlinus commented May 23, 2022

saona-raimundo commented May 25, 2022

raphlinus commented May 25, 2022

Stack monoid take 3 #66

Stack monoid take 3 #66

Comments

raphlinus commented Jan 11, 2022

rschreyer commented Jan 24, 2022 • edited

raphlinus commented Jan 24, 2022

raphlinus commented Jan 25, 2022

rschreyer commented Jan 25, 2022

rschreyer commented Jan 26, 2022

raphlinus commented Jan 26, 2022

raphlinus commented Jan 27, 2022

raphlinus commented Jan 28, 2022

raphlinus commented Jan 28, 2022

rschreyer commented Jan 28, 2022 • edited

raphlinus commented Jan 28, 2022

raphlinus commented Jan 28, 2022

rschreyer commented Feb 4, 2022

raphlinus commented Apr 30, 2022

raphlinus commented May 2, 2022

raphlinus commented May 23, 2022

saona-raimundo commented May 25, 2022

Theoretical algorithms and their implementation

PRAM model

raphlinus commented May 25, 2022

rschreyer commented Jan 24, 2022 •

edited

rschreyer commented Jan 28, 2022 •

edited