Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stack monoid take 3 #66

Open
raphlinus opened this issue Jan 11, 2022 · 18 comments
Open

Stack monoid take 3 #66

raphlinus opened this issue Jan 11, 2022 · 18 comments

Comments

@raphlinus
Copy link
Owner

I have figured out a seriously faster version of the stack monoid algorithm (see stack monoid revisited) and want to write that up.

The implementation exists in a PR on the piet-gpu repo, where it will become part of the GPU test suite.

This will become a blog post at least at some point, but I'm also doing a submission to the HPDC conference. Attached is a very early draft of that paper, for review while I iterate on it. It's also possible I'll do a PR adding the paper to the assets section here.

stack.pdf

@rschreyer
Copy link

rschreyer commented Jan 24, 2022

I’m just getting started digging into the algorithms themselves, but two initial questions:

Section 3

Note that Bic(s[i..j]).b is monotonically increasing as i increases

What about this example? Am I calculating the semigroup incorrectly?

s = “(((“
Bic(s[0..3]) = (0, 3)
Bic(s[1..3]) = (0, 2)

In Section 5, what is the meaning of the Stk(enum(s))[…]) syntax? Especially the enum token? Is the reverse can an inclusive or exclusive scan?

@raphlinus
Copy link
Owner Author

I got this wrong, it should be as i decreases. This is fixed in my current draft. I'll upload a new one today, which will address these and more. Glad you're interested!

@raphlinus
Copy link
Owner Author

What's attached is just a snapshot of the current state, but it doesn't reflect what I want it to be. I'll take a solid pass tomorrow and hopefully fill in the major gaps.
stack.pdf

@rschreyer
Copy link

I'm trying to understand the output of Section 5.1 Simple Algorithm / Stack Slices

As described, I'd expect the reverse scan for the string )(()( to produce:
rev_scan(")(()(") --> (1,2), (0,2), (0,1), (1,1), (0,1) (I'm assuming this is an inclusive scan)
rev_scan(")(()(").compact(.a==0) --> (0,2),(0,1),(0,1)
So the partition produces 3 elements. However, one of those elements is already matched with a closing parenthesis within the partition, and so shouldn't be visible as an element-of-interest by subsequent partitions? (and two of them have .b==1, so they'd write to the same location in the produced stack).

@rschreyer
Copy link

Section 5.2, first bullet says It consists of a forward Hillis-Steele scan.... Is it a forward scan, or reverse scan? I think you'd need a reverse scan to make this work, and the implementation also appears to be a reverse scan?

@raphlinus
Copy link
Owner Author

Thanks for these questions, they're very useful in making me describe things more clearly. In 5.2, yes, this is a reverse scan.

The description of 5.1 is too informal, and is also missing the fact that only open parentheses are written (this is why your interpretation is that 3 elements are written when it should only be 2).

Index i is written to location Bic(s).b - Bic(s[i..n]).b if s[i] = '(' and Bic(s[i+1..n]).a = 0. So for ")(()(", Bic(s).b is 2, the sequence Bic(s[i+1..n]).a is (0, 0, 1, 0, 0). The predicate hits at indices 1 and 4, which results in 1 written at 0 and 4 written at 1. I think giving an example here will help.

@raphlinus
Copy link
Owner Author

Attached is my latest draft. I haven't done a real polish pass yet, but there's a new introduction (following the suggested format for the HPDC conference) and one graphic, so I think it says mostly what I want it to say now. I'll do some measurements and see if I can get another graphic in there. I'm very open to comments and suggestions: what's not clear? what could be better illustrated?

stack.pdf

@raphlinus
Copy link
Owner Author

Still needs work, but here it is in the recommended template form, and incorporating some of the feedback I've gotten.

stack.pdf

@raphlinus
Copy link
Owner Author

I'm actually pretty happy with this version, though of course there's always more that can be done.

stack.pdf

@rschreyer
Copy link

rschreyer commented Jan 28, 2022

'𝑟𝑒𝑣(Stk(𝑒𝑛𝑢𝑚(𝑠)[𝑖0..𝑖2])[𝑘]'

That's missing a closing parenthesis.

It's taken me quite awhile to understand Equation 1 and the following paragraph. I think some additional exposition may have helped me (assuming I've gotten it right):
In cases where Bic(s[i1..i2]).b is non-zero, then a the match the index i2 will be found within the slice s[i1..i2]. If .b is zero, then a match will need to index into a stack monoid covering the slice s[i0..i1]. Maybe this is a good place for a forward reference to justify the division of the sequence into i0..i1 and i1..i2 to say that one of these is solved within the workgroup, and the other requires reading the stack generated by other workgroups?

Section 7.2: This section does feel like it is missing context. For example, the introduction of segments, and that they can be empty vs. non-empty. Would these be tokens other than ( and )? Or is an empty segment an entire prior partion that doesn't have any unclosed open parenthesis?

@raphlinus
Copy link
Owner Author

Thanks! Since posting that draft I've actually added an explanation English, I hope that will explain it better. I'll have another look, fix the parentheses, and have another go at 7.2 (I know this is still a bit in shorthand). New draft up soon.

@raphlinus
Copy link
Owner Author

I rewrote the section, expanding out a bit what was previously in shorthand. Hopefully, there's enough detail there to figure out what it's actually saying. (empty refers strictly to unmatched open parentheses, and I believe I've made that more precise and explicit now).

I don't really expect people to get 7.2, it's pretty detailed and weedsy. But it should be a description of what I've done, and arguably important as it's what supports the performance result and the ability to scale up pretty well even with two dispatches.

In any case, I very much appreciate your review and comments. I've been getting some on the Zulip as well, and between the two it makes me feel a lot less like I'm shouting into the void. I think I'll submit this version, and then I do have a small window to revise that further if I feel the need.

stack.pdf

@rschreyer
Copy link

Sorry for the delay, I didn't see a notification of a new post.

In cases where Bic([s1..i2])
That should probably be i1. Also, the long sentence that follows is confusing. It should probably be split into two separate statements, with the latter being explicit as using the stack monoid for [i0..i] if the match can't be found within [i1..i2]`.

Materialize the stack for the prefix of the input up to the current partition
This immediately makes me think you are materializing the entire stack. Perhaps rephrase to 'Materialize the w-suffix of the stack at the beginning of the current partition'

I think the new Section 7.2 works quite well. Send it!

@raphlinus
Copy link
Owner Author

An update. The paper submission above was rejected. I am trying again. The new forum allows publication of the work but asks that the forum not be identified. I am furiously working to make the May 2 deadline. I've done some writing and editing, but the measurements (both of the existing stack monoid and the new bbox oriented work) haven't been done yet.

So at this point I'd love to have feedback on the writing, understanding that the obvious TODO sections will be filled in later.

bbox-04-30.pdf

@raphlinus
Copy link
Owner Author

Attached is a new draft, this time with performance measurements and analysis. I don't think I will have time to create a new visual as originally planned, so from here I will basically take a pass to clean up typos and obvious TODO items.

bbox-05-02.pdf

@raphlinus
Copy link
Owner Author

This was not accepted by the conference. I now plan to post to arXiv and write an associated blog post. The new version, with conference information stripped off, is attached. Comments are welcome, it will probably be a few days before I get everything lined up here. I'm also hoping to rerun measurements on Nvidia 3080 hardware, but that will also take a few days.

bbox-05-23.pdf

@saona-raimundo
Copy link

Hi there!

I hope you do not mind some naive comments. I studied mathematics and work in a computer science group. I have no experience in theoretical parallel algorithms but am generally interested in practical implementations. I hope the comments are relevant.

Theoretical algorithms and their implementation

In Section 1.2, page 2, first bullet point, you say

Theoretical presentations of work-efficient algorithms analyzed in terms of an abstract Parallel Random Access Machine PRAM) model but no clear mapping to an efficient GPU implementation ([Bar-On and Vishkin 1985], [Levcopoulos and Petersson 1992], [Prasad et al.1994]).

My question is why this mapping is not clear:

  • Is it just the usual "this is a theoretical algorithm and there was no implementation to be found"? or
  • Is there a more substantial reason for the mapping to be hard?

PRAM model

In Section 1.3, page 3, third paragraph, you say

The second insight is that interleaving these two approaches yields a work-efficient algorithm. Further, the two approaches map well to the hierarchical structure of actual GPU hardware, in which each GPU workgroup computes one partition of the input problem.

My question is would this not mean that a hierarchical memory model fits better current hardware?

  • Maybe the PRAM model is never going to be implemented in real hardware and a different model fits better? or
  • Would it not be a good improvement (to this paper at least) to formalize such a hierarchical model for GPUs?

@raphlinus
Copy link
Owner Author

First, the paper is now on arXiv and that will be the definitive place to find it. I am tentatively planning a v2 with more measurement and will probably delay publicizing it widely until then, but feel free to share.

Second, absolutely I don't mind comments. Think of this as the Q&A at the hypothetical conference it would have gotten accepted to :)

Personally I believe that there is a huge and deep area of study adapting "classical" parallel algorithms to more realistic hardware. There might be differences of opinion about whether GPU compute shaders represent a narrow niche (not as efficient as TPU-like accelerators, not as agile as CPUs) or whether the hierarchies are a more principled way to organize large numbers of threads, likely to persist in similar fashion well into the future. I'm honestly not sure the extent to which making an abstract compute performance model is useful, there's a possibility it's too complicated.

Making such a model is a fairly deep study, and I think wildly too ambitious for this particular paper. Essentially if you were proposing such a thing, you'd need to justify that its predictions correlated with real-world performance. I think a good step in this direction is Vasily Volkov's thesis work.

I'm not sure where this kind of work is being done, and where is a good place to publish it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants