Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding SLS node instead of reducing to SLWS #3039

Closed
wants to merge 3 commits into from

Conversation

Projects
None yet
10 participants
@mortzur
Copy link
Contributor

commented Jun 4, 2019

Summary:
Currently SLS is implemented by creating a splat node for constant 1.0 weights and reducing to SLWS. Better performance could be achieved by avoiding the unnecessary weights.
work-in-progress on this stack:

  • add Habana impl
  • add OpenCL impl
  • do the same for the fused operator (SparseLengthsSumFused8BitRowwise)

update: OpenCL was not implemented, lowering to SLWS was enables instead. SparseLengthsSumFused8BitRowwise will be added on a separate PR.

Test Plan:

[Optional Fixes #issue]

Please see a detailed explanation of how to fill out the fields in the relevant sections in PULL_REQUEST.md.

@jfix71
Copy link
Contributor

left a comment

If the only motivation here is to gain performance, I don't think it's necessary to add the SLS Node, just the SLS Instruction. We could just add specialized IRGen logic that checks if the weights input is a Splat of 1.0, and if so generate the SLS Instruction, otherwise generate the SLWS Instruction. This limits complexity both by not needing a new Node, and by allowing for optimizations that would apply to both SLS/SLWS to be written just for SLWS. WDYT?

@rdzhabarov

This comment has been minimized.

Copy link
Contributor

commented Jun 4, 2019

I agree with @jfix71 on not having explicit SLS node on graph level.

Note, for Habana we need to do this checking/matching before IR gen (we do not use IR gen for it).
For now I'd focus on optimizing Habana and then we could add general support as @jfix71 suggested.

@bertmaher

This comment has been minimized.

Copy link
Contributor

commented Jun 4, 2019

IMO it is simpler to add an SLS node than it is to try to pattern match it away in the IR (and also in the backends that don’t use IR)

@jfix71

This comment has been minimized.

Copy link
Contributor

commented Jun 4, 2019

IMO it is simpler to add an SLS node than it is to try to pattern match it away in the IR (and also in the backends that don’t use IR)

I think the pattern matching logic is pretty small/simple here. Instead of adding an SLS case in a switch that mostly duplicates the SLWS logic, you'd add something e.g. like this for Habana to the SLWS case:

std::string kernel;
SplatNode *SN = dyn_cast<SplatNode>(SLWS->getWeights());
if (SN && SN->getValue() == 1.0) {
  kernel = "sparse_lengths_sum_f32";
} else {
  kernel = "sparse_lengths_weighted_sum_f32";
  inputs.push_back(tensors[RI->getWeights()].get());
}
@bertmaher

This comment has been minimized.

Copy link
Contributor

commented Jun 4, 2019

Ok, sure, we can do the pattern match instead.

@bertmaher

This comment has been minimized.

Copy link
Contributor

commented Jun 4, 2019

Just so I don’t feel like an idiot, I’ve got to explain why I think it’s “simpler” to add the new node- it ends up adding a fair amount of boilerplate to add a new node, which I had forgotten about, but the boilerplate is all extremely regular. So there’s duplicated code but it’s identical to the surrounding code, so it’s pretty easy to write and reason about.

The custom optimization code is short, and I guess I prefer doing it here in the Habana backend just for the sake of diff size, but it’s a custom thing that has to be reasoned about (eg, will the Habana backend DCE away the splat? It turns out it will, so this particular thing isn’t really an issue, but it could have been).

I also have a larger issue with our “lowering” nodes that are present in the source only to then “raise” the generated pattern back to the original input. But this isn’t the only place we do that, so, may as well not fight it.

@opti-mix

This comment has been minimized.

Copy link
Contributor

commented Jun 4, 2019

I also have a larger issue with our “lowering” nodes that are present in the source only to then “raise” the generated pattern back to the original input. But this isn’t the only place we do that, so, may as well not fight it.

This is not the only place :-( I was coincidentally working on the same optimization for one of the backends. IMHO, it is not Habana-specific. Some other backends may benefit as well from being able to distinguish between SLWS and SLS. And as far as I can see some of them are graph IR based and some others are low-level IR based.

@jfix71

This comment has been minimized.

Copy link
Contributor

commented Jun 4, 2019

will the Habana backend DCE away the splat? It turns out it will, so this particular thing isn’t really an issue, but it could have been

I don't follow here -- aren't you essentially lowering the graph to Habana IR? Why would Habana IR know about the Splat if you elide it when lowering?

I also have a larger issue with our “lowering” nodes that are present in the source only to then “raise” the generated pattern back to the original input. But this isn’t the only place we do that, so, may as well not fight it.

This is not the only place :-( I was coincidentally working on the same optimization for one of the backends. IMHO, it is not Habana-specific. Some other backends may benefit as well from being able to distinguish between SLWS and SLS. And as far as I can see some of them are graph IR based and some others are low-level IR based.

Hmm, so I think we generally do the opposite, raising things as much as possible at the beginning (e.g. via fold()), then gradually lowering only as the backend desires. What other examples are you guys thinking of?

In general if it was at all difficult/fragile to pattern match the SLS from the SLWS then I'd understand the issue more, but it's super simple here. I agree this is definitely not Habana specific -- all backends would want to do this -- it's more a question of how to represent it in the IR. This same pattern matching logic would be used everywhere (i.e. check if the Weights input is a splat with Value 1.0). We could even make this a helper function if it'd make it easier.

@nadavrot

This comment has been minimized.

Copy link
Contributor

commented Jun 4, 2019

I am really enjoying reading this thread. There are always multiple ways to do things and it's great to have these technical discussions that help us to come up with a great design. But even beyond the scope of this specific PR, I love learning from discussion threads like this about different design tradeoffs.

@mortzur

This comment has been minimized.

Copy link
Contributor Author

commented Jun 4, 2019

@jfix71 - On one hand I would love to avoid changing the high level returned node type, because it would allow me to insert the modifications in a more modular way, without needing to change anything for backends which exist and work fine with SLWS . The motivation here is on Habana.

One the other hand, returning SLWS node which hides an implementation as SLS (if I understood right) sounds confusing for readability and debugging.

@bertmaher

This comment has been minimized.

Copy link
Contributor

commented Jun 4, 2019

will the Habana backend DCE away the splat? It turns out it will, so this particular thing isn’t really an issue, but it could have been

I don't follow here -- aren't you essentially lowering the graph to Habana IR? Why would Habana IR know about the Splat if you elide it when lowering?

So the way the Habana codegen works is to iterate all the nodes, and call synCreateGenericNode to produce the corresponding Habana IR. So if there's a splat node in the graph, we'll create a Habana node for it, even if it's not used by anything. (I think the Habana GC is smart enough to DCE it later). (Hopefully I'm not wrong about this description :) ).

Maybe there's some alternate design that would avoid visiting the splat at all?

@opti-mix

This comment has been minimized.

Copy link
Contributor

commented Jun 4, 2019

So if there's a splat node in the graph, we'll create a Habana node for it, even if it's not used by anything. (I think the Habana GC is smart enough to DCE it later). (Hopefully I'm not wrong about this description :) ).

I totally agree that this is an issue. I see a similar problem with the IR-based backend I'm working on. I do the simple conditional check that @jfix71 proposed, but it is too late already, because the code for Splat was generated already for the weights operand of the SLWS instruction and this memory and compute to initialize it by the splat value is just wasted. Of course, I could have introduced a backend-specific SLS instruction and custom IRGen. Or a backend-specific SLS node and instruction. But it seems like we already have 2 or 3 backends that may need to do it or to use some other custom logic to understand that SLS could be used instead of SLWS. So, it seems like avoiding having a dedicated SLS node and instruction does not buy us much in terms of saving the development effort.

Another thing that we briefly discussed with @bertmaher and @mortzur is if backends should just say in isOpSupported if they want SLS to be lowered (into SLWS) or not. By default it should be lowered, i.e. we preserve the today's status quo. But backends could override this.

@SplitInfinity

This comment has been minimized.

Copy link
Contributor

commented Jun 4, 2019

Better performance could be achieved by avoiding the unnecessary weights.

Could you share your measurements?

@jfix71

This comment has been minimized.

Copy link
Contributor

commented Jun 4, 2019

@bertmaher @opti-mix Ah, got it -- yeah I imagine you could walk the graph starting from outputs and then recursively generate IR for your inputs, so that when you generate an SLS from an SLWS you wouldn't generate the weights. But given the extra complexity that would bring I think it totally makes sense to just add the SLS node 🙂

One the other hand, returning SLWS node which hides an implementation as SLS (if I understood right) sounds confusing for readability and debugging.

@mortzur Yeah I agree it is a little unexpected. However we do this already in many places in Graph.cpp, where we return some Node that implements some op even though we don't have a class for that op. E.g. see createBroadcast(), createDotProduct(), createClip(), createLogit(), etc. We do this for a similar reason to this PR -- to avoid adding a new Node to NodeGen. Though I agree this case is slightly different given that we want to IRGen it to something else.

@artemrakhov

This comment has been minimized.

Copy link
Contributor

commented Jun 4, 2019

Discussed offline. I still think that if we add unweighted versions of everything, we will end up with 2x more nodes:

  • SparseLengthsWeightedSum
  • RowwiseQuantizedSparseLengthsWeightedSum
  • FusedRowwiseQuantizedSparseLengthsWeightedSum
  • SparseLengthsSum
  • RowwiseQuantizedSparseLengthsSum
  • FusedRowwiseQuantizedSparseLengthsSum

Which we probably don't want. It's a slippery slope, this is how we can quickly reach 500 ops.

@bertmaher

This comment has been minimized.

Copy link
Contributor

commented Jun 5, 2019

Discussed offline. I still think that if we add unweighted versions of everything, we will end up with 2x more nodes:
Which we probably don't want. It's a slippery slope, this is how we can quickly reach 500 ops.

Here's my take: I don't think we should focus on the number of operators for its own sake. What we care about is performance of ML workloads. In some cases having fewer ops helps you get performance, because you can find fusions or other "global" optimization opportunities that would be hard to find amongst a bunch of complicated operators. (This is kind of like RISC vs CISC.)

If we had a nice fusion pass that would take in a graph containing SLWS+Splat(1.0) and produce code equivalent to SLS (and also DCE the splat), that would be preferable to adding an explicit SLS. But we don't have that yet (and it's still an open question of whether we can get there since some backends don't make it easy to automatically create new fused kernels on the fly).

In the meantime we could use the performance boost today :-)

@mortzur mortzur force-pushed the mortzur:sls_op branch 3 times, most recently from 72236e4 to b6a7827 Jun 6, 2019

@nrsatish

This comment has been minimized.

Copy link
Contributor

commented Jun 7, 2019

Hi folks - it is becoming very difficult to partition due to these shared Splat nodes among all the ops. This PR would unblock partitioning progress.

@bertmaher

This comment has been minimized.

Copy link
Contributor

commented Jun 7, 2019

Hi folks - it is becoming very difficult to partition due to these shared Splat nodes among all the ops. This PR would unblock partitioning progress.

This is a pretty good argument for having the new node. An alternative we talked about was introducing a HabanaSparseLengthsSum and specializing it for that platform, but, given this PR already exists, I'm going to go ahead and review it as-is. Code is mutable, we can always take a different approach later :-)

@bertmaher

This comment has been minimized.

Copy link
Contributor

commented Jun 7, 2019

Ok, so per offline convo we're going to introduce the notion of an optional input to nodes, which we'll use to magic away the weights input of SLWS. So I'll wait for that instead.

@bertmaher
Copy link
Contributor

left a comment

Unblocking this.

Show resolved Hide resolved lib/Graph/Nodes.cpp Outdated
Show resolved Hide resolved lib/Graph/Nodes.cpp Outdated
@facebook-github-bot
Copy link

left a comment

@bertmaher has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@@ -794,6 +794,18 @@ static void lowerBatchMatMulNode(Function *F, CompilationContext &cctx,
replaceAllUsesOfWith(cctx.loweredInfoMap, BMMN.getResult(), RN);
}

static void lowerSparseLengthsSumNode(Function *F, CompilationContext &cctx,

This comment has been minimized.

Copy link
@rdzhabarov

rdzhabarov Jun 11, 2019

Contributor

what's the reason for this given you added implementation of SLS to interpreter and CPU backend?

This comment has been minimized.

Copy link
@mortzur

mortzur Jun 11, 2019

Author Contributor

It is meant for other backends that does not have SLS implementation, i.e. OpenCL.
I should have thought about it before implementing for interpreter and CPU.

@mortzur mortzur force-pushed the mortzur:sls_op branch from b6a7827 to 051a4b0 Jun 11, 2019

@mortzur mortzur force-pushed the mortzur:sls_op branch from 051a4b0 to 647ecab Jun 11, 2019

@rdzhabarov

This comment has been minimized.

Copy link
Contributor

commented Jun 11, 2019

Please, check that internally this works on representative models before merging.

@facebook-github-bot
Copy link

left a comment

@mortzur has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@artemrakhov

This comment has been minimized.

Copy link
Contributor

commented Jun 11, 2019

Just for the records, have we decided not to proceed with SLSNode? Closed without landing?

@bertmaher

This comment has been minimized.

Copy link
Contributor

commented Jun 11, 2019

No, this landed, the UI is just a bit confusing. I don't think we really came to a decisive agreement about the Right Way forward, but this helps unblock some perf work in the short term.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.