Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A more low level implementation of vectorize in numba #92

Merged
merged 14 commits into from
Jan 4, 2023

Conversation

aseyboldt
Copy link
Member

As discussed in #70, this rewrites the numba implementation using llvm intrinsics.
This way we can use broadcasting information, supply llvm with required aliasing information for vectoriziation, support multiple outputs of elemwise and prepare for ElemwiseSum-like optimizations. (the llvm code supports that already).

Locally, there are still some test failures due to execution in python mode that I will have to work out.

This can lead to sizable performance improvements (with svml installed):

import pytensor
import pytensor.tensor as pt
import numpy as np
import numba
import pytensor.link.numba.dispatch

import llvmlite.binding as llvm
#llvm.set_option('', '--debug-only=loop-vectorize,iv-descriptors')

x = pt.dmatrix("y")
y = pt.dvector("z")

out = np.exp(2 * x * y + y)

x_val = np.random.randn(200, 500)
y_val = np.random.randn(500)

func = pytensor.function([x, y], out, mode="NUMBA")
out = func(x_val, y_val)
np.testing.assert_allclose(np.exp(2 * x_val * y_val + y_val), out)

%timeit func(x_val, y_val)

# Before PR
712 µs ± 6.76 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

# After PR
189 µs ± 2.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

layout = aryty.layout
return (data, shape, strides, layout)

# TODO I think this is better than the noalias attribute
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can ask for a release?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would need to be merged first :-)
numba/llvmlite#895
Also, I don't think this is a major issue. The current noalias attribute for the output arrays should cover most (if not all) cases as well.

@aseyboldt aseyboldt force-pushed the llvm-elemwise branch 5 times, most recently from 2ce8ae9 to 8e52ff7 Compare December 21, 2022 04:02
@aseyboldt aseyboldt marked this pull request as ready for review December 21, 2022 04:12
@aseyboldt
Copy link
Member Author

I think I finally got there, tests might just pass this time :-)

@ricardoV94
Copy link
Member

I'll try to review soon. In the meantime do you want to add 1 or 2 benchmark tests? #139

@twiecki
Copy link
Member

twiecki commented Dec 21, 2022

@aseyboldt Any more timing results?

@codecov-commenter
Copy link

codecov-commenter commented Dec 21, 2022

Codecov Report

Merging #92 (c62ed02) into main (f4de2fd) will decrease coverage by 0.02%.
The diff coverage is 82.19%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #92      +/-   ##
==========================================
- Coverage   79.98%   79.95%   -0.03%     
==========================================
  Files         169      170       +1     
  Lines       44607    44848     +241     
  Branches     9426     9493      +67     
==========================================
+ Hits        35678    35859     +181     
- Misses       6738     6778      +40     
- Partials     2191     2211      +20     
Impacted Files Coverage Δ
pytensor/link/numba/dispatch/extra_ops.py 92.24% <ø> (ø)
pytensor/link/numba/dispatch/elemwise.py 89.01% <78.01%> (-8.04%) ⬇️
pytensor/link/numba/dispatch/elemwise_codegen.py 85.96% <85.96%> (ø)
pytensor/link/numba/dispatch/basic.py 89.33% <100.00%> (-0.73%) ⬇️
pytensor/link/numba/dispatch/scalar.py 94.51% <100.00%> (+0.06%) ⬆️
pytensor/link/numba/dispatch/scan.py 95.45% <100.00%> (+0.06%) ⬆️
pytensor/link/numba/linker.py 100.00% <100.00%> (ø)

@aseyboldt
Copy link
Member Author

@twiecki The above benchmark hasn't really changed.
Hard to say how this impacts real models, for the radon model in the nutpie example this currently reduces my compile time from ~15s to ~12s (lots of noise...) and logp eval time from 12μs to 10μs (much less noise).
I would expect bigger wins for models with more data, and also more wins once we get this merged with the loop fusion and move the sums into the elemwise op.

@ricardoV94 added the code from the description as benchmark.

pytensor/link/numba/dispatch/elemwise_codegen.py Outdated Show resolved Hide resolved
@@ -0,0 +1,240 @@
from typing import Any, List, Optional, Tuple
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can guess the answer to this question... but do we need to go this level? Can't we add some shape asserts in a vanilla numba function as a way to get the speedup without us having to do all this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some shape asserts won't help I'm afraid. :-)
I think it is pretty clear that we want support for reductions during elemwise, and we also want to support multiple outputs. This means that the build-in numba vectorize won't do it.
So as far as I can see we have two options to get those:

  • Build numba functions using string processing
  • Use the llvm interface

I think working with strings is actually harder and more error-prone than working with the llvm code directly. (have a look at the Reduction functions in this module. I think that's even more complicated). Also, if we think about what happens if we do this, this would look like:

  • We start with a properly typed graph
  • We (conditionally) generate strings of numba code, that have no type info
  • The numba code get's evaluated and transformed to bytecode
  • The bytecode in analysed by numba
  • Numba tries to figure out types for all the variables (those types that we discarded earlier, hopefully)
  • Numba uses code just like in the PR to build an llvm module, based on the bytecode

I don't see why we would want those in-between steps to happen? pytensor is a compiler, I don't think we can reasonably expect to get around doing what compilers do: generate code. ;-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The long term problem I see here is maintenance. This is the backbone of the library and I don't know how much time it would take me (or another dev) to sort out a bug in the future. I don't know how volatile these functions are or how well documented.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the move from "we build a stats library" to "let's build a compiler" is a bit sudden :-)

@@ -117,6 +117,25 @@ def test_Elemwise(inputs, input_vals, output_fn, exc):
compare_numba_and_py(out_fg, input_vals)


def test_elemwise_speed(benchmark):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a case with broadcasting?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That one is using broadcasting, isn't it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I guess I wanted non-dimshuffled broadcasting because that's the one we've been thinking about, but should be the same?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elemwise will always call dimshuffle in make_node anyway, so there shouldn't be a non-dimshuffled case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By non-dimshuffled I mean matrix + row/col

@ricardoV94
Copy link
Member

ricardoV94 commented Jan 3, 2023

LGTM, not happy with having to write low-level code that seems as complex if not more than our C implementation.

Left some small questions/comments above

@twiecki
Copy link
Member

twiecki commented Jan 3, 2023

Yeah, I'd also feel more comfortable if we proved the benefit a bit more of adding this much complexity.

@aseyboldt
Copy link
Member Author

The docs failure looks like it is because of the 6.0 release, which dropped py37 support: https://www.sphinx-doc.org/en/master/changes.html#id9
(now #165)

@twiecki

Yeah, I'd also feel more comfortable if we proved the benefit a bit more of adding this much complexity.

Does that post above convince you already?
I can see the maintenance problem, it's just that I don't see alternatives that are much better...

@twiecki
Copy link
Member

twiecki commented Jan 4, 2023

Does that post above convince you already?

So it's not just faster but actually allowing us to do things we couldn't do, i.e. reductions during elemwise. In that case I'm convinced.

@michaelosthege
Copy link
Member

@aseyboldt a rebase should fix the docs build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants