Skip to content
This repository has been archived by the owner on Jun 21, 2022. It is now read-only.

performance of idiomatic expressions #107

Closed
jpata opened this issue Mar 21, 2019 · 7 comments
Closed

performance of idiomatic expressions #107

jpata opened this issue Mar 21, 2019 · 7 comments

Comments

@jpata
Copy link

jpata commented Mar 21, 2019

I've noticed that a somewhat-complex reduction is much faster (~250x) as a direct numba loop over contents and offsets as opposed to an idiomatic awkward array haiku. The task I have is that given a nested list of muon charges per event, mask all muons except the first two that have an opposite sign. I wonder if there is a better way to state this operation in awkward, or perhaps there is some unexpected performance loss?

Awkward-array idiomatic code: runtime 1.9s.

#muon_charges: JaggedArray of the form [[-1, +1, -1], [+1, +1, -1], ...]
#out_muon_mask: boolean array with the same length as muon_charges.content. Mask all charges except the first two of opposite charge, i.e. [1, 1, 0, 1, 0, 1] for the above array.

def get_os_muons_awkward(muon_charges, out_muon_mask):

    ch = muon_charges
        
    #select events with at least 2 muons
    events_min2_muons = ch.count()>=2
    
    #get the charges of the muons in these events
    ch2 = ch[events_min2_muons]

    #get the index pairs of all muons on an event-by-event basis
    all_muon_pairs = ch2.argcross(ch2)

    #get only those index pairs where the muon is not paired with itself and is paired with another muon with a higher index
    pairs_mask = (all_muon_pairs['0'] != all_muon_pairs['1']) & ((all_muon_pairs['0'] < all_muon_pairs['1']))
    all_muon_pairs = all_muon_pairs[pairs_mask]
    
    #get the pairs with the opposite sign charges
    pairs_with_os = ch2[all_muon_pairs['0']] != ch2[all_muon_pairs['1']]
    
    #get the indices of the pairs with the opposite sign
    idxs = all_muon_pairs[pairs_with_os]

    #get the events that had at least one such good pair
    events_min1_os_pair = idxs['0'].count()>=1
    idxs2 = idxs[events_min1_os_pair]
    bestpair = idxs2[:, 0]
    
    first_muon_idx = bestpair['0']
    second_muon_idx = bestpair['1']

    #set the leading and subleading muons to pass the mask according to the pair
    muon_mask_active = out_muon_mask[events_min2_muons][events_min1_os_pair]
    muon_mask_active.content[muon_mask_active.starts + first_muon_idx] = True
    muon_mask_active.content[muon_mask_active.starts + second_muon_idx] = True
    
    return

Direct Numba loop, runtime 7.5ms

@numba.jit(nopython=True)
def get_os_muons_numba(muon_charges_content, muon_charges_offsets, muon_mask_out):
    for iev in range(len(muon_charges_offsets) - 1):
        start = muon_charges_offsets[iev]
        end = muon_charges_offsets[iev + 1]
        
        if end-start >= 2:
            ch1 = muon_charges_content[start]
            for imuon in range(start+1, end):
                ch2 = muon_charges_content[imuon]
                if ch2 != ch1:
                    muon_mask_out[start] = True
                    muon_mask_out[imuon] = True
                    break
    return

CUDA, timing ~500 microseconds:

@cuda.jit('void(int8[:], int64[:], int8[:])')
def get_os_muons_cuda(muon_charges_content, muon_charges_offsets, muon_mask_out):
    xi = cuda.grid(1)
    xstride = cuda.gridsize(1)
    
    for iev in range(xi, muon_charges_offsets.shape[0]-1, xstride):
        start = muon_charges_offsets[iev]
        end = muon_charges_offsets[iev + 1]
        
        ch1 = muon_charges_content[start]
        
        for imuon in range(start+1, end):
            ch2 = muon_charges_content[imuon]
            if (ch2 != ch1):
                muon_mask_out[start] = 1
                muon_mask_out[imuon] = 1
                break
    return

I wasn't sure what's the appropriate way to raise this, feel free to move the discussion elsewhere.

@jpivarski
Copy link
Member

To be pure Numpy, some of the jagged method implementations have many full-pass steps. The only performance rule I mostly followed was to never do a Python for loop that scales with number of entries. (There are very few cases where it couldn't be avoided—I wonder if you found one.)

In a large part because of this, I'm developing Numba implementations of all the operations, much like your PR for CUDA. An analyst's workflow might still be a little slower for passing over the data multiple times, but each operation would individually be near maximal. We're also hoping to mentor a GSoC student to do pre-compiled methods, to replace a one-time JIT-compilation cost for each type of structure with a one-time virtual method call at the beginning of each operation over many events.

As for this particular performance gaff, if you can narrow it down to one operation that's taking way longer than it seems it should, I can look into it.

@nsmith-
Copy link
Contributor

nsmith- commented Mar 21, 2019

@jpata if you could provide an event sample, I would be interested in trying a few different ways of expressing this, to see if another method is faster.

@jpata
Copy link
Author

jpata commented Mar 21, 2019

@nsmith- can you access the following files? I'm curious if there is a better way to write down what I attempted!

/afs/cern.ch/user/j/jpata/public/awkward/0C2B3A66-B042-E811-8C6D-44A8423DE2C0.root
/afs/cern.ch/user/j/jpata/public/awkward/dimuon-opposite-sign-benchmarks.ipynb

@jpivarski
Copy link
Member

Thanks, @nsmith-, for offering to offering to look for a work-around. I'd also like to know if one of the operations is a hundred times slower than it looks like it should be. (Remember JaggedArray.argmax?)

@jpata
Copy link
Author

jpata commented Mar 21, 2019

I ran a cProfile check and the overwhelming time is spent in argcross.

    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    262/1    0.014    0.000  192.166  192.166 {built-in method builtins.exec}
        1    0.424    0.424  192.166  192.166 get-os.py:1(<module>)
      100    1.142    0.011  189.025    1.890 get-os.py:4(get_os_muons_awkward)
      100    0.604    0.006  119.871    1.199 jagged.py:1126(argcross)
     1100    2.121    0.002   93.147    0.085 jagged.py:66(offsets2parents)
      600   11.337    0.019   84.912    0.142 jagged.py:773(tojagged)
     1100   82.235    0.075   82.235    0.075 {method 'at' of 'numpy.ufunc' objects}
      200    0.209    0.001   67.315    0.337 jagged.py:757(__setitem__)
     2200    0.008    0.000   67.236    0.031 jagged.py:413(parents)
      800    0.460    0.001   61.157    0.076 mixins.py:22(func)
      800    6.704    0.008   60.696    0.076 jagged.py:877(__array_ufunc__)
2200/1800    4.684    0.002   55.309    0.031 jagged.py:505(__getitem__)
      100   15.957    0.160   26.201    0.262 jagged.py:1102(_argcross)
     2402    0.013    0.000   11.138    0.005 fromnumeric.py:2252(cumsum)
     2402    0.011    0.000   11.125    0.005 fromnumeric.py:54(_wrapfunc)
     2402   11.112    0.005   11.112    0.005 {method 'cumsum' of 'numpy.ndarray' objects}
      200    6.758    0.034   10.926    0.055 indexed.py:44(invert)
     1600    1.553    0.001    8.295    0.005 jagged.py:817(_tojagged)

@jpivarski
Copy link
Member

How big is the biggest cross-join? That is, for your a.cross(b), what's (a.counts * b.counts).max()? Or better yet, (a.counts * b.counts).sum(), which is the size of the per-event cross-join.

It could be that this is the case we were worried about with vectorizing nested loops: the scalar-code way is to iterate over the same data multiple times (a.counts * b.counts number of steps, but the intermediate memory use is a.counts + b.counts), and the vector-code way is to explode the indexes out to all cases and do a single loop—or blocks of concurrently applied vectors—on the flattened double-loop. The latter definitely uses more memory and may not be possible in all cases. A sufficiently high-level implementation, which doesn't exist yet, would want to look at the argcross the user wants to do and decide whether the scalar or vector is a better choice. Underused warps in a GPU are better than not being able to fit a full cross-join in the GPU's memory!

But then, maybe I'm barking up the wrong tree—maybe it's not the size of the intermediate array in this case. Another major contributor is cumsum. If this is the cumsum that is performed inside argcross, through counts2offsets, then this is something that's ripe for optimization, once we go beyond Numpy. Naively, a cumulative sum involves a loop-carried dependency and can't be vectorized, even on a CPU, even with compiler flags asking for it. It can be vectorized by a Hillis-Steele algorithm, which we simply have to write ourselves to take advantage of the vector instructions available on most machines (at least AVX-2). That's one of the optimizations we're looking forward to in these specialized extensions, like awkward-numba, awkward-pybind11, and (maybe) awkward-cuda.

@nsmith-
Copy link
Contributor

nsmith- commented Mar 21, 2019

Here is a slightly faster version, also a bit concise

def get_os_muons_awkward_2(charge):
    pair_idx = charge.argchoose(2)
    valid_idx = pair_idx[charge[pair_idx.i0]*charge[pair_idx.i1] == -1]
    first_valid = valid_idx[:,:1]
    # Now we have to get hacky since awkward arrays are immutable
    out_muon_mask = charge.zeros_like()
    raw_i0 = (out_muon_mask.starts + first_valid.i0).flatten()
    out_muon_mask.content[raw_i0] = 1
    raw_i1 = (out_muon_mask.starts + first_valid.i1).flatten()
    out_muon_mask.content[raw_i1] = 1
    return out_muon_mask

which clocks in at

902 ms ± 5.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

so a factor of 2. This is also using #102 which sorts the pairs in a way that naturally satisfies your requirement to keep the first two opposite-sign muons (i.e. the highest sum-pT pair)
But in general I am not expecting any pair combinatorics in pure numpy to beat out numba implementations. Once #83 is complete we could have an awkward-numba version of combinatorics that works faster. It looks like the dominant usage is all the jagged getitem:

from pyinstrument import Profiler
profiler = Profiler()
profiler.start()
get_os_muons_awkward_2(charge)
get_os_muons_awkward_2(charge)
get_os_muons_awkward_2(charge)
profiler.stop()
print(profiler.output_text(unicode=True, color=True))
2.883 run_code  IPython/core/interactiveshell.py:3230
├─ 1.012 <module>  <ipython-input-11-ac6d3c11eed0>:4
│  └─ 1.006 get_os_muons_awkward_2  <ipython-input-5-cd8b2ce452a9>:1
│     ├─ 0.461 __getitem__  awkward/array/jagged.py:505
│     │  ├─ 0.209 tojagged  awkward/array/jagged.py:773
│     │  │  └─ 0.178 parents  awkward/array/jagged.py:413
│     │  │     └─ 0.175 offsets2parents  awkward/array/jagged.py:66
│     │  ├─ 0.108 parents  awkward/array/jagged.py:413
│     │  │  └─ 0.103 offsets2parents  awkward/array/jagged.py:66
│     │  ├─ 0.032 _tojagged  awkward/array/jagged.py:817
│     │  └─ 0.031 sum  awkward/array/base.py:177
│     │     └─ 0.029 _reduce  awkward/array/jagged.py:1299
│     ├─ 0.225 __array_ufunc__  awkward/array/jagged.py:877
│     │  └─ 0.178 parents  awkward/array/jagged.py:413
│     │     └─ 0.175 offsets2parents  awkward/array/jagged.py:66
│     ├─ 0.207 argchoose  awkward/array/jagged.py:1066
│     │  ├─ 0.091 offsets2parents  awkward/array/jagged.py:66
│     │  └─ 0.040 repeat  numpy/core/fromnumeric.py:429
│     │     └─ 0.040 _wrapfunc  numpy/core/fromnumeric.py:54
│     └─ 0.074 func  numpy/lib/mixins.py:22
│        └─ 0.074 __array_ufunc__  awkward/array/jagged.py:877
│           └─ 0.040 _tojagged  awkward/array/jagged.py:817
├─ 0.965 <module>  <ipython-input-11-ac6d3c11eed0>:5
│  └─ 0.959 get_os_muons_awkward_2  <ipython-input-5-cd8b2ce452a9>:1
│     ├─ 0.452 __getitem__  awkward/array/jagged.py:505
│     │  ├─ 0.234 tojagged  awkward/array/jagged.py:773
│     │  │  └─ 0.185 parents  awkward/array/jagged.py:413
│     │  │     └─ 0.182 offsets2parents  awkward/array/jagged.py:66
│     │  └─ 0.087 parents  awkward/array/jagged.py:413
│     │     └─ 0.085 offsets2parents  awkward/array/jagged.py:66
│     ├─ 0.216 __array_ufunc__  awkward/array/jagged.py:877
│     │  └─ 0.174 parents  awkward/array/jagged.py:413
│     │     └─ 0.171 offsets2parents  awkward/array/jagged.py:66
│     ├─ 0.174 argchoose  awkward/array/jagged.py:1066
│     │  └─ 0.089 offsets2parents  awkward/array/jagged.py:66
│     └─ 0.079 func  numpy/lib/mixins.py:22
│        └─ 0.068 __array_ufunc__  awkward/array/jagged.py:877
│           └─ 0.039 _tojagged  awkward/array/jagged.py:817
└─ 0.905 <module>  <ipython-input-11-ac6d3c11eed0>:6
   └─ 0.899 get_os_muons_awkward_2  <ipython-input-5-cd8b2ce452a9>:1
      ├─ 0.424 __getitem__  awkward/array/jagged.py:505
      │  ├─ 0.211 tojagged  awkward/array/jagged.py:773
      │  │  └─ 0.182 parents  awkward/array/jagged.py:413
      │  │     └─ 0.179 offsets2parents  awkward/array/jagged.py:66
      │  └─ 0.088 parents  awkward/array/jagged.py:413
      │     └─ 0.087 offsets2parents  awkward/array/jagged.py:66
      ├─ 0.219 __array_ufunc__  awkward/array/jagged.py:877
      │  └─ 0.176 parents  awkward/array/jagged.py:413
      │     └─ 0.174 offsets2parents  awkward/array/jagged.py:66
      ├─ 0.168 argchoose  awkward/array/jagged.py:1066
      │  ├─ 0.086 offsets2parents  awkward/array/jagged.py:66
      │  └─ 0.029 repeat  numpy/core/fromnumeric.py:429
      │     └─ 0.029 _wrapfunc  numpy/core/fromnumeric.py:54
      └─ 0.049 func  numpy/lib/mixins.py:22
         └─ 0.049 __array_ufunc__  awkward/array/jagged.py:877

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants