New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Optimize calculate #175

Merged
merged 8 commits into from Jun 5, 2018

Conversation

Projects
None yet
2 participants
@bocklund
Collaborator

bocklund commented Jun 3, 2018

Summary

For a set of 30M points (x16 DOF = 500M array elements), calculate had an overhead of ~35s beyond calling the objective function. These changes optimize that down to ~13s in the general case and ~5s in special cases.

Problem

When working with a large (8 sublattice) binary system, which results in points arrays of approximately 500M points at a point density of 1000. calling calculate was very slow.

This PR introduces some optimizations to speed up calculate.

For reference, here is the script that was run:

from pycalphad import calculate, Database

dbf = Database('/Users/brandon/Projects/notebooks/pycalphad/AB8SL_v4.TDB')

from pycalphad import Model
mod = Model(dbf, ['A', 'B'], 'FCC')  # save time initializing models on each call
# warm the cache
calc_res = calculate(dbf, ['A', 'B'], 'FCC', pdens=1000, model=mod)
% timeit calc_res = calculate(dbf, ['A', 'B'], 'FCC', pdens=1000, model=mod)

Details

Running this on develop master gives:

1min 15s ± 783 ms per loop (mean ± std. dev. of 7 runs, 1 loop each), where ~40s comes from calling the obj method of a PhaseRecord. Here I am assuming that this cannot be changed.

The main changes are

  1. Update dof array to tranpose before, rather than after, where the concatenated arrays line up better in memory. Leads to

    58.7 s ± 2.74 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

  2. Changing to append the NaNs to extend the points array, rather than preallocate a whole array of them and then copy the points over. Both methods have to copy the points, so it's not the most efficient, but this is a little faster because it saves from allocating np.size(points) of np.nan. The ideal situation (as commented) would be to give _compute_phase_values the NaN-padded points array in the first place, then slice that up (if possible) inside the function, preventing any copying of memory. This optimization gives

    53.1 s ± 4.65 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

  3. For the hot path, such as single phase calculations and other special cases, where the current internal degrees of freedom is the same as the maximum degrees of freedom, we can avoid any extending/copying completely. Gives

    46.6 s ± 884 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

    which is close to the objective function time.

Except for the additional overhead of the if statement checking the array shapes, this should across the board be cheaper, even for small points arrays, and should allow us to handle more sublattice and multicomponent systems more gracefully.

@bocklund bocklund requested a review from richardotis Jun 3, 2018

@bocklund

This comment has been minimized.

Collaborator

bocklund commented Jun 3, 2018

For the record:

After this optimization, assuming a precompiled Model that doesn't have to be initialized in equilibrium, the breakdown of an equilibrium call is ~98% in dask calculating things.

Of that time, 92.3% is spent in calculate. Virtually all of calculateis spent in _compute_phase_values, where there it's 65.2% calling the objective and 20.6% array allocation/copying (np.concatenate, np.full), 11% internal logic, and the rest small bits.

So in order to be more performant for bigger systems we need to (IMO) do the following.

  1. Find ways to speed up the objective function (@richardotis suggested reducing branching at Model creation)
  2. Reduce array allocation/copying
  3. Continue to speed up internal logic in _compute_phase_values
@richardotis

This comment has been minimized.

Collaborator

richardotis commented Jun 3, 2018

Thanks for this work. If you want to eliminate some more array allocations, I would also look at calculate.py/_sample_phase_constitution() and also utils.py/endmember_matrix(). I think it is possible to change them to accept a maximum_internal_dof argument to ensure all the points arrays are pre-padded. Note that you may need to then tweak some nan-checking logic like calculate.py:116.

@bocklund

This comment has been minimized.

Collaborator

bocklund commented Jun 3, 2018

My reasoning for not doing looking into padding here was that I think phase_record.obj(phase_output, dof) requires the inputs to be C-ordered blocks of arrays in memory, but pre-padding would create more columns and slicing would create a view of that array that is not continuous in memory.

Will passing in the view still work?

@richardotis

This comment has been minimized.

Collaborator

richardotis commented Jun 3, 2018

There's a codegen'd Cython wrapper over the raw function which I think will check the dof array shape at call time and do the pointer arithmetic, so all the array indexing operations should safely ignore any nan padding at the end of each row.

We should probably also consider Cythonization of _compute_phase_values given the huge number of array operations in the hot path.

@richardotis

This comment has been minimized.

Collaborator

richardotis commented Jun 3, 2018

Oh right, we bypass the Cython wrapper function. Hmm, I'll have to take a peek at the raw C code generated again to figure out if it's safe.

@richardotis

This comment has been minimized.

Collaborator

richardotis commented Jun 3, 2018

I bet we'd see some nice speedups with a Cython pass on _compute_phase_values. We could then use OpenMP to chunk the rows of big point arrays and perform the computation in parallel.

@richardotis

This comment has been minimized.

Collaborator

richardotis commented Jun 3, 2018

@bocklund I think we could probably keep to what you have here and leave the other stuff to a future PR. This is already a nice improvement. Do you agree? (I will still perform a review.)

@richardotis

This comment has been minimized.

Collaborator

richardotis commented Jun 3, 2018

As you suspected, pre-padding the points array would not be a safe optimization. The underlying raw C function has the size of the input inner dimension hard-coded in the pointer arithmetic for the dof array.

@richardotis

Code coverage analysis indicates that calculate.py:218 is never executed by the test suite. Please add a test to exercise that code path (or remove that code branch).

@bocklund

This comment has been minimized.

Collaborator

bocklund commented Jun 5, 2018

My understanding is that we have to reshape based on the maximum tieline vertices no matter what, so we'll never hit that code. I decided to remove it.

@bocklund

This comment has been minimized.

Collaborator

bocklund commented Jun 5, 2018

Merging this. For future reference line profiling has pointed out the following hotspots that could be optimized in the future (in order of time, time in seconds for order of magnitude gauge). Even after these changes, _compute_phase_values takes almost the entire time of calculate after the cache is warm.

calculate

We can avoid all of the Model-labeled code by passing our own Models and callables. The ** cached**-labeled code is cached and only needs to be run once where the parameters are the same (I'm not so worried about fixing these immedately because they don't slow down equilibrium).

  1. (line 408) phase_ds = _compute_phase_values(nonvacant_components, str_statevar_dict, points, phase_record, output, maximum_internal_dof, broadcast=broadcast, largest_energy=float(largest_energy), fake_points=fp) 44s
  2. (line 409) points = _sample_phase_constitution(phase_name, phase_obj.constituents, sublattice_dof, comps, tuple(variables), sampler_dict[phase_name] or point_sample, fixedgrid_dict[phase_name], pdens_dict[phase_name]) 35s - cached
  3. (line 405) phase_record = PhaseRecord_from_cython(comps, list(statevar_dict.keys()) + variables, np.array(dbf.phases[phase_name].sublattices, dtype=np.float), param_values, comp_sets[phase_name], None, None, mass_dict[phase_name], None) 19s - triggers JIT, cached
  4. (line 385) undefs = list(out.atoms(Symbol) - out.atoms(v.StateVariable)) 78s - Model
  5. (line 387) out = out.xreplace({undef: float(0)}) 0.5s - Model
  6. (line 389) comp_sets[phase_name] = build_functions(out, list(statevar_dict.keys()) + variables, include_obj=True, include_grad=False, parameters=param_symbols) 0.2s - Model
  7. (line 400) mass_dict[phase_name] = [build_functions(mod.moles(el), list(statevar_dict.keys()) + variables, include_obj=True, include_grad=False, parameters=param_symbols) for el in pure_elements] 0.12s - Model

_compute_phase_values

(this happens per phase in calculate) - it's mostly calculations and allocation.

  1. (line 188) phase_record.obj(phase_output, dof) 36s
  2. (line 185) dof = np.ascontiguousarray(np.concatenate((bc_statevars.T, pts), axis=1)) 3.7s
  3. (line 190) phase_record.mass_obj(phase_compositions[:,el_idx], dof, el_idx) 1.8s
  4. (line 206) phase_names = np.full(points.shape[:-1], phase_record.phase_name, dtype='U'+str(len(phase_record.phase_name))) 0.5s
  5. (line 164) statevars = np.meshgrid(*itertools.chain(statevar_dict.values(), [np.empty(points.shape[-2])]), sparse=True, indexing='ij')[:-1] 0.3s
  6. (line 183) bc_statevars = np.ascontiguousarray([broadcast_to(x, points.shape[:-1]).reshape(-1) for x in statevars]) 0.2s

@bocklund bocklund merged commit 7239366 into develop Jun 5, 2018

5 checks passed

continuous-integration/appveyor/branch AppVeyor build succeeded
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.02%) to 87.789%
Details

@bocklund bocklund deleted the optimize_calculate branch Jul 17, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment