Speed up scoring function when used with Monte Carlo #959

benmwebb · 2016-08-04T23:58:18Z

The default IMP scoring function (RestraintsScoringFunction) computes the model score by simply evaluating all restraints and returning their sum. This is really inefficient for sampling schemes such as Monte Carlo which rely on making a large number of small perturbations to the system, since many terms in the scoring function are needlessly recomputed at each step.

There's an existing mechanism to address this: IncrementalScoringFunction. This works by decomposing each Restraint into simpler ones, then building a scoring function for each Particle in the system made up of sums of these simpler restraints. Moving a given Particle then only requires recomputing that Particle's scoring function. However, this approach has drawbacks:

Not every restraint can be usefully decomposed.
RestraintSet information is lost.
The current implementation is designed to function optimally only for moves of a single Particle.
For a large system, the correspondingly large number of scoring functions may exhaust system memory.
In @Pellarin's tests with crosslinking restraints, IncrementalScoringFunction is actually slower than RestraintsScoringFunction.
The code was never finished and tested, and so is possibly buggy.

IMP needs a more efficient mechanism for partial scoring function updates in order to make Monte Carlo usable for larger systems.

The text was updated successfully, but these errors were encountered:

benmwebb · 2016-08-05T00:25:40Z

Proposal: add a single-bit moved flag for each Particle. This flag is cleared for all Particles after the scoring function is evaluated, and set for a given Particle whenever an attribute value is changed or an attribute added/removed. Individual restraints can then query this flag at evaluate time and return cached values if Particles are found to have not moved. (These cached values would probably just be the value of the restraint, not the first derivatives. Samplers that use first derivatives, such as MolecularDynamics, typically move all Particles in the system so wouldn't benefit much from partial evaluation anyway.) For example:

PairRestraint: if either Particle has its moved flag set, compute, cache and return the score value. If neither Particle does, return the cached value.
PairsRestraint and similar (e.g. nonbonded restraints): maintain a vector of cached values, one for each pair in the internal ParticleIndexPair. Split the ParticleIndexPair list into two - one containing only Particles that have not moved and the other the remaining Particle pairs. Return the sum of cached values for the first list; compute and cache values for the second. (If the list is particularly large so that the cached values consume significant memory, the list could be split into blocks and the sum of all pairs in each block could instead be cached; then the "not moved" list would contain only blocks where no particle has moved.)
Graph-based restraints such as ConnectivityRestraint: cache the graph complete with node and/or vertex weights. Recompute only the subgraph containing moved particles. Then proceed to calculate the minimum spanning tree. As a further optimization, if the new graph is sufficiently similar to the old one, can possibly skip the MST step too.

This should have the advantage that it is largely transparent to the user, and restraints can use the 'moved' information more intelligently than in a simple decomposition.

cgreenberg · 2016-08-05T03:31:16Z

That sounds great! I'm assuming this will work on rigid bodies as well? We could potentially do some caching to incrementally update EM scoring functions (storing a list of which components contribute to each grid point and only recalculating the densities of contributors to those points) although that might require a lot of memory.

benmwebb · 2016-08-05T04:06:45Z

I'm assuming this will work on rigid bodies as well?

Should do. If you change a rigid body, at model update time all of the rigid body members will get their coordinates updated, so any score that depends on either the rigid body itself or the members should notice the particles have moved and do the "right thing" automatically.

Pellarin · 2016-08-05T12:47:45Z

Awesome! It also looks like this implementation could be easily parallelized. Could that be taken into account somehow?

benmwebb · 2016-08-05T18:21:46Z

It also looks like this implementation could be easily parallelized.

Not sure if you're replying to my comment or Charles's here. But model evaluation is already parallelized in IMP (at least with OpenMP) and this certainly shouldn't interfere with that. It certainly should be possible to parallelize individual restraints too, as per #832.

barakr · 2016-08-05T18:48:07Z

one note on speed, semi-related to discussion, but strongly related to
speed.

I was able to almost double the performance of the NPC BrownianDynamic
simulations by being a bit more efficient about memory access. I did this
by adding some 'secret' methods to Model to directly access the tables of
attributes. Only use it sparingly, if (a) you know what you're doing and
(b) said code is actually spotted by the profiler as a critical bottleneck.
(c) you see that it actually helps. (d) you're 100% sure you didn't break
somebody elses code

I believe the main reasons that memory access becomes faster are:

CPUs like to access arrays in contiguous blocks
More efficient cache memory
no need to retrieve particle keys from memory over and over again - this
might be actually the most important.
This will also be great for future GPU programming, where you need to
copy the property tables to the GPI

An example code (from atom/src/BrownianDynamics.cpp):

*// I. Get direct access to property tables: *

double const* rotational_diffusion_coefficient_table=
m->access_attribute_data(
RigidBodyDiffusion::get_rotational_diffusion_coefficient_key());
double const* torque_tables[3];
for(unsigned int i = 0; i < 3; i++){
torque_tables[i]=
core::RigidBody::access_torque_i_data(m, i);
}
double* quaternion_tables[4];
for(unsigned int i = 0; i < 4; i++){
quaternion_tables[i]=
core::RigidBody::access_quaternion_i_data(m, i);
}

// II. Go over a bunch of particles and directly obtain values from
tables

// (Note that this can probably be done even more efficiently)* for
(unsigned int i = begin; i < end; ++i) {
ParticleIndex pi= ps[i];
algebra::Rotation3D rot(1,0,0,0);
double rdc=
rotational_diffusion_coefficient_table[pi.get_index()];
if(IMP::internal::FloatAttributeTableTraits::get_is_valid(rdc)){
rot=compute_rotation_0
(pi, dtfs, ikT, rdc, torque_tables);
core::RigidBody(m, pi).apply_rotation_lazy_using_internal_tables
(rot, quaternion_tables);
}
}
===

On Fri, Aug 5, 2016 at 11:21 AM, Ben Webb notifications@github.com wrote:

It also looks like this implementation could be easily parallelized.

Not sure if you're replying to my comment or Charles's here. But model
evaluation is already parallelized in IMP (at least with OpenMP) and this
certainly shouldn't interfere with that. It certainly should be possible to
parallelize individual restraints too, as per #832.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.

Barak

benmwebb · 2016-08-05T18:54:44Z

CPUs like to access arrays in contiguous blocks

Right, the particle moved flags would be a boost::dynamic_bitset which I'm pretty sure is a) contiguous in memory and b) small (only one bit per particle) so can probably fit in cache. So these accesses should be super-fast.

This flag is cleared after the scoring function is calculated, and set whenever a particle is 'changed' (newly created or an attribute changed). This can be used to speed up restraint evaluation by caching part or all of the score. Relates #959.

shanot · 2017-03-27T13:07:01Z

Hi Ben,

I don't know if you're still working on this, but this is very interesting and will probably give a nice boost (unless this creates a lot of cache misses due to having to fetch more data).
If I understand correctly, for this to be functional we need to modify the movers so that they set the flag of particles that have been moved ?

Another possibility that I thought of would be to pass a list of movers to the restraints at each evaluation, this way the restraints know which particle have been moved.

benmwebb · 2017-03-27T15:14:34Z

If I understand correctly, for this to be functional we need to modify the movers so that they set the flag of particles that have been moved ?

No, the flag is set automatically when particle attributes are changed. Unfortunately my benchmarking of the code (the existing branch is fully functional) shows a small but noticeable slowdown with MD due to the overhead of this tracking. So not acceptable in its current form.

MC movers already return a list of particles that they've moved (the IncrementalScoringFunction uses this) and I'll use that instead in the next iteration. We'll get some speedups by also using this information in the ScoreStates.

benmwebb · 2021-08-20T22:51:11Z

This should largely be resolved by b8c4d59 and subsequent commits. These provide an alternative path for scoring function evaluation (essentially, an evaluate_moved() method which parallels the existing evaluate()) that takes information on the particles that moved since the last evaluation. This information is provided by the MC movers. Since it's a separate code path there is no performance impact on MD or other non-MC samplers. The default implementation of Restraint::evaluate_moved() is to just call the regular evaluate() so this change should be reasonably non-invasive.

Restraint subclasses can provide alternative implementations of evaluate_moved() to speed up evaluation. For example, RestraintSet and LogWrapper both skip any child restraint that acts only on non-moved particles, replacing them with cached scores from the previous evaluation. This dramatically speeds up crosslinking in my benchmarking.

Similarly, ScoreStates which don't take any moved particles as input or output can be skipped at scoring function evaluation time. (For example in a typical PMI application we only need to call the states for rigid bodies that we moved, not all of them.) This is an 'opt-in' mechanism (via a ScoreState::can_skip flag), again to be minimally invasive for now.

It is easy to activate the new code path, simply by setting the score_moved flag in core::MonteCarlo or in PMI's ReplicaExchange0 class.

Support in more restraints and ScoreStates will be forthcoming. In particular most container restraints, for example excluded volume, don't yet support this optimization and there are large performance gains still to be made there.

benmwebb added the feature request label Aug 4, 2016

benmwebb self-assigned this Aug 4, 2016

benmwebb added the in progress label Aug 9, 2016

benmwebb added the performance label Jun 26, 2018

benmwebb closed this as completed Aug 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up scoring function when used with Monte Carlo #959

Speed up scoring function when used with Monte Carlo #959

benmwebb commented Aug 4, 2016

benmwebb commented Aug 5, 2016

cgreenberg commented Aug 5, 2016

benmwebb commented Aug 5, 2016

Pellarin commented Aug 5, 2016

benmwebb commented Aug 5, 2016

barakr commented Aug 5, 2016

benmwebb commented Aug 5, 2016

shanot commented Mar 27, 2017

benmwebb commented Mar 27, 2017

benmwebb commented Aug 20, 2021

Speed up scoring function when used with Monte Carlo #959

Speed up scoring function when used with Monte Carlo #959

Comments

benmwebb commented Aug 4, 2016

benmwebb commented Aug 5, 2016

cgreenberg commented Aug 5, 2016

benmwebb commented Aug 5, 2016

Pellarin commented Aug 5, 2016

benmwebb commented Aug 5, 2016

barakr commented Aug 5, 2016

benmwebb commented Aug 5, 2016

shanot commented Mar 27, 2017

benmwebb commented Mar 27, 2017

benmwebb commented Aug 20, 2021