Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up scoring function when used with Monte Carlo #959

Closed
benmwebb opened this issue Aug 4, 2016 · 10 comments
Closed

Speed up scoring function when used with Monte Carlo #959

benmwebb opened this issue Aug 4, 2016 · 10 comments

Comments

@benmwebb
Copy link
Member

benmwebb commented Aug 4, 2016

The default IMP scoring function (RestraintsScoringFunction) computes the model score by simply evaluating all restraints and returning their sum. This is really inefficient for sampling schemes such as Monte Carlo which rely on making a large number of small perturbations to the system, since many terms in the scoring function are needlessly recomputed at each step.

There's an existing mechanism to address this: IncrementalScoringFunction. This works by decomposing each Restraint into simpler ones, then building a scoring function for each Particle in the system made up of sums of these simpler restraints. Moving a given Particle then only requires recomputing that Particle's scoring function. However, this approach has drawbacks:

  • Not every restraint can be usefully decomposed.
  • RestraintSet information is lost.
  • The current implementation is designed to function optimally only for moves of a single Particle.
  • For a large system, the correspondingly large number of scoring functions may exhaust system memory.
  • In @Pellarin's tests with crosslinking restraints, IncrementalScoringFunction is actually slower than RestraintsScoringFunction.
  • The code was never finished and tested, and so is possibly buggy.

IMP needs a more efficient mechanism for partial scoring function updates in order to make Monte Carlo usable for larger systems.

@benmwebb benmwebb self-assigned this Aug 4, 2016
@benmwebb
Copy link
Member Author

benmwebb commented Aug 5, 2016

Proposal: add a single-bit moved flag for each Particle. This flag is cleared for all Particles after the scoring function is evaluated, and set for a given Particle whenever an attribute value is changed or an attribute added/removed. Individual restraints can then query this flag at evaluate time and return cached values if Particles are found to have not moved. (These cached values would probably just be the value of the restraint, not the first derivatives. Samplers that use first derivatives, such as MolecularDynamics, typically move all Particles in the system so wouldn't benefit much from partial evaluation anyway.) For example:

  • PairRestraint: if either Particle has its moved flag set, compute, cache and return the score value. If neither Particle does, return the cached value.
  • PairsRestraint and similar (e.g. nonbonded restraints): maintain a vector of cached values, one for each pair in the internal ParticleIndexPair. Split the ParticleIndexPair list into two - one containing only Particles that have not moved and the other the remaining Particle pairs. Return the sum of cached values for the first list; compute and cache values for the second. (If the list is particularly large so that the cached values consume significant memory, the list could be split into blocks and the sum of all pairs in each block could instead be cached; then the "not moved" list would contain only blocks where no particle has moved.)
  • Graph-based restraints such as ConnectivityRestraint: cache the graph complete with node and/or vertex weights. Recompute only the subgraph containing moved particles. Then proceed to calculate the minimum spanning tree. As a further optimization, if the new graph is sufficiently similar to the old one, can possibly skip the MST step too.

This should have the advantage that it is largely transparent to the user, and restraints can use the 'moved' information more intelligently than in a simple decomposition.

@cgreenberg
Copy link
Contributor

That sounds great! I'm assuming this will work on rigid bodies as well? We could potentially do some caching to incrementally update EM scoring functions (storing a list of which components contribute to each grid point and only recalculating the densities of contributors to those points) although that might require a lot of memory.

@benmwebb
Copy link
Member Author

benmwebb commented Aug 5, 2016

I'm assuming this will work on rigid bodies as well?

Should do. If you change a rigid body, at model update time all of the rigid body members will get their coordinates updated, so any score that depends on either the rigid body itself or the members should notice the particles have moved and do the "right thing" automatically.

@Pellarin
Copy link
Contributor

Pellarin commented Aug 5, 2016

Awesome! It also looks like this implementation could be easily parallelized. Could that be taken into account somehow?

@benmwebb
Copy link
Member Author

benmwebb commented Aug 5, 2016

It also looks like this implementation could be easily parallelized.

Not sure if you're replying to my comment or Charles's here. But model evaluation is already parallelized in IMP (at least with OpenMP) and this certainly shouldn't interfere with that. It certainly should be possible to parallelize individual restraints too, as per #832.

@barakr
Copy link
Collaborator

barakr commented Aug 5, 2016

one note on speed, semi-related to discussion, but strongly related to
speed.

I was able to almost double the performance of the NPC BrownianDynamic
simulations by being a bit more efficient about memory access. I did this
by adding some 'secret' methods to Model to directly access the tables of
attributes. Only use it sparingly, if (a) you know what you're doing and
(b) said code is actually spotted by the profiler as a critical bottleneck.
(c) you see that it actually helps. (d) you're 100% sure you didn't break
somebody elses code

I believe the main reasons that memory access becomes faster are:

  1. CPUs like to access arrays in contiguous blocks
  2. More efficient cache memory
  3. no need to retrieve particle keys from memory over and over again - this
    might be actually the most important.
  4. This will also be great for future GPU programming, where you need to
    copy the property tables to the GPI

An example code (from atom/src/BrownianDynamics.cpp):

*// I. Get direct access to property tables: *

double const* rotational_diffusion_coefficient_table=
m->access_attribute_data(
RigidBodyDiffusion::get_rotational_diffusion_coefficient_key());
double const* torque_tables[3];
for(unsigned int i = 0; i < 3; i++){
torque_tables[i]=
core::RigidBody::access_torque_i_data(m, i);
}
double* quaternion_tables[4];
for(unsigned int i = 0; i < 4; i++){
quaternion_tables[i]=
core::RigidBody::access_quaternion_i_data(m, i);
}

// II. Go over a bunch of particles and directly obtain values from
tables

  • // (Note that this can probably be done even more efficiently)* for
    (unsigned int i = begin; i < end; ++i) {
    ParticleIndex pi= ps[i];
    algebra::Rotation3D rot(1,0,0,0);
    double rdc=
    rotational_diffusion_coefficient_table[pi.get_index()];
    if(IMP::internal::FloatAttributeTableTraits::get_is_valid(rdc)){
    rot=compute_rotation_0
    (pi, dtfs, ikT, rdc, torque_tables);
    core::RigidBody(m, pi).apply_rotation_lazy_using_internal_tables
    (rot, quaternion_tables);
    }
    }
    ===

On Fri, Aug 5, 2016 at 11:21 AM, Ben Webb notifications@github.com wrote:

It also looks like this implementation could be easily parallelized.

Not sure if you're replying to my comment or Charles's here. But model
evaluation is already parallelized in IMP (at least with OpenMP) and this
certainly shouldn't interfere with that. It certainly should be possible to
parallelize individual restraints too, as per #832.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.

Barak

@benmwebb
Copy link
Member Author

benmwebb commented Aug 5, 2016

CPUs like to access arrays in contiguous blocks

Right, the particle moved flags would be a boost::dynamic_bitset which I'm pretty sure is a) contiguous in memory and b) small (only one bit per particle) so can probably fit in cache. So these accesses should be super-fast.

benmwebb added a commit that referenced this issue Sep 14, 2016
This flag is cleared after the scoring function is calculated,
and set whenever a particle is 'changed' (newly created or
an attribute changed). This can be used to speed up restraint
evaluation by caching part or all of the score. Relates #959.
@shanot
Copy link
Contributor

shanot commented Mar 27, 2017

Hi Ben,

I don't know if you're still working on this, but this is very interesting and will probably give a nice boost (unless this creates a lot of cache misses due to having to fetch more data).
If I understand correctly, for this to be functional we need to modify the movers so that they set the flag of particles that have been moved ?

Another possibility that I thought of would be to pass a list of movers to the restraints at each evaluation, this way the restraints know which particle have been moved.

@benmwebb
Copy link
Member Author

If I understand correctly, for this to be functional we need to modify the movers so that they set the flag of particles that have been moved ?

No, the flag is set automatically when particle attributes are changed. Unfortunately my benchmarking of the code (the existing branch is fully functional) shows a small but noticeable slowdown with MD due to the overhead of this tracking. So not acceptable in its current form.

MC movers already return a list of particles that they've moved (the IncrementalScoringFunction uses this) and I'll use that instead in the next iteration. We'll get some speedups by also using this information in the ScoreStates.

@benmwebb
Copy link
Member Author

This should largely be resolved by b8c4d59 and subsequent commits. These provide an alternative path for scoring function evaluation (essentially, an evaluate_moved() method which parallels the existing evaluate()) that takes information on the particles that moved since the last evaluation. This information is provided by the MC movers. Since it's a separate code path there is no performance impact on MD or other non-MC samplers. The default implementation of Restraint::evaluate_moved() is to just call the regular evaluate() so this change should be reasonably non-invasive.

Restraint subclasses can provide alternative implementations of evaluate_moved() to speed up evaluation. For example, RestraintSet and LogWrapper both skip any child restraint that acts only on non-moved particles, replacing them with cached scores from the previous evaluation. This dramatically speeds up crosslinking in my benchmarking.

Similarly, ScoreStates which don't take any moved particles as input or output can be skipped at scoring function evaluation time. (For example in a typical PMI application we only need to call the states for rigid bodies that we moved, not all of them.) This is an 'opt-in' mechanism (via a ScoreState::can_skip flag), again to be minimally invasive for now.

It is easy to activate the new code path, simply by setting the score_moved flag in core::MonteCarlo or in PMI's ReplicaExchange0 class.

Support in more restraints and ScoreStates will be forthcoming. In particular most container restraints, for example excluded volume, don't yet support this optimization and there are large performance gains still to be made there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
New features
In Progress
Development

No branches or pull requests

5 participants