Skip to content
Permalink
Browse files

Don't start kicking off work units during parallel stylo traversal un…

…til they're actually full.

This improves style sharing at the cost of a bit less parallelism.  Fixes Gecko
bug 1385982.  r=bholley
  • Loading branch information
bzbarsky committed Aug 1, 2017
1 parent b49311c commit 2e0248752695101f948ca6251b272629e7a2b70f
Showing with 46 additions and 13 deletions.
  1. +46 −13 components/style/parallel.rs
@@ -34,9 +34,16 @@ use traversal::{DomTraversal, PerLevelTraversalData, PreTraverseToken};

/// The maximum number of child nodes that we will process as a single unit.
///
/// Larger values will increase style sharing cache hits and general DOM locality
/// at the expense of decreased opportunities for parallelism. This value has not
/// been measured and could potentially be tuned.
/// Larger values will increase style sharing cache hits and general DOM
/// locality at the expense of decreased opportunities for parallelism. There
/// are some measurements in
/// https://bugzilla.mozilla.org/show_bug.cgi?id=1385982#c11 and comments 12
/// and 13 that investigate some slightly different values for the work unit
/// size. If the size is significantly increased, make sure to adjust the
/// condition for kicking off a new work unit in top_down_dom, because
/// otherwise we're likely to end up doing too much work serially. For
/// example, the condition there could become some fraction of WORK_UNIT_MAX
/// instead of WORK_UNIT_MAX.
pub const WORK_UNIT_MAX: usize = 16;

/// A set of nodes, sized to the work unit. This gets copied when sent to other
@@ -137,7 +144,8 @@ fn create_thread_local_context<'scope, E, D>(
/// * Never process a child before its parent (since child style depends on
/// parent style). If this were to happen, the styling algorithm would panic.
/// * Prioritize discovering nodes as quickly as possible to maximize
/// opportunities for parallelism.
/// opportunities for parallelism. But this needs to be weighed against
/// styling cousins on a single thread to improve sharing.
/// * Style all the children of a given node (i.e. all sibling nodes) on
/// a single thread (with an upper bound to handle nodes with an
/// abnormally large number of children). This is important because we use
@@ -173,19 +181,44 @@ fn top_down_dom<'a, 'scope, E, D>(nodes: &'a [SendNode<E::ConcreteNode>],
};

for n in nodes {
// If the last node we processed produced children, spawn them off
// into a work item. We do this at the beginning of the loop (rather
// than at the end) so that we can traverse the children of the last
// sibling directly on this thread without a spawn call.
// If the last node we processed produced children, we may want to
// spawn them off into a work item. We do this at the beginning of
// the loop (rather than at the end) so that we can traverse our
// last bits of work directly on this thread without a spawn call.
//
// This has the important effect of removing the allocation and
// context-switching overhead of the parallel traversal for perfectly
// linear regions of the DOM, i.e.:
//
// <russian><doll><tag><nesting></nesting></tag></doll></russian>
//
// Which are not at all uncommon.
if !discovered_child_nodes.is_empty() {
// which are not at all uncommon.
//
// There's a tension here between spawning off a work item as soon
// as discovered_child_nodes is nonempty and waiting until we have a
// full work item to do so. The former optimizes for speed of
// discovery (we'll start discovering the kids of the things in
// "nodes" ASAP). The latter gives us better sharing (e.g. we can
// share between cousins much better, because we don't hand them off
// as separate work items, which are likely to end up on separate
// threads) and gives us a chance to just handle everything on this
// thread for small DOM subtrees, as in the linear example above.
//
// There are performance and "number of ComputedValues"
// measurements for various testcases in
// https://bugzilla.mozilla.org/show_bug.cgi?id=1385982#c10 and
// following.
//
// The worst case behavior for waiting until we have a full work
// item is a deep tree which has WORK_UNIT_MAX "linear" branches,
// hence WORK_UNIT_MAX elements at each level. Such a tree would
// end up getting processed entirely sequentially, because we would
// process each level one at a time as a single work unit, whether
// via our end-of-loop tail call or not. If we kicked off a
// traversal as soon as we discovered kids, we would instead
// process such a tree more or less with a thread-per-branch,
// multiplexed across our actual threadpool.
if discovered_child_nodes.len() >= WORK_UNIT_MAX {
let mut traversal_data_copy = traversal_data.clone();
traversal_data_copy.current_dom_depth += 1;
traverse_nodes(&*discovered_child_nodes,
@@ -213,9 +246,9 @@ fn top_down_dom<'a, 'scope, E, D>(nodes: &'a [SendNode<E::ConcreteNode>],
}
}

// Handle the children of the last element in this work unit. If any exist,
// we can process them (or at least one work unit's worth of them) directly
// on this thread by passing TailCall.
// Handle whatever elements we have queued up but not kicked off traversals
// for yet. If any exist, we can process them (or at least one work unit's
// worth of them) directly on this thread by passing TailCall.
if !discovered_child_nodes.is_empty() {
traversal_data.current_dom_depth += 1;
traverse_nodes(&discovered_child_nodes,

0 comments on commit 2e02487

Please sign in to comment.
You can’t perform that action at this time.