New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

html5ever in the rustc-perf repository is memory-intensive #52028

Closed
Mark-Simulacrum opened this Issue Jul 3, 2018 · 33 comments

Comments

Projects
None yet
5 participants
@Mark-Simulacrum
Member

Mark-Simulacrum commented Jul 3, 2018

I see OOMs locally with a 16 GB memory computer.

Source code: https://github.com/rust-lang-nursery/rustc-perf/tree/master/collector/benchmarks/html5ever

@nnethercote

This comment has been minimized.

Contributor

nnethercote commented Jul 4, 2018

I did a run with Massif:
massif
Yep, that's a 14 GiB peak, with 12 GiB of it coming from a single stack trace involving visit_ty/visit_with. I'm not surprised it OOM'd on a 16 GiB machine.

Massif doesn't get great stack traces due to inlining, but DHAT does a better job by using debuginfo. Here's the important info:

-------------------- 2 of 500 --------------------
max-live:    12,884,901,888 in 1 blocks
tot-alloc:   12,884,901,888 in 1 blocks (avg size 12884901888.00)
deaths:      1, at avg age 219,437,305,943 (40.34% of prog lifetime) 
acc-ratios:  0.60 rd, 0.60 wr  (7,768,302,580 b-read, 7,768,302,580 b-written)
   at 0x4C2DECF: malloc (in /usr/lib/valgrind/vgpreload_exp-dhat-amd64-linux.so)
   by 0x7A673EC: alloc (alloc.rs:62)
   by 0x7A673EC: alloc (alloc.rs:123)
   by 0x7A673EC: reserve_internal<(&rustc::ty::sty::RegionKind, rustc::mir::Location),alloc::alloc::Global> (raw_vec.rs:679)
   by 0x7A673EC: reserve<(&rustc::ty::sty::RegionKind, rustc::mir::Location),alloc::alloc::Global> (raw_vec.rs:502)
   by 0x7A673EC: reserve<(&rustc::ty::sty::RegionKind, rustc::mir::Location)> (vec.rs:464)
   by 0x7A673EC: push<(&rustc::ty::sty::RegionKind, rustc::mir::Location)> (vec.rs:1060)
   by 0x7A673EC: {{closure}}<&rustc::ty::TyS> (liveness.rs:171)
   by 0x7A673EC: visit_region<closure> (fold.rs:305)
   by 0x7A673EC: rustc::ty::structural_impls::<impl rustc::ty::fold::TypeFoldable<'tcx> for &'tcx rustc::ty::sty::RegionKind>::visit_with (structural_impls.rs:956)
   by 0x7A68568: super_visit_with<rustc::ty::fold::{{impl}}::for_each_free_region::RegionVisitor<closure>> (structural_impls.rs:888)
   by 0x7A68568: rustc::ty::fold::TypeVisitor::visit_ty (fold.rs:188)
   by 0x7A69AD0: visit_with<rustc::ty::fold::{{impl}}::for_each_free_region::RegionVisitor<closure>> (structural_impls.rs:903)
   by 0x7A69AD0: {{closure}}<rustc::ty::fold::{{impl}}::for_each_free_region::RegionVisitor<closure>> (structural_impls.rs:762)
   by 0x7A69AD0: {{closure}}<core::slice::Iter<&rustc::ty::TyS>,closure> (iterator.rs:1706)
   by 0x7A69AD0: {{closure}}<core::slice::Iter<&rustc::ty::TyS>,closure,core::iter::LoopState<(), ()>> (iterator.rs:1536)
   by 0x7A69AD0: try_fold<&rustc::ty::TyS,(),closure,core::iter::LoopState<(), ()>> (mod.rs:2431)
   by 0x7A69AD0: try_for_each<core::slice::Iter<&rustc::ty::TyS>,closure,core::iter::LoopState<(), ()>> (iterator.rs:1536)
   by 0x7A69AD0: any<core::slice::Iter<&rustc::ty::TyS>,closure> (iterator.rs:1705)
   by 0x7A69AD0: super_visit_with<rustc::ty::fold::{{impl}}::for_each_free_region::RegionVisitor<closure>> (structural_impls.rs:762)
   by 0x7A69AD0: rustc::ty::fold::TypeFoldable::visit_with (fold.rs:63)

Holy cow! That's a single 12 GiB allocation happening within push_type_live_constraint:
https://github.com/rust-lang/rust/blob/master/src/librustc_mir/borrow_check/nll/type_check/liveness.rs#L170-L172

(Once again, add_liveness_constraints and the code beneath it is causing difficulties.)

Because liveness_set is so enormous, add_region_liveness_constraints_from_type_check() is slow. From Callgrind:

            .          for (region, location) in liveness_set {
            .              debug!("generate: {:#?} is live at {:#?}", region, location);
  777,018,978              let region_vid = regioncx.to_region_vid(region);
1,942,548,420              regioncx.add_live_point(region_vid, *location);
59,657,319,186  => /home/njn/moz/rust0/src/librustc_mir/borrow_check/nll/region_infer/mod.rs:rustc_mir::borrow_check::nll::region_infer::RegionInferenceContext::add_live_point (388507821x)
            .          }

add_live_point() then does some SparseBitMatrix insertions.

@nnethercote

This comment has been minimized.

Contributor

nnethercote commented Jul 4, 2018

There are also some other causes of slowness for html5ever.

First, precompute_borrows_out_of_scope is super hot. Callgrind output:

            .      mir: &'a Mir<'tcx>,
            .      regioncx: &Rc<RegionInferenceContext<'tcx>>, 
            .      borrows_out_of_scope_at_location: &mut FxHashMap<Location, Vec<BorrowIndex>>,
            .      borrow_index: BorrowIndex,
            .      borrow_region: RegionVid,
            .      location: Location,
            .  ) {     
            .      // Keep track of places we've locations to check and locations that we have checked.
       46,668      let mut stack = vec![ location ];
            .      let mut visited = FxHashSet();
       62,224      visited.insert(location);
    4,413,266  => /home/njn/moz/rust0/src/libstd/collections/hash/set.rs:<std::collections::hash::set::HashSet<T, S>>::insert (15431x)
            .                  
            .      debug!(     
            .          "borrow {:?} has region {:?} with value {:?}",
            .          borrow_index,
            .          borrow_region,
            .          regioncx.region_value_str(borrow_region),
            .      );      
            .      debug!("borrow {:?} starts at {:?}", borrow_index, location);
  971,769,410      while let Some(location) = stack.pop() {
            .          // If region does not contain a point at the location, then add to list and skip
            .          // successor locations.
  971,769,410          if !regioncx.region_contains_point(borrow_region, location) {
            .              debug!("borrow {:?} gets killed at {:?}", borrow_index, location);
       99,190              borrows_out_of_scope_at_location
    1,290,825  => /home/njn/moz/rust0/src/libstd/collections/hash/map.rs:<std::collections::hash::map::Entry<'a, K, V>>::or_insert (13827x)
    3,298,140  => /home/njn/moz/rust0/src/libstd/collections/hash/map.rs:<std::collections::hash::map::HashMap<K, V, S>>::entry (13827x)
       28,340                  .entry(location)
            .                  .or_insert(vec![])
            .                  .push(borrow_index);
            .              continue;
            .          }
            .
            .          let bb_data = &mir[location.block];
            .          // If this is the last statement in the block, then add the
            .          // terminator successors next.
1,457,611,605          if location.statement_index == bb_data.statements.len() {
            .              // Add successors to locations to visit, if not visited before.
       59,584              if let Some(ref terminator) = bb_data.terminator {
       89,376                  for block in terminator.successors() {
      799,865  => /home/njn/moz/rust0/src/librustc/mir/mod.rs:rustc::mir::Terminator::successors (29527x)
      190,405                      let loc = block.start_location();
      112,488  => /home/njn/moz/rust0/src/librustc/mir/mod.rs:rustc::mir::BasicBlock::start_location (37496x)
      228,486                      if visited.insert(loc) {
    5,583,147  => /home/njn/moz/rust0/src/libstd/collections/hash/set.rs:<std::collections::hash::set::HashSet<T, S>>::insert (37496x)
            .                          stack.push(loc);
            .                      }
            .                  }
            .              }
            .          } else {
            .              // Visit next statement in block.
2,429,203,715              let loc = location.successor_within_block();
1,943,360,660  => /home/njn/moz/rust0/src/librustc/mir/mod.rs:rustc::mir::Location::successor_within_block (485840165x)
2,915,044,458              if visited.insert(loc) {
68,678,149,511  => /home/njn/moz/rust0/src/libstd/collections/hash/set.rs:<std::collections::hash::set::HashSet<T, S>>::insert (485840165x)
            .                  stack.push(loc);
            .              }
            .          }
            .      }
            .  }

It's the bit at the end that's the hottest, executing 485 million times. Because of this function, the following things are also super hot:

  • region_contains_point/RegionValues::contains
  • HashSet::insert. (Note that the visited hash table gets big: a total of 38 GB is allocated/reallocated in 109,000 allocations with an average size of 347 KB!)
  • successor_within_block

Second, unroll_place is also very hot:

            .      place: &Place<'tcx>,
            .      next: Option<&PlaceComponents<'_, 'tcx>>,
            .      op: impl FnOnce(PlaceComponentsIter<'_, 'tcx>) -> R,
            .  ) -> R {
            .      match place {
8,892,046,598          Place::Projection(interior) => unroll_place(
42,799,560,024  => /home/njn/moz/rust0/src/librustc_mir/borrow_check/places_conflict.rs:rustc_mir::borrow_check::places_conflict::unroll_place'2 (485791408x)
  534,414,126              &interior.base,
1,068,828,252              Some(&PlaceComponents {
            .                  component: place,
            .                  next,
            .              }),
2,672,070,630              op,
            .          ),
            .
            .          Place::Local(_) | Place::Static(_) => {
1,943,739,512              let list = PlaceComponents {
            .                  component: place,
            .                  next,
            .              }; 
1,943,739,512              op(list.iter())
            .          }
            .      }
6,754,274,663  }
@nikomatsakis

This comment has been minimized.

Contributor

nikomatsakis commented Jul 4, 2018

I always wondered it that super naive liveness_set would cause us problems =) seems clear the answer is yes.

@nikomatsakis

This comment has been minimized.

Contributor

nikomatsakis commented Jul 4, 2018

Thanks @nnethercote for the data, super helpful.

@nikomatsakis

This comment has been minimized.

Contributor

nikomatsakis commented Jul 4, 2018

I suspect we can remove the liveness_set vector entirely.

@nikomatsakis

This comment has been minimized.

Contributor

nikomatsakis commented Jul 4, 2018

But it will require some refactoring. I think I would want to build this atop #51987. Let me summarize the setup in that PR:

  • Type check generates outlives constraints ('a: 'b) as well as the set of liveness facts
    • liveness facts are presently stored as a set of (R, P) points
  • Once type check completes, we know the full set of region variables
  • We create the region inference context
    • we compute the SCC amongst region variables
    • we create a value per SCC which contains the points where any of the regions in that SCC are live
    • at present, we also create a liveness set for each individual region, which is used only in error reporting

We used to have a fairly hard requirement that we know the number of region variables before we could create the region inference context. This is because we represented the values as a huge matrix of bits. But now that we use sparse bit sets that is not true; we could grow the number of region variables fairly easily.

This implies to me that -- to start -- we could refactor so that type check generates the initial liveness results (storing in a sparse bit set per variable) directly. This would effectively replace the liveness_set vector that we have today. We would then hand that off to the region inference context.

The reason we'd have to be able to grow the number of region variables is that I think that computing liveness constraints can sometimes create fresh region variables and possibly constraints between those variables. (This is due to the "dropck" computation.)

It'd be sort of nice if that were not true, though, and it may not be true -- if it weren't, we could compute the SCC ahead of time and then never generate the per-region liveness, instead focusing on the liveness per SCC. This may affect diagnostics, which we could recover by doing more refined computations lazilly perhaps (or maybe it wouldn't affect diagnostics, i'm not sure, I'd have to think about it).

But it'd probably still be a big improvement to move to the sparse bit set instead of the vector of points.

Another option that I have been wondering about is moving from "set of points" to a more compact representation -- for example, the SEME region abstraction implemented in https://github.com/rust-lang-nursery/rustc-seme-regions may be more compact. However, presently there can be multiple disjoint liveness regions, and SEME regions can't express that, so we'd have to figure out what to do there.

@nnethercote

This comment has been minimized.

Contributor

nnethercote commented Jul 4, 2018

More about precompute_borrows_out_of_scope: there's something quadratic going on. I added this line to the very end of the function:

eprintln!("visited {}", visited.len());

The resulting output when compiling html5ever is dominated by this sequence of length 9858:

visited 88704
visited 88703
visited 88691
visited 88683
visited 88675
visited 88667
...
visited 9899
visited 9891
visited 9883
visited 9875
visited 9867
visited 9862
visited 9861 

It definitely feels like we're traversing a long sequence from A to B, then moving forward A slightly (by 8 steps, to be precise), then traversing from A to B again, over and over.

@nikomatsakis

This comment has been minimized.

Contributor

nikomatsakis commented Jul 4, 2018

Yeah, precompute_borrows_out_of_scope is potentially quadratic -- it itself is an optimization over an earlier approach -- but I've not thought of a better way for it to work. It basically walks the scope of each borrow -- the number of borrows is sort of "O(n)" in the size of the code, and the potential extent of each borrow is similarly "O(n)".

@nikomatsakis

This comment has been minimized.

Contributor

nikomatsakis commented Jul 4, 2018

Hmm, as an aside, I wonder if the premise of finding the borrows-in-scope via a dataflow is a bit silly, given that precompute_borrows_out_of_scope exists. That is, it seems like we are duplicating our work here with some of the dataflow propagation code. Maybe it would be worth killing that code.

(This is another place where changing to an alternative representation of regions, one that optimizes continuous stretches of things, might be a win.)

@nnethercote

This comment has been minimized.

Contributor

nnethercote commented Jul 5, 2018

I made several attempts to speed up unroll_place -- using SmallVec, using Vec, moving the borrow_components computation outside the inner loop in each_borrow_involving_path -- but they all made things slower or at best made no difference :(

@nikomatsakis

This comment has been minimized.

Contributor

nikomatsakis commented Jul 5, 2018

@nnethercote yeah, that code was not naive -- in fact, you had previously optimized it to use SmallVec, and unroll_place was itself an optimization on that =) I have some thoughts about how to do better but it would require deeper refactoring. I will try to write some notes down at least.

@nikomatsakis

This comment has been minimized.

Contributor

nikomatsakis commented Jul 5, 2018

@nnethercote

This comment has been minimized.

Contributor

nnethercote commented Jul 11, 2018

@Mark-Simulacrum: #52250 reduces html5ever's max-rss from 11,380,868.00 down to 10,285,700.00. (I assume the unit is MB or MiB.) I wonder if that's enough to avoid the OOMs on machines with 16 GiB of RAM.

@Mark-Simulacrum

This comment has been minimized.

Member

Mark-Simulacrum commented Jul 11, 2018

@nnethercote I'll try and enable it -- the perf collector has, I think, 32 GB of RAM (though free reports 31G, that seems odd) so I think that it should be "good enough" in that regard. Local benchmark runners can exclude it if necessary, I think...

bors added a commit that referenced this issue Jul 16, 2018

Auto merge of #52190 - davidtwco:issue-52028, r=<try>
WIP: html5ever in the rustc-perf repository is memory-intensive

Part of #52028. Rebased atop of #51987.

r? @nikomatsakis

bors added a commit that referenced this issue Jul 17, 2018

Auto merge of #52190 - davidtwco:issue-52028, r=nikomatsakis
html5ever in the rustc-perf repository is memory-intensive

Part of #52028. Rebased atop of #51987.

r? @nikomatsakis
@nnethercote

This comment has been minimized.

Contributor

nnethercote commented Jul 17, 2018

I looked into unroll_place a bit more. The problem is not that unroll_place is slow, but rather that it's called so many times.

Specifically, we hit the for i in candidates loop within each_borrow_involving_path() 150,000 times when compiling html5ever:

for i in candidates {
let borrowed = &borrow_set[i];
if places_conflict::places_conflict(tcx, mir, &borrowed.borrowed_place, place, access) {
debug!(
"each_borrow_involving_path: {:?} @ {:?} vs. {:?}/{:?}",
i, borrowed, place, access
);
let ctrl = op(s, i, borrowed);
if ctrl == Control::Break {
return;
}
}
}

Here's the candidates length distribution for those 150,000 hits:

  • len=20: 8 times
  • len=21: 8 times
  • len=22: 8 times
  • ...
  • len=9854: 8 times
  • len=9855: 8 times
  • len=9856: 9862 times

There's something quadratic going on, as shown by the equal number of occurrences for lengths 1..9855. There is also the over-representation of len=9856. I haven't yet worked out exactly where this is coming from.

bors added a commit that referenced this issue Jul 17, 2018

Auto merge of #52190 - davidtwco:issue-52028, r=nikomatsakis
html5ever in the rustc-perf repository is memory-intensive

Part of #52028. Rebased atop of #51987.

r? @nikomatsakis
@nikomatsakis

This comment has been minimized.

Contributor

nikomatsakis commented Jul 17, 2018

#52190 reduces the max-rss to 2GB-- still too big, but a lot better. To do much better than that, I think we have to start changing from a simple bitset to something that is more compressed, so we can capture patterns across variables. Something like BDDs come to mind.

@nikomatsakis

This comment has been minimized.

Contributor

nikomatsakis commented Jul 17, 2018

The performance remains ungreat though and @nnethercote's profiles are very useful in that regard. It seems clear (also from the fact that liveness dominates) that in this case we have somewhere a large number of borrows accumulating -- each access is then checked against those large number of borrows. I suspect we could do a lot better via some kind of hashing scheme, where we hash paths to figure out if they might possible overlap. I would imagine we would walk down the place to find prefixes.

@nikomatsakis

This comment has been minimized.

Contributor

nikomatsakis commented Jul 17, 2018

We are deferring further work that targets max-rss specifically to the release candidate, since it doesn't seem pressing for Edition Preview 2 right now. At this point, it seems like the next step is to look at ways to represent liveness regions that can share between region values -- e.g., something like a BDD (correct ordering will be key here) or perhaps SEME regions.

@nnethercote

This comment has been minimized.

Contributor

nnethercote commented Jul 25, 2018

I re-profiled. The high number of calls to unroll_place is still a major problem, as is the behaviour of precompute_borrows_out_of_scope. Both appear to involve quadratic behaviour.

@lqd

This comment has been minimized.

Contributor

lqd commented Jul 25, 2018

For posterity's sake, 95% of the time spent in html5ever is due to this big static

@lqd

This comment has been minimized.

Contributor

lqd commented Jul 25, 2018

Just linking the couple issues Niko filed, related to @nnethercote's and others' profiles (but not strictly about this current issue of memory usage):

@nikomatsakis

This comment has been minimized.

Contributor

nikomatsakis commented Aug 7, 2018

Update: With #53168, I think we can expect html5ever's memory usage to be reduced to 1.2GB. (At least, the previous PR — which I reverted — had that effect, and this one does the same basic thing, just more soundly.)

@nnethercote

This comment has been minimized.

Contributor

nnethercote commented Aug 17, 2018

@Mark-Simulacrum : are you happy to close this now? With NLL, html5ever's memory usage and speed are now both reasonable (and #53383 will help both a bit more).

@Mark-Simulacrum

This comment has been minimized.

Member

Mark-Simulacrum commented Aug 17, 2018

html5ever still uses 6x more memory (over a gigabyte) for check builds if I'm reading the dashboard right, which I believe is unacceptably high long-term. However, I do agree that we can discontinue prioritizing this for the edition (performance regression is no longer a huge outlier for html5ever specifically).

@nikomatsakis

This comment has been minimized.

Contributor

nikomatsakis commented Aug 27, 2018

Note that #53327 seems to drop memory use to 600MB here (although not the latest rev, which needs to be investigated).

@nikomatsakis

This comment has been minimized.

Contributor

nikomatsakis commented Aug 27, 2018

Pushing this out to RC 2 — it's a perf issue and not a critical one (plus pending PRs ought to help)

@nikomatsakis

This comment has been minimized.

Contributor

nikomatsakis commented Sep 11, 2018

Memory use is currently at a 2.28x ratio, weighing in at 501MB.

Still a lot, but not so much as to be an RC2 blocker.

Would be good to know what's going on though.

@nikomatsakis nikomatsakis removed this from the Edition 2018 RC 2 milestone Sep 11, 2018

@nnethercote

This comment has been minimized.

Contributor

nnethercote commented Sep 12, 2018

Here's a Massif graph showing a big spike near the end of a check build:
massif

228MB of that is from SparseBitMatrix instances within ConstraintGeneration::add_regular_live_constraint. Here's a DHAT record for half of it (and there's a second record for the other half that is extremely similar):

max-live:    114,195,072 in 9,858 blocks
tot-alloc:   115,981,392 in 15,556 blocks (avg size 7455.73) 
deaths:      15,556, at avg age 186,718,791 (1.80% of prog lifetime)
acc-ratios:  1.00 rd, 0.00 wr  (116,230,288 b-read, 248,896 b-written)
   at 0x4C2FEE5: calloc (in /usr/lib/valgrind/vgpreload_exp-dhat-amd64-linux.so)
   by 0x7AE3D45: alloc_zeroed (alloc.rs:147)
   by 0x7AE3D45: alloc_zeroed (alloc.rs:174)
   by 0x7AE3D45: allocate_in<u128,alloc::alloc::Global> (raw_vec.rs:104)
   by 0x7AE3D45: with_capacity_zeroed<u128> (raw_vec.rs:156)
   by 0x7AE3D45: from_elem<u128> (vec.rs:1620)
   by 0x7AE3D45: from_elem<u128> (vec.rs:1581)
   by 0x7AE3D45: new<rustc_mir::borrow_check::nll::region_infer::values::PointIndex> (<vec macros>:2)
   by 0x7AE3D45: {{closure}}<rustc_mir::borrow_check::nll::constraints::ConstraintSccIndex,rustc_mir::borrow_check::nll::region_infer::values::PointIndex> (bitvec.rs:355)
   by 0x7AE3D45: get_or_insert_with<rustc_data_structures::bitvec::BitArray<rustc_mir::borrow_check::nll::region_infer::values::PointIndex>,closure> (option.rs:814)
   by 0x7AE3D45: <rustc_data_structures::bitvec::SparseBitMatrix<R, C>>::ensure_row (bitvec.rs:355)
   by 0x7B654AC: add<rustc::ty::sty::RegionVid,rustc_mir::borrow_check::nll::region_infer::values::PointIndex> (bitvec.rs:363)
   by 0x7B654AC: <rustc_mir::borrow_check::nll::region_infer::values::LivenessValues<N>>::add_element (values.rs:182)
   by 0x7B5690D: {{closure}}<&rustc::ty::sty::RegionKind> (constraint_generation.rs:205)
   by 0x7B5690D: {{closure}}<&rustc::ty::sty::RegionKind,closure> (fold.rs:271)
   by 0x7B5690D: visit_region<closure> (fold.rs:333)
   by 0x7B5690D: visit_with<rustc::ty::fold::{{impl}}::any_free_region_meets::RegionVisitor<closure>> (structural_impls.rs:964)
   by 0x7B5690D: any_free_region_meets<&rustc::ty::sty::RegionKind,closure> (fold.rs:291)
   by 0x7B5690D: for_each_free_region<&rustc::ty::sty::RegionKind,closure> (fold.rs:270)
   by 0x7B5690D: add_regular_live_constraint<&rustc::ty::sty::RegionKind> (constraint_generation.rs:201)
   by 0x7B5690D: visit_region (constraint_generation.rs:70)
   by 0x7B5690D: super_rvalue<rustc_mir::borrow_check::nll::constraint_generation::ConstraintGeneration> (visit.rs:540)
   by 0x7B5690D: visit_rvalue<rustc_mir::borrow_check::nll::constraint_generation::ConstraintGeneration> (visit.rs:138)
   by 0x7B5690D: super_assign<rustc_mir::borrow_check::nll::constraint_generation::ConstraintGeneration> (visit.rs:403)
   by 0x7B5690D: <rustc_mir::borrow_check::nll::constraint_generation::ConstraintGeneration<'cg, 'cx, 'gcx, 'tcx> as rustc::mir::visit::Visitor<'tcx>>::visit_assign (constraint_
generation.rs:151)

@nnethercote

This comment has been minimized.

Contributor

nnethercote commented Sep 12, 2018

I added some println! statements. Looks like we have a big SparseBitMatrix that ends up with about 30,000 rows, and 92,652 columns in each row, but only a single bit is set in each row. SparseBitMatrix is moderately sparse, in that rows are instantiated fully on demand.

Perhaps a sparser representation could help. In fact, SparseBitMatrix used to be sparser, but in #52250 I made it denser because it was a significant speed win on some benchmarks. Perhaps there's a middle ground to be found.

@nnethercote

This comment has been minimized.

Contributor

nnethercote commented Sep 13, 2018

I have a plan to fix this, by making SparseBitMatrix sparser in the case where many rows have only 1 column (or a small number of columns) set. It will involve using HybridIdxSet for the rows. But that will first require cleaning up and combining the various bitset structures we currently have, including BitSlice, BitArray, and IdxSet. #54177 is the first step.

(The old SparseBitMatrix used BTreeMap for the rows, which gave terrible performance and memory usage when the rows became fairly dense. HybridIdxSet should provide a good compromise: compact when sparse, but space-efficient when dense, and fast in both cases.)

@nnethercote

This comment has been minimized.

Contributor

nnethercote commented Sep 17, 2018

I have a plan to fix this, by making SparseBitMatrix sparser in the case where many rows have only 1 column (or a small number of columns) set. It will involve using HybridIdxSet for the rows. But that will first require cleaning up and combining the various bitset structures we currently have, including BitSlice, BitArray, and IdxSet. #54177 is the first step.

#54286 is the next step. Once that lands, I will file a PR for using HybridBitSet within SparseBitMatrix. I have it working locally and the memory improvements are sizeable:

html5ever-check
        avg: -46.7%     min: -46.7%     max: -46.7%
ucd-check
        avg: -45.2%     min: -45.2%     max: -45.2%
clap-rs-check
        avg: -23.3%     min: -23.3%     max: -23.3%
inflate-check
        avg: -14.9%     min: -14.9%     max: -14.9%
@nnethercote

This comment has been minimized.

Contributor

nnethercote commented Sep 18, 2018

@lqd just pointed me at a crate (unic-ucd-name-0.7.0) with high memory usage under NLL. My HybridBitSet change reduces its max-rss by 96%, from 29.1GB to 1.2GB! (Compared to non-NLL, that improves the increase in max-rss from 42.5x to 1.7x.)

nnethercote added a commit to nnethercote/rust that referenced this issue Sep 18, 2018

Use `HybridBitSet` for rows within `SparseBitMatrix`.
This requires adding a few extra methods to `HybridBitSet`. (These are
tested in a new unit test.)

This commit reduces the `max-rss` for `nll-check` builds of `html5ever`
by 46%, `ucd` by 45%, `clap-rs` by 23%, `inflate` by 14%. And the
results for the `unic-ucd-name` crate are even more impressive: a 21%
reduction in instructions, a 60% reduction in wall-time, a 96%
reduction in `max-rss`, and a 97% reduction in faults!

Fixes rust-lang#52028.

@bors bors closed this in 154be2c Sep 19, 2018

@nnethercote

This comment has been minimized.

Contributor

nnethercote commented Sep 19, 2018

The NLL dashboard has updated. The top 5 now looks like this:

Benchmark    | clean-check  | nll-check    | %
keccak       | 636,612.00   | 912,740.00   | 143.37
html5ever    | 221,892.00   | 266,248.00   | 119.99
coercions    | 181,744.00   | 210,120.00   | 115.61
tuple-stress | 395,444.00   | 442,804.00   | 111.98
script-servo | 2,463,224.00 | 2,617,760.00 | 106.27

A 20% increase in max-rss for html5ever with NLL isn't ideal, but it's a whole lot better than the 14GB+ we were getting when this PR was opened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment