`MinHash.remove_many` continues to be very slow when removing many hashes #1617

ctb · 2021-06-21T13:22:19Z

despite #1571, the problems in #1552 continued after using the new remove_many implementation until I refactored the enclosing script in #1613.

The following script reproduces the problem:

import sourmash

print('loading...')
big_sig = sourmash.load_one_signature('SRR7186547.k31.sig')
print(f'...done. loaded {len(big_sig.minhash)} hashes.')

print('converting to mutable...')
mh = big_sig.minhash.to_mutable()
print('...done.')

print('subtracting...')
mh2 = mh.remove_many(big_sig.minhash)
print('...done')

The text was updated successfully, but these errors were encountered:

ctb · 2021-06-21T13:26:33Z

The .sig file can be downloaded here, https://osf.io/egks5/, or is in ~ctbrown/prefetch-hangest/ on farm.

mr-eyes · 2021-07-19T23:45:50Z

I think the only way to enhance this is to parallelize the removals or convert the hashes vector alongside the abundance vector to a single hashmap<hash, abundance>. In the provided example, there will be 8,577,001 search operations on the vector, which -AFAI Know- O(n) just to get the item index then remove it.

ctb · 2021-07-20T14:27:09Z

This seems to work pretty fast. It's interesting how Python is so much faster than Rust, don't you think? 😜

import sourmash

print('loading...')
big_sig = sourmash.load_one_signature('SRR7186547.k31.sig')
print(f'...done. loaded {len(big_sig.minhash)} hashes.')

print('Converting to set...')
x = set(big_sig.minhash.hashes)
print('...done!')

print('subtracting...')
y = set(x)
z = x - y
print('...done!')

mr-eyes · 2021-07-20T14:48:31Z

Hahaha, using set will be way faster ...

Here's what is happening in Rust

import sourmash

print('loading...')
big_sig = sourmash.load_one_signature('SRR7186547.k31.sig')
print(f'...done. loaded {len(big_sig.minhash)} hashes.')

print('Converting to list...')

hashes_list = list(big_sig.minhash.hashes)
abundance_list = hashes_list.copy() # Simulate abundance vector

to_be_removed = hashes_list.copy()

print('subtracting...')

for hash in to_be_removed:
    idx = hashes_list.index(hash)
    del hashes_list[idx]
    del abundance_list[idx]

print("Done")

Vectors are being used to hold hashes and abundances values to be kept in order. Using set instead of vector will not preserve the insertion order.

luizirber · 2021-07-25T23:46:55Z

Let's try to disentangle a bit the many threads going on this conversation =]

This seems to work pretty fast. It's interesting how Python is so much faster than Rust, don't you think?

import sourmash

print('loading...')
big_sig = sourmash.load_one_signature('SRR7186547.k31.sig')
print(f'...done. loaded {len(big_sig.minhash)} hashes.')

print('Converting to set...')
x = set(big_sig.minhash.hashes)
print('...done!')

print('subtracting...')
y = set(x)
z = x - y
print('...done!')

This code takes shortcuts (is not doing .to_mutable() which triggers a copy of the data, nor ending with a usable MH after the operation), so it will already be faster. Nonetheless, converting to a set makes it use twice the memory. Using memory-profiler (as @mr-eyes did in #1571) these are the results:

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
     3   52.844 MiB   52.844 MiB           1   @profile
     4                                         def main():
     5   52.844 MiB    0.000 MiB           1       print('loading...')
     6  186.164 MiB  133.320 MiB           1       big_sig = sourmash.load_one_signature('SRR7186547.k31.sig')
     7  186.164 MiB    0.000 MiB           1       print(f'...done. loaded {len(big_sig.minhash)} hashes.')
     8
     9  186.164 MiB    0.000 MiB           1       print('Converting to set...')
    10  741.484 MiB  555.320 MiB           1       x = set(big_sig.minhash.hashes)
    11  741.484 MiB    0.000 MiB           1       print('...done!')
    12
    13  741.484 MiB    0.000 MiB           1       print('subtracting...')
    14 1253.238 MiB  511.754 MiB           1       y = set(x)
    15 1253.238 MiB    0.000 MiB           1       z = x - y
    16 1253.238 MiB    0.000 MiB           1       print('...done!')

My point here: the Rust code is trying to avoid allocating more memory than needed, and this is DISASTROUS with the current implementation when removing many hashes. Since it is an ordered vector, for each removal it needs to reallocate large chunks of the vector (as Mo pointed out in his explanation of what Rust is doing). This is easy to see with py-spy top -n:

Collecting samples from 'python -m memory_profiler -o rust.time slow_remove.py' (python v3.8.9)
Total Samples 248200
GIL: 0.00%, Active: 100.00%, Threads: 1

  %Own   %Total  OwnTime  TotalTime  Function (filename:line)
100.00% 100.00%    2481s     2481s   __memmove_avx_unaligned_erms (libc-2.32.so)
  0.00% 100.00%   0.620s     2482s   sourmash::sketch::minhash::KmerMinHash::remove_hash::hec5d940496ec6541 (sourmash/_lowlevel__lib.so)
  0.00% 100.00%   0.140s     2482s   sourmash::ffi::utils::landingpad::hf48eacb578fa3b98 (sourmash/_lowlevel__lib.so)
...

It literally spends all the time moving memory around.

So, what to do?

the call to .to_mutable() should be transforming the MinHash in the Rust side from a KmerMinHash into a KmerMinHashBTree. The latter supports the same operations that the set() in Python is doing, and is going to be much faster.
As Mo pointed, the Rust side still keeps abundances separated from hashes. mins and abunds can probably be merged into a hashes: BTreeMap<u64, Option<u64>> field, which lets iterate and modify both hashes and abundances much more conveniently. (Incidentally, BTreeSet<T> is implemented as a BTreeMap<T, ()>, so it should still be pretty efficient space-wise, but more testing needed)
Spread 'frozen' behavior through code base & document in dev docs #1616 already did the legwork of spreading Frozen around, but there still need to be some method exposed on the FFI to convert from KmerMinHash 🦀/Minhash 🐍 to KmerMinHashBTree 🦀/FrozenMinhash 🐍 and vice-versa so we can benefit from these changes.
KmerMinHashBTree is a HORRIBLE name. Probably rename it too while doing these changes.

If you can cheat, I can cheat too =]

There is some wonkiness in the API, but I used the released crate (so no optimizations in KmerMinHashBTree like I suggested above) to write a quick-and-dirty program to do the same.

use sourmash::signature::{Signature, SigsTrait};
use sourmash::sketch::minhash::KmerMinHashBTree;
use sourmash::sketch::Sketch;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    println!("loading...");
    let (reader, _) = niffler::from_path("SRR7186547.k31.sig")?;
    let mut sig: Vec<Signature> = serde_json::from_reader(reader)?;

    if let Sketch::MinHash(big_sig) = sig.swap_remove(0).sketches().swap_remove(0) {
        println!("...done. loaded {} hashes.", big_sig.size());

        println!("converting to mutable...");
        let mut mh: KmerMinHashBTree = big_sig.into();
        println!("...done");

        println!("subtracting...");
        mh.remove_many(&mh.mins())?;
        println!("...done");
    }

    Ok(())
}

Timings

Rust:

$ /usr/bin/time -v -- cargo run --release
    Finished release [optimized] target(s) in 0.02s
     Running `target/release/remove_hashes`
loading...
...done. loaded 8679673 hashes.
converting to mutable...
...done
subtracting...
...done
        Command being timed: "cargo run --release"
        User time (seconds): 4.19
        System time (seconds): 0.53
        Percent of CPU this job got: 100%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:04.72
        Maximum resident set size (kbytes): 950944
        File system inputs: 336772

Python (with sets):

$ /usr/bin/time -v -- python python_remove.py
loading...
...done. loaded 8679673 hashes.
Converting to set...
...done!
subtracting...
...done!
        Command being timed: "python python_remove.py"
        User time (seconds): 8.00
        System time (seconds): 2.56
        Percent of CPU this job got: 117%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:08.95
        Maximum resident set size (kbytes): 1342936
        File system inputs: 345687

ctb · 2021-07-25T23:50:24Z

If you can cheat, I can cheat too =]

😆

My intention was to point out that there must be options. Thank you for falling into my trap^W^W^W^Wexploring them!

ctb · 2021-07-25T23:51:34Z

and let me just say how adorable the 🦀 and 🐍 are!

luizirber · 2021-07-25T23:59:59Z

My intention was to point out that there must be options. Thank you for falling into my trap^W^W^W^Wexploring them!

I know, I know. Just pointing that the options where there all along, but... not implemented all the way across FFI =P

and let me just say how adorable the 🦀 and 🐍 are!

Right? gonna start using it all the time when talking about Rust and Python types 😸

mr-eyes · 2022-07-17T14:41:48Z

Coming from #1771

sourmash/src/core/src/sketch/minhash.rs

Lines 399 to 409 in 401ba48

    
           pub fn remove_hash(&mut self, hash: u64) { 
        
               if let Ok(pos) = self.mins.binary_search(&hash) { 
        
                   if self.mins[pos] == hash { 
        
                       self.mins.remove(pos); 
        
                       self.reset_md5sum(); 
        
                       if let Some(ref mut abunds) = self.abunds { 
        
                           abunds.remove(pos); 
        
                       } 
        
                   } 
        
               }; 
        
           }

Performing a binary search on every delete is expected to slow down the process of removing many elements.

Would replacing the vec<hashes> & vec<abundance> with a hashmap<hash, abundance> be the optimal solution here?

mr-eyes · 2022-07-17T14:47:10Z

Coming from #1771

sourmash/src/core/src/sketch/minhash.rs

Lines 399 to 409 in 401ba48

pub fn remove_hash(&mut self, hash: u64) {

if let Ok(pos) = self.mins.binary_search(&hash) {

if self.mins[pos] == hash {

self.mins.remove(pos);

self.reset_md5sum();

if let Some(ref mut abunds) = self.abunds {

abunds.remove(pos);

}

}

};

}

Performing a binary search on every delete is expected to slow down the process of removing many elements.
Would replacing the vec<hashes> & vec<abundance> with a hashmap<hash, abundance> be the optimal solution here?

Ok, that was already discussed .

ctb · 2022-07-18T13:24:14Z

we should revisit the code in #2123 if/when we speed up remove_many.

ctb · 2022-07-22T01:43:05Z

I ran the scaled=1000 version of the benchmark in #1747:

time py-spy record -o latest.svg -- sourmash gather SRR10988543.sig bins.sbt.zip >& latest.out

and saw the following:

so, still some work to do here :)

It took < 2 hours to run, but not by much.

ctb · 2022-07-22T11:10:31Z

oh, and when we zoom in on the block on the right, we see remove_many continues to be a problem: it's the big block on the left.

the other big chunk of time is in this generator expression in search.py link to latest:

        weighted_missed = sum((orig_query_abunds[k] for k in query_hashes))

ctb added the rust label Jun 21, 2021

ctb changed the title ~~MinHash.remove_many continues to be very slow~~ MinHash.remove_many continues to be very slow when removing many hashes Jun 21, 2021

ctb mentioned this issue Jul 17, 2022

running gather on many (very) closely related genomes is (very) slow #1771

Closed

ctb mentioned this issue Jul 24, 2022

thoughts on opportunities to speed up CounterGather and sourmash gather #2145

Open

ctb mentioned this issue Nov 29, 2022

Comparing full set of k-mers between genomes (Scaled=1?) #2381

Open

ctb mentioned this issue Dec 20, 2022

What should we benchmark? #2410

Open

ctb mentioned this issue Sep 13, 2023

add_many is super slow in adding large number of kmers #2756

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`MinHash.remove_many` continues to be very slow when removing many hashes #1617

`MinHash.remove_many` continues to be very slow when removing many hashes #1617

ctb commented Jun 21, 2021

ctb commented Jun 21, 2021

mr-eyes commented Jul 19, 2021 •

edited

Loading

ctb commented Jul 20, 2021

mr-eyes commented Jul 20, 2021 •

edited

Loading

luizirber commented Jul 25, 2021 •

edited

Loading

ctb commented Jul 25, 2021

ctb commented Jul 25, 2021

luizirber commented Jul 25, 2021 •

edited

Loading

mr-eyes commented Jul 17, 2022

mr-eyes commented Jul 17, 2022

ctb commented Jul 18, 2022

ctb commented Jul 22, 2022

ctb commented Jul 22, 2022

MinHash.remove_many continues to be very slow when removing many hashes #1617

MinHash.remove_many continues to be very slow when removing many hashes #1617

Comments

ctb commented Jun 21, 2021

ctb commented Jun 21, 2021

mr-eyes commented Jul 19, 2021 • edited Loading

ctb commented Jul 20, 2021

mr-eyes commented Jul 20, 2021 • edited Loading

luizirber commented Jul 25, 2021 • edited Loading

Timings

ctb commented Jul 25, 2021

ctb commented Jul 25, 2021

luizirber commented Jul 25, 2021 • edited Loading

mr-eyes commented Jul 17, 2022

mr-eyes commented Jul 17, 2022

ctb commented Jul 18, 2022

ctb commented Jul 22, 2022

ctb commented Jul 22, 2022

`MinHash.remove_many` continues to be very slow when removing many hashes #1617

`MinHash.remove_many` continues to be very slow when removing many hashes #1617

mr-eyes commented Jul 19, 2021 •

edited

Loading

mr-eyes commented Jul 20, 2021 •

edited

Loading

luizirber commented Jul 25, 2021 •

edited

Loading

luizirber commented Jul 25, 2021 •

edited

Loading