Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MinHash.remove_many continues to be very slow when removing many hashes #1617

Open
ctb opened this issue Jun 21, 2021 · 13 comments
Open

MinHash.remove_many continues to be very slow when removing many hashes #1617

ctb opened this issue Jun 21, 2021 · 13 comments
Labels

Comments

@ctb
Copy link
Contributor

ctb commented Jun 21, 2021

despite #1571, the problems in #1552 continued after using the new remove_many implementation until I refactored the enclosing script in #1613.

The following script reproduces the problem:

import sourmash

print('loading...')
big_sig = sourmash.load_one_signature('SRR7186547.k31.sig')
print(f'...done. loaded {len(big_sig.minhash)} hashes.')

print('converting to mutable...')
mh = big_sig.minhash.to_mutable()
print('...done.')

print('subtracting...')
mh2 = mh.remove_many(big_sig.minhash)
print('...done')
@ctb ctb added the rust label Jun 21, 2021
@ctb ctb changed the title MinHash.remove_many continues to be very slow MinHash.remove_many continues to be very slow when removing many hashes Jun 21, 2021
@ctb
Copy link
Contributor Author

ctb commented Jun 21, 2021

The .sig file can be downloaded here, https://osf.io/egks5/, or is in ~ctbrown/prefetch-hangest/ on farm.

@mr-eyes
Copy link
Member

mr-eyes commented Jul 19, 2021

I think the only way to enhance this is to parallelize the removals or convert the hashes vector alongside the abundance vector to a single hashmap<hash, abundance>. In the provided example, there will be 8,577,001 search operations on the vector, which -AFAI Know- O(n) just to get the item index then remove it.

@ctb
Copy link
Contributor Author

ctb commented Jul 20, 2021

This seems to work pretty fast. It's interesting how Python is so much faster than Rust, don't you think? 😜

import sourmash

print('loading...')
big_sig = sourmash.load_one_signature('SRR7186547.k31.sig')
print(f'...done. loaded {len(big_sig.minhash)} hashes.')

print('Converting to set...')
x = set(big_sig.minhash.hashes)
print('...done!')

print('subtracting...')
y = set(x)
z = x - y
print('...done!')

@mr-eyes
Copy link
Member

mr-eyes commented Jul 20, 2021

Hahaha, using set will be way faster ...

Here's what is happening in Rust

import sourmash

print('loading...')
big_sig = sourmash.load_one_signature('SRR7186547.k31.sig')
print(f'...done. loaded {len(big_sig.minhash)} hashes.')

print('Converting to list...')

hashes_list = list(big_sig.minhash.hashes)
abundance_list = hashes_list.copy() # Simulate abundance vector

to_be_removed = hashes_list.copy()

print('subtracting...')

for hash in to_be_removed:
    idx = hashes_list.index(hash)
    del hashes_list[idx]
    del abundance_list[idx]

print("Done")

Vectors are being used to hold hashes and abundances values to be kept in order. Using set instead of vector will not preserve the insertion order.

@luizirber
Copy link
Member

luizirber commented Jul 25, 2021

Let's try to disentangle a bit the many threads going on this conversation =]

This seems to work pretty fast. It's interesting how Python is so much faster than Rust, don't you think?

import sourmash

print('loading...')
big_sig = sourmash.load_one_signature('SRR7186547.k31.sig')
print(f'...done. loaded {len(big_sig.minhash)} hashes.')

print('Converting to set...')
x = set(big_sig.minhash.hashes)
print('...done!')

print('subtracting...')
y = set(x)
z = x - y
print('...done!')

This code takes shortcuts (is not doing .to_mutable() which triggers a copy of the data, nor ending with a usable MH after the operation), so it will already be faster. Nonetheless, converting to a set makes it use twice the memory. Using memory-profiler (as @mr-eyes did in #1571) these are the results:

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
     3   52.844 MiB   52.844 MiB           1   @profile
     4                                         def main():
     5   52.844 MiB    0.000 MiB           1       print('loading...')
     6  186.164 MiB  133.320 MiB           1       big_sig = sourmash.load_one_signature('SRR7186547.k31.sig')
     7  186.164 MiB    0.000 MiB           1       print(f'...done. loaded {len(big_sig.minhash)} hashes.')
     8
     9  186.164 MiB    0.000 MiB           1       print('Converting to set...')
    10  741.484 MiB  555.320 MiB           1       x = set(big_sig.minhash.hashes)
    11  741.484 MiB    0.000 MiB           1       print('...done!')
    12
    13  741.484 MiB    0.000 MiB           1       print('subtracting...')
    14 1253.238 MiB  511.754 MiB           1       y = set(x)
    15 1253.238 MiB    0.000 MiB           1       z = x - y
    16 1253.238 MiB    0.000 MiB           1       print('...done!')

My point here: the Rust code is trying to avoid allocating more memory than needed, and this is DISASTROUS with the current implementation when removing many hashes. Since it is an ordered vector, for each removal it needs to reallocate large chunks of the vector (as Mo pointed out in his explanation of what Rust is doing). This is easy to see with py-spy top -n:

Collecting samples from 'python -m memory_profiler -o rust.time slow_remove.py' (python v3.8.9)
Total Samples 248200
GIL: 0.00%, Active: 100.00%, Threads: 1

  %Own   %Total  OwnTime  TotalTime  Function (filename:line)
100.00% 100.00%    2481s     2481s   __memmove_avx_unaligned_erms (libc-2.32.so)
  0.00% 100.00%   0.620s     2482s   sourmash::sketch::minhash::KmerMinHash::remove_hash::hec5d940496ec6541 (sourmash/_lowlevel__lib.so)
  0.00% 100.00%   0.140s     2482s   sourmash::ffi::utils::landingpad::hf48eacb578fa3b98 (sourmash/_lowlevel__lib.so)
...

It literally spends all the time moving memory around.


So, what to do?

  • the call to .to_mutable() should be transforming the MinHash in the Rust side from a KmerMinHash into a KmerMinHashBTree. The latter supports the same operations that the set() in Python is doing, and is going to be much faster.
  • As Mo pointed, the Rust side still keeps abundances separated from hashes. mins and abunds can probably be merged into a hashes: BTreeMap<u64, Option<u64>> field, which lets iterate and modify both hashes and abundances much more conveniently. (Incidentally, BTreeSet<T> is implemented as a BTreeMap<T, ()>, so it should still be pretty efficient space-wise, but more testing needed)
  • Spread 'frozen' behavior through code base & document in dev docs #1616 already did the legwork of spreading Frozen around, but there still need to be some method exposed on the FFI to convert from KmerMinHash 🦀/Minhash 🐍 to KmerMinHashBTree 🦀/FrozenMinhash 🐍 and vice-versa so we can benefit from these changes.
  • KmerMinHashBTree is a HORRIBLE name. Probably rename it too while doing these changes.

If you can cheat, I can cheat too =]

There is some wonkiness in the API, but I used the released crate (so no optimizations in KmerMinHashBTree like I suggested above) to write a quick-and-dirty program to do the same.

use sourmash::signature::{Signature, SigsTrait};
use sourmash::sketch::minhash::KmerMinHashBTree;
use sourmash::sketch::Sketch;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    println!("loading...");
    let (reader, _) = niffler::from_path("SRR7186547.k31.sig")?;
    let mut sig: Vec<Signature> = serde_json::from_reader(reader)?;

    if let Sketch::MinHash(big_sig) = sig.swap_remove(0).sketches().swap_remove(0) {
        println!("...done. loaded {} hashes.", big_sig.size());

        println!("converting to mutable...");
        let mut mh: KmerMinHashBTree = big_sig.into();
        println!("...done");

        println!("subtracting...");
        mh.remove_many(&mh.mins())?;
        println!("...done");
    }

    Ok(())
}

Timings

Rust:

$ /usr/bin/time -v -- cargo run --release
    Finished release [optimized] target(s) in 0.02s
     Running `target/release/remove_hashes`
loading...
...done. loaded 8679673 hashes.
converting to mutable...
...done
subtracting...
...done
        Command being timed: "cargo run --release"
        User time (seconds): 4.19
        System time (seconds): 0.53
        Percent of CPU this job got: 100%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:04.72
        Maximum resident set size (kbytes): 950944
        File system inputs: 336772

Python (with sets):

$ /usr/bin/time -v -- python python_remove.py
loading...
...done. loaded 8679673 hashes.
Converting to set...
...done!
subtracting...
...done!
        Command being timed: "python python_remove.py"
        User time (seconds): 8.00
        System time (seconds): 2.56
        Percent of CPU this job got: 117%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:08.95
        Maximum resident set size (kbytes): 1342936
        File system inputs: 345687

@ctb
Copy link
Contributor Author

ctb commented Jul 25, 2021

If you can cheat, I can cheat too =]

😆

My intention was to point out that there must be options. Thank you for falling into my trap^W^W^W^Wexploring them!

@ctb
Copy link
Contributor Author

ctb commented Jul 25, 2021

and let me just say how adorable the 🦀 and 🐍 are!

@luizirber
Copy link
Member

luizirber commented Jul 25, 2021

My intention was to point out that there must be options. Thank you for falling into my trap^W^W^W^Wexploring them!

I know, I know. Just pointing that the options where there all along, but... not implemented all the way across FFI =P

and let me just say how adorable the 🦀 and 🐍 are!

Right? gonna start using it all the time when talking about Rust and Python types 😸

@mr-eyes
Copy link
Member

mr-eyes commented Jul 17, 2022

Coming from #1771

pub fn remove_hash(&mut self, hash: u64) {
if let Ok(pos) = self.mins.binary_search(&hash) {
if self.mins[pos] == hash {
self.mins.remove(pos);
self.reset_md5sum();
if let Some(ref mut abunds) = self.abunds {
abunds.remove(pos);
}
}
};
}

Performing a binary search on every delete is expected to slow down the process of removing many elements.

Would replacing the vec<hashes> & vec<abundance> with a hashmap<hash, abundance> be the optimal solution here?

@mr-eyes
Copy link
Member

mr-eyes commented Jul 17, 2022

Coming from #1771

pub fn remove_hash(&mut self, hash: u64) {
if let Ok(pos) = self.mins.binary_search(&hash) {
if self.mins[pos] == hash {
self.mins.remove(pos);
self.reset_md5sum();
if let Some(ref mut abunds) = self.abunds {
abunds.remove(pos);
}
}
};
}

Performing a binary search on every delete is expected to slow down the process of removing many elements.
Would replacing the vec<hashes> & vec<abundance> with a hashmap<hash, abundance> be the optimal solution here?

Ok, that was already discussed :neckbeard:.

@ctb
Copy link
Contributor Author

ctb commented Jul 18, 2022

we should revisit the code in #2123 if/when we speed up remove_many.

@ctb
Copy link
Contributor Author

ctb commented Jul 22, 2022

I ran the scaled=1000 version of the benchmark in #1747:

time py-spy record -o latest.svg -- sourmash gather SRR10988543.sig bins.sbt.zip >& latest.out

and saw the following:

Screen Shot 2022-07-21 at 9 39 54 PM

so, still some work to do here :)

It took < 2 hours to run, but not by much.

@ctb
Copy link
Contributor Author

ctb commented Jul 22, 2022

oh, and when we zoom in on the block on the right, we see remove_many continues to be a problem: it's the big block on the left.

Screen Shot 2022-07-22 at 7 07 26 AM

the other big chunk of time is in this generator expression in search.py link to latest:

        weighted_missed = sum((orig_query_abunds[k] for k in query_hashes))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants