Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add more hash manipulation utilities to sourmash CLI #1266

Open
ctb opened this issue Jan 1, 2021 · 2 comments
Open

add more hash manipulation utilities to sourmash CLI #1266

ctb opened this issue Jan 1, 2021 · 2 comments
Labels
good first issue good next issue An issue that should be ready to resolve. plugin_todo Write a plugin for this!

Comments

@ctb
Copy link
Contributor

ctb commented Jan 1, 2021

yesterday, I spent some time digging into a sourmash use case with @shannonekj, and a few different reasonably generic utility script needs emerged.

the code for this is in a private repository so I'll try to describe things here - we were looking for differential presence of hashes in a genome between two samples (specifically, looking for hashes that correlated with male vs female genomes).

to do this, we needed the following new functionality -

  • code to export hashes, together with their abundances, from a signature computed with track-abund; viz update sourmash sig export to export to CSV #1098
  • code to intersect one track-abund signature with another signature, without flattening the abundances in the first signature. (note that sourmash sig intersect flattens all signatures) - this could be maybe be done by updating sourmash sig intersect
  • code to select sequences in a FASTA/FASTQ file that have some number of overlapping hashes with a signature (this has been a repeatedly useful utility that I've implemented a dozen times in various contexts 😆 )
  • code to estimate the abundance of sequences based on median hash abundance from a signature (i.e. estimate sequence abundance in a FASTA/FASTQ file using abundances from a track-abund sourmash signature) - this may be too niche to implement in sourmash directly, but I feel like it has come in handy.

I implemented all of this in a Jupyter notebook fairly easily, but it'd nice to have this in the sourmash CLI.

since code exists for all of this and I can make it available upon request, I'll label this as a good first issue...

@ctb ctb mentioned this issue Jan 4, 2021
@ctb
Copy link
Contributor Author

ctb commented Mar 4, 2021

I put several utilities in https://github.com/ctb/2020-emqc-scripts, which also includes the abundhist code from #933.

  • contig-intersect.py does an intersection of hashes b/t two signatures, and produces a new signature from the abundances in one sig. It's useful for examining the abundances of assembled contigs.
  • extract-contigs.py extracts contigs based on their median hash abundance in a signature.

These are pretty good candidates for a plugin :) #1353

@ctb ctb added the good next issue An issue that should be ready to resolve. label Sep 23, 2021
@ctb
Copy link
Contributor Author

ctb commented Jun 13, 2022

some of these may now be possible with sig inflate added in #1889

@ctb ctb added the plugin_todo Write a plugin for this! label Sep 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue good next issue An issue that should be ready to resolve. plugin_todo Write a plugin for this!
Projects
None yet
Development

No branches or pull requests

1 participant