Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "describe" to output kmer sizes, molecules, scaled, and num hashes for each signature #561

Closed
olgabot opened this issue Oct 26, 2018 · 4 comments

Comments

@olgabot
Copy link
Collaborator

olgabot commented Oct 26, 2018

Hello! I find myself forgetting exactly the parameters I used for each signature and then get frustrated/confused when I sourmash compare and then there's no compatible kmer sizes or molecules. So this feature would let the user know what kinds of signatures are inside each signature.

Here's what I hacked up. I know this uses json instead of ijson but my brain got so turned around trying to use the iterator so I used json for now. I also couldn't figure out how to access the 'scaled' option.

import itertools
import json


keys_for_length = ('mins',)

keys_for_values = ('ksize', 'molecule')

keys_to_print = keys_for_length + keys_for_values


def _describe_single(signature):
    data = []
    for sig in signature['signatures']:
        d = dict(name=signature['name'])
        for key, value in sig.items():
            if key not in keys_to_print:
                continue

            if key in keys_for_length:
                v = len(value)
                k = 'n_' + key
            elif key in keys_for_values:
                v = value
                k = key
            d[k] = v
        data.append(d)
    return data


def describe(filename):
    with open(filename) as f:
        signature = json.load(f)
    data = itertools.chain(*map(_describe_single, signature))

    description = pd.DataFrame(list(data))
    return description

This outputs the following pandas dataframe:

screen shot 2018-10-26 at 9 28 13 am

@olgabot
Copy link
Collaborator Author

olgabot commented Oct 26, 2018

I know pandas isn't a dependency so we can use the csv tool instead.

@ctb
Copy link
Contributor

ctb commented Nov 1, 2018

beautiful. yes, we should add this as a basic command line option I think!

@taylorreiter
Copy link
Contributor

I was just thinking that something like this would be very useful as I ran head on a signature and had to scroll back up a million lines in my terminal to see the information I wanted. Thank you @olgabot!

@ctb
Copy link
Contributor

ctb commented Dec 27, 2018

Added as a command line feature, 'sourmash sig info', in #587. Outputs a CSV which can be used as above.

@ctb ctb closed this as completed Jan 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants