Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hp encoding for proteins #758

Merged
merged 50 commits into from Nov 22, 2019
Merged

Add hp encoding for proteins #758

merged 50 commits into from Nov 22, 2019

Conversation

pranathivemuri
Copy link
Contributor

@pranathivemuri pranathivemuri commented Oct 22, 2019

  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

@pranathivemuri
Copy link
Contributor Author

@olgabot @luizirber @taylorreiter Added HP encoding, please review when you are free!

@luizirber
Copy link
Member

test_sig_filter_1 is failing in python3.5 - https://travis-ci.com/dib-lab/sourmash/jobs/248448499#L530 which I didn't change

Yes, it comes from #748, @ctb is checking it

@luizirber
Copy link
Member

We should also consider #751 before merging this

@pranathivemuri
Copy link
Contributor Author

I am not sure if the coverage is actually calculated correctly. The codecov bot edit history shows that coverage increased at one point yesterday, now decreased again, all I did was fix a test and add more tests and merge from master.

@ctb
Copy link
Contributor

ctb commented Oct 31, 2019

Thank you! I'm really enthusiastic about this functionality, and these changes look good to me. I would like to have a chance to dig into them more carefully, but might not have the time in the near future.

Questions I have:

  • are we saving this information in the JSON signatures?
  • do we have enough tests? This bit of signature calculation/comparison code is getting complicated. In particular, do we test the detection of incompatible minhash sigs?
  • what impact does this have on Rust branches?

@pranathivemuri
Copy link
Contributor Author

  1. Yes
  2. I am not sure when there would be incompatible minhash signatures. currently the testing is done to see if the signature json has hp level signature, the signature contents i.e mins are not tested. They haven't been tested before either
  3. Rust and python minhash will diverge unless we add hp encoding on rust minhash library

Thanks for reviewing and asking great questions!

@luizirber
Copy link
Member

* are we saving this information in the JSON signatures?
1. Yes

We need to start thinking more about how to document these things... Adding new things to the JSON signatures should trigger data version bumps (should already have happened for dayhoff). But this is being discussed in #751 and is not really a blocker for this PR.

* what impact does this have on Rust branches?
3. Rust and python minhash will diverge unless we add hp encoding on rust minhash library

This PR is very similar to #689, and relatively straighforward to add to the Rust branches.

@pranathivemuri
Copy link
Contributor Author

@luizirber I can add the rust stuff in a different PR if that's blocking this PR. Let me know if I can directly check out from master and make those changes.

@luizirber
Copy link
Member

@luizirber I can add the rust stuff in a different PR if that's blocking this PR. Let me know if I can directly check out from master and make those changes.

Oh, that's not really blocking the PR. I think #751 is blocking more, but I'm comfortable with this one being merged now and we figure out the rest later.

Adding the dayhoff encoding in the Rust branch was a bit annoying because I want to keep #424 with only the Python/CFFI changes, but obviously there were changes on the Rust code too. I ended up implementing it all in the same branch to be sure it worked, and then cherry-picked changes into a new PR #760 (which contains other changes that accumulated in #424 but were not really relevant to that PR).

On @ctb comments about tests, even if we are covering the new lines I'm not so sure we are taking care of all the behaviors we expect. I'll take a closer look and come up with suggestions soon.

@pranathivemuri
Copy link
Contributor Author

thanks @luizirber

Getting this error in here nf-core/kmermaid#17, at least locally:

```

  == This is sourmash version 2.0.0a10.dev119+gf9bd45f. ==

  == Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==


  computing signatures for files: SRR4238355_subsamp_coding_reads_peptides.fasta

  Computing signature for ksizes: [6]

  Computing only Dayhoff-encoded protein (and not nucleotide) signatures.

  Computing a total of 1 signature(s).

  ... reading sequences from SRR4238355_subsamp_coding_reads_peptides.fasta
  Traceback (most recent call last):
    File "/opt/conda/envs/nfcore-kmermaid-0.1dev/bin/sourmash", line 8, in <module>
      sys.exit(main())
    File "/opt/conda/envs/nfcore-kmermaid-0.1dev/lib/python3.6/site-packages/sourmash/__main__.py", line 83, in main
      cmd(sys.argv[2:])
    File "/opt/conda/envs/nfcore-kmermaid-0.1dev/lib/python3.6/site-packages/sourmash/command_compute.py", line 376, in compute
      notify('... {} {} sequences', filename, n + 1)
  UnboundLocalError: local variable 'n' referenced before assignment
```
Copy link
Collaborator

@olgabot olgabot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Had a few suggestions for documentation and comments.

sourmash/kmer_min_hash.hh Outdated Show resolved Hide resolved
sourmash/sourmash_args.py Outdated Show resolved Hide resolved
mh.add_protein(b'AGYYG')

if hp:
assert len(mh.get_mins()) == 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still so funny to me that this sequence is all p so it only contains one k-mer!

@@ -12,3 +12,4 @@ sphinxcontrib-napoleon
setuptools_scm
setuptools_scm_git_archive
nbsphinx
bam2fasta
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe I should remove this, should we still keep bam2fasta optional? @olgabot @luizirber

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave this up to @luizirber. I'd personally prefer to keep it there since the functionality will break if it is not installed, but not everyone needs the bam to fasta conversion.

@pranathivemuri
Copy link
Contributor Author

@olgabot addressed your comments. Please review when you are free

Copy link
Collaborator

@olgabot olgabot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@luizirber luizirber merged commit 0a4de16 into sourmash-bio:master Nov 22, 2019
@luizirber luizirber mentioned this pull request Dec 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants