Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add SINA reference based alignment #53

Open
2 tasks done
epruesse opened this issue Aug 28, 2018 · 22 comments
Open
2 tasks done

add SINA reference based alignment #53

epruesse opened this issue Aug 28, 2018 · 22 comments

Comments

@epruesse
Copy link
Member

epruesse commented Aug 28, 2018

TODO

  • add tests
  • expose parameters useful in Q2 context

Addition Description

Allow using SINA to align to reference (i.e. GreenGenes or SILVA)

Current Behavior

Only de-novo alignment using MaFFT

Proposed Behavior

Reference based alignment is useful for placement in pre-existing tree as alignments must match and promises better scalability than progressive de-novo alignment.

Comments

SINA (currently) requires ARB which in turn requires a ton of software, including X11, creating a bit of a problem for a core plugin.

@wasade told me to PR to qiime2/q2-alignment anyway for now. For production, the issue will need some kind of fix. Options that come to mind:

  1. Making SINA a plugin of its own: As qiime alignment sina would be the natural spot, this would require a (clean) way to inject actions into core plugins.
  2. Creating a Qiime2 specific version of SINA that works without ARB. This looses some functionality, however.

References
See PR #54

@wasade
Copy link
Member

wasade commented Aug 28, 2018

Thanks, @epruesse!

Another option would be, for now, do create a separate plugin as q2-sina which does pull down ARB, and to migrate the code to q2-alignment in the future once kinks have been worked out?

@epruesse
Copy link
Member Author

@wasade If I went the q2-sina route, where would I host that package? It should be a conda package due to the binary dependencies, but would fail tests at Bioconda due to missing qiime2 dependencies (question came up here for ghost-tree as well).

I think a minimal feature route withq2-alignment using a sina-noarb variant package that is independent of ARB should be possible within a 2-week timeframe.

The primary features offered by the ARB dependency would not easily integrate with Qiime2 anyway. The ARB files are something of QZAs themselves, combining multiple sequence features, alignments, phylogenetic trees, taxonomic groupings, per-organism and per-alignment-column metadata, etc into one database file.

@wasade
Copy link
Member

wasade commented Aug 28, 2018

It's similar to the other plugins for hosting. See q2-fragment-insertion for example.

I think an ARB independent version of SINA would reduce deployment burden, and perhaps it would make sense as a PR on this repo.

@gregcaporaso or @thermokarst, any issues with a PR here vs. a separate q2 plugin?

@gregcaporaso
Copy link
Member

I think including in q2-alignment would be a good way to go if possible. This would be a really nice addition - thanks for your interest in contributing it @epruesse!

@epruesse
Copy link
Member Author

@gregcaporaso I had waited for a while, hoping someone might add PyNAST so that I could copy-paste more easily ;-)

@epruesse
Copy link
Member Author

@wasade Is there a way to have software required by q2 plugins installed into separate conda environments? I believe we touched upon this topic at ISME.

Turns out that conda install sina in a qiime2-dev environment fails with

UnsatisfiableError: The following specifications were found to be in conflict:
  - freetype=2.9.1
  - q2-dada2=2018.8.0.dev0
  - sina

In this case I know the culprit - the openmotif package needs an update. Thinking long-term maintenance though, these things tend to become an endless rabbit hole. I'm not sure I will be able to guarantee that sina and its dependencies will remain conflict free with all other packages required by qiime2 base (in particular since sina uses boost...).

@thermokarst
Copy link
Contributor

This is a good reason for moving away from the "core distribution" model and into an a la carte "choose your favorite plugins" style (@ebolyen and I have been discussing this at length in recent months). Then if there are conflicting plugins, we at least stand a chance (just make two different envs and boumce between them).

@epruesse
Copy link
Member Author

OFF TOPIC - re dependency hell
@thermokarst @ebolyen A point for your discussion: While you obviously can't separate out Python dependencies that need to work in the same interpreter instance as qiime itself, you could create separate "per tool" conda environments for binary dependencies. I was thinking about that for SINA, but framework support would make things a lot easier.

Roughly, what I'd envision is defining an action with an extra parameter:

register_function(conda_env={'mycommand': {'bioconda': ['fastspar']}}, ...)

The function would then be passed an accessor object via which commands can be executed.

def myfunction(input1: type1, conda_env: qiime2.envrunner):
  conda_env.run("mycommand", ['mycommand', '-i', str(input1)])
  ...

Upon call, the accessor object can ensure that the environment is available, installing it on demand in a suitable place (same disk as conda installation to allow hard links to work), and prefix the command with source activate $PATH_TO_ENV;. The accessor could also handle verbosity/debugging such as implemented byq2_alignment.mafft.run_command in a more frameworky way.

This would limit dependency hell to the things needed by the plugin python code, which is much more controllable than the needs of the wrapped tools. Initial installation size would also be reduced, potentially allowing many more plugins to be delivered with qiime core.

The CI system could then actually pre-resolve the environments, run unit-tests and store fully qualified URLs to be used for creating environments on the user side. The latter would prevent user/support pain with changing packages in the conda channels, and also has the potential of greatly increasing installation speed.

I've got code doing essentially this for my own pipeline. If you want to derive from that I'll happily grant you license (code is GPL but all mine) to do so.

@epruesse
Copy link
Member Author

Ok, I've got the dependency issue fixed. conda install -c bioconda sina now requires only arb-bio-tools, boost, and libarbdb in addition to sina itself. Total installed size is ~200MB, of which most is boost (~180MB).

How do I handle the ARB files and the indices that SINA needs for alignment?

The "normal", non-Qiime workflow looks like this:

  1. (Uncommon) Convert reference alignment from FASTA to ARB format
  2. (First time only) Build Suffix Trie from reference Alignment ("pt server", .arb.pt file).
  3. Align sequences to match reference alignment

Step 2 can take quite a while, but since we usually work with a static reference database, the hour or so spent is only expended once.

Questions:

  • How do I handle the ARB files? Do I create a new semantic type for those? They are unusual in that they can contain many alignments and trees as well as per-sequence and per-column meta-data (e.g. taxonomy, habitat, etc or variability masks, folding structures, respectively). I'm not sure how those would fit into the semantic type model.
  • How do I handle caching of indices? It'd be wasteful to spend 99% of the compute time to re-calculate those. Plus, they are big, so even zipping/unzipping takes enough time to annoy users.

@epruesse
Copy link
Member Author

epruesse commented Sep 11, 2018

q2lint wants me to claim you guys wrote the _sina.py. Filed an issue here qiime2/q2lint#20

@epruesse
Copy link
Member Author

ping @gregcaporaso - could you merge #56?

The biggest chunk it adds in terms of disk is libboost. If that's still too much, I could look into doing yet another version of the sina package using static linking. I'd prefer to keep using the lib though.

W.r.t. the ARB data type problem: I've worked around it by accepting either FeatureData[AlignedSequence] as normal --i-reference input parameter or a string --p-arb-reference pointing to an ARB file. Not ideal as the latter breaks providence tracking, but as a trade-off for speed it's workable until a better solution presents itself.

@thermokarst
Copy link
Contributor

@epruesse --- #56 has been merged. Busywork will start chewing on this soon, if all goes well, you should have a new staging env file in ~2 hrs.

@epruesse
Copy link
Member Author

ACK. I'll push an update to #54 this evening then to see whether the unit tests pass on that.

@epruesse
Copy link
Member Author

W.r.t. the ARB data type problem: I've worked around it by accepting either FeatureData[AlignedSequence] as normal --i-reference input parameter or a string --p-arb-reference pointing to an ARB file. Not ideal as the latter breaks providence tracking, but as a trade-off for speed it's workable until a better solution presents itself.

That way, using a large reference database it can stay in arb format and have the hefty indices sitting next to it. Smaller databases should work fine with qza format.

I'm also working on pushing SINA 1.4 out which will have native parallel processing and a built-in search engine obviating the need for the arb_pt_server and it's demands on disk and memory.

@ebolyen
Copy link
Member

ebolyen commented Sep 17, 2018

I had a thought the other day about the ARB situation. What if QIIME 2 artifacts had an ephemeral host-specific caching layer?

We had discussed holding a "view cache" within the artifact, but we've always kind of discarded the idea since it breaks the immutability of an artifact. However if the cache (in this case an arb file) was just somewhere else on disk, we could use the UUID to look up if such a file exists.

This would be useful for all manner of index formats which are usually large and avoids the unzip problem.

The most immediate issue would be the provenance tracking of that transformation (which could have happened with different software versions), but that seems like a workable problem to me.

@epruesse
Copy link
Member Author

epruesse commented Sep 17, 2018

@ebolyen There are few concepts in there:

  1. Caching an unpacked QZA directory. You might be doing this already? Integrity can be protected by removing the user's write permission, and verified using mtime and size. Or, causing some extra IO, using the CRC32 from the zip or a checksum you may already keep as part of the manifest. The biggest issue is where this should reside and how to ensure that it does get cleaned away.

  2. Caching transformed on-disk data. Indexing, and other slower one-time operations (say converting an RNA to an DNA database) would benefit from this. You could just use a hash directory (=hash table on disk 😄) to keep those files based on their provenance.

Be aware that indexing is often parametrizable, even if that isn't used very often.

@epruesse epruesse reopened this Sep 17, 2018
@ebolyen
Copy link
Member

ebolyen commented Sep 17, 2018

I was thinking more along the lines of 2 but 1 would be pretty easy to do as a side effect of 2.

I guess the hard part becomes not using up all of the disk without the user realizing.

@epruesse
Copy link
Member Author

@ebolyen Exactly. And managing that is non-trivial. No good OS mechanisms exist to e.g. trigger a cleanup if disk space becomes low. The closest is using XDG_CACHE_DIR to at least keep the "non essential" data in one place. You'd have to check and remove orphaned entries, account for total cache size, do some sort of least-used, longest-unused or lowest-gain type cache cleaning and have a qiime tools clean to get rid of it all.

@epruesse
Copy link
Member Author

@gregcaporaso #54 builds and tests fine now. Where do you want me to add docs? Anything else on your wishlist for this?

@epruesse
Copy link
Member Author

epruesse commented Oct 1, 2018

See q2-alignment-reference-based-alignment-using-sina

Is there a way to show warnings to the user even if --verbose is not active? Using SINA with non-SILVA alignments may "break" if the reference alignment has too few columns to fit insertions the new sequences bring.

@thermokarst
Copy link
Contributor

Is there a way to show warnings to the user even if --verbose is not active?

No, unfortunately not (related: qiime2/qiime2#141). Would it make sense to raise an Exception in the case where there are too few columns? That error message would be guaranteed to make it to a user.

@ebolyen
Copy link
Member

ebolyen commented Sep 21, 2022

Some initial work for this was done in #54

Future implementations might want to build on the above.

@gregcaporaso gregcaporaso removed their assignment Mar 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants