-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposed QIIME2 Roadmap #4
Changes from all commits
440e470
10a7575
8dbc37e
8b9ddaa
086a539
2b9e5bd
546ea45
d3aec69
7bfb297
baeac5c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,164 @@ | ||
# QIIME2 Proposed Roadmap | ||
We propose a complete re-envisioning and redesign of QIIME from the ground up, | ||
hereby referred to as QIIME2. In this document, we provide a concise and | ||
high-level overview of various aspects of the QIIME2 project and how they differ | ||
from the current QIIME software package. | ||
|
||
**Note:** This summary is a **proposal** of high-level ideas that will guide the | ||
design and implementation of QIIME2. We are soliciting input from all QIIME | ||
developers and the QIIME user community via the QIIME forum. **Nothing is | ||
finalized and everything is | ||
subject to change.** Once we reach agreement on the project's direction and | ||
vision, we will provide additional documents with further details | ||
(e.g., requirements and design documents). | ||
|
||
The roadmap is meant to provide a high-level view of the QIIME2. **It does not | ||
contain specific implementation details.** For example, we may mention the use | ||
of a database, but we're not yet defining the database schema or assuming use | ||
of a particular database implementation (e.g., PostgreSQL). | ||
|
||
This document was originally prepared based on conversations between | ||
@gregcaporaso, @ebolyen, and @jairideout. | ||
|
||
## Aspects of QIIME2 | ||
|
||
### Client-Server Architecture | ||
QIIME2 will use a client-server architecture, allowing it to provide a graphical | ||
interface (this will also enable multiple arbitrary interfaces, e.g., CLI, iPad, | ||
BaseSpace). This architecture is supported in a single host (e.g. laptop or | ||
VirtualBox) and multi-host deployment (e.g. a cluster or EC2). **All | ||
interactions** with QIIME2 will happen through a standardized protocol provided | ||
by the server (_qiime-server_). The goal of the protocol is to reduce complexity | ||
and duplication in defining multiple interfaces. Additionally it will allow | ||
remote execution over a network barrier (this would have been difficult to | ||
achieve with [pyqi](http://pyqi.readthedocs.org/en/latest/)). | ||
|
||
### Workers | ||
Once the _qiime-server_ has received a request via the protocol, it will launch | ||
a worker job to perform the computation. The _qiime-server_ will provide status | ||
updates to clients through the protocol. The worker job will record the results | ||
as an _artifact_ in a database. | ||
|
||
### Database | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It will be great to match what we are currently doing in qiita and find parallels. Then, add to qiita vs. redoing stuff, if it makes sense. |
||
**Note: This is not intended to be a substitute for the QIIME database | ||
project (QiiTA).** This is a discussion of how data will be organized and stored | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
internally in QIIME2. | ||
|
||
The database represents a significant departure from the way QIIME currently | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would this mean that the only way to access this data would be to do it vía the Q2 interface itself? I try to think of the case where other tools want o access the data generated by QIIME, would this then add a step where you serialize as a regular file any of the contents of your QIIME study? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes there would be an explicit export step, or in the case of a web interface, likely a right-click download. It is also possible that we could provide the ability to export an entire analysis as a tarball which might look like a well organized output directory in qiime right now. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for expanding on this. I am concerned that this makes the usage On (Sep-23-14|14:56), Evan Bolyen wrote:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That is all true, but we do gain the ability of the interface to reason about composition if it has control of the artifacts in an abstract way. Perhaps a compromise is possible where a database simply logs the locations of files, but presently many of the files waste a lot of disk space by repeating what should be a relation to another table. I think this is definitely something worth talking about over a call. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks again. This also seems like a good topic to discuss with an expert in HCI. On (Sep-23-14|15:13), Evan Bolyen wrote:
|
||
handles data (e.g. storing input and output files in a directory structure). | ||
Presently, data is serialized and deserialized to and from the file-system at | ||
each step in an analysis. The resulting data are highly denormalized; for | ||
example, sample IDs are duplicated throughout nearly every file format used in | ||
QIIME. This gives rise to a number of issues. For example, it is very difficult | ||
and error-prone to rename a sample ID after sequences have been demultiplexed. | ||
|
||
Since QIIME fundamentally deals with samples at every step in an analysis, they | ||
will become the basis of structuring output in a normalized way. The database | ||
will store this normalized data as _artifacts_. _artifacts_ are data which are | ||
analogous to QIIME's input and output files, but annotated with additional | ||
metadata (e.g., history/provenance, semantic type, etc.). An _artifact_ can be | ||
data that has been imported into QIIME2 (e.g., raw sequence data), or output | ||
produced during an analysis (e.g., a UniFrac distance matrix). _artifacts_ can | ||
be exported in a variety of file formats (e.g., for use in external tools, to | ||
share with collaborators, or include in a publication). | ||
|
||
### Graphical Interface (Web-based) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this could be a nice add-on to qiita. |
||
Currently it is very difficult to create custom workflows in QIIME; only a few | ||
core developers are able to, and it leads to messy and error-prone code that is | ||
difficult to maintain and validate with unit tests. Current QIIME workflows are | ||
essentially black boxes: many users have voiced concern (e.g., on the QIIME | ||
forum and at workshops) that they don't know the exact steps a workflow is | ||
performing. Users have also been asking for a graphical way to perform QIIME | ||
analyses since QIIME's first release; this is likely the most popular request | ||
we've received, and it would significantly cut down the support burden on the | ||
QIIME forum. | ||
|
||
To address these concerns, we propose an easy-to-use, portable web-based | ||
interface as the primary way to interact with QIIME2. The web-interface will not | ||
merely wrap a command line interface (as we attempted with pyqi), but instead | ||
will provide a powerful workflow-centered interaction model for both technical | ||
and non-technical users. The interface will allow users to easily create | ||
arbitrary workflows by dragging and dropping methods together. They will be | ||
guided by a strong semantic type system to prevent easily-avoided errors such as | ||
passing pre-split-libraries sequence data into OTU picking workflows. Users will | ||
then be able to preview, export, download, visualize, and view the history of | ||
their data as it becomes available. Additionally they may be able to query their | ||
results like a database (because they are stored in one). | ||
|
||
### Semantic Type System | ||
All inputs and outputs of methods and workflows are _artifacts_. All | ||
_artifacts_ have a semantic type. This allows inference and simple | ||
validation when creating analyses (e.g., showing a user what methods/workflows | ||
can be applied to an _artifact_). | ||
|
||
There are two kinds of types: _abstract_ and _concrete_ types. An _abstract_ | ||
type is a group or collection of _concrete_ types that share a common interface. | ||
A _concrete_ type is specific flavor of an _abstact_ type. Two _artifacts_ of | ||
different _abstract_ types are never considered equivalent because they may not | ||
have compatible interfaces, whereas two _artifacts_ of the same _abstract_ type | ||
but different _concrete_ types may be considered equivalent, though would | ||
warn the user that they may be providing a semantically-inappropriate type as | ||
input. The type system can be made clearer with a few examples: | ||
|
||
- Unrarefied and rarefied OTU tables are of the same _abstract_ type, and | ||
methods will work with either, but some methods (e.g. alpha and beta diversity) | ||
would semantically prefer a rarefied OTU table, while others (such as | ||
rarefaction methods) expect an unrarified OTU table. | ||
|
||
- Positionally-filtered alignments and unfiltered alignments are of the same | ||
_abstract_ type, and both types can be passed to `make_phylogeny.py`. However, | ||
generally the user would want to pass a positionally-filtered alignment, though | ||
it may be necessary to use an unfiltered alignment in odd cases. The type system | ||
would warn users when providing an unfiltered alignment, but the user could | ||
override by acknowledging the warning. | ||
|
||
- A pumpkin pie is functionally equivalent to an apple pie, but | ||
may make less sense on the 4th of July. Pumpkin and apple pies are the same | ||
_abstract_ type, but are different _concrete_ types. A warning would be issued | ||
if a user tried to bring a pumpkin pie to a 4th of July party. An error would be | ||
issued if a user tried to bring an alligator to the party. | ||
|
||
The semantic type system will support a wide range of primitive and | ||
microbial-ecology specific types, as well as arbitrary user-defined types. | ||
|
||
### Plugin System | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is something I'm envisioning also within qiita and the first example could be Evident ... but I see the separation between GUI and commands. |
||
The plugin system will replace QIIME's current collection of scripts by | ||
providing a repository of domain-specific computation (e.g., methods, | ||
algorithms, and analyses commonly used in microbial ecology) that has been | ||
registered with QIIME2. | ||
|
||
The plugin system will support two types of computation: _methods_ and | ||
_workflows_. A _method_ is an atomic unit of computation and is analogous to a | ||
function: it takes some input(s) (some possibly required and some optional) and | ||
produces some output(s). A _workflow_ is a directed acyclic graph (DAG) that | ||
is composed of one or more _methods_ and/or other _workflows_. Conceptually, a | ||
_workflow_ can still be viewed as a function that accepts input and creates | ||
output, just like a _method_. | ||
|
||
Each _method_/_workflow_ will be registered with QIIME2's plugin system. While | ||
the way to register computation is an implementation detail, we propose the use | ||
of Python 3's | ||
[function annotations](http://legacy.python.org/dev/peps/pep-3107/) as a clean, | ||
elegant, and built-in way to describe a function's inputs and outputs. | ||
Alternative implementations include decorators or custom docstring formats. | ||
|
||
When computation is registered with the plugin system, its inputs and outputs | ||
will be described using types in the | ||
[Semantic Type System](#semantic-type-system). Custom semantic types may also be | ||
defined in the plugin system. | ||
|
||
The plugins provided with QIIME2 will include functionality specific to | ||
microbial ecology. The plugin system will be easily extendable to allow | ||
users/developers to register their own custom functionality with the system. | ||
Thus, there will be an "official" set of plugins that ship with QIIME2, but the | ||
system will also allow users to install plugins from other sources. The plugin | ||
system allows the QIIME2 ecosystem to grow without requiring all methods to be | ||
specifically added to the QIIME2 distribution. | ||
|
||
## Deliverables | ||
Details will be filled in after discussion of the roadmap has taken place (so we | ||
know what actually needs to be done). | ||
|
||
## Timeline | ||
Details will be filled in after discussion of the roadmap has taken place (so we | ||
know what actually needs to be done). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the idea to have something like an http RESTful api thing? e.g.:
curl -X POST -H "Content-Type: application/json" -d '
{
"otu_table":"full_otu_table",
"alpha_metrics":"PD"
}' http://qiime-server:8000/otu_tables/alpha_diversity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a possibility, but we are currently looking more towards a socket based protocol. (Using plain TCP and WebSockets).
Of course the actual implementation is less important than the idea of separating the interface from the server via a protocol, RESTful or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aye, I just wonder if http-ising it would help with a website gui. Though I'm not sure if they need the same interface - the website might want to use http://qiime-server/alpha_diversity?summary=true
I'm not wedded to HTTP though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, no particular details are set in stone at this point (and they don't need to be yet). From a strictly hypothetical perspective, a WebSocket based protocol would actually make things a bit easier from the web GUI side, as it could just open a persistent connection from which the server could push updates to. (as opposed to the GUI continuously polling the server). This is also how IPython operates at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea (if I am understanding this correctly) is that users would have to start a server process and leverage all compute to that server from whatever interface they decide to use? This sounds nice. I wonder if setting up a system like this would perhaps be too ambitious for the average user.
To expand on this, it seems like the client-server architecture is good solution for a use-case where you want to streamline analyzing a have a high volume of datasets. In much simpler cases, it almost seems like overkill and an unnecessary thing to have. Clearly if compute and a deployed installation was to be provided for free to users, then this would make a lot of sense as that on itself becomes a high volume of datasets to process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have the idea correct.
One of our explicit goals is to support everything from a laptop to a cluster. This means progressive enhancement. For the average user, it conceptually isn't much different from an IPython notebook, where you type
ipython notebook
. Instead you might typeqiime start
, it could launch the server and the web-browser pointed at that server and the server will just use something like a SQLLite database at a path.In the context of a cluster you could have the qiime-server running as an explicit service (like
service qiime start
) which users log into using their cluster credentials. The sysadmin can take responsibility for using a different database, managing plugins, ports, etc.Here's a diagram representing the idea maybe a little better:
https://drive.google.com/file/d/0B_qySw7nb-DKOVU0NFlNRHZvTWc/edit?usp=sharing
In that diagram, each component can exist on it's own host, or they can all exist on the same host (like a laptop).
But definitely the goal is for this to be as simple as possible and to work out-of-the-box.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or the components can exist on any combination of hosts that doesn't literally exceed the number of components being used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That means that if I wanted to run something like a batch job (5
commands executed serially), this would require me to create a q2 script
that needs to be processed by the interactive q2 shell and then
gets executed? That seems like re-inventing the wheel, something like what
the Mothur CLI does.
In hope to reduce the volume of e-mails in everyone's inbox, let's have
this conversation over a call.
On (Sep-23-14|15:08), Evan Bolyen wrote:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Roger that. Though in your example a batch job would technically just be a workflow which should be as simple as a drag and drop. (which then does get processed by the qiime-server)