Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposed QIIME2 Roadmap #4

Merged
merged 10 commits into from
Dec 18, 2014
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
metoo
=====
QIIME2
======

[![Build Status](https://travis-ci.org/biocore/metoo.png?branch=master)](https://travis-ci.org/biocore/metoo) [![Coverage Status](https://coveralls.io/repos/biocore/metoo/badge.png)](https://coveralls.io/r/biocore/metoo)

*Staging ground for QIIME 2 development*
*Staging ground for QIIME2 development*

This repository serves as a staging ground for the next major version of
[QIIME](http://qiime.org/) (i.e., QIIME 2), which will be a complete redesign
[QIIME](http://qiime.org/) (i.e., QIIME2), which will be a complete redesign
and reimplementation of the package.

**Note:** This repository exists mainly for developers and is not intended for
Expand Down
164 changes: 164 additions & 0 deletions ROADMAP.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
# QIIME2 Proposed Roadmap
We propose a complete re-envisioning and redesign of QIIME from the ground up,
hereby referred to as QIIME2. In this document, we provide a concise and
high-level overview of various aspects of the QIIME2 project and how they differ
from the current QIIME software package.

**Note:** This summary is a **proposal** of high-level ideas that will guide the
design and implementation of QIIME2. We are soliciting input from all QIIME
developers and the QIIME user community via the QIIME forum. **Nothing is
finalized and everything is
subject to change.** Once we reach agreement on the project's direction and
vision, we will provide additional documents with further details
(e.g., requirements and design documents).

The roadmap is meant to provide a high-level view of the QIIME2. **It does not
contain specific implementation details.** For example, we may mention the use
of a database, but we're not yet defining the database schema or assuming use
of a particular database implementation (e.g., PostgreSQL).

This document was originally prepared based on conversations between
@gregcaporaso, @ebolyen, and @jairideout.

## Aspects of QIIME2

### Client-Server Architecture

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea to have something like an http RESTful api thing? e.g.:

curl -X POST -H "Content-Type: application/json" -d '
{
"otu_table":"full_otu_table",
"alpha_metrics":"PD"
}' http://qiime-server:8000/otu_tables/alpha_diversity

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a possibility, but we are currently looking more towards a socket based protocol. (Using plain TCP and WebSockets).
Of course the actual implementation is less important than the idea of separating the interface from the server via a protocol, RESTful or not.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aye, I just wonder if http-ising it would help with a website gui. Though I'm not sure if they need the same interface - the website might want to use http://qiime-server/alpha_diversity?summary=true

I'm not wedded to HTTP though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, no particular details are set in stone at this point (and they don't need to be yet). From a strictly hypothetical perspective, a WebSocket based protocol would actually make things a bit easier from the web GUI side, as it could just open a persistent connection from which the server could push updates to. (as opposed to the GUI continuously polling the server). This is also how IPython operates at the moment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea (if I am understanding this correctly) is that users would have to start a server process and leverage all compute to that server from whatever interface they decide to use? This sounds nice. I wonder if setting up a system like this would perhaps be too ambitious for the average user.

To expand on this, it seems like the client-server architecture is good solution for a use-case where you want to streamline analyzing a have a high volume of datasets. In much simpler cases, it almost seems like overkill and an unnecessary thing to have. Clearly if compute and a deployed installation was to be provided for free to users, then this would make a lot of sense as that on itself becomes a high volume of datasets to process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have the idea correct.

One of our explicit goals is to support everything from a laptop to a cluster. This means progressive enhancement. For the average user, it conceptually isn't much different from an IPython notebook, where you type ipython notebook. Instead you might type qiime start, it could launch the server and the web-browser pointed at that server and the server will just use something like a SQLLite database at a path.

In the context of a cluster you could have the qiime-server running as an explicit service (like service qiime start) which users log into using their cluster credentials. The sysadmin can take responsibility for using a different database, managing plugins, ports, etc.

Here's a diagram representing the idea maybe a little better:
https://drive.google.com/file/d/0B_qySw7nb-DKOVU0NFlNRHZvTWc/edit?usp=sharing

In that diagram, each component can exist on it's own host, or they can all exist on the same host (like a laptop).

But definitely the goal is for this to be as simple as possible and to work out-of-the-box.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or the components can exist on any combination of hosts that doesn't literally exceed the number of components being used.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means that if I wanted to run something like a batch job (5
commands executed serially), this would require me to create a q2 script
that needs to be processed by the interactive q2 shell and then
gets executed? That seems like re-inventing the wheel, something like what
the Mothur CLI does.

In hope to reduce the volume of e-mails in everyone's inbox, let's have
this conversation over a call.

On (Sep-23-14|15:08), Evan Bolyen wrote:

+finalized and everything is
+subject to change.** Once we reach agreement on the project's direction and
+vision, we will provide additional documents with further details
+(e.g., requirements and design documents).
+
+The roadmap is meant to provide a high-level view of the QIIME2. It does not
+contain specific implementation details.
For example, we may mention the use
+of a database, but we're not yet defining the database schema or assuming use
+of a particular database implementation (e.g., PostgreSQL).
+
+This document was originally prepared based on conversations between
+@gregcaporaso, @ebolyen, and @jairideout.
+
+## Aspects of QIIME2
+
+### Client-Server Architecture

You have the idea correct.

One of our explicit goals is to support everything from a laptop to a cluster. This means progressive enhancement. For the average user, it conceptually isn't much different from an IPython notebook, where you type ipython notebook. Instead you might type qiime start, it could launch the server and the web-browser pointed at that server and the server will just use something like a SQLLite database at a path.

In the context of a cluster you could have the qiime-server running as an explicit service (like service qiime start) which users log into using their cluster credentials. The sysadmin can take responsibility for using a different database, managing plugins, ports, etc.

Here's a diagram representing the idea maybe a little better:
https://drive.google.com/file/d/0B_qySw7nb-DKOVU0NFlNRHZvTWc/edit?usp=sharing

In that diagram, each component can exist on it's own host, or they can all exist on the same host (like a laptop).

But definitely the goal is for this to be as simple as possible and to work out-of-the-box.


Reply to this email directly or view it on GitHub:
https://github.com/biocore/metoo/pull/4/files#r17942882

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roger that. Though in your example a batch job would technically just be a workflow which should be as simple as a drag and drop. (which then does get processed by the qiime-server)

QIIME2 will use a client-server architecture, allowing it to provide a graphical
interface (this will also enable multiple arbitrary interfaces, e.g., CLI, iPad,
BaseSpace). This architecture is supported in a single host (e.g. laptop or
VirtualBox) and multi-host deployment (e.g. a cluster or EC2). **All
interactions** with QIIME2 will happen through a standardized protocol provided
by the server (_qiime-server_). The goal of the protocol is to reduce complexity
and duplication in defining multiple interfaces. Additionally it will allow
remote execution over a network barrier (this would have been difficult to
achieve with [pyqi](http://pyqi.readthedocs.org/en/latest/)).

### Workers
Once the _qiime-server_ has received a request via the protocol, it will launch
a worker job to perform the computation. The _qiime-server_ will provide status
updates to clients through the protocol. The worker job will record the results
as an _artifact_ in a database.

### Database

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be great to match what we are currently doing in qiita and find parallels. Then, add to qiita vs. redoing stuff, if it makes sense.

**Note: This is not intended to be a substitute for the QIIME database
project (QiiTA).** This is a discussion of how data will be organized and stored

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QiiTA -> Qiita 🐐

internally in QIIME2.

The database represents a significant departure from the way QIIME currently

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this mean that the only way to access this data would be to do it vía the Q2 interface itself? I try to think of the case where other tools want o access the data generated by QIIME, would this then add a step where you serialize as a regular file any of the contents of your QIIME study?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes there would be an explicit export step, or in the case of a web interface, likely a right-click download.
That is a downside to this approach, however it does allow more consistent data management, allows for the provenance of these artifacts to be maintained and reviewed.

It is also possible that we could provide the ability to export an entire analysis as a tarball which might look like a well organized output directory in qiime right now.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for expanding on this. I am concerned that this makes the usage
of the software on itself more complicated. Integrating data generated
with qiime1.x or data that was not generated with other packages becomes
a burden as the import step would require a variety of validations and
specifications.

On (Sep-23-14|14:56), Evan Bolyen wrote:

+and duplication in defining multiple interfaces. Additionally it will allow
+remote execution over a network barrier (this would have been difficult to
+achieve with pyqi).
+
+### Workers
+Once the qiime-server has received a request via the protocol, it will launch
+a worker job to perform the computation. The qiime-server will provide status
+updates to clients through the protocol. The worker job will record the results
+as an artifact in a database.
+
+### Database
+Note: This is not intended to be a substitute for the QIIME database
+project (QiiTA).
This is a discussion of how data will be organized and stored
+internally in QIIME2.
+
+The database represents a significant departure from the way QIIME currently

Yes there would be an explicit export step, or in the case of a web interface, likely a right-click download.
That is a downside to this approach, however it does allow more consistent data management, allows for the provenance of these artifacts to be maintained and reviewed.

It is also possible that we could provide the ability to export an entire analysis as a tarball which might look like a well organized output directory in qiime right now.


Reply to this email directly or view it on GitHub:
https://github.com/biocore/metoo/pull/4/files#r17942329

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is all true, but we do gain the ability of the interface to reason about composition if it has control of the artifacts in an abstract way. Perhaps a compromise is possible where a database simply logs the locations of files, but presently many of the files waste a lot of disk space by repeating what should be a relation to another table.

I think this is definitely something worth talking about over a call.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again.

This also seems like a good topic to discuss with an expert in HCI.

On (Sep-23-14|15:13), Evan Bolyen wrote:

+and duplication in defining multiple interfaces. Additionally it will allow
+remote execution over a network barrier (this would have been difficult to
+achieve with pyqi).
+
+### Workers
+Once the qiime-server has received a request via the protocol, it will launch
+a worker job to perform the computation. The qiime-server will provide status
+updates to clients through the protocol. The worker job will record the results
+as an artifact in a database.
+
+### Database
+Note: This is not intended to be a substitute for the QIIME database
+project (QiiTA).
This is a discussion of how data will be organized and stored
+internally in QIIME2.
+
+The database represents a significant departure from the way QIIME currently

That is all true, but we do gain the ability of the interface to reason about composition if it has control of the artifacts in an abstract way. Perhaps a compromise is possible where a database simply logs the locations of files, but presently many of the files waste a lot of disk space by repeating what should be a relation to another table.

I think this is definitely something worth talking about over a call.


Reply to this email directly or view it on GitHub:
https://github.com/biocore/metoo/pull/4/files#r17943105

handles data (e.g. storing input and output files in a directory structure).
Presently, data is serialized and deserialized to and from the file-system at
each step in an analysis. The resulting data are highly denormalized; for
example, sample IDs are duplicated throughout nearly every file format used in
QIIME. This gives rise to a number of issues. For example, it is very difficult
and error-prone to rename a sample ID after sequences have been demultiplexed.

Since QIIME fundamentally deals with samples at every step in an analysis, they
will become the basis of structuring output in a normalized way. The database
will store this normalized data as _artifacts_. _artifacts_ are data which are
analogous to QIIME's input and output files, but annotated with additional
metadata (e.g., history/provenance, semantic type, etc.). An _artifact_ can be
data that has been imported into QIIME2 (e.g., raw sequence data), or output
produced during an analysis (e.g., a UniFrac distance matrix). _artifacts_ can
be exported in a variety of file formats (e.g., for use in external tools, to
share with collaborators, or include in a publication).

### Graphical Interface (Web-based)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be a nice add-on to qiita.

Currently it is very difficult to create custom workflows in QIIME; only a few
core developers are able to, and it leads to messy and error-prone code that is
difficult to maintain and validate with unit tests. Current QIIME workflows are
essentially black boxes: many users have voiced concern (e.g., on the QIIME
forum and at workshops) that they don't know the exact steps a workflow is
performing. Users have also been asking for a graphical way to perform QIIME
analyses since QIIME's first release; this is likely the most popular request
we've received, and it would significantly cut down the support burden on the
QIIME forum.

To address these concerns, we propose an easy-to-use, portable web-based
interface as the primary way to interact with QIIME2. The web-interface will not
merely wrap a command line interface (as we attempted with pyqi), but instead
will provide a powerful workflow-centered interaction model for both technical
and non-technical users. The interface will allow users to easily create
arbitrary workflows by dragging and dropping methods together. They will be
guided by a strong semantic type system to prevent easily-avoided errors such as
passing pre-split-libraries sequence data into OTU picking workflows. Users will
then be able to preview, export, download, visualize, and view the history of
their data as it becomes available. Additionally they may be able to query their
results like a database (because they are stored in one).

### Semantic Type System
All inputs and outputs of methods and workflows are _artifacts_. All
_artifacts_ have a semantic type. This allows inference and simple
validation when creating analyses (e.g., showing a user what methods/workflows
can be applied to an _artifact_).

There are two kinds of types: _abstract_ and _concrete_ types. An _abstract_
type is a group or collection of _concrete_ types that share a common interface.
A _concrete_ type is specific flavor of an _abstact_ type. Two _artifacts_ of
different _abstract_ types are never considered equivalent because they may not
have compatible interfaces, whereas two _artifacts_ of the same _abstract_ type
but different _concrete_ types may be considered equivalent, though would
warn the user that they may be providing a semantically-inappropriate type as
input. The type system can be made clearer with a few examples:

- Unrarefied and rarefied OTU tables are of the same _abstract_ type, and
methods will work with either, but some methods (e.g. alpha and beta diversity)
would semantically prefer a rarefied OTU table, while others (such as
rarefaction methods) expect an unrarified OTU table.

- Positionally-filtered alignments and unfiltered alignments are of the same
_abstract_ type, and both types can be passed to `make_phylogeny.py`. However,
generally the user would want to pass a positionally-filtered alignment, though
it may be necessary to use an unfiltered alignment in odd cases. The type system
would warn users when providing an unfiltered alignment, but the user could
override by acknowledging the warning.

- A pumpkin pie is functionally equivalent to an apple pie, but
may make less sense on the 4th of July. Pumpkin and apple pies are the same
_abstract_ type, but are different _concrete_ types. A warning would be issued
if a user tried to bring a pumpkin pie to a 4th of July party. An error would be
issued if a user tried to bring an alligator to the party.

The semantic type system will support a wide range of primitive and
microbial-ecology specific types, as well as arbitrary user-defined types.

### Plugin System

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something I'm envisioning also within qiita and the first example could be Evident ... but I see the separation between GUI and commands.

The plugin system will replace QIIME's current collection of scripts by
providing a repository of domain-specific computation (e.g., methods,
algorithms, and analyses commonly used in microbial ecology) that has been
registered with QIIME2.

The plugin system will support two types of computation: _methods_ and
_workflows_. A _method_ is an atomic unit of computation and is analogous to a
function: it takes some input(s) (some possibly required and some optional) and
produces some output(s). A _workflow_ is a directed acyclic graph (DAG) that
is composed of one or more _methods_ and/or other _workflows_. Conceptually, a
_workflow_ can still be viewed as a function that accepts input and creates
output, just like a _method_.

Each _method_/_workflow_ will be registered with QIIME2's plugin system. While
the way to register computation is an implementation detail, we propose the use
of Python 3's
[function annotations](http://legacy.python.org/dev/peps/pep-3107/) as a clean,
elegant, and built-in way to describe a function's inputs and outputs.
Alternative implementations include decorators or custom docstring formats.

When computation is registered with the plugin system, its inputs and outputs
will be described using types in the
[Semantic Type System](#semantic-type-system). Custom semantic types may also be
defined in the plugin system.

The plugins provided with QIIME2 will include functionality specific to
microbial ecology. The plugin system will be easily extendable to allow
users/developers to register their own custom functionality with the system.
Thus, there will be an "official" set of plugins that ship with QIIME2, but the
system will also allow users to install plugins from other sources. The plugin
system allows the QIIME2 ecosystem to grow without requiring all methods to be
specifically added to the QIIME2 distribution.

## Deliverables
Details will be filled in after discussion of the roadmap has taken place (so we
know what actually needs to be done).

## Timeline
Details will be filled in after discussion of the roadmap has taken place (so we
know what actually needs to be done).