Skip to content

Commit

Permalink
Merge pull request #31 from standage/docs
Browse files Browse the repository at this point in the history
Filling out the documentation
  • Loading branch information
standage committed Dec 14, 2016
2 parents 7a80af9 + f6a0921 commit 16c7c4b
Show file tree
Hide file tree
Showing 18 changed files with 251 additions and 61 deletions.
2 changes: 1 addition & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@ addons:
language: python
python:
- 2.7
- 3.3
- 3.4
- 3.5
install:
Expand All @@ -14,6 +13,7 @@ install:
script:
- make test
- make style
- make doc
after_success:
- make loc
- bash <(curl -s https://codecov.io/bash)
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ install:
pip install .

devenv:
pip install pytest pytest-cov pep8
pip install pytest pytest-cov pep8 sphinx

style:
pep8 tag/*.py tests/*.py scripts/*.py
Expand Down
10 changes: 9 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,12 @@
[![PyPI version](https://img.shields.io/pypi/v/tag.svg)](https://pypi.python.org/pypi/tag)
[![GenHub build status](https://img.shields.io/travis/standage/tag.svg)](https://travis-ci.org/standage/tag)
[![codecov.io coverage](https://img.shields.io/codecov/c/github/standage/tag.svg)](https://codecov.io/github/standage/tag)
[![BSD-3 licensed](https://img.shields.io/pypi/l/tag.svg)](https://github.com/standage/tag/blob/master/LICENSE)

> *Computational biology is 90% text formatting and ID cross-referencing!*
> -- discouraged graduate students everywhere
**tag** is a free open-source software package for analyzing genome annotation data.

To install the most recent stable release execute `pip install tag` from your terminal.

Full installation instructions and project documentation are available at https://tag.readthedocs.io.
Binary file added docs/_static/graph.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
13 changes: 13 additions & 0 deletions docs/acknowledgements.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
Acknowledgements
================

The initial development of the **tag** package was influenced heavily by `GenomeTools <http://genometools.org>`_: the `software <https://github.com/genometools/genometools>`_, the `paper <http://dx.doi.org/10.1109/TCBB.2013.68>`_, and the `developers <https://github.com/genometools/genometools/graphs/contributors>`_.
Aside from being an extremely well engineered software library and fostering a welcoming development community, GenomeTools is exceptional in two key respects: 1) a focus on streaming processing; and 2) a focus on explicit grouping of related genome features into graphs, rather than entry-by-entry processing requiring the user to resolve relationships between features.
For all of these reasons I use the GenomeTools library extensively in my research: I use the command-line interface, I've written programs (and entire libraries) that use and extend the C API, and I've contributed to the core library itself.

However, it's no secret that development in C is no walk in the park.
I've spent more hours troubleshooting memory management and chasing memory leaks than I ever care to again.
The performance of bare-metal ANSI C is nigh unmatchable, but as a research scientist *my ability to implement and evaluate prototypes quickly* saves me much more time in the long run than a constant-factor speedup in my research code's performance.

So I'd like to acknowledge the GenomeTools community for blazing the trail and (especially Gordon Gremme and Sascha Steinbiss) for their tireless support.
I'd also like to thank the Python community for their support of a wide variety of tools that made implementing, testing, documenting, and distributing the **tag** package a pleasure.
57 changes: 46 additions & 11 deletions docs/api.rst
Original file line number Diff line number Diff line change
@@ -1,17 +1,52 @@
The Python API
==============

The following classes/modules are included tag's Python API, which is under
`semantic versioning <http://semver.org>`_.
The following classes/modules are included **tag**'s Python API, which is under `semantic versioning <http://semver.org>`_.

.. toctree::
:maxdepth: 1
Range
-----

range
comment
directive
sequence
feature
reader
writer
.. automodule:: tag.range
:members:

Comment
-------

.. automodule:: tag.comment
:members:

Directive
---------

.. automodule:: tag.directive
:members:

Sequence
--------

.. automodule:: tag.sequence
:members:

Feature
-------

.. automodule:: tag.feature
:members:

Readers
-------

Currently the :code:`readers` module contains only a single class, GFF3Reader,
but may include others in the future.

.. automodule:: tag.reader
:members:

Writers
-------

Currently the :code:`writers` module contains only a single class, GFF3Writer,
but may include others in the future.

.. automodule:: tag.writer
:members:
5 changes: 0 additions & 5 deletions docs/comment.rst

This file was deleted.

2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,7 @@
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
#html_static_path = ['_static']
html_static_path = ['_static']

# Add any extra paths that contain custom files (such as robots.txt or
# .htaccess) here, relative to this directory. These files are copied
Expand Down
5 changes: 0 additions & 5 deletions docs/directive.rst

This file was deleted.

5 changes: 0 additions & 5 deletions docs/feature.rst

This file was deleted.

87 changes: 87 additions & 0 deletions docs/formats.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
A crash course in annotation formats
====================================

Data file formatting is the perennial topic of jokes and rants in the bioinformatics community.
For many scientists, it's easier to come up with a new *ad hoc* format than it is to take the time to fully understand and adhere to existing formats (due in large part to the poor state of training for data and computing literacy and poor support for standards development in the life sciences).
There's even a running joke that inventing your own file format is an important rite of passage in becoming a "true" bioinformatician.

.. raw:: html

<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Introducing the Bioinformatics Ironman: write an assembler, a short/long read aligner and a file format</p>&mdash; Pall Melsted (@pmelsted) <a href="https://twitter.com/pmelsted/status/680697640212951040">December 26, 2015</a></blockquote>
<script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>

For annotating genome features, there are several formats in wide use, the most common of which include `GFF3 <https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md>`_, `GTF <http://mblab.wustl.edu/GTF22.html>`_, and `BED <https://genome.ucsc.edu/FAQ/FAQformat.html#format1>`_.
These are all plain-text tab-delimited file formats.

| *Although there are many richer ways of representing genomic features via XML and in relational database schemas, the stubborn persistence of a variety of ad-hoc tab-delimited flat file formats declares the bioinformatics community's need for a simple format that can be modified with a text editor and processed with shell tools like grep.*
| -- Lincoln Stein, `GFF3 specification <https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md>`_

GFF3 vs GTF vs BED
------------------

There are a lot of similarities between the "big three" annotation formats, but there are also some important differences (`this has been covered before <https://standage.github.io/on-genomic-interval-notation.html>`_).
BED and GTF were designed for very specific use cases (visualization and gene prediction, respectively), whereas GFF3 was designed as a generalized solution for genome annotation.
BED allows for a single level of feature decomposition (a feature can be broken up into blocks) and GTF supports two levels (exons can be grouped by **transcript_id**, and transcripts grouped by **gene_id**), while GFF3 supports an arbitrary number of levels (parent/child relationships defined by **ID** and **Parent** attributes specify a directed acyclic graph of features).
And finally, GFF3 and GTF use 1-based closed interval notation, whereas BED uses 0-based half-open interval notation which makes interval arithmetic much simpler.

One common complaint about GFF3 is that all the "action" happens in the 9th column, which consists of free-form key/value pairs for storing metadata and relationships between annotated features.
To a large extent this complaint applies to all of these formats.


Why GFF3?
---------

GFF3 is the preferred annotation encoding for the **tag** package.
Despite its use of 1-based closed interval notation (inferior to the notation used by BED), GFF3 is the most robust and flexible format for describing genomic features.

There may be some level of support for GTF and BED in the future, but it will likely be limited to scripts for converting data into GFF3.
This is a difficult problem to solve generically, since each annotation format exists in so many subtly incompatible variants.
Rather, this will likely be handled on a case-by-case basis: for example, conversion scripts for a small number of tools or databases that produce data in a consistent GTF or BED format.


Annotation graphs
-----------------

The concept of an "annotation graph" was first introduced by Eilbeck *et al.* (`citation <https://dx.doi.org/10.1186%2Fgb-2005-6-5-r44>`__) and then elaborated on by Gremme *et al.* (`citation <http://dx.doi.org/10.1109/TCBB.2013.68>`__).
The **tag** library relies on this concept heavily.
In short, rather than processing a data file one entry at a time (line by line), the idea is to group related features into a directed acyclic graph structure.

.. image:: _static/graph.png
:width: 300px
:align: center

This has some implications for the terminology we use to describe annotations.

* When we say *feature* we could be referring to a single node in the annotation graph, or we could be referring to an entire connected component.
* We use *parent* and *child* to refer to **part_of** relationships between features and related subfeatures.
In GFF3 these are encoded using **ID** and **Parent** attributes.
Feature types and **part_of** relationships can be validated using the `Sequence Ontology <http://www.sequenceontology.org/>`_.
* A feature is a *top-level feature* if it has no parents.

The **tag** package provides two primary means for processing the annotation graph.

* iterating over the top-level features in the entire graph
* for a particular feature (connected component) in the graph, iterating over all connected subfeatures


Multiple parents
----------------

A node in a feature graph can have more than one parent, and the GFF3 specification explicitly permits features to have multiple **Parent** attribute values.
For example, in an alternatively spliced gene, different isoforms will often share exons, and the canonical way to reflect this is for each shared exon to refer to each of its parent mRNAs in its **Parent** attribute.
However, in my experience it is much more common for shared exons to be duplicated in the GFF3 file, once for each isoform to which they belong.
This isn't wrong *per se* from a standards perspective, but it can be misleading when (for example) counting up different feature types or calculating sequence characteristics.

Currently the **tag** library does little to warn the user of duplicate entries.
It is up to the user to inspect the data to determine how it is encoded and how statistics on the features should be calculated and interpreted.


Multi-features
--------------

Some genome features exist discontinuously on the sequence, and therefore cannot be declared with a single GFF3 entry (which can encode only a single interval).
The canonical encoding for these types of features is called a multi-feature, in which a single feature is declared on multiple lines with multiple entries all sharing the same feature type and ID attribute.
This is commonly done with coding sequence (**CDS**) features.

**tag** designates one entry for each multi-features as its *representative*, and all other components of that feature are designated *siblings* of the representative.
51 changes: 47 additions & 4 deletions docs/index.rst
Original file line number Diff line number Diff line change
@@ -1,17 +1,60 @@
tag: genome annotation analysis in Python!
==========================================
**tag**: genome annotation analysis in Python!
==============================================

More coming soon.
**tag** is a free open-source software package for analyzing genome annotation data.
It is developed as a reusable library with a focus on ease of use.
**tag** is implemented in pure Python (no compiling required) with minimal dependencies!

.. toctree::
:maxdepth: 1

install
formats
api
acknowledgements


What problem does **tag** solve?
--------------------------------

| *Computational biology is 90% text formatting and ID cross-referencing!*
| -- discouraged graduate students everywhere
Most GFF parsers will load data into memory for you--the trivial bit--but will not group related features for you--the useful bit.
**tag** represents related features as a *feature graph* (a directed acyclic graph) which can be easily traversed and inspected.

.. code:: python
# Calculate number of exons per gene
for gene in gff3reader:
exons = [subfeat for subfeat in gene if subfeat.type == 'exon']
print('num exons:', len(exons))
See :doc:`the primer on annotation formats <formats>` for more information.


Summary
-------

The **tag** library is built around the following features:

* **parsers and writers** for reading and printing annotation data in GFF3 format (with intelligent gzip support)
* **data structures** for convenient handling of various types of GFF3 entries: annotated sequence features, directives and other metadata, embedded sequences, and comments
* **generator functions** for a variety of common and useful annotation processing tasks, which can be easily composed to create streaming pipelines
* a unified **command-line interface** for executing common processing workflows
* a stable, documented **Python API** for interactive data analysis and building custom workflows


Development
-----------

Development of the **tag** library is currently a one-man show, but I would heartily welcome contributions.
The development repository is at https://github.com/standage/tag.
Please feel free to submit comments, questions, support requests to the `GitHub issue tracker <https://github.com/standage/tag/issues>`_, or (even better) a pull request!


Indices and tables
==================
------------------

* :ref:`genindex`
* :ref:`modindex`
Expand Down
46 changes: 46 additions & 0 deletions docs/install.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
Installing and using **tag**
============================

The easiest way to install **tag** is with the :code:`pip` command.
By default, :code:`pip` installs the most recent stable release from the Python Package Index (PyPI).

.. code::
pip install tag
To install the latest unreleased code, install directly from the development repository at GitHub.

.. code::
pip install git+https://github.com/standage/tag.git
.. note:: We recommend installing **tag** and its prerequisites in a `virtual environment <http://docs.python-guide.org/en/latest/dev/virtualenvs/>`_.

**tag** is implemented in pure Python and requires no compilation.
It has only a single runtime dependency (the `intervaltree library <https://pypi.python.org/pypi/intervaltree>`_) which is also pure Python.


Using **tag** interactively
---------------------------

If you want to analyze or explore your data interactively, fire up the Python interpreter by invoking the :code:`python` command in your terminal.
Please see the :doc:`API documentation <api>` for a description of the data structures and objects available to you.

.. code:: python
>>> import tag
>>> reader = tag.reader.GFF3Reader(infilename='/data/genomes/mybug.gff3.gz')
>>> for entry in reader
... if hasattr(entry, 'type') and entry.type == 'intron':
... if len(entry) > 100000:
... print(entry.slug)
intron@scaffold3[37992, 149255]
intron@scaffold55[288477, 389001]
intron@scaffold192[1057, 196433]
Using the **tag** command-line interface
----------------------------------------

The **tag** package has a command-line interface for common processing workflows.
Execute :code:`tag -h` to see a list of available commands and :code:`tag <cmd> -h` for instructions on running a particular command.
5 changes: 0 additions & 5 deletions docs/range.rst

This file was deleted.

8 changes: 0 additions & 8 deletions docs/reader.rst

This file was deleted.

5 changes: 0 additions & 5 deletions docs/sequence.rst

This file was deleted.

8 changes: 0 additions & 8 deletions docs/writer.rst

This file was deleted.

1 change: 0 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,6 @@
'Development Status :: 4 - Beta',
'Environment :: Console',
'License :: OSI Approved :: BSD License',
'Programming Language :: Python :: 2.6',
'Programming Language :: Python :: 2.7',
'Programming Language :: Python :: 3.4',
'Programming Language :: Python :: 3.5',
Expand Down

0 comments on commit 16c7c4b

Please sign in to comment.