Skip to content

Commit

Permalink
ex-294 (jebene) moved each command desc into a separate .rst file,
Browse files Browse the repository at this point in the history
modified toctree, edited other .rst files
  • Loading branch information
jebene committed Sep 17, 2015
1 parent 2861c8d commit b1f99ce
Show file tree
Hide file tree
Showing 8 changed files with 408 additions and 406 deletions.
394 changes: 14 additions & 380 deletions doc/command_details.rst

Large diffs are not rendered by default.

62 changes: 62 additions & 0 deletions doc/expand.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
.. _expand-command:

Expand
======
The expand command explodes a VCF file into a tab-separated file. It is not
caller-dependent and will work with any VCF file.

.. figure:: images/expand_columns.jpg

**Expanding Columns :** *The INFO column and sample-specific FORMAT tags from
the input VCF file are separated into distinct columns in the output file.*

Usage
-----
``usage: jacquard expand <input_file> <output_file> [OPTIONS]``


*positional arguments:*

+--------+---------------------------------------------------------------------+
| input | | A VCF file. Other file types ignored |
+--------+---------------------------------------------------------------------+
| output | | A TXT file |
+--------+---------------------------------------------------------------------+


*optional arguments:*

+----------------------------------+-------------------------------------------+
| -s, --selected_columns_file FILE | | File containing an ordered list of |
| | column names to be included |
| | | in the output file; column names can |
| | include regular expressions |
+----------------------------------+-------------------------------------------+

Description
-----------
The expand command converts a VCF file into a tab-delimited file in a tabular
format. This format is more suitable than a VCF for analysis and visualization
in R, Pandas, Excel, or another third-party application.

.. figure:: images/expand_tabular.jpg

**Tabular Format of Jacquard Output :** *Jacquard transforms the dense VCF
format into a tabular format.*

The 'fixed' fields (i.e. CHROM, POS, ID, REF, ALT, QUAL, FILTER) are directly
copied from the input VCF file. Based on the metaheaders, each field in the
INFO column is expanded into a separate column named after its tag ID. Also,
based on the metaheaders, each FORMAT tag is expanded into a set of columns,
one for each sample, named as <FORMAT tag ID>|<sample column name>. By default,
all INFO fields and FORMAT tags are expanded; specific INFO fields and FORMAT
tags can be selected using a flag.

This command also emits a tab-delimited glossary file, created based on the
metaheaders in the input VCF file. FORMAT and INFO tag IDs are listed in the
glossary and are defined by their metaheader description.

.. figure:: images/expand_excel.jpg

**Pattern Identification :** *The expanded output file can be visualized in a
third-party tool to identify patterns in the dataset.*
51 changes: 26 additions & 25 deletions doc/implementation_details.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,11 +35,11 @@ Test Conventions
General Architecture:
---------------------
Modules are typically:
- commands (like translate): these modules are invoked from the command line;
- commands (like *translate*): these modules are invoked from the command line;
they follow a simple command pattern.
- variant caller transforms (like mutect): these modules contain classes that
- variant caller transforms (like *mutect*): these modules contain classes that
add Jacquard annotations to a native VCF record.
- utilities (like vcf or logger): these modules provide a common method or
- utilities (like *vcf* or *logger*): these modules provide a common method or
class used by other modules.
Extending and adapting existing patterns will ensure commands/transforms stay
consistent. Here are some guidelines on how to extend functionality:
Expand All @@ -48,35 +48,36 @@ How to add a new format tag:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
For all variant callers that support the new tag, you will need to extend each
variant caller transform to:
* define the new tag (set the metaheader and how the new value is derived)
* add the new tag to the VC's reader
Note: If the new tag can be summarized, you will also need to add a correponding
tag to summarize_rollup_transform.
* define the new tag (set the metaheader and how the new value is derived)
* add the new tag to the variant caller's reader
.. note:: If the new tag can be summarized, you will also need to add a
corresponding tag to *summarize_rollup_transform*.


How to add a new variant caller:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* Add a new module in variant_caller_transforms.
* In the new module, define the supported version.
* Add supported tags (as described in section above).
* Add a VcfReader class to interpret native VCFs to translated VCFs.
* Add a new class named for the variant caller; define a claim method to
recognize and claim VCF files.
* Add the new variant caller class to variant_caller_factory
Note: variant caller should have no dependencies on other packages (except
utils and vcf) and classes should only refer to variant callers through
variant_caller_factory (except tests).
* Add a new module in *variant_caller_transforms*.
* In the new module, define the supported version.
* Add supported tags (as described in section above).
* Add a VcfReader class to interpret native VCFs to translated VCFs.
* Add a new class named for the variant caller; define a claim method to
recognize and claim VCF files.
* Add the new variant caller class to *variant_caller_factory*.
.. note:: The variant caller should have no dependencies on other packages
(except utils and vcf) and classes should only refer to variant
callers through *variant_caller_factory* (except tests).

How to add a new command:
^^^^^^^^^^^^^^^^^^^^^^^^^
* Add a new module in jacquard named for the command.
* In the new module, add the methods:
* add_subparser(subparser) with appropriate help and defaults.
* get_required_input_output_types().
* validate_args(args).
* report_prediction
* execute(args, execution_context).
Note that commands are independent and should not refer to other commands.
* Add a new module in *jacquard*, named for the command.
* In the new module, add the methods:

* add_subparser(subparser) with appropriate help and defaults.
* get_required_input_output_types().
* validate_args(args).
* report_prediction
* execute(args, execution_context).
.. note:: Commands are independent and should not refer to other commands.

|
Expand Down
2 changes: 2 additions & 0 deletions doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,9 @@ Jacquard
Overview <overview>
Installation <installation>
Quick Start <quickstart>

Command Details <command_details>

FAQ <faq>
Changelog <changelog>
Future Directions <future_directions>
Expand Down
140 changes: 140 additions & 0 deletions doc/merge.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
.. _merge-command:

Merge
=====
The merge command integrates a directory of VCFs into a single VCF. It is
caller-agnostic and can be used on any set of VCF files.

.. figure:: images/merge_join_step.jpg

**The Merging Process :** *Sample-specific information is grouped together
for each patient.*

Usage
-----
``usage: jacquard merge <input_dir> <output_file> [OPTIONS]``


*positional arguments:*

+--------+---------------------------------------------------------------------+
| input | | Directory containing VCF files. Other file types ignored |
+--------+---------------------------------------------------------------------+
| output | | A single VCF file |
+--------+---------------------------------------------------------------------+


*optional arguments:*

+-----------------------+------------------------------------------------------+
| --include_format_tags | | Comma-separated user-defined list of regular |
| | expressions for format tags |
| | | to be included in output. |
+-----------------------+------------------------------------------------------+
| --include_cells | | valid: Only include valid variants |
| | | all: Include all variants |
| | | passed: Only include variants which passed their |
| | respective filter |
| | | somatic: Only include somatic variants |
+-----------------------+------------------------------------------------------+
| --include_rows | | at_least_one_somatic: Include all variants at |
| | loci where at least one |
| | variant |
| | | was somatic |
| | | all_somatic: Include all variants at loci where |
| | all variants were somatic |
| | | at_least_one_passed: Include all variants at loci |
| | where at least one variant |
| | | passed |
| | | all_passed: Include all variants at loci where |
| | all variants passed |
| | | all: Include all variants at loci |
+-----------------------+------------------------------------------------------+

Description
-----------
Conceptually, merge has four basic steps, each described in detail below:
#. Integrate matching loci from different VCFs into common rows
#. Combine matching samples from different VCFs into common columns
#. Filter tag values and rows
#. Assemble the subset of FORMAT tags to be included in the final VCF

Integrate matching loci
^^^^^^^^^^^^^^^^^^^^^^^
Jacquard first develops the superset of all loci (CHROM, POS, REF, and ALT)
across the set of all input VCFs. For each locus, the input VCF FORMAT tags and
values are merged into a single row. Input variant record-level fields (such as
FILTER, INFO, etc.) are ignored.

MERGE_LOCI_IMAGE_HERE


Combine matching samples
^^^^^^^^^^^^^^^^^^^^^^^^
In the input directory, an individual sample could be called by more than one
variant caller. When merging, Jacquard combines results from the same sample
into a single column. Merged sample names are constructed by concatenating the
filename prefix and the VCF column header.

+--------------------+-----------------------------------+---------------------+
| Filename | VCF Column header | Merged sample names |
+--------------------+-----------------------------------+---------------------+
| case_A.strelka.vcf | #CHROM ... FORMAT SAMPLE1 SAMPLE2 | | case_A:SAMPLE1 |
| | | | case_A:SAMPLE2 |
+--------------------+-----------------------------------+---------------------+
| case_A.mutect.vcf | #CHROM ... FORMAT SAMPLE1 SAMPLE2 | | case_A:SAMPLE1 |
| | | | case_A:SAMPLE2 |
+--------------------+-----------------------------------+---------------------+
| case_B.strelka.vcf | #CHROM ... FORMAT SAMPLE3 SAMPLE4 | | case_B:SAMPLE3 |
| | | | case_A:SAMPLE4 |
+--------------------+-----------------------------------+---------------------+
| case_B.mutect.vcf | #CHROM ... FORMAT SAMPLE3 SAMPLE4 | | case_B:SAMPLE3 |
| | | | case_A:SAMPLE4 |
+--------------------+-----------------------------------+---------------------+

Given the input VCFs above, the resulting merged VCF will have four sample
columns:
case_A|SAMPLE1, case_A|SAMPLE2, case_B|SAMPLE1, case_B|SAMPLE2.


Filter tag values and rows
^^^^^^^^^^^^^^^^^^^^^^^^^^

By default, merge contains only Jacquard-translated format tags (JQ\_\.*) and
includes all variants with valid syntax at loci where at least one variant was
somatic. The resulting filtered files contain fewer rows, yet higher quality
data than the input files.

Though most variant callers have their own distinct set of format tags, some
tag names are common across multiple callers. If there are any format tag name
collisions, merge will add a prefix (e.g. JQ1_<original_tag>) in order to
disambiguate the format tags.


.. figure:: images/merge_filter_step.jpg

**The Filtering Process :** *Rows and specific cells in the VCF files are
filtered based on the command-line options.*

After filtering, the merge command combines all of the input VCFs into a single,
merged VCF that includes all necessary information for continuing your analysis.

The resulting VCF files contain the distinct set of all coordinates (CHROM, POS,
REF, and ALT) and samples from the input files, provided they pass the filters.
Each coordinate from the input VCF files is added to the output file, which
increases the file length. Additionally, sample columns are merged for each
patient, adding sample specific information and leading to increased column and
file width.

.. note:: Importantly, rather than giving caller-wise sample columns in the
output VCF file, merge emits patient-wise sample columns. For each
patient, the merge command joins the set of corresponding sample
columns into a single column. The grouping of sample-specific
information for each patient helps to easily analyze the data.


Assemble the subset of FORMAT tags
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

TODO

36 changes: 36 additions & 0 deletions doc/summarize.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
.. _summarize-command:

Summarize
=========
The summarize command adds new INFO fields and FORMAT tags that combine variant
data from the merged VCF. It will only work with VCF files that have been
translated.

.. figure:: images/summarize.jpg

**Summarizing Format Tags :** *The Jacquard-translated format tags from
each caller are aggregated and processed together to create consensus format
tags.*

Usage
-----
``usage: jacquard summarize <input_file> <output_file>``


*positional arguments:*

+--------+---------------------------------------------------------------------+
| input | | Jacquard-merged VCF file (or any VCF with Jacquard tags; e.g. |
| | JQ_SOM_MT) |
+--------+---------------------------------------------------------------------+
| output | | A single VCF file |
+--------+---------------------------------------------------------------------+

Description
-----------
The summarize command uses the Jacquard-specific tags to aggregate caller
information from the file, providing a summary-level view. The inclusion of
summary fields, such as averages, helps you to easily determine which are the
true variants.

The summarized format tags contain the prefix 'JQ_SUMMARY'.
Loading

0 comments on commit b1f99ce

Please sign in to comment.