ex-294 (jebene) moved each command desc into a separate .rst file,

modified toctree, edited other .rst files
umich-brcf-bioinf · Sep 17, 2015 · b1f99ce · b1f99ce
1 parent 2861c8d
commit b1f99ce
Show file tree

Hide file tree

Showing 8 changed files with 408 additions and 406 deletions.
diff --git a/doc/command_details.rst b/doc/command_details.rst
diff --git a/doc/expand.rst b/doc/expand.rst
@@ -0,0 +1,62 @@
+.. _expand-command:
+
+Expand
+======
+The expand command explodes a VCF file into a tab-separated file. It is not
+caller-dependent and will work with any VCF file.
+
+.. figure:: images/expand_columns.jpg
+
+   **Expanding Columns :** *The INFO column and sample-specific FORMAT tags from
+   the input VCF file are separated into distinct columns in the output file.*
+
+Usage
+-----
+``usage: jacquard expand <input_file> <output_file> [OPTIONS]``
+
+
+*positional arguments:*
+
++--------+---------------------------------------------------------------------+
+| input  | | A VCF file. Other file types ignored                              |
++--------+---------------------------------------------------------------------+
+| output | | A TXT file                                                        |
++--------+---------------------------------------------------------------------+
+
+
+*optional arguments:*
+
++----------------------------------+-------------------------------------------+
+| -s, --selected_columns_file FILE | | File containing an ordered list of      |
+|                                  |   column names to be included             |
+|                                  | | in the output file; column names can    |
+|                                  |   include regular expressions             |
++----------------------------------+-------------------------------------------+
+
+Description
+-----------
+The expand command converts a VCF file into a tab-delimited file in a tabular
+format. This format is more suitable than a VCF for analysis and visualization
+in R, Pandas, Excel, or another third-party application.
+
+.. figure:: images/expand_tabular.jpg
+
+   **Tabular Format of Jacquard Output :** *Jacquard transforms the dense VCF
+   format into a tabular format.*
+
+The 'fixed' fields (i.e. CHROM, POS, ID, REF, ALT, QUAL, FILTER) are directly
+copied from the input VCF file. Based on the metaheaders, each field in the
+INFO column is expanded into a separate column named after its tag ID. Also,
+based on the metaheaders, each FORMAT tag is expanded into a set of columns,
+one for each sample, named as <FORMAT tag ID>|<sample column name>. By default,
+all INFO fields and FORMAT tags are expanded; specific INFO fields and FORMAT
+tags can be selected using a flag.
+
+This command also emits a tab-delimited glossary file, created based on the
+metaheaders in the input VCF file. FORMAT and INFO tag IDs are listed in the
+glossary and are defined by their metaheader description.
+
+.. figure:: images/expand_excel.jpg
+
+   **Pattern Identification :** *The expanded output file can be visualized in a
+   third-party tool to identify patterns in the dataset.* 
diff --git a/doc/implementation_details.rst b/doc/implementation_details.rst
@@ -35,11 +35,11 @@ Test Conventions
 General Architecture:
 ---------------------
 Modules are typically:
- - commands (like translate): these modules are invoked from the command line; 
+ - commands (like *translate*): these modules are invoked from the command line; 
    they follow a simple command pattern.
- - variant caller transforms (like mutect): these modules contain classes that 
+ - variant caller transforms (like *mutect*): these modules contain classes that 
    add Jacquard annotations to a native VCF record.
- - utilities  (like vcf or logger): these modules provide a common method or
+ - utilities  (like *vcf* or *logger*): these modules provide a common method or
    class used by other modules.
 Extending and adapting existing patterns will ensure commands/transforms stay
 consistent. Here are some guidelines on how to extend functionality:
@@ -48,35 +48,36 @@ How to add a new format tag:
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 For all variant callers that support the new tag, you will need to extend each
 variant caller transform to:
-* define the new tag (set the metaheader and how the new value is derived)
-* add the new tag to the VC's reader
-Note: If the new tag can be summarized, you will also need to add a correponding
-tag to summarize_rollup_transform.
+ * define the new tag (set the metaheader and how the new value is derived)
+ * add the new tag to the variant caller's reader
+.. note:: If the new tag can be summarized, you will also need to add a
+          corresponding tag to *summarize_rollup_transform*.
 
 
 How to add a new variant caller:
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-* Add a new module in variant_caller_transforms.
-* In the new module, define the supported version.
-* Add supported tags (as described in section above).
-* Add a VcfReader class to interpret native VCFs to translated VCFs.
-* Add a new class named for the variant caller; define a claim method to
-  recognize and claim VCF files.
-* Add the new variant caller class to variant_caller_factory
-Note: variant caller should have no dependencies on other packages (except
-utils and vcf) and classes should only refer to variant callers through 
-variant_caller_factory (except tests).
+ * Add a new module in *variant_caller_transforms*.
+ * In the new module, define the supported version.
+ * Add supported tags (as described in section above).
+ * Add a VcfReader class to interpret native VCFs to translated VCFs.
+ * Add a new class named for the variant caller; define a claim method to
+   recognize and claim VCF files.
+ * Add the new variant caller class to *variant_caller_factory*.
+.. note:: The variant caller should have no dependencies on other packages
+          (except utils and vcf) and classes should only refer to variant
+          callers through *variant_caller_factory* (except tests).
 
 How to add a new command:
 ^^^^^^^^^^^^^^^^^^^^^^^^^
-* Add a new module in jacquard named for the command.
-* In the new module, add the methods:
-  * add_subparser(subparser) with appropriate help and defaults.
-  * get_required_input_output_types().
-  * validate_args(args).
-  * report_prediction
-  * execute(args, execution_context).
-Note that commands are independent and should not refer to other commands.
+ * Add a new module in *jacquard*, named for the command.
+ * In the new module, add the methods:
+
+   * add_subparser(subparser) with appropriate help and defaults.
+   * get_required_input_output_types().
+   * validate_args(args).
+   * report_prediction
+   * execute(args, execution_context).
+.. note:: Commands are independent and should not refer to other commands.
 
 |
 

diff --git a/doc/index.rst b/doc/index.rst
@@ -7,7 +7,9 @@ Jacquard
    Overview <overview>
    Installation <installation>
    Quick Start <quickstart>
+
    Command Details <command_details>
+
    FAQ <faq>
    Changelog <changelog>
    Future Directions <future_directions>

diff --git a/doc/merge.rst b/doc/merge.rst
@@ -0,0 +1,140 @@
+.. _merge-command:
+
+Merge
+=====
+The merge command integrates a directory of VCFs into a single VCF. It is
+caller-agnostic and can be used on any set of VCF files.
+
+.. figure:: images/merge_join_step.jpg
+
+   **The Merging Process :** *Sample-specific information is grouped together
+   for each patient.*
+
+Usage
+-----
+``usage: jacquard merge <input_dir> <output_file> [OPTIONS]``
+
+
+*positional arguments:*
+
++--------+---------------------------------------------------------------------+
+| input  | | Directory containing VCF files. Other file types ignored          |
++--------+---------------------------------------------------------------------+
+| output | | A single VCF file                                                 |
++--------+---------------------------------------------------------------------+
+
+
+*optional arguments:*
+
++-----------------------+------------------------------------------------------+
+| --include_format_tags | | Comma-separated user-defined list of regular       |
+|                       |   expressions for format tags                        |
+|                       | | to be included in output.                          |
++-----------------------+------------------------------------------------------+
+| --include_cells       | | valid:  Only include valid variants                |
+|                       | | all:  Include all variants                         |
+|                       | | passed:  Only include variants which passed their  |
+|                       |            respective filter                         |
+|                       | | somatic:  Only include somatic variants            |
++-----------------------+------------------------------------------------------+
+| --include_rows        | | at_least_one_somatic:  Include all variants at     |
+|                       |                          loci where at least one     |
+|                       |                          variant                     |
+|                       | |                        was somatic                 |
+|                       | | all_somatic:  Include all variants at loci where   |
+|                       |                all variants were somatic             |
+|                       | | at_least_one_passed:  Include all variants at loci |
+|                       |                         where at least one variant   |
+|                       | |                       passed                       |
+|                       | | all_passed:  Include all variants at loci where    |
+|                       |                all variants passed                   |
+|                       | | all:  Include all variants at loci                 |
++-----------------------+------------------------------------------------------+
+
+Description
+-----------
+Conceptually, merge has four basic steps, each described in detail below:
+ #. Integrate matching loci from different VCFs into common rows
+ #. Combine matching samples from different VCFs into common columns
+ #. Filter tag values and rows
+ #. Assemble the subset of FORMAT tags to be included in the final VCF
+
+Integrate matching loci
+^^^^^^^^^^^^^^^^^^^^^^^
+Jacquard first develops the superset of all loci (CHROM, POS, REF, and ALT) 
+across the set of all input VCFs. For each locus, the input VCF FORMAT tags and
+values are merged into a single row. Input variant record-level fields (such as
+FILTER, INFO, etc.) are ignored.
+
+MERGE_LOCI_IMAGE_HERE
+
+
+Combine matching samples
+^^^^^^^^^^^^^^^^^^^^^^^^
+In the input directory, an individual sample could be called by more than one
+variant caller. When merging, Jacquard combines results from the same sample
+into a single column. Merged sample names are constructed by concatenating the
+filename prefix and the VCF column header.
+
++--------------------+-----------------------------------+---------------------+
+| Filename           | VCF Column header                 | Merged sample names |
++--------------------+-----------------------------------+---------------------+
+| case_A.strelka.vcf | #CHROM ... FORMAT SAMPLE1 SAMPLE2 | | case_A:SAMPLE1    |
+|                    |                                   | | case_A:SAMPLE2    |
++--------------------+-----------------------------------+---------------------+
+| case_A.mutect.vcf  | #CHROM ... FORMAT SAMPLE1 SAMPLE2 | | case_A:SAMPLE1    |
+|                    |                                   | | case_A:SAMPLE2    |
++--------------------+-----------------------------------+---------------------+
+| case_B.strelka.vcf | #CHROM ... FORMAT SAMPLE3 SAMPLE4 | | case_B:SAMPLE3    |
+|                    |                                   | | case_A:SAMPLE4    |
++--------------------+-----------------------------------+---------------------+
+| case_B.mutect.vcf  | #CHROM ... FORMAT SAMPLE3 SAMPLE4 | | case_B:SAMPLE3    |
+|                    |                                   | | case_A:SAMPLE4    |
++--------------------+-----------------------------------+---------------------+
+
+Given the input VCFs above, the resulting merged VCF will have four sample
+columns:
+case_A|SAMPLE1,  case_A|SAMPLE2,  case_B|SAMPLE1,  case_B|SAMPLE2.
+
+
+Filter tag values and rows
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+By default, merge contains only Jacquard-translated format tags (JQ\_\.*) and
+includes all variants with valid syntax at loci where at least one variant was
+somatic. The resulting filtered files contain fewer rows, yet higher quality
+data than the input files.
+
+Though most variant callers have their own distinct set of format tags, some
+tag names are common across multiple callers. If there are any format tag name
+collisions, merge will add a prefix (e.g. JQ1_<original_tag>) in order to
+disambiguate the format tags.
+
+
+.. figure:: images/merge_filter_step.jpg
+
+   **The Filtering Process :** *Rows and specific cells in the VCF files are 
+   filtered based on the command-line options.*
+
+After filtering, the merge command combines all of the input VCFs into a single,
+merged VCF that includes all necessary information for continuing your analysis.
+
+The resulting VCF files contain the distinct set of all coordinates (CHROM, POS,
+REF, and ALT) and samples from the input files, provided they pass the filters.
+Each coordinate from the input VCF files is added to the output file, which
+increases the file length. Additionally, sample columns are merged for each
+patient, adding sample specific information and leading to increased column and
+file width.
+
+.. note:: Importantly, rather than giving caller-wise sample columns in the
+          output VCF file, merge emits patient-wise sample columns. For each
+          patient, the merge command joins the set of corresponding sample
+          columns into a single column. The grouping of sample-specific
+          information for each patient helps to easily analyze the data.
+
+
+Assemble the subset of FORMAT tags
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+TODO
+
diff --git a/doc/summarize.rst b/doc/summarize.rst
@@ -0,0 +1,36 @@
+.. _summarize-command:
+
+Summarize
+=========
+The summarize command adds new INFO fields and FORMAT tags that combine variant
+data from the merged VCF. It will only work with VCF files that have been
+translated.
+
+.. figure:: images/summarize.jpg
+
+   **Summarizing Format Tags :** *The Jacquard-translated format tags from
+   each caller are aggregated and processed together to create consensus format
+   tags.* 
+
+Usage
+-----
+``usage: jacquard summarize <input_file> <output_file>``
+
+
+*positional arguments:*
+
++--------+---------------------------------------------------------------------+
+| input  | | Jacquard-merged VCF file (or any VCF with Jacquard tags; e.g.     |
+|        |   JQ_SOM_MT)                                                        |
++--------+---------------------------------------------------------------------+
+| output | | A single VCF file                                                 |
++--------+---------------------------------------------------------------------+
+
+Description
+-----------
+The summarize command uses the Jacquard-specific tags to aggregate caller
+information from the file, providing a summary-level view. The inclusion of
+summary fields, such as averages, helps you to easily determine which are the
+true variants.
+
+The summarized format tags contain the prefix 'JQ_SUMMARY'.