Merge pull request #140 from croth1/improve_docs

improved docs
soedinglab · Jan 3, 2019 · 53f9e87 · 53f9e87
2 parents 4784606 + 5c74a25
commit 53f9e87
Show file tree

Hide file tree

Showing 7 changed files with 133 additions and 0 deletions.
diff --git a/docs/source/img/motif_distribution_peaked.png b/docs/source/img/motif_distribution_peaked.png
diff --git a/docs/source/img/motif_distribution_uniform.png b/docs/source/img/motif_distribution_uniform.png
diff --git a/docs/source/img/motif_pval_stats.png b/docs/source/img/motif_pval_stats.png
diff --git a/docs/source/intro.rst b/docs/source/intro.rst
@@ -17,6 +17,20 @@ It supports four workflows for integrated Motif analyses:
 
 BaMM webserver is designed to make the power of higher-order motif analysis accessible in common motif analysis tasks. 
 
+Optimal input data
+******************
+
+The BaMM webserver works best with:
+  * 1000-10000 short (up to 250nt) nucleotide sequences enriched with motifs in fasta format.
+  * Sequences derived from ChIP-seq, CLIP-seq, HT-SELEX, or similar techniques.
+  * Sequences that passed quality control and are preselected for bound sequences (see also :ref:`faq_chipseq_preprocess`)
+
+If you intend to ...
+  * ... submit long sequences, please have a look at: :ref:`faq_long_sequences`
+  * ... submit very few sequences, please have a look at: :ref:`faq_few_sequences`
+
+
+
 Understanding BaMMs
 *******************
 

diff --git a/docs/source/misc.rst b/docs/source/misc.rst
@@ -189,6 +189,8 @@ FAQ
         in the webserver's github repository.
 
 
+.. _faq_chipseq_preprocess:
+
 How do I prepare my ChIP-seq data?
 **********************************
 
@@ -203,6 +205,31 @@ Following pipeline has so far yielded good results for us.
   * Extract fasta sequences centered on the peak regions of fixed length (e.g 201)
   * Submit a fasta file with sequences from the highest ranked peaks (e.g. 5000)
 
+.. _faq_long_sequences:
+
+Can I use the server with long sequences (>250bp)?
+**************************************************
+
+BaMMserver uses a :term:`ZOOPS` model for learning and evaluating its higher-order models.
+That means all motifs are trained independently from each other and every sequence in the input file is considered to have either exactly one or no occurrence of the motif.
+
+This setting is optimized for short sequences that are strongly enriched for the motif of interest, e.g. generated by CLIP-,ChIP- or SELEX-based methods.
+For longer sequences (e.g. scanning full promoter sequences), our :term:`ZOOPS` model has limitations:
+
+  * Low complexity repeat sequences (e.g. `ATATATATAT` repeats) are abundant in genomes.
+    Repeats will appear as strong motifs, despite having little biological significance for most questions.
+  * Our evaluation metric `AvRec` is based on how well the motif can distinguish input sequences from scrambled sequences. In our :term:`ZOOPS` model only the best motif occurrence per sequence is used for classification. The longer the input sequences the higher the chance of finding a motif by chance.
+
+This has several implications:
+  * Despite of their strong enrichment, low complexity motifs are often biologically irrelevant.
+  * all but very long and informative motifs (often low complexity repeats!) score poorly in the :term:`AvRec` benchmark.
+
+Following options are possible:
+  * Use only the seeding stage (``Manual seed selection``, see :ref:`workflows_denovo`) which learns :term:`PWM`\ s using a :term:`MOPS` model. Skip the refinement of the seeds to BaMMs.
+  * Chop long sequences up into multiple smaller sequences (e.g. 100bp) to get a more robust performance estimation, especially when only few sequences (<1000) are used (see also :ref:`faq_few_sequences`).
+
+
+.. _faq_few_sequences:
 
 Can I use the server with very few sequences?
 *********************************************
@@ -215,6 +242,73 @@ The seeding stage itself uses a MOOPS model. It is therefore possible to scan a
 You can circumvent the minimum requirement of sequences for the seeding stage by adding extra sequences with the sequence 'NNNNNNNNNNNN'.
 Please note that these seeds cannot be optimized to higher-order models due to the ZOOPS assumption.
 
+How do I figure out whether my motif is biologically relevant?
+**************************************************************
+
+Motif learners find enriched sequence motifs from the input data. However statistical significant motifs do not have to have to play a role in regulatory processes.
+De-novo motifs should be analyzed carefully - regulatory function should not be ascribed without further validation. We offer several ways to help validating the motifs:
+
+Infer relevance from P-value distribution
+=========================================
+
+By performing quality control and only selecting the 1000-5000 most strongly bound sequences, true motifs should be present in a significant amount of sequences.
+Have a look at the p-value statistic for calculating the motif :term:`AvRec` (upper right plot in evaluation panel).
+
+There are mainly two things to ensure (see also :ref:`faq_relevance_pvals`):
+  * The p-value distribution should be skewed towards low p-values (The more uniform the less prevalance/information is in the motif)
+  * There should be a significant portion of area above the orange line (meaning that a significant portion of input sequences carry the motif).
+
+
+If your motif does not pass the above criteria, **try setting stricter cutoffs** for selecting the sequences or **shortening the sequences**. If this does not help, the motif probably not relevant.
+
+.. warning:: Using only very few input sequences will increase the noise on the p-value distribution and may make it hard to interpret the plot (see also :ref:`faq_few_sequences`)
+
+.. _faq_relevance_pvals:
+
+.. figure:: img/motif_pval_stats.png
+   :width: 400px
+   :alt: P-value distribution for calculating the motif AvRec score.
+   :align: center
+
+   Motif p-value distribution indicates the relevance of the motif
+
+   Good motifs have a p-value distribution that is heavily skewed towards low p-values (left side).
+   The histogram area above the orange line are the input sequences that contain the motif.
+   The rectangular area above the background sequences and below the orange line are the input sequences that do not contain the motif.
+   For biologically relevant motifs the ratio between the two areas should not be too small (meaning only very few input sequences contain the motif)   
+
+
+Infer relevance from motif occurrences
+======================================
+
+When the sequences are generated from signal peaks (e.g. ChIP-seq), there is an additional information source available: when the sequences are extracted symmetrically around the peak, motifs should be enriched around the center.
+
+The centered enrichment of motifs from ChIP-seq sequences can be appreciated in the figure below.
+
+.. image:: img/motif_distribution_peaked.png
+
+For sequences centered around peaks, motifs with uniform occurrence distributions are less likely to be of relevance.
+
+.. image:: img/motif_distribution_uniform.png
+
+
+Infer relevance from motif complexity
+=====================================
+
+Low complexity repeat regions are abundant in genomes.
+**Always be careful when interpreting repeat motifs like ACACACAC or TATATATAT**.
+The high repeat abundance and the high information content makes them easily reach statistical significance.
+They are especially prominent when the input sequences are long. The webserver will show a warning if the best scoring motif is a repeat motif.
+
+Infer relevance from MMcompare annotation
+=========================================
+
+Motif comparisons generated by our MMcompare tool can be used as a strong indicator that the motif is relevant and a good starting point for deeper investigation of the underlying biology.
+
+.. warning:: Please do not forget to always be sceptical when assigning proteins and function to your discovered motifs. The motifs can originate from cofactors with strong binding motifs, or repetitive regions. 
+
+.. image:: img/mmcompare_annotation.png
+
 Where is the button to visualize in the genome browser?
 *******************************************************
 
@@ -243,6 +337,26 @@ Please refer to the `documentation of bedtools <http://bedtools.readthedocs.io/e
 Miscellaneous
 #############
 
+Glossary
+********
+
+.. glossary::
+
+  ZOOPS
+    **Z**\ ero or **O**\ ne **O**\ ccurrence **P**\ er **S**\ equence, describes the modeling assumption that input sequences can contain either no motif occurrence or at most one.
+
+  MOPS
+    **M**\ ultiple **O**\ ccurrences **P**\ er **S**\ equences, describes the modeling assumption that input sequences can contain zero or multiple occurrences of a motif.
+
+  AvRec
+    **Av**\ erage **Rec**\ all, evaluation metric used by the BaMMserver, for details see :ref:`results_avrec`.
+
+  PWM
+    **P**\ osition **W**\ eight **M**\ atrix, zeroth order motif model with independent contributions of each motif positions.
+    See also `PWMs on Wikipedia <https://en.wikipedia.org/wiki/Position_weight_matrix>`_.
+
+
+
 Using the commandline tools
 ***************************
 

diff --git a/docs/source/results.rst b/docs/source/results.rst
@@ -50,6 +50,9 @@ contributions to the information content.
    :alt: Sequence logos
    :align: center
 
+
+.. _results_avrec:
+
 AvRec evaluation
 ****************
 

diff --git a/docs/source/workflows.rst b/docs/source/workflows.rst
@@ -12,6 +12,8 @@ The BaMM webserver offers four workflows for analyzing regulatory motifs. In the
 
    A schematic of the four workflows available at the BaMM webserver: (i) De-novo motif discovery, (ii) Motif-motif compare, (iii) Motif database search, (iv) Motif scan.
 
+.. _workflows_denovo:
+
 De-novo motif discovery
 ***********************