Skip to content

Commit

Permalink
Merge pull request #140 from croth1/improve_docs
Browse files Browse the repository at this point in the history
improved docs
  • Loading branch information
Christian Roth committed Jan 3, 2019
2 parents 4784606 + 5c74a25 commit 53f9e87
Show file tree
Hide file tree
Showing 7 changed files with 133 additions and 0 deletions.
Binary file added docs/source/img/motif_distribution_peaked.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/img/motif_distribution_uniform.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/img/motif_pval_stats.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
14 changes: 14 additions & 0 deletions docs/source/intro.rst
Expand Up @@ -17,6 +17,20 @@ It supports four workflows for integrated Motif analyses:

BaMM webserver is designed to make the power of higher-order motif analysis accessible in common motif analysis tasks.

Optimal input data
******************

The BaMM webserver works best with:
* 1000-10000 short (up to 250nt) nucleotide sequences enriched with motifs in fasta format.
* Sequences derived from ChIP-seq, CLIP-seq, HT-SELEX, or similar techniques.
* Sequences that passed quality control and are preselected for bound sequences (see also :ref:`faq_chipseq_preprocess`)

If you intend to ...
* ... submit long sequences, please have a look at: :ref:`faq_long_sequences`
* ... submit very few sequences, please have a look at: :ref:`faq_few_sequences`



Understanding BaMMs
*******************

Expand Down
114 changes: 114 additions & 0 deletions docs/source/misc.rst
Expand Up @@ -189,6 +189,8 @@ FAQ
in the webserver's github repository.


.. _faq_chipseq_preprocess:

How do I prepare my ChIP-seq data?
**********************************

Expand All @@ -203,6 +205,31 @@ Following pipeline has so far yielded good results for us.
* Extract fasta sequences centered on the peak regions of fixed length (e.g 201)
* Submit a fasta file with sequences from the highest ranked peaks (e.g. 5000)

.. _faq_long_sequences:

Can I use the server with long sequences (>250bp)?
**************************************************

BaMMserver uses a :term:`ZOOPS` model for learning and evaluating its higher-order models.
That means all motifs are trained independently from each other and every sequence in the input file is considered to have either exactly one or no occurrence of the motif.

This setting is optimized for short sequences that are strongly enriched for the motif of interest, e.g. generated by CLIP-,ChIP- or SELEX-based methods.
For longer sequences (e.g. scanning full promoter sequences), our :term:`ZOOPS` model has limitations:

* Low complexity repeat sequences (e.g. `ATATATATAT` repeats) are abundant in genomes.
Repeats will appear as strong motifs, despite having little biological significance for most questions.
* Our evaluation metric `AvRec` is based on how well the motif can distinguish input sequences from scrambled sequences. In our :term:`ZOOPS` model only the best motif occurrence per sequence is used for classification. The longer the input sequences the higher the chance of finding a motif by chance.

This has several implications:
* Despite of their strong enrichment, low complexity motifs are often biologically irrelevant.
* all but very long and informative motifs (often low complexity repeats!) score poorly in the :term:`AvRec` benchmark.

Following options are possible:
* Use only the seeding stage (``Manual seed selection``, see :ref:`workflows_denovo`) which learns :term:`PWM`\ s using a :term:`MOPS` model. Skip the refinement of the seeds to BaMMs.
* Chop long sequences up into multiple smaller sequences (e.g. 100bp) to get a more robust performance estimation, especially when only few sequences (<1000) are used (see also :ref:`faq_few_sequences`).


.. _faq_few_sequences:

Can I use the server with very few sequences?
*********************************************
Expand All @@ -215,6 +242,73 @@ The seeding stage itself uses a MOOPS model. It is therefore possible to scan a
You can circumvent the minimum requirement of sequences for the seeding stage by adding extra sequences with the sequence 'NNNNNNNNNNNN'.
Please note that these seeds cannot be optimized to higher-order models due to the ZOOPS assumption.

How do I figure out whether my motif is biologically relevant?
**************************************************************

Motif learners find enriched sequence motifs from the input data. However statistical significant motifs do not have to have to play a role in regulatory processes.
De-novo motifs should be analyzed carefully - regulatory function should not be ascribed without further validation. We offer several ways to help validating the motifs:

Infer relevance from P-value distribution
=========================================

By performing quality control and only selecting the 1000-5000 most strongly bound sequences, true motifs should be present in a significant amount of sequences.
Have a look at the p-value statistic for calculating the motif :term:`AvRec` (upper right plot in evaluation panel).

There are mainly two things to ensure (see also :ref:`faq_relevance_pvals`):
* The p-value distribution should be skewed towards low p-values (The more uniform the less prevalance/information is in the motif)
* There should be a significant portion of area above the orange line (meaning that a significant portion of input sequences carry the motif).


If your motif does not pass the above criteria, **try setting stricter cutoffs** for selecting the sequences or **shortening the sequences**. If this does not help, the motif probably not relevant.

.. warning:: Using only very few input sequences will increase the noise on the p-value distribution and may make it hard to interpret the plot (see also :ref:`faq_few_sequences`)

.. _faq_relevance_pvals:

.. figure:: img/motif_pval_stats.png
:width: 400px
:alt: P-value distribution for calculating the motif AvRec score.
:align: center

Motif p-value distribution indicates the relevance of the motif

Good motifs have a p-value distribution that is heavily skewed towards low p-values (left side).
The histogram area above the orange line are the input sequences that contain the motif.
The rectangular area above the background sequences and below the orange line are the input sequences that do not contain the motif.
For biologically relevant motifs the ratio between the two areas should not be too small (meaning only very few input sequences contain the motif)


Infer relevance from motif occurrences
======================================

When the sequences are generated from signal peaks (e.g. ChIP-seq), there is an additional information source available: when the sequences are extracted symmetrically around the peak, motifs should be enriched around the center.

The centered enrichment of motifs from ChIP-seq sequences can be appreciated in the figure below.

.. image:: img/motif_distribution_peaked.png

For sequences centered around peaks, motifs with uniform occurrence distributions are less likely to be of relevance.

.. image:: img/motif_distribution_uniform.png


Infer relevance from motif complexity
=====================================

Low complexity repeat regions are abundant in genomes.
**Always be careful when interpreting repeat motifs like ACACACAC or TATATATAT**.
The high repeat abundance and the high information content makes them easily reach statistical significance.
They are especially prominent when the input sequences are long. The webserver will show a warning if the best scoring motif is a repeat motif.

Infer relevance from MMcompare annotation
=========================================

Motif comparisons generated by our MMcompare tool can be used as a strong indicator that the motif is relevant and a good starting point for deeper investigation of the underlying biology.

.. warning:: Please do not forget to always be sceptical when assigning proteins and function to your discovered motifs. The motifs can originate from cofactors with strong binding motifs, or repetitive regions.

.. image:: img/mmcompare_annotation.png

Where is the button to visualize in the genome browser?
*******************************************************

Expand Down Expand Up @@ -243,6 +337,26 @@ Please refer to the `documentation of bedtools <http://bedtools.readthedocs.io/e
Miscellaneous
#############

Glossary
********

.. glossary::

ZOOPS
**Z**\ ero or **O**\ ne **O**\ ccurrence **P**\ er **S**\ equence, describes the modeling assumption that input sequences can contain either no motif occurrence or at most one.

MOPS
**M**\ ultiple **O**\ ccurrences **P**\ er **S**\ equences, describes the modeling assumption that input sequences can contain zero or multiple occurrences of a motif.

AvRec
**Av**\ erage **Rec**\ all, evaluation metric used by the BaMMserver, for details see :ref:`results_avrec`.

PWM
**P**\ osition **W**\ eight **M**\ atrix, zeroth order motif model with independent contributions of each motif positions.
See also `PWMs on Wikipedia <https://en.wikipedia.org/wiki/Position_weight_matrix>`_.



Using the commandline tools
***************************

Expand Down
3 changes: 3 additions & 0 deletions docs/source/results.rst
Expand Up @@ -50,6 +50,9 @@ contributions to the information content.
:alt: Sequence logos
:align: center


.. _results_avrec:

AvRec evaluation
****************

Expand Down
2 changes: 2 additions & 0 deletions docs/source/workflows.rst
Expand Up @@ -12,6 +12,8 @@ The BaMM webserver offers four workflows for analyzing regulatory motifs. In the

A schematic of the four workflows available at the BaMM webserver: (i) De-novo motif discovery, (ii) Motif-motif compare, (iii) Motif database search, (iv) Motif scan.

.. _workflows_denovo:

De-novo motif discovery
***********************

Expand Down

0 comments on commit 53f9e87

Please sign in to comment.