Skip to content

Commit

Permalink
Merge pull request #78 from croth1/update_docs
Browse files Browse the repository at this point in the history
update documentation
  • Loading branch information
Christian Roth committed Apr 23, 2018
2 parents d927d7c + 3f9706b commit 72d2cae
Show file tree
Hide file tree
Showing 7 changed files with 473 additions and 304 deletions.
438 changes: 134 additions & 304 deletions docs/source/BaMMserver.rst

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,7 @@
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']
html_static_path = []

# Add any extra paths that contain custom files (such as robots.txt or
# .htaccess) here, relative to this directory. These files are copied
Expand Down
Binary file added docs/source/img/BaMM_seq_logos.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/img/bayesianScheme.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/img/evaluation_plots.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
7 changes: 7 additions & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,14 @@ Welcome to BaMMServer

.. toctree::
BaMMserver
results


.. External links:
.. _SoedingLab: https://www.mpibpc.mpg.de/soeding

.. _PEnGmotif repository: https://github.com/soedinglab/PEnG-motif

.. _BaMMmotif repository: https://github.com/soedinglab/BaMMmotif

.. _BaMMmotif paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5291271/
331 changes: 331 additions & 0 deletions docs/source/results.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,331 @@
What does the results page show?
================================
.. _What does the results page show:

BaMMmotif outputs the best three motifs after seeding all possible motifs and optimized the top 3 motifs.

For each motif, the IUPAC code, PWM sequence logos for both sense and anti-sense sequences, AvRec score and
fractional occurrence ratio are shown.

When you click on one motif, or scroll down the page, you will find more details, such as 1st- and 2nd-order sequence
logos for the positive strand, which show the amount of information contributed by each order over and above what is
provided by lower orders, for each oligonucleotide and position (:ref:`details <How to interpret BaMMmotif logos>`).

You will also see two measures for the motif performance, namely, p-value statistics and Recall-TP/FP ratio curves for
the motif evaluated either with the assumption that all input sequences are positives (left panel) or by re-estimating the
null distribution and getting a new positive:negative ratio (right panel). The corresponding area-under-the curve
(AvRev values) are calculated (:ref:`detals <How to interpret evaluation plots>`). This hopefully helps you estimate
the suitability of a motif model to predict binding sites in your input data set.

The last plot shows the motif position distribution in your sequence set. If you search on both strands, it will show
your the motif distribution on both strands.

All the settings for obtaining these models are listed at the bottom of the results page.

You can download all the models by clicking "DOWNLOAD ALL" button or each individual model by clicking "DOWNLOAD".

How to interpret BaMMmotif logos?
=================================
.. _How to interpret BaMMmotif logos:

We developed sequence logos for higher orders to visualise the BaMMs. We split the relative entropy
:math:`H(p_{motif}|p_{bg}) = \sum_{x}^{}{ p_{motif}(x) log_2[ p_{motif}(x) / p_{bg}(x)]}` into a sum of terms, one
for each order. The logos show the amount of information contributed by each order over and above what is provided by
lower orders, for each oligo-nucleotide and position.

In the 0th-order sequence logo, the higher of the four bases on each column is determined by their relative frequencies.
More frequent bases are depicted on top of less frequent bases. Consequently, the consensus sequence can be assembled
from the top bases, while the vertical order of bases in each column corresponds to their order of predominance.

This 0th-order sequence logo was designed to reflect the characteristics of the PWM model and has been widely used.
However, it is not suited to illustrate dependencies between binding site positions. Thus, we generate higher-order
logos for you. In the higher-order logos, the height of both columns and *k*-mers corresponds to the contribution to the
information content that is not yet described in a lower order, in other words, the information you can gain by taking
into consideration of the dependencies between positions in the motif. Note that *k*-mers can exhibit negative
contributions to the information content.

.. image:: img/BaMM_seq_logos.png
:width: 400px
:height: 200px
:scale: 150 %
:alt: Sequence logos
:align: center


How to interpret evaluation plots?
==================================
.. _How to interpret evaluation plots:

**Evaluation plots and AvRec score**

.. image:: img/evaluation_plots.png
:width: 450px
:height: 450px
:scale: 150 %
:alt: Evaluation plots
:align: center

We generate background sequences from a second-order Markov model trained on all input sequences and ask how well
the input sequences (positives) are separated from the background sequences (negatives). As usual we define true
positives (TP) as predictions that are correct, false positives (FP) as predictions that are incorrect,and
false negatives (FN) as positive test cases that have not been predicted. The precision is defined as the fraction of
predictions that are correct, TP/(TP+FP), and the recall (= sensitivity) is the fraction of true motif instances that
are actually predicted, TP/(TP+FN).

The TP/FP ratio is normalized to the case where positives:negatives=1:1. We then plot the recall-TP/FP ratio curve,
with TP/FP ratio plotted on a logarithmic y-scale, e.g. between 1 and 100. The area under this curve summarizes the
performance of the model for the entire range of TP/FP values that are relevant in practice, without putting undue
emphasis on any specific region. It can be interpreted as the average model recall (AvRec) over the given range of
TP/FP ratio. This gives the motif performance on the input data set, The area under the Recall-TP/FP ratio curve is
calculated for positives:negative=1:1 and we name it as **dataset AvRec**. In addition, if the TP/FP ratio is above 1
for the case where positives:negatives=1:10, a dash curve for it will be depicted on the plot.
Additionally, the p-value density plot is shown, which indicates how we define each term.

(Note that **dataset AvRec** is used as the **zoops_score** for motif reranking in the seeding phase.)

Besides, given the p-value distribution from both input and background sequences, we use `fdrtool` to re-estimate
the null distribution and the ratio between positives and negatives, which is shown by the orange dash line
in the motif p-value statistics plot. In this way, we can estimate the motif quality independent from the initial
assumption that all input sequences are positives. The area under the Recall-TP/FP ratio curve is calculated for
positives:negative=1:1. We call this averaged recall as **motif AvRec**.


What downloaded package contains
================================
.. _What downloaded package contains:

MEME-format PWM models
......................
.. _MEME-format PWM models:

By default, BaMM!motif uses our in-house tool **PEnG!motif** to find seeding patterns in your input sequences.
It generates `MEME-format <http://meme-suite.org/doc/meme-format.html>`_ files (saved as `.meme` file) which
contain a bunch of PWMs of the motifs ranked by `zoops_scores`. The occurrences of predicted motifs are estimated
and shown as `occur`.

These PWMs are used as seeds for further refinement into BaMM models.

Here is an example for MEME-format file generated by our PEnG!motif:
::

MEME version 4

ALPHABET= ACGT

Background letter frequencies
A 0.25864 C 0.240258 G 0.241035 T 0.260067

MOTIF TGASTCATCSC
letter-probability matrix: alength= 4 w= 11 nsites= 32240 bg_prob= 0 opt_bg_order= 2 log(Pval)= -20070.6 zoops_score= 0.763 occur= 0.939
0.00000011 0.00000020 0.00000005 0.99999958
0.00000019 0.00000019 0.99973792 0.00026177
0.99776745 0.00222652 0.00000086 0.00000516
0.00043767 0.31039140 0.68885416 0.00031674
0.00000172 0.00001118 0.00000463 0.99998242
0.00015724 0.99983966 0.00000142 0.00000168
0.99997258 0.00000054 0.00002521 0.00000166
0.00000828 0.25723305 0.00413273 0.73862594
0.02208222 0.92982459 0.00702223 0.04107105
0.16592142 0.34808874 0.35102692 0.13496293
0.07382397 0.51519489 0.17385206 0.23712915

MOTIF ATTRTTTGTTTT
letter-probability matrix: alength= 4 w= 12 nsites= 13728 bg_prob= 0.0 opt_bg_order= 2 log(Pval)= -893.0211792 zoops_score= 0.252 occur= 0.621
0.68648666 0.00624365 0.03511349 0.27215624
0.27477601 0.00371415 0.06135688 0.66015303
0.00009623 0.00107756 0.00017856 0.99864769
0.56885940 0.00127072 0.42884308 0.00102682
0.00040802 0.00205148 0.00037785 0.99716270
0.00159969 0.00023089 0.00047653 0.99769294
0.00016588 0.00006246 0.01915511 0.98061651
0.24886248 0.01232569 0.70805818 0.03075366
0.00018377 0.14974646 0.01920011 0.83086962
0.08978166 0.01159330 0.00815281 0.89047223
0.00074780 0.00028864 0.00068021 0.99828333
0.27042452 0.00127012 0.01194946 0.71635598

BaMM models
...........
.. _BaMM models:

The optimized foreground model is in **BaMM**-format file with extension `.ihbcp` (inhomogeneous bamm conditional probability)
contains probabilities of the BaMM model. While blank lines separate BaMM positions, lines 1 to k+1 of each BaMM
position contain the (conditional) probabilities for order 0 to order k. For instance, the format for a BaMM of order
2 and length W is as follows:

**Filename extension: `.ihbcp`**

P\ :sub:`1`\ (A) P\ :sub:`1`\ (C) P\ :sub:`1`\ (G) P\ :sub:`1`\ (T)

P\ :sub:`1`\ (A|A) P\ :sub:`1`\ (C|A) P\ :sub:`1`\ (G|A) P\ :sub:`1`\ (T|A) P\ :sub:`1`\ (A|C) P\ :sub:`1`\ (C|C) P\ :sub:`1`\ (G|C) ... P\ :sub:`1`\ (T|T)

P\ :sub:`1`\ (A|AA) P\ :sub:`1`\ (C|AA) P\ :sub:`1`\ (G|AA) P\ :sub:`1`\ (T|AA) P\ :sub:`1`\ (A|AC) P\ :sub:`1`\ (C|AC) P\ :sub:`1`\ (G|AC) ... P\ :sub:`1`\ (T|TT)


P\ :sub:`2`\ (A) P\ :sub:`2`\ (C) P\ :sub:`2`\ (G) P\ :sub:`2`\ (T)

P\ :sub:`2`\ (A|A) P\ :sub:`2`\ (C|A) P\ :sub:`2`\ (G|A) P\ :sub:`2`\ (T|A) P\ :sub:`2`\ (A|C) P\ :sub:`2`\ (C|C) P\ :sub:`2`\ (G|C) ... P\ :sub:`2`\ (T|T)

P\ :sub:`2`\ (A|AA) P\ :sub:`2`\ (C|AA) P\ :sub:`2`\ (G|AA) P\ :sub:`2`\ (T|AA) P\ :sub:`2`\ (A|AC) P\ :sub:`2`\ (C|AC) P\ :sub:`2`\ (G|AC) ... P\ :sub:`2`\ (T|TT)

...

P\ :sub:`w`\ (A) P\ :sub:`w`\ (C) P\ :sub:`w`\ (G) P\ :sub:`w`\ (T)

P\ :sub:`w`\ (A|A) P\ :sub:`w`\ (C|A) P\ :sub:`w`\ (G|A) P\ :sub:`w`\ (T|A) P\ :sub:`w`\ (A|C) P\ :sub:`w`\ (C|C) P\ :sub:`w`\ (G|C) ... P\ :sub:`w`\ (T|T)

P\ :sub:`w`\ (A|AA) P\ :sub:`w`\ (C|AA) P\ :sub:`w`\ (G|AA) P\ :sub:`w`\ (T|AA) P\ :sub:`w`\ (A|AC) P\ :sub:`w`\ (C|AC) P\ :sub:`w`\ (G|AC) ... P\ :sub:`w`\ (T|TT)


**BaMM Background Model**

The homogeneous Markov model derived from the K-mer frequencies of the positive sequenced is used as
background model for evaluating the foreground model. It is saved in a file with extension `.hbcp`
(homogeneous bamm conditional probabilities):

P(A) P(C) P(G) P(T)

P(A|A) P(C|A) P(G|A) P(T|A) P(A|C) P(C|C) P(G|C) ... P(T|T)

P(A|AA) P(C|AA) P(G|AA) P(T|AA) P(A|AC) P(C|AC) P(G|AC) ... P(T|TT)


The background model is important for generating BaMM sequence logos.

BaMM Sequence Logo
..................
.. _BaMM Sequence Logo:

BaMM sequences logos for 0th-order on both the original strand and its reverse complement. The 1st- and 2nd-order logos
are shown only for the original strand.

These logos are saved in `.png` files.

Motif Occurrence and Distribution
.................................
.. _Motif Occurrence and Distribution:

**Motif Occurrence**

The motif occurrences with a p-value above certain value are reported in the file with extension `.occurrence`. It
contains information such as:

::

chrom seq_length strand start..end pattern p-value e-value

This can be converted to a `.bed` file if the #CHROM information is properly given by the input FASTA file.

This `.occurrence` file is used for generating the motif distribution plot.

**Motif Distribution**

Motif distribution plot shows the relative motif positions to peak summit, if the input sequences are extracted
by peak callers. If motif is learned on both strands, it will show the motif distribution on both strands.

Motif Evaluation
................
.. _Motif Evaluation:


**zoops.stats file**

The `.zoops.stats` file contains the TPs (true positives), FPs (false positives), FDR (false discovery rate), Recall
p-values for both positive and background sequences, with the assumption that in the data set there is zero or one
occurrence of each motif per sequence (`zoops`). It is used for generating the evaluation plots and `bmscore` file.

**Evaluation plots**

The evaluation plots for motif on the data set are saved with extension `_dataRRC` and `_dataPval`.

The evaluation plots for motif only are saved with extension `_motifRRC` and `_motifPval`.

These plots are saved in both `.png` and `.pdf` files.

**bmscore file**

The `.bmscore` file keeps track of the AvRec score, motif occurrences for both dataset (the dataset_avrec and data_occur)
and motif only (the motif_avrec and motif_occur) for the top three optimized motifs obtained from the seeding phase.

Find my job
===========
.. _Find my job:

All the recently submitted jobs are listed on the "Find my job" page. You can also search for them by entering the job ID.

Command line tools
==================
.. _Command line tools:

Here are the stand-alone tools which can be downloaded from our Github repository and used via command lines:

PEnG!motif
..........
.. _PEnGmotif:

The command line version of PEnG!motif can be downloaded from our GitHub `PEnGmotif repository`_.

BaMM!motif
..........
.. _BaMMmotif:

The command line version of BaMM!motif can be downloaded from our GitHub `BaMMmotif repository`_.

BaMMScan
........
.. _BaMMScan:

The command line version of BaMMScan can be downloaded from our GitHub `BaMMmotif repository`_.

FDR analysis
............
.. _FDR:

The command line version of FDR can be downloaded from our GitHub `BaMMmotif repository`_.

How to use these tools via command lines?
.........................................
.. _How to use these tools via command lines:

A detailed description of how to use the command line tool can be found in the README section of the respective GitHub
repository.


FAQs
====
.. _FAQs:

How long are the results available?
...................................
.. _How long are the results available:

Currently, the results will be kept for 4 weeks.

What is the maximal size of input sequence file that I can upload?
..................................................................
.. _maximal size of input sequence file:

You can upload input sequence file with a maximal size of 40 MB.

For larger sequence files, you can either use our binary tools on the Github repositories, or download this webserver
and run it locally after changing settings in the .env file.

Citing and References
=====================
.. _Citing and References:

Please cite our paper: `BaMMmotif paper`_ DOI: 10.1093/nar/gkw521

Contact
=======
.. _Contact:

bamm(at)mpibpc(dot)mpg(dot)de

.. external links:
.. _PEnGmotif repository: https://github.com/soedinglab/PEnG-motif

.. _BaMMmotif repository: https://github.com/soedinglab/BaMMmotif

.. _BaMMmotif paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5291271/

0 comments on commit 72d2cae

Please sign in to comment.