Implement sophisticated seq logo algorithms #94

wilzbach · 2014-11-19T02:20:15Z

This issue is a follow-up to #16.

Unfortunately the seqlogo doesn't handle this natively.
(just as a reminder for myself)

original paper: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC332411/pdf/nar00204-0153.pdf

follow-ups:

http://bioinformatics.oxfordjournals.org/content/28/14/1935.long
http://schneider.ncifcrf.gov/paper/hawaii/latex/paper.pdf

goldbergtatyana · 2014-11-19T10:05:14Z

The most popular and highest cited seq logo implementation is WebLogo of Steven Brenner: http://weblogo.berkeley.edu/logo.cgi

Link to the paper: http://weblogo.berkeley.edu/Crooks-2004-GR-WebLogo.pdf

From: Seb
Sent: Wednesday, November 19, 2014 3:20 AM
To: greenify/biojs-vis-msa
Subject: [biojs-vis-msa] Implement sophisticated seq logo algorithms (#94)

This issue is a follow-up to #16.

Unfortunately the seqlogo doesn't handle this natively.
(just as a reminder for myself)

original paper: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC332411/pdf/nar00204-0153.pdf

follow-ups:

http://bioinformatics.oxfordjournals.org/content/28/14/1935.long
http://schneider.ncifcrf.gov/paper/hawaii/latex/paper.pdf

—
Reply to this email directly or view it on GitHub.

wilzbach · 2014-11-30T07:32:25Z

This is now done with the new biojs-stat-seq, globally available under <msaInstance>.g.stats

It would also support to use a custom background or calculate the background from the sequences.

wilzbach · 2014-11-30T07:35:29Z

BTW how do we handle gaps in the alignment?

use the gaps in the calculation & filter them out afterwards (current solution)
filter the gaps out before the calculation

TatyanaGoldberg · 2014-12-01T20:43:30Z

Hi Seb,

I don't see in https://github.com/greenify/biojs-stat-seqs the information on how you calculate the sequence conservation at each position in the alignment. If you havent done so, please use the formula from the WebLogo paper [1]. The height of each base (or amino acid) per side is then its frequency at a specific position times the sequence conservation at this position.

BTW how do we handle gaps in the alignment?

If we have an alignment:
AWKDAA
-WKDAA
AKKWDD
AKAWDA

Then the frequency of A at position 1 would be 0.75. The information content (i.e. the sequence conservation) would be 4 and the height of stack A then 3.

[1] http://genome.cshlp.org/content/14/6/1188.long

wilzbach · 2014-12-02T03:06:26Z

how you calculate the sequence conservation at each position in the alignment.

https://github.com/greenify/biojs-stat-seqs/blob/master/lib/index.js#L216

If we have an alignment:

I also have written some tests.

trivia

maxLength 6
gaps [ 0.25, 0, 0, 0, 0, 0 ]
frequency [ { A: 0.75, '-': 0.25 },
  { W: 0.5, K: 0.5 },
  { K: 0.75, A: 0.25 },
  { D: 0.5, W: 0.5 },
  { A: 0.5, D: 0.5 },
  { A: 0.75, D: 0.25 } ]
consensus AWKDAA
identity to consensus [ 1, 1, 0.33, 0.33 ]
identity to AAAA [ 0.17, 0, 0.17, 0.33 ]

For the identity I ignore gaps.

ic

ic [ 0.81, 1, 0.81, 1, 1, 0.81 ]
scaled ic [ 0.41, 0.5, 0.41, 0.5, 0.5, 0.41 ]

conservation [ 1.19, 1, 1.19, 1, 1, 1.19 ]
conservation scaled [ 0.59, 0.5, 0.59, 0.5, 0.5, 0.59 ]

conservation per residue [ { A: 0.89, '-': 0.3 },
  { W: 0.5, K: 0.5 },
  { K: 0.89, A: 0.3 },
  { D: 0.5, W: 0.5 },
  { A: 0.5, D: 0.5 },
  { A: 0.89, D: 0.3 } ]
conservation per residue scaled [ { A: 0.45, '-': 0.15 },
  { W: 0.25, K: 0.25 },
  { K: 0.45, A: 0.15 },
  { D: 0.25, W: 0.25 },
  { A: 0.25, D: 0.25 },
  { A: 0.45, D: 0.15 } ]

For the sequence logo I draw "conservation per residue scaled" with just omitting the gaps.

Background

Frequency of the letters

background { A: 0.38, W: 0.17, K: 0.21, D: 0.21, '-': 0.04 }

We could offer to calculate the information content against a given background e.g. Uniprot or the alignment itself.

TatyanaGoldberg · 2014-12-02T11:59:05Z

Comments:

We have following sample sequence alignment:
AWKDAA
-WKDAA
AKKWDD
AKAWDA

maxLength 6
gaps [ 0.25, 0, 0, 0, 0, 0 ]
frequency [ { A: 0.75, '-': 0.25 },
{ W: 0.5, K: 0.5 },
{ K: 0.75, A: 0.25 },
{ D: 0.5, W: 0.5 },
{ A: 0.5, D: 0.5 },
{ A: 0.75, D: 0.25 } ]
consensus AWKDAA
good
identity to consensus [ 1, 1, 0.33, 0.33 ]
The first sequence in the input file should be considered as consensus by default and the identity of all other sequences must be calculated according to the first sequence. If a user wants to calculate his own consensus sequence however then your calculation is right.

identity to AAAA [ 0.17, 0, 0.17, 0.33 ]
Here, you cannot calculate identity without having aligned this sequence to the other four. The second sequence could align to AAAA like this:
--AAAA
WKDAA
and so its identity to AAAA would be 0.40.

For the identity I ignore gaps.
yes, the formula is simply dividing the number of identical residues by the sequence length.

ic [ 0.81, 1, 0.81, 1, 1, 0.81 ]
how do you calculate the information content? Using the formula from the weblogo paper [1], ic at position 1 should be log2(20)-(-0.75_log2(0.75))=4.32-(-0.75_(-0.41))=4.02

scaled ic [ 0.41, 0.5, 0.41, 0.5, 0.5, 0.41 ]
how is this done and why?

conservation [ 1.19, 1, 1.19, 1, 1, 1.19 ]
what is this and what is the difference between conservation and information content?

[1] http://genome.cshlp.org/content/14/6/1188.long

wilzbach · 2014-12-02T12:47:05Z

identity to consensus [ 1, 1, 0.33, 0.33 ]
The first sequence in the input file should be considered as consensus by default and the identity of > all other sequences must be calculated according to the first sequence. If a user wants to calculate > his own consensus sequence however then your calculation is right.

Yup but this has nothing to do with the statistics package. The MSA viewer has the task of figuring out whether the user wants to calculate the consensus are already has a consensus sequence (hence two methods).

identity to AAAA [ 0.17, 0, 0.17, 0.33 ]
Here, you cannot calculate identity without having aligned this sequence to the other four. The > second sequence could align to AAAA like this:
--AAAA
WKDAA
and so its identity to AAAA would be 0.40.

I can't align sequences (too expensive) and therefore the identity calculation stupidly expects that your sequences are in an optimal alignment (which should be the case).

what is this and what is the difference between conservation and information content?

Information content (aka Entropy)

Conservation: Max. Information content - obs. information content

how do you calculate the information content? Using the formula from the weblogo paper [1], ic at position 1 should be log2(20)-(-0.75_log2(0.75))=4.32-(-0.75_(-0.41))=4.02

 { A: 0.75, '-': 0.25 }

- (0.75 * log2(0.75)) - (0.25 * log2(0.25))  = 0.81

(again I count the gaps as normal letters)

scaled ic [ 0.41, 0.5, 0.41, 0.5, 0.5, 0.41 ]
how is this done and why?

obs. ic / max. ic

It is a feature of the biojs-stat-seqs package and might be useful (e.g. barchart, ...)
Currently we only display the conservation per residue.

TatyanaGoldberg · 2014-12-02T16:27:55Z

Hi Seb,

Two issues here:

{ A: 0.75, '-': 0.25 }

(0.75 * log2(0.75)) - (0.25 * log2(0.25)) = 0.81

You should not consider gaps for calculating observed information content (i.e. entropy), as gap is not one of 4 for DNA/RNA or 20 for protein characters we are interested in the conservation of
The max. information content is always log2(4) for DNA/RNA and log2(20) for proteins

Tatyana

wilzbach · 2014-12-07T08:40:15Z

You should not consider gaps for calculating observed information content (i.e. entropy), as gap is not one of 4 for DNA/RNA or 20 for protein characters we are interested in the conservation of

I changed the implementation to be consistent with the one from Biopython. Ignored chars like "-" are also ignored in the frequency calculation, so that

{ A: 0.75, '-': 0.25 }
(1 * log2(1)) = 0

The max. information content is always log2(4) for DNA/RNA and log2(20) for proteins

Why is this an issue? The stat allows to enter any alphabet size.

TatyanaGoldberg · 2014-12-22T22:59:01Z

The sequence logo is not shown for the full length of the alignment. Example file: https://rostlab.org/~goldberg/biojs/ssGP/ssgp/B2014122194A560KL7I.4.ids.clustal.
Letter colors in seq logo do not always correspond to the colors in the MSA

wilzbach · 2014-12-28T00:51:31Z

Fixed 1 & 2
http://workmen.biojs.net/demo/biojs-vis-msa/ssgp

wilzbach · 2015-01-03T19:14:30Z

count gaps in the seq logo ...

wilzbach added the enhancement label Nov 19, 2014

wilzbach mentioned this issue Nov 19, 2014

integration with the sequence logo viewer #16

Closed

wilzbach added this to the 0.3 milestone Nov 19, 2014

wilzbach added the ready label Nov 30, 2014

TatyanaGoldberg added the usability label Dec 22, 2014

wilzbach closed this as completed May 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement sophisticated seq logo algorithms #94

Implement sophisticated seq logo algorithms #94

wilzbach commented Nov 19, 2014

goldbergtatyana commented Nov 19, 2014

wilzbach commented Nov 30, 2014

wilzbach commented Nov 30, 2014

TatyanaGoldberg commented Dec 1, 2014

wilzbach commented Dec 2, 2014

TatyanaGoldberg commented Dec 2, 2014

wilzbach commented Dec 2, 2014

TatyanaGoldberg commented Dec 2, 2014

wilzbach commented Dec 7, 2014

TatyanaGoldberg commented Dec 22, 2014

wilzbach commented Dec 28, 2014

wilzbach commented Jan 3, 2015