Skip to content
This repository has been archived by the owner on May 22, 2022. It is now read-only.

Implement sophisticated seq logo algorithms #94

Closed
wilzbach opened this issue Nov 19, 2014 · 12 comments
Closed

Implement sophisticated seq logo algorithms #94

wilzbach opened this issue Nov 19, 2014 · 12 comments

Comments

@wilzbach
Copy link
Owner

This issue is a follow-up to #16.

Unfortunately the seqlogo doesn't handle this natively.
(just as a reminder for myself)

original paper: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC332411/pdf/nar00204-0153.pdf

follow-ups:

http://bioinformatics.oxfordjournals.org/content/28/14/1935.long
http://schneider.ncifcrf.gov/paper/hawaii/latex/paper.pdf

@goldbergtatyana
Copy link
Collaborator

The most popular and highest cited seq logo implementation is WebLogo of Steven Brenner: http://weblogo.berkeley.edu/logo.cgi

Link to the paper: http://weblogo.berkeley.edu/Crooks-2004-GR-WebLogo.pdf

From: Seb
Sent: Wednesday, November 19, 2014 3:20 AM
To: greenify/biojs-vis-msa
Subject: [biojs-vis-msa] Implement sophisticated seq logo algorithms (#94)

This issue is a follow-up to #16.

Unfortunately the seqlogo doesn't handle this natively.
(just as a reminder for myself)

original paper: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC332411/pdf/nar00204-0153.pdf

follow-ups:

http://bioinformatics.oxfordjournals.org/content/28/14/1935.long
http://schneider.ncifcrf.gov/paper/hawaii/latex/paper.pdf


Reply to this email directly or view it on GitHub.

@wilzbach
Copy link
Owner Author

This is now done with the new biojs-stat-seq, globally available under <msaInstance>.g.stats

It would also support to use a custom background or calculate the background from the sequences.

@wilzbach wilzbach added the ready label Nov 30, 2014
@wilzbach
Copy link
Owner Author

BTW how do we handle gaps in the alignment?

  1. use the gaps in the calculation & filter them out afterwards (current solution)
  2. filter the gaps out before the calculation

@TatyanaGoldberg
Copy link
Collaborator

Hi Seb,

I don't see in https://github.com/greenify/biojs-stat-seqs the information on how you calculate the sequence conservation at each position in the alignment. If you havent done so, please use the formula from the WebLogo paper [1]. The height of each base (or amino acid) per side is then its frequency at a specific position times the sequence conservation at this position.

BTW how do we handle gaps in the alignment?

If we have an alignment:
AWKDAA
-WKDAA
AKKWDD
AKAWDA

Then the frequency of A at position 1 would be 0.75. The information content (i.e. the sequence conservation) would be 4 and the height of stack A then 3.

[1] http://genome.cshlp.org/content/14/6/1188.long

@wilzbach
Copy link
Owner Author

wilzbach commented Dec 2, 2014

how you calculate the sequence conservation at each position in the alignment.

https://github.com/greenify/biojs-stat-seqs/blob/master/lib/index.js#L216

If we have an alignment:

I also have written some tests.

trivia

maxLength 6
gaps [ 0.25, 0, 0, 0, 0, 0 ]
frequency [ { A: 0.75, '-': 0.25 },
  { W: 0.5, K: 0.5 },
  { K: 0.75, A: 0.25 },
  { D: 0.5, W: 0.5 },
  { A: 0.5, D: 0.5 },
  { A: 0.75, D: 0.25 } ]
consensus AWKDAA
identity to consensus [ 1, 1, 0.33, 0.33 ]
identity to AAAA [ 0.17, 0, 0.17, 0.33 ]

For the identity I ignore gaps.

ic

ic [ 0.81, 1, 0.81, 1, 1, 0.81 ]
scaled ic [ 0.41, 0.5, 0.41, 0.5, 0.5, 0.41 ]

conservation [ 1.19, 1, 1.19, 1, 1, 1.19 ]
conservation scaled [ 0.59, 0.5, 0.59, 0.5, 0.5, 0.59 ]

conservation per residue [ { A: 0.89, '-': 0.3 },
  { W: 0.5, K: 0.5 },
  { K: 0.89, A: 0.3 },
  { D: 0.5, W: 0.5 },
  { A: 0.5, D: 0.5 },
  { A: 0.89, D: 0.3 } ]
conservation per residue scaled [ { A: 0.45, '-': 0.15 },
  { W: 0.25, K: 0.25 },
  { K: 0.45, A: 0.15 },
  { D: 0.25, W: 0.25 },
  { A: 0.25, D: 0.25 },
  { A: 0.45, D: 0.15 } ]

For the sequence logo I draw "conservation per residue scaled" with just omitting the gaps.

Background

Frequency of the letters

background { A: 0.38, W: 0.17, K: 0.21, D: 0.21, '-': 0.04 }

We could offer to calculate the information content against a given background e.g. Uniprot or the alignment itself.

@TatyanaGoldberg
Copy link
Collaborator

Comments:

We have following sample sequence alignment:
AWKDAA
-WKDAA
AKKWDD
AKAWDA

maxLength 6
gaps [ 0.25, 0, 0, 0, 0, 0 ]
frequency [ { A: 0.75, '-': 0.25 },
{ W: 0.5, K: 0.5 },
{ K: 0.75, A: 0.25 },
{ D: 0.5, W: 0.5 },
{ A: 0.5, D: 0.5 },
{ A: 0.75, D: 0.25 } ]
consensus AWKDAA
good
identity to consensus [ 1, 1, 0.33, 0.33 ]
The first sequence in the input file should be considered as consensus by default and the identity of all other sequences must be calculated according to the first sequence. If a user wants to calculate his own consensus sequence however then your calculation is right.

identity to AAAA [ 0.17, 0, 0.17, 0.33 ]
Here, you cannot calculate identity without having aligned this sequence to the other four. The second sequence could align to AAAA like this:
--AAAA
WKDAA
and so its identity to AAAA would be 0.40.

For the identity I ignore gaps.
yes, the formula is simply dividing the number of identical residues by the sequence length.

ic [ 0.81, 1, 0.81, 1, 1, 0.81 ]
how do you calculate the information content? Using the formula from the weblogo paper [1], ic at position 1 should be log2(20)-(-0.75_log2(0.75))=4.32-(-0.75_(-0.41))=4.02

scaled ic [ 0.41, 0.5, 0.41, 0.5, 0.5, 0.41 ]
how is this done and why?

conservation [ 1.19, 1, 1.19, 1, 1, 1.19 ]
what is this and what is the difference between conservation and information content?

[1] http://genome.cshlp.org/content/14/6/1188.long

@wilzbach
Copy link
Owner Author

wilzbach commented Dec 2, 2014

identity to consensus [ 1, 1, 0.33, 0.33 ]
The first sequence in the input file should be considered as consensus by default and the identity of > all other sequences must be calculated according to the first sequence. If a user wants to calculate > his own consensus sequence however then your calculation is right.

Yup but this has nothing to do with the statistics package. The MSA viewer has the task of figuring out whether the user wants to calculate the consensus are already has a consensus sequence (hence two methods).

identity to AAAA [ 0.17, 0, 0.17, 0.33 ]
Here, you cannot calculate identity without having aligned this sequence to the other four. The > second sequence could align to AAAA like this:
--AAAA
WKDAA
and so its identity to AAAA would be 0.40.

I can't align sequences (too expensive) and therefore the identity calculation stupidly expects that your sequences are in an optimal alignment (which should be the case).

what is this and what is the difference between conservation and information content?

Information content (aka Entropy)

Information content

Conservation: Max. Information content - obs. information content

how do you calculate the information content? Using the formula from the weblogo paper [1], ic at position 1 should be log2(20)-(-0.75_log2(0.75))=4.32-(-0.75_(-0.41))=4.02

 { A: 0.75, '-': 0.25 }

- (0.75 * log2(0.75)) - (0.25 * log2(0.25))  = 0.81

(again I count the gaps as normal letters)

scaled ic [ 0.41, 0.5, 0.41, 0.5, 0.5, 0.41 ]
how is this done and why?

obs. ic / max. ic

It is a feature of the biojs-stat-seqs package and might be useful (e.g. barchart, ...)
Currently we only display the conservation per residue.

@TatyanaGoldberg
Copy link
Collaborator

Hi Seb,

Two issues here:

{ A: 0.75, '-': 0.25 }

  • (0.75 * log2(0.75)) - (0.25 * log2(0.25)) = 0.81
  1. You should not consider gaps for calculating observed information content (i.e. entropy), as gap is not one of 4 for DNA/RNA or 20 for protein characters we are interested in the conservation of
  2. The max. information content is always log2(4) for DNA/RNA and log2(20) for proteins

Tatyana

@wilzbach
Copy link
Owner Author

wilzbach commented Dec 7, 2014

You should not consider gaps for calculating observed information content (i.e. entropy), as gap is not one of 4 for DNA/RNA or 20 for protein characters we are interested in the conservation of

I changed the implementation to be consistent with the one from Biopython. Ignored chars like "-" are also ignored in the frequency calculation, so that

{ A: 0.75, '-': 0.25 }
(1 * log2(1)) = 0

The max. information content is always log2(4) for DNA/RNA and log2(20) for proteins

Why is this an issue? The stat allows to enter any alphabet size.

@TatyanaGoldberg
Copy link
Collaborator

  1. The sequence logo is not shown for the full length of the alignment. Example file: https://rostlab.org/~goldberg/biojs/ssGP/ssgp/B2014122194A560KL7I.4.ids.clustal.
  2. Letter colors in seq logo do not always correspond to the colors in the MSA

@wilzbach
Copy link
Owner Author

@wilzbach
Copy link
Owner Author

wilzbach commented Jan 3, 2015

count gaps in the seq logo ...

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants