Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add Alignment.heatmap method #816

Closed
wants to merge 25 commits into from
Closed
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
426a3f6
Added heatmap plotting method to Alignment (and tests).
kestrelgorlick Nov 25, 2014
c6c8d6e
Addresses @Jorge-C's comment regarding matplotlib.pyplot import.
kestrelgorlick Nov 25, 2014
5f9d989
Added my alignment heatmap method to the changelog.
kestrelgorlick Nov 25, 2014
5f3b0b8
Fixes Travis plotting errors.
kestrelgorlick Nov 25, 2014
cf79a1f
Cleaned up docstring.
kestrelgorlick Nov 25, 2014
fc6015c
Addresses @jairideout's comments.
kestrelgorlick Dec 3, 2014
d51d603
Fixes Travis errors.
kestrelgorlick Dec 3, 2014
7f52bcb
Fixes Travis error.
kestrelgorlick Dec 4, 2014
2e5b21b
Cleaned up tests and added a few lines to _alignment.py .
kestrelgorlick Dec 5, 2014
42989b7
Addresses @jairideout's comments
kestrelgorlick Dec 14, 2014
6ab20c0
Fixed Travis error
kestrelgorlick Dec 16, 2014
c29d54f
Merge branch 'master' into issue-765
kestrelgorlick Dec 16, 2014
1cf2e9f
Addresses @jairideout's comments
kestrelgorlick Jan 9, 2015
6fa40a8
Adds tests in order to fix coverage loss.
kestrelgorlick Jan 13, 2015
b50ecc0
Merge branch 'issue-765' of https://github.com/Kleptobismol/scikit-bi…
jairideout Feb 9, 2015
c8e83f4
DOC: move Alignment.heatmap changelog mention under 0.2.2-dev heading
jairideout Feb 9, 2015
0e2aabd
DOC: minor cleanup to Alignment.heatmap docstring
jairideout Feb 9, 2015
89f45fc
BUG/MAINT/TST: code cleanup, refactoring, a few bug fixes, unit test …
jairideout Feb 11, 2015
a7b9094
BUG: fix Python 3 error resulting from incorrect creation of numpy array
jairideout Feb 11, 2015
2683103
Merge branch 'master' into aln-heatmap
jairideout Feb 16, 2015
a97665a
MAINT: stop forcing agg backend
jairideout Feb 16, 2015
6e58588
Merge branch 'master' into aln-heatmap
jairideout Feb 20, 2015
246b15f
BUG/TST: add back missing npt import
jairideout Feb 20, 2015
828925a
Merge branch 'master' into aln-heatmap
jairideout Mar 5, 2015
cc68f7c
ENH: default to no legend labels; report stats in legend labels
jairideout Mar 5, 2015
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Expand Up @@ -3,6 +3,7 @@
## Version 0.2.3-dev (changes since 0.2.3 release go here)

### Features
* Added ``heatmap`` method to ``skbio.alignment.Alignment`` for creating a heatmap from an alignment ([#765](https://github.com/biocore/scikit-bio/issues/765)).

### Bug fixes
* Changed `BiologicalSequence.distance` to raise an error any time two sequences are passed of different lengths regardless of the `distance_fn` being passed. [(#514)](https://github.com/biocore/scikit-bio/issues/514)
Expand Down
189 changes: 189 additions & 0 deletions skbio/alignment/_alignment.py
Expand Up @@ -15,6 +15,7 @@

import numpy as np
from scipy.stats import entropy
import matplotlib.pyplot as plt

from skbio._base import SkbioObject
from skbio.sequence import BiologicalSequence
Expand Down Expand Up @@ -1405,6 +1406,194 @@ def sequence_length(self):
else:
return len(self._data[0])

def heatmap(self, value_map,
legend_labels=('Minimum', 'Median', 'Maximum'), fig_size=None,
cmap=None, sequence_order=None):
"""Plot alignment as a heatmap.

Plot the alignment as a heatmap, coloring heatmap cells using the
character mapping in `value_map`. The x-axis is labeled by majority
consensus, and the y-axis is labeled by sequence IDs (first, last, and
every third sequence ID are displayed).

Parameters
----------
value_map : dict, collections.defaultdict
Dictionary mapping characters in the alignment to numeric values.
``KeyErrors`` are not caught, so all possible values should be in
`value_map`, or it should be a ``collections.defaultdict`` which
can, for example, default to ``nan``.
legend_labels : iterable, optional
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gregcaporaso does the behavior described for legend_labels make sense?

In your original cookbook implementation, all of the values in value_map were used to compute the minimum, median, and maximum. This could produce strange behavior if a value in value_map wasn't used in the heatmap. For example, if value_map had an unused mapping to a maximum value, the "maximum" label wouldn't show up in the legend because a different (smaller) maximum value was used in the heatmap/colorbar.

Also, if a defaultdict was supplied, mappings using the default value weren't being considered when computing min/median/max. For example, suppose the defaultdict had a default value that happened to be the true maximum, the "maximum" label would be placed in the wrong spot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jairideout, that makes sense and I see that it has to be that way now that you bring it up. But, I wonder if we should include the values in the legend label as well in this case. So, next to the label for the minimum, you'd always include the min value in parenthesis, and same for mean and max. Otherwise it could be misleading if your labels were *hydrophilic", "medium", and "hydrophobic", and then there were no "hydrophobic" residues in the alignment (I know, very unlikely, but just working from the cookbook example).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea, I'll add that in. Since these labels are pretty specialized, I think a better default is to not include these labels on the legend at all (i.e., legend_labels=None). If someone wants to mark the min/median/max, then they have the option to, but then we're not forcing users to label the legend this way. Does that make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, good idea.

Labels for the minimum, median, and maximum values in the legend.
Must be an iterable of exactly three strings. Mininum, median, and
maximum values are computed from the values in `value_map`. Only
values that are actually used to perform a character-to-value
mapping are included when computing the minimum, median, and
maximum. ``nan`` values are ignored.
fig_size : tuple, optional
Size of the figure in inches. If ``None``, defaults to figure size
specified in the user's matplotlibrc file.
cmap : matplotlib colormap, optional
See here for choices:
http://matplotlib.org/examples/color/colormaps_reference.html
If ``None``, defaults to the colormap specified in the user's
matplotlibrc file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the default if it's not specified in the user's matplotlibrc?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With any luck, we'll have a better default in the future as mpl is discussing a change to the default color map (matplotlib/matplotlib#875). Anyone want to weigh in on that discussion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fall back to whatever matplotlib's default is (there is always a "base" config file included in a matplotlib install, similar to how QIIME config files work). I don't think listing this here is appropriate because matplotlib's defaults may change and we won't want to update these docs when that happens.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, agree.

sequence_order : iterable, optional
The order, from top-to-bottom, that the sequences should be
plotted in. Must be an iterable containing all sequence IDs in the
alignment. If ``None``, sequences are plotted as they are ordered
in the alignment.

Returns
-------
matplotlib.figure.Figure
Figure containing the heatmap and legend of the plotted alignment.

Raises
------
KeyError
If a sequence character in the alignment is not in `value_map` and
`value_map` is not a ``collections.defaultdict``.
ValueError
If `sequence_order` isn't the same set of sequence IDs in the
alignment, or if `sequence_order` contains duplicate sequence IDs.

See Also
--------
majority_consensus

References
----------
.. [1] http://www.genome.jp/aaindex/
.. [2] http://www.genome.jp/dbget-bin/www_bget?aaindex:ARGP820101

Examples
--------
.. plot::

Let's visualize an alignment of protein sequences as a heatmap, with
amino acids colored by their hydrophobicity.

Load the AAIndex database [1]_ hydrophobicity index values [2]_:

>>> from collections import defaultdict
>>> import numpy as np
>>> hydrophobicity_idx = defaultdict(lambda: np.nan)
>>> hydrophobicity_idx.update({'A': 0.61, 'L': 1.53, 'R': 0.60,
... 'K': 1.15, 'N': 0.06, 'M': 1.18,
... 'D': 0.46, 'F': 2.02, 'C': 1.07,
... 'P': 1.95, 'Q': 0., 'S': 0.05,
... 'E': 0.47, 'T': 0.05, 'G': 0.07,
... 'W': 2.65, 'H': 0.61, 'Y': 1.88,
... 'I': 2.22, 'V': 1.32})

Define an alignment of protein sequences:

>>> from skbio import Alignment, Protein
>>> sequences = [Protein('VHLTPEEKSAVTALWGKVNVDEV--', id="seq1"),
... Protein('-GLSDGEWQLVLKVWGKVEGDLPGH', id="seq2"),
... Protein('-GLSDGEWQLVLKVWGKVETDITGH', id="seq3"),
... Protein('-VLSEGEWQLVLHVWAKVEADIAGH', id="seq4")]
>>> aln = Alignment(sequences)

Plot the alignment as a heatmap:

>>> fig = aln.heatmap(hydrophobicity_idx, fig_size=(15, 10),
... legend_labels=['Hydrophilic', 'Medium',
... 'Hydrophobic'],
... cmap='Greens')

"""
if len(legend_labels) != 3:
raise ValueError(
"`legend_labels` must contain exactly three labels.")

if sequence_order is not None:
sequence_order_set = set(sequence_order)
if len(sequence_order) != len(sequence_order_set):
raise ValueError(
"Found duplicate sequence ID(s) in `sequence_order`.")
if sequence_order_set != set(self.ids()):
raise ValueError(
"`sequence_order` must contain the same set of sequence "
"IDs in the alignment.")
else:
sequence_order = self.ids()

mtx, min_val, median_val, max_val = self._alignment_to_heatmap_matrix(
value_map, sequence_order)

# heatmap code derived from the matplotlib gallery:
# http://matplotlib.org/examples/pylab_examples/
# colorbar_tick_labelling_demo.html
fig, ax = plt.subplots()
if fig_size is not None:
fig.set_size_inches(fig_size)

cax = ax.imshow(mtx, interpolation='nearest', cmap=cmap)

sequence_count = self.sequence_count()
y_ticks = [0] + list(range(3, sequence_count - 3, 3))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be changed to list(range(0, sequence_count - 3, 3)). Also can you explain why we subtract three from the count with a step of 3? It is not clear why we do this.

if sequence_count > 1:
# we have more than one sequence, so add the index of the last
# sequence
y_ticks += [sequence_count - 1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In regard to the above, this will exclude the last "step-able" sequence and then add only the last sequence. i.e.
sequence_count = 15
y_ticks = [0, 3, 6, 9, 14] not y_ticks = [0, 3, 6, 9, 12, 14] which may have been the intended behaviour.

y_tick_labels = [sequence_order[tick] for tick in y_ticks]
ax.set_yticks(y_ticks)
ax.set_yticklabels(y_tick_labels)

ax.set_xticks(range(self.sequence_length()))
ax.set_xticklabels(self.majority_consensus(), size=7)

# add colorbar legend
cbar = fig.colorbar(cax, ticks=[min_val, median_val, max_val],
orientation='horizontal')
cbar.ax.set_xticklabels(legend_labels)

return fig

def _alignment_to_heatmap_matrix(self, value_map, sequence_order):
"""Convert alignment to numeric matrix for heatmap plotting."""
if self.is_empty():
raise AlignmentError(
"Cannot create heatmap from an empty alignment.")

sequence_length = self.sequence_length()
if sequence_length < 1:
raise AlignmentError(
"Cannot create heatmap from an alignment without any "
"positions.")

# track the values that are actually used -- this can differ from the
# values in `value_map` (e.g., if `value_map` is a superset or a
# defaultdict). we want to use these values to compute the min, median,
# and max for the colorbar legend labels
values = {}
mtx = np.empty([self.sequence_count(), sequence_length])
for row_idx, sequence_id in enumerate(sequence_order):
seq = str(self[sequence_id])
for col_idx in range(sequence_length):
char = seq[col_idx]
value = value_map[char]
mtx[row_idx][col_idx] = value
values[char] = value

# remove nans as this can mess up the median calculation below.
# numpy.nanmedian was added in 1.9.0, so consider updating this code
# if/when we start supporting numpy >= 1.9.0. scipy.stats.nanmedian is
# deprecated so we can't use that either.
#
# modified from http://stackoverflow.com/a/26475626/3776794
#
# note: we need to explicitly create a list from the dict's values in
# order to have this work in Python 3. Using values.values() directly,
# or with future.utils.viewvalues, creates a numpy array of object
# dtype, which makes np.isnan fail.
values = np.asarray(list(values.values()))
values = values[~np.isnan(values)]

return mtx, min(values), np.median(values), max(values)

def _validate_lengths(self):
"""Return ``True`` if all sequences same length, ``False`` otherwise
"""
Expand Down
128 changes: 128 additions & 0 deletions skbio/alignment/tests/test_alignment.py
Expand Up @@ -17,6 +17,7 @@
import tempfile

import numpy as np
import numpy.testing as npt
from scipy.spatial.distance import hamming

from skbio import (BiologicalSequence, DNASequence, RNASequence, DNA, RNA,
Expand Down Expand Up @@ -753,6 +754,133 @@ def test_sequence_length(self):
self.assertEqual(self.a2.sequence_length(), 5)
self.assertEqual(self.empty.sequence_length(), 0)

def check_heatmap_sanity(self, fig, exp_x_tick_labels, exp_y_tick_labels,
exp_legend_labels):
"""Helper method for testing basic heatmap figure properties."""
axes = fig.get_axes()
self.assertEqual(len(axes), 2)

ax, axc = axes

x_tick_labels = [e.get_text() for e in ax.get_xticklabels()]
self.assertEqual(x_tick_labels, exp_x_tick_labels)

y_tick_labels = [e.get_text() for e in ax.get_yticklabels()]
self.assertEqual(y_tick_labels, exp_y_tick_labels)

legend_labels = [e.get_text() for e in axc.get_xticklabels()]
self.assertEqual(legend_labels, exp_legend_labels)

def test_heatmap_empty(self):
# no seqs
aln = Alignment([])
value_map = {'A': 0.1, 'C': 1.5, 'U': -0.42, 'G': 0.55, 'T': 99}
with self.assertRaises(AlignmentError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test the message

aln.heatmap(value_map)

# no positions
aln = Alignment([DNA(''), RNA('', id="s2")])
with self.assertRaises(AlignmentError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test the message

aln.heatmap(value_map)

def test_heatmap_1_seq(self):
aln = Alignment([RNA('AACGU', id="seq1")])
value_map = {'A': 0.1, 'C': 1.5, 'U': -0.42, 'G': 0.55, 'T': 99}
fig = aln.heatmap(value_map)

self.check_heatmap_sanity(
fig, ['A', 'A', 'C', 'G', 'U'], ['seq1'],
['Minimum', 'Median', 'Maximum'])

def test_heatmap_2_seqs(self):
aln = Alignment([RNA('AACGU', id="seq1"),
RNA('AACGU', id="seq2")])
value_map = {'A': 0.1, 'C': 1.5, 'U': -0.42, 'G': 0.55, 'T': 99}
fig = aln.heatmap(value_map)

self.check_heatmap_sanity(
fig, ['A', 'A', 'C', 'G', 'U'], ['seq1', 'seq2'],
['Minimum', 'Median', 'Maximum'])

def test_heatmap_3_seqs(self):
aln = Alignment([DNA('AACCCGT', id="seq1"),
DNA('ACCCGGT', id="seq2"),
DNA('ACCCGGT', id="seq3")])
value_map = {'A': 0.61, 'C': 1.07, 'T': 0.05, 'G': 0.07}
fig = aln.heatmap(value_map)

self.check_heatmap_sanity(
fig, ['A', 'C', 'C', 'C', 'G', 'G', 'T'], ['seq1', 'seq3'],
['Minimum', 'Median', 'Maximum'])

def test_heatmap_with_non_defaults(self):
aln = Alignment([DNA('AGTCGGT', id="seq1"),
DNA('CAACGGA', id="seq2"),
DNA('AACCTCT', id="seq3"),
DNA('TACTCGT', id="seq4")])
value_map = {'A': 0.61, 'C': 1.07, 'T': 0.05, 'G': 0.07}
labels = ['a', 'b', 'c']
fig = aln.heatmap(
value_map, legend_labels=labels, fig_size=(42, 22), cmap='Blues',
sequence_order=('seq4', 'seq3', 'seq1', 'seq2'))

self.check_heatmap_sanity(fig, ['A', 'A', 'C', 'C', 'G', 'G', 'T'],
['seq4', 'seq2'], labels)
self.assertEqual(fig.get_figwidth(), 42.0)
self.assertEqual(fig.get_figheight(), 22.0)

def test_heatmap_invalid_legend_labels(self):
with self.assertRaises(ValueError):
self.a1.heatmap({}, legend_labels=['a', 'b', 'c', 'd'])

def test_heatmap_invalid_sequence_order(self):
# duplicate ids
with self.assertRaises(ValueError):
self.a1.heatmap({}, sequence_order=['d1', 'd2', 'd1'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you assert some basic verbs in the error message exist to ensure that the correct error is being raised?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent idea!


# provided set of ids doesn't match alignment's set of ids
with self.assertRaises(ValueError):
self.a1.heatmap({}, sequence_order=['d2', 'd3', 'd0'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above


def test_heatmap_missing_character_in_value_map(self):
with self.assertRaises(KeyError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test the message

self.a1.heatmap({})

def test_alignment_to_heatmap_matrix(self):
aln = Alignment([DNA('ACTG', id='d1'),
DNA('A.-G', id='d2'),
DNA('TC-G', id='d3')])
value_map = defaultdict(lambda: np.nan)
value_map.update({'A': 42, 'C': 10.5, 'T': 22.1, 'G': -7.789,
'U': -999.9, 'Z': 42000})
exp_min = -7.789
exp_median = 16.3
exp_max = 42

# sequence order is same as what's in the alignment
mtx, min_val, median_val, max_val = \
aln._alignment_to_heatmap_matrix(value_map, ['d1', 'd2', 'd3'])

exp_mtx = np.array([[42., 10.5, 22.1, -7.789],
[42., np.nan, np.nan, -7.789],
[22.1, 10.5, np.nan, -7.789]])
npt.assert_array_equal(mtx, exp_mtx)
self.assertEqual(min_val, exp_min)
self.assertAlmostEqual(median_val, exp_median)
self.assertEqual(max_val, exp_max)

# sequence order is different from what's in the alignment
mtx, min_val, median_val, max_val = \
aln._alignment_to_heatmap_matrix(value_map, ['d3', 'd1', 'd2'])

exp_mtx = np.array([[22.1, 10.5, np.nan, -7.789],
[42., 10.5, 22.1, -7.789],
[42., np.nan, np.nan, -7.789]])
npt.assert_array_equal(mtx, exp_mtx)
self.assertEqual(min_val, exp_min)
self.assertAlmostEqual(median_val, exp_median)
self.assertEqual(max_val, exp_max)

def test_validate_lengths(self):
self.assertTrue(self.a1._validate_lengths())
self.assertTrue(self.a2._validate_lengths())
Expand Down