Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add UMAP visualization support #92

Merged
merged 3 commits into from Feb 22, 2018
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
24 changes: 14 additions & 10 deletions docs/references.rst
Expand Up @@ -3,7 +3,7 @@

<style type="text/css">
.label {
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about removing whitespace. I just couldn't resist 🙈

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure! there should be no whitespace here. maybe some relict from when this was all markdown and three whitespaces meant a line break

}
</style>

References
Expand All @@ -19,7 +19,7 @@ References

.. [Blondel08] Blondel *et al.* (2008),
*Fast unfolding of communities in large networks*,
`J. Stat. Mech. <https://doi.org/10.1088/1742-5468/2008/10/P10008>`__.
`J. Stat. Mech. <https://doi.org/10.1088/1742-5468/2008/10/P10008>`__.

.. [Coifman05] Coifman *et al.* (2005),
*Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps*,
Expand All @@ -28,7 +28,7 @@ References
.. [Csardi06] Csardi *et al.* (2006),
*The igraph software package for complex network researc*,
`InterJournal Complex Systems <http://igraph.org>`__.

.. [Ester96] Ester *et al.* (1996),
*A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise*,
`Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining,
Expand All @@ -37,7 +37,7 @@ References
.. [Eulenberg17] Eulenberg *et al.* (2017),
*Reconstructing cell cycle and disease progression using deep learning*
`Nat. Comms., accepted <https://doi.org/10.1101/081364>`__.

.. [Fruchterman91] Fruchterman & Reingold (1991),
*Graph drawing by force-directed placement*,
`Software: Practice & Experience <http://doi.org:10.1002/spe.4380211102>`__.
Expand All @@ -49,7 +49,7 @@ References
.. [Hastie09]
Hastie *et al.* (2009),
*The Elements of Statistical Learning*,
`Springer <https://web.stanford.edu/~hastie/ElemStatLearn/>`_.
`Springer <https://web.stanford.edu/~hastie/ElemStatLearn/>`_.

.. [Haghverdi15] Haghverdi *et al.* (2015),
*Diffusion maps for high-dimensional single-cell analysis of differentiation data*,
Expand All @@ -74,7 +74,7 @@ References
.. [Levine15] Levine *et al.* (2015),
*Data-Driven Phenotypic Dissection of AML Reveals Progenitor--like Cells that Correlate with Prognosis*,
`Cell <https://doi.org/10.1016/j.cell.2015.05.047>`__.

.. [Maaten08] Maaten & Hinton (2008),
*Visualizing data using t-SNE*,
`JMLR <http://www.jmlr.org/papers/v9/vandermaaten08a.html>`__.
Expand All @@ -83,14 +83,18 @@ References
*Spatial reconstruction of single-cell gene expression data*,
`Nature Biotechnology <https://doi.org/10.1038/nbt.3192>`__.

.. [McInnes18] McInnes & Healy (2018),
*UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction*,
`arXiv <https://arxiv.org/abs/1802.03426>`__.

.. [Moignard15] Moignard *et al.* (2015),
*Decoding the regulatory network of early blood development from single-cell gene expression measurements*,
`Nature Biotechnology <https://doi.org/10.1038/nbt.3154>`__.

.. [Murphy12]
Murphy (2012,
*Machine Learning: A Probabilisitc Perspective*,
`MIT Press <https://mitpress.mit.edu/books/machine-learning-0>`_.
`MIT Press <https://mitpress.mit.edu/books/machine-learning-0>`_.

.. [Pedregosa11] Pedregosa *et al.* (2011),
*Scikit-learn: Machine Learning in Python*,
Expand All @@ -103,7 +107,7 @@ References
.. [Traag17] Traag (2017),
*Louvain*,
`GitHub <https://doi.org/10.5281/zenodo.35117>`__.

.. [Ulyanov16] Ulyanov (2016),
*Multicore t-SNE*,
`GitHub <https://github.com/DmitryUlyanov/Multicore-TSNE>`__.
Expand All @@ -119,15 +123,15 @@ References
.. [Waskom16] Waskom *et al.* (2017),
*Seaborn*,
`Zenodo <https://doi.org/10.5281/zenodo.54844>`__.

.. [Wolf17] Wolf *et al.* (2018),
*Scanpy: large-scale single-cell gene expression data analysis*,
`Genome Biology <https://doi.org/10.1186/s13059-017-1382-0>`_.

.. [Wolf17i] Wolf *et al.* (2017),
*Graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells.*
`bioRxiv <https://doi.org/10.1101/208819>`__.

.. [Zheng17] Zheng *et al.* (2017),
*Massively parallel digital transcriptional profiling of single cells*,
`Nature Communications <https://doi.org/10.1038/ncomms14049>`__.
1 change: 1 addition & 0 deletions requires.txt
Expand Up @@ -25,3 +25,4 @@ joblib
profilehooks
# extensions (also needed if installing cython compiled .c files)
# cython
# umap-learn # optional dependency for umap visualization
1 change: 1 addition & 0 deletions scanpy/api/tl.py
@@ -1,5 +1,6 @@
from ..tools.pca import pca
from ..tools.tsne import tsne
from ..tools.umap import umap
from ..tools.diffmap import diffmap
from ..tools.draw_graph import draw_graph

Expand Down
1 change: 1 addition & 0 deletions scanpy/plotting/__init__.py
Expand Up @@ -6,6 +6,7 @@
from .tools import diffmap
from .tools import draw_graph
from .tools import tsne
from .tools import umap
from .tools import aga, aga_attachedness, aga_graph, aga_path, aga_scatter
from .tools import dpt, dpt_scatter, dpt_groups_pseudotime, dpt_timeseries
from .tools import louvain
Expand Down
3 changes: 2 additions & 1 deletion scanpy/plotting/anndata.py
Expand Up @@ -66,7 +66,7 @@ def scatter(
sort_order : `bool`, optional (default: `True`)
For continuous annotations used as color parameter, plot data points
with higher values on top of others.
basis : {'pca', 'tsne', 'diffmap', 'draw_graph_fr', etc.}
basis : {'pca', 'tsne', 'umap', 'diffmap', 'draw_graph_fr', etc.}
String that denotes a plotting tool that computed coordinates.
groups : str, optional (default: all groups in color)
Allows to restrict categories in sample annotation to a subset.
Expand Down Expand Up @@ -153,6 +153,7 @@ def scatter(
component_name = ('DC' if basis == 'diffmap'
else basis.replace('draw_graph_', '').upper() if 'draw_graph' in basis
else 'tSNE' if basis == 'tsne'
else 'UMAP' if basis == 'umap'
else 'PC' if basis == 'pca'
else 'Spring' if basis == 'spring'
else None)
Expand Down
91 changes: 91 additions & 0 deletions scanpy/plotting/tools.py
Expand Up @@ -458,6 +458,97 @@ def tsne(
if show == False: return axs


def umap(
adata,
color=None,
use_raw=True,
sort_order=True,
alpha=None,
groups=None,
components=None,
projection='2d',
legend_loc='right margin',
legend_fontsize=None,
legend_fontweight=None,
color_map=None,
palette=None,
right_margin=None,
size=None,
title=None,
show=None,
save=None, ax=None):
"""Scatter plot in UMAP basis.

Parameters
----------
adata : AnnData
Annotated data matrix.
color : string or list of strings, optional (default: None)
Keys for sample/cell annotation either as list `["ann1", "ann2"]` or
string `"ann1,ann2,..."`.
use_raw : `bool`, optional (default: `True`)
Use `raw` attribute of `adata` if present.
sort_order : `bool`, optional (default: `True`)
For continuous annotations used as color parameter, plot data points
with higher values on top of others.
groups : str, optional (default: all groups)
Restrict to a few categories in categorical sample annotation.
components : str or list of str, optional (default: '1,2')
String of the form '1,2' or ['1,2', '2,3'].
projection : {'2d', '3d'}, optional (default: '2d')
Projection of plot.
legend_loc : str, optional (default: 'right margin')
Location of legend, either 'on data', 'right margin' or valid keywords
for matplotlib.legend.
legend_fontsize : int (default: None)
Legend font size.
color_map : str (default: `matplotlib.rcParams['image.cmap']`)
String denoting matplotlib color map.
palette : list of str (default: None)
Colors to use for plotting groups (categorical annotation).
right_margin : float or list of floats (default: None)
Adjust the width of the space right of each plotting panel.
size : float (default: None)
Point size.
title : str, optional (default: None)
Provide title for panels either as `["title1", "title2", ...]` or
`"title1,title2,..."`.
show : bool, optional (default: None)
Show the plot, do not return axis.
save : `bool` or `str`, optional (default: `None`)
If `True` or a `str`, save the figure. A string is appended to the
default filename. Infer the filetype if ending on \{'.pdf', '.png', '.svg'\}.
ax : matplotlib.Axes
A matplotlib axes object.

Returns
-------
matplotlib.Axes object
"""
axs = scatter(
adata,
basis='umap',
color=color,
use_raw=use_raw,
sort_order=sort_order,
alpha=alpha,
groups=groups,
components=components,
projection=projection,
legend_loc=legend_loc,
legend_fontsize=legend_fontsize,
legend_fontweight=legend_fontweight,
color_map=color_map,
palette=palette,
right_margin=right_margin,
size=size,
title=title,
show=show,
save=save,
ax=ax)
if show == False: return axs


# ------------------------------------------------------------------------------
# Subgroup identification and ordering - clustering, pseudotime, branching
# and tree inference tools
Expand Down
140 changes: 140 additions & 0 deletions scanpy/tools/umap.py
@@ -0,0 +1,140 @@
import numpy as np
from ..tools.pca import pca
from .. import settings
from .. import logging as logg


def umap(adata,
n_neighbors=15,
n_components=2,
min_dist=0.1,
metric='euclidean',
alpha=1.0,
init='spectral',
local_connectivity=1.0,
random_state=None,
copy=False,
umap_kwargs={}):
"""UMAP [McInnes18]_.

UMAP (Uniform Manifold Approximation and Projection) is a novel manifold
learning technique for dimension reduction which is suitable for
visualization of high dimensional single cell data. It is competitive
with tSNE yet, it is faster and it arguably preserves more of the global structure.
We use the implementation of *umap-learn* [McInnes18]_.

Parameters
----------
adata : `~scanpy.api.AnnData`
Annotated data matrix.
n_neighbors: `float`, optional (default: 15)
The size of local neighborhood (in terms of number of neighboring
sample points) used for manifold approximation. Larger values
result in more global views of the manifold, while smaller
values result in more local data being preserved. In general
values should be in the range 2 to 100.
n_components: `int`, optional (default: 2)
The dimension of the space to embed into. This defaults to 2 to
provide easy visualization, but can reasonably be set to any
integer value in the range 2 to 100.
min_dist: `float`, optional (default: 0.1)
The effective minimum distance between embedded points. Smaller values
will result in a more clustered/clumped embedding where nearby points
on the manifold are drawn closer together, while larger values will
result on a more even dispersal of points. The value should be set
relative to the ``spread`` value, which determines the scale at which
embedded points will be spread out.
metric: `string` or `function`, optional (default: 'euclidean')
The metric to use to compute distances in high dimensional space.
If a string is passed it must match a valid predefined metric. If
a general metric is required a function that takes two 1d arrays and
returns a float can be provided. For performance purposes it is
required that this be a numba jit'd function. Valid string metrics
include:
* euclidean
* manhattan
* chebyshev
* minkowski
* canberra
* braycurtis
* mahalanobis
* wminkowski
* seuclidean
* cosine
* correlation
* haversine
* hamming
* jaccard
* dice
* russelrao
* kulsinski
* rogerstanimoto
* sokalmichener
* sokalsneath
* yule
Metrics that take arguments (such as minkowski, mahalanobis etc.)
can have arguments passed via the metric_kwds dictionary. At this
time care must be taken and dictionary elements must be ordered
appropriately; this will hopefully be fixed in the future.
alpha: `float`, optional (default: 1.0)
The initial learning rate for the embedding optimization.
init: string (optional, default 'spectral')
How to initialize the low dimensional embedding. Options are:
* 'spectral': use a spectral embedding of the fuzzy 1-skeleton
* 'random': assign initial embedding positions at random.
* A numpy array of initial embedding positions.
local_connectivity: `int`, optional (default: 1)
The local connectivity required -- i.e. the number of nearest
neighbors that should be assumed to be connected at a local level.
The higher this value the more connected the manifold becomes
locally. In practice this should be not more than the local intrinsic
dimension of the manifold.
random_state: `int`, `RandomState instance` or `None`, optional (default: None)
If int, random_state is the seed used by the random number generator;
If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used
by `np.random`.
umap_kwargs: `dict`, optional (default: {})
Additional keyword arguments for UMAP class constructor from umap-learn
package.

Returns
-------
Depending on `copy`, returns or updates `adata` with the following fields.

X_umap : `np.ndarray` (`adata.obs`, dtype `float`)
UMAP coordinates of data.
"""
try:
import umap
except ImportError:
logg.error('UMAP visualization requires umap-learn package, however '
'it is not found. Follow instructions on GitHub to install '
'umap-learn: https://github.com/lmcinnes/umap#installing')

logg.info('computing UMAP', r=True)
adata = adata.copy() if copy else adata

# params for umap-learn
params_umap = {'n_neighbors': n_neighbors,
'n_components': n_components,
'min_dist': min_dist,
'metric': metric,
'alpha': alpha,
'init': init,
'local_connectivity': local_connectivity,
'random_state': random_state,
'verbose': max(0, settings.verbosity-3),
**umap_kwargs
}

um = umap.UMAP(**params_umap)
X_umap = um.fit_transform(adata.X)

# update AnnData instance
adata.obsm['X_umap'] = X_umap # annotate samples with UMAP coordinates
logg.info(' finished', time=True, end=' ' if settings.verbosity > 2 else '\n')
logg.hint('added\n'
' \'X_umap\', UMAP coordinates (adata.obs)')

return adata if copy else None