Remove frozen umap #576

Koncopd · 2019-03-30T23:23:06Z

Still need to figure out why paga test fails.

Also simplicial_set_embedding from umap requires data and metrics. Data is adata.X and i set metrics='euclidean', but this is not clear.

Fixes #522

falexwolf · 2019-03-31T19:35:21Z

Great!

Some tests should fail as there are probably differences in the neighbor algorithm.

This is also why this is a backwards-compat breaking change. Can you just visually inspect https://scanpy-tutorials.readthedocs.io/en/latest/paga-paul15.html and see what's going on?

This is another notebook that should still do something meaningful after the change: https://nbviewer.jupyter.org/github/theislab/scanpy_usage/blob/master/170502_paul15/paul15.ipynb

And finally, of course, https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html should give somewhat consistent results. But I expect slight variations and no perfect consistence... Actually, I'd expect the associated tests (https://github.com/theislab/scanpy/blob/master/scanpy/tests/notebooks/test_pbmc3k.py) to fail. Can you check?

falexwolf · 2019-03-31T22:36:53Z

Finally, I want to mention that we also need to export the following functions from UMAP: https://github.com/theislab/scanpy/blob/97c8b54ec884ac8e8396a80b6782a0d59a17a874/scanpy/tools/_umap.py#L107

Koncopd · 2019-04-01T20:48:02Z

@falexwolf

Hi, Alex.
Yes, i'm checking these.

Actually, it somehow passes test_pbmc3k. Only test_paga_paul15_subsampled fails. It seems that adjacency matrix is a bit different after the change and this affects paga connectivities. But it is preliminary, i'm checking still.

Koncopd · 2019-04-01T20:49:33Z

Finally, I want to mention that we also need to export the following functions from UMAP:

scanpy/scanpy/tools/_umap.py

Line 107 in 97c8b54

from ..neighbors.umap.umap_ import find_ab_params, simplicial_set_embedding

This is already in the PR.

falexwolf · 2019-04-03T09:52:08Z

Great!

Yes, I would have expected that the adjacency matrix will differ slightly and hence, test_paga_paul15 fails. We'll need to rerun and upload https://scanpy-tutorials.readthedocs.io/en/latest/paga-paul15.html with the new version in that case and also update the tests.

falexwolf · 2019-04-03T10:04:35Z

One other thing.

We'd like to have two new convenience functions:

def neighbors_update(adata, adata_new)
def umap_update(adata, adata_new)

The first maps the new data into the existing neighbor graph based on the chosen latent representation.

The second maps the new data into the existing UMAP embedding.

For the second function, one just needs to find a good way of wrapping

model = umap.UMAP(seed=1234)
model.fit(X)
model.transform(new_X)

For the first, I'm not quite sure how easy it is easy. I'm using pynndescent for it, which will become UMAP's dependency at some point, but isn't yet.

Maybe what UMAP does internally is already sufficient, but I don't know.

Can you investigate and if it's easy cover in this PR? If it's tricky, let's wait for another PR.

tomwhite · 2019-04-03T15:38:56Z

This looks good to me.

Do you intend to remove scanpy/neighbors/umap as a part of this PR?

falexwolf · 2019-04-03T16:08:51Z

Yes, scanpy/neighbors/umap should definitely be removed.

falexwolf · 2019-04-03T16:10:01Z

scanpy/tools/_umap.py

@@ -131,7 +132,9 @@ def umap(
        n_epochs,
        init_coords,
        random_state,
-        max(0, verbosity-3))
+        metric="euclidean",


I don't think that a metric makes any sense here. We're just embedding the graph that had already been computed before. Are you sure that simplicial_set_embedding still takes a metric as an argument?

Yes, unfortunately it does. Should i add metric as an argument to umap?

Ok, just looked it up, it's just used for the initialization.

Same as the data argument, which, btw, is not correct to be set to adata.X. It should be set to the representation chosen in pp.neighbors and the metric also needs to be chosen consistently with it.

I'd recommend to stare the params of the neighbors call as .uns['neighbors']['params'] as a dict as in several other tools, too and initizalize from that.

Hm, in the description it says

data: array of shape (n_samples, n_features)
The source data to be embedded by UMAP

It is used in spectral_layout only, it uses data and metric only when the number of connected components is bigger than 1. And it also seems from the code that this is exactly adata.X. Or maybe not adata.X, but the data which was used for embedding, adata.X could be changed already from that state. So what can we do, froze adata.X which were used for embedding?

What we should do is this

I'd recommend to store the params of the pp.neighbors call as .uns['neighbors']['params'] as a dict as in several other tools, too and initialize from that.

So, you'll have a use_rep param in that dict. And that param tells you which representation intends to use for the embedding (used for computing the neighborhood graph/simplicial set).

Yes, thank you, i see . Will do like this.

Just forgot that there can be another representation to use.

falexwolf · 2019-04-04T04:09:28Z

scanpy/tools/_umap.py

@@ -118,9 +118,10 @@ def umap(
        init_coords = init_pos
    from sklearn.utils import check_random_state
    random_state = check_random_state(random_state)
-    n_epochs = maxiter
+    n_epochs = 0 if maxiter is None else maxiter


0 epochs will result in no embedding. But None is not legit anymore?

It was changed, now 0 is the same as None before.

https://github.com/lmcinnes/umap/blob/a858c6322a3e682d8daf9c17e13ac023f3e18cfa/umap/umap_.py#L921

ivirshup · 2019-04-09T02:53:36Z

@falexwolf @Koncopd any chance you've had time to look into the projection of new data into an existing UMAP yet? Additionally, would this be rolled out with an equivalent pca_update? I'd be interested in taking some part of implementing this if you haven't worked it out already.

Koncopd · 2019-04-09T14:27:27Z

Hi, @ivirshup.
No, i haven't looked into this yet, so if you have time and ideas, i'll appreciate any help.

ivirshup · 2019-04-10T04:11:07Z

Alright! I've got a little example case I'd probably be using for a test case here (doublet prediction by simulation and projection).

My current thoughts:

Since we need to be working in the same feature space, we'll at least need PCA projection, but this is pretty easy:

Basic PCA projection

def pca_update(tgt, src, inplace=True):
    # TODO: Make sure we know the settings (just whether to center?) from src
    if not inplace:
        tgt = tgt.copy()
    if sparse.issparse(tgt.X):
        X = tgt.X.toarray()
    else:
        X = tgt.X.copy()
    X -= np.asarray(tgt.X.mean(axis=0))
    tgt_pca = np.dot(X, src.varm["PCs"])
    tgt.obsm["X_pca"] = tgt_pca
    return tgt

Are you planning on storing the UMAP object in the AnnData? That would make transformation easier, but I see how on-disk representation could get complicated.
What order should we do this in? Would you like everything to be accomplished by this PR or should we break it up?
Are we introducing a general transfer learning api? Probably worth considering that a bit. Some relevant questions:
- Does the syntax still work for cases other than 1-to-1 transfer?
- How do we deal with concatenation/ joins? The current concatenate doesn't join things like obsm.
- Alternatively, does everything have to be in the same AnnData? It would solve issues with having var be the same, but could complicate a lot of other code (many functions would need some kind of masking argument).

falexwolf · 2019-04-10T11:50:02Z

@ivirshup, fantastic! Yes, it would be great if you could work on this. We should also keep it separate from this PR, in this case. This PR should remain concerned with getting rid of the duplicate code.

I'll open up an issue.

Koncopd · 2019-04-22T01:16:45Z

Added the initial version of neighbors_update, only works with use_rep='X' in sc.pp.neighbors for now.

ivirshup · 2019-04-22T04:40:45Z

scanpy/neighbors/__init__.py

+    result = search(train, search_graph.indptr, search_graph.indices, init, test)
+
+    indices, dists = deheap_sort(result)
+    return indices[:, :k], dists[:, :k]


Just checking, since downstream functions like leiden use the "connectivities" weighted graph, are you planning on returning that as well?

For now this is just an analog of

neighbors = NNDescent(adata1.X) # construct first graph neighbors.query(adata2.X) # lookup second data

But calculating connectivities should be possible, yes.

falexwolf · 2019-04-26T12:14:24Z

This looks good! 😄

Storing the forest in the AnnData is good! It should also be compatible with the updates the @tomwhite plans on UMAP and pynndescent (UMAP will depend on pynndescent) as that should be the most basic object to store when to enable queries later on...

But I would not store the "forest" in a default neighbors call. Or do you have any estimate on how large it is?

Great work!

falexwolf · 2019-04-28T19:55:01Z

@Koncopd, can we merge this without the neighbors_update function and without writing the rp_forest to the AnnData object? Your code is good, but we should put it into another PR.

Can you investigate and if it's easy cover in this PR? If it's tricky, let's wait for another PR.

Is what I wrote in the beginning. I think it turned out tricky and is a case for #562 (comment). So, let's keep this PR really simple and just be about removing the legacy code.

Your statement about "all tests pass except for the PAGA tests" is still true? Did you manually inspect the PAGA notebook and does it look consistent? Just a few cosmetic things should have changed, I guess.

If yes, we'll merge this, now that 1.4.1 is out.

Koncopd · 2019-04-28T20:55:19Z

@falexwolf
Yes, i inspected the notebook, everything looks the same. Also this PR passes all tests now. So, yes, i think it can be merged

falexwolf · 2019-04-28T23:19:36Z

scanpy/neighbors/__init__.py

    adata.uns['neighbors']['distances'] = neighbors.distances
    adata.uns['neighbors']['connectivities'] = neighbors.connectivities
+    if neighbors.rp_forest is not None:


But, as mentioned, I'd not at the rp_forest object at this stage. Makes sense? Can you remove it?

falexwolf · 2019-04-28T23:21:17Z

Cool! But, can you address the comment above? And, what about all these strange conflicts? There shouldn't be any in scanpy/neighbors/ as all of these files get removed. Could you fix the conflict in scanpy/tools/_umap.py?

Koncopd · 2019-04-29T08:50:33Z

@falexwolf , yes, i'll remove the commits with rp_forest and update_neighbors and resolve the conflict today. The conflicts are because of this commit, i believe.

falexwolf · 2019-04-29T19:39:39Z

scanpy/tools/_umap.py

@@ -1,5 +1,5 @@
 from ._utils import get_init_pos_from_paga, choose_representation
-from .. import settings
+from .._settings import settings


This should not have been necessary. It should be from .. import settings...

It is the same as in the master branch
https://github.com/theislab/scanpy/blob/9b522f54e0f839e1a0c9874ca658400bfe79a894/scanpy/tools/_umap.py#L2

Well, it doesn't hurt, I don't who put it there. But it should be from ..settings import settings. :) Don't worry...

falexwolf · 2019-04-29T19:40:06Z

scanpy/neighbors/__init__.py

                X, n_neighbors, random_state, metric=metric, metric_kwds=metric_kwds)
-            self._rp_forest = _make_forest_dict(forest)
+            #self._rp_forest = _make_forest_dict(forest)


OK, looks good!

falexwolf · 2019-04-29T19:40:55Z

So, is this good to go? Waiting for your OK!

Koncopd · 2019-04-29T19:42:05Z

@falexwolf
It is ready for merge.
However before i checked only paul15.ipynb, it is the same.
But for paga-paul15 there are serious differences.
The upper graph is new

Denoising the graph

falexwolf · 2019-04-30T01:33:58Z

I tested myself and obtained exactly the same results. :)

You probably don't have the FA2 package installed, that's why your graph look different... :)

I'm merging this! Awesome work!

falexwolf · 2019-04-30T01:38:20Z

@tomwhite, we're now fully depending on UMAP. 😄

tomwhite · 2019-04-30T09:14:40Z

That's great! Thanks @Koncopd and @falexwolf

Remove frozen umap

Koncopd mentioned this pull request Mar 30, 2019

Getting rid of legacy umap code #522

Closed

falexwolf mentioned this pull request Mar 31, 2019

TODO: Backwards-compat breaking changes #453

Open

16 tasks

falexwolf reviewed Apr 3, 2019

View reviewed changes

falexwolf reviewed Apr 4, 2019

View reviewed changes

ivirshup reviewed Apr 22, 2019

View reviewed changes

falexwolf reviewed Apr 28, 2019

View reviewed changes

Koncopd added 6 commits April 29, 2019 20:34

use umap nearest_neighbors

70e3771

fix import

31f8946

use umap fuzzy_simplicial_set

f3ffbc9

use umap simplicial_set_embedding

e05ca81

umap fuzzy_simplicial_set has no bandwidth

4b4550b

space

ced6f98

Koncopd added 10 commits April 29, 2019 20:34

fix arguments

5445b30

add umap to requirements

eef76b2

fix n_epoch in _umap

a7eeb78

store and use neighbors params

ca2c919

remove frozen umap

75379ee

fix random_state

3f7a055

save rp_forest in anndata

c5558e3

neighbors_update initial version

39e49e3

fix for tests

9800307

dont store rp_forest for now

5746697

Koncopd force-pushed the umap_deps branch from 4787d1d to 5746697 Compare April 29, 2019 18:41

fix import settings

152809b

falexwolf reviewed Apr 29, 2019

View reviewed changes

falexwolf added 2 commits April 29, 2019 21:28

added umap as dependency in logging

a718638

updated installation

470b516

added to release notes

bdbc3d3

falexwolf merged commit 9e4e5ee into master Apr 30, 2019

ivirshup mentioned this pull request May 7, 2019

Bandwidth parameter no longer supported by scanpy or UMAP Teichlab/bbknn#10

Closed

awnimo pushed a commit to dpeerlab/scanpy that referenced this pull request Dec 17, 2019

Merge pull request scverse#576 from theislab/umap_deps

c85b157

Remove frozen umap

Remove frozen umap #576

Remove frozen umap #576

Conversation

Koncopd commented Mar 30, 2019 • edited by flying-sheep Loading

falexwolf commented Mar 31, 2019

falexwolf commented Mar 31, 2019

Koncopd commented Apr 1, 2019 • edited Loading

Koncopd commented Apr 1, 2019

falexwolf commented Apr 3, 2019

falexwolf commented Apr 3, 2019 • edited Loading

tomwhite commented Apr 3, 2019

falexwolf commented Apr 3, 2019

Choose a reason for hiding this comment

Koncopd Apr 3, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivirshup commented Apr 9, 2019

Koncopd commented Apr 9, 2019

ivirshup commented Apr 10, 2019

falexwolf commented Apr 10, 2019

Koncopd commented Apr 22, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

falexwolf commented Apr 26, 2019

falexwolf commented Apr 28, 2019

Koncopd commented Apr 28, 2019

Choose a reason for hiding this comment

falexwolf commented Apr 28, 2019

Koncopd commented Apr 29, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

falexwolf commented Apr 29, 2019

Koncopd commented Apr 29, 2019

falexwolf commented Apr 30, 2019

falexwolf commented Apr 30, 2019

tomwhite commented Apr 30, 2019

Koncopd commented Mar 30, 2019 •

edited by flying-sheep

Loading

Koncopd commented Apr 1, 2019 •

edited

Loading

falexwolf commented Apr 3, 2019 •

edited

Loading

Koncopd Apr 3, 2019 •

edited

Loading

Koncopd commented Apr 29, 2019 •

edited

Loading