Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove frozen umap #576

Merged
merged 20 commits into from Apr 30, 2019

Conversation

@Koncopd
Copy link
Member

commented Mar 30, 2019

Still need to figure out why paga test fails.

Also simplicial_set_embedding from umap requires data and metrics. Data is adata.X and i set metrics='euclidean', but this is not clear.

Fixes #522

@falexwolf

This comment has been minimized.

Copy link
Member

commented Mar 31, 2019

Great!

Some tests should fail as there are probably differences in the neighbor algorithm.

This is also why this is a backwards-compat breaking change. Can you just visually inspect https://scanpy-tutorials.readthedocs.io/en/latest/paga-paul15.html and see what's going on?

This is another notebook that should still do something meaningful after the change: https://nbviewer.jupyter.org/github/theislab/scanpy_usage/blob/master/170502_paul15/paul15.ipynb

And finally, of course, https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html should give somewhat consistent results. But I expect slight variations and no perfect consistence... Actually, I'd expect the associated tests (https://github.com/theislab/scanpy/blob/master/scanpy/tests/notebooks/test_pbmc3k.py) to fail. Can you check?

@falexwolf falexwolf referenced this pull request Mar 31, 2019
0 of 13 tasks complete
@falexwolf

This comment has been minimized.

Copy link
Member

commented Mar 31, 2019

Finally, I want to mention that we also need to export the following functions from UMAP:

from ..neighbors.umap.umap_ import find_ab_params, simplicial_set_embedding

@Koncopd

This comment has been minimized.

Copy link
Member Author

commented Apr 1, 2019

@falexwolf

Hi, Alex.
Yes, i'm checking these.

Actually, it somehow passes test_pbmc3k. Only test_paga_paul15_subsampled fails. It seems that adjacency matrix is a bit different after the change and this affects paga connectivities. But it is preliminary, i'm checking still.

@Koncopd

This comment has been minimized.

Copy link
Member Author

commented Apr 1, 2019

Finally, I want to mention that we also need to export the following functions from UMAP:

scanpy/scanpy/tools/_umap.py

Line 107 in 97c8b54

from ..neighbors.umap.umap_ import find_ab_params, simplicial_set_embedding

This is already in the PR.

@falexwolf

This comment has been minimized.

Copy link
Member

commented Apr 3, 2019

Great!

Yes, I would have expected that the adjacency matrix will differ slightly and hence, test_paga_paul15 fails. We'll need to rerun and upload https://scanpy-tutorials.readthedocs.io/en/latest/paga-paul15.html with the new version in that case and also update the tests.

@falexwolf

This comment has been minimized.

Copy link
Member

commented Apr 3, 2019

One other thing.

We'd like to have two new convenience functions:

def neighbors_update(adata, adata_new)
def umap_update(adata, adata_new)

The first maps the new data into the existing neighbor graph based on the chosen latent representation.

The second maps the new data into the existing UMAP embedding.

For the second function, one just needs to find a good way of wrapping

model = umap.UMAP(seed=1234)
model.fit(X)
model.transform(new_X)

For the first, I'm not quite sure how easy it is easy. I'm using pynndescent for it, which will become UMAP's dependency at some point, but isn't yet.

Maybe what UMAP does internally is already sufficient, but I don't know.

Can you investigate and if it's easy cover in this PR? If it's tricky, let's wait for another PR.

@tomwhite

This comment has been minimized.

Copy link
Contributor

commented Apr 3, 2019

This looks good to me.

Do you intend to remove scanpy/neighbors/umap as a part of this PR?

@falexwolf

This comment has been minimized.

Copy link
Member

commented Apr 3, 2019

Yes, scanpy/neighbors/umap should definitely be removed.

@@ -131,7 +132,9 @@ def umap(
n_epochs,
init_coords,
random_state,
max(0, verbosity-3))
metric="euclidean",

This comment has been minimized.

Copy link
@falexwolf

falexwolf Apr 3, 2019

Member

I don't think that a metric makes any sense here. We're just embedding the graph that had already been computed before. Are you sure that simplicial_set_embedding still takes a metric as an argument?

This comment has been minimized.

Copy link
@Koncopd

Koncopd Apr 3, 2019

Author Member

Yes, unfortunately it does. Should i add metric as an argument to umap?

This comment has been minimized.

Copy link
@falexwolf

falexwolf Apr 4, 2019

Member

Ok, just looked it up, it's just used for the initialization.

Same as the data argument, which, btw, is not correct to be set to adata.X. It should be set to the representation chosen in pp.neighbors and the metric also needs to be chosen consistently with it.

I'd recommend to stare the params of the neighbors call as .uns['neighbors']['params'] as a dict as in several other tools, too and initizalize from that.

This comment has been minimized.

Copy link
@Koncopd

Koncopd Apr 4, 2019

Author Member

Hm, in the description it says

data: array of shape (n_samples, n_features)
The source data to be embedded by UMAP

It is used in spectral_layout only, it uses data and metric only when the number of connected components is bigger than 1. And it also seems from the code that this is exactly adata.X. Or maybe not adata.X, but the data which was used for embedding, adata.X could be changed already from that state. So what can we do, froze adata.X which were used for embedding?

This comment has been minimized.

Copy link
@falexwolf

falexwolf Apr 4, 2019

Member

What we should do is this

I'd recommend to store the params of the pp.neighbors call as .uns['neighbors']['params'] as a dict as in several other tools, too and initialize from that.

So, you'll have a use_rep param in that dict. And that param tells you which representation intends to use for the embedding (used for computing the neighborhood graph/simplicial set).

This comment has been minimized.

Copy link
@Koncopd

Koncopd Apr 4, 2019

Author Member

Yes, thank you, i see . Will do like this.

This comment has been minimized.

Copy link
@Koncopd

Koncopd Apr 4, 2019

Author Member

Just forgot that there can be another representation to use.

@@ -118,9 +118,10 @@ def umap(
init_coords = init_pos
from sklearn.utils import check_random_state
random_state = check_random_state(random_state)
n_epochs = maxiter
n_epochs = 0 if maxiter is None else maxiter

This comment has been minimized.

Copy link
@falexwolf

falexwolf Apr 4, 2019

Member

0 epochs will result in no embedding. But None is not legit anymore?

This comment has been minimized.

Copy link
@Koncopd

Koncopd Apr 4, 2019

Author Member

It was changed, now 0 is the same as None before.

This comment has been minimized.

Copy link
@falexwolf

falexwolf Apr 4, 2019

Member

Thanks!

@ivirshup

This comment has been minimized.

Copy link
Collaborator

commented Apr 9, 2019

@falexwolf @Koncopd any chance you've had time to look into the projection of new data into an existing UMAP yet? Additionally, would this be rolled out with an equivalent pca_update? I'd be interested in taking some part of implementing this if you haven't worked it out already.

@Koncopd

This comment has been minimized.

Copy link
Member Author

commented Apr 9, 2019

Hi, @ivirshup.
No, i haven't looked into this yet, so if you have time and ideas, i'll appreciate any help.

@ivirshup

This comment has been minimized.

Copy link
Collaborator

commented Apr 10, 2019

Alright! I've got a little example case I'd probably be using for a test case here (doublet prediction by simulation and projection).

My current thoughts:

  • Since we need to be working in the same feature space, we'll at least need PCA projection, but this is pretty easy:
Basic PCA projection
def pca_update(tgt, src, inplace=True):
    # TODO: Make sure we know the settings (just whether to center?) from src
    if not inplace:
        tgt = tgt.copy()
    if sparse.issparse(tgt.X):
        X = tgt.X.toarray()
    else:
        X = tgt.X.copy()
    X -= np.asarray(tgt.X.mean(axis=0))
    tgt_pca = np.dot(X, src.varm["PCs"])
    tgt.obsm["X_pca"] = tgt_pca
    return tgt
  • Are you planning on storing the UMAP object in the AnnData? That would make transformation easier, but I see how on-disk representation could get complicated.
  • What order should we do this in? Would you like everything to be accomplished by this PR or should we break it up?
  • Are we introducing a general transfer learning api? Probably worth considering that a bit. Some relevant questions:
    • Does the syntax still work for cases other than 1-to-1 transfer?
    • How do we deal with concatenation/ joins? The current concatenate doesn't join things like obsm.
    • Alternatively, does everything have to be in the same AnnData? It would solve issues with having var be the same, but could complicate a lot of other code (many functions would need some kind of masking argument).
@falexwolf

This comment has been minimized.

Copy link
Member

commented Apr 10, 2019

@ivirshup, fantastic! Yes, it would be great if you could work on this. We should also keep it separate from this PR, in this case. This PR should remain concerned with getting rid of the duplicate code.

I'll open up an issue.

@Koncopd

This comment has been minimized.

Copy link
Member Author

commented Apr 22, 2019

Added the initial version of neighbors_update, only works with use_rep='X' in sc.pp.neighbors for now.

result = search(train, search_graph.indptr, search_graph.indices, init, test)

indices, dists = deheap_sort(result)
return indices[:, :k], dists[:, :k]

This comment has been minimized.

Copy link
@ivirshup

ivirshup Apr 22, 2019

Collaborator

Just checking, since downstream functions like leiden use the "connectivities" weighted graph, are you planning on returning that as well?

This comment has been minimized.

Copy link
@Koncopd

Koncopd Apr 22, 2019

Author Member

For now this is just an analog of

neighbors = NNDescent(adata1.X)  # construct first graph
neighbors.query(adata2.X)  # lookup second data

But calculating connectivities should be possible, yes.

@falexwolf

This comment has been minimized.

Copy link
Member

commented Apr 26, 2019

This looks good! 😄

Storing the forest in the AnnData is good! It should also be compatible with the updates the @tomwhite plans on UMAP and pynndescent (UMAP will depend on pynndescent) as that should be the most basic object to store when to enable queries later on...

But I would not store the "forest" in a default neighbors call. Or do you have any estimate on how large it is?

Great work!

@falexwolf

This comment has been minimized.

Copy link
Member

commented Apr 28, 2019

@Koncopd, can we merge this without the neighbors_update function and without writing the rp_forest to the AnnData object? Your code is good, but we should put it into another PR.

Can you investigate and if it's easy cover in this PR? If it's tricky, let's wait for another PR.

Is what I wrote in the beginning. I think it turned out tricky and is a case for #562 (comment). So, let's keep this PR really simple and just be about removing the legacy code.

Your statement about "all tests pass except for the PAGA tests" is still true? Did you manually inspect the PAGA notebook and does it look consistent? Just a few cosmetic things should have changed, I guess.

If yes, we'll merge this, now that 1.4.1 is out.

@Koncopd

This comment has been minimized.

Copy link
Member Author

commented Apr 28, 2019

@falexwolf
Yes, i inspected the notebook, everything looks the same. Also this PR passes all tests now. So, yes, i think it can be merged

adata.uns['neighbors']['distances'] = neighbors.distances
adata.uns['neighbors']['connectivities'] = neighbors.connectivities
if neighbors.rp_forest is not None:

This comment has been minimized.

Copy link
@falexwolf

falexwolf Apr 28, 2019

Member

But, as mentioned, I'd not at the rp_forest object at this stage. Makes sense? Can you remove it?

@falexwolf

This comment has been minimized.

Copy link
Member

commented Apr 28, 2019

Cool! But, can you address the comment above? And, what about all these strange conflicts? There shouldn't be any in scanpy/neighbors/ as all of these files get removed. Could you fix the conflict in scanpy/tools/_umap.py?

@Koncopd

This comment has been minimized.

Copy link
Member Author

commented Apr 29, 2019

@falexwolf , yes, i'll remove the commits with rp_forest and update_neighbors and resolve the conflict today. The conflicts are because of this commit, i believe.

@Koncopd Koncopd force-pushed the umap_deps branch from 4787d1d to 5746697 Apr 29, 2019

@@ -1,5 +1,5 @@
from ._utils import get_init_pos_from_paga, choose_representation
from .. import settings
from .._settings import settings

This comment has been minimized.

Copy link
@falexwolf

falexwolf Apr 29, 2019

Member

This should not have been necessary. It should be from .. import settings...

This comment has been minimized.

Copy link
@Koncopd

Koncopd Apr 29, 2019

Author Member

It is the same as in the master branch

from .._settings import settings

This comment has been minimized.

Copy link
@falexwolf

falexwolf Apr 30, 2019

Member

Well, it doesn't hurt, I don't who put it there. But it should be from ..settings import settings. :) Don't worry...

X, n_neighbors, random_state, metric=metric, metric_kwds=metric_kwds)
self._rp_forest = _make_forest_dict(forest)
#self._rp_forest = _make_forest_dict(forest)

This comment has been minimized.

Copy link
@falexwolf

falexwolf Apr 29, 2019

Member

OK, looks good!

@falexwolf

This comment has been minimized.

Copy link
Member

commented Apr 29, 2019

So, is this good to go? Waiting for your OK!

@Koncopd

This comment has been minimized.

Copy link
Member Author

commented Apr 29, 2019

@falexwolf
It is ready for merge.
However before i checked only paul15.ipynb, it is the same.
But for paga-paul15 there are serious differences.
The upper graph is new
image
Denoising the graph
image

falexwolf added some commits Apr 30, 2019

@falexwolf

This comment has been minimized.

Copy link
Member

commented Apr 30, 2019

I tested myself and obtained exactly the same results. :)

You probably don't have the FA2 package installed, that's why your graph look different... :)

I'm merging this! Awesome work!

@falexwolf falexwolf merged commit 9e4e5ee into master Apr 30, 2019

0 of 2 checks passed

continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
continuous-integration/travis-ci/push The Travis CI build is in progress
Details
@falexwolf

This comment has been minimized.

Copy link
Member

commented Apr 30, 2019

@tomwhite, we're now fully depending on UMAP. 😄

@tomwhite

This comment has been minimized.

Copy link
Contributor

commented Apr 30, 2019

That's great! Thanks @Koncopd and @falexwolf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.