Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove frozen umap #576

Merged
merged 20 commits into from
Apr 30, 2019
Merged

Remove frozen umap #576

merged 20 commits into from
Apr 30, 2019

Conversation

Koncopd
Copy link
Member

@Koncopd Koncopd commented Mar 30, 2019

Still need to figure out why paga test fails.

Also simplicial_set_embedding from umap requires data and metrics. Data is adata.X and i set metrics='euclidean', but this is not clear.

Fixes #522

@falexwolf
Copy link
Member

Great!

Some tests should fail as there are probably differences in the neighbor algorithm.

This is also why this is a backwards-compat breaking change. Can you just visually inspect https://scanpy-tutorials.readthedocs.io/en/latest/paga-paul15.html and see what's going on?

This is another notebook that should still do something meaningful after the change: https://nbviewer.jupyter.org/github/theislab/scanpy_usage/blob/master/170502_paul15/paul15.ipynb

And finally, of course, https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html should give somewhat consistent results. But I expect slight variations and no perfect consistence... Actually, I'd expect the associated tests (https://github.com/theislab/scanpy/blob/master/scanpy/tests/notebooks/test_pbmc3k.py) to fail. Can you check?

@falexwolf
Copy link
Member

Finally, I want to mention that we also need to export the following functions from UMAP: https://github.com/theislab/scanpy/blob/97c8b54ec884ac8e8396a80b6782a0d59a17a874/scanpy/tools/_umap.py#L107

@Koncopd
Copy link
Member Author

Koncopd commented Apr 1, 2019

@falexwolf

Hi, Alex.
Yes, i'm checking these.

Actually, it somehow passes test_pbmc3k. Only test_paga_paul15_subsampled fails. It seems that adjacency matrix is a bit different after the change and this affects paga connectivities. But it is preliminary, i'm checking still.

@Koncopd
Copy link
Member Author

Koncopd commented Apr 1, 2019

Finally, I want to mention that we also need to export the following functions from UMAP:

scanpy/scanpy/tools/_umap.py

Line 107 in 97c8b54

from ..neighbors.umap.umap_ import find_ab_params, simplicial_set_embedding

This is already in the PR.

@falexwolf
Copy link
Member

Great!

Yes, I would have expected that the adjacency matrix will differ slightly and hence, test_paga_paul15 fails. We'll need to rerun and upload https://scanpy-tutorials.readthedocs.io/en/latest/paga-paul15.html with the new version in that case and also update the tests.

@falexwolf
Copy link
Member

falexwolf commented Apr 3, 2019

One other thing.

We'd like to have two new convenience functions:

def neighbors_update(adata, adata_new)
def umap_update(adata, adata_new)

The first maps the new data into the existing neighbor graph based on the chosen latent representation.

The second maps the new data into the existing UMAP embedding.

For the second function, one just needs to find a good way of wrapping

model = umap.UMAP(seed=1234)
model.fit(X)
model.transform(new_X)

For the first, I'm not quite sure how easy it is easy. I'm using pynndescent for it, which will become UMAP's dependency at some point, but isn't yet.

Maybe what UMAP does internally is already sufficient, but I don't know.

Can you investigate and if it's easy cover in this PR? If it's tricky, let's wait for another PR.

@tomwhite
Copy link
Contributor

tomwhite commented Apr 3, 2019

This looks good to me.

Do you intend to remove scanpy/neighbors/umap as a part of this PR?

@falexwolf
Copy link
Member

Yes, scanpy/neighbors/umap should definitely be removed.

@@ -131,7 +132,9 @@ def umap(
n_epochs,
init_coords,
random_state,
max(0, verbosity-3))
metric="euclidean",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that a metric makes any sense here. We're just embedding the graph that had already been computed before. Are you sure that simplicial_set_embedding still takes a metric as an argument?

Copy link
Member Author

@Koncopd Koncopd Apr 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, unfortunately it does. Should i add metric as an argument to umap?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, just looked it up, it's just used for the initialization.

Same as the data argument, which, btw, is not correct to be set to adata.X. It should be set to the representation chosen in pp.neighbors and the metric also needs to be chosen consistently with it.

I'd recommend to stare the params of the neighbors call as .uns['neighbors']['params'] as a dict as in several other tools, too and initizalize from that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, in the description it says

data: array of shape (n_samples, n_features)
The source data to be embedded by UMAP

It is used in spectral_layout only, it uses data and metric only when the number of connected components is bigger than 1. And it also seems from the code that this is exactly adata.X. Or maybe not adata.X, but the data which was used for embedding, adata.X could be changed already from that state. So what can we do, froze adata.X which were used for embedding?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we should do is this

I'd recommend to store the params of the pp.neighbors call as .uns['neighbors']['params'] as a dict as in several other tools, too and initialize from that.

So, you'll have a use_rep param in that dict. And that param tells you which representation intends to use for the embedding (used for computing the neighborhood graph/simplicial set).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thank you, i see . Will do like this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just forgot that there can be another representation to use.

@@ -118,9 +118,10 @@ def umap(
init_coords = init_pos
from sklearn.utils import check_random_state
random_state = check_random_state(random_state)
n_epochs = maxiter
n_epochs = 0 if maxiter is None else maxiter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 epochs will result in no embedding. But None is not legit anymore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was changed, now 0 is the same as None before.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@ivirshup
Copy link
Member

ivirshup commented Apr 9, 2019

@falexwolf @Koncopd any chance you've had time to look into the projection of new data into an existing UMAP yet? Additionally, would this be rolled out with an equivalent pca_update? I'd be interested in taking some part of implementing this if you haven't worked it out already.

@Koncopd
Copy link
Member Author

Koncopd commented Apr 9, 2019

Hi, @ivirshup.
No, i haven't looked into this yet, so if you have time and ideas, i'll appreciate any help.

@ivirshup
Copy link
Member

Alright! I've got a little example case I'd probably be using for a test case here (doublet prediction by simulation and projection).

My current thoughts:

  • Since we need to be working in the same feature space, we'll at least need PCA projection, but this is pretty easy:
Basic PCA projection
def pca_update(tgt, src, inplace=True):
    # TODO: Make sure we know the settings (just whether to center?) from src
    if not inplace:
        tgt = tgt.copy()
    if sparse.issparse(tgt.X):
        X = tgt.X.toarray()
    else:
        X = tgt.X.copy()
    X -= np.asarray(tgt.X.mean(axis=0))
    tgt_pca = np.dot(X, src.varm["PCs"])
    tgt.obsm["X_pca"] = tgt_pca
    return tgt
  • Are you planning on storing the UMAP object in the AnnData? That would make transformation easier, but I see how on-disk representation could get complicated.
  • What order should we do this in? Would you like everything to be accomplished by this PR or should we break it up?
  • Are we introducing a general transfer learning api? Probably worth considering that a bit. Some relevant questions:
    • Does the syntax still work for cases other than 1-to-1 transfer?
    • How do we deal with concatenation/ joins? The current concatenate doesn't join things like obsm.
    • Alternatively, does everything have to be in the same AnnData? It would solve issues with having var be the same, but could complicate a lot of other code (many functions would need some kind of masking argument).

@falexwolf
Copy link
Member

@ivirshup, fantastic! Yes, it would be great if you could work on this. We should also keep it separate from this PR, in this case. This PR should remain concerned with getting rid of the duplicate code.

I'll open up an issue.

@Koncopd
Copy link
Member Author

Koncopd commented Apr 22, 2019

Added the initial version of neighbors_update, only works with use_rep='X' in sc.pp.neighbors for now.

result = search(train, search_graph.indptr, search_graph.indices, init, test)

indices, dists = deheap_sort(result)
return indices[:, :k], dists[:, :k]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking, since downstream functions like leiden use the "connectivities" weighted graph, are you planning on returning that as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now this is just an analog of

neighbors = NNDescent(adata1.X)  # construct first graph
neighbors.query(adata2.X)  # lookup second data

But calculating connectivities should be possible, yes.

@falexwolf
Copy link
Member

This looks good! 😄

Storing the forest in the AnnData is good! It should also be compatible with the updates the @tomwhite plans on UMAP and pynndescent (UMAP will depend on pynndescent) as that should be the most basic object to store when to enable queries later on...

But I would not store the "forest" in a default neighbors call. Or do you have any estimate on how large it is?

Great work!

@falexwolf
Copy link
Member

@Koncopd, can we merge this without the neighbors_update function and without writing the rp_forest to the AnnData object? Your code is good, but we should put it into another PR.

Can you investigate and if it's easy cover in this PR? If it's tricky, let's wait for another PR.

Is what I wrote in the beginning. I think it turned out tricky and is a case for #562 (comment). So, let's keep this PR really simple and just be about removing the legacy code.

Your statement about "all tests pass except for the PAGA tests" is still true? Did you manually inspect the PAGA notebook and does it look consistent? Just a few cosmetic things should have changed, I guess.

If yes, we'll merge this, now that 1.4.1 is out.

@Koncopd
Copy link
Member Author

Koncopd commented Apr 28, 2019

@falexwolf
Yes, i inspected the notebook, everything looks the same. Also this PR passes all tests now. So, yes, i think it can be merged

adata.uns['neighbors']['distances'] = neighbors.distances
adata.uns['neighbors']['connectivities'] = neighbors.connectivities
if neighbors.rp_forest is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, as mentioned, I'd not at the rp_forest object at this stage. Makes sense? Can you remove it?

@falexwolf
Copy link
Member

Cool! But, can you address the comment above? And, what about all these strange conflicts? There shouldn't be any in scanpy/neighbors/ as all of these files get removed. Could you fix the conflict in scanpy/tools/_umap.py?

@Koncopd
Copy link
Member Author

Koncopd commented Apr 29, 2019

@falexwolf , yes, i'll remove the commits with rp_forest and update_neighbors and resolve the conflict today. The conflicts are because of this commit, i believe.

@@ -1,5 +1,5 @@
from ._utils import get_init_pos_from_paga, choose_representation
from .. import settings
from .._settings import settings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not have been necessary. It should be from .. import settings...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it doesn't hurt, I don't who put it there. But it should be from ..settings import settings. :) Don't worry...

X, n_neighbors, random_state, metric=metric, metric_kwds=metric_kwds)
self._rp_forest = _make_forest_dict(forest)
#self._rp_forest = _make_forest_dict(forest)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, looks good!

@falexwolf
Copy link
Member

So, is this good to go? Waiting for your OK!

@Koncopd
Copy link
Member Author

Koncopd commented Apr 29, 2019

@falexwolf
It is ready for merge.
However before i checked only paul15.ipynb, it is the same.
But for paga-paul15 there are serious differences.
The upper graph is new
image
Denoising the graph
image

@falexwolf
Copy link
Member

I tested myself and obtained exactly the same results. :)

You probably don't have the FA2 package installed, that's why your graph look different... :)

I'm merging this! Awesome work!

@falexwolf falexwolf merged commit 9e4e5ee into master Apr 30, 2019
@falexwolf
Copy link
Member

@tomwhite, we're now fully depending on UMAP. 😄

@tomwhite
Copy link
Contributor

That's great! Thanks @Koncopd and @falexwolf

awnimo pushed a commit to dpeerlab/scanpy that referenced this pull request Dec 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Getting rid of legacy umap code
4 participants