Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError when doing pca with changed number of components on a filtered adata #126

Closed
fabianrost84 opened this issue Mar 19, 2019 · 10 comments

Comments

Projects
None yet
3 participants
@fabianrost84
Copy link
Contributor

commented Mar 19, 2019

I ran into this error when analysing a subset of cells. Minimal example:

import scanpy as sc

adata = sc.datasets.blobs(n_variables=100, n_observations=100)

sc.tl.pca(adata)
adata = adata[:50]
sc.tl.pca(adata, n_comps=20)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-91-3af710cf8b3f> in <module>
      5 sc.tl.pca(adata)
      6 adata = adata[:50]
----> 7 sc.tl.pca(adata, n_comps=20)

~/miniconda3/envs/spols190117/lib/python3.6/site-packages/scanpy/preprocessing/_simple.py in pca(data, n_comps, zero_center, svd_solver, random_state, return_info, use_highly_variable, dtype, copy, chunked, chunk_size)
    504 
    505     if data_is_AnnData:
--> 506         adata.obsm['X_pca'] = X_pca
    507         if use_highly_variable:
    508             adata.varm['PCs'] = np.zeros(shape=(adata.n_vars, n_comps))

ValueError: could not broadcast input array from shape (50,20) into shape (50,50)

Maybe this is actually an issue with AnnData as I can produce the same error like that

import anndata

adata = anndata.AnnData(np.empty((100, 100)))
adata.obsm['o'] = np.empty((100, 50))

adata = adata[:50]
adata.obsm['o'] = np.empty((50, 20))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-226-e4d4bf8798f3> in <module>
      5 
      6 adata = adata[:50]
----> 7 adata.obsm['o'] = np.empty((50, 20))

ValueError: could not broadcast input array from shape (50,20) into shape (50,50)
>>> sc.logging.print_versions()
scanpy==1.4 anndata==0.6.18 numpy==1.15.4 scipy==1.2.0 pandas==0.24.2 scikit-learn==0.20.2 statsmodels==0.9.0 python-igraph==0.7.1 louvain==0.6.1 
@falexwolf

This comment has been minimized.

Copy link
Member

commented Mar 19, 2019

Thanks! Make a copy adata = adata[:50].copy() instead of using a view.

But actually, that was implicit until recently. Let me check why this broke; it should also be covered by the tests:

adata_subset.obs['foo'] = range(2)

I'll look into it.

@falexwolf

This comment has been minimized.

Copy link
Member

commented Mar 19, 2019

It's just a dimensionality error, everything works as expected:

import numpy as np
import anndata

adata = anndata.AnnData(np.empty((100, 100)))
adata.obsm['o'] = np.empty((100, 50))

adata = adata[:50]
adata.obsm['o'] = np.empty((50, 50))

runs through.

The problem is that n_comps of tl.pca is not respected in this. I'll fix that quickly.

@fabianrost84

This comment has been minimized.

Copy link
Contributor Author

commented Mar 19, 2019

I'm not so sure. In your example you do not change the second dimension of adata.obsm['o'] and this works fine. But in principle, one should always be able to overwrite an entry in adata.obsm with another one that has a different second dimension. One example for changing the second dimension would be to change n_comps in tl.pca. But then again... you probably know better ;)

@falexwolf

This comment has been minimized.

Copy link
Member

commented Mar 19, 2019

You are right, it should also work with any other second dimension.

Strange thing is, that I seem to be unable to debug

anndata/anndata/base.py

Lines 384 to 385 in f201810

_init_actual_AnnData(adata_view)
getattr(adata_view, attr_name)[idx] = value

This works

import numpy as np
import anndata

adata = anndata.AnnData(np.empty((100, 100)))
adata.obsm['o'] = np.ones((100, 50))

adata = adata[:50]
adata._init_as_actual(adata.copy())
adata.obsm['o'] = np.empty((50, 20))

and should be called internally when trying to set annotations of a view of an AnnData.

@flying-sheep, you came up with the really elegant _SetItemMixin implementation. Is it possible that there are cases in which class ArrayView(np.ndarray, _SetItemMixin) doesn't delegate to _SetItemMixin.__setitem__? It seems that that the arrays __setitem__ is called as I'm not seeing any output when adding a print statement?

class _SetItemMixin:
    def __setitem__(self, idx: Any, value: Any):
        print('test')
        if self._view_args is None:
            super().__setitem__(idx, value)
        else:
            adata_view, attr_name = self._view_args
            _init_actual_AnnData(adata_view)
            getattr(adata_view, attr_name)[idx] = value
@fabianrost84

This comment has been minimized.

Copy link
Contributor Author

commented Mar 20, 2019

Just in case this is of help: The issue came up when I executed some code that I wrote some months ago (I think in December 2018). Back then, the code worked alright. So it could be that this issue was introduced in a recent update. But I am not absolutely sure about this and I do not have the time to reproduce this right now.

@falexwolf

This comment has been minimized.

Copy link
Member

commented Mar 21, 2019

@flying-sheep, I just checked, it's actually due to your prettification of the code in 4e9bd9e

The following runs through without problems

import numpy as np
import anndata

print(anndata.__version__)

adata = anndata.AnnData(np.empty((100, 100)))
adata.obsm['o'] = np.ones((100, 50))

adata = adata[:50]
adata.obsm['o'] = np.empty((50, 20))

printing 0.6.17+6.gc6ab5d4.

So, parent commit dc14d3c was still fine.

It would be awesome if you fixed it. I'll then add the few lines above as a test...

Thank you so much for pointing this out, @fabianrost84.

@flying-sheep

This comment has been minimized.

Copy link
Member

commented Mar 22, 2019

it's actually due to your prettification of the code in 4e9bd9e

… which included the introduction of _SetItemMixin that you mentioned previously, and you were right, it doesn’t delegate.

I guess the order matters here! Let’s see if putting the mixin in front fixes that.

@flying-sheep flying-sheep transferred this issue from theislab/scanpy Mar 22, 2019

@flying-sheep

This comment has been minimized.

Copy link
Member

commented Mar 22, 2019

Yes it does. What an easy fix!

@falexwolf

This comment has been minimized.

Copy link
Member

commented Mar 22, 2019

Awesome! That was easy! :)

And now it's also clear why it wasn't a problem for .obs etc. (then the tests would have failed); the order is correct there. Stupid that I didn't notice that.

anndata/anndata/base.py

Lines 420 to 432 in 0ab553f

class SparseCSRView(_ViewMixin, sparse.csr_matrix):
pass
class SparseCSCView(_ViewMixin, sparse.csc_matrix):
pass
class DictView(_ViewMixin, dict):
pass
class DataFrameView(_ViewMixin, pd.DataFrame):

But it makes sense that the order plays a role...

@flying-sheep

This comment has been minimized.

Copy link
Member

commented Mar 22, 2019

No idea why I did the right order everywhere else but not there … sorry!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.