Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow plots to use adata.obs index as groupby #1583

Merged
merged 17 commits into from
Feb 2, 2021
Merged

Conversation

fidelram
Copy link
Collaborator

@fidelram fidelram commented Jan 13, 2021

I simplified the _prepare_dataframe code by using sc.get.df. However, this change uncovered two issues with sc.get.obs_df that I have now addressed in this PR.

The most relevant is the case when the call to sc.get.obs_df contains keys with duplicates (e.g. keys=['gene1', 'gene1']). This case is not rare as for example in sc.pl.dotplot the same gene can be visualized several times, which requires calling sc.get.obs_df with keys that contain duplications. An example is when sc.pl.rank_genes_groups_dotplot is called and, frequently, the same gene appears as top up-regulated for more than one category. To address this, sc.get.obs_df removes all duplicates (which correspond to DataFrame columns) and after the DataFrame is complete, the duplicates are added back.

A second problem was for non-unique adata.obs indices which should be a rare situation. However, it turns out that one of the test adata object used in test_plotting have this issue. Also, for the goal of this PR (allow adata.obs.index as groupby option) it could be expected that the index may not be unique.

In general, non unique obs indices are ok as long as .obs DataFrame is not joined or merge based on index. However, because internally in sc.get.obs_df the DataFrames are merged using adata.obs.index this non-unique indices caused an increase in rows due to multiple matching. To fix this, the code now checks for unique index, and if it is not unique then a temporary index is added to allow proper join operations and then the non-unique index is put back.

Copy link
Member

@ivirshup ivirshup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big point: don't modify the AnnData object when you're getting values out of it.

Otherwise, mostly good.

Needs docs and examples though.

scanpy/plotting/_anndata.py Outdated Show resolved Hide resolved
scanpy/get.py Outdated Show resolved Hide resolved
scanpy/get.py Outdated Show resolved Hide resolved
scanpy/plotting/_anndata.py Outdated Show resolved Hide resolved
scanpy/plotting/_baseplot_class.py Outdated Show resolved Hide resolved
scanpy/get.py Outdated Show resolved Hide resolved
scanpy/get.py Outdated Show resolved Hide resolved
scanpy/tests/test_plotting.py Outdated Show resolved Hide resolved
Copy link
Member

@ivirshup ivirshup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking much cleaner! I'm really happy with the simplification of _prepare_dataframe.

I've realized I missed a few things with obsdf though:

  • Shouldn't var_df should get similar updates to obs_df?
  • Could we get tests for get.obs_df/ get.var_df for the issues you addressed here (repeated indices)?
  • I've realized this will need a backport, but that can be a separate pr to the 1.7.x branch where you just copy and paste the new functions

scanpy/plotting/_anndata.py Outdated Show resolved Hide resolved
scanpy/plotting/_anndata.py Show resolved Hide resolved
scanpy/plotting/_anndata.py Show resolved Hide resolved
scanpy/plotting/_anndata.py Outdated Show resolved Hide resolved
scanpy/plotting/_anndata.py Outdated Show resolved Hide resolved
@fidelram
Copy link
Collaborator Author

  • Shouldn't var_df should get similar updates to obs_df?

I would suggest a different PR to address this.

  • Could we get tests for get.obs_df/ get.var_df for the issues you addressed here (repeated indices)?

Sure, I added new tests to get.obs_df to check duplicated keys.

Copy link
Member

@ivirshup ivirshup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I added new tests to get.obs_df to check duplicated keys.

Oh, I meant the repeated index names that were giving you problems with DataFrame.join. This issue you mentioned before:

However, because internally in sc.get.obs_df the DataFrames are merged using adata.obs.index this non-unique indices caused an increase in rows due to multiple matching.

Here is a test you could use:

adata = sc.AnnData(
    np.arange(16).reshape(4, 4),
    obs=pd.DataFrame(index=["a", "a", "b", "c"]),
    var=pd.DataFrame(index=[f"gene{i}" for i in range(4)]),
)
df = sc.get.obs_df(adata, ["gene1"])
pd.testing.assert_index_equal(df.index, adata.obs_names)

Shouldn't var_df should get similar updates to obs_df?

I would suggest a different PR to address this.

Just because the code changes should be largely equivalent (as should the tests) I think it will be easier to review if these changes are together in one PR.

scanpy/plotting/_anndata.py Outdated Show resolved Hide resolved
scanpy/plotting/_anndata.py Outdated Show resolved Hide resolved
@fidelram
Copy link
Collaborator Author

@ivirshup I think this PR is ready to merge. readthedocs is failing because of a missing link from scvelo unrelated to the PR

@ivirshup
Copy link
Member

Fidel, sorry to say, but I've run into some issues. Most of these actually didn't have to do with this PR, but were additional things that broke from #1499. I'll give you a few examples of what I've found, mostly by contrast with the current behaviour of 1.6.1.

import scanpy as sc, pandas as pd, numpy as np

M, N = (5, 3)
adata = sc.AnnData(
    X=np.zeros((M, N)),
    obs=pd.DataFrame(
        np.arange(M * 3).reshape((M, 3)),
        columns=["repeated_col", "repeated_col", "var_id"],
        index=pd.Index([f"cell_{i}" for i in range(M)], name="obs_index"),
    ),
    var=pd.DataFrame(
        index=pd.Index(["var_id"] + [f"gene_{i}" for i in range(N-1)], name="var_index"),
    ),
)

Repeated column in adata.obs

I think this should be an error. This is because downstream functions (like plotting) currently assume that for each key input here, there will be one output column. Turns out this isn't exactly pandas behaviour with repeated column values, but I do think it's reasonable.

M, N = 5, 3
adata = sc.AnnData(
    X=np.zeros((M, N)),
    obs=pd.DataFrame(
        np.arange(M * 2).reshape((M, 2)),
        columns=["repeated_col", "repeated_col"],
        index=[f"cell_{i}" for i in range(M)],
    ),
    var=pd.DataFrame(
        index=[f"gene_{i}" for i in range(N)],
    ),   
)
sc.get.obs_df(adata, ["repeated_col"])

This pr (gets both columns)

           repeated_col  repeated_col
obs_index                            
cell_0                0             1
cell_1                3             4
cell_2                6             7
cell_3                9            10
cell_4               12            13

1.6 (errors)

~/miniconda3/envs/scanpy-1.6/lib/python3.8/site-packages/pandas/core/internals/blocks.py in __init__(self, values, placement, ndim)
    140 
    141         if self._validate_ndim and self.ndim and len(self.mgr_locs) != len(self.values):
--> 142             raise ValueError(
    143                 f"Wrong number of items passed {len(self.values)}, "
    144                 f"placement implies {len(self.mgr_locs)}"

ValueError: Wrong number of items passed 2, placement implies 1

Not a great error, could definitley be improved.

Key in adata.obs.columns and adata.var_names

In this case, the key is ambiguous (should it get the gene values or the column from obs?). I think this means it should error. I feel like this point has been discussed a number of times, but doesn't seem to have been discussed when this behaviour was changed.

M, N = 5, 3
adata = sc.AnnData(
    X=np.zeros((M, N)),
    obs=pd.DataFrame(
        np.arange(M),
        columns=["var_id"],
        index=[f"cell_{i}" for i in range(M)],
    ),
    var=pd.DataFrame(
        index=["var_id"] + [f"gene_{i}" for i in range(N-1)],
    ),   
)
sc.get.obs_df(adata, ["var_id"])

This pr (warns)

/Users/isaac/github/scanpy/scanpy/get.py:177: UserWarning: The key `var_id` is found in both adata.obs and adata.var_names.Only the adata.obs key will be used.
  warnings.warn(
Out[58]: 
           var_id
obs_index        
cell_0          2
cell_1          5
cell_2          8
cell_3         11
cell_4         14

1.6 (errors)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-16-69be169f6a4f> in <module>
----> 1 sc.get.obs_df(adata, ["var_id"])

~/miniconda3/envs/scanpy-1.6/lib/python3.8/site-packages/scanpy/get.py in obs_df(adata, keys, obsm_keys, layer, gene_symbols, use_raw)
    171     for k, l in zip(keys, lookup_keys):
    172         if not use_raw or k in adata.obs.columns:
--> 173             df[k] = adata.obs_vector(l, layer=layer)
    174         else:
    175             df[k] = adata.raw.obs_vector(l)

~/miniconda3/envs/scanpy-1.6/lib/python3.8/site-packages/anndata/_core/anndata.py in obs_vector(self, k, layer)
   1362                 )
   1363                 layer = None
-> 1364         return get_vector(self, k, "obs", "var", layer=layer)
   1365 
   1366     def var_vector(self, k, *, layer: Optional[str] = None) -> np.ndarray:

~/miniconda3/envs/scanpy-1.6/lib/python3.8/site-packages/anndata/_core/index.py in get_vector(adata, k, coldim, idxdim, layer)
    156 
    157     if (in_col + in_idx) == 2:
--> 158         raise ValueError(
    159             f"Key {k} could be found in both .{idxdim}_names and .{coldim}.columns"
    160         )

ValueError: Key var_id could be found in both .var_names and .obs.columns

Repeats in var_names

When there are repeats in var_names (pretty frequent occurence), getting a dataframe with keys that aren't repeated. I think it's fine for this to work. I do think it should error if the key is one values that is duplicated in the index.

adata = sc.AnnData(
    X=np.ones((2, 3)),
    obs=pd.DataFrame(index=["cell-0", "cell-1"]),
    var=pd.DataFrame(index=["gene-0", "gene-0", "gene-1"]),
)
sc.get.obs_df(adata, ["gene-1"])

This PR (errors)

---------------------------------------------------------------------------
InvalidIndexError                         Traceback (most recent call last)
<ipython-input-62-405d671e2970> in <module>
----> 1 sc.get.obs_df(adata, ["a", "gene-1"])

~/github/scanpy/scanpy/get.py in obs_df(adata, keys, obsm_keys, layer, gene_symbols, use_raw)
    213             var_idx = adata.raw.var_names.get_indexer(var_names)
    214         else:
--> 215             var_idx = adata.var_names.get_indexer(var_names)
    216 
    217         # for backed AnnData is important that the indices are ordered

/usr/local/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_indexer(self, target, method, limit, tolerance)
   3169 
   3170         if not self.is_unique:
-> 3171             raise InvalidIndexError(
   3172                 "Reindexing only valid with uniquely valued Index objects"
   3173             )

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

1.6 (suceeds)

        gene-1
cell-0     1.0
cell-1     1.0

1.6 does error if I use "gene-0" as a key, but the error message could definitley be better.

What should we do about this?

My current inclination is to revert most changes to obs_df and var_df from this PR and #1499. This should leave the use of indices as groupby untouched. Also, the loss of perfomance from reverting #1499 should be partially mitigated by improvements in pandas (see pandas-dev/pandas#37954). We would keep all the user facing changes, and all the tests from both PRs. We can then make a release now, and can patch in performance boosts during the release cycle.

Do you agree with this assessment? If not, could you propose an alternative?

@fidelram
Copy link
Collaborator Author

@ivirshup regarding the three comments:

  • adata.obs.columns non_unique: I also think this should be an error as this will cause trouble at some point or another. I added a check for this which will raise a ValueError if duplicates are found
  • adata.var.index Here I wanted to raise an error only if a key matched a duplicated var_name as you suggested. However, the selection by index, instead of var_name, from the matrix causes an error whenever the var_names contains duplicates. Actually, doing an AnnData selection when the var_names are not unique also raises an error. Thus, I added a ValueError if the var_names are not unique
adata = sc.AnnData(
    X=np.ones((2, 3)),
    obs=pd.DataFrame(index=["cell-0", "cell-1"]),
    var=pd.DataFrame(index=["gene-0", "gene-0", "gene-1"]),
)
adata[:, ['gene-1']]
--------------------------------------------------------------------------
InvalidIndexError                         Traceback (most recent call last)
<ipython-input-58-8d1a96772653> in <module>
----> 1 adata[:, ['gene-0']]

site-packages/anndata/_core/anndata.py in __getitem__(self, index)
   1085     def __getitem__(self, index: Index) -> "AnnData":
   1086         """Returns a sliced view of the object."""
-> 1087         oidx, vidx = self._normalize_indices(index)
   1088         return AnnData(self, oidx=oidx, vidx=vidx, asview=True)
   1089 

site-packages/anndata/_core/anndata.py in _normalize_indices(self, index)
   1066 
   1067     def _normalize_indices(self, index: Optional[Index]) -> Tuple[slice, slice]:
-> 1068         return _normalize_indices(index, self.obs_names, self.var_names)
   1069 
   1070     # TODO: this is not quite complete...

site-packages/anndata/_core/index.py in _normalize_indices(index, names0, names1)
     33     ax0, ax1 = unpack_index(index)
     34     ax0 = _normalize_index(ax0, names0)
---> 35     ax1 = _normalize_index(ax1, names1)
     36     return ax0, ax1
     37 

site-packages/anndata/_core/index.py in _normalize_index(indexer, index)
     95             return positions  # np.ndarray[int]
     96         else:  # indexer should be string array
---> 97             positions = index.get_indexer(indexer)
     98             if np.any(positions < 0):
     99                 not_found = indexer[positions < 0]

site-packages/pandas/core/indexes/base.py in get_indexer(self, target, method, limit, tolerance)
   3169 
   3170         if not self.is_unique:
-> 3171             raise InvalidIndexError(
   3172                 "Reindexing only valid with uniquely valued Index objects"
   3173             )

InvalidIndexError: Reindexing only valid with uniquely valued Index objects
  • key in both adata.obs.columns and adata.var.index: I changed this to raise a ValueError.

I though about reverting back to the original implementation as you suggest but this will not work with _anndata._prepare_dataframe as I introduced some changes to work with this function.

The just added changes should mimic the response from 1.6 except for duplicate names in var_names which I think should respond similarly like when doing a slicing on the AnnData object.

I added new tests based on your examples.

I added checks to test for unique adata.obs.columns

@ivirshup
Copy link
Member

ivirshup commented Feb 1, 2021

@fidelram, what are the changes which are incompatible with _anndata._prepare_dataframe? I've tried reverting dealing with X and those seemed to work fine for me.

@ivirshup
Copy link
Member

ivirshup commented Feb 1, 2021

The just added changes should mimic the response from 1.6 except for duplicate names in var_names which I think should respond similarly like when doing a slicing on the AnnData object.

Thinking about this more. Considering that no one has complained about this so far. I think I'm actually fine with this being an error. If there are complaints, I think we should change it back.

I do think it's important that gene_symbols can have duplicates as long as those values aren't being accessed (as non-unique identifiers is the whole point of gene_symbols).

@ivirshup ivirshup merged commit 81dbae9 into master Feb 2, 2021
@ivirshup
Copy link
Member

ivirshup commented Feb 2, 2021

@meeseeksdev backport to 1.7.x

meeseeksmachine pushed a commit to meeseeksmachine/scanpy that referenced this pull request Feb 2, 2021
ivirshup pushed a commit that referenced this pull request Feb 2, 2021
Co-authored-by: Fidel Ramirez <fidel.ramirez@gmail.com>
@ivirshup ivirshup mentioned this pull request Feb 2, 2021
ivirshup added a commit to ivirshup/scanpy that referenced this pull request Feb 3, 2021
ivirshup added a commit that referenced this pull request Feb 3, 2021
* Release note for #1583 and update release date

* Swap travis badge for azure
meeseeksmachine pushed a commit to meeseeksmachine/scanpy that referenced this pull request Feb 3, 2021
ivirshup added a commit that referenced this pull request Feb 3, 2021
)

Co-authored-by: Isaac Virshup <ivirshup@gmail.com>
@flying-sheep flying-sheep deleted the groupby_index2 branch October 30, 2023 13:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants