Allow plots to use adata.obs index as groupby #1583

fidelram · 2021-01-13T12:29:55Z

I simplified the _prepare_dataframe code by using sc.get.df. However, this change uncovered two issues with sc.get.obs_df that I have now addressed in this PR.

The most relevant is the case when the call to sc.get.obs_df contains keys with duplicates (e.g. keys=['gene1', 'gene1']). This case is not rare as for example in sc.pl.dotplot the same gene can be visualized several times, which requires calling sc.get.obs_df with keys that contain duplications. An example is when sc.pl.rank_genes_groups_dotplot is called and, frequently, the same gene appears as top up-regulated for more than one category. To address this, sc.get.obs_df removes all duplicates (which correspond to DataFrame columns) and after the DataFrame is complete, the duplicates are added back.

A second problem was for non-unique adata.obs indices which should be a rare situation. However, it turns out that one of the test adata object used in test_plotting have this issue. Also, for the goal of this PR (allow adata.obs.index as groupby option) it could be expected that the index may not be unique.

In general, non unique obs indices are ok as long as .obs DataFrame is not joined or merge based on index. However, because internally in sc.get.obs_df the DataFrames are merged using adata.obs.index this non-unique indices caused an increase in rows due to multiple matching. To fix this, the code now checks for unique index, and if it is not unique then a temporary index is added to allow proper join operations and then the non-unique index is put back.

…issue with duplicated key names.

…a.obs index column as valid groupby category.

ivirshup

Big point: don't modify the AnnData object when you're getting values out of it.

Otherwise, mostly good.

Needs docs and examples though.

scanpy/plotting/_anndata.py

scanpy/get.py

scanpy/plotting/_anndata.py

scanpy/plotting/_baseplot_class.py

scanpy/get.py

scanpy/tests/test_plotting.py

…ique index.

ivirshup

Looking much cleaner! I'm really happy with the simplification of _prepare_dataframe.

I've realized I missed a few things with obsdf though:

Shouldn't var_df should get similar updates to obs_df?
Could we get tests for get.obs_df/ get.var_df for the issues you addressed here (repeated indices)?
I've realized this will need a backport, but that can be a separate pr to the 1.7.x branch where you just copy and paste the new functions

scanpy/plotting/_anndata.py

fidelram · 2021-01-22T09:11:45Z

Shouldn't var_df should get similar updates to obs_df?

I would suggest a different PR to address this.

Could we get tests for get.obs_df/ get.var_df for the issues you addressed here (repeated indices)?

Sure, I added new tests to get.obs_df to check duplicated keys.

ivirshup

Sure, I added new tests to get.obs_df to check duplicated keys.

Oh, I meant the repeated index names that were giving you problems with DataFrame.join. This issue you mentioned before:

However, because internally in sc.get.obs_df the DataFrames are merged using adata.obs.index this non-unique indices caused an increase in rows due to multiple matching.

Here is a test you could use:

adata = sc.AnnData(
    np.arange(16).reshape(4, 4),
    obs=pd.DataFrame(index=["a", "a", "b", "c"]),
    var=pd.DataFrame(index=[f"gene{i}" for i in range(4)]),
)
df = sc.get.obs_df(adata, ["gene1"])
pd.testing.assert_index_equal(df.index, adata.obs_names)

Shouldn't var_df should get similar updates to obs_df?

I would suggest a different PR to address this.

Just because the code changes should be largely equivalent (as should the tests) I think it will be easier to review if these changes are together in one PR.

scanpy/plotting/_anndata.py

fidelram · 2021-01-27T06:58:35Z

@ivirshup I think this PR is ready to merge. readthedocs is failing because of a missing link from scvelo unrelated to the PR

…ated keys

ivirshup · 2021-01-30T06:38:12Z

Fidel, sorry to say, but I've run into some issues. Most of these actually didn't have to do with this PR, but were additional things that broke from #1499. I'll give you a few examples of what I've found, mostly by contrast with the current behaviour of 1.6.1.

import scanpy as sc, pandas as pd, numpy as np

M, N = (5, 3)
adata = sc.AnnData(
    X=np.zeros((M, N)),
    obs=pd.DataFrame(
        np.arange(M * 3).reshape((M, 3)),
        columns=["repeated_col", "repeated_col", "var_id"],
        index=pd.Index([f"cell_{i}" for i in range(M)], name="obs_index"),
    ),
    var=pd.DataFrame(
        index=pd.Index(["var_id"] + [f"gene_{i}" for i in range(N-1)], name="var_index"),
    ),
)

Repeated column in `adata.obs`

I think this should be an error. This is because downstream functions (like plotting) currently assume that for each key input here, there will be one output column. Turns out this isn't exactly pandas behaviour with repeated column values, but I do think it's reasonable.

M, N = 5, 3
adata = sc.AnnData(
    X=np.zeros((M, N)),
    obs=pd.DataFrame(
        np.arange(M * 2).reshape((M, 2)),
        columns=["repeated_col", "repeated_col"],
        index=[f"cell_{i}" for i in range(M)],
    ),
    var=pd.DataFrame(
        index=[f"gene_{i}" for i in range(N)],
    ),   
)
sc.get.obs_df(adata, ["repeated_col"])

This pr (gets both columns)

           repeated_col  repeated_col
obs_index                            
cell_0                0             1
cell_1                3             4
cell_2                6             7
cell_3                9            10
cell_4               12            13

1.6 (errors)

~/miniconda3/envs/scanpy-1.6/lib/python3.8/site-packages/pandas/core/internals/blocks.py in __init__(self, values, placement, ndim)
    140 
    141         if self._validate_ndim and self.ndim and len(self.mgr_locs) != len(self.values):
--> 142             raise ValueError(
    143                 f"Wrong number of items passed {len(self.values)}, "
    144                 f"placement implies {len(self.mgr_locs)}"

ValueError: Wrong number of items passed 2, placement implies 1

Not a great error, could definitley be improved.

Key in adata.obs.columns and adata.var_names

In this case, the key is ambiguous (should it get the gene values or the column from obs?). I think this means it should error. I feel like this point has been discussed a number of times, but doesn't seem to have been discussed when this behaviour was changed.

M, N = 5, 3
adata = sc.AnnData(
    X=np.zeros((M, N)),
    obs=pd.DataFrame(
        np.arange(M),
        columns=["var_id"],
        index=[f"cell_{i}" for i in range(M)],
    ),
    var=pd.DataFrame(
        index=["var_id"] + [f"gene_{i}" for i in range(N-1)],
    ),   
)
sc.get.obs_df(adata, ["var_id"])

This pr (warns)

/Users/isaac/github/scanpy/scanpy/get.py:177: UserWarning: The key `var_id` is found in both adata.obs and adata.var_names.Only the adata.obs key will be used.
  warnings.warn(
Out[58]: 
           var_id
obs_index        
cell_0          2
cell_1          5
cell_2          8
cell_3         11
cell_4         14

1.6 (errors)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-16-69be169f6a4f> in <module>
----> 1 sc.get.obs_df(adata, ["var_id"])

~/miniconda3/envs/scanpy-1.6/lib/python3.8/site-packages/scanpy/get.py in obs_df(adata, keys, obsm_keys, layer, gene_symbols, use_raw)
    171     for k, l in zip(keys, lookup_keys):
    172         if not use_raw or k in adata.obs.columns:
--> 173             df[k] = adata.obs_vector(l, layer=layer)
    174         else:
    175             df[k] = adata.raw.obs_vector(l)

~/miniconda3/envs/scanpy-1.6/lib/python3.8/site-packages/anndata/_core/anndata.py in obs_vector(self, k, layer)
   1362                 )
   1363                 layer = None
-> 1364         return get_vector(self, k, "obs", "var", layer=layer)
   1365 
   1366     def var_vector(self, k, *, layer: Optional[str] = None) -> np.ndarray:

~/miniconda3/envs/scanpy-1.6/lib/python3.8/site-packages/anndata/_core/index.py in get_vector(adata, k, coldim, idxdim, layer)
    156 
    157     if (in_col + in_idx) == 2:
--> 158         raise ValueError(
    159             f"Key {k} could be found in both .{idxdim}_names and .{coldim}.columns"
    160         )

ValueError: Key var_id could be found in both .var_names and .obs.columns

Repeats in var_names

When there are repeats in var_names (pretty frequent occurence), getting a dataframe with keys that aren't repeated. I think it's fine for this to work. I do think it should error if the key is one values that is duplicated in the index.

adata = sc.AnnData(
    X=np.ones((2, 3)),
    obs=pd.DataFrame(index=["cell-0", "cell-1"]),
    var=pd.DataFrame(index=["gene-0", "gene-0", "gene-1"]),
)
sc.get.obs_df(adata, ["gene-1"])

This PR (errors)

---------------------------------------------------------------------------
InvalidIndexError                         Traceback (most recent call last)
<ipython-input-62-405d671e2970> in <module>
----> 1 sc.get.obs_df(adata, ["a", "gene-1"])

~/github/scanpy/scanpy/get.py in obs_df(adata, keys, obsm_keys, layer, gene_symbols, use_raw)
    213             var_idx = adata.raw.var_names.get_indexer(var_names)
    214         else:
--> 215             var_idx = adata.var_names.get_indexer(var_names)
    216 
    217         # for backed AnnData is important that the indices are ordered

/usr/local/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_indexer(self, target, method, limit, tolerance)
   3169 
   3170         if not self.is_unique:
-> 3171             raise InvalidIndexError(
   3172                 "Reindexing only valid with uniquely valued Index objects"
   3173             )

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

1.6 (suceeds)

        gene-1
cell-0     1.0
cell-1     1.0

1.6 does error if I use "gene-0" as a key, but the error message could definitley be better.

What should we do about this?

My current inclination is to revert most changes to obs_df and var_df from this PR and #1499. This should leave the use of indices as groupby untouched. Also, the loss of perfomance from reverting #1499 should be partially mitigated by improvements in pandas (see pandas-dev/pandas#37954). We would keep all the user facing changes, and all the tests from both PRs. We can then make a release now, and can patch in performance boosts during the release cycle.

Do you agree with this assessment? If not, could you propose an alternative?

fidelram · 2021-01-30T14:06:02Z

@ivirshup regarding the three comments:

adata.obs.columns non_unique: I also think this should be an error as this will cause trouble at some point or another. I added a check for this which will raise a ValueError if duplicates are found
adata.var.index Here I wanted to raise an error only if a key matched a duplicated var_name as you suggested. However, the selection by index, instead of var_name, from the matrix causes an error whenever the var_names contains duplicates. Actually, doing an AnnData selection when the var_names are not unique also raises an error. Thus, I added a ValueError if the var_names are not unique

adata = sc.AnnData(
    X=np.ones((2, 3)),
    obs=pd.DataFrame(index=["cell-0", "cell-1"]),
    var=pd.DataFrame(index=["gene-0", "gene-0", "gene-1"]),
)
adata[:, ['gene-1']]

--------------------------------------------------------------------------
InvalidIndexError                         Traceback (most recent call last)
<ipython-input-58-8d1a96772653> in <module>
----> 1 adata[:, ['gene-0']]

site-packages/anndata/_core/anndata.py in __getitem__(self, index)
   1085     def __getitem__(self, index: Index) -> "AnnData":
   1086         """Returns a sliced view of the object."""
-> 1087         oidx, vidx = self._normalize_indices(index)
   1088         return AnnData(self, oidx=oidx, vidx=vidx, asview=True)
   1089 

site-packages/anndata/_core/anndata.py in _normalize_indices(self, index)
   1066 
   1067     def _normalize_indices(self, index: Optional[Index]) -> Tuple[slice, slice]:
-> 1068         return _normalize_indices(index, self.obs_names, self.var_names)
   1069 
   1070     # TODO: this is not quite complete...

site-packages/anndata/_core/index.py in _normalize_indices(index, names0, names1)
     33     ax0, ax1 = unpack_index(index)
     34     ax0 = _normalize_index(ax0, names0)
---> 35     ax1 = _normalize_index(ax1, names1)
     36     return ax0, ax1
     37 

site-packages/anndata/_core/index.py in _normalize_index(indexer, index)
     95             return positions  # np.ndarray[int]
     96         else:  # indexer should be string array
---> 97             positions = index.get_indexer(indexer)
     98             if np.any(positions < 0):
     99                 not_found = indexer[positions < 0]

site-packages/pandas/core/indexes/base.py in get_indexer(self, target, method, limit, tolerance)
   3169 
   3170         if not self.is_unique:
-> 3171             raise InvalidIndexError(
   3172                 "Reindexing only valid with uniquely valued Index objects"
   3173             )

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

key in both adata.obs.columns and adata.var.index: I changed this to raise a ValueError.

I though about reverting back to the original implementation as you suggest but this will not work with _anndata._prepare_dataframe as I introduced some changes to work with this function.

The just added changes should mimic the response from 1.6 except for duplicate names in var_names which I think should respond similarly like when doing a slicing on the AnnData object.

I added new tests based on your examples.

I added checks to test for unique adata.obs.columns

ivirshup · 2021-02-01T07:05:01Z

@fidelram, what are the changes which are incompatible with _anndata._prepare_dataframe? I've tried reverting dealing with X and those seemed to work fine for me.

ivirshup · 2021-02-01T07:39:17Z

The just added changes should mimic the response from 1.6 except for duplicate names in var_names which I think should respond similarly like when doing a slicing on the AnnData object.

Thinking about this more. Considering that no one has complained about this so far. I think I'm actually fine with this being an error. If there are complaints, I think we should change it back.

I do think it's important that gene_symbols can have duplicates as long as those values aren't being accessed (as non-unique identifiers is the whole point of gene_symbols).

ivirshup · 2021-02-02T02:18:44Z

@meeseeksdev backport to 1.7.x

Co-authored-by: Fidel Ramirez <fidel.ramirez@gmail.com>

* Release note for #1583 and update release date * Swap travis badge for azure

…lease date

) Co-authored-by: Isaac Virshup <ivirshup@gmail.com>

fidelram added 2 commits January 13, 2021 12:20

Fix problem with pandas join when adata.obs index is not unique. Fix …

ad9d345

…issue with duplicated key names.

simplify _prepare_dataframe by using sc.get.obs_df. Allow use of adat…

c7b8844

…a.obs index column as valid groupby category.

fidelram requested a review from ivirshup January 13, 2021 13:19

empty commit to triggir tests

597a205

ivirshup requested changes Jan 17, 2021

View reviewed changes

fidelram added 5 commits January 19, 2021 20:21

use pd.concat instead of pd.join as this avoids the need to have a un…

b9218ab

…ique index.

remove modification to adata in _prepare_dataframe

8dbc944

add group by index test

6d8a558

change error for warning when too many categories will be plotted

7e34a1a

black changes

b929abf

ivirshup requested changes Jan 22, 2021

View reviewed changes

address review comments

25d1f9e

ivirshup requested changes Jan 24, 2021

View reviewed changes

scanpy/plotting/_anndata.py Outdated Show resolved Hide resolved

scanpy/plotting/_anndata.py Outdated Show resolved Hide resolved

fidelram added 2 commits January 26, 2021 17:56

review changes

6a0ba7b

fix error when index is not category

0a67e8b

ivirshup mentioned this pull request Jan 27, 2021

var_names not being found by during plotting #1406

Closed

changed df.join for pd.concat in sc.get.var_df. Added test for repe…

3ffb01f

…ated keys

fidelram added 2 commits January 30, 2021 14:34

check that adata.obs.columns and adata.var.index are unique

4fc132e

added non unique var.index and non unique obs.columns tests

62a0b8d

Trigger CI

075dc2c

ivirshup added 2 commits February 1, 2021 19:13

Throw error if gene_symbol duplicated when accessed

3bb2bdf

Formatting for test_get

bde5a15

ivirshup merged commit 81dbae9 into master Feb 2, 2021

meeseeksmachine pushed a commit to meeseeksmachine/scanpy that referenced this pull request Feb 2, 2021

Backport PR scverse#1583: Allow plots to use adata.obs index as groupby

580dac7

meeseeksmachine mentioned this pull request Feb 2, 2021

Backport PR #1583 on branch 1.7.x (Allow plots to use adata.obs index as groupby) #1617

Merged

ivirshup pushed a commit that referenced this pull request Feb 2, 2021

Backport PR #1583: Allow plots to use adata.obs index as groupby (#1617)

fad82a0

Co-authored-by: Fidel Ramirez <fidel.ramirez@gmail.com>

ivirshup mentioned this pull request Feb 2, 2021

Get cleanup #1621

Merged

ivirshup added a commit to ivirshup/scanpy that referenced this pull request Feb 3, 2021

Release note for scverse#1583 and update release date

94c2bee

ivirshup added a commit that referenced this pull request Feb 3, 2021

Release note for #1583 and update release date (#1623)

de0dd7f

* Release note for #1583 and update release date * Swap travis badge for azure

meeseeksmachine pushed a commit to meeseeksmachine/scanpy that referenced this pull request Feb 3, 2021

Backport PR scverse#1623: Release note for scverse#1583 and update re…

c8165e8

…lease date

meeseeksmachine mentioned this pull request Feb 3, 2021

Backport PR #1623 on branch 1.7.x (Release note for #1583 and update release date) #1624

Merged

ivirshup added a commit that referenced this pull request Feb 3, 2021

Backport PR #1623: Release note for #1583 and update release date (#1624

242ada2

) Co-authored-by: Isaac Virshup <ivirshup@gmail.com>

flying-sheep deleted the groupby_index2 branch October 30, 2023 13:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow plots to use adata.obs index as groupby #1583

Allow plots to use adata.obs index as groupby #1583

fidelram commented Jan 13, 2021 •

edited

Loading

ivirshup left a comment •

edited

Loading

ivirshup left a comment

fidelram commented Jan 22, 2021

ivirshup left a comment

fidelram commented Jan 27, 2021

ivirshup commented Jan 30, 2021

fidelram commented Jan 30, 2021

ivirshup commented Feb 1, 2021

ivirshup commented Feb 1, 2021

ivirshup commented Feb 2, 2021

Allow plots to use adata.obs index as groupby #1583

Allow plots to use adata.obs index as groupby #1583

Conversation

fidelram commented Jan 13, 2021 • edited Loading

ivirshup left a comment • edited Loading

Choose a reason for hiding this comment

ivirshup left a comment

Choose a reason for hiding this comment

fidelram commented Jan 22, 2021

ivirshup left a comment

Choose a reason for hiding this comment

fidelram commented Jan 27, 2021

ivirshup commented Jan 30, 2021

Repeated column in adata.obs

This pr (gets both columns)

1.6 (errors)

Key in adata.obs.columns and adata.var_names

This pr (warns)

1.6 (errors)

Repeats in var_names

This PR (errors)

1.6 (suceeds)

What should we do about this?

fidelram commented Jan 30, 2021

ivirshup commented Feb 1, 2021

ivirshup commented Feb 1, 2021

ivirshup commented Feb 2, 2021

fidelram commented Jan 13, 2021 •

edited

Loading

ivirshup left a comment •

edited

Loading

Repeated column in `adata.obs`