-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow plots to use adata.obs index as groupby #1583
Conversation
…issue with duplicated key names.
…a.obs index column as valid groupby category.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Big point: don't modify the AnnData object when you're getting values out of it.
Otherwise, mostly good.
Needs docs and examples though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking much cleaner! I'm really happy with the simplification of _prepare_dataframe
.
I've realized I missed a few things with obsdf
though:
- Shouldn't
var_df
should get similar updates toobs_df
? - Could we get tests for
get.obs_df
/get.var_df
for the issues you addressed here (repeated indices)? - I've realized this will need a backport, but that can be a separate pr to the 1.7.x branch where you just copy and paste the new functions
I would suggest a different PR to address this.
Sure, I added new tests to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I added new tests to get.obs_df to check duplicated keys.
Oh, I meant the repeated index names that were giving you problems with DataFrame.join
. This issue you mentioned before:
However, because internally in sc.get.obs_df the DataFrames are merged using adata.obs.index this non-unique indices caused an increase in rows due to multiple matching.
Here is a test you could use:
adata = sc.AnnData(
np.arange(16).reshape(4, 4),
obs=pd.DataFrame(index=["a", "a", "b", "c"]),
var=pd.DataFrame(index=[f"gene{i}" for i in range(4)]),
)
df = sc.get.obs_df(adata, ["gene1"])
pd.testing.assert_index_equal(df.index, adata.obs_names)
Shouldn't var_df should get similar updates to obs_df?
I would suggest a different PR to address this.
Just because the code changes should be largely equivalent (as should the tests) I think it will be easier to review if these changes are together in one PR.
@ivirshup I think this PR is ready to merge. readthedocs is failing because of a missing link from scvelo unrelated to the PR |
Fidel, sorry to say, but I've run into some issues. Most of these actually didn't have to do with this PR, but were additional things that broke from #1499. I'll give you a few examples of what I've found, mostly by contrast with the current behaviour of 1.6.1. import scanpy as sc, pandas as pd, numpy as np
M, N = (5, 3)
adata = sc.AnnData(
X=np.zeros((M, N)),
obs=pd.DataFrame(
np.arange(M * 3).reshape((M, 3)),
columns=["repeated_col", "repeated_col", "var_id"],
index=pd.Index([f"cell_{i}" for i in range(M)], name="obs_index"),
),
var=pd.DataFrame(
index=pd.Index(["var_id"] + [f"gene_{i}" for i in range(N-1)], name="var_index"),
),
) Repeated column in
|
@ivirshup regarding the three comments:
adata = sc.AnnData(
X=np.ones((2, 3)),
obs=pd.DataFrame(index=["cell-0", "cell-1"]),
var=pd.DataFrame(index=["gene-0", "gene-0", "gene-1"]),
)
adata[:, ['gene-1']]
I though about reverting back to the original implementation as you suggest but this will not work with The just added changes should mimic the response from 1.6 except for duplicate names in var_names which I think should respond similarly like when doing a slicing on the I added new tests based on your examples. I added checks to test for unique adata.obs.columns |
@fidelram, what are the changes which are incompatible with |
Thinking about this more. Considering that no one has complained about this so far. I think I'm actually fine with this being an error. If there are complaints, I think we should change it back. I do think it's important that |
@meeseeksdev backport to 1.7.x |
Co-authored-by: Fidel Ramirez <fidel.ramirez@gmail.com>
* Release note for #1583 and update release date * Swap travis badge for azure
I simplified the
_prepare_dataframe code
by usingsc.get.df
. However, this change uncovered two issues withsc.get.obs_df
that I have now addressed in this PR.The most relevant is the case when the call to
sc.get.obs_df
contains keys with duplicates (e.g.keys=['gene1', 'gene1']
). This case is not rare as for example insc.pl.dotplot
the same gene can be visualized several times, which requires callingsc.get.obs_df
with keys that contain duplications. An example is whensc.pl.rank_genes_groups_dotplot
is called and, frequently, the same gene appears as top up-regulated for more than one category. To address this,sc.get.obs_df
removes all duplicates (which correspond to DataFrame columns) and after the DataFrame is complete, the duplicates are added back.A second problem was for non-unique adata.obs indices which should be a rare situation. However, it turns out that one of the test adata object used in
test_plotting
have this issue. Also, for the goal of this PR (allow adata.obs.index as groupby option) it could be expected that the index may not be unique.In general, non unique obs indices are ok as long as
.obs
DataFrame is not joined or merge based on index. However, because internally insc.get.obs_df
the DataFrames are merged usingadata.obs.index
this non-unique indices caused an increase in rows due to multiple matching. To fix this, the code now checks for unique index, and if it is not unique then a temporary index is added to allow proper join operations and then the non-unique index is put back.