Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect merging of pd.Series in AnnCollection #1352

Open
3 tasks done
ordabayevy opened this issue Feb 1, 2024 · 1 comment
Open
3 tasks done

Incorrect merging of pd.Series in AnnCollection #1352

ordabayevy opened this issue Feb 1, 2024 · 1 comment

Comments

@ordabayevy
Copy link
Contributor

ordabayevy commented Feb 1, 2024

Please make sure these conditions are met

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of anndata.
  • (optional) I have confirmed this bug exists on the master branch of anndata.

Report

Problem:

I create an AnnCollection without harmonization and then want to access some categorical obs column, e.g. disease (example below) or cell_type. Indexing cells from one anndata object and then accessing the attribute works as expected (returns categorical pd.Series). However, if cells are indexed from multiple anndata objects then accessing the attribute returns a numpy array with dtype=object. Looking at the source code the problem seems to lie in the concat_arrays function that does not have a logic for handling pd.Series arrays:

def concat_arrays(arrays, reindexers, axis=0, index=None, fill_value=None):

Code:

import gdown
import scanpy as sc
from anndata.experimental.multi_files import AnnCollection

# the data is from this scvi reproducibility notebook
# https://yoseflab.github.io/scvi-tools-reproducibility/scarches_totalvi_seurat_data/
gdown.download(
    url="https://drive.google.com/uc?id=1JgaXNwNeoEqX7zJL-jJD3cfXDGurMrq9", output="covid_cite.h5ad", quiet=False
)

covid = sc.read("covid_cite.h5ad")

dataset = AnnCollection([covid, covid], join_obs=None, join_obsm=None, join_vars=None, harmonize_dtypes=False)

dataset[0].obs["disease"]
# expected result
# AAACCCACACCAGCGT-1    COVID-19
# Name: disease, dtype: category
# Categories (2, object): ['COVID-19', 'Healthy']

dataset[[0, 60000]].obs["disease"]
# unexpected result
# array(['COVID-19', 'COVID-19'], dtype=object)

Versions

-----
anndata             0.10.5.post1
session_info        1.0.0
-----
cython_runtime      NA
dateutil            2.8.2
exceptiongroup      1.1.3
google              NA
h5py                3.10.0
natsort             8.4.0
numpy               1.26.1
packaging           23.2
pandas              2.1.1
pyarrow             15.0.0
pynvml              NA
pytz                2023.3.post1
scipy               1.11.3
six                 1.16.0
sphinxcontrib       NA
torch               2.2.0+cu121
torchgen            NA
tqdm                4.66.1
typing_extensions   NA
zoneinfo            NA
-----
Python 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]
Linux-4.19.0-26-cloud-amd64-x86_64-with-glibc2.28
-----
Session information updated at 2024-02-01 16:10
Copy link

github-actions bot commented Apr 2, 2024

This issue has been automatically marked as stale because it has not had recent activity.
Please add a comment if you want to keep the issue open. Thank you for your contributions!

@github-actions github-actions bot added the stale label Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant