Incorrect merging of `pd.Series` in `AnnCollection` #1352

ordabayevy · 2024-02-01T16:16:42Z

Please make sure these conditions are met

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of anndata.
(optional) I have confirmed this bug exists on the master branch of anndata.

Report

Problem:

I create an AnnCollection without harmonization and then want to access some categorical obs column, e.g. disease (example below) or cell_type. Indexing cells from one anndata object and then accessing the attribute works as expected (returns categorical pd.Series). However, if cells are indexed from multiple anndata objects then accessing the attribute returns a numpy array with dtype=object. Looking at the source code the problem seems to lie in the concat_arrays function that does not have a logic for handling pd.Series arrays:

anndata/anndata/_core/merge.py

Line 746 in c790113

def concat_arrays(arrays, reindexers, axis=0, index=None, fill_value=None):

Code:

import gdown
import scanpy as sc
from anndata.experimental.multi_files import AnnCollection

# the data is from this scvi reproducibility notebook
# https://yoseflab.github.io/scvi-tools-reproducibility/scarches_totalvi_seurat_data/
gdown.download(
    url="https://drive.google.com/uc?id=1JgaXNwNeoEqX7zJL-jJD3cfXDGurMrq9", output="covid_cite.h5ad", quiet=False
)

covid = sc.read("covid_cite.h5ad")

dataset = AnnCollection([covid, covid], join_obs=None, join_obsm=None, join_vars=None, harmonize_dtypes=False)

dataset[0].obs["disease"]
# expected result
# AAACCCACACCAGCGT-1    COVID-19
# Name: disease, dtype: category
# Categories (2, object): ['COVID-19', 'Healthy']

dataset[[0, 60000]].obs["disease"]
# unexpected result
# array(['COVID-19', 'COVID-19'], dtype=object)

Versions

-----
anndata             0.10.5.post1
session_info        1.0.0
-----
cython_runtime      NA
dateutil            2.8.2
exceptiongroup      1.1.3
google              NA
h5py                3.10.0
natsort             8.4.0
numpy               1.26.1
packaging           23.2
pandas              2.1.1
pyarrow             15.0.0
pynvml              NA
pytz                2023.3.post1
scipy               1.11.3
six                 1.16.0
sphinxcontrib       NA
torch               2.2.0+cu121
torchgen            NA
tqdm                4.66.1
typing_extensions   NA
zoneinfo            NA
-----
Python 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]
Linux-4.19.0-26-cloud-amd64-x86_64-with-glibc2.28
-----
Session information updated at 2024-02-01 16:10

The text was updated successfully, but these errors were encountered:

github-actions · 2024-04-02T02:07:21Z

This issue has been automatically marked as stale because it has not had recent activity.
Please add a comment if you want to keep the issue open. Thank you for your contributions!

ordabayevy added the Bug 🐛 label Feb 1, 2024

ordabayevy mentioned this issue Feb 1, 2024

Logistic Regression train doesn't work with multiple GPUs cellarium-ai/cellarium-ml#112

Closed

github-actions bot added the stale label Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect merging of `pd.Series` in `AnnCollection` #1352

Incorrect merging of `pd.Series` in `AnnCollection` #1352

ordabayevy commented Feb 1, 2024 •

edited

Loading

github-actions bot commented Apr 2, 2024

Incorrect merging of pd.Series in AnnCollection #1352

Incorrect merging of pd.Series in AnnCollection #1352

Comments

ordabayevy commented Feb 1, 2024 • edited Loading

Please make sure these conditions are met

Report

Versions

github-actions bot commented Apr 2, 2024

Incorrect merging of `pd.Series` in `AnnCollection` #1352

Incorrect merging of `pd.Series` in `AnnCollection` #1352

ordabayevy commented Feb 1, 2024 •

edited

Loading