Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read multiple 10X files #267

Closed
cartal opened this issue Sep 21, 2018 · 12 comments
Closed

Read multiple 10X files #267

cartal opened this issue Sep 21, 2018 · 12 comments

Comments

@cartal
Copy link

cartal commented Sep 21, 2018

Hi,

Maybe this is somewhere in the manual and I just don't see it. But is there a way to read multiple 10X samples (either multiple .h5 or the matrix/genes/barcodes) in the same way that Seurat does with its Read10X() function?

@falexwolf
Copy link
Member

falexwolf commented Sep 26, 2018

I don't know how they do it Seurat, but I'd simply do

filenames = ['name0.h5', 'name1.h5', 'name2.h5']
adatas = [sc.read_10x_h5(filename) for filename in filenames]
adata = adatas[0].concatenate(adatas[1:])

Does this help?

@cartal
Copy link
Author

cartal commented Sep 26, 2018

Hi, thanks for the reply.

This example helps already. Thanks. I was thinking more about importing multiple samples from 10X where for each sample you have a folder containing the three files (matrix, barcodes, genes). But I guess I can do something to convert those into .h5 prior to read them into scanpy.

@falexwolf
Copy link
Member

falexwolf commented Sep 26, 2018

You can do the same as above using sc.read_10x_mtx, which is not in a release yet but on GitHub's Master branch. In .concatenate() you have the option to pass how you want to name your batches/samples by passing batch_categories.

PS: Note that I edited the example above to show sc.read_10x_h5.

@cartal
Copy link
Author

cartal commented Sep 26, 2018

Many thanks!!!

@cartal cartal closed this as completed Sep 26, 2018
@elfore
Copy link

elfore commented Apr 26, 2019

Hi falexwolf,

I try to use concatenate to read multiple 10X mtx and put them together.
But it seems like if I concatenate more than 15 mtx(already stored and read from cache), it becomes very slow. Do you have any advice?
Thanks for any information you may provide.

@aditisk
Copy link

aditisk commented Mar 24, 2020

Hi @falexwolf, thanks for the solution you provided above for reading multiple files. I tried it and it worked when I had just 2 files. I am trying the same code with 23 files and I am getting an error message in the concatenation step. Any idea on how to fix this ? Thanks.


AttributeError Traceback (most recent call last)
in
12 adatas.obs['cell_names'] = pd.read_csv(path + sample + 'barcodes.tsv.gz', header=None)[0].values
13
---> 14 adata = adatas[0].concatenate(adatas[1:])

/Applications/anaconda3/lib/python3.7/site-packages/anndata/core/anndata.py in concatenate(self, join, batch_key, batch_categories, index_unique, *adatas)
1908
1909 if any_sparse:
-> 1910 sparse_format = all_adatas[0].X.getformat()
1911 X = X.asformat(sparse_format)
1912

AttributeError: 'numpy.ndarray' object has no attribute 'getformat'

@aditisk
Copy link

aditisk commented Mar 24, 2020

Hi @elfore, were you able to concatenate your files successfully ? If yes, could you please share the code you used for concatenation ? Thanks.

@taopeng1100
Copy link

If I do this : adata = adata1.concatenate (adata2, adata3). How can I keep the original sample names in adata? Thx!

@ivirshup
Copy link
Member

ivirshup commented May 1, 2020

@taopeng1100, this should work:

adata = adata1.concatenate(adata2, adata3, index_unique=None)

@BrianLohman
Copy link

Hello,

I am having problems with reading in multiple h5 files using the code snipped that was posted by falexwolf. I am doing:

filenames = ['./a.h5', './b.h5', './c.h5', './d.h5']
adatas = [sc.read_10x_h5(filename, gex_only = True) for filename in filenames]
adata = adatas[0].concatenate(adatas[1:], batch_key='gene_ids', batch_categories=filenames)

With or without the batch_key and batch_categories arguments I get the same error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-20-e23ba2ca6e37> in <module>
      1 filenames = ['./a.h5', './b.h5', './c.h5', './d.h5']
      2 adatas = [sc.read_10x_h5(filename, gex_only = True) for filename in filenames]
----> 3 adata = adatas[0].concatenate(adatas[1:], batch_key='gene_ids')

~/anaconda3/lib/python3.7/site-packages/anndata/_core/anndata.py in concatenate(self, join, batch_key, batch_categories, uns_merge, index_unique, fill_value, *adatas)
   1764             fill_value=fill_value,
   1765             index_unique=index_unique,
-> 1766             pairwise=False,
   1767         )
   1768 

~/anaconda3/lib/python3.7/site-packages/anndata/_core/merge.py in concat(adatas, axis, join, merge, uns_merge, label, keys, index_unique, fill_value, pairwise)
    817     # Annotation for other axis
    818     alt_annot = merge_dataframes(
--> 819         [getattr(a, alt_dim) for a in adatas], alt_indices, merge
    820     )
    821 

~/anaconda3/lib/python3.7/site-packages/anndata/_core/merge.py in merge_dataframes(dfs, new_index, merge_strategy)
    529     dfs: Iterable[pd.DataFrame], new_index, merge_strategy=merge_unique
    530 ) -> pd.DataFrame:
--> 531     dfs = [df.reindex(index=new_index) for df in dfs]
    532     # New dataframe with all shared data
    533     new_df = pd.DataFrame(merge_strategy(dfs), index=new_index)

~/anaconda3/lib/python3.7/site-packages/anndata/_core/merge.py in <listcomp>(.0)
    529     dfs: Iterable[pd.DataFrame], new_index, merge_strategy=merge_unique
    530 ) -> pd.DataFrame:
--> 531     dfs = [df.reindex(index=new_index) for df in dfs]
    532     # New dataframe with all shared data
    533     new_df = pd.DataFrame(merge_strategy(dfs), index=new_index)

~/anaconda3/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    310         @wraps(func)
    311         def wrapper(*args, **kwargs) -> Callable[..., Any]:
--> 312             return func(*args, **kwargs)
    313 
    314         kind = inspect.Parameter.POSITIONAL_OR_KEYWORD

~/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in reindex(self, *args, **kwargs)
   4174         kwargs.pop("axis", None)
   4175         kwargs.pop("labels", None)
-> 4176         return super().reindex(**kwargs)
   4177 
   4178     def drop(

~/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in reindex(self, *args, **kwargs)
   4810         # perform the reindex on the axes
   4811         return self._reindex_axes(
-> 4812             axes, level, limit, tolerance, method, fill_value, copy
   4813         ).__finalize__(self, method="reindex")
   4814 

~/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in _reindex_axes(self, axes, level, limit, tolerance, method, fill_value, copy)
   4021         if index is not None:
   4022             frame = frame._reindex_index(
-> 4023                 index, method, copy, level, fill_value, limit, tolerance
   4024             )
   4025 

~/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in _reindex_index(self, new_index, method, copy, level, fill_value, limit, tolerance)
   4043             copy=copy,
   4044             fill_value=fill_value,
-> 4045             allow_dups=False,
   4046         )
   4047 

~/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in _reindex_with_indexers(self, reindexers, fill_value, copy, allow_dups)
   4881                 fill_value=fill_value,
   4882                 allow_dups=allow_dups,
-> 4883                 copy=copy,
   4884             )
   4885             # If we've made a copy once, no need to make another one

~/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py in reindex_indexer(self, new_axis, indexer, axis, fill_value, allow_dups, copy, consolidate, only_slice)
   1299         # some axes don't allow reindexing with dups
   1300         if not allow_dups:
-> 1301             self.axes[axis]._can_reindex(indexer)
   1302 
   1303         if axis >= self.ndim:

~/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in _can_reindex(self, indexer)
   3475         # trying to reindex on an axis with duplicates
   3476         if not self._index_as_unique and len(indexer):
-> 3477             raise ValueError("cannot reindex from a duplicate axis")
   3478 
   3479     def reindex(self, target, method=None, level=None, limit=None, tolerance=None):

ValueError: cannot reindex from a duplicate axis

Loading a single h5 file works and produces expected output:

a = sc.read_10x_h5('./a.h5', gex_only = True)
a
AnnData object with n_obs × n_vars = 7474 × 31053
    var: 'gene_ids', 'feature_types', 'genome'

So the input files appear to be valid I just can't get them to concatenate to a single object.

Any ideas would be welcome.

@xiaozhangPrivate
Copy link

Hi @BrianLohman , I had the same problem, and this might help you.
Just run adata.var_names_make_unique() before concatenate.
filenames = ["a","b","c","d"] adatas = [] for filename in filenames: adata = sc.read_10x_h5(filename) adata.var_names_make_unique() adatas.append(adata) adata = adatas[0].concatenate(adatas[1:])

@dhairya02
Copy link

dhairya02 commented Feb 15, 2023

HI, I tried to do what you suggested but I am getting an error saying ValueError: only one regex group is supported with Index.
I have multiple h5ad files with varying n_obs × n_vars. Here is my code:

batch_names = []
for i in range(len(adatas)):
  adatas[i].var_names_make_unique()
  batch_names.append(filenames[i].split('.')[0])
  print(i,adatas[i])

adata = adatas[0].concatenate(adatas[1:],
                              batch_key = 'ID',
                              uns_merge="unique",
                              index_unique=None,
                              batch_categories=batch_names)

and this produces the above error. Can anyone help?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants