merging with a loom file #36

SamueleSoraggi · 2019-01-23T10:57:08Z

Hej,

I am trying to use scv.utils.merge(adata, adata_loom) to merge my dataset used with scanpy and the related loom dataset opened with scvelo. However, adata has dimension 7370x15000, while adata_loom has dimension 282016x33694. adata is the concatenation of 3 datasets, one from each individual, and the loom file I used is the combination of the 3 separate individuals' file through loompy.combine.

I am pretty ok with the second dimension (I kept the most expressed 15000 genes in scanpy), but the first dimension is not clear to me, since it is really large, and generates an error because adata and adata_loom have the attribute layers containing matrices of incompatible shape.
Is the first dimension to be interpreted in another way than just cells? Should I run the velocity analysis on the loom file without being allowed to merge?

Cheers,
Samuele

The text was updated successfully, but these errors were encountered:

VolkerBergen · 2019-01-23T11:04:10Z

Hi Samuele,

so loompy.combine, or the way you used it, generates sth. huge, for whatever reason.

Could you just load all loom files into three AnnData objects and concatenate those into one adata_loom?

That should hopefully give you the same dimensions.

SamueleSoraggi · 2019-01-23T12:03:10Z

Hej Volker,

thanks for the hint, I thought it was the right way to go, based on some issues I read on the velocyto command line tool page. I'll post back when I try it :)

Cheers,
Samuele

SamueleSoraggi · 2019-01-23T13:17:13Z

Hej again,

I tried using anndata.concatenate(loom1, loom2, loom3) on the three loom files. However, the resulting concatenated file has no longer the layers needed to have spliced and unspliced counts, maybe because it removes them since they have unmatching dimension.

VolkerBergen · 2019-01-23T13:20:44Z

You're right, I forgot to mention that we only recently enabled the concatenate module to also consider layers (see scverse/anndata@1ff11f2). Hence, you would need to clone the latest AnnData version.

SamueleSoraggi · 2019-01-23T13:27:59Z

ok thanks.
maybe there is also something weird with my data, since the first dimension of the three datasets is huge! The obs_names are also different than for the dataset I have been using for scanpy, so maybe I have to figure that out first. I'll post an update here when I have news :)

VolkerBergen · 2019-01-23T13:32:58Z

obs_names don't have to match exactly. For instance, the merge module with figure out itself, that '10X43_1_AAACATACCCATGA' and 'AAACATACCCATGA' belong together. Also the ordering doesn't have to be the same. Simplest quick check is looking at the dimensions of your concatenated data. If an obs hasn't been found in all of the datasets, it will through it away.

SamueleSoraggi · 2019-01-23T14:07:58Z

Ok, then I try with anndata. Should I just update it from conda? I can see it has version 0.6.17 available :)

VolkerBergen · 2019-01-23T14:36:44Z

It's not released yet. Just install the latest commits from source via

git clone https://github.com/theislab/anndata.git
cd anndata
pip install .

VolkerBergen · 2019-01-30T16:25:06Z

Did that work for you?

SamueleSoraggi · 2019-01-30T21:02:50Z

Hej Volker, I was going to write about this. I get the same object by concatenating my 3 loom files opened in scanpy and by loading the file combined with loompy. However when I do the merging with my previously annotated object, I still get an error. I worked the issue around by selecting cells matching the name, but I lose a thousand cells that seem not to be in there. I post the commands as soon as I have my hands on the laptop. Cheers, Samuele ons. 30. jan. 2019 5.25 PM skrev Volker Bergen <notifications@github.com>:

…

Did that work for you? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#36 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AIXvUeJhZ24h1t-sbJexi7riTYsOD80_ks5vIcdigaJpZM4aOh-M> .

SamueleSoraggi · 2019-02-06T12:17:31Z

Hi again,

so as I said concatenating 3 loom files or using the one combined by loompy gives the same thing.
Now, if I try to merge with my annData object from the previous analysis there is again the error related to the non-matching shape:

all_data_merged = scv.utils.merge(all_data, all_data_loom)

outputs again this error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-8-55a13faaeae6> in <module>
----> 1 all_data_flt_clst_mrk_paga_vel = scv.utils.merge(all_data_flt_clst_mrk_paga, all_data_loom)

~/miniconda3/envs/scRNA/lib/python3.6/site-packages/scvelo/read_load.py in merge(adata, ldata, copy)
    133         _adata.uns[attr] = _ldata.uns[attr]
    134     for attr in _ldata.layers.keys():
--> 135         _adata.layers[attr] = _ldata.layers[attr]
    136 
    137     if _adata.shape[1] == _ldata.shape[1]:

~/miniconda3/envs/scRNA/lib/python3.6/site-packages/anndata/layers.py in __setitem__(self, key, value)
     62         else:
     63             if value.shape != self._adata.shape:
---> 64                 raise ValueError('Shape does not fit.')
     65             self._layers[key] = value
     66 

ValueError: Shape does not fit.

As a tweak I extracted the layers (magic and raw) from my annData object and canceled them from it.

magicM = all_data.layers['magic']
rawM = all_data.layers['raw']
scv.utils.cleanup(all_data, clean='layers')

after running succesfully the merging with the loom file, I added again the layers into the resulting object. However, I get again an error because the shapes do not match again. before merging I had 7379 cells and now I get a little over 6000. I think the problem is recognizing cell names that are duplicated when the merging is done:

>>> all_data #previous annData object
AnnData object with n_obs × n_vars = 7379 × 15000
>>> all_data_merged = scv.utils.merge(all_data, all_data_loom)
>>> all_data_merged #merged object
AnnData object with n_obs × n_vars = 6717 × 33694

The number of genes do not match as well, but one can do it manually very easily.

VolkerBergen · 2019-02-07T13:22:20Z

Tested the merge module for multiple use cases where it works as expected. I guess it's something with the cell (and gene) names. Could you please check the following (with the new release 0.1.16)?

common_obs = adata.obs_names.intersection(ldata.obs_names)
common_vars = adata.var_names.intersection(ldata.var_names)
print(len(common_obs), len(common_vars))

SamueleSoraggi · 2019-02-12T12:41:08Z

I get the two lengths 0 15009.
It seems there is no match at all between cells. But just by using intersection, is it supposed to work?
For example, the names are of type

AAACCTGAGAAACCTA-1-0
possorted_genome_bam_1B86T:AAACCTGAGATGGGTC

in the data and in the loom data, respectively. I am using again 0.1.16 as a release, not the one in development, but the one available from pip.

VolkerBergen · 2019-02-12T15:01:11Z

The merge module first "cleans up" names (basically removing everything that is not from 'ACTG'), what you can do with

scv.utils.clean_obs_names(adata)
scv.utils.clean_obs_names(ldata)

Then intersect. Can you run that again with cleaned names and also check if they're still unique?

SamueleSoraggi · 2019-02-13T08:12:38Z

If I use the cleaning tools, I obtain 6557 15009.
So this time it went definitely better, even though I am still not matching the numbers I had in the annotated dataset I started with (7219 × 15000).
Getting exactly the same dimension would be cool. I thought that some of the cell names were appearing more than once, and only one of them is kept. But if I run

np.size(np.unique(all_data_flt_clst_mrk_paga.obs_names))

I get exactly 7219, while I would expect as well 6557.

VolkerBergen · 2019-02-13T08:22:25Z

Would need to figure out which cells get sorted out and why (after cleaning)

common_obs = adata.obs_names.intersection(ldata.obs_names)
adata.obs_names[adata.obs_names.isin(common_obs) == False]

SamueleSoraggi · 2019-02-13T08:34:36Z

There are actually some names popping out that seem not to be clean, but have still "-1" in the name

Index(['AAACGGGAGGTGATTA', 'AAACGGGGTCATATGC', 'AAAGCAAGTAAAGGAG',
       'AAAGTAGTCATGCTCC', 'AAATGCCTCATGTCTT', 'AACACGTCAGACTCGC',
       'AACCATGAGACTAGAT', 'AACCATGCAAATTGCC', 'AACCGCGAGCTGTTCA',
       'AACCGCGTCGGTCCGA',
       ...
       'TTTATGCGTAGCCTAT', 'TTTCCTCCATCGACGC', 'TTTCCTCGTACTCGCG',
       'TTTGCGCAGATACACA', 'TTTGCGCAGGCAGGTT', 'TTTGCGCAGGTGACCA',
       'TTTGTCAAGAGCTGGT', 'GAAGCAGGTTCACCTC-1', 'GCATGCGGTAACGTTC-1',
       'ATGCGATGTCACCCAG-1'],
      dtype='object', length=662)

The other names seem ok. If I run on the loom data

ldata.obs_names[ldata.obs_names.isin(common_obs) == False]

then I get a vector of length 275459, meaning that there are only 6557 matches out of the 282016 observations available. So this is why the intersection has length 6557

Index(['AAACCTGAGATGGGTC', 'AAACCTGAGCTCAACT', 'AAACCTGAGCCAGTAG',
       'AAACCTGAGACAAAGG', 'AAACCTGAGAGGGCTT', 'AAACCTGAGATGCGAC',
       'AAACCTGAGAAGATTC', 'AAACCTGAGATCCTGT', 'AAACCTGAGCACACAG',
       'AAACCTGAGCCGATTT',
       ...
       'TTTGTCATCCTGTACC', 'TTTGTCATCTTCTGGC', 'TTTGTCATCTTATCTG-1',
       'TTTGTCATCTGCCCTA', 'TTTGTCATCGGCTTGG', 'TTTGTCATCTGACCTC',
       'TTTGTCATCGTGGGAA-1', 'TTTGTCATCTTGAGAC', 'TTTGTCATCTTCGAGA',
       'TTTGTCATCTGAAAGA'],
      dtype='object', length=275459)

I noticed that here there are as well names with some remaining -1 characters in the string. As far as I can understand, all cells in my previous dataset should be found into my loom data. Is it right, or one might not find some of them?

VolkerBergen · 2019-02-13T08:44:11Z

The -1 is added to make names unique, i.e. those appear at least twice in your obs_names.

You might need a tailored merge module in your case. If you send me your obs_names via E-Mail (e.g. with np.save) I can have a look.

SamueleSoraggi · 2019-02-13T08:48:35Z

That would be nice, of course only when you have time to look at it :) I guess this kind of problem might arise pretty often when one has concatenated different datasets into a single annotated data object. But it is probably also impossible to make a merge function that generalizes to all possible cases.

SamueleSoraggi · 2019-02-13T09:15:37Z

I sent the data in a mail to your institutional address :)
Thanks, Samuele

VolkerBergen · 2019-02-13T09:34:51Z

I conclude that the merge module behaves as expected and filters out cells that have not been found in both .obs_names.

Rather something is corrupted with the data. Detailed disc via mail.

SamueleSoraggi · 2019-02-13T09:46:43Z

That is in some way nice, so I know that this kind of stuff should not happen and things would rather have to go more smooth.

SamueleSoraggi · 2019-02-15T09:48:31Z

Thanks a lot for the help and patience. Still owe a beer ;)

hejian41 · 2019-02-22T09:50:14Z

Hi
Error "Variable names are not identical" when I trying to use scv.utils.merge(adata, adata_loom).
The sharpe of adata and adata_loom are the same, but the orders of var.index are different, which led to this error, so is there any way to reorder the index of adata_loom.var?
Thanks a lot
Jian

VolkerBergen · 2019-02-22T10:49:29Z

If the var_names are the same, but in different order, you can simply put them in the right order with adata_loom = adata_loom[:, adata.var_names]

Good point though. I'll include that in the merge module, so that one does not have to care about it anymore.

hmassalha · 2019-06-20T11:38:58Z

Hi,

Do you have an example of how the andata.concatenate function work?
I have cells from 6 individuals that were processed using the same plate based method. Each individual gets a set of unique barcodes (well barcode + plate barcode). The total number of cells varies between samples.
Does the function take care of the same CellID that was used in the different samples?
Does the function generate a big matrix of unique-modified-CellIDs against the maximum number of unique genes? (meaning adding NaN or zeros for missing data).
And lastly, I keep getting the following message "Variable names are not unique" for the individual loom file. Why is this? I have "Start", "End", ... they are unique! Once I applied the method .var_names_make_unique then andata.concatenate I get in the merged file "Start1", "Start2", Start3", ... and so for all vars! This is why I am asking how the merging function is working? I have the same gene level from 6 individuals, so why to get 6 "Start" columns?

Thanks in advance,
HM

VolkerBergen · 2019-06-20T11:59:34Z

Sure, here you find the module description along with several examples. It's basically adata1.concatenate(adata2, adata3) where the var_names should match. If you have same cellIDs, say 'AAACATACCCATGA', it becomes 'AAACATACCCATGA-0' and 'AAACATACCCATGA-1'.

The multiple .var issue is addressed in scverse/anndata#162 and will be resolved soon. If it bothers you, you can just delete the duplicated .var columns.

The merge module is different. It merges a pre-existing AnnData with a new loom file (e.g. when you have already done all kinds of preprocessing and don't want to repeat that) and generates a new AnnData with shared observations across shared genes.

You raised quite a few questions. Let me know, if I overlooked anything.

hmassalha · 2019-07-02T10:36:14Z

Thanks for your answer.
I have an extra question please, the original loom files contain doublicated gene names. How that can happen? I used the same loom file with Seurat, and I didn't have this issue. Could you please rise a few suggestions for this issue?

Thanks, HM

VolkerBergen · 2019-07-02T10:53:52Z

Are you sure, that you didn't have any duplicates in Seurat?

You can check that in python with set([x for x in list(adata.var_names) if list(adata.var_names).count(x) > 1]) and in R with duplicated() from tidyverse.

As of satijalab/seurat#1238 it looks like it is handling/removing duplicates internally.

In scanpy/scvelo, this is fixed by calling adata.var_names_make_unique() after reading your data.

hmassalha · 2019-07-02T19:35:13Z

Thanks,
I think Seurat is treating this point behind the stage. My question is, can't we merge the duplicated genes? I mean make a new AnnData obj contains the sum of counts by cells (by columns) for the same genes? is it legal to do this?
I think we can do this as we have counts which representative for 'real' reads. Splitting the counts and through a great portion of them just because we wanted to use only one row from the table is nor clever. On the other hand, if this is the case, which row one should use (both have the same gene name)?

Since I am new to python, could you please write an example code that will sum the rows of the same gene names by columns. The output will be a table of unique gene names by cells, and the values are the summed counts.

Thanks a lot, HM

VolkerBergen · 2019-07-03T09:56:59Z

Get the names of your duplicates:

def get_duplicates(array):
    from collections import Counter
    return np.array([item for (item, count) in Counter(array).items() if count > 1])
duplicates = get_duplicates(adata.var_names)
print(duplicates)

subset AnnData to the first duplicate var_name:

adata_dup = adata[:, [duplicates[0]]]
print(adata_dup.var)

Here, you can check whether you have gene_id/Accession that uniquely identifies the genes with duplicate names.

If they have unique identifiers, also the number of cells expressed for that gene with duplicate names would probably be different:

print(adata_dup.X.sum(0))

denvercal1234GitHub · 2021-11-21T19:44:57Z

Hi @VolkerBergen - My apologies for asking another q for closed issue, but when I read my data as below, a warning stated Variable names are not unique. To make them unique, call .var_names_make_unique, but if I called adata.var_names_make_unique(), the dimension stayed the same, suggesting to me that there was actually no duplicates?

Thank you very much for your help!

WeilerP · 2021-11-21T21:12:40Z

@denvercal1234GitHub, as the function name suggests and indicated by the warning message, variable names are made unique (e.g. by appending a suffix) and not duplicate variables removed.

denvercal1234GitHub · 2021-11-22T11:48:47Z

Ah, I see. Thank you Weiler.

SamueleSoraggi closed this as completed Feb 15, 2019

ifanirene mentioned this issue Jul 9, 2021

Handling of CellID in scv.utils.merge #523

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merging with a loom file #36

merging with a loom file #36

SamueleSoraggi commented Jan 23, 2019

VolkerBergen commented Jan 23, 2019

SamueleSoraggi commented Jan 23, 2019

SamueleSoraggi commented Jan 23, 2019

VolkerBergen commented Jan 23, 2019

SamueleSoraggi commented Jan 23, 2019 •

edited

Loading

VolkerBergen commented Jan 23, 2019

SamueleSoraggi commented Jan 23, 2019

VolkerBergen commented Jan 23, 2019

VolkerBergen commented Jan 30, 2019

SamueleSoraggi commented Jan 30, 2019 via email

SamueleSoraggi commented Feb 6, 2019 •

edited

Loading

VolkerBergen commented Feb 7, 2019

SamueleSoraggi commented Feb 12, 2019 •

edited

Loading

VolkerBergen commented Feb 12, 2019

SamueleSoraggi commented Feb 13, 2019

VolkerBergen commented Feb 13, 2019 •

edited

Loading

SamueleSoraggi commented Feb 13, 2019 •

edited

Loading

VolkerBergen commented Feb 13, 2019

SamueleSoraggi commented Feb 13, 2019

SamueleSoraggi commented Feb 13, 2019

VolkerBergen commented Feb 13, 2019

SamueleSoraggi commented Feb 13, 2019

SamueleSoraggi commented Feb 15, 2019

hejian41 commented Feb 22, 2019

VolkerBergen commented Feb 22, 2019 •

edited

Loading

hmassalha commented Jun 20, 2019

VolkerBergen commented Jun 20, 2019

hmassalha commented Jul 2, 2019

VolkerBergen commented Jul 2, 2019 •

edited

Loading

hmassalha commented Jul 2, 2019

VolkerBergen commented Jul 3, 2019 •

edited

Loading

denvercal1234GitHub commented Nov 21, 2021

WeilerP commented Nov 21, 2021

denvercal1234GitHub commented Nov 22, 2021

merging with a loom file #36

merging with a loom file #36

Comments

SamueleSoraggi commented Jan 23, 2019

VolkerBergen commented Jan 23, 2019

SamueleSoraggi commented Jan 23, 2019

SamueleSoraggi commented Jan 23, 2019

VolkerBergen commented Jan 23, 2019

SamueleSoraggi commented Jan 23, 2019 • edited Loading

VolkerBergen commented Jan 23, 2019

SamueleSoraggi commented Jan 23, 2019

VolkerBergen commented Jan 23, 2019

VolkerBergen commented Jan 30, 2019

SamueleSoraggi commented Jan 30, 2019 via email

SamueleSoraggi commented Feb 6, 2019 • edited Loading

VolkerBergen commented Feb 7, 2019

SamueleSoraggi commented Feb 12, 2019 • edited Loading

VolkerBergen commented Feb 12, 2019

SamueleSoraggi commented Feb 13, 2019

VolkerBergen commented Feb 13, 2019 • edited Loading

SamueleSoraggi commented Feb 13, 2019 • edited Loading

VolkerBergen commented Feb 13, 2019

SamueleSoraggi commented Feb 13, 2019

SamueleSoraggi commented Feb 13, 2019

VolkerBergen commented Feb 13, 2019

SamueleSoraggi commented Feb 13, 2019

SamueleSoraggi commented Feb 15, 2019

hejian41 commented Feb 22, 2019

VolkerBergen commented Feb 22, 2019 • edited Loading

hmassalha commented Jun 20, 2019

VolkerBergen commented Jun 20, 2019

hmassalha commented Jul 2, 2019

VolkerBergen commented Jul 2, 2019 • edited Loading

hmassalha commented Jul 2, 2019

VolkerBergen commented Jul 3, 2019 • edited Loading

denvercal1234GitHub commented Nov 21, 2021

WeilerP commented Nov 21, 2021

denvercal1234GitHub commented Nov 22, 2021

SamueleSoraggi commented Jan 23, 2019 •

edited

Loading

SamueleSoraggi commented Feb 6, 2019 •

edited

Loading

SamueleSoraggi commented Feb 12, 2019 •

edited

Loading

VolkerBergen commented Feb 13, 2019 •

edited

Loading

SamueleSoraggi commented Feb 13, 2019 •

edited

Loading

VolkerBergen commented Feb 22, 2019 •

edited

Loading

VolkerBergen commented Jul 2, 2019 •

edited

Loading

VolkerBergen commented Jul 3, 2019 •

edited

Loading