Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merging with a loom file #36

Closed
SamueleSoraggi opened this issue Jan 23, 2019 · 34 comments
Closed

merging with a loom file #36

SamueleSoraggi opened this issue Jan 23, 2019 · 34 comments

Comments

@SamueleSoraggi
Copy link

Hej,

I am trying to use scv.utils.merge(adata, adata_loom) to merge my dataset used with scanpy and the related loom dataset opened with scvelo. However, adata has dimension 7370x15000, while adata_loom has dimension 282016x33694. adata is the concatenation of 3 datasets, one from each individual, and the loom file I used is the combination of the 3 separate individuals' file through loompy.combine.

I am pretty ok with the second dimension (I kept the most expressed 15000 genes in scanpy), but the first dimension is not clear to me, since it is really large, and generates an error because adata and adata_loom have the attribute layers containing matrices of incompatible shape.
Is the first dimension to be interpreted in another way than just cells? Should I run the velocity analysis on the loom file without being allowed to merge?

Cheers,
Samuele

@VolkerBergen
Copy link
Contributor

Hi Samuele,

so loompy.combine, or the way you used it, generates sth. huge, for whatever reason.

Could you just load all loom files into three AnnData objects and concatenate those into one adata_loom?

That should hopefully give you the same dimensions.

@SamueleSoraggi
Copy link
Author

Hej Volker,

thanks for the hint, I thought it was the right way to go, based on some issues I read on the velocyto command line tool page. I'll post back when I try it :)

Cheers,
Samuele

@SamueleSoraggi
Copy link
Author

Hej again,

I tried using anndata.concatenate(loom1, loom2, loom3) on the three loom files. However, the resulting concatenated file has no longer the layers needed to have spliced and unspliced counts, maybe because it removes them since they have unmatching dimension.

@VolkerBergen
Copy link
Contributor

You're right, I forgot to mention that we only recently enabled the concatenate module to also consider layers (see scverse/anndata@1ff11f2). Hence, you would need to clone the latest AnnData version.

@SamueleSoraggi
Copy link
Author

SamueleSoraggi commented Jan 23, 2019

ok thanks.
maybe there is also something weird with my data, since the first dimension of the three datasets is huge! The obs_names are also different than for the dataset I have been using for scanpy, so maybe I have to figure that out first. I'll post an update here when I have news :)

@VolkerBergen
Copy link
Contributor

obs_names don't have to match exactly. For instance, the merge module with figure out itself, that '10X43_1_AAACATACCCATGA' and 'AAACATACCCATGA' belong together. Also the ordering doesn't have to be the same. Simplest quick check is looking at the dimensions of your concatenated data. If an obs hasn't been found in all of the datasets, it will through it away.

@SamueleSoraggi
Copy link
Author

Ok, then I try with anndata. Should I just update it from conda? I can see it has version 0.6.17 available :)

@VolkerBergen
Copy link
Contributor

It's not released yet. Just install the latest commits from source via

git clone https://github.com/theislab/anndata.git
cd anndata
pip install .

@VolkerBergen
Copy link
Contributor

Did that work for you?

@SamueleSoraggi
Copy link
Author

SamueleSoraggi commented Jan 30, 2019 via email

@SamueleSoraggi
Copy link
Author

SamueleSoraggi commented Feb 6, 2019

Hi again,

so as I said concatenating 3 loom files or using the one combined by loompy gives the same thing.
Now, if I try to merge with my annData object from the previous analysis there is again the error related to the non-matching shape:

all_data_merged = scv.utils.merge(all_data, all_data_loom)

outputs again this error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-8-55a13faaeae6> in <module>
----> 1 all_data_flt_clst_mrk_paga_vel = scv.utils.merge(all_data_flt_clst_mrk_paga, all_data_loom)

~/miniconda3/envs/scRNA/lib/python3.6/site-packages/scvelo/read_load.py in merge(adata, ldata, copy)
    133         _adata.uns[attr] = _ldata.uns[attr]
    134     for attr in _ldata.layers.keys():
--> 135         _adata.layers[attr] = _ldata.layers[attr]
    136 
    137     if _adata.shape[1] == _ldata.shape[1]:

~/miniconda3/envs/scRNA/lib/python3.6/site-packages/anndata/layers.py in __setitem__(self, key, value)
     62         else:
     63             if value.shape != self._adata.shape:
---> 64                 raise ValueError('Shape does not fit.')
     65             self._layers[key] = value
     66 

ValueError: Shape does not fit.

As a tweak I extracted the layers (magic and raw) from my annData object and canceled them from it.

magicM = all_data.layers['magic']
rawM = all_data.layers['raw']
scv.utils.cleanup(all_data, clean='layers')

after running succesfully the merging with the loom file, I added again the layers into the resulting object. However, I get again an error because the shapes do not match again. before merging I had 7379 cells and now I get a little over 6000. I think the problem is recognizing cell names that are duplicated when the merging is done:

>>> all_data #previous annData object
AnnData object with n_obs × n_vars = 7379 × 15000
>>> all_data_merged = scv.utils.merge(all_data, all_data_loom)
>>> all_data_merged #merged object
AnnData object with n_obs × n_vars = 6717 × 33694 

The number of genes do not match as well, but one can do it manually very easily.

@VolkerBergen
Copy link
Contributor

Tested the merge module for multiple use cases where it works as expected. I guess it's something with the cell (and gene) names. Could you please check the following (with the new release 0.1.16)?

common_obs = adata.obs_names.intersection(ldata.obs_names)
common_vars = adata.var_names.intersection(ldata.var_names)
print(len(common_obs), len(common_vars))

@SamueleSoraggi
Copy link
Author

SamueleSoraggi commented Feb 12, 2019

I get the two lengths 0 15009.
It seems there is no match at all between cells. But just by using intersection, is it supposed to work?
For example, the names are of type

AAACCTGAGAAACCTA-1-0
possorted_genome_bam_1B86T:AAACCTGAGATGGGTC

in the data and in the loom data, respectively. I am using again 0.1.16 as a release, not the one in development, but the one available from pip.

@VolkerBergen
Copy link
Contributor

The merge module first "cleans up" names (basically removing everything that is not from 'ACTG'), what you can do with

scv.utils.clean_obs_names(adata)
scv.utils.clean_obs_names(ldata)

Then intersect. Can you run that again with cleaned names and also check if they're still unique?

@SamueleSoraggi
Copy link
Author

If I use the cleaning tools, I obtain 6557 15009.
So this time it went definitely better, even though I am still not matching the numbers I had in the annotated dataset I started with (7219 × 15000).
Getting exactly the same dimension would be cool. I thought that some of the cell names were appearing more than once, and only one of them is kept. But if I run

np.size(np.unique(all_data_flt_clst_mrk_paga.obs_names))

I get exactly 7219, while I would expect as well 6557.

@VolkerBergen
Copy link
Contributor

VolkerBergen commented Feb 13, 2019

Would need to figure out which cells get sorted out and why (after cleaning)

common_obs = adata.obs_names.intersection(ldata.obs_names)
adata.obs_names[adata.obs_names.isin(common_obs) == False]

@SamueleSoraggi
Copy link
Author

SamueleSoraggi commented Feb 13, 2019

There are actually some names popping out that seem not to be clean, but have still "-1" in the name

Index(['AAACGGGAGGTGATTA', 'AAACGGGGTCATATGC', 'AAAGCAAGTAAAGGAG',
       'AAAGTAGTCATGCTCC', 'AAATGCCTCATGTCTT', 'AACACGTCAGACTCGC',
       'AACCATGAGACTAGAT', 'AACCATGCAAATTGCC', 'AACCGCGAGCTGTTCA',
       'AACCGCGTCGGTCCGA',
       ...
       'TTTATGCGTAGCCTAT', 'TTTCCTCCATCGACGC', 'TTTCCTCGTACTCGCG',
       'TTTGCGCAGATACACA', 'TTTGCGCAGGCAGGTT', 'TTTGCGCAGGTGACCA',
       'TTTGTCAAGAGCTGGT', 'GAAGCAGGTTCACCTC-1', 'GCATGCGGTAACGTTC-1',
       'ATGCGATGTCACCCAG-1'],
      dtype='object', length=662)

The other names seem ok. If I run on the loom data

ldata.obs_names[ldata.obs_names.isin(common_obs) == False]

then I get a vector of length 275459, meaning that there are only 6557 matches out of the 282016 observations available. So this is why the intersection has length 6557

Index(['AAACCTGAGATGGGTC', 'AAACCTGAGCTCAACT', 'AAACCTGAGCCAGTAG',
       'AAACCTGAGACAAAGG', 'AAACCTGAGAGGGCTT', 'AAACCTGAGATGCGAC',
       'AAACCTGAGAAGATTC', 'AAACCTGAGATCCTGT', 'AAACCTGAGCACACAG',
       'AAACCTGAGCCGATTT',
       ...
       'TTTGTCATCCTGTACC', 'TTTGTCATCTTCTGGC', 'TTTGTCATCTTATCTG-1',
       'TTTGTCATCTGCCCTA', 'TTTGTCATCGGCTTGG', 'TTTGTCATCTGACCTC',
       'TTTGTCATCGTGGGAA-1', 'TTTGTCATCTTGAGAC', 'TTTGTCATCTTCGAGA',
       'TTTGTCATCTGAAAGA'],
      dtype='object', length=275459)

I noticed that here there are as well names with some remaining -1 characters in the string. As far as I can understand, all cells in my previous dataset should be found into my loom data. Is it right, or one might not find some of them?

@VolkerBergen
Copy link
Contributor

The -1 is added to make names unique, i.e. those appear at least twice in your obs_names.

You might need a tailored merge module in your case. If you send me your obs_names via E-Mail (e.g. with np.save) I can have a look.

@SamueleSoraggi
Copy link
Author

That would be nice, of course only when you have time to look at it :) I guess this kind of problem might arise pretty often when one has concatenated different datasets into a single annotated data object. But it is probably also impossible to make a merge function that generalizes to all possible cases.

@SamueleSoraggi
Copy link
Author

I sent the data in a mail to your institutional address :)
Thanks, Samuele

@VolkerBergen
Copy link
Contributor

I conclude that the merge module behaves as expected and filters out cells that have not been found in both .obs_names.

Rather something is corrupted with the data. Detailed disc via mail.

@SamueleSoraggi
Copy link
Author

That is in some way nice, so I know that this kind of stuff should not happen and things would rather have to go more smooth.

@SamueleSoraggi
Copy link
Author

Thanks a lot for the help and patience. Still owe a beer ;)

@hejian41
Copy link

Hi
Error "Variable names are not identical" when I trying to use scv.utils.merge(adata, adata_loom).
The sharpe of adata and adata_loom are the same, but the orders of var.index are different, which led to this error, so is there any way to reorder the index of adata_loom.var?
Thanks a lot
Jian

@VolkerBergen
Copy link
Contributor

VolkerBergen commented Feb 22, 2019

If the var_names are the same, but in different order, you can simply put them in the right order with adata_loom = adata_loom[:, adata.var_names]

Good point though. I'll include that in the merge module, so that one does not have to care about it anymore.

@hmassalha
Copy link

Hi,

Do you have an example of how the andata.concatenate function work?
I have cells from 6 individuals that were processed using the same plate based method. Each individual gets a set of unique barcodes (well barcode + plate barcode). The total number of cells varies between samples.
Does the function take care of the same CellID that was used in the different samples?
Does the function generate a big matrix of unique-modified-CellIDs against the maximum number of unique genes? (meaning adding NaN or zeros for missing data).
And lastly, I keep getting the following message "Variable names are not unique" for the individual loom file. Why is this? I have "Start", "End", ... they are unique! Once I applied the method .var_names_make_unique then andata.concatenate I get in the merged file "Start1", "Start2", Start3", ... and so for all vars! This is why I am asking how the merging function is working? I have the same gene level from 6 individuals, so why to get 6 "Start" columns?

Thanks in advance,
HM

@VolkerBergen
Copy link
Contributor

Sure, here you find the module description along with several examples. It's basically adata1.concatenate(adata2, adata3) where the var_names should match. If you have same cellIDs, say 'AAACATACCCATGA', it becomes 'AAACATACCCATGA-0' and 'AAACATACCCATGA-1'.

The multiple .var issue is addressed in scverse/anndata#162 and will be resolved soon. If it bothers you, you can just delete the duplicated .var columns.

The merge module is different. It merges a pre-existing AnnData with a new loom file (e.g. when you have already done all kinds of preprocessing and don't want to repeat that) and generates a new AnnData with shared observations across shared genes.

You raised quite a few questions. Let me know, if I overlooked anything.

@hmassalha
Copy link

Thanks for your answer.
I have an extra question please, the original loom files contain doublicated gene names. How that can happen? I used the same loom file with Seurat, and I didn't have this issue. Could you please rise a few suggestions for this issue?

Thanks, HM

@VolkerBergen
Copy link
Contributor

VolkerBergen commented Jul 2, 2019

Are you sure, that you didn't have any duplicates in Seurat?

You can check that in python with set([x for x in list(adata.var_names) if list(adata.var_names).count(x) > 1]) and in R with duplicated() from tidyverse.

As of satijalab/seurat#1238 it looks like it is handling/removing duplicates internally.

In scanpy/scvelo, this is fixed by calling adata.var_names_make_unique() after reading your data.

@hmassalha
Copy link

Thanks,
I think Seurat is treating this point behind the stage. My question is, can't we merge the duplicated genes? I mean make a new AnnData obj contains the sum of counts by cells (by columns) for the same genes? is it legal to do this?
I think we can do this as we have counts which representative for 'real' reads. Splitting the counts and through a great portion of them just because we wanted to use only one row from the table is nor clever. On the other hand, if this is the case, which row one should use (both have the same gene name)?

Since I am new to python, could you please write an example code that will sum the rows of the same gene names by columns. The output will be a table of unique gene names by cells, and the values are the summed counts.

Thanks a lot, HM

@VolkerBergen
Copy link
Contributor

VolkerBergen commented Jul 3, 2019

Get the names of your duplicates:

def get_duplicates(array):
    from collections import Counter
    return np.array([item for (item, count) in Counter(array).items() if count > 1])
duplicates = get_duplicates(adata.var_names)
print(duplicates)

subset AnnData to the first duplicate var_name:

adata_dup = adata[:, [duplicates[0]]]
print(adata_dup.var)

Here, you can check whether you have gene_id/Accession that uniquely identifies the genes with duplicate names.

If they have unique identifiers, also the number of cells expressed for that gene with duplicate names would probably be different:

print(adata_dup.X.sum(0))

@denvercal1234GitHub
Copy link

Hi @VolkerBergen - My apologies for asking another q for closed issue, but when I read my data as below, a warning stated Variable names are not unique. To make them unique, call .var_names_make_unique, but if I called adata.var_names_make_unique(), the dimension stayed the same, suggesting to me that there was actually no duplicates?

Thank you very much for your help!

Screen Shot 2021-11-21 at 7 43 38 PM

Screen Shot 2021-11-21 at 7 44 50 PM

@WeilerP
Copy link
Member

WeilerP commented Nov 21, 2021

@denvercal1234GitHub, as the function name suggests and indicated by the warning message, variable names are made unique (e.g. by appending a suffix) and not duplicate variables removed.

@denvercal1234GitHub
Copy link

Ah, I see. Thank you Weiler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants