Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seurat3.0 Finding integration vectors: long vectors not supported yet #1029

Closed
ruqianl opened this issue Dec 29, 2018 · 23 comments
Closed

Seurat3.0 Finding integration vectors: long vectors not supported yet #1029

ruqianl opened this issue Dec 29, 2018 · 23 comments

Comments

@ruqianl
Copy link

ruqianl commented Dec 29, 2018

Hi guys,

When I applied Seurat3.0's IntegrateData on a large dataset which consists of about 100k cells from 50 samples I got an error:

Integrating data
Merging dataset 46 3 34 2 1 47 51 45 49 37 50 into 19 42 36 44 20 30 25 33 26 43 12 14 35 39 21 24 28 31 8 32 29 27 18 23 11 15 17 10 16 7 13 9 38 22 6 41 40 4 5 48
Extracting anchors for merged samples
Finding integration vectors
Error in validityMethod(as(object, superClass)) : 
  long vectors not supported yet: ../../src/include/Rinlinedfuns.h:138
Calls: IntegrateData ... validObject -> anyStrings -> validityMethod -> .Call
Execution halted

I guess it has to do with the data size, any workaround with this?

PS: when I group this samples into 7 sets and rerun the integration on the 7 datasets, it runs successfully.

Thanks!
Lyu

@satijalab
Copy link
Collaborator

Thanks for the question - we've explored this and the cause is that there are so many anchors, that it creates a sparse matrix with >2^31 elements in R, which can throw an error.

This is not simply a function of cell or dataset number (we have performed much larger alignments), but we are implementing a fix that will not affect results. Our apologies for the delay in the mean time

@gouinK
Copy link

gouinK commented Apr 1, 2019

Greetings, I am running into the same issue with my dataset, is there any fix for this yet? My dataset is actually quite similar as the OP in that I have ~120k cells across 36 samples. Thanks for your time!

@markrobinsonuzh
Copy link

Just to highlight that I also have this issue .. 24 samples with (downsampled to) ~3k cells for each sample ..

[snip]
Integrating data
Merging dataset 24 16 18 5 23 20 into 3 13 2 8 14 9 1 7 4 10 22
Extracting anchors for merged samples
Finding integration vectors
Error in validityMethod(as(object, superClass)) : 
  long vectors not supported yet: ../../src/include/Rinlinedfuns.h:519
> traceback()
15: validityMethod(as(object, superClass))
14: isTRUE(x)
13: anyStrings(validityMethod(as(object, superClass)))
12: validObject(.Object)
11: .nextMethod(.Object = .Object, ... = ...)
10: callNextMethod()
9: initialize(value, ...)
8: initialize(value, ...)
7: new("dgTMatrix", Dim = d, Dimnames = dn, i = i, j = j, x = x)
6: newTMat(i = c(ij1[, 1], ij2[, 1]), j = c(ij1[, 2], ij2[, 2]), 
       x = if (Generic == "+") c(e1@x, e2@x) else c(e1@x, -e2@x))
5: .Arith.Csparse(e1, e2, .Generic, class. = "dgCMatrix")
4: data.use2 - data.use1
3: data.use2 - data.use1
2: FindIntegrationMatrix(object = merged.obj, integration.name = integration.name, 
       features.integrate = features.to.integrate, verbose = verbose)
1: IntegrateData(anchorset = anchors, dims = 1:pa$dims, features.to.integrate = row.names(sl[[1]]))

Has anyone tried using anchor.features = 1000 (or something fewer than the default 2000)? I mean, how much does this affect the integration?

Thanks in advance,
Mark Robinson

@markrobinsonuzh
Copy link

markrobinsonuzh commented Apr 2, 2019

Ok, I tried FindIntegrationAnchors(.., anchor.features = 1000) and no luck, I got the same error.

@markrobinsonuzh
Copy link

markrobinsonuzh commented Apr 3, 2019

I solved it .. I was integrating a bunch of features, not just those used for anchoring ..

seurat <- IntegrateData( anchorset=anchors, 
                         dims = 1:pa$dims,
                         features.to.integrate=row.names(sl[[1]]) )

This works:

seurat <- IntegrateData( anchorset=anchors, 
                         dims = 1:pa$dims)

@jeffjjohnston
Copy link

I've also encountered this error during integration. Is there a recommended workaround while we await a fix? Thanks!

@timoast
Copy link
Collaborator

timoast commented May 17, 2019

@jeffjjohnston you can try using Mark's workaround which should work in most cases. We are working on further improvements to the integration procedure that will address this issue more fully, which will be made available soon

@MaxKman
Copy link

MaxKman commented Jun 22, 2019

HI, any news about this? I am facing the same problem. Is the workaround just running IntegrateData without using specific features? That doesn't work in my case.

@kevinblighe
Copy link

I am as yet unsure why, but this worked for me:

IntegrateData( anchorset = scData.Anchors )

That is, not specifying anything for dims or features.to.integrate.

The very puzzling part is that I was specifying dims = 1:20, code for which the error stated in the original question (above) is returned. The default for dims, however, is 1:30, so, 1:30 is used if nothing is set for dims (?).

@MaxKman
Copy link

MaxKman commented Jul 10, 2019

I could get around the issue when integrating my samples in two separate sets (using the SCT assay), followed by integrating the two resulting objects (using the integrated RNA assay of each object). However, now I would like to use the new method for directly integrating the SCT normalized data based on the pearson residuals (https://satijalab.org/seurat/v3.0/pancreas_integration_label_transfer.html) and I am not sure if it is sound to use the same two set workaround in this case. Any feedback?

@timoast
Copy link
Collaborator

timoast commented Aug 2, 2019

We have now introduced multiple ways to scale-up the integration, using either reference-based integration or reciprocal PCA rather than CCA. Please see the integration vignette for examples.

@timoast timoast closed this as completed Aug 2, 2019
@lambdamoses
Copy link

lambdamoses commented Sep 25, 2019

I looked at the source code and this is the line of code that is the culprit:

integration.matrix <- data.use2 - data.use1
I don't think this is Seurat's problem, but the problem with Matrix, which still doesn't support vectors with more than 2^31 elements. It's just that a sparse matrix with too many non-zero elements is produced. This can be worked around by using the sparse matrix package spam64, but will require changes to Seurat's source code. Actually supporting long vectors is on the to do list of Matrix developers, but somehow they still haven't implemented it.

Also, do you think Seurat should use HDF5Array from Bioconductor for data that doesn't fit into memory?

@lambdamoses
Copy link

I added reference = 1 to anchors <- FindIntegrationAnchors(seus, normalization.method = "SCT", anchor.features = features_use, reference = 1), and this issue has been avoided. I still got decent results after integrating 21 datasets with over 180k cells in total.

@Chenmengpin
Copy link

I added reference = 1 to anchors <- FindIntegrationAnchors(seus, normalization.method = "SCT", anchor.features = features_use, reference = 1), and this issue has been avoided. I still got decent results after integrating 21 datasets with over 180k cells in total.

Hi, I've also encountered this"long vector not available" error too, and trying to add "reference=1", but I was wondering what it means, which dataset it's using as a reference during the integrating. I'm quite new to bioinformatics, would very appreciate if you can give me any answer for that, thank you!

@lambdamoses
Copy link

That means using the first dataset in your list as the reference. With a reference, each of the other dataset will be integrated with the reference, so fewer integrations are done, which makes the code faster.

@Chenmengpin
Copy link

That means using the first dataset in your list as the reference. With a reference, each of the other dataset will be integrated with the reference, so fewer integrations are done, which makes the code faster.

I see. So it would give me different integrating results when using another dataset in list as the reference, but I was also wondering how to make sure which one is best as a reference, do I need to try all the datasets in my list to get the best results? Thank you.

@rebeccawuu
Copy link

@Chenmengpin I was wondering if you got any replies to your question about how referncing would affect the integration or how to determine what a good reference is as I am new to bioinformatics and have the same question!

@gabsax
Copy link

gabsax commented Feb 23, 2021

@Chenmengpin I was also wondering whether you found the answer to that question.

@Chenmengpin
Copy link

@Chenmengpin I was wondering if you got any replies to your question about how referncing would affect the integration or how to determine what a good reference is as I am new to bioinformatics and have the same question!

Sorry for the late reply. I don't really have the right answer for this issue, I tried several different datasets as reference and found the results were quite similar in my case. I guess you can try this way and see how it goes with your own data.

@Chenmengpin
Copy link

@Chenmengpin I was also wondering whether you found the answer to that question.

I tried different datasets as reference, the intergration results look similar. Try a dataset with good quality would give you a good result I guess.

@gabsax
Copy link

gabsax commented Feb 24, 2021

Alright thanks @Chenmengpin

@Pingxu0101
Copy link

Hi all, I also have the same error. I have a question what dose 1:pa$dims . @markrobinsonuzh
I saw you already solved this error. But I. din't understand the pa here.
Could you explain a little bit?
Thanks in advance.

seurat <- IntegrateData( anchorset=anchors,
dims = 1:pa$dims)

@zhanghao-njmu
Copy link

You can try to use the modified version https://github.com/zhanghao-njmu/seurat which I have created a pull request #6527.

I change the integration matrix format to "spam" or "spam64" to process matrices with more than 2^31-1 non-zero elements.

However, it is important to note that processing large data requires sufficient memory.
I tested the modified version on the data with more than 200,000 cells. The maximum memory used during the calculation was 1TB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests