Cannot allocate vector of size ... #3

g-antonello · 2022-05-17T15:56:19Z

Dear Huijuan,
I'm happy about this tool for many reasons, first of all because it is fast, compared to other ones (DESeq2, corncob). However, I am having an issue with a differential abundance run on 172 samples and 781 SGBs from MetaPhlAn (each sample sums to 100):

linda(otu.tab = abundances(phyloseq)), meta = meta(phyloseq), formula = "~ age + sex + using_drugs + trait_of_interest", type = "proportion", adaptive = TRUE)
This outputs: Error: cannot allocate vector of size 9.6 Gb.

Is this really that memory-intensive? Or am I doing something wrong?

The variables values are:
* age: integer
* sex: binary
* using_drugs: binary
* trait_of_interest: integer taking values 1, 2 or 3

The text was updated successfully, but these errors were encountered:

zhouhj1994 · 2022-05-18T14:46:27Z

Hi, Thanks for your interest! It seems to be a memory issue judging from the error message, but a dataset with 721 features and 172 samples is far from too big. The largest intermediate variable produced by the procedure would be about a 5000 by 5000 covariance matrix, and the memory size of which is less than 0.5Gb. Could you please check if "abundances(phyloseq))" is a 721 by 172 dataframe, the "meta(phyloseq)" is a dataframe with appropriate dimensions (the number of rows should be 172), and the "age" variable is not stored as a factor? (otherwise the covariance matrix could actually be large). I will think about what we can do next if none of these explains the error. Thanks, Huijuan

…

On Tue, May 17, 2022 at 11:56 PM ubiminor ***@***.***> wrote: Dear Huijuan, I'm happy about this tool for many reasons, first of all because it is fast, compared to other ones (DESeq2, corncob). However, I am having an issue with a differential abundance run on 172 samples and 721 SGBs from MetaPhlAn (each sample sums to 100): linda(otu.tab = abundances(phyloseq)), meta = meta(phyloseq), formula = "~ age + sex + using_drugs + trait_of_interest, type = "proportion", adaptive = TRUE) This outputs: Error: cannot allocate vector of size 9.6 Gb. Is this really that memory-intensive? Or am I doing something wrong? The variables values are: * age: integer * sex: binary * using_drugs: binary * trait_of_interest: integer taking values 1, 2 or 3 — Reply to this email directly, view it on GitHub <#3>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALF4E4CKOT64HXDM66BAHXTVKO6TBANCNFSM5WFNPFZQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

g-antonello · 2022-05-19T07:30:09Z

Dear Huijuan,
Thank you for your detailed reply. I guess it was a variable encoding issue, i don't know.. because when I started from scratch with a cleaner code it worked, Thank you for this!
Another point is: since metaphlan returns counts summing to 100, these are proportions, which I made sure they were accounted for by each sample's proportions sum to 1 and then using type = "proportion" in linda.

Is LinDA's approach still valid with this setup, even if it can't leverage on sequencing depth?
Do you think it is a valid approach to include the number of reads mapped per sample as a covariate, to mimic what LinDA would do internally with counts?

Best,
Giacomo

zhouhj1994 · 2022-05-20T09:05:32Z

Hi Giacomo, I'm glad that it worked and thanks for your insightful feedback. 1. Yes. For LinDA, the proportion data and the count data are pretty much the same as the data will be CLR (centered log-ratio) transformed anyway. The only difference is that if the data is count, the sequencing depth information is available, which we have utilized in LinDA to impute zeros. If the data is proportion, we use the "half minimum approach" (the half of the minimum proportion value among samples for a feature) to replace zeros. 2. In the model that motivates LinDA, the absolute abundance in the ecosystem is the dependent variable, to which the sequencing depth is not related. So the number of total reads is not included as a regressor. But you‘ve made a good point, in some methods (like those employing negative binomial models), the sequencing depth is a component. Although the real proportion (the proportion in the ecosystem) is not related to the sequencing depth, the low sequencing depth would cause under-sampling especially for the rare features, which means that the sequencing depth does influence the observed proportions. I believe a thoughtful procedure is required to address this issue. Best, Huijuan

…

On Thu, May 19, 2022 at 3:30 PM ubiminor ***@***.***> wrote: Dear Huijuan, Thank you for your detailed reply. I guess it was a variable encoding issue, i don't know.. because when I started from scratch with a cleaner code it worked, Thank you for this! Another point is: since metaphlan returns counts summing to 100, these are proportions, which I made sure they were accounted for by each sample's proportions sum to 1 and then using type = "proportion" in linda. 1. Is LinDA's approach still valid with this setup, even if it can't leverage on sequencing depth? 2. Do you think it is a valid approach to include the number of reads mapped per sample as a covariate, to mimic what LinDA would do internally with counts? — Reply to this email directly, view it on GitHub <#3 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALF4E4FRABC5WDJ7AIXM37LVKXUY5ANCNFSM5WFNPFZQ> . You are receiving this because you commented.Message ID: ***@***.***>

g-antonello · 2022-05-20T19:07:16Z

Thank you for the insight! Is the model's performance (accuracy/power/...) good without imputing zeros with the default method for counts data?

Giacomo

zhouhj1994 · 2022-05-21T03:21:41Z

The zero treatment is necessary in LinDA as it involves logarithms. If data is counts, we provide two zero-handling strategies: add pseudo counts (e.g., 0.5) to all counts or impute zeros by some sort of ratios between the sequencing depths. The choice of these two strategies affects the performance (accuracy/power/...) indeed. We have supplied an adaptive approach to choose between them. If data is proportions, we use the "half minimum approach" to replace zeros. Huijuan

…

On Sat, May 21, 2022 at 3:07 AM ubiminor ***@***.***> wrote: Thank you for the insight! Is the model's performance (accuracy/power/...) good without imputing zeros with the default method for counts data? Giacomo — Reply to this email directly, view it on GitHub <#3 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALF4E4BBQF7UGHIZF3JKG4LVK7PHDANCNFSM5WFNPFZQ> . You are receiving this because you commented.Message ID: ***@***.***>

zhouhj1994 closed this as completed May 30, 2022

zhouhj1994 reopened this Aug 3, 2022

zhouhj1994 closed this as completed Aug 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot allocate vector of size ... #3

Cannot allocate vector of size ... #3

g-antonello commented May 17, 2022 •

edited

Loading

zhouhj1994 commented May 18, 2022 via email

g-antonello commented May 19, 2022 •

edited

Loading

zhouhj1994 commented May 20, 2022 via email

g-antonello commented May 20, 2022

zhouhj1994 commented May 21, 2022 via email

Cannot allocate vector of size ... #3

Cannot allocate vector of size ... #3

Comments

g-antonello commented May 17, 2022 • edited Loading

zhouhj1994 commented May 18, 2022 via email

g-antonello commented May 19, 2022 • edited Loading

zhouhj1994 commented May 20, 2022 via email

g-antonello commented May 20, 2022

zhouhj1994 commented May 21, 2022 via email

g-antonello commented May 17, 2022 •

edited

Loading

g-antonello commented May 19, 2022 •

edited

Loading