different results in different runs #16

bsierieb1 · 2020-12-08T23:35:50Z

Hi,

My data set contains 155 clusters and 2 conditions (treatment vs control). I performed 15 sample_hmc runs with the same data and the same NoBaselineModel, and the results could be broadly classified as follows:
10/15 runs: ~15-20 cell types had non-zero final parameter values, all values were negative
5/15 runs: ~50-60 cell types had non-zero final parameter values, all values were positive

Expectedly, the cell types that had negative final parameter values in some runs were not overlapping with the cell types that had positive final parameter values in the other runs. However, I am puzzled by the fact that there was not a single run in which I would have captured both positive and negative cell types at the same time.

What can it mean and how should I go about it? Should I do some kind of bootstrapping?

Thanks!

johannesostner · 2020-12-09T07:27:59Z

Hi!

This is a conceptional problem that arises when running scCODA without an explicit reference (or baseline) category. Since compositions are of constrained nature, it is only possible to make statements relative to a certain category. (see e.g. this paper on microbial data about it; you can directly translate this to single-cell data as well)

When running NoBaselineModel, it is implied that all cell types with Final parameter 0 are seen as the reference set. If you set one cell type as the baseline (by using BaselineModel), the results will become consistent, as we now always have the same reference.

As a reference cell type, you should choose one whose relative abundance is fairly consistent over all samples.

bsierieb1 · 2020-12-09T16:12:32Z

thanks for the explanation! is there a rule of a thumb how to select the reference without any prior knowledge which cell type should remain unaffected by treatment? do you think it would make sense to select the cell type that has the proportion of treated cells which is the closest to the proportion of treated cells in the entire data set?

johannesostner · 2020-12-09T17:17:04Z

If you have no prior knowledge, I would take a cell type that has low variation (in terms of relative abundance) over the entire dataset. This makes interpretation easier, since now the proportion of your reference category is neither changing drastically between conditions nor within conditions.

For example, in the dataset from the tutorial I would choose Goblet or Endocrine cells:

bsierieb1 · 2020-12-09T23:05:57Z

it was a useful exercise to try and run it with a baseline but the issue persists - i still get either one bunch of negative clusters or a different bunch of positive clusters depending on the run. does it mean that it is essentially impossible to say if I had an expansion of some clusters or contraction of other clusters in treatment because they kind of "compensate" for each other? I know I may not be phrasing it correctly but I hope you understand what I mean...

johannesostner · 2020-12-10T07:29:40Z

I think I know what you are trying to point out with clusters "compensating" for each other, and this is exactly what makes compositional data analysis sometimes difficult to interpret: Due to the total sum constraint on a sample (the number of cells that can be sequenced in one go is fixed), the counts are all inherently negatively correlated.
This means that if one cluster expands, all others seem to contract in order to, as you say, compensate.

Therefore, the two solutions you get are two equivalent (and equally likely) interpretations of the same data, and there is no way to tell which of these is "true" on a global, scale, at least not without any additional quantitative measurements.

bsierieb1 · 2020-12-10T14:56:16Z

Got it, thanks for the explanation. In such a case, I am wondering what would be the most appropriate way to show what I observe in a figure?

When someone simply compares cell type ratios using something like Wald test instead of doing compositional analysis as implemented in scCODA, they sometimes show a volcano plot with cell type proportions on the y axis and -log(p) on the x axis. I really like this representation because it is very visual, and most biologists are used to volcano plots. So how about the following: I could run scCODA n times and then plot max/min value or the range of the "final parameter" values on the x axis and the number of runs in which the final parameter was !=0 on the y axis. Would there be anything wrong with this conceptually, as long as I explain in the text that it is impossible to distinguish between some clusters contracting and other clusters expanding?

My main goal is simply to identify clusters which may be changing and focus on them for the downstream analyses.

johannesostner · 2020-12-10T16:36:51Z

Hmm, this is a very unique idea. So far, we have only ever reported single runs of scCODA with a fixed reference and analyzed everything in regards to this reference. For this, we usually show grouped boxplots (grouped by cell type, one plot per category) and indicate the ones with a credible effect somehow. But this does not work in your case since you have 155 clusters, which is far more than we ever used so far.

I like the idea of looking at how often a parameter was found to be credibly different from 0 (final parameter !=0) to gain a consistency estimate, and to use this as a starting point for further analysis. However, you should not use the numerical values of the final parameters like you proposed, in my opinion. They are dependent not only on the rate of change, but also on the dispersion of your data, and the values of the intercept parameters, which makes them very hard to interpret.

What you could use is the sign of the parameters, though. Negative parameters will always indicate a decrease, while positive ones indicate an increase in relative abundance. I would therefore report how often each cluster's parameter was different from 0 (maybe divided into increasing/decreasing), and then plot this as a barplot or heatmap of some sort.

Lastly, I just want to point out that scCODA is still unpublished, so please contact us (preferably @b-schubert) before including it in any reports

bsierieb1 · 2020-12-10T17:33:55Z

this is a very good thought! a barplot with the number of runs in which the parameter was credibly different from 0 (separated by + and -) provides a nice summary and does not prompt questions about having some sort of "significance" threshold which one normally expects to see in a volcano plot. thanks for the discussion and I will make sure to reach out to you guys in case I want to present the data outside my lab.

bsierieb1 closed this as completed Dec 10, 2020

kanefos mentioned this issue May 4, 2021

Test All Pairwise Comparisons #26

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

different results in different runs #16

different results in different runs #16

bsierieb1 commented Dec 8, 2020

johannesostner commented Dec 9, 2020

bsierieb1 commented Dec 9, 2020

johannesostner commented Dec 9, 2020

bsierieb1 commented Dec 9, 2020

johannesostner commented Dec 10, 2020

bsierieb1 commented Dec 10, 2020

johannesostner commented Dec 10, 2020

bsierieb1 commented Dec 10, 2020

different results in different runs #16

different results in different runs #16

Comments

bsierieb1 commented Dec 8, 2020

johannesostner commented Dec 9, 2020

bsierieb1 commented Dec 9, 2020

johannesostner commented Dec 9, 2020

bsierieb1 commented Dec 9, 2020

johannesostner commented Dec 10, 2020

bsierieb1 commented Dec 10, 2020

johannesostner commented Dec 10, 2020

bsierieb1 commented Dec 10, 2020