Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

different results in different runs #16

Closed
bsierieb1 opened this issue Dec 8, 2020 · 8 comments
Closed

different results in different runs #16

bsierieb1 opened this issue Dec 8, 2020 · 8 comments

Comments

@bsierieb1
Copy link

Hi,

My data set contains 155 clusters and 2 conditions (treatment vs control). I performed 15 sample_hmc runs with the same data and the same NoBaselineModel, and the results could be broadly classified as follows:
10/15 runs: ~15-20 cell types had non-zero final parameter values, all values were negative
5/15 runs: ~50-60 cell types had non-zero final parameter values, all values were positive

Expectedly, the cell types that had negative final parameter values in some runs were not overlapping with the cell types that had positive final parameter values in the other runs. However, I am puzzled by the fact that there was not a single run in which I would have captured both positive and negative cell types at the same time.

What can it mean and how should I go about it? Should I do some kind of bootstrapping?

Thanks!

@johannesostner
Copy link
Collaborator

Hi!

This is a conceptional problem that arises when running scCODA without an explicit reference (or baseline) category. Since compositions are of constrained nature, it is only possible to make statements relative to a certain category. (see e.g. this paper on microbial data about it; you can directly translate this to single-cell data as well)

When running NoBaselineModel, it is implied that all cell types with Final parameter 0 are seen as the reference set. If you set one cell type as the baseline (by using BaselineModel), the results will become consistent, as we now always have the same reference.

As a reference cell type, you should choose one whose relative abundance is fairly consistent over all samples.

@bsierieb1
Copy link
Author

thanks for the explanation! is there a rule of a thumb how to select the reference without any prior knowledge which cell type should remain unaffected by treatment? do you think it would make sense to select the cell type that has the proportion of treated cells which is the closest to the proportion of treated cells in the entire data set?

@johannesostner
Copy link
Collaborator

If you have no prior knowledge, I would take a cell type that has low variation (in terms of relative abundance) over the entire dataset. This makes interpretation easier, since now the proportion of your reference category is neither changing drastically between conditions nor within conditions.

For example, in the dataset from the tutorial I would choose Goblet or Endocrine cells:
image

@bsierieb1
Copy link
Author

it was a useful exercise to try and run it with a baseline but the issue persists - i still get either one bunch of negative clusters or a different bunch of positive clusters depending on the run. does it mean that it is essentially impossible to say if I had an expansion of some clusters or contraction of other clusters in treatment because they kind of "compensate" for each other? I know I may not be phrasing it correctly but I hope you understand what I mean...

@johannesostner
Copy link
Collaborator

I think I know what you are trying to point out with clusters "compensating" for each other, and this is exactly what makes compositional data analysis sometimes difficult to interpret: Due to the total sum constraint on a sample (the number of cells that can be sequenced in one go is fixed), the counts are all inherently negatively correlated.
This means that if one cluster expands, all others seem to contract in order to, as you say, compensate.

Therefore, the two solutions you get are two equivalent (and equally likely) interpretations of the same data, and there is no way to tell which of these is "true" on a global, scale, at least not without any additional quantitative measurements.

@bsierieb1
Copy link
Author

Got it, thanks for the explanation. In such a case, I am wondering what would be the most appropriate way to show what I observe in a figure?

When someone simply compares cell type ratios using something like Wald test instead of doing compositional analysis as implemented in scCODA, they sometimes show a volcano plot with cell type proportions on the y axis and -log(p) on the x axis. I really like this representation because it is very visual, and most biologists are used to volcano plots. So how about the following: I could run scCODA n times and then plot max/min value or the range of the "final parameter" values on the x axis and the number of runs in which the final parameter was !=0 on the y axis. Would there be anything wrong with this conceptually, as long as I explain in the text that it is impossible to distinguish between some clusters contracting and other clusters expanding?

My main goal is simply to identify clusters which may be changing and focus on them for the downstream analyses.

@johannesostner
Copy link
Collaborator

Hmm, this is a very unique idea. So far, we have only ever reported single runs of scCODA with a fixed reference and analyzed everything in regards to this reference. For this, we usually show grouped boxplots (grouped by cell type, one plot per category) and indicate the ones with a credible effect somehow. But this does not work in your case since you have 155 clusters, which is far more than we ever used so far.

I like the idea of looking at how often a parameter was found to be credibly different from 0 (final parameter !=0) to gain a consistency estimate, and to use this as a starting point for further analysis. However, you should not use the numerical values of the final parameters like you proposed, in my opinion. They are dependent not only on the rate of change, but also on the dispersion of your data, and the values of the intercept parameters, which makes them very hard to interpret.

What you could use is the sign of the parameters, though. Negative parameters will always indicate a decrease, while positive ones indicate an increase in relative abundance. I would therefore report how often each cluster's parameter was different from 0 (maybe divided into increasing/decreasing), and then plot this as a barplot or heatmap of some sort.

Lastly, I just want to point out that scCODA is still unpublished, so please contact us (preferably @b-schubert) before including it in any reports

@bsierieb1
Copy link
Author

this is a very good thought! a barplot with the number of runs in which the parameter was credibly different from 0 (separated by + and -) provides a nice summary and does not prompt questions about having some sort of "significance" threshold which one normally expects to see in a volcano plot. thanks for the discussion and I will make sure to reach out to you guys in case I want to present the data outside my lab.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants