Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best practices for using regress_out? #35

Closed
oligomyeggo opened this issue Apr 15, 2020 · 11 comments
Closed

Best practices for using regress_out? #35

oligomyeggo opened this issue Apr 15, 2020 · 11 comments

Comments

@oligomyeggo
Copy link

oligomyeggo commented Apr 15, 2020

I am adapting the current best practices workflow (epithelial cells) from @LuckyMD with my own data set, and am running into an issue/question. I am subsetting my data to include a few clusters of interest. Once I have those clusters isolated, I am selecting highly variable genes, regressing out effects of cell cycle, ribo genes and mito genes, scaling the data, and embedding a new UMAP, all in preparation for some downstream trajectory analysis (I opted not to regress out anything in my main data set, but want to regress out some confounding factors in my subset specifically for trajectory analysis). Is it better to first run pp.highly_variable_genes and then use pp.regress_out, or is better to run pp.regress_out followed by pp.highly_variable_genes? My code currently looks like the following:

Subset to highly variable genes
sc.pp.highly_variable_genes(adata_sub, flavor='cell_ranger', n_top_genes=4000, subset=True)

Regress out effects of cell cycle, mito genes, and ribo genes
sc.pp.regress_out(adata_sub, ['S_score', 'G2M_score', 'percent_mt', 'percent_ribo'])

Scale
sc.pp.scale(adata_sub, max_value=10)

Calculate the visualization

sc.pp.pca(adata_sub, n_comps=50, use_highly_variable=True, svd_solver='arpack')
sc.pp.neighbors(adata_sub)
sc.tl.umap(adata_sub)

If I run pp.regress_out before pp.highly_variable_genes, I have to include the line pp.filter_genes(adata_sub, min_counts=1 or else I get ValueError: The first guess on the deviance function returned a nan. This could be a boundary problem and should be reported. However, after doing some trial and error runs, I believe that including pp.filter_genes(adata_sub, min_counts=1 is excluding some genes of interest from my downstream trajectory analysis. I am able to recover these genes by reverting back to running pp.highly_variable_genes before pp.regress_out and excluding pp.filter_genes.

Intuitively I feel like it makes more sense to run pp.regress_out before pp.highly_variable_genes, but considering I am having issues using that order for downstream analysis, is it OK to run pp.regress_out after pp.highly_variable_genes? What is the best practice?

@LuckyMD
Copy link
Contributor

LuckyMD commented Apr 15, 2020

Hi @oligomyeggo,

You have found what I would say is the most annoying issue in scanpy's pipeline at the moment. A brief explanation:
sc.pp.regress_out() returns only the residuals of the regression, and doesn't add the offset again. That means you have an expected mean of 0 per gene. However, sc.pp.highly_variable_genes() bins genes by mean expression to evaluate dispersion outliers per bin and ensure that selection of HVGs is not biased towards highly expressed genes. This binning is no longer possible on genes with 0 mean (hence your boundary problem). Thus, both scaling and regression don't really work before HVG selection, although I agree with you that this would be the intuitive way forward.

There are two ways forward:

  1. you regress out and scale after HVG selection. The danger here is that your HVGs may be dependent on the regressed out covariate. You could get around this type of thing by selecting HVGs per cell cycle phase (using the batch parameter in HVG selection), but you can't do this for MT and ribosomal gene expression unless you bin those. This is overall not very elegant though. The easiest approach would be to take this effect into account and regress out after your HVG selection.
  2. You submit a pull request to scanpy to give the option to the sc.pp.regress_out() function to add the intercept term again. That way you can use sc.pp.highly_variable_genes() as expected. This would be the technically better version.

Unfortunately I don't have the bandwidth at the moment to implement solution 2. Maybe @Koncopd or @ivirshup have time? It shouldn't be too much work I guess.

Hope this helps.

On another note, I am curious why you decide to regress out MT and ribosomal gene expression?

@oligomyeggo
Copy link
Author

Thanks for the thorough response @LuckyMD! I see the issue now. I'll submit a pull request, and just move ahead with regressing out and scaling after HVG selection for now.

When I was initially analyzing my subset of cells I noticed that I was getting some clusters based on cell cycle, which is why I regressed that out. Upon further analysis, I saw that I was always getting a cluster or two that were dominated by ribo and mito genes. I do think that this is biologically relevant, and that this cluster represents an important transition state in my cell population. However, inclusion of this cluster seemed to be confounding downstream trajectory analysis a bit, and I was also curious if there were any other important features that might cluster cells/identify clusters if I ignored the ribo and mito genes (again, since these probably represent a transition state but not necessarily a unique cell population, if that makes sense). Is it reasonable to regress out the ribo and mito genes at this stage of analysis, since it isn't being done blindly but is informed by previous analysis without regressing out the genes? Or is it better to not regress out anything? I tried to avoid regressing out anything when working with my entire data set, but when it got down to doing trajectory analysis with my subset it seemed to make sense. I'm definitely open to thoughts on improving my workflow though!

@LuckyMD
Copy link
Contributor

LuckyMD commented Apr 16, 2020

I'm always quite hesitant of regressing out things tbh. But I'm sure there is a reason to regress out effects that I'm not aware of. I do think that regressing out cell cycle phase is important for trajectory analysis, but I'm not sure about MT and ribo genes. Cell cycle is a biological signal that can mask other signals that you might be looking for. If you can interpret MT or ribo gene expression as such a continuous signal across cells, then I think there is a good rationale to regress that out as well. Otherwise, you have a particular fixed cell state that you are trying to regress out of every cell in the dataset... I'm not sure what that means for the results you would get. Typically I interpret high MT gene expression either as a signal for increased respiratory activity, or as low quality cells (if combined with low total counts). The former might be a continuous process, but usually also a biological signal of interest. The latter is a cluster of cells that I would normally remove. I'm not sure how to interpret increased ribosomal gene expression.

I'd be curious to hear if your regressing out of these covariates helps though. If you do regress this out, I would regress all effects out jointly in a single call of sc.pp.regress_out().

@oligomyeggo
Copy link
Author

Yes, in our case it looks like there might be a stage of the cell cycle where our progenitor population goes through a high ribo gene expression phase. But you make some good points about regressing out things more conservatively. I've run through the workflow twice, one where I regress out cell cycle, mito and ribo genes (all in a single in sc.pp.regress_out() call) and one where I just regress out cell cycle. It honestly doesn't seem to have as much of an impact as I initially thought. We get the same underlying biological patterns either way, and the high ribo gene expression cluster (that shows up when we don't regress out the ribo genes) isn't one we are interested in including for downstream analysis at this point anyway. So all that considered, I think I will just regress out the cell cycle effects for downstream trajectory analysis and leave everything else intact. Thank you for your thoughtful response and advice!

@LuckyMD
Copy link
Contributor

LuckyMD commented Apr 20, 2020

No problem. Just out of curiosity, in case you can share I would love to know what it means that you have a high ribo gene expression phase. Is this just an observation in a trajectory? Or do you have some biological rationale behind high ribo gene expression? I've always wondered how I would interpret things like stem cell clusters having higher ribo gene expression.

@oligomyeggo
Copy link
Author

Sure! We think that this subset of high ribo gene expressing progenitors are ramping up to make a bunch of proteins, probably for differentiation, so they are upregulating translation. So the high ribo gene expressing progenitors represent a phase of leaving the progenitor state and preparing to differentiate. We don't have any real proof of this - just a working hypothesis at the moment, based on what we're seeing in the scRNA-seq data.

@LuckyMD
Copy link
Contributor

LuckyMD commented Apr 22, 2020

Exciting! We thought about sth like this for stem cell clusters as well, but unsure how to validate this. If you get any further evidence for that, would love to hear about it :). Good luck!

@LuckyMD LuckyMD closed this as completed Apr 22, 2020
@alexcwsmith
Copy link

I know this is old / has been closed, but any status on a PR to add the intercept back in to regress out? It would be great to be able to use this before HVG selection. I might have some time to work on implementing that.

On a side note, I've also been noticing lately certain clusters with very high ribosomal gene expression. But my data is from adult, post-mitotic neurons, so it isn't likely due to cell cycle or differentiation... Curious if anyone has any more thoughts on that almost a year after this original post.

@oligomyeggo
Copy link
Author

2020 definitely derailed a lot of things for me scientifically (which I know is true for a lot of people haha). I have not started a pull request to give the option to the sc.pp.regress_out() function to add the intercept term again.

With regards to the ribosomal gene expression, I keep finding sub-clusters of progenitors with high ribo gene expression in basically every data set I look at - but these are all embryonic data sets (mostly mouse, but also some human). Nevertheless, I think it is biologically interesting and meaningful and am planning some wet lab experiments to follow up on this once I am back in lab. Not sure what it means that you are seeing some in post-mitotic neurons.

@LuckyMD
Copy link
Contributor

LuckyMD commented Feb 4, 2021

Not sure what it means that you are seeing some in post-mitotic neurons.

This is probably why most people just ignore them ;).

@alexcwsmith It would be great to see a scanpy PR on just adding the intercept back in. You can @ me there as well if you want.

@hyjforesight
Copy link

hello @oligomyeggo,
We do subseting like this adata_sub=adata[adata.obs['leiden'].isin(['3', '5', '8', '11', '14'])], and then do HVG on adata_sub. Because the data in adata_sub has already been scaled, when we do HVG on adata_sub, the data is totally messed up. So, i'm wondering could you please share the coding of subseting clusters with us?
Thanks!
Best,
YJ

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants