-
Notifications
You must be signed in to change notification settings - Fork 458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Best practices for using regress_out? #35
Comments
Hi @oligomyeggo, You have found what I would say is the most annoying issue in scanpy's pipeline at the moment. A brief explanation: There are two ways forward:
Unfortunately I don't have the bandwidth at the moment to implement solution 2. Maybe @Koncopd or @ivirshup have time? It shouldn't be too much work I guess. Hope this helps. On another note, I am curious why you decide to regress out MT and ribosomal gene expression? |
Thanks for the thorough response @LuckyMD! I see the issue now. I'll submit a pull request, and just move ahead with regressing out and scaling after HVG selection for now. When I was initially analyzing my subset of cells I noticed that I was getting some clusters based on cell cycle, which is why I regressed that out. Upon further analysis, I saw that I was always getting a cluster or two that were dominated by ribo and mito genes. I do think that this is biologically relevant, and that this cluster represents an important transition state in my cell population. However, inclusion of this cluster seemed to be confounding downstream trajectory analysis a bit, and I was also curious if there were any other important features that might cluster cells/identify clusters if I ignored the ribo and mito genes (again, since these probably represent a transition state but not necessarily a unique cell population, if that makes sense). Is it reasonable to regress out the ribo and mito genes at this stage of analysis, since it isn't being done blindly but is informed by previous analysis without regressing out the genes? Or is it better to not regress out anything? I tried to avoid regressing out anything when working with my entire data set, but when it got down to doing trajectory analysis with my subset it seemed to make sense. I'm definitely open to thoughts on improving my workflow though! |
I'm always quite hesitant of regressing out things tbh. But I'm sure there is a reason to regress out effects that I'm not aware of. I do think that regressing out cell cycle phase is important for trajectory analysis, but I'm not sure about MT and ribo genes. Cell cycle is a biological signal that can mask other signals that you might be looking for. If you can interpret MT or ribo gene expression as such a continuous signal across cells, then I think there is a good rationale to regress that out as well. Otherwise, you have a particular fixed cell state that you are trying to regress out of every cell in the dataset... I'm not sure what that means for the results you would get. Typically I interpret high MT gene expression either as a signal for increased respiratory activity, or as low quality cells (if combined with low total counts). The former might be a continuous process, but usually also a biological signal of interest. The latter is a cluster of cells that I would normally remove. I'm not sure how to interpret increased ribosomal gene expression. I'd be curious to hear if your regressing out of these covariates helps though. If you do regress this out, I would regress all effects out jointly in a single call of |
Yes, in our case it looks like there might be a stage of the cell cycle where our progenitor population goes through a high ribo gene expression phase. But you make some good points about regressing out things more conservatively. I've run through the workflow twice, one where I regress out cell cycle, mito and ribo genes (all in a single in |
No problem. Just out of curiosity, in case you can share I would love to know what it means that you have a high ribo gene expression phase. Is this just an observation in a trajectory? Or do you have some biological rationale behind high ribo gene expression? I've always wondered how I would interpret things like stem cell clusters having higher ribo gene expression. |
Sure! We think that this subset of high ribo gene expressing progenitors are ramping up to make a bunch of proteins, probably for differentiation, so they are upregulating translation. So the high ribo gene expressing progenitors represent a phase of leaving the progenitor state and preparing to differentiate. We don't have any real proof of this - just a working hypothesis at the moment, based on what we're seeing in the scRNA-seq data. |
Exciting! We thought about sth like this for stem cell clusters as well, but unsure how to validate this. If you get any further evidence for that, would love to hear about it :). Good luck! |
I know this is old / has been closed, but any status on a PR to add the intercept back in to regress out? It would be great to be able to use this before HVG selection. I might have some time to work on implementing that. On a side note, I've also been noticing lately certain clusters with very high ribosomal gene expression. But my data is from adult, post-mitotic neurons, so it isn't likely due to cell cycle or differentiation... Curious if anyone has any more thoughts on that almost a year after this original post. |
2020 definitely derailed a lot of things for me scientifically (which I know is true for a lot of people haha). I have not started a pull request to give the option to the With regards to the ribosomal gene expression, I keep finding sub-clusters of progenitors with high ribo gene expression in basically every data set I look at - but these are all embryonic data sets (mostly mouse, but also some human). Nevertheless, I think it is biologically interesting and meaningful and am planning some wet lab experiments to follow up on this once I am back in lab. Not sure what it means that you are seeing some in post-mitotic neurons. |
This is probably why most people just ignore them ;). @alexcwsmith It would be great to see a scanpy PR on just adding the intercept back in. You can @ me there as well if you want. |
hello @oligomyeggo, |
I am adapting the current best practices workflow (epithelial cells) from @LuckyMD with my own data set, and am running into an issue/question. I am subsetting my data to include a few clusters of interest. Once I have those clusters isolated, I am selecting highly variable genes, regressing out effects of cell cycle, ribo genes and mito genes, scaling the data, and embedding a new UMAP, all in preparation for some downstream trajectory analysis (I opted not to regress out anything in my main data set, but want to regress out some confounding factors in my subset specifically for trajectory analysis). Is it better to first run
pp.highly_variable_genes
and then usepp.regress_out
, or is better to runpp.regress_out
followed bypp.highly_variable_genes
? My code currently looks like the following:Subset to highly variable genes
sc.pp.highly_variable_genes(adata_sub, flavor='cell_ranger', n_top_genes=4000, subset=True)
Regress out effects of cell cycle, mito genes, and ribo genes
sc.pp.regress_out(adata_sub, ['S_score', 'G2M_score', 'percent_mt', 'percent_ribo'])
Scale
sc.pp.scale(adata_sub, max_value=10)
Calculate the visualization
If I run
pp.regress_out
beforepp.highly_variable_genes
, I have to include the linepp.filter_genes(adata_sub, min_counts=1
or else I getValueError: The first guess on the deviance function returned a nan. This could be a boundary problem and should be reported.
However, after doing some trial and error runs, I believe that includingpp.filter_genes(adata_sub, min_counts=1
is excluding some genes of interest from my downstream trajectory analysis. I am able to recover these genes by reverting back to runningpp.highly_variable_genes
beforepp.regress_out
and excludingpp.filter_genes
.Intuitively I feel like it makes more sense to run
pp.regress_out
beforepp.highly_variable_genes
, but considering I am having issues using that order for downstream analysis, is it OK to runpp.regress_out
afterpp.highly_variable_genes
? What is the best practice?The text was updated successfully, but these errors were encountered: