-
Notifications
You must be signed in to change notification settings - Fork 581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Any interest in getting SCTransform up on Scanpy? #1643
Comments
Definitely love to have more contributions! I believe @LuckyMD and @giovp are quite keen on having this in the library. Excited to see your benchmarks! Side note: I'd definitely recommend looking into using |
Thanks for the suggestion, @ivirshup! I'll take a look into that. Would this be something that goes into |
I think there's a pretty strong desire on the team to see something like this in the main API, if you'd like to contribute this there. |
Gotcha! I'll prioritize getting the benchmarks up and then I'll need some guidance on how to organize it to fit in scanpy's codebase. Thanks! |
agreed, this would be super cool! |
totally agree! would be really cool to have! |
I agree that this should not belong to 'external' but to the main API. Also, I would not be initially concerned about performance. Having the tool first is more important at the moment. We can later see how can be optimized. |
I looked at a few genes and it looks like I'm getting pretty similar residuals compared to the R implementation. Something weird is going on with the genes in the last couple images so I'm currently trying to figure that before generating more thorough benchmarks. (I clip negative values to zero to preserve sparsity structure) |
Hi @atarashansky , looks really good!
Re parallelization, @ivirshup already mentioned joblib. For reference, I want to share here a framework developed by @michalk8 that uses joblib for parallelizing a function over a collection of elements. It is already used in 2 scanpy sisters packages: https://github.com/theislab/cellrank and https://github.com/theislab/scvelo , as well as a third one that is coming out next week. Implementation in theislab/scvelo Implementation in theislab/cellrank I feel this settings is very similar to the one above, so maybe useful to check it out. Let me know what you think ! |
Here’s the notebook. |
@atarashansky thanks a lot, looks really good indeed! |
@atarashansky sorry for getting back to you late! I played around with the implementation, thanks a lot for the notebooks! few questions before opening the PR:
looking forward to hear what you think! thank you! |
This looks super cool! One minor thing is the bandwidth adjustment. Ideally this shouldn't require manual tweaking. Is this the case? When I was playing around with Nebulosa (https://github.com/powellgenomicslab/Nebulosa) to reimplement it in python, I noticed that both bandwidth algorithms ( |
Sorry for the late reply, the notifications for this thread got sent to my spam folder.
|
SCTransform does automatic bandwidth selection using the scott algorithm, which I am implementing with |
I don't know how "easy" the density estimation problem here is, but I would avoid scott's whenever I can (e.g. https://aakinshin.net/posts/kde-bw/). |
hi @atarashansky , For the rest, I think the two main points that could be addressed before PR are:
with these two points addressed, I think it's good to start a PR, I'd be happy to review and also help in setting up tests/benchmarks (which like you mentioned, would be probably best to just test against the result of original implementation). Let me know for anything else, exciting for this! |
Hi @atarashansky and everyone following this interesting discussion! I just found this issue after posting quite a related PR yesterday (#1715) that came out of a discussion from the end of last year (berenslab/umi-normalization#1), and wanted to leave a note about that relation here: In my PR, I implement normalization by analytic Pearson residuals based on a NB offset model, which is an improved/simplified version of the scTransform model that does not need regularization by smoothing anymore... This brings some theoretical advantages and we found it works well in practice (details in this preprint with @dkobak ). One of the differences remaining between the two is how the overdispersion Also, I'm curious about the clipping of the Pearson residuals to Looking forward to you thoughts on this :) |
@jlause Interesting work! It would indeed be nice to avoid having to learn bandwidths altogether. What would be the procedure for learning global theta from the data? Would you just flatten the expression matrix into one vector? With regards to the clipping, I turned my brain off and copied the Seurat implementation as much as possible. |
Oh, scTransform uses |
If I remember correctly, the SCTransform |
Regarding this -- yes, that's what we thought to do. Do you think it makes sense? However, for computational efficiency, I would threshold the genes by expression and take e.g. 1000 or even 100 genes with the highest average expression (for the purpose of estimating theta). Lowly-expressed genes don't really constrain theta much, it's highly-expressed ones that do. |
Hi guys, I've been following this thread and it's been quiet recently :) wondering if there's any updates on incorporating ScTransform on Scanpy. Thanks!! 🙏🏼 |
Found it after I wrote it.😄 Here is my code for those who want to try 'R' SCT. import anndata2ri
from rpy2.robjects.packages import importr
from rpy2.robjects import r, pandas2ri
import numpy as np
anndata2ri.activate()
pandas2ri.activate()
def run_sctransform(adata, layer=None, **kwargs):
if layer:
mat = adata.layers[layer]
else:
mat = adata.X
# Set names for the input matrix
cell_names = adata.obs_names
gene_names = adata.var_names
r.assign('mat', mat.T)
r.assign('cell_names', cell_names)
r.assign('gene_names', gene_names)
r('colnames(mat) <- cell_names')
r('rownames(mat) <- gene_names')
seurat = importr('Seurat')
r('seurat_obj <- CreateSeuratObject(mat)')
# Run
for k, v in kwargs.items():
r.assign(k, v)
kwargs_str = ', '.join([f'{k}={k}' for k in kwargs.keys()])
r(f'seurat_obj <- SCTransform(seurat_obj,vst.flavor="v2", {kwargs_str})')
# Extract the SCT data and add it as a new layer in the original anndata object
sct_data = np.asarray(r['as.matrix'](r('seurat_obj@assays$SCT@data')))
adata.layers['SCT_data'] = sct_data.T
sct_data = np.asarray(r['as.matrix'](r('seurat_obj@assays$SCT@counts')))
adata.layers['SCT_counts'] = sct_data.T
return adata adata.layers["data"] = adata.X.copy()
adata = run_sctransform(adata, layer="counts")
R[write to console]: Running SCTransform on assay: RNA
R[write to console]: Place corrected count matrix in counts slot
R[write to console]: Set default assay to SCT
adata
layers: 'counts', 'data', 'SCT_data', 'SCT_counts' Please use this code and your data with caution. |
Thank you everybody, and particularly @Moloch0, for contributing to this discussion. Your A little note in case anyone else might run into the same issue: my original AnnData contained a small set of genes/variables present in very few cells, which were filtered out during SCTransform normalisation. This prevented the SCT layers from being added to adata due to dimension mismatch. To address this, I added a simple subsetting script to the function between normalisation and layer addition, as follows: #[...]
r(f'seurat_obj <- SCTransform(seurat_obj,vst.flavor="v2", {kwargs_str})')
# Prevent partial SCT output because of default min.genes messing up layer addition
r('diffDash <- setdiff(rownames(seurat_obj), rownames(mat))')
r('diffDash <- gsub("-", "_", diffDash)')
r('diffScore <- setdiff(rownames(mat), rownames(seurat_obj))')
filtout_genes = svconvert(r('setdiff(diffScore, diffDash)'))
filtout_indicator = np.in1d(adata.var_names, filtout_genes)
adata = adata[:, ~filtout_indicator]
# Extract the SCT data and add it as a new layer in the original anndata object
#[...] Hope that comes in handy for anyone else facing this issue! |
sc.tools
?sc.pl
?sc.external.*
?I recently ported SCTransform from R into python. Any interest in getting it onto Scanpy?
The original paper is here. It's a variance-stabilizing transformation that overcomes some key drawbacks of previous, similar methods (e.g. overfitting caused by building regression models from individual genes as opposed to groups of similar genes). It also eliminates the need for pseudocounts, log transformations, or library size normalization.
My code is here.
Implementation notes (from the SCTransformPy README):
statsmodels
package and parallelized withmultiprocessing
.KDEpy
package.theta
, using MLE was translated from thetheta.ml
function in R.Anecdotally, it produces very similar results to the R implementation, though the code itself is still a little rough around the edges. I also have to do more formal quantitative benchmarking to ensure results are similar to those of the original package.
I thought I'd gauge interest here prior to working on making it
scanpy
-ready.The text was updated successfully, but these errors were encountered: