In [1]:
library(reticulate) 
library(Seurat)

Attaching SeuratObject



In [2]:
#### user specified

data_folder = "D:/analyze_Pearson_residuals/"
data_subfolder = "10k_brain"

data_path  <- paste0 ( data_folder, data_subfolder )

In [3]:
# output data
residual_variance_pkl = "SCT_v2_residual_variance.pkl"
Pearson_residuals_pkl = "SCT_v2_Pearson_residuals_top_2000.pkl"

# input data
gene_array_pkl  = "gene_array.pkl"
cell_array_pkl  = "cell_array.pkl"
counts_pkl = "counts_scipy_csc.pkl"



# paths: output data
residual_variance_dsn    <- paste0 ( data_path, '/', residual_variance_pkl )
Pearson_residuals_dsn    <- paste0 ( data_path, '/', Pearson_residuals_pkl )

# paths: input data
gene_array_dsn    <- paste0 ( data_path, '/', gene_array_pkl )
cell_array_dsn   <- paste0 ( data_path, '/', cell_array_pkl )
counts_dsn    <- paste0 ( data_path, '/', counts_pkl )

In [4]:
gene_array = py_load_object ( gene_array_dsn )
print ( dim ( gene_array ) )
print ( head ( gene_array ) )

[1] 13817
[1] "ENSMUSG00000052305" "ENSMUSG00000073940" "ENSMUSG00000069919"
[4] "ENSMUSG00000069917" "ENSMUSG00000019874" "ENSMUSG00000019890"


In [5]:
cell_array = py_load_object ( cell_array_dsn )
print ( dim ( cell_array ) )
print ( head ( cell_array ) )

[1] 11843
[1] "AAACCCAAGCAACTCT-1" "AAACCCACACGCGGTT-1" "AAACCCACAGCATACT-1"
[4] "AAACCCACATACCATG-1" "AAACCCAGTCGCACAC-1" "AAACCCAGTGCACATT-1"


In [6]:
df_counts = py_load_object( counts_dsn ) 

rownames(df_counts) <- gene_array
colnames(df_counts) <- cell_array

print ( dim ( df_counts ) )
print ( head (df_counts [, c(1:5)] ) )

[1] 13817 11843
6 x 5 sparse Matrix of class "dgCMatrix"
                   AAACCCAAGCAACTCT-1 AAACCCACACGCGGTT-1 AAACCCACAGCATACT-1
ENSMUSG00000052305                  .                  .                  .
ENSMUSG00000073940                  .                  .                  .
ENSMUSG00000069919                  .                  .                  .
ENSMUSG00000069917                  .                  .                  .
ENSMUSG00000019874                 20                  2                 14
ENSMUSG00000019890                  .                  .                  .
                   AAACCCACATACCATG-1 AAACCCAGTCGCACAC-1
ENSMUSG00000052305                  .                  .
ENSMUSG00000073940                  .                  1
ENSMUSG00000069919                  .                  .
ENSMUSG00000069917                  .                  .
ENSMUSG00000019874                  .                  .
ENSMUSG00000019890                  .                  .


In [7]:
seurat_object <-  CreateSeuratObject(counts = df_counts , project = data_subfolder )
print ( dim ( seurat_object ) )

[1] 13817 11843


In [8]:
seurat_object <- SCTransform(seurat_object, vst.flavor="v2", method = "glmGamPoi", variable.features.n=2000 )

gene_list = seurat_object@assays$SCT@data@Dimnames[1]
gene_vector = unlist ( gene_list )

residual_variance = seurat_object@assays$SCT@SCTModel.list$model1@feature.attributes$residual_variance
df_residual_variance <- data.frame( residual_variance )
row.names( df_residual_variance ) <- gene_vector
print ( dim ( df_residual_variance ) ) 

vst.flavor='v2' set, setting model to use fixed slope and exclude poisson genes.

Calculating cell attributes from input UMI matrix: log_umi

Total Step 1 genes: 13817

Total overdispersed genes: 13510

Excluding 307 genes from Step 1 because they are not overdispersed.

Variance stabilizing transformation of count matrix of size 13817 by 11843

Model formula is y ~ log_umi

Get Negative Binomial regression parameters per gene

Using 2000 genes, 5000 cells





Setting estimate of  165 genes to inf as theta_mm/theta_mle < 1e-3

# of step1 poisson genes (variance < mean): 0

# of low mean genes (mean < 0.001): 0

Total # of Step1 poisson genes (theta=Inf; variance < mean): 165

Total # of poisson genes (theta=Inf; variance < mean): 307

Calling offset model for all 307 poisson genes

Found 219 outliers - those will be ignored in fitting/regularization step


Ignoring theta inf genes

Replacing fit params for 307 poisson genes by theta=Inf

Setting min_variance based on median UMI:  0.04

Second step: Get residuals using fitted parameters for 13817 genes





Computing corrected count matrix for 13817 genes





Calculating gene attributes

Wall clock passed: Time difference of 59.40047 secs

Determine variable features

Place corrected count matrix in counts slot

Centering data matrix

Set default assay to SCT



[1] 13817     1


In [9]:
SCT_assay =  seurat_object@assays$SCT
residuals_top_features = SCT_assay@scale.data
print ( dim ( residuals_top_features ) )

[1]  2000 11843


In [10]:
py_save_object ( df_residual_variance, residual_variance_dsn )
py_save_object ( as.data.frame ( residuals_top_features ), Pearson_residuals_dsn )