In [1]:
library(reticulate) 
library(Seurat)

Attaching SeuratObject



In [2]:
#### user specified

data_folder = "D:/analyze_Pearson_residuals/"
data_subfolder = "10k_brain"

data_path  <- paste0 ( data_folder, data_subfolder )

In [3]:
# output data
residual_variance_pkl = "Sg_complementary_samples.pkl"

# input data
gene_array_pkl  = "gene_array.pkl"
cell_array_pkl  = "cell_array.pkl"
counts_pkl = "counts_scipy_csc.pkl"

selected_genes_array_pkl  = "gene_array_complementary_samples.pkl"
df_selected_cells_pkl  = "df_cells_complementary_samples.pkl" 



# path: output data
residual_variance_dsn    <- paste0 ( data_path, '/', residual_variance_pkl )


# paths: input data
gene_array_dsn    <- paste0 ( data_path, '/', gene_array_pkl )
cell_array_dsn   <- paste0 ( data_path, '/', cell_array_pkl )
counts_dsn    <- paste0 ( data_path, '/', counts_pkl )

selected_genes_array_dsn <- paste0 ( data_path, '/', selected_genes_array_pkl )
df_selected_cells_dsn  <- paste0 ( data_path, '/', df_selected_cells_pkl )

In [4]:
gene_array = py_load_object ( gene_array_dsn )
print ( dim ( gene_array ) )
print ( head ( gene_array ) )

[1] 13817
[1] "ENSMUSG00000052305" "ENSMUSG00000073940" "ENSMUSG00000069919"
[4] "ENSMUSG00000069917" "ENSMUSG00000019874" "ENSMUSG00000019890"


In [5]:
cell_array = py_load_object ( cell_array_dsn )
print ( dim ( cell_array ) )
print ( head ( cell_array ) )

[1] 11843
[1] "AAACCCAAGCAACTCT-1" "AAACCCACACGCGGTT-1" "AAACCCACAGCATACT-1"
[4] "AAACCCACATACCATG-1" "AAACCCAGTCGCACAC-1" "AAACCCAGTGCACATT-1"


In [6]:
df_counts = py_load_object( counts_dsn ) 

rownames(df_counts) <- gene_array
colnames(df_counts) <- cell_array
rm ( gene_array, cell_array )

print ( dim ( df_counts ) )
print ( head (df_counts [, c(1:5)] ) )

[1] 13817 11843
6 x 5 sparse Matrix of class "dgCMatrix"
                   AAACCCAAGCAACTCT-1 AAACCCACACGCGGTT-1 AAACCCACAGCATACT-1
ENSMUSG00000052305                  .                  .                  .
ENSMUSG00000073940                  .                  .                  .
ENSMUSG00000069919                  .                  .                  .
ENSMUSG00000069917                  .                  .                  .
ENSMUSG00000019874                 20                  2                 14
ENSMUSG00000019890                  .                  .                  .
                   AAACCCACATACCATG-1 AAACCCAGTCGCACAC-1
ENSMUSG00000052305                  .                  .
ENSMUSG00000073940                  .                  1
ENSMUSG00000069919                  .                  .
ENSMUSG00000069917                  .                  .
ENSMUSG00000019874                  .                  .
ENSMUSG00000019890                  .                  .


In [7]:
selected_genes_array = py_load_object ( selected_genes_array_dsn )

df_selected_cells = py_load_object ( df_selected_cells_dsn )
print ( dim ( df_selected_cells ) )
print ( head ( df_selected_cells ) )

rownames_selected_cells = rownames ( df_selected_cells )

[1] 11843     2
                       0     1
AAACCCAAGCAACTCT-1 FALSE  TRUE
AAACCCACACGCGGTT-1  TRUE FALSE
AAACCCACAGCATACT-1  TRUE FALSE
AAACCCACATACCATG-1  TRUE FALSE
AAACCCAGTCGCACAC-1 FALSE  TRUE
AAACCCAGTGCACATT-1  TRUE FALSE


In [8]:
df_counts_selected_cells = df_counts[,rownames_selected_cells]
rm ( df_counts )
print ( dim ( df_counts_selected_cells ) )
print ( head ( df_counts_selected_cells  [, c(1:5)] ) )

[1] 13817 11843
6 x 5 sparse Matrix of class "dgCMatrix"
                   AAACCCAAGCAACTCT-1 AAACCCACACGCGGTT-1 AAACCCACAGCATACT-1
ENSMUSG00000052305                  .                  .                  .
ENSMUSG00000073940                  .                  .                  .
ENSMUSG00000069919                  .                  .                  .
ENSMUSG00000069917                  .                  .                  .
ENSMUSG00000019874                 20                  2                 14
ENSMUSG00000019890                  .                  .                  .
                   AAACCCACATACCATG-1 AAACCCAGTCGCACAC-1
ENSMUSG00000052305                  .                  .
ENSMUSG00000073940                  .                  1
ENSMUSG00000069919                  .                  .
ENSMUSG00000069917                  .                  .
ENSMUSG00000019874                  .                  .
ENSMUSG00000019890                  .                  .


In [9]:
df_counts_selected = df_counts_selected_cells [ rownames ( df_counts_selected_cells ) %in% selected_genes_array, ]  
rm ( df_counts_selected_cells )
print ( dim ( df_counts_selected ) )
print ( head ( df_counts_selected  [, c(1:5)] ) )

[1] 12570 11843
6 x 5 sparse Matrix of class "dgCMatrix"
                   AAACCCAAGCAACTCT-1 AAACCCACACGCGGTT-1 AAACCCACAGCATACT-1
ENSMUSG00000052305                  .                  .                  .
ENSMUSG00000073940                  .                  .                  .
ENSMUSG00000069919                  .                  .                  .
ENSMUSG00000069917                  .                  .                  .
ENSMUSG00000019874                 20                  2                 14
ENSMUSG00000019890                  .                  .                  .
                   AAACCCACATACCATG-1 AAACCCAGTCGCACAC-1
ENSMUSG00000052305                  .                  .
ENSMUSG00000073940                  .                  1
ENSMUSG00000069919                  .                  .
ENSMUSG00000069917                  .                  .
ENSMUSG00000019874                  .                  .
ENSMUSG00000019890                  .                  .


In [10]:

# output data frames
list_df_residual_variance = list()

for ( sample in (1:2) )
{
  print ( paste ( 'sample:', sample ) )
  df_sample = df_selected_cells[ sample ]
  
  colnames(df_sample)  <-'select'
  vect_select = df_sample$select
  
  df_sample_new = data.frame ( rownames_selected_cells) 
  cells_select = df_sample_new[ vect_select, ]

  df_counts_select_sample = df_counts_selected[, cells_select ]  
  print ( dim ( df_counts_select_sample ) )

  seurat_object <-  CreateSeuratObject(counts = df_counts_select_sample , project = data_subfolder )  
  seurat_object <- SCTransform(seurat_object, vst.flavor="v2", method = "glmGamPoi", variable.features.n=10 )

  gene_list = seurat_object@assays$SCT@data@Dimnames[1]
  gene_vector = unlist ( gene_list )

  residual_variance = seurat_object@assays$SCT@SCTModel.list$model1@feature.attributes$residual_variance

  df_residual_variance <- data.frame( gene_vector )
  str_sample = as.character ( sample - 1 )
  df_residual_variance[[ str_sample ]] = residual_variance
  
  print ( '--------------------------------------------------------'  ) 
 
  list_df_residual_variance[[ sample ]]  = df_residual_variance 
} 

[1] "sample: 1"
[1] 12570  5980


vst.flavor='v2' set, setting model to use fixed slope and exclude poisson genes.

Calculating cell attributes from input UMI matrix: log_umi

Total Step 1 genes: 12570

Total overdispersed genes: 12374

Excluding 196 genes from Step 1 because they are not overdispersed.

Variance stabilizing transformation of count matrix of size 12570 by 5980

Model formula is y ~ log_umi

Get Negative Binomial regression parameters per gene

Using 2000 genes, 5000 cells





Setting estimate of  129 genes to inf as theta_mm/theta_mle < 1e-3

# of step1 poisson genes (variance < mean): 0

# of low mean genes (mean < 0.001): 0

Total # of Step1 poisson genes (theta=Inf; variance < mean): 129

Total # of poisson genes (theta=Inf; variance < mean): 196

Calling offset model for all 196 poisson genes

Found 183 outliers - those will be ignored in fitting/regularization step


Ignoring theta inf genes

Replacing fit params for 196 poisson genes by theta=Inf

Setting min_variance based on median UMI:  0.04

Second step: Get residuals using fitted parameters for 12570 genes





Computing corrected count matrix for 12570 genes





Calculating gene attributes

Wall clock passed: Time difference of 41.9287 secs

Determine variable features

Place corrected count matrix in counts slot

Centering data matrix

Set default assay to SCT



[1] "--------------------------------------------------------"
[1] "sample: 2"
[1] 12570  5863


vst.flavor='v2' set, setting model to use fixed slope and exclude poisson genes.

Calculating cell attributes from input UMI matrix: log_umi

Total Step 1 genes: 12570

Total overdispersed genes: 12353

Excluding 217 genes from Step 1 because they are not overdispersed.

Variance stabilizing transformation of count matrix of size 12570 by 5863

Model formula is y ~ log_umi

Get Negative Binomial regression parameters per gene

Using 2000 genes, 5000 cells





Setting estimate of  116 genes to inf as theta_mm/theta_mle < 1e-3

# of step1 poisson genes (variance < mean): 0

# of low mean genes (mean < 0.001): 0

Total # of Step1 poisson genes (theta=Inf; variance < mean): 116

Total # of poisson genes (theta=Inf; variance < mean): 217

Calling offset model for all 217 poisson genes

Found 169 outliers - those will be ignored in fitting/regularization step


Ignoring theta inf genes

Replacing fit params for 217 poisson genes by theta=Inf

Setting min_variance based on median UMI:  0.04

Second step: Get residuals using fitted parameters for 12570 genes





Computing corrected count matrix for 12570 genes





Calculating gene attributes

Wall clock passed: Time difference of 37.79129 secs

Determine variable features

Place corrected count matrix in counts slot

Centering data matrix

Set default assay to SCT



[1] "--------------------------------------------------------"


In [11]:
df_residual_variance_return = merge ( list_df_residual_variance[[1]], list_df_residual_variance[[2]], by=c("gene_vector") )
print ( head ( df_residual_variance_return ) ) 

         gene_vector         0         1
1 ENSMUSG00000000001 0.8201942 0.7393027
2 ENSMUSG00000000028 1.1895073 1.0743692
3 ENSMUSG00000000031 1.7944710 1.4458637
4 ENSMUSG00000000037 0.3688614 0.3776920
5 ENSMUSG00000000056 0.9609841 0.8968551
6 ENSMUSG00000000058 1.2173327 1.0450934


In [12]:
py_save_object ( df_residual_variance_return, residual_variance_dsn )