# Computing pseudobulkreplicates (PBRs) for referee's questions

Computing drug signature scores via differential expression analyses computed between DMSO and treated cells by:
- Producing 100 pseudoreplicates as bootstrapped samples _as per the referee's requests_
- Obtaining the pseudobulk of each replicate via aggregation 
- Running differential expression analysis via edgeR and glm, quasi-likelihood F-test, best ranking in Soneson and Robinson, 2018 https://doi.org/10.1038/nmeth.4612 between treated- and untreated-cells within each model (see the other R script).

To compare the extent of differential response across the three models to a certain drug, the number of differentially expressed genes will be extracted and plotted.

Loading the R packages.

In [1]:
suppressWarnings({suppressPackageStartupMessages({
    library(Seurat)
    library(readxl)
    library(parallel)
})})

As we need to fetch back the well of origin to assign the sample, we need to load the HTO assignment of each cell

In [2]:
JHOS2_HTO_assignment <- read.table(file = "JHOS2_HTO_classification_cell_by_cell.txt",
                                  sep = "\t", header = F)
PDC3_HTO_assignment <- read.table(file = "PDC3_HTO_classification_cell_by_cell.txt",
                                  sep = "\t", header = F)
PDC2_HTO_assignment <- read.table(file = "PDC2_HTO_classification_cell_by_cell.txt",
                                  sep = "\t", header = F)

Once we have extracted the cell assignment to each well, we build a look-up table to fetch the well of origin and the annotation. To do so, we need the treatment group assignment, but for the two (or six, in the case of untreated cells) distinct wells, we need to add a number to record the replicate ID.

In [3]:
colnames(JHOS2_HTO_assignment) <- colnames(PDC3_HTO_assignment) <- colnames(PDC2_HTO_assignment) <- c("CellID", "Well")

Now we can easily identify the largest replicate to sample the pseudobulkreplicates later on.

In [4]:
max_repr_JHOS2 <- max(table(JHOS2_HTO_assignment$Well))
max_repr_PDC3 <-max(table(PDC3_HTO_assignment$Well))
max_repr_PDC2 <- max(table(PDC2_HTO_assignment$Well))

Loading the group assignment or treatment group assignment table.

In [5]:
treatment_groups <- as.data.frame(read_xlsx(path = "Treatment_groups.xlsx", sheet = 1, col_names = T))
rownames(treatment_groups) <- treatment_groups$`HTO classification`

treatment_groups_PDC3_only <- as.data.frame(read_xlsx(path = "Treatment_groups_PDC3_only.xlsx", sheet = 1, col_names = T))
rownames(treatment_groups_PDC3_only) <- treatment_groups_PDC3_only$`HTO classification`

Recording the duplicate number: for each drug present in the table, fetch the rows whose _Drug_ column is equal to that drug, then substitute the NA values previously loaded for column duplicate_number with number going from 1 to the number of replicates present.

In [6]:
treatment_groups$duplicate_number <- NA
for(d in unique(treatment_groups$Drug)){
    treatment_groups[which(treatment_groups$Drug == d), "duplicate_number"] <- c(1:length(which(treatment_groups$Drug == d)))
}

treatment_groups_PDC3_only$duplicate_number <- NA
for(d in unique(treatment_groups_PDC3_only$Drug)){
    treatment_groups_PDC3_only[which(treatment_groups_PDC3_only$Drug == d), "duplicate_number"] <- c(1:length(which(treatment_groups_PDC3_only$Drug == d)))
}

Now we can store that identity given by drug + underscore + duplicate number (eg DMSO_6).

In [7]:
treatment_groups$identity_duplicate <- paste0(treatment_groups$Drug, "_", treatment_groups$duplicate_number)
treatment_groups_PDC3_only$identity_duplicate <- paste0(treatment_groups_PDC3_only$Drug, "_", treatment_groups_PDC3_only$duplicate_number)

In [8]:
head(treatment_groups)

Unnamed: 0_level_0,HTO classification,Drug,Final concentration (nM) in 100 µl,Identity,duplicate_number,identity_duplicate
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<int>,<chr>
column1_row1,column1_row1,DMSO,0.0,DMSO,1,DMSO_1
column1_row2,column1_row2,Belinostat,10.0,Belinostat,1,Belinostat_1
column1_row3,column1_row3,Quisinostat,1.0,Quisinostat,1,Quisinostat_1
column1_row4,column1_row4,Dinaciclib,0.1,Dinaciclib,1,Dinaciclib_1
column1_row5,column1_row5,Pictilisib,10.0,Pictilisib,1,Pictilisib_1
column1_row6,column1_row6,Pacritinib,10.0,Pacritinib,1,Pacritinib_1


In [9]:
head(treatment_groups_PDC3_only)

Unnamed: 0_level_0,HTO classification,Drug,Final concentration (nM) in 100 µl,Identity,duplicate_number,identity_duplicate
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<int>,<chr>
column1_row8,column1_row8,DMSO,0.0,DMSO,1,DMSO_1
column1_row7,column1_row7,Belinostat,10.0,Belinostat,1,Belinostat_1
column1_row6,column1_row6,Quisinostat,1.0,Quisinostat,1,Quisinostat_1
column1_row5,column1_row5,Dinaciclib,0.1,Dinaciclib,1,Dinaciclib_1
column1_row4,column1_row4,Pictilisib,10.0,Pictilisib,1,Pictilisib_1
column1_row3,column1_row3,Pacritinib,10.0,Pacritinib,1,Pacritinib_1


These identities can now be added to the original object's metadata.

In [10]:
sc_data <- readRDS(file = "HGSOC_CellHashing_CLUSTERED.RDS")

Building the look-up table.

In [11]:
JHOS2_HTO_assignment$CellID <- paste0("JHOS2_", JHOS2_HTO_assignment$CellID)
PDC3_HTO_assignment$CellID <- paste0("PDC3_", PDC3_HTO_assignment$CellID)
PDC2_HTO_assignment$CellID <- paste0("PDC2_", PDC2_HTO_assignment$CellID)

In [12]:
JHOS2_HTO_assignment$identity_duplicate <- sapply(JHOS2_HTO_assignment$Well, 
                                                  function(x) treatment_groups[x, "identity_duplicate"])
PDC2_HTO_assignment$identity_duplicate <- sapply(PDC2_HTO_assignment$Well, 
                                                  function(x) treatment_groups[x, "identity_duplicate"])   
PDC3_HTO_assignment$identity_duplicate <- sapply(PDC3_HTO_assignment$Well, 
                                                  function(x) treatment_groups_PDC3_only[x, "identity_duplicate"]) 
                                                 
rownames(JHOS2_HTO_assignment) <- JHOS2_HTO_assignment$CellID            
rownames(PDC3_HTO_assignment) <- PDC3_HTO_assignment$CellID            
rownames(PDC2_HTO_assignment) <- PDC2_HTO_assignment$CellID 

Merging the look-up tables into a unique one.

In [13]:
HTO_assignment_lookup_table <- rbind(JHOS2_HTO_assignment, PDC3_HTO_assignment, PDC2_HTO_assignment)
nrow(HTO_assignment_lookup_table) # Matches number of cells expected

Now we can easily add the info about the duplicate identity in the Seurat object meta data.

In [14]:
sc_data@meta.data$duplicate_drug_number <- sapply(rownames(sc_data@meta.data), 
                                                  function(x) HTO_assignment_lookup_table[x, "identity_duplicate"])

In [15]:
tail(sc_data@meta.data)

Unnamed: 0_level_0,orig.ident,nCount_RNA,nFeature_RNA,percent.mt,percent.rb,Treatment_group,nCount_SCT,nFeature_SCT,S.Score,G2M.Score,⋯,SCT_snn_res.0.4,SCT_snn_res.0.5,SCT_snn_res.0.6,SCT_snn_res.0.7,SCT_snn_res.0.8,SCT_snn_res.0.9,SCT_snn_res.1,seurat_clusters,RNA_clusters,duplicate_drug_number
Unnamed: 0_level_1,<chr>,<dbl>,<int>,<dbl>,<dbl>,<chr>,<dbl>,<int>,<dbl>,<dbl>,⋯,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<chr>
PDC2_TTTGTTGGTCTGATAC-1,SeuratProject,42908,6447,2.535658,24.33579,TGX-221,27851,6198,0.02994034,0.82631298,⋯,1,1,1,1,1,1,1,1,1,TGX-221_2
PDC2_TTTGTTGGTGGAAGTC-1,SeuratProject,22721,4065,3.683817,34.86202,Dactolisib,26168,4043,-0.08625217,-0.09808539,⋯,1,1,1,1,1,1,1,1,1,Dactolisib_2
PDC2_TTTGTTGGTGTGAGCA-1,SeuratProject,5779,2629,6.679356,17.33864,AZD8055,23716,4687,-0.06163985,-0.13494217,⋯,3,3,3,6,5,5,5,5,2,AZD8055_2
PDC2_TTTGTTGGTTACCTGA-1,SeuratProject,10454,3795,3.749761,21.93419,Milciclib,24251,4216,0.39595552,0.39398493,⋯,1,1,1,1,1,1,1,1,1,Milciclib_2
PDC2_TTTGTTGTCGTCCATC-1,SeuratProject,5490,2560,13.060109,15.99271,SCH772984,23717,4830,-0.02503921,-0.11735909,⋯,14,13,14,14,14,12,12,12,1,SCH772984_2
PDC2_TTTGTTGTCTGGTTGA-1,SeuratProject,37691,7190,2.149054,18.69412,JQ1,27930,7079,0.11391769,0.44819895,⋯,5,4,5,4,6,16,6,6,5,JQ1_1


In order to obtain model- and replicate-specific pseudobulk, we add the model info to the identity duplicate column. This eases the re-usability of the function too.

In [16]:
sc_data@meta.data$duplicate_drug_number <- paste0(sc_data@meta.data$model, "_", sc_data@meta.data$duplicate_drug_number)
Idents(sc_data) <- "duplicate_drug_number"
head(Idents(sc_data))
tail(Idents(sc_data))

Running pseudobulkreplicates (PBR) computations: we design a function to be launched in parallel for the three models.

In [17]:
pseudoreplicates_pseudobulk <- function(seurat_object, identity, model, drug, max_repr_size){
    
    exp <- c() # Pre-allocated count matrix
    
    # Create pseudoreplicates of each identity in a certain model+condition
    # Setting identities
    Idents(seurat_object) <- identity
    
    # Retrieving cells of interest
    cells <- Idents(seurat_object)[grep(x = Idents(seurat_object), pattern = paste0(model, "_", drug))]

    cells <- factor(cells, levels = c(paste0(model, "_", drug, "_", c(1:length(grep(x = unique(cells), pattern = drug))))))
    
    # Loading expression matrix that will store all the pseudo replicates
    exp_mat_pseudoreplicates <- c()
    
    # Iteratively, for each level, we extrat the cells, compute the sampling, and get the average
    for(replicate in levels(cells)){
        
        # Extracting the cells
        replicate_cells <- cells[cells == replicate]
        
        # Producing 1,000 samples with replacement of 2/3 of the largest replicate size, rounding to the closest integer
        # For each sample, extracting the Seurat data and calculating the average expression
        for(i in 1:100){ # 100 iterations for referee's requests
            # IMPORTANT CHANGE: DYNAMIC SEED 
            # This guarantees that the resampling is done always in a different way, yet reproducible.
            set.seed(i)
    
            # Creating the sample
            bootstrap_cells <- sample(x = replicate_cells, 
                                      size = round(max_repr_size*(2/3), digits = 0), 
                                      replace = TRUE) 
            # Extracting the cells
            subset_seurat_object <- subset(x = seurat_object, 
                                           cells = names(bootstrap_cells))
            
            # Since Seurat's AggregateExpression has issues with duplicated row names, we need to rename the cells
            subset_seurat_object <- RenameCells(subset_seurat_object,
                                                new.names = paste0(1:length(bootstrap_cells),"_", colnames(subset_seurat_object)))
           
            # Getting aggregate-based pseudobulk, retaining only raw counts for edgeR
            # Since we work on the raw counts, no additional scaling of the subset is needed
            # AggregateExpression calls the (undocumented) Seurat function "Pseudobulk" with the "aggregate" parameter
            # Aggregate allows us to keep the heterogeneity across cells
            avg_pseudo_replicate <- AggregateExpression(object = subset_seurat_object, 
                                                                  slot = "count", 
                                                                  assays = "RNA", 
                                                                  group.by = identity, 
                                                                  verbose = FALSE)$RNA
            
            # We should obtain a one-column data frame because they are all of the same type, and we include it to the exp_mat
            # We include the number of the iteration in the name, so as to keep track of the pseudo replicate
            colnames(avg_pseudo_replicate) <- paste0(replicate, "_", i)
            exp_mat_pseudoreplicates <- cbind(exp_mat_pseudoreplicates, avg_pseudo_replicate)
            i <- i + 1
            
            # Garbage collection, releases memory
            gc()
        }
    }
    return(exp_mat_pseudoreplicates)
}

We use the _parallel_ package (future and callr API for future failed) via _mclapply_ at the model-specific level, that is, using the parallel version of _lapply_. The function is run in parallel on 46 cores which corresponds to one core per treatment group.

In [18]:
all_drugs <- unique(treatment_groups$Drug)
all_drugs

In [19]:
JHOS2_pseudo_parallelized <- mclapply(all_drugs, function(x) pseudoreplicates_pseudobulk(seurat_object = sc_data,
                                                                          identity = "duplicate_drug_number",
                                                                          model = "JHOS2",
                                                                          drug = x,
                                                                          max_repr_size = max_repr_JHOS2), mc.cores = 46)

In [20]:
PDC3_pseudo_parallelized <- mclapply(all_drugs, function(x) pseudoreplicates_pseudobulk(seurat_object = sc_data,
                                                                          identity = "duplicate_drug_number",
                                                                          model = "PDC3",
                                                                          drug = x,
                                                                          max_repr_size = max_repr_PDC3), mc.cores = 46)

In [21]:
PDC2_pseudo_parallelized <- mclapply(all_drugs, function(x) pseudoreplicates_pseudobulk(seurat_object = sc_data,
                                                                          identity = "duplicate_drug_number",
                                                                          model = "PDC2",
                                                                          drug = x,
                                                                          max_repr_size = max_repr_PDC2), mc.cores = 46)

Saving.

In [22]:
saveRDS(object = JHOS2_pseudo_parallelized, file = "JHOS2_pseudobulkreplicates_100reps_rev.RDS")
saveRDS(object = PDC3_pseudo_parallelized, file = "PDC3_pseudobulkreplicates_100reps_rev.RDS")
saveRDS(object = PDC2_pseudo_parallelized, file = "PDC2_pseudobulkreplicates_100reps_rev.RDS")

In [23]:
sessionInfo()

R version 4.2.2 (2022-10-31)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Rocky Linux 8.8 (Green Obsidian)

Matrix products: default
BLAS/LAPACK: /homedir01/adini22/.conda/envs/cellhashing_preprocessing/lib/libopenblasp-r0.3.21.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] readxl_1.4.1       SeuratObject_4.1.3 Seurat_4.3.0.9001 

loaded via a namespace (and not attached):
  [1] nlme_3.1-162           spatstat.sparse_3.0-0  matrixStats_0.62.0    
  [4] RcppAnnoy_0.0.20       RColorBrewer_1.1-3     httr_1.4.4            
