# Computing pseudobulkreplicates (PBRs)

Computing drug signature scores via differential expression analyses computed between DMSO and treated cells by:
- Producing 1,000 pseudoreplicates as bootstrapped samples
- Obtaining the pseudobulk of each replicate via aggregation 
- Running differential expression analysis via edgeR and glm, quasi-likelihood F-test, best ranking in Soneson and Robinson, 2018 https://doi.org/10.1038/nmeth.4612 between treated- and untreated-cells within each model (see the other R script).

To compare the extent of differential response across the three models to a certain drug, the number of differentially expressed genes will be extracted and plotted.

Loading the R packages.

In [1]:
suppressWarnings({suppressPackageStartupMessages({
    library(Seurat)
    library(readxl)
    library(parallel)
})})

As we need to fetch back the well of origin to assign the sample, we need to load the HTO assignment of each cell

In [2]:
JHOS2_HTO_assignment <- read.table(file = "JHOS2_HTO_classification_cell_by_cell.txt",
                                  sep = "\t", header = F)
PDC1_HTO_assignment <- read.table(file = "PDC1_HTO_classification_cell_by_cell.txt",
                                  sep = "\t", header = F)
PDC2_HTO_assignment <- read.table(file = "PDC2_HTO_classification_cell_by_cell.txt",
                                  sep = "\t", header = F)

Once we have extracted the cell assignment to each well, we build a look-up table to fetch the well of origin and the annotation. To do so, we need the treatment group assignment, but for the two (or six, in the case of untreated cells) distinct wells, we need to add a number to record the replicate ID.

In [5]:
colnames(JHOS2_HTO_assignment) <- colnames(PDC1_HTO_assignment) <- colnames(PDC2_HTO_assignment) <- c("CellID", "Well")

Now we can easily identify the largest replicate to sample the pseudobulkreplicates later on.

In [6]:
max_repr_JHOS2 <- max(table(JHOS2_HTO_assignment$Well))
max_repr_PDC1 <-max(table(PDC1_HTO_assignment$Well))
max_repr_PDC2 <- max(table(PDC2_HTO_assignment$Well))

Loading the group assignment or treatment group assignment table.

In [7]:
treatment_groups <- as.data.frame(read_xlsx(path = "Treatment_groups.xlsx", sheet = 1, col_names = T))
rownames(treatment_groups) <- treatment_groups$`HTO classification`

Recording the duplicate number: for each drug present in the table, fetch the rows whose _Drug_ column is equal to that drug, then substitute the NA values previously loaded for column duplicate_number with number going from 1 to the number of replicates present.

In [9]:
treatment_groups$duplicate_number <- NA
for(d in unique(treatment_groups$Drug)){
    treatment_groups[which(treatment_groups$Drug == d), "duplicate_number"] <- c(1:length(which(treatment_groups$Drug == d)))
}

Now we can store that identity given by drug + underscore + duplicate number (eg DMSO_6).

In [10]:
treatment_groups$identity_duplicate <- paste0(treatment_groups$Drug, "_", treatment_groups$duplicate_number)

These identities can now be added to the original object's metadata.

In [11]:
sc_data <- readRDS(file = "HGSOC_CellHashing_CLUSTERED.RDS")

Building the look-up table.

In [12]:
JHOS2_HTO_assignment$CellID <- paste0("JHOS2_", JHOS2_HTO_assignment$CellID)
PDC1_HTO_assignment$CellID <- paste0("PDC1_", PDC1_HTO_assignment$CellID)
PDC2_HTO_assignment$CellID <- paste0("PDC2_", PDC2_HTO_assignment$CellID)

Merging the look-up tables into a unique one.

In [14]:
HTO_assignment_lookup_table <- rbind(JHOS2_HTO_assignment, PDC1_HTO_assignment, PDC2_HTO_assignment)
nrow(HTO_assignment_lookup_table) # Matches number of cells expected

In [15]:
HTO_assignment_lookup_table$identity_duplicate <- sapply(HTO_assignment_lookup_table$Well, 
                                                         function(x) treatment_groups[x, "identity_duplicate"])
rownames(HTO_assignment_lookup_table) <- HTO_assignment_lookup_table$CellID            
head(HTO_assignment_lookup_table)                                                         

Unnamed: 0_level_0,CellID,Well,identity_duplicate
Unnamed: 0_level_1,<chr>,<chr>,<chr>
JHOS2_AAACCCAAGCAAATCA-1,JHOS2_AAACCCAAGCAAATCA-1,column8_row4,Fedratinib_2
JHOS2_AAACCCAAGGTCGCCT-1,JHOS2_AAACCCAAGGTCGCCT-1,column10_row2,BMS-754807_2
JHOS2_AAACCCACAGAACATA-1,JHOS2_AAACCCACAGAACATA-1,column12_row7,SCH772984_2
JHOS2_AAACCCAGTCTGTCAA-1,JHOS2_AAACCCAGTCTGTCAA-1,column12_row6,TGX-221_2
JHOS2_AAACCCAGTCTGTGGC-1,JHOS2_AAACCCAGTCTGTGGC-1,column8_row1,Ipatasertib_2
JHOS2_AAACCCAGTTGGACCC-1,JHOS2_AAACCCAGTTGGACCC-1,column4_row1,Ralimetinib_1


Now we can easily add the info about the duplicate identity in the Seurat object meta data.

In [16]:
sc_data@meta.data$duplicate_drug_number <- sapply(rownames(sc_data@meta.data), 
                                                  function(x) HTO_assignment_lookup_table[x, "identity_duplicate"])

In order to obtain model- and replicate-specific pseudobulk, we add the model info to the identity duplicate column. This eases the re-usability of the function too.

In [17]:
sc_data@meta.data$duplicate_drug_number <- paste0(sc_data@meta.data$model, "_", sc_data@meta.data$duplicate_drug_number)
Idents(sc_data) <- "duplicate_drug_number"
head(Idents(sc_data))
tail(Idents(sc_data))

Running pseudobulkreplicates (PBR) computations: we design a function to be launched in parallel for the three models.

In [18]:
pseudoreplicates_pseudobulk <- function(seurat_object, identity, model, drug, max_repr_size){
    
    exp <- c() # Pre-allocated count matrix
    
    # Create pseudoreplicates of each identity in a certain model+condition
    # Setting identities
    Idents(seurat_object) <- identity
    
    # Retrieving cells of interest
    cells <- Idents(seurat_object)[grep(x = Idents(seurat_object), pattern = paste0(model, "_", drug))]

    cells <- factor(cells, levels = c(paste0(model, "_", drug, "_", c(1:length(grep(x = unique(cells), pattern = drug))))))
    
    # Loading expression matrix that will store all the pseudo replicates
    exp_mat_pseudoreplicates <- c()
    
    # Iteratively, for each level, we extrat the cells, compute the sampling, and get the average
    for(replicate in levels(cells)){
        
        # Extracting the cells
        replicate_cells <- cells[cells == replicate]
        
        # Producing 1,000 samples with replacement of 2/3 of the largest replicate size, rounding to the closest integer
        # For each sample, extracting the Seurat data and calculating the average expression
        for(i in 1:1000){
            # IMPORTANT CHANGE: DYNAMIC SEED 
            # This guarantees that the resampling is done always in a different way, yet reproducible.
            set.seed(i)
    
            # Creating the sample
            bootstrap_cells <- sample(x = replicate_cells, 
                                      size = round(max_repr_size*(2/3), digits = 0), 
                                      replace = TRUE) 
            # Extracting the cells
            subset_seurat_object <- subset(x = seurat_object, 
                                           cells = names(bootstrap_cells))
            
            # Since Seurat's AggregateExpression has issues with duplicated row names, we need to rename the cells
            subset_seurat_object <- RenameCells(subset_seurat_object,
                                                new.names = paste0(1:length(bootstrap_cells),"_", colnames(subset_seurat_object)))
           
            # Getting aggregate-based pseudobulk, retaining only raw counts for edgeR
            # Since we work on the raw counts, no additional scaling of the subset is needed
            # AggregateExpression calls the (undocumented) Seurat function "Pseudobulk" with the "aggregate" parameter
            # Aggregate allows us to keep the heterogeneity across cells
            avg_pseudo_replicate <- AggregateExpression(object = subset_seurat_object, 
                                                                  slot = "count", 
                                                                  assays = "RNA", 
                                                                  group.by = identity, 
                                                                  verbose = FALSE)$RNA
            
            # We should obtain a one-column data frame because they are all of the same type, and we include it to the exp_mat
            # We include the number of the iteration in the name, so as to keep track of the pseudo replicate
            colnames(avg_pseudo_replicate) <- paste0(replicate, "_", i)
            exp_mat_pseudoreplicates <- cbind(exp_mat_pseudoreplicates, avg_pseudo_replicate)
            i <- i + 1
            
            # Garbage collection, releases memory
            gc()
        }
    }
    return(exp_mat_pseudoreplicates)
}

We use the _parallel_ package (future and callr API for future failed) via _mclapply_ at the model-specific level, that is, using the parallel version of _lapply_. The function is run in parallel on 46 cores which corresponds to one core per treatment group.

In [19]:
all_drugs <- unique(treatment_groups$Drug)
all_drugs

In [20]:
JHOS2_pseudo_parallelized <- mclapply(all_drugs, function(x) pseudoreplicates_pseudobulk(seurat_object = sc_data,
                                                                          identity = "duplicate_drug_number",
                                                                          model = "JHOS2",
                                                                          drug = x,
                                                                          max_repr_size = max_repr_JHOS2), mc.cores = 46)

In [21]:
PDC1_pseudo_parallelized <- mclapply(all_drugs, function(x) pseudoreplicates_pseudobulk(seurat_object = sc_data,
                                                                          identity = "duplicate_drug_number",
                                                                          model = "PDC1",
                                                                          drug = x,
                                                                          max_repr_size = max_repr_PDC1), mc.cores = 46)

In [22]:
PDC2_pseudo_parallelized <- mclapply(all_drugs, function(x) pseudoreplicates_pseudobulk(seurat_object = sc_data,
                                                                          identity = "duplicate_drug_number",
                                                                          model = "PDC2",
                                                                          drug = x,
                                                                          max_repr_size = max_repr_PDC2), mc.cores = 46)

Saving.

In [23]:
saveRDS(object = JHOS2_pseudo_parallelized, file = "JHOS2_pseudobulkreplicates.RDS")
saveRDS(object = PDC1_pseudo_parallelized, file = "PDC1_pseudobulkreplicates.RDS")
saveRDS(object = PDC2_pseudo_parallelized, file = "PDC2_pseudobulkreplicates.RDS")

In [24]:
sessionInfo()

R version 4.2.2 (2022-10-31)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Rocky Linux 8.8 (Green Obsidian)

Matrix products: default
BLAS/LAPACK: /homedir01/adini22/.conda/envs/cellhashing_analyses/lib/libopenblasp-r0.3.21.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] readxl_1.4.1       SeuratObject_4.1.3 Seurat_4.3.0.9002 

loaded via a namespace (and not attached):
  [1] Rtsne_0.16             colorspace_2.1-0       deldir_1.0-6          
  [4] ellipsis_0.3.2         ggridges_0.5.4         IRdisplay_1.1         
  [7]