Skip to content

5. Other personalized settings

wguo-research edited this page Jul 5, 2022 · 14 revisions
  1. Run scCancer from an expression matrix
  2. Re-perform cell calling
  3. Modify cell QC thresholds
  4. Modify gene QC thresholds
  5. Perform ambient RNAs contamination correction
  6. Set species and genome used for the sample
  7. Analyze PDX samples
  8. Use "UMAP" coordinates to present analyses results
  9. Analyze multi-modal data
  10. Train new cell type templates
  11. Modify the parameters of harmony

Run scCancer from an expression matrix

If you only have an count matrix and want to perform scCancer analysis, you can use the generate10Xdata function to generate a 10X-like data folder based on the data matrix and gene information. It required:

  • matrix: a gene-cell matrix or data.frame.
  • gene.info: a data.frame of gene information. It should contain two columns, the first is gene Ensemble ID, and the second is gene symbol (as shown below). The order of the genes should be consistent with the row order of the parameter matrix. Other details can be found by help(generate10Xdata).
ENSG00000177757	FAM87B
ENSG00000230368	FAM41C
ENSG00000187634	SAMD11
  • outPath: a path to save the output files.

Then you can perform scCancer analysis based on the generated data folder.

Re-perform cell calling

If the sample data is generated by CR2 and contain "raw_gene_bc_matrices" matrix, the scCancer package can re-perform cell calling using the method EmptyDrop (the name of its R package is DropletUtils). But users need to manually install the DropletUtils and import it in script.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("DropletUtils")

library(DropletUtils)

Modify cell QC thresholds

After the step scStatistics, a HTML report (report-scStat.html) will be generated, which presents the statistical features of the data from various perspectives (nUMI, nGene, mito.percent, ribo.percent, diss.percent). By identifying outliers from the distribution of these metrics, scCancer adaptively provides some suggested thresholds to filter low-quality cells and records them in the file cell.QC.thres.txt.

Users can modify the values in file cell.QC.thres.txt to make scAnnotation use the updated thresholds to perform cell QC and downstream analyses.

Modify gene QC thresholds

For the step gene QC, the scCancer filters genes according to three aspects of information.

  1. Mitochondrial, ribosomal, and dissociation-associated genes.
  2. Genes with the number of expressed cells less than argument nCell.min (the default is 3).
  3. Genes with the background percentage larger than the argument bgPercent.max (the default is 1, which means unfilter). The bgPercent.max can be decided according to the distribution of background percentage in report-scStat.html.

Users can modify the values of these arguments according to their needs.

Perform ambient RNAs contamination correction

In the scStatistics module, users can pass in the soup (background) specific gene lists by the argument bg.spec.genes or use the default setting:

bg.spec.genes <- list(
    igGenes = c('IGHA1','IGHA2','IGHG1','IGHG2','IGHG3','IGHG4','IGHD','IGHE','IGHM', 'IGLC1','IGLC2','IGLC3','IGLC4','IGLC5','IGLC6','IGLC7', 'IGKC', 'IGLL5', 'IGLL1'),
    HLAGenes = c('HLA-DRA', 'HLA-DRB5', 'HLA-DRB1', 'HLA-DQA1', 'HLA-DQB1', 'HLA-DQB1', 'HLA-DQA2', 'HLA-DQB2', 'HLA-DPA1', 'HLA-DPB1'),
    HBGenes = c("HBB","HBD","HBG1","HBG2", "HBE1","HBZ","HBM","HBA2", "HBA1","HBQ1")
)

Then a contamination fraction will be estimated and saved in file ambientRNA-SoupX.txt.

In the scAnnotation module, if users want to correct the expression data according to the estimated ambient RNAs contamination fraction, they can set the argument bool.rmContamination as TRUE (the default is FALSE), and set the argument contamination.fraction as a number (between 0 and 1) or NULL (NULL means the result of scStatistics will be used).

Set species and genome used for the sample

By default, the arguments species and genome are set as human and hg19. Users can set species as human or mouse, and set genome as hg19, hg38, or mm10, respectively.

Analyze PDX samples

Patient-drived tumor xenograft (PDX) samples contain both human and mouse cells generally. In order to deal with these human-mouse-mixed data, users can explicitly set the arguments hg.mm.mix, species, genome, hg.mm.thres, and mix.anno of the relevant functions. More details for the meaning of these arguments can be found by using command help().

Use "UMAP" coordinates to present analyses results

By default, we use both t-SNE and UMAP to obtain low-dimension coordiantes, and the clustering results are presented using both of them. For other analyses, we use t-SNE 2D coordinates by default. Users can also set the argument coor.names as c("UMAP_1", "UMAP_2") to view the results under UMAP coordinates.

Note: If users haven't installed UMAP, they can do so via

reticulate::py_install(packages = 'umap-learn')

Analyze multi-modal data

For the multi-modal data, which have both gene expression and antibody capture results, scCancer is also compatible. Users don't need to perform special setting, and scCancer will extract the expression data automatically and run downstream analyses.

Train new cell type templates

If users' samples have other cell types to classify, they can train the new templates and pass in the argument ct.templates. To train the new templates, following codes can be referred:

library(gelnet)

train.data <- list("type1" = expr.data1, "type2" = expr.data2)
ct.templates <- lapply(names(train.data), FUN = function(x){
    result <- gelnet(t(train.data[[x]]), NULL, 0, 1)
    return(result$w[result$w != 0])
})

Modify the parameters of harmony

The parameters harmony.theta, harmony.lambda, and harmony.sigma correspond to the parameters theta, lambda, and sigma of function "RunHarmony" in the harmony package. You can adjust them to test the batch effect correction performance.

Clone this wiki locally