-
Notifications
You must be signed in to change notification settings - Fork 45
5. Other personalized settings
- Run scCancer from an expression matrix
- Re-perform cell calling
- Modify cell QC thresholds
- Modify gene QC thresholds
- Perform ambient RNAs contamination correction
- Set species and genome used for the sample
- Analyze PDX samples
- Use "UMAP" coordinates to present analyses results
- Analyze multi-modal data
- Train new cell type templates
- Modify the parameters of harmony
If you only have an count matrix and want to perform scCancer analysis, you can use the generate10Xdata
function to generate a 10X-like data folder based on the data matrix and gene information. It required:
-
matrix
: a gene-cell matrix ordata.frame
. -
gene.info
: adata.frame
of gene information. It should contain two columns, the first is gene Ensemble ID, and the second is gene symbol (as shown below). The order of the genes should be consistent with the row order of the parametermatrix
. Other details can be found byhelp(generate10Xdata)
.
ENSG00000177757 FAM87B
ENSG00000230368 FAM41C
ENSG00000187634 SAMD11
-
outPath
: a path to save the output files.
Then you can perform scCancer analysis based on the generated data folder.
If the sample data is generated by CR2
and contain "raw_gene_bc_matrices" matrix, the scCancer package can re-perform cell calling using the method EmptyDrop
(the name of its R package is DropletUtils
). But users need to manually install the DropletUtils
and import it in script.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DropletUtils")
library(DropletUtils)
After the step scStatistics
, a HTML report (report-scStat.html
) will be generated, which presents the statistical features of the data from various perspectives (nUMI
, nGene
, mito.percent
, ribo.percent
, diss.percent
).
By identifying outliers from the distribution of these metrics, scCancer
adaptively provides some suggested thresholds to filter low-quality cells and records them in the file cell.QC.thres.txt
.
Users can modify the values in file cell.QC.thres.txt
to make scAnnotation
use the updated thresholds to perform cell QC and downstream analyses.
For the step gene QC, the scCancer
filters genes according to three aspects of information.
- Mitochondrial, ribosomal, and dissociation-associated genes.
- Genes with the number of expressed cells less than argument
nCell.min
(the default is 3). - Genes with the background percentage larger than the argument
bgPercent.max
(the default is 1, which means unfilter). ThebgPercent.max
can be decided according to the distribution of background percentage inreport-scStat.html
.
Users can modify the values of these arguments according to their needs.
In the scStatistics
module, users can pass in the soup (background) specific gene lists by the argument bg.spec.genes
or use the default setting:
bg.spec.genes <- list(
igGenes = c('IGHA1','IGHA2','IGHG1','IGHG2','IGHG3','IGHG4','IGHD','IGHE','IGHM', 'IGLC1','IGLC2','IGLC3','IGLC4','IGLC5','IGLC6','IGLC7', 'IGKC', 'IGLL5', 'IGLL1'),
HLAGenes = c('HLA-DRA', 'HLA-DRB5', 'HLA-DRB1', 'HLA-DQA1', 'HLA-DQB1', 'HLA-DQB1', 'HLA-DQA2', 'HLA-DQB2', 'HLA-DPA1', 'HLA-DPB1'),
HBGenes = c("HBB","HBD","HBG1","HBG2", "HBE1","HBZ","HBM","HBA2", "HBA1","HBQ1")
)
Then a contamination fraction will be estimated and saved in file ambientRNA-SoupX.txt
.
In the scAnnotation
module, if users want to correct the expression data according to the estimated ambient RNAs contamination fraction, they can set the argument bool.rmContamination
as TRUE
(the default is FALSE
), and set the argument contamination.fraction
as a number (between 0 and 1) or NULL
(NULL
means the result of scStatistics
will be used).
By default, the arguments species
and genome
are set as human
and hg19
.
Users can set species
as human
or mouse
, and set genome
as hg19
, hg38
, or mm10
, respectively.
Patient-drived tumor xenograft (PDX) samples contain both human and mouse cells generally. In order to deal with these human-mouse-mixed data, users can explicitly set the arguments hg.mm.mix
, species
, genome
, hg.mm.thres
, and mix.anno
of the relevant functions.
More details for the meaning of these arguments can be found by using command help()
.
By default, we use both t-SNE and UMAP to obtain low-dimension coordiantes,
and the clustering results are presented using both of them.
For other analyses, we use t-SNE 2D coordinates by default.
Users can also set the argument coor.names
as c("UMAP_1", "UMAP_2")
to view the results under UMAP coordinates.
Note: If users haven't installed UMAP
, they can do so via
reticulate::py_install(packages = 'umap-learn')
For the multi-modal data, which have both gene expression and antibody capture results, scCancer
is also compatible. Users don't need to perform special setting, and scCancer
will extract the expression data automatically and run downstream analyses.
If users' samples have other cell types to classify, they can train the new templates and pass in the argument ct.templates
. To train the new templates, following codes can be referred:
library(gelnet)
train.data <- list("type1" = expr.data1, "type2" = expr.data2)
ct.templates <- lapply(names(train.data), FUN = function(x){
result <- gelnet(t(train.data[[x]]), NULL, 0, 1)
return(result$w[result$w != 0])
})
The parameters harmony.theta
, harmony.lambda
, and harmony.sigma
correspond to the parameters theta
, lambda
, and sigma
of function "RunHarmony" in the harmony package. You can adjust them to test the batch effect correction performance.