Skip to content

steverozen/SynSigGen

Repository files navigation

SynSigGen

R-CMD-check-v1 R-CMD-check-bioc-3.14

Synthetic (Mutational) Signature Generation (SynSigGen)

Purpose

Create catalogs of synthetic mutational spectra for assessing the performance of mutational signature analysis programs. ‘SynSigGen’ stands for Generation of Synthetic Signatures and Spectra.


Installation

Before installation, prerequisites in Bioconductor needs to be installed:

install.packages("BiocManager")
BiocManager::install(
  c("Biostrings", "BSgenome", "GenomeInfoDb", "GenomicRanges")
)

Install from GitHub with the R command line:

install.packages("remotes")
remotes::install_github(repo = "steverozen/SynSigGen", ref = "1.1.1-branch")

Example usage

1. Generate synthetic tumor spectra used in The repertoire of mutational signatures in human cancer, Alexandrov et al. (2020)

Use functions below to generate 11 spectra datasets used in paper The repertoire of mutational signatures in human cancer (https://doi.org/10.1038/s41586-020-1943-3), published in Nature. The data sets are available at Synapse:

# Users should specify regress.dir = NULL unless for comparison
# with original data sets.
# 
# Compare tools (e.g., BeyondCompare, Meld) are recommended 
# over specifying regress.dir,
# because the latter might raise an error even when query 
# and original data sets are identical.
#
# Users should specify top.level.dir to the destination folder
# for data sets. Otherwise default paths will be used.
PancAdenoCA1000()
ManyTypes2700()
RCCOvary1000()
Create.3.5.40.Abstract()
BladderSkin1000()
Create.2.7a.7b.Abstract()
CreateRandomSyn()

The description of 11 data sets are available at section “Description of each suite of synthetic data sets” in Supplementary Note 2 of the paper.

2. Generate synthetic spectra used in paper Accuracy of Mutational Signature Software on Correlated Signatures, Wu et al. (2022)

To generate 20 spectra data sets with mutation load of and correlation between SBS1 and SBS5 varied, use function

CreateSBS1SBS5CorrelatedSyntheticData()

This paper is published at Scientific Reports, and the original data sets are available at Zenodo.

3. Generate synthetic spectra used in paper mSigHdp: hierarchical Dirichlet processes for mutational signature discovery, Liu et al. (2022)

To generate 3 spectra data sets on single base substitution (SBS) mutation channels and 3 spectra datasets on indel channels, check GitHub repository Liu_et_al_Sup_Files. The dataset generation codes in this repository requires SynSigGen >= 1.1.1 as a dependency.


Notes for functions to generate legacy data sets (1 & 2)

  • The wrapper functions used to generate data sets in Nature paper are in R files with suffix “_Nat”.

  • The wrapper functions used to generate data sets for paper on 20 correlated data sets are in file “CreateSynSBS1SBS5Correlated.R”.

  • These wrapper functions are primarily used to generate legacy data sets, as they don’t round the exposures to integers.

  • By contrast, GenerateSyntheticExposures() now rounds the exposures by default from version 1.0.10.


Reference manual

https://github.com/steverozen/SynSigGen/blob/1.1.1-branch/data-raw/SynSigGen_1.1.1.pdf