MutSpot

Non-coding MUTation hotSPOT dectection in cancer genomes

The MutSpot R package systematically and unbiasedly scans cancer whole genomes to detect mutation hotspots. MutSpot first builds a background mutation model that corrects for covariates of mutation probability, such as local nucleotide context, replication timing and epigenomic features. Then MutSpot evaluates the mutation recurrence of focal DNA regions using a Poisson binomial model to account for varying mutation rates across different tumors. Mutation hotspots identified have significantly higher mutation recurrence compared to the background genomic mutation rate, suggesting positive selection in cancer and involvement in tumorigenesis.

References:

Guo et al., Nature Communications, 2018

Guo et al., npj Genomic Medicine, 2020

Installation

MutSpot runs on R (tested on R versions 3.2.0 to 3.6.0). Install the package from Github using the following R commands.

install.packages("devtools")
library(devtools)
install_github("skandlab/MutSpot", subdir="MutSpot_Rpackage")

Alternatively, the package may downloaded from Github and installed in R:

# Clone/download MutSpot into the current working dirctory with the following command: git clone https://github.com/skandlab/MutSpot.git
library(devtools)
install("my/current/directory/MutSpot/MutSpot_Rpackage")

The alternative method takes a longer time as it downloads all test data sets at the same time

MutSpot analysis workflow

The full MutSpot workflow includes the following 7 steps:

Sample mutated and non-mutated sites for LASSO feature selection.
Calculate local mutation rates in 100kb bins across the whole genome.
Select sequence features using LASSO logistic regression.
Select epigenetic features using LASSO logistic regression.
Compute feature matrix based on selected features for background mutation model fitting.
Fit the background mutation model using logistic regression.
Predict mutation hotspots.

By default, the MutSpot() function runs the entire workflow. However, it is possible to run specific steps of the workflow by specifiying the run.to parameter (see full documentation).

Usage example

By default, MutSpot runs in the current working directory unless specified by the user. All intermediate and output files will be saved in the results folder created by MutSpot in the working directory. By default, MutSpot performs hotspot discovry genome-wide. However, the user can constrain hotspot discovery to certain regions of the genome by providing a BED file with the genomic coordinates of specific regions of interest under the region.of.interest parameter (e.g. CTCF binding sites or promoters).

library("MutSpot")

Download the test data sets from https://github.com/skandlab/MutSpot/tree/master/test-data into your working directory.

Run the analysis using the following commands:

Identify SNV hotspots genome-wide. (Whole genome analysis will take less than 1 day using 2 cores)

MutSpot(snv.mutations = "subset_snv_mutations_sid.MAF", cores = 2, cutoff.nucleotide.new = 1, genomic.features = "genomic_features_genome.txt")

Identify SNV hotspots in CTCF binding sites only, including clinical subtype and cosmic signatures as sample specific features. (CTCF analysis will take about 30 minutes using 2 cores)

MutSpot(snv.mutations = "subset_snv_mutations_sid.MAF", region.of.interest = "gastric_ctcf_motif.bed", cores = 2, genomic.features = "genomic_features_ctcf.txt",
sample.snv.features = "sample_features_table_snv.txt")

To run MutSpot on genome assembly Ch38 instead of Ch37, please specify genome.build = "Ch38" in the MutSpot() function.

Main arguments

Parameter	Description
snv.mutations	List of SNVs in MAF format
indel.mutations	List of indels in MAF format
genomic.features	File paths of all potential genomic features for LASSO selection for background mutation model. E.g. Replication timing profile, histone modification profiles
sample.snv.features/sample.indel.features	Tab delimited file of sample specific features. E.g. clinical subtype
min.count	Minimum number of mutated samples in each hotspot (default = 2)
region.of.interest	Restrict hotspot analysis to regions in the given BED file
cores	Number of cores (default = 1) [The maximum number of cores for Windows users is 1.]
genome.build	Genome assembly (default = Ch37)

Input files

1. Mutations

Mutation files contain all SNVs or indels of all tumors in the study in MAF format. MAF file should be tab delimited with the following 6 columns:

Chromosome
Start position (1-based)
End position (1-based)
Reference allele
Alternate allele
Sample ID

Example MAF file:


chr1	16265287	16265287	G	C	patient1
chr1	17320166	17320166	C	T	patient2
chr1	19497536	19497536	G	C	patient3
...	...	..	..	..	...

There is no header row in a MAF file.

2. Genomic features

Genomic features can be continuous or binary. Continuous features, such as replication timing profile, are input as bigwig files. Binary features, such as peak calls of histone modifications, are input as BED files. All continuous features will be discretized into n bins (n is specified by the user). The logistic regression model will be fitted from a frequency table of the counts of mutated and non-mutated sites for all combinations of the covariates. It is recommended for the user to input genomic covariates as binary features where possible to reduce the memory usage of the function.

The genomic features are input as a tab delimited file with 4 columns:

Feature name
File path of genomic feature (either BigWig or BED format).
Binary value indicating if the feature is continuous or binary (1 for continuous, 0 for binary)
Number of bins to discretize continuous feature into (max = 10, NA for binary features).

Example format:

feature_name	file_path	feature_type	nbins
mean_rep_time	./features/wgEncodeUwRepliSeqHepg2WaveSignalRep1.bigWig	1	10
E094-DNase	./features/E094-DNase.bed	0	NA
E094-H3K27ac	./features/E094-H3K27ac.bed	0	NA
...	...	...	...

A binary feature BED file should include the following columns:

Chromosome
Start position (0-based)
End position (0-based)

For binary features, genomic regions that are found in the feature BED file are assigned value of 1, else value of 0

A list of genomic feature files (Transcription factors, DNA secondary structure, Replication timing) can be downloaded from https://github.com/skandlab/MutSpot/tree/master/features/Ch37 or https://github.com/skandlab/MutSpot/tree/master/features/Ch38 into the features folder in your working directory. The user may choose to run the analysis using these features by specifying genomic.features = "./features/genomic_features_genome_default.txt" in the MutSpot() function, else he/she may create a similar text file containing the desired features.

3. Sample specific features (optional)

The user may choose to include sample specific features in the background mutation model, such as the clinical subtype of the tumor. Note that sample specific features will not undergo LASSO feature selection and will be automatically included in the final model. Sample specific features are to be supplied as a tab delimited file where each row corresponds to a sample and each column corresponds to a feature.

Example format:

SampleID	subtype	feature1	feature2
patient1	EBV	0.32	0.6
patient2	MSI	0.41	1.5
patient3	GS	0.18	-0.3
...	...	...	...

4. Region of interest (optional)

Instead of finding mutation hotspots genome-wide, the user may restrict the hotspot analysis to certain regions of interest, such as promoters, enhancers, or UTRs, by supplying a BED file with the following columns:

Chromosome
Start position (0-based)
End position (0-based)

Example BED file:


chr1	15786447	16265287
chr1	27891466	28456878
chr1	42456878	45468785
...	...	..

There is no header row in a BED file.

Adjusting threshold of LASSO feature selection

For a more/less stringent nucleotide selection, users may choose to re-define frequency threshold by re-running step 3.2. This can be done by specifying run.to = 3.2 and the new selection threshold as the cutoff.nucleotide.new parameter. Users may also choose to select the top n features based on the mean coefficients by specifying the number of contexts to select as the top.nucleotide parameter.

MutSpot(run.to = 3.2, snv.mutations = "subset_snv_mutations_sid.MAF", region.of.interest = "gastric_ctcf_motif.bed", cutoff.nucleotide.new = 0.98, top.nucleotide = 3, cores = 2, genomic.features = "genomic_features_ctcf.txt", sample.snv.features = "sample_features_table_snv.txt")

Similarly, for a more/less stringent epigenetic features selection, users may choose to re-define frequency threshold by running step 4.2. This can be done by specifying run.to = 4.2 and the new selection thresholds for SNVs and indels as the cutoff.nucleotide.new.snv and cutoff.features.new.indel parameters respectively. Users may also choose to select the top n features based on the mean coefficients by specifying the number of features to select as the top.features parameter.

MutSpot(run.to = 4.2, snv.mutations = "subset_snv_mutations_sid.MAF", region.of.interest = "gastric_ctcf_motif.bed", cores = 2, genomic.features = "genomic_features_ctcf.txt",
sample.snv.features = "sample_features_table_snv.txt", cutoff.features.new.snv = 0.8, top.features = 11)

Output files

Hotspot summary file

MutSpot outputs a TSV file of all hotspot regions found, ordered by significance of recurrence. Overlapping hotspot regions are merged and annotations columns are added to indicate if the hotspots are located in gene promoters or UTRs. The output file has the following fields:

Chromosome
Start position (1-based)
End position (1-based)
P-value
Length of hotspot (bp)
Mean background mutation probability
Number of mutated samples
FDR
Transcripts overlapping hotspot in their promoters
Transcripts overlapping hotspot in their 3'UTRs
Transcripts overlapping hotspot in their 5'UTRs

Figures

At the end of the analysis, 3 figures will be generated by MutSpot:

Bar plot of feature importance of the background mutation model
Manhattan plot of hotspots across the genome
Distribution of mutations in the top n hotspots (default n = 3)

Power of size of sampled sites

To test the power of different sizes of sampled sites, users may choose to run an additional analysis which runs feature selection and model fitting on samples of size 20%, 40%, 60%, 80% and 100% of sampled sites. It can be run by specifying run.to = NULL and dilution.analysis = TRUE.

MutSpot(run.to = NULL, snv.mutations = "subset_snv_mutations_sid.MAF", genomic.features = "genomic_features_ctcf.txt", cores = 2, dilution.analysis = TRUE)

At the end of this analysis, 3 figures will be generated:

Line plot of McFadden's R2 for each subsample's model
Line plot of the number of nucleotide/epigenomic features selected by LASSO for each subsample
Heatmap of LASSO stability frequency of the nucleotide/epigenomic features selected for each subsample

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
MutSpot_Rpackage		MutSpot_Rpackage
Rmarkdowns		Rmarkdowns
features		features
test-data		test-data
.DS_Store		.DS_Store
Documentation.md		Documentation.md
MutSpot_Supplement_final.pdf		MutSpot_Supplement_final.pdf
Readme.md		Readme.md

skandlab/MutSpot

Folders and files

Latest commit

History

Repository files navigation

MutSpot

Non-coding MUTation hotSPOT dectection in cancer genomes

Contents