# How to create a summed mutation track from a MAF mutation file

### By: Oliver Ocsenas

When studying the association between genome-wide regional mutation burden (SNV's) and other genome-wide -omics datasets, it may be useful to first get the total number of mutations in equally sized non-overlapping windows across the genome. We can then use simple linear models or more complex machine learning models to study the associations between our datasets at the scale of the window size we have selected. In this tutorial, we will explore how to create these summed mutation files from whole-genome SNV datasets.

First we will load in some basic packages that will be useful.

In [1]:
packages = c("data.table", "GenomicRanges", "rtracklayer", "BSgenome.Hsapiens.UCSC.hg38")
lapply(packages, function(x) suppressMessages(require(x, character.only = TRUE)))

We will first load in an example MAF file containing some mutations from a breast cancer cohort from PCAWG.

In [5]:
breast_cancer_MAF = fread("PCAWG_breastcancer_SNV.MAF")
head(breast_cancer_MAF)

Chromosome,Start_position,End_position,Project_Code
<chr>,<int>,<int>,<chr>
chr1,918404,918404,Breast-AdenoCa
chr1,918741,918741,Breast-AdenoCa
chr1,919337,919337,Breast-AdenoCa
chr1,1304193,1304193,Breast-AdenoCa
chr1,1820141,1820141,Breast-AdenoCa
chr1,1925516,1925516,Breast-AdenoCa


We can see that we have a table with a genomic positions for single nucleotide variants from a breast cancer cohort. Now we may want to know whether certain regions of the genome have more mutations than others so we find the sum of mutations in our equally-sized, non-overlapping windows.

We can create a GRanges object that contains the exact ranges or windows that we're interested in.

In [6]:
window_size = 1000000

#Load in human genome hg38
genome = BSgenome.Hsapiens.UCSC.hg38

#Convert autosomal genome (chromosomes 1 through 22) into windows of specified size 
# and cut last window in each chromosome to stay within chromosome
gr.windows = tileGenome(seqinfo(Hsapiens)[paste("chr",1:22,sep="")], 
						tilewidth = window_size, 
						cut.last.tile.in.chrom=TRUE)

gr.windows

GRanges object with 2887 ranges and 0 metadata columns:
         seqnames            ranges strand
            <Rle>         <IRanges>  <Rle>
     [1]     chr1         1-1000000      *
     [2]     chr1   1000001-2000000      *
     [3]     chr1   2000001-3000000      *
     [4]     chr1   3000001-4000000      *
     [5]     chr1   4000001-5000000      *
     ...      ...               ...    ...
  [2883]    chr22 46000001-47000000      *
  [2884]    chr22 47000001-48000000      *
  [2885]    chr22 48000001-49000000      *
  [2886]    chr22 49000001-50000000      *
  [2887]    chr22 50000001-50818468      *
  -------
  seqinfo: 22 sequences from hg38 genome

Next we can convert our initial MAF to a GRanges object as well.

In [8]:
 #Convert mutations file into GRanges object
MAF.gr = GRanges(breast_cancer_MAF$Chromosome, 
					  IRanges(breast_cancer_MAF$Start_position, breast_cancer_MAF$End_position))
MAF.gr

GRanges object with 1396733 ranges and 0 metadata columns:
            seqnames    ranges strand
               <Rle> <IRanges>  <Rle>
        [1]     chr1    918404      *
        [2]     chr1    918741      *
        [3]     chr1    919337      *
        [4]     chr1   1304193      *
        [5]     chr1   1820141      *
        ...      ...       ...    ...
  [1396729]     chrX 154124510      *
  [1396730]     chrX 154482329      *
  [1396731]     chrX 155614529      *
  [1396732]     chrX 155772375      *
  [1396733]     chrX 155971989      *
  -------
  seqinfo: 24 sequences from an unspecified genome; no seqlengths

And we can get the sum of mutations overlapping each of our windows.

In [9]:
#Get number of counts of mutations overlapping each genomic window
mutation_counts = as.numeric(table(factor(queryHits(findOverlaps(gr.windows, MAF.gr)), 
										  levels = 1:length(gr.windows))))
head(mutation_counts)

Finally, we create a data table combining the genomic positions of the windows and the number of overlapping mutations to have create representing the regional mutational burden from our MAF file.

In [10]:
#Convert to data table with chromosome and start base pairs of window in addition to the mutation counts
Mutation_dt = as.data.table(cbind.data.frame(chr = as.character(seqnames(gr.windows)), 
										start = start(gr.windows), 
										Mut_counts = mutation_counts))
head(Mutation_dt)

chr,start,Mut_counts
<chr>,<int>,<dbl>
chr1,1,76
chr1,1000001,397
chr1,2000001,403
chr1,3000001,529
chr1,4000001,688
chr1,5000001,499
