# How to create an average binned track from a BigWig genomic track with window size of interest

### By: Oliver Ocsenas

When studying the association between two or more genome-wide -omics datasets, it may be useful to first get the average value of these datasets in equally sized non-overlapping windows across the genome. We can then use simple linear models or more complex machine learning models to study the associations between our datasets at the scale of the window size we have selected. In this tutorial, we will explore how to create these average binned tracks from BigWig files of -omics datasets.

First we will load in some basic packages that will be useful.

In [3]:
packages = c("data.table", "GenomicRanges", "rtracklayer", "BSgenome.Hsapiens.UCSC.hg38")
lapply(packages, function(x) suppressMessages(require(x, character.only = TRUE)))

We will load in an example BigWig file containing genome-wide ATAC-Seq values from a breast cancer sample.

In [6]:
BRCA_bigwig = import("TCGA_BRCA_ATACSeq_chr1_2.bw")
BRCA_bigwig

GRanges object with 20330312 ranges and 1 metadata column:
             seqnames            ranges strand |     score
                <Rle>         <IRanges>  <Rle> | <numeric>
         [1]     chr1            1-9999      * |   0.00000
         [2]     chr1       10000-10099      * |  17.71652
         [3]     chr1       10100-10199      * |  30.60127
         [4]     chr1       10200-10299      * |   9.66356
         [5]     chr1       10300-10399      * |   4.83178
         ...      ...               ...    ... .       ...
  [20330308]     chrY 56886116-56886715      * |   0.00000
  [20330309]     chrY 56886716-56886915      * |   1.61059
  [20330310]     chrY 56886916-56887015      * |   0.00000
  [20330311]     chrY 56887016-56887215      * |   1.61059
  [20330312]     chrY 56887216-57227415      * |   0.00000
  -------
  seqinfo: 24 sequences from an unspecified genome

We can see that we have a GRanges object with a score column for non-overlapping but differently sized genomic regions. This score indicates the level of chromatin accessibility in each of these genomic regions. Now we may want to have an average score for much larger windows across the genome, say 1 megabase-pair (1 million base-pairs).

We can create a GRanges object that contains the exact ranges or windows that we're interested in.

In [21]:
window_size = 1000000

#Load in human genome hg38
genome = BSgenome.Hsapiens.UCSC.hg38

#Convert autosomal genome (chromosomes 1 through 22) into windows of specified size 
# and cut last window in each chromosome to stay within chromosome
gr.windows = tileGenome(seqinfo(Hsapiens)[paste("chr",1:22,sep="")], 
						tilewidth = window_size, 
						cut.last.tile.in.chrom=TRUE)

gr.windows

GRanges object with 2887 ranges and 0 metadata columns:
         seqnames            ranges strand
            <Rle>         <IRanges>  <Rle>
     [1]     chr1         1-1000000      *
     [2]     chr1   1000001-2000000      *
     [3]     chr1   2000001-3000000      *
     [4]     chr1   3000001-4000000      *
     [5]     chr1   4000001-5000000      *
     ...      ...               ...    ...
  [2883]    chr22 46000001-47000000      *
  [2884]    chr22 47000001-48000000      *
  [2885]    chr22 48000001-49000000      *
  [2886]    chr22 49000001-50000000      *
  [2887]    chr22 50000001-50818468      *
  -------
  seqinfo: 22 sequences from hg38 genome

Next we can convert our initial BigWig to a more manageable state where we have the length of each window and its corresponding score.

In [22]:
#Get an RleList object with one coverage vector per chromosome in the BigWig
coverage_vector = GenomicRanges::coverage(BRCA_bigwig, weight="score")
head(coverage_vector, 2)

RleList of length 2
$chr1
numeric-Rle of length 248956422 with 1704559 runs
  Lengths:     9999      100      100      100 ...      100      200    10000
  Values :  0.00000 17.71652 30.60127  9.66356 ...  9.66356  4.83178  0.00000

$chr2
numeric-Rle of length 242193529 with 1656216 runs
  Lengths:    10199      100      100      100 ...      100      100    10000
  Values :  0.00000  1.61059  4.83178  6.44237 ...  3.22119  4.83178  0.00000


In [27]:
#Match chromosomes between input and human genome
seqlevels(gr.windows, pruning.mode="coarse") = names(coverage_vector)

Now we can get the binned average chromatin accessibility score for each of our megabase-pair sized windows.

In [29]:
#Get binned average for each window
gr.data.new = binnedAverage(gr.windows, coverage_vector, "value")

gr.data.new

GRanges object with 2887 ranges and 1 metadata column:
         seqnames            ranges strand |     value
            <Rle>         <IRanges>  <Rle> | <numeric>
     [1]     chr1         1-1000000      * |  0.900821
     [2]     chr1   1000001-2000000      * |  5.370671
     [3]     chr1   2000001-3000000      * |  4.118927
     [4]     chr1   3000001-4000000      * |  2.915980
     [5]     chr1   4000001-5000000      * |  1.482712
     ...      ...               ...    ... .       ...
  [2883]    chr22 46000001-47000000      * |  10.23986
  [2884]    chr22 47000001-48000000      * |   2.87957
  [2885]    chr22 48000001-49000000      * |   2.62055
  [2886]    chr22 49000001-50000000      * |   4.78436
  [2887]    chr22 50000001-50818468      * |   9.19745
  -------
  seqinfo: 24 sequences from 2 genomes (hg38, NA)

And now we have a binned average chromatin accessibility track for each megabase-pair sized window that we can compare to other datasets binned at the same scale.