# How to use chromatin accessibility and replication timing to find genomic regions with significantly more mutations than expected

### By: Oliver Ocsenas

In this tutorial, we train a random forest model on all of our chromatin accessibility and replication timing predictors to predict regional mutation rates. Then we will use the model residuals to find regions with significantly more mutations than we predicted.

First, we will load in some useful packages.

In [1]:
packages = c("data.table", "randomForest")
lapply(packages, function(x) suppressMessages(require(x, character.only = TRUE)))

Next, we will load in the appropriate data. This involves average binned tracks from tumor and normal tissue chromatin accessibility and replication timing at the megabase-scale.

In [2]:
CA_RT_MBscale = fread("data/All_CA_RT_1MB_scale.csv.gz")
head(CA_RT_MBscale, 3)

chr,start,ACC TCGA-OR-A5J2,ACC TCGA-OR-A5J3,ACC TCGA-OR-A5J6,ACC TCGA-OR-A5J9,ACC TCGA-OR-A5JZ,ACC TCGA-OR-A5K8,ACC TCGA-OR-A5KX,ACC TCGA-PA-A5YG,⋯,Repli seq of NHEK S1 phase,Repli seq of NHEK S2 phase,Repli seq of NHEK S3 phase,Repli seq of NHEK S4 phase,Repli seq of SK N SH G1 phase,Repli seq of SK N SH G2 phase,Repli seq of SK N SH S1 phase,Repli seq of SK N SH S2 phase,Repli seq of SK N SH S3 phase,Repli seq of SK N SH S4 phase
<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
chr1,2000001,3.985625,3.3680199,4.549601,1.8892585,3.274139,3.120603,2.9564165,3.9239266,⋯,16.263372,12.02102,11.92807,8.420803,39.16145,9.017025,22.80582,13.88087,5.706045,3.0025
chr1,3000001,3.777538,2.1058428,2.733947,1.3259219,1.865876,1.7705485,1.9169945,1.8410087,⋯,20.766127,19.64354,16.83726,9.974132,25.69529,7.608355,29.66111,22.55539,9.649642,4.16149
chr1,4000001,1.75523,0.8019908,1.2479,0.4344994,0.395774,0.6742724,0.4020992,0.3389768,⋯,6.231268,15.36473,36.71549,28.632269,8.97739,8.46,14.95429,24.74951,29.078391,13.69341


We will also load in the binned mutation (SNV only) track from 18 cohorts (including pan-cancer) in the PCAWG dataset.

In [3]:
PCAWG_mutations_binned_MBscale = fread("data/PCAWG_SNVbinned_MBscale.csv.gz")
head(PCAWG_mutations_binned_MBscale)

chr,start,PANCAN,Breast-AdenoCa,Prost-AdenoCA,Kidney-RCC,Skin-Melanoma,Uterus-AdenoCA,Eso-AdenoCa,Stomach-AdenoCA,CNS-GBM,Lung-SCC,ColoRect-AdenoCA,Biliary-AdenoCA,Head-SCC,Lymph-CLL,Lung-AdenoCA,Lymph-BNHL,Liver-HCC,Thy-AdenoCA
<chr>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
chr1,2000001,5423,403,209,267,600,141,407,207,88,362,147,57,210,34,194,224,649,16
chr1,3000001,5870,529,196,328,682,170,406,189,71,342,161,44,255,49,181,222,764,13
chr1,4000001,11489,688,339,355,1106,267,1933,488,128,587,351,105,356,80,541,331,1571,28
chr1,5000001,9773,499,259,361,1149,210,1364,375,100,516,246,111,283,80,398,319,1546,25
chr1,6000001,4779,344,161,266,398,138,339,164,55,265,120,46,217,39,187,192,705,11
chr1,7000001,5418,344,166,243,611,154,383,206,49,323,158,50,232,74,236,206,752,15


We can now train a random forest model the combined chromatin accessibility dataset and replication timing dataset to predict mutation rates in our cohort of interest (in this example, we will use breast cancer) and then assess individual predictor importance.

First we merge the predictor set with the response vector based on genomic coordinates (first 2 columns).

In [4]:
CA_RT_breastcancermuts = merge(CA_RT_MBscale, 
									PCAWG_mutations_binned_MBscale[,c("chr", "start", "Breast-AdenoCa")], 
									by=c("chr", "start"))
head(CA_RT_breastcancermuts, 3)

chr,start,ACC TCGA-OR-A5J2,ACC TCGA-OR-A5J3,ACC TCGA-OR-A5J6,ACC TCGA-OR-A5J9,ACC TCGA-OR-A5JZ,ACC TCGA-OR-A5K8,ACC TCGA-OR-A5KX,ACC TCGA-PA-A5YG,⋯,Repli seq of NHEK S2 phase,Repli seq of NHEK S3 phase,Repli seq of NHEK S4 phase,Repli seq of SK N SH G1 phase,Repli seq of SK N SH G2 phase,Repli seq of SK N SH S1 phase,Repli seq of SK N SH S2 phase,Repli seq of SK N SH S3 phase,Repli seq of SK N SH S4 phase,Breast-AdenoCa
<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>
chr1,2000001,3.985625,3.3680199,4.549601,1.8892585,3.274139,3.120603,2.9564165,3.9239266,⋯,12.02102,11.92807,8.420803,39.16145,9.017025,22.80582,13.88087,5.706045,3.0025,403
chr1,3000001,3.777538,2.1058428,2.733947,1.3259219,1.865876,1.7705485,1.9169945,1.8410087,⋯,19.64354,16.83726,9.974132,25.69529,7.608355,29.66111,22.55539,9.649642,4.16149,529
chr1,4000001,1.75523,0.8019908,1.2479,0.4344994,0.395774,0.6742724,0.4020992,0.3389768,⋯,15.36473,36.71549,28.632269,8.97739,8.46,14.95429,24.74951,29.078391,13.69341,688


Now we train a random forest models on the combined CA and RT predictor set to predict regional mutation burden in breast cancer.

In [5]:
#Assign variables to predictors and response
predictors = CA_RT_breastcancermuts[,-c("chr", "start", "Breast-AdenoCa")]
response = CA_RT_breastcancermuts[["Breast-AdenoCa"]]

#Train randomForest model
rf = randomForest(x = predictors, y = response, 
							 keep.forest = T, ntree = 100, 
							 do.trace = F, importance = T)

Now we can find the windows (genomic regions) with significantly more mutations than predicted by our model. We do that by creating a one-tailed normal distrbution based on the model residuals (as we are interested in windows with more mutations than predicted, not less) and then we assess the p-value of each window in relation to that distribution.

In [13]:
#Define model residuals
observed_mutations = response
predicted_mutations = rf$predicted
model_residuals = observed_mutations - predicted_mutations

#Define p-values using one-tailed normal distribution (only higher observed than predicted mutations get significant p-values)
p_values = pnorm(model_residuals, 
					mean = mean(model_residuals),
					sd = sd(model_residuals),
					lower.tail = F) 

#Assign p-values to genomic regions
p_val_dt = as.data.table(cbind.data.frame(chr = CA_RT_breastcancermuts$chr,
										 start = CA_RT_breastcancermuts$start,
										 observed = observed_mutations,
										 predicted = predicted_mutations,
										 p_val = p_values))

Now we have p-values for every megabase-scale window in our analysis that describes excess mutations in the breast cancer cohort in relation to our model predictions. We can display the top 5 genomic regions with excess mutations.

In [14]:
p_val_dt[order(p_val)][1:5]

chr,start,observed,predicted,p_val
<chr>,<int>,<int>,<dbl>,<dbl>
chr1,240000001,1149,634.137,7.487711e-15
chr1,158000001,958,624.7,3.180342e-07
chr20,55000001,986,674.8869,1.667436e-06
chr1,237000001,1066,766.0458,3.686692e-06
chr10,37000001,877,578.8883,4.191983e-06


Now we have a p-value for excess mutations for each of our genomic windows which allows us to do further analysis as to why these genomic regions contain excess mutations.