# How to use tumor and normal chromatin accessibility to predict regional mutation rates in cancer with a random forest machine learning model

### By: Oliver Ocsenas

The primary element of this study involves comparing the predictive power of chromatin accessibility when derived from tumor samples in predicting regional mutation burden in cancer when compared with chromatin accessibility derived from normal tissue samples. This is a small example of the data and methods used to make this comparison.

First, we will load in some useful packages.

In [2]:
packages = c("data.table", "randomForest")
lapply(packages, function(x) suppressMessages(require(x, character.only = TRUE)))

Next, we will load in the appropriate data. This involves average binned tracks from tumor chromatin accessibility and replication timing as well as normal chromatin accessibility and replication timing at the megabase-scale

In [5]:
TumorCA_RT_MBscale = fread("TumorCA_RT_MBscale.csv")
head(TumorCA_RT_MBscale)

chr,start,ACC TCGA-OR-A5J2,ACC TCGA-OR-A5J3,ACC TCGA-OR-A5J6,ACC TCGA-OR-A5J9,ACC TCGA-OR-A5JZ,ACC TCGA-OR-A5K8,ACC TCGA-OR-A5KX,ACC TCGA-PA-A5YG,⋯,Repli seq of NHEK S1 phase,Repli seq of NHEK S2 phase,Repli seq of NHEK S3 phase,Repli seq of NHEK S4 phase,Repli seq of SK N SH G1 phase,Repli seq of SK N SH G2 phase,Repli seq of SK N SH S1 phase,Repli seq of SK N SH S2 phase,Repli seq of SK N SH S3 phase,Repli seq of SK N SH S4 phase
<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
chr1,2000001,3.985625,3.3680199,4.549601,1.8892585,3.2741389,3.120603,2.9564165,3.9239266,⋯,16.263372,12.02102,11.92807,8.420803,39.16145,9.017025,22.80582,13.88087,5.706045,3.0025
chr1,3000001,3.777538,2.1058428,2.733947,1.3259219,1.8658764,1.7705485,1.9169945,1.8410087,⋯,20.766127,19.64354,16.83726,9.974132,25.69529,7.608355,29.66111,22.55539,9.649642,4.16149
chr1,4000001,1.75523,0.8019908,1.2479,0.4344994,0.395774,0.6742724,0.4020992,0.3389768,⋯,6.231268,15.36473,36.71549,28.632269,8.97739,8.46,14.95429,24.74951,29.078391,13.693415
chr1,5000001,1.433125,1.4992569,1.512729,0.7088006,0.6542305,1.2746344,0.957788,0.4901981,⋯,10.913952,17.71098,28.37285,26.984804,17.78464,9.868122,24.5669,24.29207,15.805804,7.705024
chr1,6000001,3.49049,3.911082,4.471335,2.0340137,3.7724878,3.3055782,4.3420269,3.4266658,⋯,22.21339,13.19654,4.95561,4.436683,54.29683,13.217439,20.97317,5.955,2.915,2.624
chr1,7000001,2.303474,1.6262188,2.234739,1.419971,1.6980099,1.7905012,2.2128052,1.470963,⋯,28.685677,31.72389,11.32426,6.022872,46.08007,14.048667,21.27462,10.33011,4.709995,3.538997


In [6]:
NormalCA_RT_MBscale = fread("NormalCA_RT_MBscale.csv")
head(NormalCA_RT_MBscale)

chr,start,HG03571,GM18858,Peyer's patch,GM21390,HG02973,psoas muscle,HG03066,heart right ventricle,⋯,Repli seq of NHEK S1 phase,Repli seq of NHEK S2 phase,Repli seq of NHEK S3 phase,Repli seq of NHEK S4 phase,Repli seq of SK N SH G1 phase,Repli seq of SK N SH G2 phase,Repli seq of SK N SH S1 phase,Repli seq of SK N SH S2 phase,Repli seq of SK N SH S3 phase,Repli seq of SK N SH S4 phase
<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
chr1,2000001,0.6654796,0.4730248,0.8810326,0.4790803,0.6279306,0.7947659,0.5012096,0.853548,⋯,16.263372,12.02102,11.92807,8.420803,39.16145,9.017025,22.80582,13.88087,5.706045,3.0025
chr1,3000001,0.780782,0.4687155,0.9889974,0.5117272,0.6801467,0.9333293,0.5596727,0.9684844,⋯,20.766127,19.64354,16.83726,9.974132,25.69529,7.608355,29.66111,22.55539,9.649642,4.16149
chr1,4000001,0.6079084,0.3179031,0.9148773,0.3830738,0.5007629,0.5521419,0.3997048,0.8205128,⋯,6.231268,15.36473,36.71549,28.632269,8.97739,8.46,14.95429,24.74951,29.078391,13.693415
chr1,5000001,0.6216394,0.3446595,0.922128,0.4041626,0.5323078,0.5943706,0.404326,0.8110923,⋯,10.913952,17.71098,28.37285,26.984804,17.78464,9.868122,24.5669,24.29207,15.805804,7.705024
chr1,6000001,0.7990698,0.5957793,0.931048,0.6376199,0.768206,0.9193404,0.6344817,0.9453795,⋯,22.21339,13.19654,4.95561,4.436683,54.29683,13.217439,20.97317,5.955,2.915,2.624
chr1,7000001,0.6365986,0.3929993,0.9005885,0.4073505,0.5519631,0.7726083,0.4133591,0.8589262,⋯,28.685677,31.72389,11.32426,6.022872,46.08007,14.048667,21.27462,10.33011,4.709995,3.538997


We will also load in the binned mutation (SNV only) track from 18 cohorts (including pan-cancer) in the PCAWG dataset.

In [13]:
PCAWG_mutations_binned_MBscale = fread("PCAWG_SNVbinned_MBscale.csv")
head(PCAWG_mutations_binned_MBscale)

chr,start,PANCAN,Breast-AdenoCa,Prost-AdenoCA,Kidney-RCC,Skin-Melanoma,Uterus-AdenoCA,Eso-AdenoCa,Stomach-AdenoCA,CNS-GBM,Lung-SCC,ColoRect-AdenoCA,Biliary-AdenoCA,Head-SCC,Lymph-CLL,Lung-AdenoCA,Lymph-BNHL,Liver-HCC,Thy-AdenoCA
<chr>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
chr1,2000001,5423,403,209,267,600,141,407,207,88,362,147,57,210,34,194,224,649,16
chr1,3000001,5870,529,196,328,682,170,406,189,71,342,161,44,255,49,181,222,764,13
chr1,4000001,11489,688,339,355,1106,267,1933,488,128,587,351,105,356,80,541,331,1571,28
chr1,5000001,9773,499,259,361,1149,210,1364,375,100,516,246,111,283,80,398,319,1546,25
chr1,6000001,4779,344,161,266,398,138,339,164,55,265,120,46,217,39,187,192,705,11
chr1,7000001,5418,344,166,243,611,154,383,206,49,323,158,50,232,74,236,206,752,15


We can now train a random forest model seperately on the tumor chromatin accessibility dataset and normal tissue chromatin accessibility dataset to predict mutation rates in our cohort of interest (in this example, we will use breast cancer).

First we merge each of the predictor sets with the response vector based on genomic coordinates (first 2 columns).

In [19]:
TumorCA_RT_breastcancermuts = merge(TumorCA_RT_MBscale, 
									PCAWG_mutations_binned_MBscale[,c("chr", "start", "Breast-AdenoCa")], 
									by=c("chr", "start"))
head(TumorCA_RT_breastcancermuts, 3)

NormalCA_RT_breastcancermuts = merge(NormalCA_RT_MBscale, 
									PCAWG_mutations_binned_MBscale[,c("chr", "start", "Breast-AdenoCa")], 
									by=c("chr", "start"))
head(NormalCA_RT_breastcancermuts, 3)

chr,start,ACC TCGA-OR-A5J2,ACC TCGA-OR-A5J3,ACC TCGA-OR-A5J6,ACC TCGA-OR-A5J9,ACC TCGA-OR-A5JZ,ACC TCGA-OR-A5K8,ACC TCGA-OR-A5KX,ACC TCGA-PA-A5YG,⋯,Repli seq of NHEK S2 phase,Repli seq of NHEK S3 phase,Repli seq of NHEK S4 phase,Repli seq of SK N SH G1 phase,Repli seq of SK N SH G2 phase,Repli seq of SK N SH S1 phase,Repli seq of SK N SH S2 phase,Repli seq of SK N SH S3 phase,Repli seq of SK N SH S4 phase,Breast-AdenoCa
<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>
chr1,2000001,3.985625,3.3680199,4.549601,1.8892585,3.274139,3.120603,2.9564165,3.9239266,⋯,12.02102,11.92807,8.420803,39.16145,9.017025,22.80582,13.88087,5.706045,3.0025,403
chr1,3000001,3.777538,2.1058428,2.733947,1.3259219,1.865876,1.7705485,1.9169945,1.8410087,⋯,19.64354,16.83726,9.974132,25.69529,7.608355,29.66111,22.55539,9.649642,4.16149,529
chr1,4000001,1.75523,0.8019908,1.2479,0.4344994,0.395774,0.6742724,0.4020992,0.3389768,⋯,15.36473,36.71549,28.632269,8.97739,8.46,14.95429,24.74951,29.078391,13.69341,688


chr,start,HG03571,GM18858,Peyer's patch,GM21390,HG02973,psoas muscle,HG03066,heart right ventricle,⋯,Repli seq of NHEK S2 phase,Repli seq of NHEK S3 phase,Repli seq of NHEK S4 phase,Repli seq of SK N SH G1 phase,Repli seq of SK N SH G2 phase,Repli seq of SK N SH S1 phase,Repli seq of SK N SH S2 phase,Repli seq of SK N SH S3 phase,Repli seq of SK N SH S4 phase,Breast-AdenoCa
<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>
chr1,2000001,0.6654796,0.4730248,0.8810326,0.4790803,0.6279306,0.7947659,0.5012096,0.853548,⋯,12.02102,11.92807,8.420803,39.16145,9.017025,22.80582,13.88087,5.706045,3.0025,403
chr1,3000001,0.780782,0.4687155,0.9889974,0.5117272,0.6801467,0.9333293,0.5596727,0.9684844,⋯,19.64354,16.83726,9.974132,25.69529,7.608355,29.66111,22.55539,9.649642,4.16149,529
chr1,4000001,0.6079084,0.3179031,0.9148773,0.3830738,0.5007629,0.5521419,0.3997048,0.8205128,⋯,15.36473,36.71549,28.632269,8.97739,8.46,14.95429,24.74951,29.078391,13.69341,688


Now we train two seperate random forest models with the two different predictor sets to predict the same response and assess the model accuracies.

In [22]:
#Train randomForest
RF_tumorCA_RT = randomForest(x = TumorCA_RT_breastcancermuts[,-c("chr", "start", "Breast-AdenoCa")], 
							 y = TumorCA_RT_breastcancermuts[["Breast-AdenoCa"]], 
							 keep.forest = T, ntree = 100, 
							 do.trace = F, importance = T)

RF_normalCA_RT = randomForest(x = NormalCA_RT_breastcancermuts[,-c("chr", "start", "Breast-AdenoCa")], 
							 y = NormalCA_RT_breastcancermuts[["Breast-AdenoCa"]], 
							 keep.forest = T, ntree = 100, 
							 do.trace = F, importance = T)

For this example, we will use adjusted out-of-bag R2 as the accuracy metric but cross-validation can also be used.

In [26]:
adjust_R2 = function(R2, n, k){
	#Function to convert R2 to adjusted R2
	#R2: R-squared accuracy of model
	#n: Number of samples in training set
	#k: Number of predictors in model
	
	adjusted_R2 = 1 - (1 - R2)*(n - 1)/(n - k - 1)
	
	return(adjusted_R2)
}

In [28]:
RF_tumorCA_RT_R2 = cor(RF_tumorCA_RT$predicted, TumorCA_RT_breastcancermuts[["Breast-AdenoCa"]])**2
RF_tumorCA_RT_adjR2 = adjust_R2(RF_tumorCA_RT_R2, nrow(TumorCA_RT_breastcancermuts), ncol(TumorCA_RT_breastcancermuts) - 3)

RF_normalCA_RT_R2 = cor(RF_normalCA_RT$predicted, NormalCA_RT_breastcancermuts[["Breast-AdenoCa"]])**2
RF_normalCA_RT_adjR2 = adjust_R2(RF_normalCA_RT_R2, nrow(NormalCA_RT_breastcancermuts), ncol(NormalCA_RT_breastcancermuts) - 3)

In [29]:
RF_tumorCA_RT_adjR2
RF_normalCA_RT_adjR2

And when we compare the two accuracies, we find that tumor chromatin accessibility and replication timing are nearly twice as good at predicting breast cancer regional mutation rates than normal chromatin accessibility and replication timing at the mega-base scale.