# How to use tumor and normal chromatin accessibility and replication timing to predict regional mutation rates in cancer with a random forest machine learning model and assess predictor importance using permutation tests and bootstrapping

### By: Oliver Ocsenas

Similar to the previous tutorial ('3_CA2M_RF.ipynb'), we compare the predictive power of chromatin accessibility when derived from tumor samples in predicting regional mutation burden in cancer when compared with chromatin accessibility derived from normal tissue samples. However, in this tutorial we will do so by assessing the importance of the invidiual predictors, derived from tumor or normal tissue, to model accuracy.

First, we will load in some useful packages.

In [2]:
packages = c("data.table", "randomForest")
lapply(packages, function(x) suppressMessages(require(x, character.only = TRUE)))

Next, we will load in the appropriate data. This involves average binned tracks from tumor and normal tissue chromatin accessibility and replication timing at the megabase-scale.

In [5]:
CA_RT_MBscale = fread("data/All_CA_RT_1MB_scale.csv.gz")
head(CA_RT_MBscale, 3)

chr,start,ACC TCGA-OR-A5J2,ACC TCGA-OR-A5J3,ACC TCGA-OR-A5J6,ACC TCGA-OR-A5J9,ACC TCGA-OR-A5JZ,ACC TCGA-OR-A5K8,ACC TCGA-OR-A5KX,ACC TCGA-PA-A5YG,⋯,Repli seq of NHEK S1 phase,Repli seq of NHEK S2 phase,Repli seq of NHEK S3 phase,Repli seq of NHEK S4 phase,Repli seq of SK N SH G1 phase,Repli seq of SK N SH G2 phase,Repli seq of SK N SH S1 phase,Repli seq of SK N SH S2 phase,Repli seq of SK N SH S3 phase,Repli seq of SK N SH S4 phase
<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
chr1,2000001,3.985625,3.3680199,4.549601,1.8892585,3.274139,3.120603,2.9564165,3.9239266,⋯,16.263372,12.02102,11.92807,8.420803,39.16145,9.017025,22.80582,13.88087,5.706045,3.0025
chr1,3000001,3.777538,2.1058428,2.733947,1.3259219,1.865876,1.7705485,1.9169945,1.8410087,⋯,20.766127,19.64354,16.83726,9.974132,25.69529,7.608355,29.66111,22.55539,9.649642,4.16149
chr1,4000001,1.75523,0.8019908,1.2479,0.4344994,0.395774,0.6742724,0.4020992,0.3389768,⋯,6.231268,15.36473,36.71549,28.632269,8.97739,8.46,14.95429,24.74951,29.078391,13.69341


We will also load in the binned mutation (SNV only) track from 18 cohorts (including pan-cancer) in the PCAWG dataset.

In [7]:
PCAWG_mutations_binned_MBscale = fread("data/PCAWG_SNVbinned_MBscale.csv.gz")
head(PCAWG_mutations_binned_MBscale)

chr,start,PANCAN,Breast-AdenoCa,Prost-AdenoCA,Kidney-RCC,Skin-Melanoma,Uterus-AdenoCA,Eso-AdenoCa,Stomach-AdenoCA,CNS-GBM,Lung-SCC,ColoRect-AdenoCA,Biliary-AdenoCA,Head-SCC,Lymph-CLL,Lung-AdenoCA,Lymph-BNHL,Liver-HCC,Thy-AdenoCA
<chr>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
chr1,2000001,5423,403,209,267,600,141,407,207,88,362,147,57,210,34,194,224,649,16
chr1,3000001,5870,529,196,328,682,170,406,189,71,342,161,44,255,49,181,222,764,13
chr1,4000001,11489,688,339,355,1106,267,1933,488,128,587,351,105,356,80,541,331,1571,28
chr1,5000001,9773,499,259,361,1149,210,1364,375,100,516,246,111,283,80,398,319,1546,25
chr1,6000001,4779,344,161,266,398,138,339,164,55,265,120,46,217,39,187,192,705,11
chr1,7000001,5418,344,166,243,611,154,383,206,49,323,158,50,232,74,236,206,752,15


We can now train a random forest model the combined chromatin accessibility dataset and replication timing dataset to predict mutation rates in our cohort of interest (in this example, we will use breast cancer) and then assess individual predictor importance.

First we merge the predictor set with the response vector based on genomic coordinates (first 2 columns).

In [8]:
CA_RT_breastcancermuts = merge(CA_RT_MBscale, 
									PCAWG_mutations_binned_MBscale[,c("chr", "start", "Breast-AdenoCa")], 
									by=c("chr", "start"))
head(CA_RT_breastcancermuts, 3)

chr,start,ACC TCGA-OR-A5J2,ACC TCGA-OR-A5J3,ACC TCGA-OR-A5J6,ACC TCGA-OR-A5J9,ACC TCGA-OR-A5JZ,ACC TCGA-OR-A5K8,ACC TCGA-OR-A5KX,ACC TCGA-PA-A5YG,⋯,Repli seq of NHEK S2 phase,Repli seq of NHEK S3 phase,Repli seq of NHEK S4 phase,Repli seq of SK N SH G1 phase,Repli seq of SK N SH G2 phase,Repli seq of SK N SH S1 phase,Repli seq of SK N SH S2 phase,Repli seq of SK N SH S3 phase,Repli seq of SK N SH S4 phase,Breast-AdenoCa
<chr>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>
chr1,2000001,3.985625,3.3680199,4.549601,1.8892585,3.274139,3.120603,2.9564165,3.9239266,⋯,12.02102,11.92807,8.420803,39.16145,9.017025,22.80582,13.88087,5.706045,3.0025,403
chr1,3000001,3.777538,2.1058428,2.733947,1.3259219,1.865876,1.7705485,1.9169945,1.8410087,⋯,19.64354,16.83726,9.974132,25.69529,7.608355,29.66111,22.55539,9.649642,4.16149,529
chr1,4000001,1.75523,0.8019908,1.2479,0.4344994,0.395774,0.6742724,0.4020992,0.3389768,⋯,15.36473,36.71549,28.632269,8.97739,8.46,14.95429,24.74951,29.078391,13.69341,688


Now we train a random forest models on the combined CA and RT predictor set to predict regional mutation burden in breast cancer and then we assess the importance of each CA/RT predictor using a permutation-based importance.

In [18]:
#Assign variables to predictors and response
predictors = CA_RT_breastcancermuts[,-c("chr", "start", "Breast-AdenoCa")]
response = CA_RT_breastcancermuts[["Breast-AdenoCa"]]

#Train randomForest model
rf = randomForest(x = predictors, y = response, 
							 keep.forest = T, ntree = 100, 
							 do.trace = F, importance = T)

#Get predictor importances  (incMSE)
importances = as.numeric(importance(rf, type = 1, scale = F))

#Print the top 5 predictors in terms of importance
colnames(predictors)[order(importances, decreasing = T)][1:5]

Here we can see that 4 of the top 5 predictors are coming from primary breast cancer chromatin accessibility tracks.

Next, we can use a permutation test to create a null distribution for the importances of each predictor so we can derive significance and a p-value for the importance values we just calculated. For each iteration in permutation test, we scramble the output without replacement and then measure the importance of each predictors when we remove the associations between the predictors and the response. This creates a null distribution of importances for each predictor that is used as a reference when calculating p-values.

In [22]:
#Function to run iterations of the permutation test
get_permutation_importance = function(seed){
    
	#Set the random seed for the permutation
	set.seed(seed)

    #Randomly permute response without replacement
    response_scrambled = sample(as.numeric(response))

    #Train randomForest to predict scrambled output, 
	#For this tutorial we will use 10 trees to speed the process up but any number can be used
    rf = randomForest(x = predictors, y = response_scrambled, keep.forest = T, 
					ntree = 10, do.trace = F, importance = T)
    
    return(as.numeric(importance(rf, type = 1, scale = F)))}

#For this tutorial, we will run 20 iterations but this can be considerably more in practice
RF_permutatation_results = as.data.table(do.call("rbind.data.frame", 
								  lapply(1:20, get_permutation_importance)))
colnames(RF_permutatation_results) = colnames(predictors)

Here each row of the results table represents an iteration of the permutation test and each column represents a predictor. We now compare the real importances to this distribution of null importances to calculate a p-value for each predictor by asking what fraction of the null importances are greater (more important) than the real importance value. The smaller that this fraction is, the more significant the result.

In [24]:
predictor_pvals = unlist(lapply(1:ncol(predictors), 
									function(x)  sum(importances[x] < RF_permutatation_results[[x]])/20))

As we have only done 20 permutations, the p-value analysis is under-powered but we will display the significant predictors for the sake of the tutorial.

In [27]:
colnames(predictors)[which(predictor_pvals == 0)]

The significant predictors are mostly derived from primary breast cancer chromatin accessibility as expected.

Finally, we can do some bootstrapping (sampling both the predictors and the response with replacement) to develop a distribution of real importance values and understand the variation in the importance value.

In [29]:
#Function to run iterations of the bootstrap
get_bootstrap_importance = function(seed){
    
	#Set the random seed for the permutation
	set.seed(seed)

    #Bootstrap sample predictors and response of RF model
    bootstrap_sample = sample(nrow(predictors), replace = T)
    predictors_bootstrap = predictors[bootstrap_sample]
    response_bootstrap = response[bootstrap_sample]

    #Train randomForest
    rf = randomForest(x = predictors_bootstrap, 
					  y = response_bootstrap, 
					  keep.forest = T, 
					  ntree = 10, 
					  do.trace = F,
					  importance = T)
    
    
    return(as.numeric(importance(rf, type = 1, scale = F)))}

#For this tutorial, we will run 20 iterations but this can be considerably more in practice
RF_bootstrap_results = as.data.table(do.call("rbind.data.frame", 
								  lapply(1:20, get_bootstrap_importance)))
colnames(RF_bootstrap_results) = colnames(predictors)

Let us look at the predictor with the highest mean importance value from the bootstrap distribution and display its standard deviation and corresponding p-value.

In [35]:
mean_importances = apply(RF_bootstrap_results, 2, mean)
sd_importances = apply(RF_bootstrap_results, 2, sd)

print(paste("Most important predictor is:", colnames(predictors)[which.max(mean_importances)]))
print(paste("Its mean importance is:", mean_importances[which.max(mean_importances)]))
print(paste("The standard deviation of its importance is:", sd_importances[which.max(mean_importances)]))
print(paste("The p-value of its importance is:", predictor_pvals[which.max(mean_importances)]))

[1] "Most important predictor is: BRCA TCGA-BH-A0B1"
[1] "Its mean importance is: 4572.31522408921"
[1] "The standard deviation of its importance is: 1288.19169213172"
[1] "The p-value of its importance is: 0"


Now we have a true importance, a p-value, and a bootstrap distribution for every predictor's importance value in this analysis.