# Pre-Processing
1) Clean data by removing rows with > 50% missing info <br>
2) Find the most informative features <br>
3) Split the data into test and train


In [1]:
library(caret)

Loading required package: lattice

Loading required package: ggplot2



## Load data
The first column is expected to be sample ID <br>
The second column is expected to be response

In [2]:
setwd("/home/jp/ICP_Responders/FinalTables")
expression <- read.csv("Final_table_response_and_expression.csv", na.strings = '..', stringsAsFactors = F)
names(expression) <- sub("^X", "", names(expression))
expression[ expression == "NA" ] <- NA
expression[,2:ncol(expression)] <- lapply(expression[, 2:ncol(expression)], as.numeric)
expression[50:60, 1:10]

Unnamed: 0_level_0,Patient,Response,3920,345611,3929,54210,3716,10454,3557,3556
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
50,31,1,,,,,,,,
51,32,1,,,,,,,,
52,33,1,,,,,,,,
53,34,1,-0.24227274,0.074304976,-0.3511685,0.54428208,-0.15006954,-0.07183968,0.2002265,-1.182660036
54,35,1,-0.48479782,0.159405465,0.4199815,-2.2288372,0.24650857,-0.3890413,1.82423349,-1.343065827
55,36,1,0.7408549,0.003467014,0.3333496,-0.03988917,-0.21346769,0.54476821,-1.69969237,0.483498059
56,37,1,,,,,,,,
57,38,1,0.81259519,-0.048404344,-0.2250873,-0.29189077,0.44538081,0.21789661,-0.08321449,1.567718745
58,39,0,,,,,,,,
59,4,0,0.14400922,-0.067130938,0.1120551,-0.96429947,0.10440116,-0.74711659,0.30591242,-1.254576697


## Clean data

In [3]:
# Check which column has > 50% NA values
countNA <- function(x=NULL,cutOff=NULL){
  output<-FALSE
  perc<-sum(is.na(x))*100/length(x)
  if(perc>cutOff){output<-TRUE}
  output  
}
col_nas <- apply(expression,2,function(x){countNA(x, 50)})
cat("Columns with NAs > 50% = ", sum(col_nas), "\n")
# all columns have <50% NAs

# Check which rows has > 50% NA values
row_nas <- apply(expression,1,function(x){countNA(x, 50)})
cat("Rows with NAs > 50% = ", sum(row_nas), "\n")

# 43 rows have have >50% NAs, removing them
expr_filtered <- expression[-which(row_nas %in% TRUE),]

cat("Dimensions of the filteres dataset = ", dim(expr_filtered))

Columns with NAs > 50% =  0 
Rows with NAs > 50% =  43 
Dimensions of the filteres dataset =  161 676

## Look for near zero variance and remove those columns

In [4]:
nzv <- nearZeroVar(expr_filtered[3:ncol(expr_filtered)], saveMetrics= TRUE)
nzv[which(nzv$zeroVar %in% TRUE), ]

# All features were retained and there was no filtering due to near zero variance

freqRatio,percentUnique,zeroVar,nzv
<dbl>,<dbl>,<lgl>,<lgl>


## Look for correlation and remove highly correlated columns

In [5]:
# find attributes that are highly corrected (ideally >0.75)
tmp <- expr_filtered
tmp[is.na(tmp)] <- 0
expr_corr <-  cor(tmp[,3:ncol(tmp)]) 
highlyCorrelated <- findCorrelation(expr_corr, cutoff=0.75, names=TRUE, verbose=TRUE)
expr_rmcorr <- expr_filtered[, -which(colnames(expr_filtered) %in% highlyCorrelated)]


 Combination row 18 and column 19 is above the cut-off, value = 0.797 
 	 Flagging column 18 
 Combination row 18 and column 20 is above the cut-off, value = 0.755 
 	 Flagging column 18 
 Combination row 20 and column 22 is above the cut-off, value = 0.832 
 	 Flagging column 22 
 Combination row 20 and column 23 is above the cut-off, value = 0.801 
 	 Flagging column 23 
 Combination row 22 and column 23 is above the cut-off, value = 0.954 
 	 Flagging column 22 
 Combination row 18 and column 24 is above the cut-off, value = 0.768 
 	 Flagging column 18 
 Combination row 19 and column 24 is above the cut-off, value = 0.838 
 	 Flagging column 24 
 Combination row 19 and column 25 is above the cut-off, value = 0.759 
 	 Flagging column 25 
 Combination row 24 and column 25 is above the cut-off, value = 0.929 
 	 Flagging column 25 
 Combination row 16 and column 26 is above the cut-off, value = 0.856 
 	 Flagging column 16 
 Combination row 16 and column 31 is above the cut-off, val

## Look for linear dependencies 
The function findLinearCombos uses the QR decomposition of a matrix to enumerate sets of linear combinations (if they exist).
QR decomposition is a decomposition of a matrix A into a product A = QR of an orthogonal matrix Q and an upper triangular matrix R. QR decomposition is the basis for a particular eigenvalue algorithm, the QR algorithm.




In [6]:
tmp <- expr_rmcorr[, 3:ncol(expr_rmcorr)]
tmp[is.na(tmp)] <- 0
comboInfo <- findLinearCombos(tmp) 
rmLnCmb <- colnames(tmp[,comboInfo$remove])
expr_rmLnCmb <- expr_rmcorr[,-which(colnames(expr_rmcorr) %in% rmLnCmb)]
cat("Started with dimension = ", dim(expression), "\n")
cat("Post 50% NA filtering in rows and columns the dimension is", dim(expr_filtered), "\n")
cat("Post filtering highly correlated columns the dimension is", dim(expr_rmcorr), "\n")
cat("Post removing linearly dependent columns the dimension is", dim(expr_rmLnCmb))

Started with dimension =  204 676 
Post 50% NA filtering in rows and columns the dimension is 161 676 
Post filtering highly correlated columns the dimension is 161 495 
Post removing linearly dependent columns the dimension is 161 158

# Feature Selection


`How does var imp work, significance?`<br>
The varImp function tracks the changes in model statistics, such as the generalized cross-validation, for each predictor and accumulates the reduction in the statistic when each predictor’s feature is added to the model. This total reduction is used as the variable importance measure.

In [7]:
# define a resampling approach for caret where data is divided into 8 random subsets and prediction is done on 
# 1 using the remaining 7. This approach is repeated thrice
control <- trainControl(method="repeatedcv", number=8, repeats=3)
mod_inp_mat <- expr_rmLnCmb[, 2:ncol(expr_rmLnCmb)]
mod_inp_mat$Response <- as.factor(mod_inp_mat$Response)

## glmNet

In [8]:
m.glm <- train(Response~., data=mod_inp_mat, 
                  method="glmnet", 
                  trControl=control,
                  preProcess = c("scale", "center"),
                  na.action = na.omit
                 )


In [9]:
# estimate variable importance
glm.imp <- varImp(m.glm, scale=TRUE)$importance
glm.imp$Name <- rownames(glm.imp)
# summarize importance
glm.imp <- glm.imp[order(glm.imp$Overall, decreasing=TRUE),]
glm.imp <- rownames(glm.imp[1:50, ])
glm.imp

## Random Forest
A Random Forest is an ensemble technique capable of performing both regression and classification tasks with the use of multiple decision trees and a technique called Bootstrap and Aggregation, commonly known as bagging. The basic idea behind this is to combine multiple decision trees in determining the final output rather than relying on individual decision trees.

In [10]:
m.rf <- train(Response~., data=mod_inp_mat, 
              method="rf", 
              trControl=control,
              preProcess = c("scale", "center"),
              na.action = na.omit)

In [11]:
# estimate variable importance
rf.imp <- varImp(m.rf, scale=TRUE)$importance
rf.imp$Name <- rownames(rf.imp)
# summarize importance
rf.imp <- rf.imp[order(rf.imp$Overall, decreasing=TRUE),]
rf.imp <- rownames(rf.imp[1:50, ])
rf.imp

In [None]:
# saveRds(tmp, "tmp.rds")
# # load by giving path
# # tmp <- readRds("path")
# # R


In [None]:
# #parameter tuning
# modelLookup('lvq')
# # design the parameter tuning grid
# grid <- expand.grid(size=c(5,10,20,50), k=c(1,2,3,4,5))
# # train the model
# model <- train(Species~., data=iris, method="lvq", trControl=control, tuneGrid=grid)

> install.packages("logicFS") <br>
Warning message:  <br>
package ‘logicFS’ is not available (for R version 3.5.1)<br>
>

## SVM
Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection.

It's advantageous in this case because:<br>
1) Effective in high dimensional spaces. <br>
2) Still effective in cases where number of dimensions is greater than the number of samples.



In [12]:
m.svm <- train(Response~., data=mod_inp_mat, 
                  method="svmLinear2", 
                  trControl=control,
                  preProcess = c("scale", "center"),
                  na.action = na.omit
                 )

In [15]:
# estimate variable importance
svm.imp <- varImp(m.svm, scale=TRUE)$importance
svm.imp$Name <- rownames(svm.imp)
# summarize importance
svm.imp <- svm.imp[order(svm.imp$X0, decreasing=TRUE),]
svm.imp <- rownames(svm.imp[1:50, ])
svm.imp

## Multi layer perceptron (Neural Network)
A multilayer perceptron is a class of feedforward artificial neural network.

In [16]:
modelLookup('mlp')

Unnamed: 0_level_0,model,parameter,label,forReg,forClass,probModel
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<lgl>,<lgl>,<lgl>
1,mlp,size,#Hidden Units,True,True,True


In [17]:
m.mlp <- train(Response~., data=mod_inp_mat, 
                  method="mlp", 
                  trControl=control,
                  preProcess = c("scale", "center"),
                  na.action = na.omit
                 )

In [19]:
# estimate variable importance
mlp.imp <- varImp(m.svm, scale=TRUE)$importance
mlp.imp$Name <- rownames(mlp.imp)
# summarize importance
mlp.imp <- mlp.imp[order(mlp.imp$X0, decreasing=TRUE),]
mlp.imp <- rownames(mlp.imp[1:50, ])
mlp.imp

## Neural Network

In [20]:
m.nnet <- train(Response~., data=mod_inp_mat, 
                  method="nnet", 
                  trControl=control,
                  preProcess = c("scale", "center"),
                  na.action = na.omit
                 )

# weights:  159
initial  value 95.018419 
iter  10 value 63.450613
iter  20 value 55.362210
iter  30 value 54.811746
iter  40 value 54.784273
iter  50 value 54.782387
iter  60 value 53.449217
iter  70 value 53.421895
iter  80 value 53.421666
iter  90 value 53.421486
iter 100 value 53.421371
final  value 53.421371 
stopped after 100 iterations
# weights:  475
initial  value 98.360074 
iter  10 value 55.101432
iter  20 value 32.549303
iter  30 value 29.720695
iter  40 value 28.032600
iter  50 value 26.857895
iter  60 value 26.757538
iter  70 value 26.753626
iter  80 value 26.729862
iter  90 value 26.666932
iter 100 value 26.659335
final  value 26.659335 
stopped after 100 iterations
# weights:  791
initial  value 94.815386 
iter  10 value 34.310623
iter  20 value 21.161808
iter  30 value 16.190901
iter  40 value 16.136986
iter  50 value 11.464234
iter  60 value 10.850612
iter  70 value 10.834082
iter  80 value 10.828786
iter  90 value 10.596809
iter 100 value 10.592185
final  value 10.59

initial  value 107.021166 
iter  10 value 73.799450
iter  20 value 58.081162
iter  30 value 55.771323
iter  40 value 54.340466
iter  50 value 54.324492
iter  60 value 54.318973
iter  70 value 54.316796
iter  80 value 54.314675
iter  90 value 52.754328
iter 100 value 52.747673
final  value 52.747673 
stopped after 100 iterations
# weights:  475
initial  value 117.847503 
iter  10 value 59.656782
iter  20 value 45.155357
iter  30 value 40.165955
iter  40 value 38.586022
iter  50 value 25.654228
iter  60 value 21.975308
iter  70 value 21.128261
iter  80 value 17.317086
iter  90 value 13.062719
iter 100 value 13.011687
final  value 13.011687 
stopped after 100 iterations
# weights:  791
initial  value 92.075283 
iter  10 value 34.411580
iter  20 value 12.005811
iter  30 value 3.699898
iter  40 value 2.858746
iter  50 value 2.806531
iter  60 value 2.794081
iter  70 value 2.779499
iter  80 value 2.632917
iter  90 value 2.185587
iter 100 value 2.171043
final  value 2.171043 
stopped after 100

initial  value 110.845635 
iter  10 value 59.294017
iter  20 value 35.584573
iter  30 value 25.214992
iter  40 value 21.399267
iter  50 value 20.348129
iter  60 value 17.076858
iter  70 value 16.075407
iter  80 value 15.782539
iter  90 value 15.731145
iter 100 value 15.479059
final  value 15.479059 
stopped after 100 iterations
# weights:  791
initial  value 112.760613 
iter  10 value 60.236128
iter  20 value 28.333736
iter  30 value 16.354661
iter  40 value 13.941241
iter  50 value 12.675005
iter  60 value 12.224031
iter  70 value 12.103037
iter  80 value 12.056159
iter  90 value 12.041492
iter 100 value 12.019481
final  value 12.019481 
stopped after 100 iterations
# weights:  159
initial  value 109.387265 
iter  10 value 63.331131
iter  20 value 54.798282
iter  30 value 51.548930
iter  40 value 51.474006
iter  50 value 49.922399
iter  60 value 49.901827
iter  70 value 49.898719
iter  80 value 48.263613
iter  90 value 48.255879
iter 100 value 46.196258
final  value 46.196258 
stopped

initial  value 89.348685 
iter  10 value 36.853757
iter  20 value 9.876348
iter  30 value 4.515584
iter  40 value 3.018550
iter  50 value 2.877574
iter  60 value 2.870969
iter  70 value 2.870848
iter  80 value 2.870827
final  value 2.870814 
converged
# weights:  159
initial  value 127.518787 
iter  10 value 71.703510
iter  20 value 53.133398
iter  30 value 42.955435
iter  40 value 38.518121
iter  50 value 29.343026
iter  60 value 22.760796
iter  70 value 21.205841
iter  80 value 19.171405
iter  90 value 18.346210
iter 100 value 18.202523
final  value 18.202523 
stopped after 100 iterations
# weights:  475
initial  value 108.417338 
iter  10 value 63.949841
iter  20 value 37.128085
iter  30 value 24.286016
iter  40 value 16.526041
iter  50 value 14.205165
iter  60 value 13.204037
iter  70 value 12.912137
iter  80 value 12.497220
iter  90 value 12.398725
iter 100 value 12.386193
final  value 12.386193 
stopped after 100 iterations
# weights:  791
initial  value 149.692428 
iter  10 valu

initial  value 99.608553 
iter  10 value 68.414837
iter  20 value 25.848843
iter  30 value 14.914777
iter  40 value 14.665597
iter  50 value 14.632778
iter  60 value 14.209108
iter  70 value 13.412576
iter  80 value 12.442602
iter  90 value 12.384278
iter 100 value 12.361824
final  value 12.361824 
stopped after 100 iterations
# weights:  159
initial  value 105.650116 
iter  10 value 62.555783
iter  20 value 53.080099
iter  30 value 51.193618
iter  40 value 50.548172
iter  50 value 50.546234
iter  60 value 50.544325
iter  70 value 50.543057
iter  80 value 50.542859
iter  90 value 50.542448
iter 100 value 50.542306
final  value 50.542306 
stopped after 100 iterations
# weights:  475
initial  value 114.450672 
iter  10 value 56.711338
iter  20 value 33.069668
iter  30 value 27.942286
iter  40 value 26.533102
iter  50 value 24.141172
iter  60 value 22.638787
iter  70 value 22.137588
iter  80 value 21.494178
iter  90 value 21.076494
iter 100 value 20.780274
final  value 20.780274 
stopped 

initial  value 96.414785 
iter  10 value 66.169387
iter  20 value 56.393905
iter  30 value 46.712586
iter  40 value 46.150390
iter  50 value 46.136782
iter  60 value 44.531871
iter  70 value 44.524384
iter  80 value 44.519496
iter  90 value 44.513560
iter 100 value 42.902051
final  value 42.902051 
stopped after 100 iterations
# weights:  475
initial  value 105.940357 
iter  10 value 30.010124
iter  20 value 16.409156
iter  30 value 10.080105
iter  40 value 7.999900
iter  50 value 3.553841
iter  60 value 2.513056
iter  70 value 2.413828
iter  80 value 2.389766
iter  90 value 2.386294
iter 100 value 2.379584
final  value 2.379584 
stopped after 100 iterations
# weights:  791
initial  value 104.183948 
iter  10 value 15.038955
iter  20 value 6.711367
iter  30 value 5.622211
iter  40 value 5.548432
iter  50 value 5.493389
iter  60 value 5.372656
iter  70 value 5.258846
iter  80 value 5.242156
iter  90 value 1.273886
iter 100 value 0.474777
final  value 0.474777 
stopped after 100 iteratio

initial  value 134.187253 
iter  10 value 80.829206
iter  20 value 58.207518
iter  30 value 44.657005
iter  40 value 36.100512
iter  50 value 31.145573
iter  60 value 26.723579
iter  70 value 23.291591
iter  80 value 21.696526
iter  90 value 21.659489
iter 100 value 21.659187
final  value 21.659187 
stopped after 100 iterations
# weights:  475
initial  value 108.183407 
iter  10 value 61.431003
iter  20 value 37.436701
iter  30 value 23.648768
iter  40 value 18.258761
iter  50 value 16.320960
iter  60 value 15.788222
iter  70 value 15.535711
iter  80 value 15.245831
iter  90 value 14.911318
iter 100 value 14.433796
final  value 14.433796 
stopped after 100 iterations
# weights:  791
initial  value 109.656717 
iter  10 value 59.069170
iter  20 value 28.481661
iter  30 value 17.756750
iter  40 value 13.931650
iter  50 value 13.061474
iter  60 value 12.892086
iter  70 value 12.639173
iter  80 value 12.474640
iter  90 value 12.393081
iter 100 value 12.388457
final  value 12.388457 
stopped

initial  value 106.852755 
iter  10 value 66.996495
iter  20 value 39.183863
iter  30 value 26.690006
iter  40 value 26.357033
iter  50 value 26.330910
iter  60 value 26.327319
iter  70 value 26.325162
iter  80 value 26.324978
iter  90 value 26.324768
iter 100 value 26.324747
final  value 26.324747 
stopped after 100 iterations
# weights:  475
initial  value 91.451848 
iter  10 value 50.549975
iter  20 value 40.755006
iter  30 value 37.354157
iter  40 value 36.131443
iter  50 value 35.451321
iter  60 value 34.971452
iter  70 value 34.904785
iter  80 value 33.945119
iter  90 value 33.938291
iter 100 value 33.936716
final  value 33.936716 
stopped after 100 iterations
# weights:  791
initial  value 107.794612 
iter  10 value 35.455775
iter  20 value 16.849656
iter  30 value 9.499189
iter  40 value 6.357182
iter  50 value 1.110923
iter  60 value 0.122224
iter  70 value 0.037064
iter  80 value 0.014189
iter  90 value 0.006677
iter 100 value 0.001949
final  value 0.001949 
stopped after 100

initial  value 112.269900 
iter  10 value 74.316180
iter  20 value 55.306382
iter  30 value 41.692714
iter  40 value 40.037736
iter  50 value 39.784092
iter  60 value 37.977018
iter  70 value 37.739350
iter  80 value 35.758338
iter  90 value 35.657232
iter 100 value 34.158512
final  value 34.158512 
stopped after 100 iterations
# weights:  791
initial  value 144.943977 
iter  10 value 45.533862
iter  20 value 26.217626
iter  30 value 24.850373
iter  40 value 22.969004
iter  50 value 22.582472
iter  60 value 21.459615
iter  70 value 20.961867
iter  80 value 20.686711
iter  90 value 19.992488
iter 100 value 19.015506
final  value 19.015506 
stopped after 100 iterations
# weights:  159
initial  value 104.724520 
iter  10 value 66.586415
iter  20 value 49.215197
iter  30 value 47.803273
iter  40 value 47.755143
iter  50 value 47.705593
iter  60 value 46.344410
iter  70 value 46.238399
iter  80 value 44.800748
iter  90 value 44.793226
iter 100 value 44.791159
final  value 44.791159 
stopped

In [22]:
# estimate variable importance
nnet.imp <- varImp(m.nnet, scale=TRUE)$importance
nnet.imp$Name <- rownames(nnet.imp)
# summarize importance
nnet.imp <- nnet.imp[order(nnet.imp$Overall, decreasing=TRUE),]
nnet.imp <- rownames(nnet.imp[1:50, ])
nnet.imp

# Compare feature ranks

In [None]:
a <- rf.imp
colnames(a) <- c("GeneID", "Score.rf")

mergedScore<- Reduce(function(x,y) merge(x,y,by="GeneID"), list(a,b,c,d))
                     
# find  correlations
# 