# Pre-Processing
1) Clean data by removing rows with > 50% missing info <br>
2) Find the most informative features <br>
3) Split the data into test and train


In [1]:
library(caret)

Loading required package: lattice

Loading required package: ggplot2



## Load data
The first column is expected to be sample ID <br>
The second column is expected to be response

In [2]:
setwd("/home/jp/ICP_Responders/FinalTables")
expression <- read.csv("Final_table_response_and_expression.csv", na.strings = '..', stringsAsFactors = F)
names(expression) <- sub("^X", "", names(expression))
expression[ expression == "NA" ] <- NA
expression[,2:ncol(expression)] <- lapply(expression[, 2:ncol(expression)], as.numeric)
expression[50:60, 1:10]

Unnamed: 0_level_0,Patient,Response,3920,345611,3929,54210,3716,10454,3557,3556
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
50,31,1,,,,,,,,
51,32,1,,,,,,,,
52,33,1,,,,,,,,
53,34,1,-0.24227274,0.074304976,-0.3511685,0.54428208,-0.15006954,-0.07183968,0.2002265,-1.182660036
54,35,1,-0.48479782,0.159405465,0.4199815,-2.2288372,0.24650857,-0.3890413,1.82423349,-1.343065827
55,36,1,0.7408549,0.003467014,0.3333496,-0.03988917,-0.21346769,0.54476821,-1.69969237,0.483498059
56,37,1,,,,,,,,
57,38,1,0.81259519,-0.048404344,-0.2250873,-0.29189077,0.44538081,0.21789661,-0.08321449,1.567718745
58,39,0,,,,,,,,
59,4,0,0.14400922,-0.067130938,0.1120551,-0.96429947,0.10440116,-0.74711659,0.30591242,-1.254576697


## Clean data

In [3]:
# Check which column has > 50% NA values
countNA <- function(x=NULL,cutOff=NULL){
  output<-FALSE
  perc<-sum(is.na(x))*100/length(x)
  if(perc>cutOff){output<-TRUE}
  output  
}
col_nas <- apply(expression,2,function(x){countNA(x, 50)})
cat("Columns with NAs > 50% = ", sum(col_nas), "\n")
# all columns have <50% NAs

# Check which rows has > 50% NA values
row_nas <- apply(expression,1,function(x){countNA(x, 50)})
cat("Rows with NAs > 50% = ", sum(row_nas), "\n")

# 43 rows have have >50% NAs, removing them
expr_filtered <- expression[-which(row_nas %in% TRUE),]

cat("Dimensions of the filteres dataset = ", dim(expr_filtered))

Columns with NAs > 50% =  0 
Rows with NAs > 50% =  43 
Dimensions of the filteres dataset =  161 676

## Look for near zero variance and remove those columns

In [4]:
nzv <- nearZeroVar(expr_filtered[3:ncol(expr_filtered)], saveMetrics= TRUE)
nzv[which(nzv$zeroVar %in% TRUE), ]

# All features were retained and there was no filtering due to near zero variance

freqRatio,percentUnique,zeroVar,nzv
<dbl>,<dbl>,<lgl>,<lgl>


## Look for correlation and remove highly correlated columns

In [5]:
# find attributes that are highly corrected (ideally >0.75)
tmp <- expr_filtered
tmp[is.na(tmp)] <- 0
expr_corr <-  cor(tmp[,3:ncol(tmp)]) 
highlyCorrelated <- findCorrelation(expr_corr, cutoff=0.75, names=TRUE, verbose=TRUE)
expr_rmcorr <- expr_filtered[, -which(colnames(expr_filtered) %in% highlyCorrelated)]


 Combination row 18 and column 19 is above the cut-off, value = 0.797 
 	 Flagging column 18 
 Combination row 18 and column 20 is above the cut-off, value = 0.755 
 	 Flagging column 18 
 Combination row 20 and column 22 is above the cut-off, value = 0.832 
 	 Flagging column 22 
 Combination row 20 and column 23 is above the cut-off, value = 0.801 
 	 Flagging column 23 
 Combination row 22 and column 23 is above the cut-off, value = 0.954 
 	 Flagging column 22 
 Combination row 18 and column 24 is above the cut-off, value = 0.768 
 	 Flagging column 18 
 Combination row 19 and column 24 is above the cut-off, value = 0.838 
 	 Flagging column 24 
 Combination row 19 and column 25 is above the cut-off, value = 0.759 
 	 Flagging column 25 
 Combination row 24 and column 25 is above the cut-off, value = 0.929 
 	 Flagging column 25 
 Combination row 16 and column 26 is above the cut-off, value = 0.856 
 	 Flagging column 16 
 Combination row 16 and column 31 is above the cut-off, val

## Look for linear dependencies 
The function findLinearCombos uses the QR decomposition of a matrix to enumerate sets of linear combinations (if they exist).
QR decomposition is a decomposition of a matrix A into a product A = QR of an orthogonal matrix Q and an upper triangular matrix R. QR decomposition is the basis for a particular eigenvalue algorithm, the QR algorithm.

Fix this


In [6]:
tmp <- expr_rmcorr[, 3:ncol(expr_rmcorr)]
tmp[is.na(tmp)] <- 0
comboInfo <- findLinearCombos(tmp) 
rmLnCmb <- colnames(tmp[,comboInfo$remove])
expr_rmLnCmb <- expr_rmcorr[,-which(colnames(expr_rmcorr) %in% rmLnCmb)]
cat("Started with dimension = ", dim(expression), "\n")
cat("Post 50% NA filtering in rows and columns the dimension is", dim(expr_filtered), "\n")
cat("Post filtering highly correlated columns the dimension is", dim(expr_rmcorr), "\n")
cat("Post removing linearly dependent columns the dimension is", dim(expr_rmLnCmb))

Started with dimension =  204 676 
Post 50% NA filtering in rows and columns the dimension is 161 676 
Post filtering highly correlated columns the dimension is 161 495 
Post removing linearly dependent columns the dimension is 161 158

# Feature Selection


`How does var imp work, significance?`<br>
The varImp function tracks the changes in model statistics, such as the generalized cross-validation, for each predictor and accumulates the reduction in the statistic when each predictor’s feature is added to the model. This total reduction is used as the variable importance measure.

In [8]:
# define a resampling approach for caret where data is divided into 8 random subsets and prediction is done on 
# 1 using the remaining 7. This approach is repeated thrice
control <- trainControl(method="repeatedcv", number=8, repeats=3)
mod_inp_mat <- expr_rmLnCmb[, 2:ncol(expr_rmLnCmb)]
mod_inp_mat$Response <- as.factor(mod_inp_mat$Response)

## glmNet

In [116]:
m.lb <- train(Response~., data=mod_inp_mat, 
                  method="glmnet", 
                  trControl=control,
                  preProcess = c("scale", "center"),
                  na.action = na.omit
                 )


1 package is needed for this model and is not installed. (logicFS). Would you like to try to install it now?

ERROR: Error: Required package is missing


## Random Forest
A Random Forest is an ensemble technique capable of performing both regression and classification tasks with the use of multiple decision trees and a technique called Bootstrap and Aggregation, commonly known as bagging. The basic idea behind this is to combine multiple decision trees in determining the final output rather than relying on individual decision trees.

In [9]:
m.rf <- train(Response~., data=mod_inp_mat, 
              method="rf", 
              trControl=control,
              preProcess = c("scale", "center"),
              na.action = na.omit)

In [13]:
# estimate variable importance
rf.imp <- varImp(m.rf, scale=TRUE)$importance
rf.imp$Name <- rownames(rf.imp)
# summarize importance
rf.imp <- rf.imp[order(rf.imp$Overall, decreasing=TRUE),]
rf.imp <- rownames(rf.imp[1:50, ])
rf.imp

In [None]:
# saveRds(tmp, "tmp.rds")
# # load by giving path
# # tmp <- readRds("path")
# # R


In [None]:
# #parameter tuning
# modelLookup('lvq')
# # design the parameter tuning grid
# grid <- expand.grid(size=c(5,10,20,50), k=c(1,2,3,4,5))
# # train the model
# model <- train(Species~., data=iris, method="lvq", trControl=control, tuneGrid=grid)

> install.packages("logicFS") <br>
Warning message:  <br>
package ‘logicFS’ is not available (for R version 3.5.1)<br>
>

## SVM
Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection.

It's advantageous in this case because:<br>
1) Effective in high dimensional spaces. <br>
2) Still effective in cases where number of dimensions is greater than the number of samples.



In [120]:
m.svm <- train(Response~., data=mod_inp_mat, 
                  method="svmLinear2", 
                  trControl=control,
                  preProcess = c("scale", "center"),
                  na.action = na.omit
                 )

In [127]:
# makeshift fix because X1 = X0, find why this is hapenning

# estimate variable importance
svm.imp <- varImp(m.svm, scale=TRUE)$importance
# svm.imp$Name <- rownames(svm.imp)
# # summarize importance
svm.imp <- svm.imp[order(svm.imp$X0, decreasing=TRUE),]
svm.imp <- svm.imp[svm.imp$X0 > 50, ]
svm.imp

Unnamed: 0_level_0,X0,X1
Unnamed: 0_level_1,<dbl>,<dbl>
6504,100.00000,100.00000
3458,87.24062,87.24062
29126,85.73951,85.73951
57823,85.38631,85.38631
8807,84.76821,84.76821
29851,82.38411,82.38411
9308,81.28035,81.28035
7128,78.05740,78.05740
51744,75.71744,75.71744
4210,72.75938,72.75938


## Multi layer perceptron (Neural Network)
A multilayer perceptron is a class of feedforward artificial neural network.

In [143]:
modelLookup('mlp')

Unnamed: 0_level_0,model,parameter,label,forReg,forClass,probModel
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<lgl>,<lgl>,<lgl>
1,mlp,size,#Hidden Units,True,True,True


In [130]:
m.mlp <- train(Response~., data=mod_inp_mat, 
                  method="mlp", 
                  trControl=control,
                  preProcess = c("scale", "center"),
                  na.action = na.omit
                 )

In [132]:
# makeshift fix because X1 = X0, find why this is hapenning

# estimate variable importance
mlp.imp <- varImp(m.mlp, scale=TRUE)$importance
# svm.imp$Name <- rownames(svm.imp)
# # summarize importance
mlp.imp <- mlp.imp[order(mlp.imp$X0, decreasing=TRUE),]
mlp.imp <- mlp.imp[mlp.imp$X0 > 50, ]
mlp.imp

Unnamed: 0_level_0,X0,X1
Unnamed: 0_level_1,<dbl>,<dbl>
6504,100.00000,100.00000
3458,87.24062,87.24062
29126,85.73951,85.73951
57823,85.38631,85.38631
8807,84.76821,84.76821
29851,82.38411,82.38411
9308,81.28035,81.28035
7128,78.05740,78.05740
51744,75.71744,75.71744
4210,72.75938,72.75938


## Neural Network

In [None]:
method = 'nnet'
Type: Classification, Regression

Tuning parameters:

size (#Hidden Units)
decay (Weight Decay)
Required packages: nnet