# Machine Learning with CARET

In this module, we will explore how to use the CARET package for doing machine learning.

If you are unfamiliar with notebooks, please review some basics [here](https://github.com/michhar/useR2016-tutorial-jupyter). 

## Essential Tips

A very brief summary of the critical components and commands within jupyter are:

1. Critically, press `Ctrl+Enter` to run (or render) the current cell.
2. Output will print to the notebook. You may have to scroll up to see it all.
3. Get help for any function by typing a question mark and then its name into
   the console: `?rxLinMod`. It will split the window, and will bring up the documentation for 
   that function below.
5. Files will appear in the specified directory. You can find them by selecting File in the menu bar and selecting "Open...". This will open a new browser window with a file navigator.
6. R objects can be viewed by typing `ls()` in an R cell.
7. Run all the example code!

There are a number of hands-on exercises in the document, so while you can run the notebook from beginning to end, you will get a lot more out of it by actually walking through cell-by-cell, and filling out the corresponding exercises.

These notebooks are based on a tutorial presented at a Microsoft conference in June of 2016. The original files are available [here](https://github.com/joseph-rickert/MLADS_JUNE_2016).



## Introduction

caret (short for **C**lassification **a**nd **RE**gression **T**raining) is the most feature rich package for doing machine learning in R. It provides functions to streamline the entire process and includes tools for:  

* data splitting    
* pre-processing    
* feature selection    
* model tuning using resampling    
* variable importance estimation    

This script explores caret's capabilities using a cell segmentation data set that is included in the package. The data is described in the paper: Hill et al "Impact of image segmentation on high-content screening data quality for SK-BR-3 cells" BMC fioinformatics (2007) vol 8 (1) pp. 340.

The analysis presented here is based on examples presented by Max Kuhn, caret's author, at Use-R 2012.

We covered this same dataset in our [Introduction to Classification](5-Classification.ipynb).

Before we get started, we'll source a configuration file in the next cell. It simply makes sure that the relevant R packages and datasets are available. You do not need to look at it, but if you are interested, you can view the configuration file [here](Resources/config.R). It may take a few moments to run the first time you run it, but it should be fast afterwards.

In [None]:
source("Resources/config.R")
seed_val <- 1 # random seed value, to be actually set for the generator later
set.seed(seed_val)   ## sets the random seed
worker_seeds <- sample(1e7, size = 5)  ## used to set the seed for the worker nodes

## Background

"Well-segmented" cells are cells for which location and size may be accurrately detremined through optical measurements. Cells that are not Well-segmented (WS) are said to be "Poorly-segmented"" (PS). Given a set of optical measurements can we predict which cells will be PS? This is a classic classification problem.

## Data

We'll get started by loading the package, getting help on the dataset and loading the data using the `data()` function, and then viewing the first couple of rows.

### Packages Required

```{r}
library(partykit)			# Plotting trees
library(rpart)  			# CART algorithm for decision trees
```  

In [None]:
library(caret)
?segmentationData
data(segmentationData)  	# Load the segmentation data set
dim(segmentationData)
head(segmentationData,2)

## Using CARET

We haven't used CARET yet, other than to make the data accessible to R. The first function from CARET that we will leverage is `createDataPartition()`. This function allows us to create a partition of the data so that can split the data into training and testing sets explicitly and easily.

Specifically, in the next case, we use `createDataPartition()` to create a vector of values that we then use to subset rows. The `p=.5` argument indicates that the vector will contain 50% of the entries in the entire dataset, and the `list=FALSE` indicates that it should return a vector of values, rather than a list with multiple sets of entries. Instead of using the list data structure, we use the negative indexing trick when we create `testData` to exclude anything in our training set.

Our column indices are simply dropping the first two columns to make model specification simple (Note the use of negative indices).

In [None]:
trainIndex <- createDataPartition(segmentationData$Case,p=.5,list=FALSE) # create a vector of indices
trainData <- segmentationData[trainIndex,-c(1,2)]  # create training data by using these as row indices.
testData  <- segmentationData[-trainIndex,-c(1,2)] # create testing data by excluding training indices

We can confirm that 50% of data are in each case by looking at the number of rows in each

In [None]:
(length(trainIndex) + nrow(testData)) == nrow(segmentationData)
c(length(trainIndex), nrow(testData))/nrow(segmentationData)

Next, we will remove the outcome variables so we have data frames with only the predictor variables:

In [None]:
trainX <-trainData[,-1]        # Pull out the dependent variable
testX <- testData[,-1]

## GENERALIZED BOOSTED REGRESSION MODEL   

We will start with building a generalized boosted regression model, or gbm. In order to do this, we need the gbm package, and we need to ntoe that the gbm function does not allow for categorical "factor" variables as dependent variables, so we will need to fix that.


In [None]:
str(trainData$Class[1:10])

In [None]:
library(gbm)
gbmTrain <- trainData
gbmTrain$Class <- ifelse(gbmTrain$Class=="PS",1,0) ## make this numeric
gbm.mod <- gbm(formula = Class~.,  			# use all variables
				distribution = "bernoulli",		  # for a classification problem
				data = gbmTrain,
				n.trees = 2000,					        # 2000 boosting iterations
				interaction.depth = 7,			    # 7 splits for each tree
				shrinkage = 0.01,				        # the learning rate parameter
				verbose = FALSE)				        # Do not print the details

In [None]:
gbm.mod

In [None]:
summary(gbm.mod)			# Plot the relative inference of the variables in the model

This is an interesting model, but how do you select the best values for the for the three tuning parameters?   

* n.trees   
* interaction.depth   
* shrinkage   

In turns out that this is where caret really shines.

## GBM Model Training Over Paramter Space

`caret` provides the "train" function that implements the following algorithm: 

Algorithm for training the model:    
Define sets of model parameters to evaluate    
for each parameter set do    
....for each resampling iteration do    
......hold out specific samples     
......pre-process the data    
......fit the model to the remainder    
....predict the holdout samples    
....end      
....calculate the average performance across hold-out predictions    
end    
Determine the optimal parameter set    
Fit the final model to the training data using the optimal parameter set    

Note the default method of picking the best model is accuracy and Cohen's $\kappa$   

Let's explore how this works in practice.

## Set up training control

First, we need to set up the data structure that will control the training procedure.

In [None]:
ctrl <- trainControl(method="repeatedcv",   # 10fold cross validation
					 repeats=5,							          # do 5 repititions of cv
					 summaryFunction=twoClassSummary,	# Use AUC to pick the best model
					 classProbs=TRUE,
                    allowParallel = FALSE)

Next, we need to define the parameter search space.

We can use the `expand.grid()` function to help specify the search space efficiently.

Note that the default search grid selects 3 values of each tuning parameter.

In [None]:
grid <- expand.grid(interaction.depth = seq(1,4,by=2), #tree depths from 1 to 4
                    n.trees=seq(10,100,by=10),	# let iterations go from 10 to 100
                    shrinkage=c(0.01,0.1),			# Try 2 values fornlearning rate 
                    n.minobsinnode = 20)
set.seed(seed_val)                     # set the seed to 1 for sequential training.

In [None]:
library(pROC) 
system.time(gbm.tune <- train(x=trainX,y=trainData$Class,
				method = "gbm",
				metric = "ROC",
				trControl = ctrl,
				tuneGrid=grid,
				verbose=FALSE))

If we have the `doParallel` library installed, we can even ask CARET to do the parameter searches in parallel! 

This becomes a little complicated when combined with using packages that adjust your library location (or with using notebooks), but we'll address that as well.

First, we just need to load doParallel and register the parallel back end.

In [None]:
library(doParallel)
registerDoParallel(4)		# Register a parallel backend for train
getDoParWorkers()

## Before we run...

We need to make sure that the workers spawned actually have the right path to packages!

In [None]:
# shows that they're the default - We don't even have caret installed there!
foreach(i = 1:getDoParWorkers()) %dopar% {.libPaths()}

In [None]:
# get the current lps and make sure they're represented appropriately on the workers.
current_lp <- .libPaths()
foreach(i = 1:getDoParWorkers()) %dopar% {assign('.lib.loc', current_lp, envir = environment(.libPaths))}

In [None]:
## show they're updated!
foreach(i = 1:getDoParWorkers()) %dopar% {.libPaths()}

## Reproducibility

Because we're using a parallel backend now, we need to make sure the seeds in the backend are tracked for reproducibility sake. We can do that with a %dopar% call

In [None]:
set.seed(seed_val)   
foreach(i = 1:getDoParWorkers()) %dopar% set.seed(worker_seeds[i])

Now, all we have to do is set the `allowParallel` value in `ctrl` to TRUE, and then rerun again!

In [None]:
ctrl$allowParallel <- TRUE
system.time(gbm.tune <- train(x=trainX,y=trainData$Class,
				method = "gbm",
				metric = "ROC",
				trControl = ctrl,
				tuneGrid=grid,
				verbose=FALSE))

This ends up being a bit faster in parallel (most of the time).

Look at the tuning results
Note that ROC was the performance criterion used to select the optimal model.   

In [None]:
gbm.tune$bestTune
plot(gbm.tune)  		# Plot the performance of the training models
res <- gbm.tune$results
names(res) <- c("depth","trees", "shrinkage","ROC", "Sens","Spec", "sdROC", "sdSens", "seSpec")
res

### GBM Model Predictions and Performance
Make predictions using the test data set

In [None]:
gbm.pred <- predict(gbm.tune,testX)
head(gbm.pred)

Look at the confusion matrix  

In [None]:
confusionMatrix(gbm.pred,testData$Class)   

Draw the ROC curve 

In [None]:
gbm.probs <- predict(gbm.tune,testX,type="prob")
head(gbm.probs)

gbm.ROC <- roc(predictor=gbm.probs$PS,
  			response=testData$Class,
				levels=rev(levels(testData$Class)))
gbm.ROC

plot(gbm.ROC)

Plot the propability of poor segmentation

In [None]:
histogram(~gbm.probs$PS|testData$Class,xlab="Probability of Poor Segmentation")

## SUPPORT VECTOR MACHINE MODEL 
We follow steps similar to those above to build a SVM model    

In [None]:
# Set up for parallel procerssing
registerDoParallel(4,cores=4)
getDoParWorkers()

In [None]:
# make sure libpaths are set, since we registered new ones
foreach(i = 1:getDoParWorkers()) %dopar% {.libPaths()}
foreach(i = 1:getDoParWorkers()) %dopar% {assign('.lib.loc', current_lp, envir = environment(.libPaths))}
foreach(i = 1:getDoParWorkers()) %dopar% {.libPaths()}


In [None]:
## We also need to set the seeds on the worker nodes:
foreach(i = 1:getDoParWorkers()) %dopar% set.seed(worker_seeds[i])

Train and Tune the SVM

In [None]:
ctrl$allowParallel
library(kernlab)
set.seed(seed_val)# this will only matter if allowParallel = FALSE
system.time(
  svm.tune <- train(x=trainX,
                    y= trainData$Class,
                    method = "svmRadial",
                    tuneLength = 9,					# 9 values of the cost function
                    preProc = c("center","scale"),
                    metric="ROC",
                    trControl=ctrl)	# same as for gbm above
)	

svm.tune

Plot the SVM results   

In [None]:
plot(svm.tune,
     metric="ROC",
     scales=list(x=list(log=2)))

Make predictions on the test data with the SVM Model   

In [None]:
svm.pred <- predict(svm.tune,testX)
head(svm.pred)

In [None]:
confusionMatrix(svm.pred,testData$Class)

In [None]:
svm.probs <- predict(svm.tune,testX,type="prob")
head(svm.probs)

svm.ROC <- roc(predictor=svm.probs$PS,
               response=testData$Class,
               levels=rev(levels(testData$Class)))
svm.ROC

plot(svm.ROC)

## RANDOM FOREST MODEL

Now we'll also try to train a random Forest using the randomForest package.

In [None]:
library(randomForest)
set.seed(seed_val)  # in case allowParallel = FALSE
foreach(i = 1:getDoParWorkers()) %dopar% set.seed(worker_seeds[i]) # make sure things are repeatable in parallel
ctrl$allowParallel
system.time(rf.tune <-train(x=trainX,
                y= trainData$Class,
                method="rf",
                trControl= ctrl,
                prox=TRUE,allowParallel=TRUE)
            )
rf.tune

In [None]:
# Plot the Random Forest results
plot(rf.tune,
     metric="ROC",
     scales=list(x=list(log=2)))

In [None]:
# Random Forest Predictions
rf.pred <- predict(rf.tune,testX)
head(rf.pred)

In [None]:
confusionMatrix(rf.pred,testData$Class)

In [None]:
rf.probs <- predict(rf.tune,testX,type="prob")
head(rf.probs)

In [None]:
rf.ROC <- roc(predictor=rf.probs$PS,
               response=testData$Class,
               levels=rev(levels(testData$Class)))
rf.ROC

plot(rf.ROC,main = "Random Forest ROC")

## Comparing Multiple Models

Having set the seeds to the same values before estimating each model, we have generated paired samples (See [Hothorn at al, "The design and analysis of benchmark experiments-Journal of Computational and Graphical Statistics (2005) vol 14 (3) pp 675-699](http://statmath.wu-wien.ac.at/~zeileis/papers/Hothorn+Leisch+Zeileis-2005.pdf)). Note that we had to do this differently depending on whether the computational engine was parallel or not.

Because of this, we are in a position to compare models using a resampling technique.


In [None]:
rValues <- resamples(list(svm=svm.tune,gbm=gbm.tune,rf=rf.tune))
rValues$values
summary(rValues)

In [None]:
bwplot(rValues,metric="ROC")		    # boxplot
dotplot(rValues,metric="ROC")		    # dotplot
splom(rValues,metric="ROC")