# R ML Crash Course_Part 3:  Data Transform, Split, Evaluation Metrics

## Full Day Workshop for user learn Data Science with R
### 2018  Timothy CL Lam
This is meant for internal usage, part of contents copied externally, not for commercial purpose


In [7]:
installed.packages()

Unnamed: 0,Package,LibPath,Version,Priority,Depends,Imports,LinkingTo,Suggests,Enhances,License,License_is_FOSS,License_restricts_use,OS_type,MD5sum,NeedsCompilation,Built
arules,arules,/usr/local/spark-2.0.2-bin-hadoop2.7/R/lib,1.5-2,,"R (>= 3.3.2), Matrix (>= 1.2-0)","stats, methods, graphics, utils",,"pmml, XML, arulesViz, testthat",,GPL-3,,,,,yes,3.3.2
arulesViz,arulesViz,/usr/local/spark-2.0.2-bin-hadoop2.7/R/lib,1.2-1,,"arules (>= 1.4.1), grid","scatterplot3d, vcd, seriation, igraph (>= 1.0.0), graphics, methods, utils, grDevices, stats, colorspace, DT, plotly",,"graph, Rgraphviz, iplots",,GPL-3,,,,,no,3.3.2
bindr,bindr,/usr/local/spark-2.0.2-bin-hadoop2.7/R/lib,0.1,,,,,testthat,,MIT + file LICENSE,,,,,no,3.3.2
bindrcpp,bindrcpp,/usr/local/spark-2.0.2-bin-hadoop2.7/R/lib,0.2,,,"Rcpp, bindr","Rcpp, plogr",testthat,,MIT + file LICENSE,,,,,yes,3.3.2
broom,broom,/usr/local/spark-2.0.2-bin-hadoop2.7/R/lib,0.4.2,,,"plyr, dplyr, tidyr, psych, stringr, reshape2, nlme, methods",,"knitr, boot, survival, gam, glmnet, lfe, Lahman, MASS, sp, maps, maptools, multcomp, testthat, lme4, zoo, lmtest, plm, biglm, ggplot2, nnet, geepack, AUC, ergm, network, statnet.common, xergm, btergm, binGroup, Hmisc, bbmle, gamlss, rstan, rstanarm, brms, coda, gmm, Matrix, ks, purrr, orcutt, mgcv, lmodel2, poLCA, mclust, covr, lsmeans, betareg, robust, akima",,MIT + file LICENSE,,,,,no,3.3.2
brunel,brunel,/usr/local/spark-2.0.2-bin-hadoop2.7/R/lib,2.3,,R (>= 3.2.1),"rJava, uuid, jsonlite",,,,"Apache License, Version 2.0",,,,,,3.3.2
C50,C50,/usr/local/spark-2.0.2-bin-hadoop2.7/R/lib,0.1.0-24,,R (>= 2.10.0),partykit,,,,GPL-3,,,,,yes,3.3.2
class,class,/usr/local/spark-2.0.2-bin-hadoop2.7/R/lib,7.3-14,recommended,"R (>= 3.0.0), stats, utils",MASS,,,,GPL-2 | GPL-3,,,,,yes,3.3.2
cluster,cluster,/usr/local/spark-2.0.2-bin-hadoop2.7/R/lib,2.0.6,recommended,R (>= 3.0.1),"graphics, grDevices, stats, utils",,MASS,,GPL (>= 2),,,,,yes,3.3.2
config,config,/usr/local/spark-2.0.2-bin-hadoop2.7/R/lib,0.2,,,yaml (>= 2.1.13),,"testthat, knitr",,GPL-3,,,,,no,3.3.2


# Data Pre-Processing in R
- The caret package in R provides a number of useful data transforms. These transforms can be
used in two ways:

**Standalone** : 
- Transforms can be modeled from training data and applied to multiple datasets.
The model of the transform is prepared using the preProcess() function and
- applied to
a dataset using the predict() function.

**Training** : 
- Transforms can be prepared and applied automatically during model evaluation.
Transforms applied during training are prepared using the preProcess() function and
- passed to the train() function via the preProcess argument.

**Useful**
- regression algorithms, 
- instance-based methods (like KNN and LVQ), support vector
machines and neural networks. 
- They are less likely to be useful for tree and rule-based methods.

## Summary of Transform Methods
Below is a quick summary of all of the transform methods supported in the argument to the
preProcess() function in caret.
- BoxCox: apply a Box-Cox transform, values must be non-zero and positive.
- YeoJohnson: apply a Yeo-Johnson transform, like a BoxCox, but values can be negative.
- expoTrans: apply a power transform like BoxCox and YeoJohnson.
- zv: remove attributes with a zero variance (all the same value).
- nzv: remove attributes with a near zero variance (close to the same value).
- center: subtract mean from values.
- scale: divide values by standard deviation.
- range: normalize values.
- pca: transform data to the principal components.
- ica: transform data to the independent components.
- spatialSign: project data onto a unit circle.

## Scale Data
- The scale transform calculates the standard deviation for an attribute and divides each value by
that standard deviation. 
- This is a useful operation for scaling data with a Gaussian distribution
consistently.

In [8]:
# load packages
library(caret)
# load the dataset
data(iris)
# summarize data
summary(iris[,1:4])
# calculate the pre-process parameters from the dataset
preprocessParams <- preProcess(iris[,1:4], method=c("scale"))
# summarize transform parameters
print(preprocessParams)
# transform the dataset using the parameters
transformed <- predict(preprocessParams, iris[,1:4])
# summarize the transformed dataset
summary(transformed)

  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  

Created from 150 samples and 4 variables

Pre-processing:
  - ignored (0)
  - scaled (4)



  Sepal.Length    Sepal.Width      Petal.Length     Petal.Width    
 Min.   :5.193   Min.   : 4.589   Min.   :0.5665   Min.   :0.1312  
 1st Qu.:6.159   1st Qu.: 6.424   1st Qu.:0.9064   1st Qu.:0.3936  
 Median :7.004   Median : 6.883   Median :2.4642   Median :1.7055  
 Mean   :7.057   Mean   : 7.014   Mean   :2.1288   Mean   :1.5734  
 3rd Qu.:7.729   3rd Qu.: 7.571   3rd Qu.:2.8890   3rd Qu.:2.3615  
 Max.   :9.540   Max.   :10.095   Max.   :3.9087   Max.   :3.2798  

In [9]:
sessionInfo()

R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] caret_6.0-73    ggplot2_2.2.1   lattice_0.20-35 SparkR_2.0.2   
[5] jsonlite_1.5    httr_1.2.1     

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.8        nloptr_1.0.4       plyr_1.8.4         iterators_1.0.8   
 [5] tools_3.3.2        digest_0.6.10      lme4_1.1-12        uuid_0.1-2        
 [9] evaluate_0.10      tibble_1.3.4       gtable_0.2.0       nlme_3.1-128      
[13] mgcv_

## Center Data
#### The center transform calculates the mean for an attribute and subtracts it from each value.m

In [10]:
# load packages
library(caret)
# load the dataset
data(iris)
# summarize data
summary(iris[,1:4])
# calculate the pre-process parameters from the dataset
preprocessParams <- preProcess(iris[,1:4], method=c("center"))
# summarize transform parameters
print(preprocessParams)
# transform the dataset using the parameters
transformed <- predict(preprocessParams, iris[,1:4])
# summarize the transformed dataset
summary(transformed)

  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  

Created from 150 samples and 4 variables

Pre-processing:
  - centered (4)
  - ignored (0)



  Sepal.Length       Sepal.Width        Petal.Length     Petal.Width     
 Min.   :-1.54333   Min.   :-1.05733   Min.   :-2.758   Min.   :-1.0993  
 1st Qu.:-0.74333   1st Qu.:-0.25733   1st Qu.:-2.158   1st Qu.:-0.8993  
 Median :-0.04333   Median :-0.05733   Median : 0.592   Median : 0.1007  
 Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.000   Mean   : 0.0000  
 3rd Qu.: 0.55667   3rd Qu.: 0.24267   3rd Qu.: 1.342   3rd Qu.: 0.6007  
 Max.   : 2.05667   Max.   : 1.34267   Max.   : 3.142   Max.   : 1.3007  

## Standardize Data
#### Combining the scale and center transforms will standardize your data. Attributes will have a mean value of 0 and a standard deviation of 1.

In [11]:
# load packages
library(caret)
# load the dataset
data(iris)
# summarize data
summary(iris[,1:4])
# calculate the pre-process parameters from the dataset
preprocessParams <- preProcess(iris[,1:4], method=c("center", "scale"))
# summarize transform parameters
print(preprocessParams)
# transform the dataset using the parameters
transformed <- predict(preprocessParams, iris[,1:4])
# summarize the transformed dataset
summary(transformed)

  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  

Created from 150 samples and 4 variables

Pre-processing:
  - centered (4)
  - ignored (0)
  - scaled (4)



  Sepal.Length       Sepal.Width       Petal.Length      Petal.Width     
 Min.   :-1.86378   Min.   :-2.4258   Min.   :-1.5623   Min.   :-1.4422  
 1st Qu.:-0.89767   1st Qu.:-0.5904   1st Qu.:-1.2225   1st Qu.:-1.1799  
 Median :-0.05233   Median :-0.1315   Median : 0.3354   Median : 0.1321  
 Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.67225   3rd Qu.: 0.5567   3rd Qu.: 0.7602   3rd Qu.: 0.7880  
 Max.   : 2.48370   Max.   : 3.0805   Max.   : 1.7799   Max.   : 1.7064  

## Normalize Data
#### Data values can be scaled into the range of [0, 1] which is called normalization.

In [12]:
# load packages
library(caret)
# load the dataset
data(iris)
# summarize data
summary(iris[,1:4])
# calculate the pre-process parameters from the dataset
preprocessParams <- preProcess(iris[,1:4], method=c("range"))
# summarize transform parameters
print(preprocessParams)
# transform the dataset using the parameters
transformed <- predict(preprocessParams, iris[,1:4])
# summarize the transformed dataset
summary(transformed)

  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  

Created from 150 samples and 4 variables

Pre-processing:
  - ignored (0)
  - re-scaling to [0, 1] (4)



  Sepal.Length     Sepal.Width      Petal.Length     Petal.Width     
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
 1st Qu.:0.2222   1st Qu.:0.3333   1st Qu.:0.1017   1st Qu.:0.08333  
 Median :0.4167   Median :0.4167   Median :0.5678   Median :0.50000  
 Mean   :0.4287   Mean   :0.4406   Mean   :0.4675   Mean   :0.45806  
 3rd Qu.:0.5833   3rd Qu.:0.5417   3rd Qu.:0.6949   3rd Qu.:0.70833  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  

## Box-Cox Transform
- When an attribute has a Gaussian-like distribution but is shifted, this is called a skew. 
- The
distribution of an attribute can be shifted to reduce the skew and make it more Gaussian. 
- The
BoxCox transform can perform this operation (assumes all values are positive).

In [14]:
install.packages('mlbench')

Installing package into ‘/user-home/_global_/R’
(as ‘lib’ is unspecified)


In [16]:
# load packages
library(mlbench)
library(caret)
# load the dataset
data(PimaIndiansDiabetes)
# summarize pedigree and age
summary(PimaIndiansDiabetes[,7:8])
# calculate the pre-process parameters from the dataset
preprocessParams <- preProcess(PimaIndiansDiabetes[,7:8], method=c("BoxCox"))
# summarize transform parameters
print(preprocessParams)
# transform the dataset using the parameters
transformed <- predict(preprocessParams, PimaIndiansDiabetes[,7:8])
# summarize the transformed dataset (note pedigree and age)
summary(transformed)

    pedigree           age       
 Min.   :0.0780   Min.   :21.00  
 1st Qu.:0.2437   1st Qu.:24.00  
 Median :0.3725   Median :29.00  
 Mean   :0.4719   Mean   :33.24  
 3rd Qu.:0.6262   3rd Qu.:41.00  
 Max.   :2.4200   Max.   :81.00  

ERROR: Error in requireNamespaceQuietStop("e1071"): package e1071 is required


## Yeo-Johnson Transform
#### The YeoJohnson transform another power-transform like Box-Cox, but it supports raw values that are equal to zero and negative.

In [18]:
# load packages
library(mlbench)
library(caret)
# load the dataset
data(PimaIndiansDiabetes)
# summarize pedigree and age
summary(PimaIndiansDiabetes[,7:8])
# calculate the pre-process parameters from the dataset
preprocessParams <- preProcess(PimaIndiansDiabetes[,7:8], method=c("YeoJohnson"))
# summarize transform parameters
print(preprocessParams)
# transform the dataset using the parameters
transformed <- predict(preprocessParams, PimaIndiansDiabetes[,7:8])
# summarize the transformed dataset (note pedigree and age)
summary(transformed)

    pedigree           age       
 Min.   :0.0780   Min.   :21.00  
 1st Qu.:0.2437   1st Qu.:24.00  
 Median :0.3725   Median :29.00  
 Mean   :0.4719   Mean   :33.24  
 3rd Qu.:0.6262   3rd Qu.:41.00  
 Max.   :2.4200   Max.   :81.00  

Created from 768 samples and 2 variables

Pre-processing:
  - ignored (0)
  - Yeo-Johnson transformation (2)

Lambda estimates for Yeo-Johnson transformation:
-2.25, -1.15


    pedigree           age        
 Min.   :0.0691   Min.   :0.8450  
 1st Qu.:0.1724   1st Qu.:0.8484  
 Median :0.2265   Median :0.8524  
 Mean   :0.2317   Mean   :0.8530  
 3rd Qu.:0.2956   3rd Qu.:0.8580  
 Max.   :0.4164   Max.   :0.8644  

## Principal Component Analysis Transform
- The PCA transforms the data to return only the principal components, a technique from
multivariate statistics and linear algebra. 
- The transform keeps those components above the
variance threshold (default=0.95) or the number of components can be specied (pcaComp).
- The result is attributes that are uncorrelated, useful for algorithms like linear and generalized
linear regression.

In [19]:
# load the packages
library(mlbench)
# load the dataset
data(iris)
# summarize dataset
summary(iris)
# calculate the pre-process parameters from the dataset
preprocessParams <- preProcess(iris, method=c("center", "scale", "pca"))
# summarize transform parameters
print(preprocessParams)
# transform the dataset using the parameters
transformed <- predict(preprocessParams, iris)
# summarize the transformed dataset
summary(transformed)

  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                

Created from 150 samples and 5 variables

Pre-processing:
  - centered (4)
  - ignored (1)
  - principal component signal extraction (4)
  - scaled (4)

PCA needed 2 components to capture 95 percent of the variance


       Species        PC1               PC2          
 setosa    :50   Min.   :-2.7651   Min.   :-2.67732  
 versicolor:50   1st Qu.:-2.0957   1st Qu.:-0.59205  
 virginica :50   Median : 0.4169   Median :-0.01744  
                 Mean   : 0.0000   Mean   : 0.00000  
                 3rd Qu.: 1.3385   3rd Qu.: 0.59649  
                 Max.   : 3.2996   Max.   : 2.64521  

## Independent Component Analysis Transform
- Transform the data to the independent components. 
- Unlike PCA, ICA retains those components
that are independent. 
- You must specify the number of desired independent components with
the n.comp argument. 
- This transform may be useful for algorithms such as Naive Bayes.

In [21]:
install.packages('fastICA')

Installing package into ‘/user-home/_global_/R’
(as ‘lib’ is unspecified)


In [22]:
# load packages
library(mlbench)
library(caret)
# load the dataset
data(PimaIndiansDiabetes)
# summarize dataset
summary(PimaIndiansDiabetes[,1:8])
# calculate the pre-process parameters from the dataset
preprocessParams <- preProcess(PimaIndiansDiabetes[,1:8], method=c("center", "scale",
    "ica"), n.comp=5)

# summarize transform parameters
print(preprocessParams)
# transform the dataset using the parameters
transformed <- predict(preprocessParams, PimaIndiansDiabetes[,1:8])
# summarize the transformed dataset
summary(transformed)

    pregnant         glucose         pressure         triceps     
 Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
 1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
 Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
 Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
 3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
 Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
    insulin           mass          pedigree           age       
 Min.   :  0.0   Min.   : 0.00   Min.   :0.0780   Min.   :21.00  
 1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437   1st Qu.:24.00  
 Median : 30.5   Median :32.00   Median :0.3725   Median :29.00  
 Mean   : 79.8   Mean   :31.99   Mean   :0.4719   Mean   :33.24  
 3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262   3rd Qu.:41.00  
 Max.   :846.0   Max.   :67.10   Max.   :2.4200   Max.   :81.00  

Created from 768 samples and 8 variables

Pre-processing:
  - centered (8)
  - independent component signal extraction (8)
  - ignored (0)
  - scaled (8)

ICA used 5 components


      ICA1               ICA2              ICA3               ICA4        
 Min.   :-2.93403   Min.   :-3.0695   Min.   :-4.89641   Min.   :-6.0200  
 1st Qu.:-0.72094   1st Qu.:-0.7711   1st Qu.:-0.48305   1st Qu.:-0.4279  
 Median :-0.07466   Median : 0.2780   Median : 0.02397   Median : 0.2596  
 Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
 3rd Qu.: 0.73880   3rd Qu.: 0.8410   3rd Qu.: 0.59436   3rd Qu.: 0.6824  
 Max.   : 2.37944   Max.   : 1.4154   Max.   : 4.17243   Max.   : 1.5765  
      ICA5        
 Min.   :-3.2215  
 1st Qu.:-0.6498  
 Median :-0.1381  
 Mean   : 0.0000  
 3rd Qu.: 0.4696  
 Max.   : 5.5384  

## Resampling Methods To Estimate Model Accuracy
The example below splits the iris dataset so that 80% is used for training a Naive Bayes
model and 20% is used to evaluate the model's performance.

In [24]:
install.packages('klaR')

Installing package into ‘/user-home/_global_/R’
(as ‘lib’ is unspecified)
also installing the dependency ‘combinat’



In [26]:
install.packages('MASS')

Installing package into ‘/user-home/_global_/R’
(as ‘lib’ is unspecified)


In [28]:
install.packages('e1071')

Installing package into ‘/user-home/_global_/R’
(as ‘lib’ is unspecified)


In [29]:
# load the packages
library(caret)
library(klaR)
# load the iris dataset
data(iris)
# define an 80%/20% train/test split of the dataset
trainIndex <- createDataPartition(iris$Species, p=0.80, list=FALSE)
dataTrain <- iris[ trainIndex,]
dataTest <- iris[-trainIndex,]
# train a naive Bayes model
fit <- NaiveBayes(Species~., data=dataTrain)
# make predictions
predictions <- predict(fit, dataTest[,1:4])
# summarize results
confusionMatrix(predictions$class, dataTest$Species)

Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         10          0         0
  versicolor      0         10         0
  virginica       0          0        10

Overall Statistics
                                     
               Accuracy : 1          
                 95% CI : (0.8843, 1)
    No Information Rate : 0.3333     
    P-Value [Acc > NIR] : 4.857e-15  
                                     
                  Kappa : 1          
 Mcnemar's Test P-Value : NA         

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            1.0000           1.0000
Specificity                 1.0000            1.0000           1.0000
Pos Pred Value              1.0000            1.0000           1.0000
Neg Pred Value              1.0000            1.0000           1.0000
Prevalence                  0.3333            0.3333           0.3333
Detection Rate

## Wait? What's is Kappa?
# Model Evaluation Metrics in R
- There are many different metrics that you can use to evaluate your machine learning algorithms
in R. 
- When you use caret to evaluate your models, the default metrics used are accuracy for
classification problems and RMSE for regression. 

### Binary & Multiclass: Accuracy & Kappa
- Accuracy and Kappa are the default metrics used to evaluate algorithms on binary and multiclass
classication datasets in caret.

- Kappa or Cohen's Kappa is like classication accuracy, 
- except that it is normalized at the
baseline of random chance on your dataset. 
- It is a more useful measure to use on problems
that have an imbalance in the classes 
- (e.g. a 70% to 30% split for classes 0 and 1 and you can
achieve 70% accuracy by predicting all instances are for class 0). 


- In the example below the
Pima Indians diabetes dataset is used. It has a class break down of 65% to 35% for negative
and positive outcomes.

In [34]:
# load packages
library(caret)
library(mlbench)
# load the dataset
data(PimaIndiansDiabetes)
# prepare resampling method
trainControl <- trainControl(method="cv", number=5)
set.seed(7)
fit <- train(diabetes~., data=PimaIndiansDiabetes, method="glm", metric="Accuracy",
trControl=trainControl)
# display results
print(fit)

Generalized Linear Model 

768 samples
  8 predictor
  2 classes: 'neg', 'pos' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 614, 614, 615, 615, 614 
Resampling results:

  Accuracy   Kappa    
  0.7695442  0.4656824

 


- Running this example, we can see tables of Accuracy and Kappa for the evaluated algorithm.
This includes the mean values (left) and the standard deviations (marked as SD) for each
metric, 
- taken over the population of cross-validation folds and trials. You can see that the
accuracy of the model is approximately 76% which is 11 percentage points 
- The Kappa value on the other hand shows
approximately 46% which is more interesting.

### Regression: RMSE and R2
- RMSE and R2 are the default metrics used to evaluate algorithms on regression datasets in
caret. RMSE or Root Mean Squared Error is the average deviation of the predictions from the
observations. 
- It is useful to get a gross idea of how well (or not) an algorithm is doing, in the
units of the output variable.
- In this example the longley economic dataset is used. The output
variable for this dataset is a number employed people in the population

In [35]:
# load packages
library(caret)
# load data
data(longley)
# prepare resampling method
trainControl <- trainControl(method="cv", number=5)
set.seed(7)
fit <- train(Employed~., data=longley, method="lm", metric="RMSE", trControl=trainControl)
# display results
print(fit)

Linear Regression 

16 samples
 6 predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 12, 12, 14, 13, 13 
Resampling results:

  RMSE       Rsquared 
  0.3868618  0.9883114

Tuning parameter 'intercept' was held constant at a value of TRUE
 


- Running this example, we can see tables of RMSE and R Squared for the evaluated algorithm.
- Again, you can see the mean and standard deviations of both metrics are provided. You can see
that the RMSE was 0.38 in the units of Employed

Two Class with Area Under Curve (AUC) ROC
- ROC metrics are only suitable for binary classification problems (e.g. two classes). 
- To calculate ROC information, you must change the summaryFunction in your trainControl to be
twoClassSummary.
- ROC is actually the area under the ROC curve or AUC. 
- The AUC represents a model's
ability to discriminate between positive and negative classes. 
- An area of 1.0 represents a model
that predicts perfectly. An area of 0.5 represents a model as good as random.

In [36]:
# load packages
library(caret)
library(mlbench)
# load the dataset
data(PimaIndiansDiabetes)
# prepare resampling method
trainControl <- trainControl(method="cv", number=5, classProbs=TRUE,
summaryFunction=twoClassSummary)
set.seed(7)
fit <- train(diabetes~., data=PimaIndiansDiabetes, method="glm", metric="ROC",
trControl=trainControl)
# display results
print(fit)

Generalized Linear Model 

768 samples
  8 predictor
  2 classes: 'neg', 'pos' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 614, 614, 615, 615, 614 
Resampling results:

  ROC        Sens   Spec     
  0.8336003  0.882  0.5600978

 


- Good but not excellent AUC score of 0.833. The rst level is taken as
the positive class, in this case neg (no onset of diabetes).
- ROC can be broken down into sensitivity and specificity. A binary classication problem is
really a trade-o between **sensitivity and specificity**

### Sensitivity
- is the true positive rate also called the recall. 
- It is the number of instances
from the positive (first) class that actually predicted correctly.

### Specificity 
- is also called the true negative rate. 
- It is the number of instances from the
negative class (second class) that were actually predicted correctly.

## MultiClass with Logarithmic Loss

- Logarithmic Loss(or LogLoss) is used to evaluate binary classication but it is more common
for multiclass classification algorithms. 
- Specifically, it evaluates the probabilities estimated by
the algorithms.

In [38]:
# load packages
library(caret)
# load the dataset
data(iris)
# prepare resampling method
trainControl <- trainControl(method="cv", number=5, classProbs=TRUE,
summaryFunction=mnLogLoss)
set.seed(7)
fit <- train(Species~., data=iris, method="rpart", metric="logLoss", trControl=trainControl)
# display results
print(fit)

Loading required package: rpart


CART 

150 samples
  4 predictor
  3 classes: 'setosa', 'versicolor', 'virginica' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 120, 120, 120, 120, 120 
Resampling results across tuning parameters:

  cp    logLoss  
  0.00  0.4105613
  0.44  0.6840517
  0.50  1.0986123

logLoss was used to select the optimal model using  the smallest value.
The final value used for the model was cp = 0. 


- Logloss is minimized and we can see the optimal CART (rpart) model had an argument cp
value of 0 (the first row of results).

## Bootstrap
- Bootstrap resampling involves taking random samples from the dataset (with re-selection)
against which to evaluate the model. 
- In aggregate, the results provide an indication of the
variance of the model's performance. 
- Typically, large number of resampling iterations are
performed (thousands or tens of thousands). 
- The following example uses a bootstrap with 100
resamples to estimate the accuracy of a Naive Bayes model.

In [30]:
# load the package
library(caret)
# load the iris dataset
data(iris)
# define training control
trainControl <- trainControl(method="boot", number=100)
# evalaute the model
fit <- train(Species~., data=iris, trControl=trainControl, method="nb")
# display the results
print(fit)

Naive Bayes 

150 samples
  4 predictor
  3 classes: 'setosa', 'versicolor', 'virginica' 

No pre-processing
Resampling: Bootstrapped (100 reps) 
Summary of sample sizes: 150, 150, 150, 150, 150, 150, ... 
Resampling results across tuning parameters:

  usekernel  Accuracy   Kappa    
  FALSE      0.9514773  0.9262873
   TRUE      0.9551236  0.9318288

Tuning parameter 'fL' was held constant at a value of 0
Tuning
 parameter 'adjust' was held constant at a value of 1
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were fL = 0, usekernel = TRUE and adjust
 = 1. 


# k-fold Cross-Validation
- The k-fold cross-validation method involves splitting the dataset into k-subsets. Each subset
is held-out while the model is trained on all other subsets. 
- This process is repeated until
accuracy is determined for each instance in the dataset, and an overall accuracy estimate is
provided. 
- It is a robust method for estimating accuracy, and the size of k can tune the amount
of bias in the estimate, with popular values set to 5 and 10. 
- The following example uses 10-fold
cross-validation to estimate the accuracy of the Naive Bayes algorithm on the iris dataset.

In [31]:
# load the package
library(caret)
# load the iris dataset
data(iris)
# define training control
trainControl <- trainControl(method="cv", number=10)
# evaluate the model
fit <- train(Species~., data=iris, trControl=trainControl, method="nb")
# display the results
print(fit)

Naive Bayes 

150 samples
  4 predictor
  3 classes: 'setosa', 'versicolor', 'virginica' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 135, 135, 135, 135, 135, 135, ... 
Resampling results across tuning parameters:

  usekernel  Accuracy   Kappa
  FALSE      0.9533333  0.93 
   TRUE      0.9600000  0.94 

Tuning parameter 'fL' was held constant at a value of 0
Tuning
 parameter 'adjust' was held constant at a value of 1
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were fL = 0, usekernel = TRUE and adjust
 = 1. 


# Repeated k-fold Cross-Validation
- The process of splitting the data into k-folds can be repeated a number of times, this is called
Repeated k-fold Cross-Validation. 
- The final model accuracy is taken as the mean from the
number of repeats. 
- The following example demonstrates 10-fold cross-validation with 3 repeats
to estimate the accuracy of the Naive Bayes algorithm on the iris datase

In [32]:
# load the package
library(caret)
# load the iris dataset
data(iris)
# define training control
trainControl <- trainControl(method="repeatedcv", number=10, repeats=3)
# evaluate the model
fit <- train(Species~., data=iris, trControl=trainControl, method="nb")
# display the results
print(fit)

Naive Bayes 

150 samples
  4 predictor
  3 classes: 'setosa', 'versicolor', 'virginica' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 135, 135, 135, 135, 135, 135, ... 
Resampling results across tuning parameters:

  usekernel  Accuracy   Kappa    
  FALSE      0.9555556  0.9333333
   TRUE      0.9555556  0.9333333

Tuning parameter 'fL' was held constant at a value of 0
Tuning
 parameter 'adjust' was held constant at a value of 1
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were fL = 0, usekernel = FALSE and adjust
 = 1. 


# Summary
## Tips For Evaluating Algorithms

- Using a data split into a training and test set is a good idea when you have a lot of data
and you are condent that your training sample is representative of the larger dataset.
- Using a data split is very ecient and is often used to get a quick estimate of model
accuracy.
- Cross-validation is a gold standard for evaluating model accuracy, often with k-folds set
to 5 or 10 to balance overtting the training data with a fair accuracy estimate.
- Repeated k-fold cross-validation is preferred when you can aord the computational
expense and require a less biased estimate.