# R ML Crash Course_Part 3:  Data Transform, Split

## Full Day Workshop for user learn Data Science with R
### 2018  Timothy CL Lam
This is meant for internal usage, part of contents copied externally, not for commercial purpose


# Data Pre-Processing in R
- The caret package in R provides a number of useful data transforms. These transforms can be
used in two ways:

**Standalone** : 
- Transforms can be modeled from training data and applied to multiple datasets.
The model of the transform is prepared using the preProcess() function and
- applied to
a dataset using the predict() function.

**Training** : 
- Transforms can be prepared and applied automatically during model evaluation.
Transforms applied during training are prepared using the preProcess() function and
- passed to the train() function via the preProcess argument.

**Useful**
- regression algorithms, 
- instance-based methods (like KNN and LVQ), support vector
machines and neural networks. 
- They are less likely to be useful for tree and rule-based methods.

## Summary of Transform Methods
Below is a quick summary of all of the transform methods supported in the argument to the
preProcess() function in caret.
- BoxCox: apply a Box-Cox transform, values must be non-zero and positive.
- YeoJohnson: apply a Yeo-Johnson transform, like a BoxCox, but values can be negative.
- expoTrans: apply a power transform like BoxCox and YeoJohnson.
- zv: remove attributes with a zero variance (all the same value).
- nzv: remove attributes with a near zero variance (close to the same value).
- center: subtract mean from values.
- scale: divide values by standard deviation.
- range: normalize values.
- pca: transform data to the principal components.
- ica: transform data to the independent components.
- spatialSign: project data onto a unit circle.

## Scale Data
- The scale transform calculates the standard deviation for an attribute and divides each value by
that standard deviation. 
- This is a useful operation for scaling data with a Gaussian distribution
consistently.

In [4]:
# load packages
library(caret)
# load the dataset
data(iris)
# summarize data
summary(iris[,1:4])
# calculate the pre-process parameters from the dataset
preprocessParams <- preProcess(iris[,1:4], method=c("scale"))
# summarize transform parameters
print(preprocessParams)
# transform the dataset using the parameters
transformed <- predict(preprocessParams, iris[,1:4])
# summarize the transformed dataset
summary(transformed)

ERROR: Error: package or namespace load failed for ‘caret’


In [5]:
sessionInfo()

R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_2.2.1   lattice_0.20-33 SparkR_2.1.2   

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.5        plyr_1.8.3         iterators_1.0.8    tools_3.3.2       
 [5] digest_0.6.13      uuid_0.1-2         jsonlite_1.5       evaluate_0.10.1   
 [9] tibble_1.1         gtable_0.2.0       nlme_3.1-126       Matrix_1.2-4      
[13] foreach_1.4.3      IRdisplay_0.4.4    IRkernel_0.8.11    repr_0

## Center Data
#### The center transform calculates the mean for an attribute and subtracts it from each value.m

In [None]:
# load packages
library(caret)
# load the dataset
data(iris)
# summarize data
summary(iris[,1:4])
# calculate the pre-process parameters from the dataset
preprocessParams <- preProcess(iris[,1:4], method=c("center"))
# summarize transform parameters
print(preprocessParams)
# transform the dataset using the parameters
transformed <- predict(preprocessParams, iris[,1:4])
# summarize the transformed dataset
summary(transformed)

## Standardize Data
#### Combining the scale and center transforms will standardize your data. Attributes will have a mean value of 0 and a standard deviation of 1.

In [None]:
# load packages
library(caret)
# load the dataset
data(iris)
# summarize data
summary(iris[,1:4])
# calculate the pre-process parameters from the dataset
preprocessParams <- preProcess(iris[,1:4], method=c("center", "scale"))
# summarize transform parameters
print(preprocessParams)
# transform the dataset using the parameters
transformed <- predict(preprocessParams, iris[,1:4])
# summarize the transformed dataset
summary(transformed)

## Normalize Data
#### Data values can be scaled into the range of [0, 1] which is called normalization.

In [8]:
# load packages
library(caret)
# load the dataset
data(iris)
# summarize data
summary(iris[,1:4])
# calculate the pre-process parameters from the dataset
preprocessParams <- preProcess(iris[,1:4], method=c("range"))
# summarize transform parameters
print(preprocessParams)
# transform the dataset using the parameters
transformed <- predict(preprocessParams, iris[,1:4])
# summarize the transformed dataset
summary(transformed)

ERROR: Error: package or namespace load failed for ‘caret’


## Box-Cox Transform
- When an attribute has a Gaussian-like distribution but is shifted, this is called a skew. 
- The
distribution of an attribute can be shifted to reduce the skew and make it more Gaussian. 
- The
BoxCox transform can perform this operation (assumes all values are positive).

In [None]:
# load packages
library(mlbench)
library(caret)
# load the dataset
data(PimaIndiansDiabetes)
# summarize pedigree and age
summary(PimaIndiansDiabetes[,7:8])
# calculate the pre-process parameters from the dataset
preprocessParams <- preProcess(PimaIndiansDiabetes[,7:8], method=c("BoxCox"))
# summarize transform parameters
print(preprocessParams)
# transform the dataset using the parameters
transformed <- predict(preprocessParams, PimaIndiansDiabetes[,7:8])
# summarize the transformed dataset (note pedigree and age)
summary(transformed)

## Yeo-Johnson Transform
#### The YeoJohnson transform another power-transform like Box-Cox, but it supports raw values that are equal to zero and negative.

In [None]:
# load packages
library(mlbench)
library(caret)
# load the dataset
data(PimaIndiansDiabetes)
# summarize pedigree and age
summary(PimaIndiansDiabetes[,7:8])
# calculate the pre-process parameters from the dataset
preprocessParams <- preProcess(PimaIndiansDiabetes[,7:8], method=c("YeoJohnson"))
# summarize transform parameters
print(preprocessParams)
# transform the dataset using the parameters
transformed <- predict(preprocessParams, PimaIndiansDiabetes[,7:8])
# summarize the transformed dataset (note pedigree and age)
summary(transformed)

## Principal Component Analysis Transform
- The PCA transforms the data to return only the principal components, a technique from
multivariate statistics and linear algebra. 
- The transform keeps those components above the
variance threshold (default=0.95) or the number of components can be specied (pcaComp).
- The result is attributes that are uncorrelated, useful for algorithms like linear and generalized
linear regression.

In [None]:
# load the packages
library(mlbench)
# load the dataset
data(iris)
# summarize dataset
summary(iris)
# calculate the pre-process parameters from the dataset
preprocessParams <- preProcess(iris, method=c("center", "scale", "pca"))
# summarize transform parameters
print(preprocessParams)
# transform the dataset using the parameters
transformed <- predict(preprocessParams, iris)
# summarize the transformed dataset
summary(transformed)

## Independent Component Analysis Transform
- Transform the data to the independent components. 
- Unlike PCA, ICA retains those components
that are independent. 
- You must specify the number of desired independent components with
the n.comp argument. 
- This transform may be useful for algorithms such as Naive Bayes.

In [None]:
# load packages
library(mlbench)
library(caret)
# load the dataset
data(PimaIndiansDiabetes)
# summarize dataset
summary(PimaIndiansDiabetes[,1:8])
# calculate the pre-process parameters from the dataset
preprocessParams <- preProcess(PimaIndiansDiabetes[,1:8], method=c("center", "scale",
    "ica"), n.comp=5)

# summarize transform parameters
print(preprocessParams)
# transform the dataset using the parameters
transformed <- predict(preprocessParams, PimaIndiansDiabetes[,1:8])
# summarize the transformed dataset
summary(transformed)

## Resampling Methods To Estimate Model Accuracy
The example below splits the iris dataset so that 80% is used for training a Naive Bayes
model and 20% is used to evaluate the model's performance.

In [10]:
# load the packages
library(caret)
library(klaR)
# load the iris dataset
data(iris)
# define an 80%/20% train/test split of the dataset
trainIndex <- createDataPartition(iris$Species, p=0.80, list=FALSE)
dataTrain <- iris[ trainIndex,]
dataTest <- iris[-trainIndex,]
# train a naive Bayes model
fit <- NaiveBayes(Species~., data=dataTrain)
# make predictions
predictions <- predict(fit, dataTest[,1:4])
# summarize results
confusionMatrix(predictions$class, dataTest$Species)

ERROR: Error: package or namespace load failed for ‘caret’


## Bootstrap
- Bootstrap resampling involves taking random samples from the dataset (with re-selection)
against which to evaluate the model. 
- In aggregate, the results provide an indication of the
variance of the model's performance. 
- Typically, large number of resampling iterations are
performed (thousands or tens of thousands). 
- The following example uses a bootstrap with 100
resamples to estimate the accuracy of a Naive Bayes model.

In [11]:
# load the package
library(caret)
# load the iris dataset
data(iris)
# define training control
trainControl <- trainControl(method="boot", number=100)
# evalaute the model
fit <- train(Species~., data=iris, trControl=trainControl, method="nb")
# display the results
print(fit)

ERROR: Error: package or namespace load failed for ‘caret’


# k-fold Cross-Validation
- The k-fold cross-validation method involves splitting the dataset into k-subsets. Each subset
is held-out while the model is trained on all other subsets. 
- This process is repeated until
accuracy is determined for each instance in the dataset, and an overall accuracy estimate is
provided. 
- It is a robust method for estimating accuracy, and the size of k can tune the amount
of bias in the estimate, with popular values set to 5 and 10. 
- The following example uses 10-fold
cross-validation to estimate the accuracy of the Naive Bayes algorithm on the iris dataset.

In [None]:
# load the package
library(caret)
# load the iris dataset
data(iris)
# define training control
trainControl <- trainControl(method="cv", number=10)
# evaluate the model
fit <- train(Species~., data=iris, trControl=trainControl, method="nb")
# display the results
print(fit)

# Repeated k-fold Cross-Validation
- The process of splitting the data into k-folds can be repeated a number of times, this is called
Repeated k-fold Cross-Validation. 
- The final model accuracy is taken as the mean from the
number of repeats. 
- The following example demonstrates 10-fold cross-validation with 3 repeats
to estimate the accuracy of the Naive Bayes algorithm on the iris datase

In [None]:
# load the package
library(caret)
# load the iris dataset
data(iris)
# define training control
trainControl <- trainControl(method="repeatedcv", number=10, repeats=3)
# evaluate the model
fit <- train(Species~., data=iris, trControl=trainControl, method="nb")
# display the results
print(fit)

# Summary
## Tips For Evaluating Algorithms

- Using a data split into a training and test set is a good idea when you have a lot of data
and you are condent that your training sample is representative of the larger dataset.
- Using a data split is very ecient and is often used to get a quick estimate of model
accuracy.
- Cross-validation is a gold standard for evaluating model accuracy, often with k-folds set
to 5 or 10 to balance overtting the training data with a fair accuracy estimate.
- Repeated k-fold cross-validation is preferred when you can aord the computational
expense and require a less biased estimate.