# Pre-Processing
- Adjust for skewed data: standardize
- Impute Missing data
- Can be passed into the train() function to standardize all Predictors

In [1]:
library(caret)
library(kernlab) #spam dataset
data(spam)

Loading required package: lattice
Loading required package: ggplot2

Attaching package: ‘kernlab’

The following object is masked from ‘package:ggplot2’:

    alpha



In [2]:
# Split based on type
set.seed(32323)
inTrain <- createDataPartition(y = spam$type, p = 0.75, list = FALSE)
training <- spam[inTrain,]
testing <- spam[-inTrain,]

<hr>

# Imputing Data
- Build a Model to preprocess
- Predict using that model
- Assign Standardized Variable to a new Variable

In [6]:
set.seed(13343)

# Make some values NA
training$capAve <- training$capitalAve
selectNA <- rbinom(dim(training)[1], size = 1, prob = 0.05) == 1
training$capAve[selectNA] <- NA

In [9]:
# Impute and Standardize
preObj <- preProcess(training[, -58], method = c("knnImpute"))
preObj

Created from 3270 samples and 58 variables

Pre-processing:
  - centered (58)
  - ignored (0)
  - 5 nearest neighbor imputation (58)
  - scaled (58)


In [10]:
capAve <- predict(preObj, training[, -58])$capAve
head(capAve)

In [30]:
# Training set stays unchanged
head(training)

Unnamed: 0,make,address,all,num3d,our,over,remove,internet,order,mail,⋯,charRoundbracket,charSquarebracket,charExclamation,charDollar,charHash,capitalAve,capitalLong,capitalTotal,type,capAve
1,0.0,0.64,0.64,0,0.32,0.0,0.0,0.0,0.0,0.0,⋯,0.0,0,0.778,0.0,0.0,3.756,61,278,spam,3.756
2,0.21,0.28,0.5,0,0.14,0.28,0.21,0.07,0.0,0.94,⋯,0.132,0,0.372,0.18,0.048,5.114,101,1028,spam,5.114
4,0.0,0.0,0.0,0,0.63,0.0,0.31,0.63,0.31,0.63,⋯,0.137,0,0.137,0.0,0.0,3.537,40,191,spam,3.537
5,0.0,0.0,0.0,0,0.63,0.0,0.31,0.63,0.31,0.63,⋯,0.135,0,0.135,0.0,0.0,3.537,40,191,spam,3.537
6,0.0,0.0,0.0,0,1.85,0.0,0.0,1.85,0.0,0.0,⋯,0.223,0,0.0,0.0,0.0,3.0,15,54,spam,3.0
8,0.0,0.0,0.0,0,1.88,0.0,0.0,1.88,0.0,0.0,⋯,0.206,0,0.0,0.0,0.0,2.45,11,49,spam,2.45


In [16]:
# Standardize true values (Variable without NAs)
capAveTruth <- training$capitalAve
head(capAveTruth)

In [17]:
# Standardize manually
capAveTruth <- (capAveTruth - mean(capAveTruth)) / sd(capAveTruth)
head(capAveTruth)

### Look at difference between Imputed and Real Values

In [33]:
quantile(capAve - capAveTruth)