/
Final Machine Learning Manual.Rmd
547 lines (453 loc) · 27.1 KB
/
Final Machine Learning Manual.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
---
title: "R Machine Learning Manual"
author: "Abu Nayeem"
date: "September 23, 2014"
output: html_document
---
#### Introduction
This Manual is meant to consolidate my knowledge in machine learning using the R package. The first two steps of machine learning is the following:
* What is the predictor variable [factor vs. numeric]
* How do you clean the dataset without removing valuable information?
Of course the standard practice of machine learning involves creating a training test [create model], cross validation set [test model], and a testing set [final unadulterated data]. This sort of testing procedure assures that the model does not face the issue of overfitting. This example will be a classification problem, but a numericla learning problem follow similar procedure.
#####Executive Summary
Technology has focused on developing health tools and gadgets to record how much training a person has done in a specific period of time. However, almost no research has been done in developing tools or models to give the trainer feedback on how well he has been performing exercises. This project is oriented in calculating a machine learning algorithm to determine whether a weight lifting trainer performed the exercise well or made an error in the execution.
##### DataSet
The data set used for the model comes from the Groupware@LES from their Human Activity Recognition project. They performed a study to analyze how well a Weight Lifting Exercise was executed. Each trainer was given a sensor for his glove, belt, dumbbell and arm-band. These are tools used by every weight lifting trainer so the original exercises maintain integrity.
Each trainer was asked to perform weight lifting in a particular manner. First, to do it perfectly as ideally described. Second, throwing the elbows to the front. Third, lifting the dumbbell half way. Fourth, lowering the dumbbell halfway. Finally, throwing the hips to the front. In each exercise performed, the sensors recorded the movements and rotations, including max accelerations, min accelerations, averages, kurtosis, between others.
You can learn more here <http://groupware.les.inf.puc-rio.br/har>
#### Preprocessing
##### Loading libraries
```{r setup, message=FALSE}
set.seed(234)
library(caret) # the power horse function; loads ggplot2 automatically
library(doMC) # enable parallel computing; loads parallel & iterators
library(nnet) # for neural networking and multi-nomial log regression models
library(randomForest) # random forest strategy
library(kernlab) # allows plenty of tools of dimension reduction and such
library(e1071) # allows more features but is needed for boosting models
library(plyr) # data table operations
library(dplyr) # data operations plus
library(gbm) # general boosting method
library(corrplot) # fancy correlation plot
library(AppliedPredictiveModeling)
library(foreach) # used in random forest alogrithm
library(doParallel) # Parallel Processing
library(ipred) # needed for treebagging
library(rpart) # for rpart but it failed in this example
registerDoMC(cores = 2) # register the number of cores to parallel process
date() #set date
```
##### Extraction
```{r, results='hide'}
# Selecting the definition of NA string was defined via post-analysis
trainingfile <- 'http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv'
training <- read.csv(trainingfile, na.strings = c("NA", "#DIV/0!"))
training <- tbl_df(training) # this data table is smoother
testingfile <- 'http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv'
testing <- read.csv(testingfile, na.strings = c("NA", "#DIV/0!"))
testing <- tbl_df(testing)
```
##### Setup
An unresolved debate is do you keep the cross validation pristine prior to implementing data cleaning techiniques. I recommend testing your algorithm in both options to see if it actually makes a difference. The cross validation process lose precious data which can be used to create a prediction model. Let's give it the worse case scenario where cross-validation set is also treated as pristine. Note: you cannot fundamentally change the training set because the testing set is still raw , so you should take that to account.
Splitting training set into a smaller training set and cross-validation set
```{r, results='hide'}
inTrain <- createDataPartition(y = training$classe, p = 0.8, list = FALSE)
smalltraining <- training[inTrain, ]
crossvalidation <- training[-inTrain, ]
```
#### Data Cleaning
A) **Basics**- Make basic assessment that need to be done
```{r, results='hide'}
dim(training)
str(training)
summary(training)
```
B) **Handling missing values:** note you should set the strings for missing upon retriving the data
Plotting of missing values:
```{r}
qplot(1, colSums(is.na(smalltraining))/dim(smalltraining)[1],
geom = 'jitter',
main = '% of missing values per variable',
xlab = '', ylab = '% missing values') # visualization of missing values
```
Accumulative method Removal:
```{r, results='hide'}
colSums(is.na(smalltraining)) # now we see the number of missing values in columns
# and see if they are significant for removal
NonNAIndex <- which(colSums(is.na(smalltraining)) > 0)
# this extracts the column index missing variable with number
RemoveNA <- smalltraining[ ,-NonNAIndex]
# Create new data frame that remmove columns that had missing values
```
Threshold Method: you choose the tolerable percentage and if above, remove the columns.
``` {r removeNA, eval=FALSE}
NonNAIndex <- which(apply(smalltraining, 2,
function(x) {sum(is.na(x))}) > 0.5 * dim(smalltraining)[1])
RemoveNA <- smalltraining[ ,-NonNAIndex]
# alternative to above but more function like
NA_threshold <- 0.50
nTrain <- nrow(smalltraining)
i <- 1
while(i < ncol(smalltraining)) {
nNA <- sum(is.na(smalltraining[,i]))
if((nNA/nTrain) >= NA_threshold) {
NonNAIndex <- c(removeCols, i)
}
i <- i + 1
}
RemoveNA <- smalltraining[,-NonNAIndex]
```
C) **Removing Uninteresting Features:**
Removing Unrelated Features:
```{r}
# choose the columns that may be useful for analysis
compacttraining <- select(RemoveNA, 2, 8:60)
```
Removing Near Zero variance features:
```{r}
# this checks if all columns have close to zero variance
# the saveMetric provide heuristic information of each column which is REALLY useful
Nzv <- nearZeroVar(compacttraining,saveMetrics=TRUE)
Nzv # all false, so no columns will be removed
Nzv <- nearZeroVar(compacttraining,saveMetrics=FALSE)
```
```{r, eval=FALSE}
# if there was columns to be removed this would be used
compacttraining <- compacttraining[ ,-Nzv]
```
D) **Removing Highly Correlated Variables:** For numerical/integer columns only
Plotting Correlated Variables:
```{r}
corData<- cor(compacttraining[ ,c(2:53)])
corrplot(corData,
title = "Corr,per eigenVectors",
order = "AOE",
method = "color",
type = "lower",
tl.cex = 0.6 ) # plot to have a look to correlations
```
Removing Correlated Variables (Manual Method)
```{r, results='hide'}
M <- abs(cor(compacttraining[ ,c(2:53)])) # create a correlation matrix
diag(M) <- 0 # by default the diagnols are one so we make them equal zero
which(M > 0.8, arr.ind = TRUE) # displays correlated pairs names
which(M > 0.8, arr.ind = FALSE) # displays the column numbers of each match
# you can remove certain pairs manually
descriptivetraining <- select(compacttraining,
-c(magnet_arm_y , pitch_dumbbell, yaw_dumbbell , accel_arm_x,
gyros_arm_y, pitch_belt, accel_belt_x, yaw_belt, total_accel_belt,
accel_belt_y, accel_belt_z, gyros_forearm_y,
gyros_dumbbell_z, gyros_dumbbell_x)) #40 variables left
```
Alternative Removal Method: Note this methos had42 variables left
```{r, eval=FALSE}
descrCor <- cor(compacttraining[ ,c(2:53)])
highlyCorDescr <- findCorrelation(descrCor, cutoff = 0.8)
descriptivetraining <- compacttraining[, -highlyCorDescr] #42 varaibles left
```
Removing high reasonable skewness [Not useful in predictions]: remember numerical columns only
``` {r removeskew}
factordescriptivetrain<-descriptivetraining[, c(1,40)] # separate non numerical variable
numdescriptivetrain<-descriptivetraining[, -c(1,40)] # separate numerical variables
NonskewIndex<-which(apply(numdescriptivetrain, 2,
function(x) abs(skewness(x)) > 6)) # find skewed volumns
numdescriptivetrain <- numdescriptivetrain[, -NonskewIndex] # remove skewed columns
cleandata <- cbind(factordescriptivetrain,numdescriptivetrain) # Combine to create clean data
```
E) **Exploratory Analysis:** [if you have an idea which variables are of concern]
```{r, message=FALSE}
require(gridExtra)
require(ggplot2)
p1 <- qplot(classe,yaw_belt,geom="boxplot",data=smalltraining,fill=classe)
p2 <- qplot(classe,pitch_forearm,geom="boxplot",data=smalltraining,fill=classe)
p3 <- qplot(classe,magnet_dumbbell_z,geom="boxplot",data=smalltraining,fill=classe)
p4 <- qplot(classe,magnet_belt_z,geom="boxplot",data=smalltraining,fill=classe)
grid.arrange(p1,p2,p3,p4, ncol=2)
```
##### Complete the Column Index
```{r, results='hide'}
colIndex <- colnames(cleandata) #38 variables should be remaining
check<-smalltraining[,colIndex]; check
# the colnames should be identical to that of cleandata
```
#### Training Models
**Pre-training:** Loading- You want to save your results so you don't need to constantly repeat analysis. Also note the rpart method failed.
```{r}
if(file.exists("Machine Learning.RData")) {
load("Machine Learning.RData")
}
```
**Random Forests:** Typically you may want build to smaller trees
The classe variable is actually a categorical variable and therefore a classification method performs better. One could use a single tree, but Random Forest have proven to be the most accurate classification algorithm, mainly for the reduction of variability while averaging different random trees. The out-of-bag (oob) error rate is important in this model:
In random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It is estimated internally, during the run, as follows: each tree is constructed using a different bootstrap sample from the original data. About one-third of the cases are left out of the bootstrap sample and not used in the construction of the kth tree. Put each case left out in the construction of the kth tree down the kth tree to get a classification. In this way, a test set classification is obtained for each case in about one-third of the trees. At the end of the run, take j to be the class that got most of the votes every time case n was oob. The proportion of times that j is not equal to the true class of n averaged over all cases is the oob error estimate. This has proven to be unbiased in many tests.
**Method 1:** Standard Random Forest model
Run the model: [make sure parallel is running]
```{r, eval=FALSE}
registerDoParallel()
Trfor1<- system.time(rf1 <- randomForest(classe ~ .,
data = smalltraining[,colIndex],
importance=TRUE))
```
Check Predictions:
```{r}
rf1 # OOB estimate of error rate: 0.77%
rf1predictions1 <- predict(rf1, crossvalidation)
confusionMatrix(rf1predictions1,crossvalidation$classe)
rfor1<- confusionMatrix(rf1predictions1,crossvalidation$classe)
```
Assesment: What are the most influential trees? *Exclusive to random forest
```{r}
varImpPlot(rf1,pch=20,col="blue")
```
Plot: Choosing the right number of trees
``` {r}
plot(rf1, log="y")
legend("topright", colnames(rf1$err.rate),col=1:4,cex=0.8,fill=1:6)
```
**Method 2:** We now build 6 random forests with 150 trees each. We make use of parallel processing to build this model. Note: error with graphing tree
Set up and train model
```{r, eval=FALSE}
t <- smalltraining[, colIndex]
x<- t[, -38]
y <- smalltraining$classe
Trfor2<- system.time(rf2 <- foreach(ntree=rep(150, 6),
.combine=randomForest::combine,
.packages='randomForest')
%dopar% {
randomForest(x, y, ntree=ntree)
})
```
Check Prediction
```{r}
rf2 # OOB rate of 0% and used 900 trees
#we need to remove the missing values for this setup in
NonNAIndex <- which(colSums(is.na(crossvalidation)) > 0)
cross <- crossvalidation[ ,-NonNAIndex]
# cross is corssvalidation with missing variables missing
rf2predictions <- predict(rf2, cross)
confusionMatrix(rf2predictions,cross$classe)
rfor2<- confusionMatrix(rf2predictions,crossvalidation$classe)
# 100% accurate?
#testing respect to original test set
rf2predictions2 <- predict(rf2, RemoveNA)
confusionMatrix(rf2predictions2,RemoveNA$classe)
#100% accurate?
```
Assesment: What are the most influential trees?
```{r}
varImpPlot(rf2,pch=20,col="blue")
```
**SVM RADIAL:** Support vector Machine is used for both classification and logistic regression. The radial kernal uses shortest distance of Euclidean distance. The kfolds separate the sample in two and then the model trains each section to predict the other; this information is then used to create the final model. Increased folds may increase validity but for each increased fold there is less data to predict the model. So be careful
* I customized `train` control function to perform k-fold cross validation of 2.
Set up and run the model:
```{r, eval=FALSE}
tC <- trainControl(method = "cv", number = 2) # note 'cv' creates folds and 2 is the size
TSVMRad<- system.time(SVMRadial1 <- train(classe ~ .,
method = "svmRadial",
trControl = tC,
data = smalltraining[, colIndex]))
```
Check predictions:
```{r}
SVMRadpredictions1 <- predict(SVMRadial1, crossvalidation)
confusionMatrix(SVMRadpredictions1,crossvalidation$classe)
SVMRad <- confusionMatrix(SVMRadpredictions1,crossvalidation$classe)
```
**SVM RADIAL COST:** Similar to above but it now implements a penalty to reduce possibility of overfitting
Setup and the run the model:
```{r, eval=FALSE}
# model creation and test
tC <- trainControl(method = "cv", number = 2)
TSVMRadCost<- system.time(SVMRadialCost1 <- train(classe ~ .,
method = "svmRadialCost",
trControl = tC,
data = smalltraining[, colIndex]))
```
Check predictions:
```{r}
SVMRadCostpredictions1 <- predict(SVMRadialCost1, crossvalidation)
confusionMatrix(SVMRadCostpredictions1,crossvalidation$classe)
SVMRadCost<- confusionMatrix(SVMRadCostpredictions1,crossvalidation$classe)
```
**TREE BAG:** it builds an expansive bundle of classification trees
Setup model and train it:
```{r, eval=FALSE}
tC <- trainControl(method = "cv", number = 2)
TTB<- system.time(treebag1 <- train(classe ~ .,
method = "treebag",
trControl = tC,
data = smalltraining[, colIndex]))
```
Check predictions:
```{r}
treepredictions1 <- predict(treebag1, crossvalidation)
confusionMatrix(treepredictions1,crossvalidation$classe)
TB<- confusionMatrix(treepredictions1,crossvalidation$classe)
```
Classification Tree: The most simplification form fo the classification tree
``` {r, eval=FALSE}
# apply classification tree
TCT<- system.time(Classtree1 <- train(classe ~ .,
method="rpart",
data = smalltraining[, colIndex]))
```
```{r}
Classtreepredictions1 <- predict(Classtree1, crossvalidation)
confusionMatrix(Classtreepredictions1, crossvalidation$classe)
CT <- confusionMatrix(Classtreepredictions1, crossvalidation$classe)
```
Gradient Boosting (GBM)- is a machine learning technique for regression problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function. The gradient boosting method can also be used for classification problems by reducing them to regression with a suitable loss function.
Take a smaller sample to train model:
```{r}
sampletrain <- smalltraining[sample(nrow(smalltraining), 3000), ]
inTrain <- createDataPartition(y=sampletrain$classe, p=0.7, list=FALSE)
tinytraining <- sampletrain[inTrain, ]
tinycrossvalidation <- sampletrain[-inTrain, ]
```
Set up the grid and run the model:
```{r, eval=FALSE}
gbmGrid <- expand.grid(interaction.depth = 5, # the num of interactions between features
n.trees = 150, # the total number of trees or iterations
shrinkage = 0.1) # the learning rate of step-size function
TGBM<- system.time(GBM1 <- train(classe ~ .,
method="gbm",
data=tinytraining[ ,colIndex],
tuneGrid = gbmGrid,
verbose = FALSE)) #verbose doesn't show output iterations
```
Check predictions:
```{r}
# test tiny crossvalidation
GBM1predictions<-predict(GBM1,tinycrossvalidation)
confusionMatrix(GBM1predictions,tinycrossvalidation$classe)
# test cross validation
GBM1predictions2<-predict(GBM1,crossvalidation)
confusionMatrix(GBM1predictions2,crossvalidation$classe)
GBM<- confusionMatrix(GBM1predictions2,crossvalidation$classe)
```
##### Comparing Models
Measuring Accuracy and Out of Sammple Error:
```{r}
# sum up all the methods
FinalAccuracy <- data.frame(rfor1$overall[1], rfor2$overall[1], SVMRad$overall[1],
SVMRadCost$overall[1], TB$overall[1], CT$overall[1],
GBM$overall[1])
colnames(FinalAccuracy) <- c("rfor1", "rfor2", "SVMRad", "SVMRadCost", "TB", "CT","GBM")
rownames(FinalAccuracy) <- "Accuracy"
FinalAccuracy
# show the out-of-sample error
outOfSamErr <- 1-FinalAccuracy
rownames(outOfSamErr) <- "OSError"
outOfSamErr
```
Measuring Kappa- Goodness to Fit
```{r}
FinalKappa <- data.frame(rfor1$overall[2], rfor2$overall[2], SVMRad$overall[2],
SVMRadCost$overall[2], TB$overall[2], CT$overall[2],
GBM$overall[2])
colnames(FinalKappa) <- c("rfor1", "rfor2", "SVMRad", "SVMRadCost", "TB", "CT","GBM")
rownames(FinalKappa) <- "Kappa"
FinalKappa
```
Measuring Size of each prediction model
```{r}
FinalSize <- data.frame(format(object.size(rf1), units = "MB"),
format(object.size(rf2), units = "MB"),
format(object.size(SVMRadial1), units = "MB"),
format(object.size(SVMRadialCost1), units = "MB"),
format(object.size(treebag1), units = "MB"),
format(object.size(Classtree1), units = "MB"),
format(object.size(GBM1), units = "MB"))
colnames(FinalSize) <- c("rfor1", "rfor2", "SVMRad", "SVMRadCost", "TB", "CT","GBM")
rownames(FinalSize) <- "Size"
FinalSize
```
Comparing computation time:
```{r}
FinalComp <- rbind(Trfor1, Trfor2, TSVMRad, TSVMRadCost, TTB, TCT, TGBM)
rownames(FinalComp) <- c("rfor1", "rfor2", "SVMRad", "SVMRadCost", "TB", "CT","GBM")
FinalComp
```
Complete Model Comparison:
```{r}
Group <- rbind(FinalKappa, outOfSamErr, FinalSize)
TGroup<- data.frame(t(Group)) # transform to matrix and transpose it
CompleteComparison<- data.frame(cbind(TGroup,FinalComp))
CompleteComparison <- mutate(CompleteComparison, usertime=user.self + user.child,
systime=sys.self + sys.child,
model = c("rfor1", "rfor2", "SVMRad", "SVMRadCost",
"TB", "CT","GBM"))
CompleteComparison<- CompleteComparison[, -c(4,5,7,8)]
CompleteComparison[, 1] <- round(as.numeric(as.character(CompleteComparison[, 1])), 3)
CompleteComparison[, 2] <- round(as.numeric(as.character(CompleteComparison[, 2])), 3)
CompleteComparison <- select(CompleteComparison,7,1:6)
CompleteComparison <- arrange(CompleteComparison, OSError)
CompleteComparison # without timestamp variables
```
##### Sub-comparison of excluding timestamp variables
Sub-Comparison to when we include timestamp variables [it was significant]
```{r}
first
```
Notice including timestamp variables decrease total size for almost all algorithms. Some models are impacted significantly computationally when including them while others enjoy one less variable. The reason is that it provides more possibilites to match and separate the data OR it make it easy to reach the goal because of fewer variables. REGARDLESS it is worth to explore tradeoffs in more well-tweaked models.
#### Conclusion
I've tested many machine learning models in this exercise. Normally, we just choose the most accurate algorithm and move on, but we need to consider the entire pipeline of the project. Several factors that we should care about is accuracy/outof sample error, fitted train model size, elapsed and system time. With that said the top three models are randomforest models, general boosting models, and treebag model. The treebag requires so much data to hold, so let's discard that. The GBM and randomforest are very both good candidates. Randomforest models have many additional features that shed a lot more of the internal processes, which can allow to build a more efficent model (less trees or remove the least interesting features. In contrast, GBM excels greatly in minimum size and training time while still maintaining accuaracy.
With the comparison chart feature you can do short diagnostic on which model you want to implement. Note the logistic regression woould use a similar procedure with a few differences.
##### Bibliography
Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human '13) . Stuttgart, Germany: ACM SIGCHI, 2013.
#### Additional Material
**Principal Component Analysis**- It can only handle numeric vectors
```{r, eval=FALSE}
preProcompact <- preProcess(compacttraining[,-c(1,57)], method="pca")
preProcdescriptive <- preProcess(descriptivetraining[,-c(1,42)], method="pca")
```
Do NOT Use PCA to modelfit for large datasets as it crashed for R or taken enormous amount of time to complete.
##### Per person approach
##### Train classifier for training subset
Based on the findings from the previous section, we'll learn separate predictor for each user.
```{r trainSubset, cache=TRUE, eval=FALSE, echo=FALSE}
users <- sort(unique(train_red$user_name))
setkey(train_red, user_name)
train_red_split <- lapply(users, function(x){train_red[data.table(x)]})
mdl <- lapply(users, function (x){train(form=.outcome~.,
data=train_red_split[[x]],
method='rf',
trControl = trainControl(method='cv',
number=10, a
llowParallel=T,
savePredictions=T))
})
tmp <- sapply(users, function(x){cbind(pred=as.character(mdl[[x]]$finalModel$y),
classe=as.character(mdl[[x]]$finalModel$predicted))})
tmp <- data.table(rbind(tmp[[1]],tmp[[2]], tmp[[3]], tmp[[4]], tmp[[5]], tmp[[6]]))
confusionMatrix(tmp$pred, tmp$classe)
```
##### Apply model to test subset and determine out of sample error
```{r classifySubset, cache=TRUE, eval=FALSE, echo=FALSE}
setkey(test_red, user_name)
test_red_split <- lapply(users, function(x){test_red[data.table(x)]})
preds <- lapply(users, function (x){predict(mdl[[x]], test_red_split[[x]])})
tmp <- sapply(users, function(x){cbind(pred=as.character(preds[[x]]),
classe=as.character(test_red_split[[x]]$.outcome))})
tmp <- data.table(rbind(tmp[[1]],tmp[[2]], tmp[[3]], tmp[[4]], tmp[[5]], tmp[[6]]))
confusionMatrix(tmp$pred, tmp$classe)
```
The out-of sample error seems to be well under control.
##### Train classifier on full test set
```{r trainFull, cache=TRUE, eval=FALSE, echo=FALSE}
train_full_red <- train_raw[, naCount==0, with=F]
setnames(train_full_red, 1, '.outcome')
setkey(train_full_red, user_name)
train_full_red_split <- lapply(users, function(x){train_full_red[data.table(x)]})
mdl <- lapply(users, function (x){train(form=.outcome~.,
data=train_full_red_split[[x]],
method='rf',
trControl = trainControl(method='cv',
number=10,
allowParallel=T,
savePredictions=T))
})
tmp <- sapply(users, function(x){cbind(pred=as.character(mdl[[x]]$finalModel$y), classe=as.character(mdl[[x]]$finalModel$predicted))})
tmp <- data.table(rbind(tmp[[1]],tmp[[2]], tmp[[3]], tmp[[4]], tmp[[5]], tmp[[6]]))
confusionMatrix(tmp$pred, tmp$classe)
```