### JHU-DSCI Practical Machine Learning - Peer-graded Assignment

* Name: Walter Yu
* Date: January 2021

### Submission Notes

1. This project is an analysis of the Human Activity Recognition [dataset][01.00].

[01.00]: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har

2. Some charts and analysis output have been commented out to minimize overall length of document. Given project scope, it was a challenge to document all work while keeping document below maximum page length.

3. The course mentor [example project][01.01] and [model training example][01.02]  were referenced while developing this project to resolve issues with random forest model training times which were excessive without parallel processing.

[01.01]: http://lgreski.github.io/practicalmachinelearning/
[01.02]: https://github.com/lgreski/datasciencectacontent/blob/master/markdown/pml-randomForestPerformance.md

4. The caret package was unable to be installed on my Windows installation of the Anaconda application, so project was completed using Jupyter Notebook with [Google Colab][01.03]. All files and output remain the same, but project was completed with Jupyter in lieu of R markdown.

[01.03]: https://stackoverflow.com/questions/54595285/how-to-use-r-with-google-colaboratory

### Methodology

1. Project is organized into 3 parts: 1) data processing, 2) model training, and 3) model testing. Each part includes analysis and code of steps taken to train/test model and make predictions.

2. Both train/test datasets contain 160 columns; many of them contain all null values upon visual inspection and checking for null value count within each column. Also, the assignment instructions and HAR dataset documentation indicates that exercise movement and whether they were done correctly are of primary interest. As a result, irrelevant columns and rows with excessive null values were removed from train/test datasets.

3. Random forest was selected for the machine learning model since it is typically regarded as the most accurate without excessive model complexity per the course lecture videos and forum discussions.

4. The parallel and doParallel packages were used to speed up model training, and the course mentor example was referenced as noted above.

5. Model was trained by splitting the training dataset into train/test partitions. Results were evaluated with confusion matrices. Predictions were made with the testing dataset.


### Data Processing

Data was prepared for train/test as follows:

1. Read in train and test data files with na.string option to flag null values.
2. Review data and identify null values since their values need to be removed or imputed. The dataset was subset to remove columns which contained all zero values.
3. Review data and identify columns with information not relevant to prediction task, e.g. user names, timestamps, etc. The dataset was subset to remove columns with only relevant data to minimize the impact on the model training/testing performance.
4. Dataset dimension were reviewed before and after subset to verify results.
5. Data processing steps were completed for train/test datasets.

* Note: The course mentor guides for random forest modeling was referenced to setup model training. Confusion matrix was used to verify results.


In [1]:
# 01.00, final project: data processing

# reference: read csv data and fill null values
# https://stackoverflow.com/questions/6299220/access-a-url-and-read-data-with-r
# https://stackoverflow.com/questions/25771071/r-read-csv-more-columns-than-column-names-error
train_data = read.csv(
    url("https://raw.githubusercontent.com/walteryu/jhu-ml-project/main/pml-training.csv"),
    na.strings = c('NA','NAN','#DIV/0!','NaN','')
)
print("train_data dimensions (all cases): ")
dim(train_data)
# import test dataset with same options
test_data = read.csv(
    url("https://raw.githubusercontent.com/walteryu/jhu-ml-project/main/pml-testing.csv"),
    na.strings = c('NA','NAN','#DIV/0!','NaN','')
)
print("test_data dimensions (all cases): ")
dim(test_data)

# subset for columns for null values
# https://stackoverflow.com/questions/15968494/how-to-delete-columns-that-contain-only-nas/45383054
# https://stackoverflow.com/questions/25188051/using-is-na-in-r-to-get-column-names-that-contain-na-values
train_data = train_data[,colSums(is.na(train_data)) == 0]
print("train_data dimensions (remove null columns): ")
dim(train_data)
# subset for columns for irrelevant columns
train_data <- train_data[,-c(1:7)]
print("train_data dimensions (remove irrelevant columns")
dim(train_data)

# subset for columns for null values
# https://stackoverflow.com/questions/15968494/how-to-delete-columns-that-contain-only-nas/45383054
# https://stackoverflow.com/questions/25188051/using-is-na-in-r-to-get-column-names-that-contain-na-values
test_data = test_data[,colSums(is.na(test_data)) == 0]
print("test_data dimensions (remove null columns): ")
dim(test_data)
# subset for columns for irrelevant columns
test_data <- test_data[,-c(1:7)]
print("test_data dimensions (remove irrelevant columns")
dim(test_data)

# subset first n rows to speed up training
# print("train_data dimensions (before subset): ")
# dim(train_data)
# df_training = train_data[1:10000,]
# print("train_data dimensions (after subset): ")
# dim(df_training)


[1] "train_data dimensions (all cases): "


[1] "test_data dimensions (all cases): "


[1] "train_data dimensions (remove null columns): "


[1] "train_data dimensions (remove irrelevant columns"


[1] "test_data dimensions (remove null columns): "


[1] "test_data dimensions (remove irrelevant columns"


### Model Training

As described in the introduction, random forest was selected for its performance capability and relevance for the prediction task. Model training was completed as follows:

1. Configure parallel and doParallel package for model training.
2. Create and configure trainControl object to run parallel process.
3. Perform cross validation using trainControl object.
4. Train model with random forest and stop cluster when done.
5. Review model fit summary and confusion matrix to verify results.

* Note: The course mentor guides for random forest modeling was referenced to setup model training. Confusion matrix was used to verify results.


In [2]:
# 02.00, final project: model training

# load packages: machine learning models
# references: course quiz and assignments
# https://github.com/lgreski/datasciencectacontent/blob/7f88642673eeb5913459eb05bd5b58734c8f0bd5/markdown/pml-randomForestPerformance.md
install.packages("caret", dependencies = TRUE)
install.packages("parallel")
install.packages("doParallel")
library(caret)
library(parallel)
library(doParallel)

# set random number generation/seed
# references: course quiz and assignments
# https://github.com/lgreski/datasciencectacontent/blob/7f88642673eeb5913459eb05bd5b58734c8f0bd5/markdown/pml-randomForestPerformance.md
RNGversion("3.5.1")
set.seed(3523)

# set random number generation/seed
# references:
# class lecture slides on random forest (set 21, slide 4)
# https://github.com/lgreski/datasciencectacontent/blob/7f88642673eeb5913459eb05bd5b58734c8f0bd5/markdown/pml-randomForestPerformance.md
inTrain = createDataPartition(y=train_data$classe, p = 0.7, list=FALSE)
training = train_data[ inTrain,]
testing = train_data[-inTrain,]

# configure parallel model training
# https://github.com/lgreski/datasciencectacontent/blob/7f88642673eeb5913459eb05bd5b58734c8f0bd5/markdown/pml-randomForestPerformance.md
print("setting up cluster and trainControl object...")
cluster <- makeCluster(detectCores() - 1)
registerDoParallel(cluster)

# create fitControl object with cross validation parameter
# https://github.com/lgreski/datasciencectacontent/blob/7f88642673eeb5913459eb05bd5b58734c8f0bd5/markdown/pml-randomForestPerformance.md
fitControl <- trainControl(
    method = "cv",
    number = 5,
    allowParallel = TRUE
)

# train model with random forest
# https://github.com/lgreski/datasciencectacontent/blob/7f88642673eeb5913459eb05bd5b58734c8f0bd5/markdown/pml-randomForestPerformance.md
print("model training with random forest...")
fit <- train(classe ~ .,
    method="rf",
    data=train_data,
    trControl = fitControl
)

# shutdown parallel processing cluster
# https://github.com/lgreski/datasciencectacontent/blob/7f88642673eeb5913459eb05bd5b58734c8f0bd5/markdown/pml-randomForestPerformance.md
print("stop cluster after model training...")
stopCluster(cluster)
registerDoSEQ()

# evaluate results with confusion matrix
# https://github.com/lgreski/datasciencectacontent/blob/7f88642673eeb5913459eb05bd5b58734c8f0bd5/markdown/pml-randomForestPerformance.md
fit
fit$resample
confusionMatrix.train(fit)


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘R.methodsS3’, ‘R.oo’, ‘R.utils’, ‘bitops’, ‘numDeriv’, ‘SQUAREM’, ‘httpuv’, ‘xtable’, ‘sourcetools’, ‘fastmap’, ‘R.cache’, ‘caTools’, ‘TH.data’, ‘profileModel’, ‘minqa’, ‘nloptr’, ‘statmod’, ‘RcppEigen’, ‘plotrix’, ‘lava’, ‘shiny’, ‘miniUI’, ‘styler’, ‘classInt’, ‘labelled’, ‘gplots’, ‘libcoin’, ‘matrixStats’, ‘multcomp’, ‘iterators’, ‘data.table’, ‘gower’, ‘timeDate’, ‘brglm’, ‘gtools’, ‘lme4’, ‘qvcalc’, ‘Formula’, ‘plotmo’, ‘TeachingDemos’, ‘prodlim’, ‘combinat’, ‘questionr’, ‘ROCR’, ‘mvtnorm’, ‘modeltools’, ‘strucchange’, ‘coin’, ‘zoo’, ‘sandwich’, ‘ISwR’, ‘corpcor’, ‘foreach’, ‘plyr’, ‘ModelMetrics’, ‘reshape2’, ‘recipes’, ‘pROC’, ‘BradleyTerry2’, ‘e1071’, ‘earth’, ‘fastICA’, ‘gam’, ‘ipred’, ‘kernlab’, ‘klaR’, ‘ellipse’, ‘mda’, ‘mlbench’, ‘MLmetrics’, ‘party’, ‘pls’, ‘proxy’, ‘randomForest’, ‘RANN’, ‘spls’, ‘subselect’, ‘pamr’, ‘superpc’, ‘Cubist’


Installing packag

[1] "setting up cluster and trainControl object..."
[1] "model training with random forest..."
[1] "stop cluster after model training..."


Random Forest 

19622 samples
   52 predictor
    5 classes: 'A', 'B', 'C', 'D', 'E' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 15697, 15698, 15699, 15697, 15697 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
   2    0.9934256  0.9916836
  27    0.9936296  0.9919415
  52    0.9879727  0.9847847

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 27.

Accuracy,Kappa,Resample
<dbl>,<dbl>,<chr>
0.9926115,0.9906527,Fold1
0.9933724,0.9916159,Fold3
0.9938838,0.9922639,Fold2
0.9943949,0.9929101,Fold5
0.9938854,0.992265,Fold4


Cross-Validated (5 fold) Confusion Matrix 

(entries are percentual average cell counts across resamples)
 
          Reference
Prediction    A    B    C    D    E
         A 28.4  0.1  0.0  0.0  0.0
         B  0.0 19.2  0.1  0.0  0.0
         C  0.0  0.0 17.3  0.2  0.0
         D  0.0  0.0  0.1 16.2  0.1
         E  0.0  0.0  0.0  0.0 18.3
                            
 Accuracy (average) : 0.9936


### Model Testing

Predictions were made on the test dataset after model training and verifying results with a confusion matrix. The dataset consisted of 20 user cases, so it was used to make predictions as follows:

1. Make new predictions on test data with trained model, i.e. predict on out-of-sample data.
2. Predictions were made on classe feature, i.e. were exercises completed correctly, etc.
3. Output predictions to verify results, i.e. predicted value for each use case.


In [3]:
# 02.01, final project: model testing

# make predictions with test data
# reference: class lecture slides on random forest (set 21, slide 8)
pred = predict(fit, newdata=test_data)
# output predictions to verify results
pred
