# Coursera - Practical Machine Learning

#### By Mandy Jiang  (04/04/2022) 

## Background and Description

The goal of this project is to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants, to predict the manner in which they did the exercise. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. Variable "classe" in the training set shows which way they did in exercise, in which Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes. (The Weight Lifting Exercise Datasethttp://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har)

In [38]:
library(ggplot2)
library(caret)
library(lattice)
library(e1071)
library(rpart)
library(kernlab)
library(randomForest)
library(gbm)

## Data preprocessing

We are going to train and test the model on the training set only, and leave the testing set for final validation. Therefore, the data cleaning and preprocessin will be applied on training set only.

In [13]:
train_file = 'https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv'
test_file = 'https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv'
train = read.csv(train_file, sep=',', header=TRUE)
dim(train)
test = read.csv(test_file, sep=',', header=TRUE)
dim(test)

We got 19622 observations and 160 variables in training dataset, while we got 20 observations and 160 variables in testing dataset. Next, we are going to remove variables with missing values and those not relevant to the prediction of classes in the training set.

In [14]:
train = train[, colSums(is.na(train)) == 0]
train = train[,-c(1:7)]
dim(train)

Then we are going to remove variables with almost zero variance across observations.

In [15]:
nvz = nearZeroVar(train)
train = train[,-nvz]
dim(train)

## Data splitting

In [22]:
set.seed(1886)
inTrain = createDataPartition(y=train$classe, p=0.7, list=FALSE)
training = train[inTrain,]
validation = train[-inTrain,]
rbind("original dataset" = dim(train),"training set" = dim(training), "validation set" = dim(validation))

0,1,2
original dataset,19622,53
training set,13737,53
validation set,5885,53


## Prediction model setup

We will use decision trees, random forests, Gradient Boosted Trees, and Support Vector Machine to predict the outcomes. We will also select the best performance model and look at the predictions on the testing dataset.

Set up control for training with 5-fold cross validation.

In [23]:
control = trainControl(method="cv", number=5, verboseIter=FALSE)

### Decision tree

In [30]:
modFit1 = train(classe~., data=training, method="rpart", trControl = control)
pred1 = predict(modFit1, validation)
cm1 = confusionMatrix(pred1, factor(validation$classe))
cm1

Confusion Matrix and Statistics

          Reference
Prediction    A    B    C    D    E
         A 1492  494  493  437  145
         B   26  361   29  177  144
         C  123  284  504  350  264
         D    0    0    0    0    0
         E   33    0    0    0  529

Overall Statistics
                                          
               Accuracy : 0.4904          
                 95% CI : (0.4775, 0.5033)
    No Information Rate : 0.2845          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.3339          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E
Sensitivity            0.8913  0.31694  0.49123   0.0000  0.48891
Specificity            0.6274  0.92078  0.78987   1.0000  0.99313
Pos Pred Value         0.4874  0.48982  0.33049      NaN  0.94128
Neg Pred Value         0.9356  0.8488

### Random Forests

In [31]:
modFit2 = train(classe~., data=training, method="rf", trControl = control)
pred2 = predict(modFit2, validation)
cm2 = confusionMatrix(pred2, factor(validation$classe))
cm2

Confusion Matrix and Statistics

          Reference
Prediction    A    B    C    D    E
         A 1671    9    0    0    0
         B    2 1129    4    0    1
         C    0    1 1019    8    0
         D    0    0    3  956    3
         E    1    0    0    0 1078

Overall Statistics
                                          
               Accuracy : 0.9946          
                 95% CI : (0.9923, 0.9963)
    No Information Rate : 0.2845          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9931          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E
Sensitivity            0.9982   0.9912   0.9932   0.9917   0.9963
Specificity            0.9979   0.9985   0.9981   0.9988   0.9998
Pos Pred Value         0.9946   0.9938   0.9912   0.9938   0.9991
Neg Pred Value         0.9993   0.997

### Gradient Boosted Trees

In [35]:
modFit3 = train(classe~., data=training, method="gbm", trControl = control)
pred3 = predict(modFit3, validation)
cm3 = confusionMatrix(pred3, factor(validation$classe))
cm3

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        1.6094             nan     0.1000    0.1263
     2        1.5253             nan     0.1000    0.0890
     3        1.4670             nan     0.1000    0.0674
     4        1.4223             nan     0.1000    0.0547
     5        1.3865             nan     0.1000    0.0490
     6        1.3545             nan     0.1000    0.0445
     7        1.3263             nan     0.1000    0.0376
     8        1.3030             nan     0.1000    0.0324
     9        1.2818             nan     0.1000    0.0361
    10        1.2583             nan     0.1000    0.0269
    20        1.1047             nan     0.1000    0.0170
    40        0.9324             nan     0.1000    0.0080
    60        0.8260             nan     0.1000    0.0070
    80        0.7464             nan     0.1000    0.0034
   100        0.6847             nan     0.1000    0.0033
   120        0.6327             nan     0.1000    0.0029
   140        

Confusion Matrix and Statistics

          Reference
Prediction    A    B    C    D    E
         A 1641   40    0    1    2
         B   27 1070   22    3   12
         C    3   26  985   31    7
         D    2    1   18  924   13
         E    1    2    1    5 1048

Overall Statistics
                                         
               Accuracy : 0.9631         
                 95% CI : (0.958, 0.9678)
    No Information Rate : 0.2845         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.9534         
                                         
 Mcnemar's Test P-Value : 0.003518       

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E
Sensitivity            0.9803   0.9394   0.9600   0.9585   0.9686
Specificity            0.9898   0.9865   0.9862   0.9931   0.9981
Pos Pred Value         0.9745   0.9436   0.9363   0.9645   0.9915
Neg Pred Value         0.9921   0.9855   0.991

### Support Vector Machine

In [39]:
modFit4 = train(classe~., data=training, method="svmLinear", trControl = control)
pred4 = predict(modFit4, validation)
cm4 = confusionMatrix(pred4, factor(validation$classe))
cm4

Confusion Matrix and Statistics

          Reference
Prediction    A    B    C    D    E
         A 1534  163   91   64   60
         B   35  803   84   40  125
         C   41   66  793  113   61
         D   52   21   25  695   61
         E   12   86   33   52  775

Overall Statistics
                                          
               Accuracy : 0.7816          
                 95% CI : (0.7709, 0.7921)
    No Information Rate : 0.2845          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.7223          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E
Sensitivity            0.9164   0.7050   0.7729   0.7210   0.7163
Specificity            0.9102   0.9402   0.9422   0.9677   0.9619
Pos Pred Value         0.8023   0.7387   0.7384   0.8138   0.8090
Neg Pred Value         0.9648   0.930

## Model evaluation

In [52]:
AccuracyResults = data.frame(Model = c('Decision Tree','Random Forests','Gradient Boosted Trees',
                                        'Support Vector Machine'),
                              Accuracy = rbind(cm1$overall[1],cm2$overall[1],cm3$overall[1],cm4$overall[1])
                             )
print(AccuracyResults)

                   Model  Accuracy
1          Decision Tree 0.4903993
2         Random Forests 0.9945624
3 Gradient Boosted Trees 0.9631266
4 Support Vector Machine 0.7816483


The best model is the Random Forest model, with 0.9945624 accuracy. 

## Prediction on test set

In [54]:
pred = predict(modFit2, newdata = test)
ValidationPredictionResults <- data.frame(problem_id=test$problem_id, predicted=pred)
print(ValidationPredictionResults)

   problem_id predicted
1           1         B
2           2         A
3           3         B
4           4         A
5           5         A
6           6         E
7           7         D
8           8         B
9           9         A
10         10         A
11         11         B
12         12         C
13         13         B
14         14         A
15         15         E
16         16         E
17         17         A
18         18         B
19         19         B
20         20         B
