# Caret Package Basics
- Preprocessing and Cleaning:
    - preProcess
- Data Splitting:
    - createDataPartition
    - createResample
    - createTimeSlices
- Training/Testing Functions:
    - train
    - predict
- Model Comparison:
    - confusionMatrix

## Data Splitting:
- Using Spam Dataset

In [2]:
library(caret)
library(kernlab) #spam dataset
data(spam)

head(spam)

make,address,all,num3d,our,over,remove,internet,order,mail,⋯,charSemicolon,charRoundbracket,charSquarebracket,charExclamation,charDollar,charHash,capitalAve,capitalLong,capitalTotal,type
0.0,0.64,0.64,0,0.32,0.0,0.0,0.0,0.0,0.0,⋯,0.0,0.0,0,0.778,0.0,0.0,3.756,61,278,spam
0.21,0.28,0.5,0,0.14,0.28,0.21,0.07,0.0,0.94,⋯,0.0,0.132,0,0.372,0.18,0.048,5.114,101,1028,spam
0.06,0.0,0.71,0,1.23,0.19,0.19,0.12,0.64,0.25,⋯,0.01,0.143,0,0.276,0.184,0.01,9.821,485,2259,spam
0.0,0.0,0.0,0,0.63,0.0,0.31,0.63,0.31,0.63,⋯,0.0,0.137,0,0.137,0.0,0.0,3.537,40,191,spam
0.0,0.0,0.0,0,0.63,0.0,0.31,0.63,0.31,0.63,⋯,0.0,0.135,0,0.135,0.0,0.0,3.537,40,191,spam
0.0,0.0,0.0,0,1.85,0.0,0.0,1.85,0.0,0.0,⋯,0.0,0.223,0,0.0,0.0,0.0,3.0,15,54,spam


In [3]:
# Split based on type
inTrain <- createDataPartition(y = spam$type, p = 0.75, list = FALSE) 

In [5]:
training <- spam[inTrain,]
testing <- spam[-inTrain,]
dim(training)

## Fit a Model

In [6]:
set.seed(32343)
modelFit <- train(type ~ ., data = training, method = 'glm')
modelFit

“glm.fit: fitted probabilities numerically 0 or 1 occurred”

Generalized Linear Model 

3451 samples
  57 predictor
   2 classes: 'nonspam', 'spam' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 3451, 3451, 3451, 3451, 3451, 3451, ... 
Resampling results:

  Accuracy   Kappa    
  0.9191168  0.8301116


## Final Model

In [7]:
modelFit$finalModel


Call:  NULL

Coefficients:
      (Intercept)               make            address                all  
       -1.654e+00         -1.821e-01         -1.618e-01          6.048e-02  
            num3d                our               over             remove  
        1.845e+00          4.995e-01          7.445e-01          2.185e+00  
         internet              order               mail            receive  
        5.078e-01          1.161e+00          4.537e-02          7.488e-02  
             will             people             report          addresses  
       -2.207e-01          4.034e-02          4.804e-02          8.231e-01  
             free           business              email                you  
        1.369e+00          1.346e+00          6.584e-02          9.619e-02  
           credit               your               font             num000  
        1.382e+00          1.975e-01          2.408e-01          1.980e+00  
            money                 hp            

## Prediction

In [9]:
prediction <- predict(modelFit, newdata = testing)
head(prediction)

## Confusion Matrix

In [11]:
# Compare predictions vs type in the testing data
confusionMatrix(prediction, testing$type)

Confusion Matrix and Statistics

          Reference
Prediction nonspam spam
   nonspam     656   44
   spam         41  409
                                          
               Accuracy : 0.9261          
                 95% CI : (0.9094, 0.9405)
    No Information Rate : 0.6061          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.845           
 Mcnemar's Test P-Value : 0.8283          
                                          
            Sensitivity : 0.9412          
            Specificity : 0.9029          
         Pos Pred Value : 0.9371          
         Neg Pred Value : 0.9089          
             Prevalence : 0.6061          
         Detection Rate : 0.5704          
   Detection Prevalence : 0.6087          
      Balanced Accuracy : 0.9220          
                                          
       'Positive' Class : nonspam         
                                          