# K-Nearest Neighbor Classification

## Using R builtin *KNN* function from the "class" package

In [1]:
library(class)

## Using builtin dataset *iris3*
- 50 observations for each type of iris (3 total, 150 total observations)
- iris3 dataset gives the same data arranged as a 3-dimensional array of size 50 by 4 by 3

In [2]:
class(iris3)
head(iris3)

## To train the KNN model, will use the first 25 observations of each species
- Predict on the other 25 observations

In [3]:
training <- rbind(iris3[1:25, ,1], iris3[1:25, ,2], iris3[1:25, ,3])
head(training)

Sepal L.,Sepal W.,Petal L.,Petal W.
5.1,3.5,1.4,0.2
4.9,3.0,1.4,0.2
4.7,3.2,1.3,0.2
4.6,3.1,1.5,0.2
5.0,3.6,1.4,0.2
5.4,3.9,1.7,0.4


In [4]:
testing <- rbind(iris3[26:50, ,1], iris3[26:50, ,2], iris3[26:50, ,3])
head(testing)

Sepal L.,Sepal W.,Petal L.,Petal W.
5.0,3.0,1.6,0.2
5.0,3.4,1.6,0.4
5.2,3.5,1.5,0.2
5.2,3.4,1.4,0.2
4.7,3.2,1.6,0.2
4.8,3.1,1.6,0.2


## Add a column for species category
- Label first 25 with "s", second 25 with "c", third 25 with "v"

In [5]:
cl <- factor(c(rep("s", 25), rep("c", 25), rep("v", 25)))
cl

## Run knn function
- Default parameters: 
    - knn(train, test, cl, k = 1, l = 0, prob = FALSE, use.all = TRUE)
- **prob = TRUE** will return probability of each prediction

In [6]:
r <- knn(training, testing, cl, k = 3, prob = TRUE)
r

## Look at Accuracy
- 92% Accuracy

In [7]:
table(cl, r)      
sum(cl == r) / length(cl)

   r
cl   c  s  v
  c 23  0  2
  s  0 25  0
  v  4  0 21

<hr>

# Using *KNN* with the *caret* package
- Using *iris* dataset

In [8]:
library(caret)

Loading required package: lattice
Loading required package: ggplot2


In [9]:
head(iris)

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa
5.4,3.9,1.7,0.4,setosa


### Data Partition

In [10]:
# Data Partition
inTrain <- createDataPartition(y = iris$Species, p = 0.7, list = FALSE)

# Sub-set titanic data to Train and to Test
training <- iris[inTrain, ]
testing <- iris[-inTrain, ]

### Train and Predict

In [11]:
# Using k = 3 
knnFit <- train(Species ~ ., data = training, method = "knn", tuneGrid = data.frame(k = 3))
knnFit

k-Nearest Neighbors 

105 samples
  4 predictor
  3 classes: 'setosa', 'versicolor', 'virginica' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 105, 105, 105, 105, 105, 105, ... 
Resampling results:

  Accuracy  Kappa    
  0.941878  0.9117098

Tuning parameter 'k' was held constant at a value of 3

In [12]:
# Predict on the Testing Set
knnPredict <- predict(knnFit, newdata = testing)
knnPredict

### Look at Accuracy
- 95% Accuracy

In [13]:
confusionMatrix(testing$Species, knnPredict)

Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         15          0         0
  versicolor      0         14         1
  virginica       0          1        14

Overall Statistics
                                          
               Accuracy : 0.9556          
                 95% CI : (0.8485, 0.9946)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9333          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.9333           0.9333
Specificity                 1.0000            0.9667           0.9667
Pos Pred Value              1.0000            0.9333           0.9333
Neg Pred Value              1.0000            0.9667           0.9667
Prevalence                  0.3333          