## IRIS classification prject

The aim of the project is to build a classification model to predict the species of a flower based on its measured charecterstics.

In [1]:
#The iris dataset is available within the ISLR library in R
library(ISLR)

In [2]:
#sneak peak into the data set
head(iris)

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa
5.4,3.9,1.7,0.4,setosa


In [3]:
str(iris)

'data.frame':	150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...


In [15]:
summary(iris)

  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                

In [17]:
#checking for missing data
sum(is.na(iris))

Awesome, no missing data in the dataset. Since there are only 150 observations, let us build a classification model using kNN.

In [4]:
#standardizing the independent variables
species <- iris[,5]
std.iris <- scale(iris[,-5])
head(std.iris)

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
-0.8976739,1.01560199,-1.335752,-1.311052
-1.1392005,-0.13153881,-1.335752,-1.311052
-1.3807271,0.32731751,-1.392399,-1.311052
-1.5014904,0.09788935,-1.279104,-1.311052
-1.0184372,1.24503015,-1.335752,-1.311052
-0.535384,1.93331463,-1.165809,-1.048667


In [5]:
#checking if the scaling worked.
var(std.iris[,1])

In [6]:
var(std.iris[,2])

In [8]:
std.iris <- as.data.frame(std.iris)
std.iris$species <- species

In [9]:
head(std.iris)

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,species
-0.8976739,1.01560199,-1.335752,-1.311052,setosa
-1.1392005,-0.13153881,-1.335752,-1.311052,setosa
-1.3807271,0.32731751,-1.392399,-1.311052,setosa
-1.5014904,0.09788935,-1.279104,-1.311052,setosa
-1.0184372,1.24503015,-1.335752,-1.311052,setosa
-0.535384,1.93331463,-1.165809,-1.048667,setosa


In [10]:
library(caTools)

In [11]:
set.seed(101)
sub <- sample.split(std.iris$species, SplitRatio = 0.7)
test <- subset(std.iris, sub == F)
train <- subset(std.iris, sub == T)

In [12]:
head(test)

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,species
3,-1.380727,0.3273175,-1.392399,-1.311052,setosa
11,-0.535384,1.4744583,-1.279104,-1.311052,setosa
12,-1.259964,0.7861738,-1.222456,-1.311052,setosa
13,-1.259964,-0.1315388,-1.335752,-1.442245,setosa
14,-1.86378,-0.1315388,-1.505695,-1.442245,setosa
17,-0.535384,1.9333146,-1.392399,-1.048667,setosa


Since we would be using the class library to implement kNN, we need to seperate out the species column out of both the test and train data sets. 

In [13]:
#store the species in respective vectors
test.species <- test$species
train.species <- train$species

#remove species column from both test and train sets
test <- test[, -5]
train <- train[, -5]

In [14]:
#using kNN
library(class)

In [18]:
predicted.species <- knn(train, test, train.species, k=1)

In [19]:
predicted.species

Now let’s evaluate the model we trained and see our misclassification error rate.

In [21]:
#missclassification rate
mean(test.species != predicted.species)

Using the elbow method to evaluate the best value of k.

In [22]:
library(ggplot2)

In [23]:
predicted.species = NULL
error.rate = NULL

for (i in 1:10){
    set.seed(101)
    predicted.species <- knn(train, test, train.species, k = i)
    error.rate[i] <- mean(test.species != predicted.species) 
}

In [24]:
error.rate

The error rate in itself is pretty low (~3%). The optimal k value should be around 3-4.  

Although, it is observed that the error rate shoots up for k = 7 and then comes back down at k = 10. This funky behaviour can be attributed to the less number of pnservations in the data set. 