<div style="background-color: lightblue; color: white; padding: 20px;">
 
<center>    
    <h1><b>Classification Model - NaiveBayes</b></h1>
    <h2>Iris dataset</h2>
</center>
    
<br>
<br>    
    
<div style="background-color: white; padding: 20px; border-radius: 10px; width: 70%; margin-left: 15%;">    
    <img src="iris.png">    
</div>  
    
<br>    
    <div style="background-color: white; color: #777877; width: 30%; margin-left: 35%; padding: 20px; font-size: 16px; line-height: 25px; border-radius: 10px;">
        <ul>
            <li>0. Introduction</li>
            <li>1. Import Dataset</li>
            <li>2. Analisys</li>
            <li>3. Visualization</li>
            <li>4. Modeling</li>
            <li>5. Cross Validation</li>
            <li>6. Prediction Metrics</li>
        </ul>
    </div>
<br>
    
<b>By:</b> Rodrigo Sarroeira    
<br>    
<b>On:</b> 22/02/2021    
</div>

<div style="background-color: lightblue; color: white; padding: 20px;"> 
    <h2 id="intro"><b>0. Introduction </b></h2>
</div>
    
<p style="text-align: justify; padding: 15px;">
In this notebook we are going to implement a very simple example of the Naive Bayes algorithm. The dataset in use will be the famous Iris, used many times for classification demonstrations. This dataset has information about 3 Species of Iris (Setosa, Versicolour, Virginica). The main goal is for our model to read the data and identify each observation correctly, as Setosa, Versicolour or Virginica.
    
<br>
    
<p style="text-align: justify; padding: 15px;">The <b>Naive Bayes</b> algorithm is based on a very simple probabilistic concept, called conditional probability. The Bayes Theorem calculates the probability of a certain event given some pior knowledge related to the event. This Theorem is of great importance for some algorithms, but it is also important to understand better the probabilities of events, based on other factors. </p>

<br>
<br>


<div style="background-color: #dedede; padding: 30px; border-radius: 10px;">
    
<center><h2><b style="color: white;">Bayes Formula</b></h2></center>
    
<div style="background-color: white; padding: 20px; border-radius: 10px; width: 40%; margin-left: 30%; margin-top: 15px;">

$$
    \ P(A|B) = P(A) \ * \ \frac{P(B|A)} {P(B)}  
$$
    
</div>
</div>    

<br>

</p>

In [None]:
library(naivebayes)  # Naive Bayes
library(ggplot2)     # Visualization
library(dplyr)       
library(psych)       # Graph   
library(caret)       # Cross Validation

<div style="background-color: lightblue; color: white; padding: 20px;"> 
<h2>1. Import Dataset</h2>
</div>

In [None]:
data = data(iris)
head(iris)

<div style="background-color: lightblue; color: white; padding: 20px;"> 
<h2>2. Analisys</h2>
</div>

In [None]:
str(iris)

In [None]:
summary(iris)

This dataset is composed 150 observations and by 5 variables, from wich 4 are numeric and 1 is a factor. The `Species` variable is the <b style="color: red;">target</b> variable, or in other words, is the variable we are trying to predict based on the other 4 variables. The independent variables represent widths and lenghts of petals and sepals. Our <b style="color: red;">target</b> variable has 3 possible values: setosa, versicolor and virginica.

<div style="background-color: lightblue; color: white; padding: 20px;"> 
<h3>2.1 Visualization</h3>
</div>

In [None]:
# Visualization of correlation between numeric variables
pairs.panels(iris[-5])

By looking at this graph we are given a lot of information. The distribution of each variable can be found on the principal diagonal. `Sepal.Length` and `Sepal.Width` distribution's are almost simetric, on the other hand, on the ditribution graphs of `Petal.Length` and `Petal.Width` we can find two "clusters". This means that the variables related to the Petals are going to be good predictors of the target variable. 
<br>
In addiction this graph gives us the linear correlation between all variables. The correlation between `Petal.Width` and `Petal.Length` is very high,<b> 0.96</b>. There are 3 positive correlations and 3 negative correlations. The lowest correlation is between `Sepal.Length` and `Sepal.Width`, with a value of <b>-0.12</b>.
<br>
On the lower triangle of this matrix we can see how the variables relate to each other, by interpreting the scatter plots.

In [None]:
ggplot(iris, aes(x=Species, y=Petal.Width, fill=Species)) +
    geom_boxplot() + 
    ggtitle("Boxplot Petal Width by Species")

These scatter plots built on `ggplot` are a very usefull tool when the target variable is categorical. In each graph we have the information about how the variable is spread, given the `Species`. For example, on the last graph we see that `Setosas` have the smallest Petal Width and that the `Virginicas` have the largest Petals.

In [None]:
ggplot(iris, aes(x=Species, y=Sepal.Width, fill=Species)) +
    geom_boxplot() + 
    ggtitle("Boxplot Sepal Width by Species")

The "Boxplot Sepal Width by Species" shows us that the `Sepal.Width` isn´t a very good variable to use for the classification, because it's values for different Species are very identical. The next two graphs will reenforce this idea. 

In [None]:
ggplot(iris, aes(x=Petal.Width, fill=Species)) +
    geom_density() + 
    ggtitle("Density plot for Petal Width by Species")

In [None]:
ggplot(iris, aes(x=Sepal.Width, fill=Species)) +
    geom_density() + 
    ggtitle("Density plot for Sepal Width by Species")

The first graph clearly shows that the `Petal.Width` is a good variable to use in the classification formula, because each of the Species have very different values for the Petal Width. On the other hand, when we look to the density graph of the variable `Sepal.Width` we see that all the species share, more or less, the same values for the Sepal Width, making it hard to distinguish them.

<div style="background-color: lightblue; color: white; padding: 20px;"> 
<h2>3. Modeling </h2>
</div>

Before creating a final model, I will create two example models. One using `Petal.Width` and another using `Sepal.Width`. The goal of this examples is to demonstrate how this two variables preform very differently when used on simple naive bayes models.

<div style="background-color: lightblue; color: white; padding: 20px;"> 
<h3>3.1 Modeling with Petal Width </h3>
</div>

In [None]:
nb1 = naive_bayes(Species ~ Petal.Width, data=iris)

In [None]:
# confusion matrix
predicted1 = predict(nb1, iris)
tb1 = table(predicted1, iris$Species)
tb1

In [None]:
# Missclassified values
1 - sum(diag(tb1)) / sum(tb1)

Only by analizing the boxplot and the density plot for this variable, we were expecting a low percentage of missclassified values. The values of this variable are grouped on clusters, one for each `Species`, making it very easy for the algorithm to predict the right classification. We have a 100% accuracy on `Setosas` and there were only 6 missclassified values, from wich 4 were `Virginicas` and the other 2 `Versicolor`. This represent a total accuracy of<b> 0.96</b>.


<div style="background-color: lightblue; color: white; padding: 20px;"> 
<h3>3.2 Modeling with Sepal Width </h3>
</div>

In [None]:
nb2 = naive_bayes(Species ~ Sepal.Width, data=iris)

In [None]:
predicted2 = predict(nb2, iris)
tb2 = table(predicted2, iris$Species)
tb2

In [None]:
# Missclassified values
1 - sum(diag(tb2)) / sum(tb2)

When we look at the plots for `Sepal.Width` we understand that all Speaces have similar Sepal widths, making it harder for the algorithm to predict the class correctly. Almost half of the observations were classified correctly, and when we look at the classification of virginicas the number of missclassified values is higher than half (27 ou of 50). In this model the Setosas are the Species with more accurately predicted values.

<div style="background-color: lightblue; color: white; padding: 20px;"> 
<h3>3.3 Final Model </h3>
</div>

In [None]:
# model with all the vairiables
nb_final = naive_bayes(Species ~ Petal.Width * Petal.Length + Sepal.Length, data=iris)

# by adding Sepal.Width to the model the number of missclassified values is incremented by one

In [None]:
# Creationg of the confusion matrix
predicted_final = predict(nb_final, iris)
tb = table(predicted_final, iris$Species)
tb

In [None]:
# Missclassified values
1 - sum(diag(tb)) / sum(tb)

This model has the best performance of all possible combinations between all the available variables. The accuracy of this model is <b>96,66%</b>. To make sure our model preforms well for new data, we are going to use a method called <b>Cross Validation</b>, this method enables predicting data from the "train set", as if it is new data for the model.  

<div style="background-color: lightblue; color: white; padding: 20px;"> 
<h2>4. Cross validation - using 6 folds</h2>
</div>

When the number of observation is low (150 is very low), it is usual to resort to the K-fold method. This method divides the dataset in K folds, using K - 1 folds to train the model and 1 folds to test the model. This process is repeated K times, as result all the observations will be predicted as they were new data to the model. This way we garentee that the predictions we are making are not only good `in-sample`, but also `out-of-sample`. 

In [None]:
k = 6   # number of folds
folds = createFolds(iris$Species, k, list=TRUE, returnTrain=FALSE)
str(folds)

In [None]:
# Creating vector to store predicted values
predict_vector = rep(NA, nrow(iris))


for(i in 1:k){ 
    
    cross_model = naive_bayes(Species ~ Petal.Width + Petal.Length + Sepal.Length, data=iris[-folds[[i]],])
    test_data = iris[folds[[i]] , c(1,2,3,4)]   
    predict_vector[folds[[i]]] = predict(cross_model, test_data)

}

predict_vector

The output vector classifies each `Specie` with a number, to enable the comparison between `predicted_vector` and the actual values, we must replace all numbers with the respective `Species` associated with it. 

In [None]:
# predicted values for Species
predict_vector[predict_vector==1] = "setosa"
predict_vector[predict_vector==2] = "versicolor"
predict_vector[predict_vector==3] = "virginica"

predict_vector = as.factor(predict_vector)

predict_vector

In [None]:
tb_final = table(predict_vector, iris$Species)
tb_final

In [None]:
performance = 1 - sum(diag(tb_final)) / sum(tb_final)
performance

Now we made sure that our model preforms well, both <b>in</b> and <b>out</b> of sample. Using this method we got a very low error of classification, only <b>4% or 5%</b> of the values were not label correctly. The `performance` can have different values, because the method that the function `createFolds()` uses has a random component, making it possible for values to float round 4 or 5 percent.

<div style="background-color: lightblue; color: white; padding: 20px;"> 
<h2>5. Prediction Metrics </h2>
</div>

To evaluate the preformance of our model we are going to use come metrics. To calculate them we are going to use the values of the confusion matrix created before, stored inside the `tb_final` variable. 

In [None]:
tb_final

The `accuracy` is a metric that calculates the proportion of well classified observations. Therefore you can obtain the proportion of missclassified observations by doing `1 - accuracy`.

In [None]:
# proportion of certain predictions
accuracy = sum(diag(tb_final)) / sum(tb_final)
accuracy

In [None]:
# proportion of missclassified values
missclassified = 1 - sum(diag(tb_final)) / sum(tb_final)
missclassified

The precision for `Setosas` evaluates the proportion of real Setosas out of all the predicted Setosas. This metric gives us the level of certainty, with which we can predict observations of a determined category. As we can see the `precision_setosa` is `1`, meaning that all setosas were well classified. 

In [None]:
# precision prediction setosas
precision_setosa = tb_final[1,1] / sum(tb_final[,1])
precision_setosa

In [None]:
# precision prediction versicolors
precision_versicolor = tb_final[2,2] / sum(tb_final[,2])
precision_versicolor

In [None]:
# precision prediction virginic
precision_virginica = tb_final[3,3] / sum(tb_final[,3])
precision_virginica

On the other hand, the sensivity metric measures the proportion of correcly identified observations, out of all real observations of that category. For example, as `sensivity_versicolor = 1`, the proportion of well classified `versicolors` out of all `versicolors` is 94,12%.

In [None]:
# proportion of right classifications for setosas
sensivity_setosa = tb_final[1,1] / sum(tb_final[1,])
sensivity_setosa

In [None]:
# proportion of right classifications for versicolors
sensivity_versicolor = tb_final[2,2] / sum(tb_final[2,])
sensivity_versicolor

In [None]:
# proportion of right classifications for virginicas
sensivity_virginica = tb_final[3,3] / sum(tb_final[3,])
sensivity_virginica

<div style="background-color: lightblue; color: white;">
<br>
<br>
<center><h2><b> We have reached the end of this simple example. Thanks for reading!  </b></h2></center>
<br>
<br>
</div>