# NAIVE BAYES WITH IONOSPHERE RADAR DATA

The ionosphere is part of Earth’s upper atmosphere.

It is a very active part of the atmosphere, ionized by solar radiation as a result of the Sun’s activity.

In ionospheric research, radar returns from the ionosphere are classified as either “good” or “bad”.

Good returns show evidence of some type of structure in the ionosphere, and are suitable for further analysis.

This is not the case for bad returns. We build a Naive Bayes classifier (NBC) to identify good and bad radar returns.

If you want to continue from a previously saved session state:

In [None]:
sessionfile <- "02_naive_bayes_02.RData"

if(file.exists(sessionfile)) load(sessionfile)

Load the necessary libraries:

In [None]:
library(data.table) # to handle the data in a more convenient manner
library(tidyverse) # for a better work flow and more tools to wrangle and visualize the data
library(evclass) # for ionosphere data
library(listviewer) # for navigating nested/list objects
library(plotly) # for interactive visualizations
library(magrittr) # handling data structures
library(stringr) # regex
library(e1071) # for naive bayes
library(knitr) # for better table printing
library(kableExtra) # for better table printing
library(IRdisplay) # printing html tables from kable
library(rlist) # for handling list structures
library(stargazer) # beautiful tables from R statistical output
library(caret) # for model performance evaluation
library(corrplot) # for correlation plots
library(scales) # for formatting numbers
options(warn = -1) # for suppressing messages


# Collect and explore data

We load the data from evclass package:

In [None]:
data ("ionosphere" , package = "evclass")

Some information on the dataset from "?ionosphere":

Ionosphere dataset

Description

This dataset was collected by a radar system and consists of phased array of 16 high-frequency antennas with a total transmitted power of the order of 6.4 kilowatts. The targets were free electrons in the ionosphere. "Good" radar returns are those showing evidence of some type of structure in the ionosphere. "Bad" returns are those that do not. There are 351 instances and 34 numeric attributes. The first 175 instances are training data, the rest are test data. This version of dataset was used by Zouhal and Denoeux (1998).

Format
A list with two elements:

x

The 351 x 18 object-attribute matrix.

y

A 351-vector containing the class labels.

And view the structure:

In [None]:
str (ionosphere)

We can navigate list or similar nested objects as such:

In [None]:
listviewer::jsonedit(ionosphere, mode = "form")

The object is a composed of 2 list objects. The first list object
contains the attributes.

There are 34 in total.

The second list object, contains the target variable. The target variable is
binary.

1 denotes "good" while, 2 denotes "bad" labels:

In [None]:
str(ionosphere$y)

## Target variable

View the counts of unique values:

In [None]:
table(ionosphere$y)

**EXERCISE 1:** Plot the histogram of ionosphere$y using plotly library (we used in 01_knn_02)

Note that for better interpretation, you should convert the number into "good" and "bad" labels while plotting

(3 minutes)

**SOLUTION 1:**

In [None]:
plotly::plot_ly(x = factor(ionosphere$y, labels = c("good", "bad")),
        type = "histogram")

## Attribute distribution

Ideally, the attributes would be generated from a Normal distribution

First, we'd better name the columns to handle attributes:
sprintf is a versatile tool for printing formatted numbers and strings

In [None]:
# add leading zeros to fill 2 digits
colnames(ionosphere$x) <- sprintf("V%02d", 1:ncol(ionosphere$x))
colnames(ionosphere$x)

In [None]:
attributes(ionosphere$x)

Let's save the x item into a separate data.table:

In [None]:
ion_attr <- as.data.table(ionosphere$x)

In [None]:
str(ion_attr)

Let's get some statistical summaries:

In [None]:
summary(ion_attr)

And pretty print those summaries:

In [None]:
summaries <- summary(ion_attr) %>% # get statistical summaries
    apply(1, function(x) stringr::str_extract(x, "(?<=:).+") %>% as.numeric) %>%
    magrittr::set_colnames(names(summary(1))) %>% # set column names
    magrittr::set_rownames(names(ion_attr)) # set row names

summaries

Or similarly but more concisely:

In [None]:
ion_attr[,lapply(.SD, function(x) c(mean(x), quantile(x)))][c(2:4,1,5:6)] %>%
    t() %>%
    magrittr::set_colnames(names(summary(1)))

Or using stargazer function:

In [None]:
capture.output(stargazer::stargazer(ion_attr, type = "html")) %>%
    paste(collapse="\n") %>%
    IRdisplay::display_html()

And we get the density plots:

In [None]:
ion_attr[,V03:V11] %>% # select columns
    tidyr::gather() %>% # reshape into long format in columns "key" and "value"
    ggplot(aes(value)) + # plot value
        facet_wrap(~ key) + # divide into separate plots by key
        geom_density(fill = "green") + # get density plots
        xlim(c(-1.5,1.5)) # align to same axis limits

Many attributes are highly skewed, and deviate quite considerably for the bell shaped Normal distribution.

It seems, any hopes the data are Gaussian, are dashed!

Lack of Normality is common place with real world data.

NBC often performs well despite this type of violation. So, lets continue with building our model. 

# Preparing the data

## Drop unnecessary variables

As we saw from the statistical summaries, V02 has no variation at all and is constant at 0:

In [None]:
summaries

In [None]:
summaries["V02",,drop = F]

So we delete it from our attribute object:

In [None]:
ion_attr[, V02 := NULL]
names(ion_attr)

V01 is also weird:

In [None]:
summaries["V01",,drop = F]

In [None]:
ion_attr[,unique(V01)]

It only takes one of (0,1). We drop it to leave only continous variables:

In [None]:
ion_attr[, V01 := NULL]
names(ion_attr)

## Factorize labels

We save the labels vector as factor separately:

In [None]:
labels <- factor(ionosphere$y, labels = c("good", "bad"))

## Create train indices

**EXERCISE 2:** select 251 indices arbitrarily and save into a vector named "train"

Use the data.table placeholder .I

(2 minutes)

**SOLUTION 2:**

In [None]:
set.seed(2018)
train <- ion_attr[,sample(.I, 251)]

# Train model

We can use the naive bayes function two ways:

```
Usage
## S3 method for class 'formula'
naiveBayes(formula, data, laplace = 0, ..., subset, na.action = na.pass)
## Default S3 method:
naiveBayes(x, y, laplace = 0, ...)


## S3 method for class 'naiveBayes'
predict(object, newdata,
  type = c("class", "raw"), threshold = 0.001, eps = 0, ...)

Arguments
x	
A numeric matrix, or a data frame of categorical and/or numeric variables.

y	
Class vector.

formula	
A formula of the form class ~ x1 + x2 +
      .... Interactions are not allowed.

data	
Either a data frame of predictors (categorical and/or numeric) or a contingency table.
```

With the first usage:

In [None]:
fit1 <- e1071::naiveBayes(formula = labels[train] ~ ., data = ion_attr[train])

With the second usage:

**EXERCISE 3:** Create the NBC model with the second usage (as in the 02_naive_bayes_01 example) and save into fit2 variable

(2 minutes)

**SOLUTION 3:**

In [None]:
fit2 <- e1071::naiveBayes(x = ion_attr[train], y = labels[train], laplace = laplace)

Now we can side by side compare both and see that, results are the same:

In [None]:
mapply(function(x, y, fit_1, fit_2)
    {
        knitr::kable(list(data.frame(fit1 = fit_1),
                            list(y),
                            data.frame(fit2 = fit_2),
                            list(y)
                         ),
                    ) %>%
                    as.character() %>%
                    IRdisplay::display_html()
    },
        flat1 <- rlist::list.flatten(fit1[-length(fit1)]),
        flat2 <- rlist::list.flatten(fit2[-length(fit1)]),
        names(flat1),
        names(flat2)
      )

Let's see the structure of the fit output:

In [None]:
attributes(fit1)
str(fit1)

In [None]:
fit1$tables$V03

In [None]:
listviewer::jsonedit(fit1)

Let's see the label distribution for our train set:

In [None]:
table(labels[train])

# Model prediction

**EXERCISE 4:** Save the "raw" predictions (posterior probabilities) on test data (with the complement of train indices) into "pred_probs" object

Reformat the pred_probs as 2 decimal digit percentages and save into "pred_percent" object

(4 minutes)

**SOLUTION 4:**

First get the predictions as "raw": posterior probs. for good and bad:

In [None]:
pred_probs <- predict(fit1, ion_attr[-train], type = "raw")

And see the probabilities in percent format:

In [None]:
pred_percent <- pred_probs %>% apply(2, scales::percent, accuracy = 0.01)
pred_percent

And let's add classification labels using colnames and max.col functions:

In [None]:
labs <- colnames(pred_percent)[max.col(pred_probs)]

pred_percent %>%
    magrittr::set_rownames(labs)

We can directly get predicted class labels with "class" option to type argument in predict:

**EXERCISE 5:** Directly get predicted class labels with "class" option to type argument in predict and save into "pred" object

(1 minute)

**SOLUTION 5:**

In [None]:
pred <- predict(fit1, ion_attr[-train], type = "class")
pred

Our calculation and automatic classification yield the same results:

In [None]:
identical(labs, as.character(pred))

# Model evaluation

We will create a confusion matrix using the predicted labels

In [None]:
result <- caret::confusionMatrix(pred, labels[-train])
result

In [None]:
str(result)

Overall accuracy is:

In [None]:
result$overall[1]

To calculate it manually:

In [None]:
conf <- result$table

sum(diag(conf)) / sum(conf)

# Improve model performance

## Cross correlations

One way to improve performance, is to reexamine the optimal conditions for the NBC.

One key assumption is the independence of the attributes.

We can use the correlation coefficient as a crude proxy to assess how well this assumption is met.

The idea is that if the features are independent they will have zero correlation.

**EXERCISE 6:** Create a correlation plot of all attribute values

You can recycle the previous code we executed, or play with options to get different kind of visualizations (coloring, lower/upper type, size, etc)

(4 minutes)

**SOLUTION 6:**

In [None]:
cor(ion_attr) %>%
corrplot::corrplot.mixed(upper = "ellipse",
                         lower = "number",
                         tl.pos = "lt",
                         number.cex = .5,
                         lower.col = "black",
                         tl.cex = 0.7)

Removing highly correlated items, might improve performance.

We will drop variables that have a correlation in excess of 0.6

Starting from the original dataset, we only drop constant valued V02 and keep binary valued V01:

In [None]:
ion_attr2 <- (as.data.table(ionosphere$x))
ion_attr2[,V02 := NULL]
ion_attr2

And filter correlations:

In [None]:
vars <- caret::findCorrelation(cor(ion_attr2),
                      cutoff = 0.6,
                      exact = T,
                      names = T)

vars

In [None]:
sprintf("So variables %s have a correlation higher than the cutoff", paste(vars, collapse = ","))

We will drop them:

In [None]:
ion_attr2[,(vars) := NULL]
names(ion_attr2)

**EXERCISE 7:** Re-run the model inyo fit1b, predict classes into pred2, get confusion matrix using caret::confusionMatrix into result2:

(6 minutes)

**SOLUTION 7:**

In [None]:
fit1b <- e1071::naiveBayes(labels[train] ~ .,
                          data = ion_attr2[train])

In [None]:
pred2 <- predict(fit1b,
                ion_attr2[-train],
                type = "class")

In [None]:
result2 <- caret::confusionMatrix(pred2, labels[-train])

The confusion matrix is:

In [None]:
result2$table

And side by side with the first model:

In [None]:
knitr::kable(list(data.frame(model = 1),
                            list(result$table),
                            data.frame(model = 2),
                            list(result2$table)
                         ),
                    ) %>%
                    as.character() %>%
                    IRdisplay::display_html()


And the overall accuracy is:

In [None]:
result2$overall[1]

In [None]:
progress <- c("worse", "better")

sprintf("With an accuracy of %s vs. %s, second model is %s than the first",
       (acc2 <- result2$overall[1]) %>% scales::percent(accuracy = 0.01),
       (acc1 <- result$overall[1]) %>% scales::percent(accuracy = 0.01),
       progress[(acc2 > acc1) + 1])

In [None]:
save.image(sessionfile)