# Naive Bayes

A probabilistic classifier.
Given $n$ _attributes_, that is

$$
\text{attributes} = \lbrace{a_1, a_2, a_3, \dots, a_n \rbrace},
$$

and a set of $k$ _labels_, we can calculate the conditional probability of each label given a certain set of attributes. That is:

$$
p\left(L_k|a_1, a_2, a_3,\dots,a_n\right)
$$.

This is done by direct application of Baye's theorem:

$$
p\left(L_k|a_1,a_2,a_3,\dots,a_n\right) = \frac{p\left(L_k\right)p\left(a_1,a_2,a_3,\dots,a_n|L_k\right)}{p\left(a_1,a_2,a_3,\dots,a_n\right)}
$$

## Example

For the following minimal example a Java implementation of the Naive Bayes algorithm will be used (available at https://github.com/ruivieira/java-naive-bayes).

Consider a simple IT purchasing system where the user can choose a laptop brand (`Apple` or `Lenovo`) for use in a specific department (`design` or `accounting`) in one of two offices (`US` or `UK`).

In this case, NB will try to classify the laptop brand according to the historical purchase data. It is clear then that the attributes will be:

$$
    a = \lbrace\text{user}, \text{department}, \text{office}\rbrace
$$

and the labels will be

$$
    L = \lbrace\text{Brand A}, \text{Brand B}\rbrace
$$

| user   | department  | office | brand |
|--------|-------------|--------|-------|
| Anna   | design      |US      |Apple  |
| Anna   | accounting  |US      |Lenovo |
| Bill   | design      |US      |Apple  |
| Bill   | accounting  |US      |Lenovo |
| Bill   | design      |UK      |Apple  |
| Bill   | accounting  |UK      |Lenovo |
| Claire | design      |US      |Lenovo |
| Claire | accounting  |US      |Lenovo |
| Claire | design      |UK      |Lenovo |
| Claire | accounting  |UK      |Lenovo |
| Dennis | design      |US      |Apple  |
| Dennis | accounting  |US      |Apple  |
| Dennis | design      |UK      |Apple  |
| Dennis | accounting  |UK      |Apple  |


In [1]:
%maven org.ruivieira:naivebayes:0.1-SNAPSHOT

In [2]:
import org.ruivieira.ml.naivebayes.NaiveBayes;
import org.ruivieira.ml.naivebayes.Model;

import java.util.Map;

In [3]:
Model model = Model.create();

Let's start by adding the first record. User Anna buys an Apple for the US design department.

In [4]:
model.train(new String[]{"Anna", "design", "US"}, "Apple");

If we try to predict that label (outcome) for any of the attributes, the result will be unsurprising:

In [5]:
NaiveBayes naiveBayes = new NaiveBayes(model);
System.out.println("Anna: " + naiveBayes.classify(new String[]{"Anna"}).toString());
System.out.println("design: " + naiveBayes.classify(new String[]{"design"}).toString());
System.out.println("US: " + naiveBayes.classify(new String[]{"US"}).toString());

Anna: {Apple=100.0}
design: {Apple=100.0}
US: {Apple=100.0}


We now add a purchase for a Lenovo for US accounting department.

In [6]:
model.train(new String[]{"Anna", "accounting", "US"}, "Lenovo");
naiveBayes = new NaiveBayes(model);
System.out.println("Anna: " + naiveBayes.classify(new String[]{"Anna"}).toString());
System.out.println("design: " + naiveBayes.classify(new String[]{"design"}).toString());
System.out.println("US: " + naiveBayes.classify(new String[]{"US"}).toString());

Anna: {Lenovo=50.0, Apple=50.0}
design: {Lenovo=5.0E-9, Apple=50.0}
US: {Lenovo=50.0, Apple=50.0}


Nothing we couldn't figure out ourselves, yet. Anna is 50/50 as likely to buy a Lenovo or an Apple. The design department is more likely to get an Apple (50% vs. ~0%) and the US office is as likely to get a Lenovo or an Apple.

Let's now add a second user, Bill. Bill makes the same purchasing choices as Anna, but he also buys for the UK office.

In [7]:
model.train(new String[]{"Bill", "accounting", "US"}, "Lenovo");
model.train(new String[]{"Bill", "design", "US"}, "Apple");
model.train(new String[]{"Bill", "accounting", "UK"}, "Lenovo");
model.train(new String[]{"Bill", "design", "UK"}, "Apple");

In [8]:
naiveBayes = new NaiveBayes(model);
System.out.println("Anna: " + naiveBayes.classify(new String[]{"Anna"}).toString());
System.out.println("Bill: " + naiveBayes.classify(new String[]{"Anna"}).toString());
System.out.println("design: " + naiveBayes.classify(new String[]{"design"}).toString());
System.out.println("US: " + naiveBayes.classify(new String[]{"US"}).toString());

Anna: {Lenovo=16.666666666666664, Apple=16.666666666666664}
Bill: {Lenovo=16.666666666666664, Apple=16.666666666666664}
design: {Lenovo=5.0E-9, Apple=50.0}
US: {Lenovo=33.33333333333333, Apple=33.33333333333333}


Still nothing surprising. However, one of the strengths of NB is the ability to combine attributes to get insights. Let's see that adding another user, Claire. Claire will buy Lenovos for all the offices and departments.

In [9]:
model.train(new String[]{"Claire", "accounting", "US"}, "Lenovo");
model.train(new String[]{"Claire", "design", "US"}, "Lenovo");
model.train(new String[]{"Claire", "accounting", "UK"}, "Lenovo");
model.train(new String[]{"Claire", "design", "UK"}, "Lenovo");

In [10]:
naiveBayes = new NaiveBayes(model);
System.out.println("Claire: " + naiveBayes.classify(new String[]{"Claire"}).toString());
System.out.println("design: " + naiveBayes.classify(new String[]{"design"}).toString());
System.out.println("design US: " + naiveBayes.classify(new String[]{"design", "US"}).toString());
System.out.println("Bill accounting: " + naiveBayes.classify(new String[]{"Bill accounting"}).toString());

Claire: {Lenovo=40.0, Apple=3.0E-9}
design: {Lenovo=20.0, Apple=30.0}
design US: {Lenovo=11.428571428571427, Apple=20.0}
Bill accounting: {Lenovo=70.0, Apple=30.0}


Here we start to see the usefulness of combining attributes.

# Random forests

To understand Random Forests (RF) we need to start with Decision Trees (DT).

## Decision Trees

A DT is a data structure that allows to model a logic flow in a tree-like structure (parent and children nodes) based on cost funtion (in this case the path's _entropy_). The following examples are performed using a Java implementation of DT/RF (https://github.com/ruivieira/java-decision-tree). Let's start by creating a `Dataset` to hold the training data and add all the same input as in the NB example.

In [11]:
%maven org.ruivieira:decisiontree:0.0.2

In [12]:
import org.ruivieira.ml.decisiontree.features.*;
import org.ruivieira.ml.decisiontree.*;
import java.util.logging.*;
Logger.getGlobal().setLevel(Level.WARNING);

In [13]:
final Dataset data = Dataset.create();

The following is just a helper function to reduce the verbosity of the code.

In [14]:
public void add(Dataset data, String user, String dpt, String office, String brand) {
    final Item item = Item.create();
    item.add("user", new StringValue(user));
    item.add("department", new StringValue(dpt));
    item.add("office", new StringValue(office));
    item.add("brand", new StringValue(brand));
    data.add(item);
}

In [15]:
add(data, "Anna", "accounting", "US", "Lenovo");
add(data, "Anna", "design", "US", "Apple");
add(data, "Bill", "accounting", "US", "Lenovo");
add(data, "Bill", "design", "US", "Apple");
add(data, "Bill", "accounting", "UK", "Lenovo");
add(data, "Bill", "design", "UK", "Apple");
add(data, "Claire", "accounting", "US", "Lenovo");
add(data, "Claire", "design", "US", "Lenovo");
add(data, "Claire", "accounting", "UK", "Lenovo");
add(data, "Claire", "design", "UK", "Lenovo");
add(data, "Dennis", "accounting", "US", "Apple");
add(data, "Dennis", "design", "US", "Apple");
add(data, "Dennis", "accounting", "UK", "Apple");
add(data, "Dennis", "design", "UK", "Apple");

Let's now re-run the NB examples and compare the results. First we want to ask a question to the DT: What is the `brand` that `Anna` chooses?

In [16]:
TreeConfig config = TreeConfig.create();
config.setData(data);
config.setDecision("brand");
DecisionTree dt = DecisionTree.create(config);

Item question = Item.create();
question.add("user", new StringValue("Anna"));

System.out.println(dt.predict(question));

StringValue{data='Apple'}


Straight away we can see the problem. Where NB gave a probabilist answer, "Anna is 50%/50% likely to buy a Lenovo or Apple", DT provide a single answer. Let's ask another question. Which is the `brand` most likely for the `design` department?

In [17]:
Item question = Item.create();
question.add("department", new StringValue("design"));

System.out.println(dt.predict(question));

StringValue{data='Apple'}


So how do we get a range of answers, along with weights or probabilities?
The solution is use RFs. Basically we build and ensemble of DTs, each trained with a random sub-sample of the data, and then we aggregate the individual answers from each tree.

## Random Forests

First we create a helper function to convert the prediction totals to a percentage:

In [18]:
public void percentages(Map<Value, Integer> pred) {
    double total = 0.0;
    for (Integer value : pred.values()) {
        total += value.doubleValue();
    }
    Map<Value, Double> perc = new HashMap<>();
    for (Value value : pred.keySet()) {
        perc.put(value, pred.get(value).doubleValue()*100.0/total);
    }
    System.out.println(perc);
}

In [19]:
final RandomForest forest = RandomForest.create(config, 80, 6);

question = Item.create();
question.add("user", new StringValue("Anna"));

percentages(forest.predict(question));

{StringValue{data='Lenovo'}=56.25, StringValue{data='Apple'}=43.75}


In [20]:
question = Item.create();
question.add("user", new StringValue("Anna"));
question.add("department", new StringValue("accounting"));

percentages(forest.predict(question));

{StringValue{data='Lenovo'}=76.25, StringValue{data='Apple'}=23.75}


In [21]:
TreeConfig config = TreeConfig.create();
config.setData(data);
config.setDecision("user");
final RandomForest forest = RandomForest.create(config, 80, 6);

[IJava-executor-0] WARN org.ruivieira.ml.decisiontree.DecisionTree - Can't find best split.
[IJava-executor-0] WARN org.ruivieira.ml.decisiontree.DecisionTree - Can't find best split.
[IJava-executor-0] WARN org.ruivieira.ml.decisiontree.DecisionTree - Can't find best split.
[IJava-executor-0] WARN org.ruivieira.ml.decisiontree.DecisionTree - Can't find best split.
[IJava-executor-0] WARN org.ruivieira.ml.decisiontree.DecisionTree - Can't find best split.
[IJava-executor-0] WARN org.ruivieira.ml.decisiontree.DecisionTree - Can't find best split.
[IJava-executor-0] WARN org.ruivieira.ml.decisiontree.DecisionTree - Can't find best split.
[IJava-executor-0] WARN org.ruivieira.ml.decisiontree.DecisionTree - Can't find best split.
[IJava-executor-0] WARN org.ruivieira.ml.decisiontree.DecisionTree - Can't find best split.
[IJava-executor-0] WARN org.ruivieira.ml.decisiontree.DecisionTree - Can't find best split.
[IJava-executor-0] WARN org.ruivieira.ml.decisiontree.DecisionTree - Can't find 

In [22]:
question = Item.create();
question.add("brand", new StringValue("Lenovo"));
question.add("department", new StringValue("accounting"));

percentages(forest.predict(question));

{StringValue{data='Anna'}=27.5, StringValue{data='Dennis'}=2.5, StringValue{data='Claire'}=46.25, StringValue{data='Bill'}=23.75}


## Numeric values

Let's now assume we have a numeric value field to consider when learning from the data.
NB would only assume the attributa _nominally_, that is not considering the distance between attributes, DFs can help us make a prediction based on a quantitative difference between attributes.
This is possibly best explained with an example. Let's say we now have a new field called `price` representing the cost of each laptop. To make the example more realistic let's add some variation to the prices and say Lenovos price is in the region of £1500 and Apples in the £2500 region. The actual data used is the following:

| user   | department  | office | brand | price |
|--------|-------------|--------|-------|-------|
| Anna   | design      |US      |Apple  |2500.0 |
| Anna   | accounting  |US      |Lenovo |1500.0 |
| Bill   | design      |US      |Apple  |2630.0 |
| Bill   | accounting  |US      |Lenovo |1640.0 |
| Bill   | design      |UK      |Apple  |2590.0 |
| Bill   | accounting  |UK      |Lenovo |1690.0 |
| Claire | design      |US      |Lenovo |1580.0 |
| Claire | accounting  |US      |Lenovo |1490.0 |
| Claire | design      |UK      |Lenovo |1620.0 |
| Claire | accounting  |UK      |Lenovo |1535.0 |
| Dennis | design      |US      |Apple  |2700.0 |
| Dennis | accounting  |US      |Apple  |2660.0 |
| Dennis | design      |UK      |Apple  |2590.0 |
| Dennis | accounting  |UK      |Apple  |2577.0 |

Let's start by adding this to the dataset.

In [23]:
public void add(Dataset data, String user, String dpt, String office, String brand, double price) {
    final Item item = Item.create();
    item.add("user", new StringValue(user));
    item.add("department", new StringValue(dpt));
    item.add("office", new StringValue(office));
    item.add("brand", new StringValue(brand));
    item.add("price", new DoubleValue(price));
    data.add(item);
}

In [24]:
final Dataset data = Dataset.create();
add(data, "Anna", "accounting", "US", "Lenovo", 1500.0);
add(data, "Anna", "design", "US", "Apple", 2500.0);
add(data, "Bill", "accounting", "US", "Lenovo", 1640.0);
add(data, "Bill", "design", "US", "Apple", 2630.0);
add(data, "Bill", "accounting", "UK", "Lenovo", 1690.0);
add(data, "Bill", "design", "UK", "Apple", 2590.0);
add(data, "Claire", "accounting", "US", "Lenovo", 1580.0);
add(data, "Claire", "design", "US", "Lenovo", 1490.0);
add(data, "Claire", "accounting", "UK", "Lenovo", 1620.0);
add(data, "Claire", "design", "UK", "Lenovo", 1535.0);
add(data, "Dennis", "accounting", "US", "Apple", 2660.0);
add(data, "Dennis", "design", "US", "Apple", 2700.0);
add(data, "Dennis", "accounting", "UK", "Apple", 2577.0);
add(data, "Dennis", "design", "UK", "Apple", 2590.0);

In [25]:
TreeConfig config = TreeConfig.create();
config.setData(data);
config.setDecision("brand");
final RandomForest forest = RandomForest.create(config, 80, 6);

In [26]:
question = Item.create();
question.add("price", new DoubleValue(2600.0));

percentages(forest.predict(question));

{StringValue{data='Apple'}=81.25, StringValue{data='Lenovo'}=18.75}


In [67]:
question = Item.create();
question.add("price", new DoubleValue(2600.0));
question.add("user", new StringValue("Claire"));
question.add("department", new StringValue("accounting"));

percentages(forest.predict(question));

{StringValue{data='Apple'}=68.75, StringValue{data='Lenovo'}=31.25}
