## Naive Bayes Classifier

In [1]:
import numpy as np

from Orange.data import Table
from Orange.data.filter import SameValue

### Warmup example

There are 20 pupils in a year of sports high school. Each of them participates in one of the sports: ```basketball```, ```football```, ```gymnastics```. We evaluated their height "on the eye" and assigned to each pupil one of the possible values: ```low```, ```average``` or ```high```.

<img src="slike/footballers.png" width=600/>

<font color="blue">How would you suggest the most appropriate sport for the new pupil Mark, who is of ```average``` height? </font>

In [2]:
data = Table('podatki/sportniki.tab')

In [3]:
data.domain

[visina | sport]

In [4]:
data

[[visok | kosarka],
 [visok | kosarka],
 [visok | kosarka],
 [visok | kosarka],
 [srednji | kosarka],
 ...
]

To start, let's look at how popular each sport is:

In [5]:
for sport in data.domain["sport"].values:
    subset = SameValue(data.domain["sport"], sport)(data)
    
    print(sport)
    print(subset)
    print()
    
    py     = len(subset) / len(data)
    print("Sport (Y): %s, število: %d, verjetnost P(Y): %f" % (sport, len(subset), py))

gimnastika
[[nizek | gimnastika],
 [nizek | gimnastika],
 [nizek | gimnastika],
 [srednji | gimnastika],
 [srednji | gimnastika]]

Sport (Y): gimnastika, število: 5, verjetnost P(Y): 0.250000
kosarka
[[visok | kosarka],
 [visok | kosarka],
 [visok | kosarka],
 [visok | kosarka],
 [srednji | kosarka],
 [srednji | kosarka],
 [nizek | kosarka],
 [visok | kosarka]]

Sport (Y): kosarka, število: 8, verjetnost P(Y): 0.400000
nogomet
[[srednji | nogomet],
 [srednji | nogomet],
 [srednji | nogomet],
 [visok | nogomet],
 [visok | nogomet],
 [nizek | nogomet],
 [nizek | nogomet]]

Sport (Y): nogomet, število: 7, verjetnost P(Y): 0.350000


The most popular sport is basketball, with 8 or 40% of pupils participating in it. Our first suggestion is that Marko should play basketball. This result is not the most satisfying, as we see that among basketball players there are not many athletes of ```medium``` height. The reason? In calculating, we did not take into account the probability of the property or *attribute* about Mark's height.

<div style="background-color:#00ccff; margin-left:50px; margin-right:50px"> The general probabilities of the classes that we calculated are called *a priori* probabilities.

Let us label them with $P(Y)$, where $Y$ is a class variable.
</div>

In our example, $Y$ takes on the values {```basketball```, ```football```, ```gymnastics```}.

In [6]:
for sport in data.domain["sport"].values:
    subset_y = SameValue(data.domain["sport"],   sport)(data)
    subset_x = SameValue(data.domain["visina"], "srednji")(subset_y)
    p_xy = len(subset_x) / len(subset_y)
    
    
    
    
    print("Sport (Y): %s, št. srednje visokih: %d, verjetnost P(X=srednji|Y=%s): %f" % (sport, len(subset_x), sport, p_xy, ))
    print(subset_x)
    print()
    

Sport (Y): gimnastika, št. srednje visokih: 2, verjetnost P(X=srednji|Y=gimnastika): 0.400000
[[srednji | gimnastika],
 [srednji | gimnastika]]

Sport (Y): kosarka, št. srednje visokih: 2, verjetnost P(X=srednji|Y=kosarka): 0.250000
[[srednji | kosarka],
 [srednji | kosarka]]

Sport (Y): nogomet, št. srednje visokih: 3, verjetnost P(X=srednji|Y=nogomet): 0.428571
[[srednji | nogomet],
 [srednji | nogomet],
 [srednji | nogomet]]



<br/>
Interesting! The likelihood of a `` medium '' height is the highest among footballers. Is the information sufficient to change the original decision?

<br/>
<div style="background-color:#00ccff; margin-left:50px; margin-right:50px">
The probabilities of $P(X|Y)$ are called <i>pogojna verjetnost spremenljivke $X$ pri znanem $Y$</i>. It determines the probability that in the cases of the $Y$ class the attribute $X$ takes a certain value.
</div>

What probability does we really care about? We want the calculation to take Mark's height into account and evaluate the likelihood of each of the sports. That's the probability

$$ P(Y|X) $$

or. in Mark's case

$$ P(Y|X=srednji)$$

We use this probability to calculate this probability

## Bayes form

In order to calculate the likely class for given attributes of $P(Y|X)$, we need probability for all possible combinations of the class $Y$ and attributes $X$, which is denoted by $P(X, Y)$. It follows from the rules on conditional probability:

$$ P(X, Y) = P(X|Y) \cdot P(Y) = P(Y|X) \cdot P(X)$$

<br/>
<div style="background-color:#00ccff; margin-left:50px; margin-right:50px">
What follows is <i>Bayesov obrazec</i> for calculating $P(Y|X)$:

$$P(Y|X) = \frac{P(X|Y) \cdot P(Y)}{P(X)} $$
</div>
<br/>

The calculation of the probability of the class $Y$ in the known attribute of $X$ is therefore dependent on the a priori probability of the class $P(Y)$, the conditional probability of $P(X|Y)$, and the a priori probability of the attribute of $P(X)$. <font color="blue">V In the example of Mark, then: </font>

$$P(Y|X=srednji) = \frac{P(X=srednji|Y) \cdot P(Y)}{P(X=srednji)} $$


<br/>
<br/>
If we estimate the probability for each possible value of the class Y, then {`` `basketball``,` `football```,` `` gymnastics```, we get the answer to the original question.

In [7]:
for sport in data.domain["sport"].values:
    
    subset_y  = SameValue(data.domain["sport"],   sport)(data)        # vsi sportniki danega sporta
    subset_x  = SameValue(data.domain["visina"], "srednji")(data)     # vsi srednje visoki ucenci
    
    subset_xy = SameValue(data.domain["visina"], "srednji")(subset_y) # vsi srednje visoki ucenci v danem sportu
    
    # Izracunamo verjetnosti
    p_y  = len(subset_y)  / len(data)         
    p_x  = len(subset_x)  / len(data)
    p_xy = len(subset_xy) / len(subset_y)
    
    p_yx = (p_xy * p_y) / p_x
    
    print("Sport (Y): %s, napoved P(Y=%s | X=srednji): %f" % (sport, sport, p_yx))

Sport (Y): gimnastika, napoved P(Y=gimnastika | X=srednji): 0.285714
Sport (Y): kosarka, napoved P(Y=kosarka | X=srednji): 0.285714
Sport (Y): nogomet, napoved P(Y=nogomet | X=srednji): 0.428571


## Implementation of the Naive Bayes Classifier

The *Naive Bayes classifier* assumes that the attributes are independent of each other, with known class.

$$ P(Y|X_1, X_2, ..., X_p) = \frac{P(Y) \cdot P(X_1|Y) \cdot P(X_2|Y) \cdots P(X_p|Y)}{P(X)} $$

##### Vprašanje 5-2-1
Complete the implementation of the Naive Bayes classifier, which is defined in the lower section. It is necessary to complete the part of the code where we calculate
* probability distribution of classes $P(Y)$
* probability distribution of attributes in the known class $P(X|Y)$

### Conclusion on data

In the case of discrete attributes, both distributions can be obtained by *counting*.
* $P(Y)$ *How many times does the $Y$ class appear in the data?*
* $P(X|Y)$ *How many times does the $X$ attribute appear in the data that belong to the $Y$ class?*

<font color="blue"><b>What about $P(X)$?</b></font> This probability is sometimes difficult to calculate, especially for high-dimensional data, since it is not necessary that all combinations of attributes will be present in the data. Fortunately, this value does not affect the choice of the most likely grade for a particular case!

### Predicting

For a new example  $X^* = (X_1^*, X_2^*, ..., X_p^*)$ among all values of the $Y=y$ class, select one that maximizes the following expression:

$$ \text{arg max}_y \ P(Y=y) \cdot P(X_1^*|Y=y) \cdot P(X_2^*|Y=y) \cdots P(X_p^*|Y=y) $$

### Log-transformation

The problem with the above approach is rather practical; multiplying a large number of probabilities quickly leads to very small numbers that can exceed machine accuracy. The simplest solution that leads to the same class choice is the following

$$ \text{arg max}_y \ \text{log } P(Y=y) + \text{log } P(X_1|Y=y) + \text{log } P(X_2|Y=y) + ... + \text{log } P(X_p|Y=y) $$

For the implementation help yourself with the passenger data from <i><a href="https://www.kaggle.com/c/titanic">Titanic</a></i>.

First, we divide the data into a learning and test set.

In [8]:
from Orange.data import Table
from numpy import random
random.seed(42)  # zagotovi ponovljivost naključnih rezultatov

data = Table('titanic')
inxs = list(range(len(data)))
n = len(inxs)

random.shuffle(inxs)

data_training = data[inxs[:n//2]]
data_test     = data[inxs[n//2:]]

data_training.save('podatki/titanic-training.tab')
data_test.save('podatki/titanic-test.tab')

Load the learning data and calculate probabilities.

In [9]:
data = Table('podatki/titanic-training.tab')
print(data.domain.class_var)
print(data.domain.class_var.values)

# P(X=child | Y = yes)
filt_child  = SameValue(data.domain["age"], "child")
filt_survived  = SameValue(data.domain["survived"], "yes")

p_xy = len(filt_survived(filt_child(data))) / len(filt_survived(data))
p_xy

survived
['no', 'yes']


0.08379888268156424

In [10]:
class NaiveBayes:
    """
    Naive Bayes classifier.
    
    :attribute self.probabilities
        Dictionary that stores
            - prior class probabilities P(Y)
            - attribute probabilities conditional on class P(X|Y)
    
    :attribute self.class_values
        All possible values of the class.
        
    :attribute self.variables
        Variables in the data. 
    
    :attribute self.trained
        Set to True after fit is called.
    """
    
    def __init__(self):
        self.trained       = False
        self.probabilities = dict()   
    
    
    def fit(self, data):
        """
        Fit a NaiveBayes classifier.
        
        :param data
            Orange data Table.        
        """
        class_variable      = data.domain.class_var    # class variable (Y) 
        self.class_values   = class_variable.values    # possible class values
        self.variables      = data.domain.attributes    # all other variables (X)
        
        n = len(data) # number of all data points
        
        # Compute P(Y)
        for y in self.class_values:

            # A not too smart guess (INCORRECT)
            self.probabilities[y] = 1/len(self.class_values)
            
            # <your code here>
            # Compute class probabilities and correctly fill
            #   probabilities[y] = ... 
            # Select all examples (rows) with class = y
          
            # </your code here>
        
        # Compute P(X|Y)
        for y in self.class_values:
            
            # Select all examples (rows) with class = y
            filty = SameValue(class_variable, y)
            
            for variable in self.variables:
                for x in variable.values:
                    
                    # A not too smart guess (INCORRECT)
                    p = 1 / (len(self.variables) * len(variable.values) * len(self.class_values))
                    
                    # P(variable=x|Y=y)
                    self.probabilities[variable, x, y] = p
                    
                
                    # <your code here>
                    # Compute correct conditional class probability
                    #   probabilities[x, value, c] = ... 
                    # 
                    # Select all examples with class == y AND 
                    # variable x == value
                    # Hint: use SameValue filter twice
            
                
                    # </your code here>
    
        self.trained = True
        
    
    def predict_instance(self, row):
        """
        Predict a class value for one row.
        
        :param row
            Orange data Instance.
        :return 
            Class prediction.
        """
        curr_p = float("-inf")   # Current highest "probability" (unnormalized)
        curr_c = None            # Current most probable class
        
        for y in self.class_values:
            p = np.log(self.probabilities[y])
            for x in self.variables:
                p = p + np.log(self.probabilities[x, row[x].value, y])
            
            if p > curr_p:
                curr_p = p
                curr_c = y
                
        return curr_c, curr_p
        
   

    def predict(self, data):
        """
        Predict class labels for all rows in data.
        
        :param data
            Orange data Table.       
        :return y
            NumPy vector with predicted classes.
        """
        
        n = len(data)
        predictions = list()
        confidences = np.zeros((n, ))
        
        for i, row in enumerate(data):
            pred, cf = self.predict_instance(row)
            predictions.append(pred)
            confidences[i] = cf
    
        return predictions, confidences

Rešitev je dostopna na: resitve_05-2_nadzorovano_naivniBayes.ipynb

In [11]:
%run 'resitve_05-2_nadzorovano_naivniBayes.ipynb'

## Using a classifier

An example of use on passenger data from <i><a href="https://www.kaggle.com/c/titanic">Titanic</a></i>.

In [12]:
model = NaiveBayes()
model.fit(data)
model.probabilities

{(DiscreteVariable(name='age', values=['adult', 'child']),
  'adult',
  'no'): 0.9663072776280324,
 (DiscreteVariable(name='age', values=['adult', 'child']),
  'adult',
  'yes'): 0.9162011173184358,
 (DiscreteVariable(name='age', values=['adult', 'child']),
  'child',
  'no'): 0.03369272237196765,
 (DiscreteVariable(name='age', values=['adult', 'child']),
  'child',
  'yes'): 0.08379888268156424,
 (DiscreteVariable(name='sex', values=['female', 'male']),
  'female',
  'no'): 0.0889487870619946,
 (DiscreteVariable(name='sex', values=['female', 'male']),
  'female',
  'yes'): 0.5195530726256983,
 (DiscreteVariable(name='sex', values=['female', 'male']),
  'male',
  'no'): 0.9110512129380054,
 (DiscreteVariable(name='sex', values=['female', 'male']),
  'male',
  'yes'): 0.48044692737430167,
 (DiscreteVariable(name='status', values=['crew', 'first', 'second', 'third']),
  'crew',
  'no'): 0.4568733153638814,
 (DiscreteVariable(name='status', values=['crew', 'first', 'second', 'third']),
  

In [13]:
predictions, confidences = model.predict(data)

for row, p, c in zip(data, predictions, confidences):
    print("Row=%s, predicted class=%s confidence=%.5f" % (row, p, c))

Row=[third, adult, male | no], predicted class=no confidence=-1.58532
Row=[second, adult, female | no], predicted class=yes confidence=-3.63450
Row=[crew, adult, male | no], predicted class=no confidence=-1.30449
Row=[crew, adult, male | no], predicted class=no confidence=-1.30449
Row=[third, adult, male | no], predicted class=no confidence=-1.58532
Row=[second, adult, male | no], predicted class=no confidence=-2.60871
Row=[crew, adult, male | no], predicted class=no confidence=-1.30449
Row=[second, adult, male | no], predicted class=no confidence=-2.60871
Row=[third, adult, male | yes], predicted class=no confidence=-1.58532
Row=[third, adult, male | no], predicted class=no confidence=-1.58532
Row=[third, adult, male | no], predicted class=no confidence=-1.58532
Row=[crew, adult, male | no], predicted class=no confidence=-1.30449
Row=[second, adult, male | no], predicted class=no confidence=-2.60871
Row=[crew, adult, male | no], predicted class=no confidence=-1.30449
Row=[third, adult

## Assessing the performance of the classification

In order to evaluate the success of the classification, we compare each predicted example with the corresponding real class. The four possible outcomes of the comparison are as follows:

<table>
<tr>
<td>
<ul>
<li>TP: True Positives (correctly predicted positive examples) </li>
<li>FP: False positives (wrongly predicted negative examples) </li>
<li>TN: True Negatives (correctly predicted negative examples) </li>
<li>FN: False negatives (wrongly predicted positive examples) </li>
</ul>

<br/>
<img src="slike/type12_error.jpeg" width=400/>

</td>
<td><img width="400" src="slike/Precisionrecall.png"></img><td>
<tr/>
<table>

### Ratio of correctly classified classes (classification accuracy)

$$ca = \frac{TP + TN}{TP + TN + FP + FN}$$

<font color="green">Pros</font>:
* Simple calculation, clear interpretation
* Useful measure for any number of classes

<font color="red">Cons</font>:
* It can be misleading with unbalanced class distributions

### Precision, recall

$$ p = \frac{TP}{TP + FP} $$

$$ r = \frac{TP}{TP + FN} $$

<font color="green">Pros</font>:
* Simple calculation, clear interpretation
* Separation of both types of errors (incorrectly positive and wrongly negative examples)
* Also applicable for unbalanced classroom distributions

<font color="red">Cons</font>:
* Applicable predominantly for classification in two classes
* It is difficult to summarize both measures; the approximation is F1-value (F1-score)
$$ F1 = 2 \frac{p \cdot r}{p + r} $$

<font color="green"><b>Do it yourself.</b></font> Predict the classes on the test set. Compare the predicted classes with the real ones and measure the classification accuracy, precision, recall and F1 value.

In [14]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

# uporaba metod: 
test_data      = Table('podatki/titanic-test.tab')
predictions, _ = model.predict(test_data) 
truth          = [row["survived"].value for row in test_data]
accuracy_score(truth, predictions)

0.77111716621253401

<font color="orange"><b>Challenge.</b></font> Some attributes have a probability of 0 for each class. How would you repair the classifier?

<font color="blue"><b>Think.</b></font> How to complete a classifier if some of the attributes could also be continuous? Hint: remember the exercises when we learned about the probability distributions of continuous variables.