### Bayes Rule


* Learn the best hypothesis given data and some domain knowledge
* Learn the most probable hypothesis given data and some domain knowledge

Pr(h|D) probability of some hypothesis 'h' given some data 'D'

Bayes' Rule:
~ Pr(h|D) = (Pr(D|h) * Pr(h)) / Pr(D)

chain rule:
Pr(a,b) = Pr(a|b) * Pr(b)
is the same as
Pr(a,b) = Pr(b|a) * Pr(a)

Pr(D) =  prior belief/probability of seeing some particular sort of data
Pr(D|h) = likelihood that we would see some data (or a particular label) given some hypothesis h is true

D = {(x$_i$, d$_i$)} (set of data and labels--labels are what we care about)

h(x) = {x >= 10}


if x = 7

then
Pr(D|h) = true is 0
Pr(D|h) = false is 1

Pr(h) is prior on h (this is all our domain knowledge in Bayesian learning)

domain knowledge (prior probability that h is true, for all data, as opposed to just for that data point)

what could make Pr(h|D) go up?

1. a higher Pr(h)
2. higher Pr(D|h) (more accurate hypothesis, more successfully predicting labels)
3. lower Pr(D) (not connected to the hypothesis directly, typically can be ignored)


In [1]:
def bayes_rule(true_pos_prob, true_neg_prob, prior_prob, test_result):
    if test_result == True:
        return true_pos_prob * prior_prob
    else:
        return true_neg_prob * prior_prob
    
bayes_rule(true_pos_prob=0.98, true_neg_prob=0.97, prior_prob=0.008, test_result=True)

0.00784

### Bayesian Learning

For each h <- H<br>
calculate Pr(h|D) = (Pr(D|h) * Pr(h)) / Pr(D)<br>
output:<br>
h = argmax of h Pr(h|D)

maximum a posteriori (map) takes into account Pr(h)

maximum likelihood (ml) hypothesis 'drops' Pr(h)
>but really, we have a uniform prior (all hypotheses are equally likely) this ones not practical generally, unless the number of hypotheses is really small

if d$_i$ = K * x$_i$ ~Pr(1/2$^k$),
then solving Pr(D|h) means identifying K and plugging it into the above probability function (just the Pr part) for each d

so if x == 1 and d == 5, Pr(D|h) = Pr(1/2$^5$) = 1/32

then get the product of all examples of d

### Return to Bayesian Learning

Given: {[x$_i$, d$_i$]}<br>
d$_i$ = f(x$_i$) + e$_i$ where e is the error and f is the function we're trying to find; e is a Gaussian<br>
e$_i$ ~ N(O, $\sigma$$^2$)

maximum likelihood hypothesis:<br>
h$_m$$_l$ = argmin $\Sigma$$_i$ (d$_i$ - h(x$_i$))$^2$ = sum of squared error!



In [3]:
import numpy as np

def sum_of_squared_errors(x, y):
    # compute difference of y from x for each value
    z = [i-j for i,j in zip(x,y)]
    # return the sum of the squares in z
    return np.sum([i**2 for i in z])

In [4]:
d = [1, 0, 5, 2, 1, 4]
x = [1, 3, 6, 10, 11, 13]

h1 = [i % 9 for i in x]
h2 = [i/3 for i in x]
h3 = [2 for i in x]

print(sum_of_squared_errors(d, h1))
print(sum_of_squared_errors(d, h2))
print(sum_of_squared_errors(d, h3))

12
19.4444444444
19


### Minimum Description Length
The best hypothesis (with the maximum a posteriori) is the one that minimizes error and the size/length of your hypothesis; simplest hypothesis that minimizes your error (Occum's razor)

### Bayesian Classification
If you have three hypotheses, one gives a '+' label with a score of 0.4, the other two both give '-' labels with scores of 0.3, the best label for x is '-', because each hypothesis basically votes, and the '-' label will have a score of 0.6 out of 1.0.

### Summary

* Bayes rule; swap causes and effects
* priors matter
* maximum a posteriori (map) hypothesis, maximum likelihood (the map you get when the prior is uniform)
* connected map and least squares
* classification: voting of hypotheses, bayes optimal classifier