## Bayesian Statistics  (Manning & Schℓutze, 1999)

### Bayes Theorem

- Bayes theorem lets us swap the order of dependence between events.
- That is, it lets us calculate P(B|A) in terms of P(A|B). 
- This is useful when the former quantity is difficult to determine. 
$$ P(B|A)=\frac{P(B\cap A)P}{P(A)}=\frac{P(A|B)P(B)}{P(A)} - - -\dots(1)$$
- The righthand side denominator P(A) can be viewed as a **normalizing constant**, something that ensures that we have a probability function. 
- If we are simply interested in which event out of some set is most likely given A, we can ignore it. 
- Since the denominator is the same in all cases, we have that:
$$ arg max_B \frac{P(A|B)P(B)}{P(A)}=arg max_B P(A|B)P(B) - - -\dots(2)$$
- However, we can also evaluate the denominator recalling that:
$$P(A\cap B)= P(A|B)P(B)$$
$$P(A\cap\bar{B})= P(A|\bar{B})P(\bar{B})$$
So we have
$$\begin{aligned}
P(A)&=P(A\cap B) + P(A \cap \bar{B})&[additivity] \\
&=P(A|B)P(B)+P(A|\bar{B})P(\bar{B})
\end{aligned}$$

- B and $\bar{B}$ serve to split the set A into two disjoint parts (one possibly empty), and so we can evaluate the conditional probability on each, and then sum, using additivity. 
- More generally, if we have some group of sets $B_i$ that **partition** A, that is, if $A\subseteq \cup_iB_i$ and the $B_i$ are disjoint, then:
$$P(A)=\sum_iP(A|B_i)P(B_i) - - - -\dots(3)$$
- This gives us the following equivalent but more elaborated version of **Bayes' theorem**:
$$P(B_j|A)=\frac{P(A|B_j)P(B_j)}{P(A)}=\frac{P(A|B_j)P(B_j)}{\sum_{i=1}^nP(A|B_i)P(B_i)})$$

#### Example 1

Suppose one is interested in a rare syntactic construction, perhaps parasitic gaps, which occurs on average once in 100,000 sentences.  Joe Linguist has developed a complicated pattern matcher that attempts to identify sentences with parasitic gaps. It’s pretty good, but it’s not perfect: if a sentence has a parasitic gap, it will say so with probability 0.95, if it doesn’t, it will wrongly say it does with probability 0.005. Suppose the test says that a sentence contains a parasitic gap. What is the probability that this is true?

#### Solution: 

Let G be the event of the sentence having a parasitic gap, and let T be the event of the test being positive. We want to determine:
$$ \begin{aligned}
 P(G|T)&=\frac{(P(T|G)P(G)}{P(T|G)P(G)+P(T|\bar{G})P(\bar{G})} \\
 &=\frac{0.95\times 0.00001}{0.95\times 0.00001+0.005\times 0.99999}\approx 0.002
\end{aligned} $$

Here we use having the construction or not as the partition in the denominator. Although Joe’s test seems quite reliable, we find that using it won’t help as much as one might have hoped. On average, only 1 in every 500 sentences that the test identifies will actually contain a parasitic gap. This poor result comes about because the prior probability of a sentence containing a parasitic gap is so low.

### Random Variable
- A **random variable** is simply a function $X: \Omega \rightarrow R^n $ (commonly with n = I), where R is the set of real numbers.
- Rather than having to work with some irregular event space which differs with every problem we look at, a random variable allows us to talk about the probabilities of numerical values that are related to the event space. 
- We think of an abstract **stochastic process** that generates numbers with a certain probability distribution. (The word stochastic simply means ‘probabilistic’ or ‘randomly generated,’ but is especially commonly used when referring to a sequence of results assumed to be generated by some underlying probability distribution.)
- A discrete random variable is a function $X: \Omega \rightarrow S$ where S is a countable subset of R. If $X:\Omega \rightarrow \{O, 1\}$, then X is called an **indicator random variable** or a **Bernoulli trial**.

#### Example 2
Suppose events are those that result from tossing two dice.  Then we could define a discrete random variable X that is the sum of their faces: S={2,...,12}, as indicated in table 1.

#### Solution
<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;}
.tg .tg-yw4l{vertical-align:top}
</style>
<table class="tg">
  <tr>
    <th class="tg-yw4l" rowspan="2">First <br>Die</th>
    <th class="tg-yw4l" colspan="6">Second Die</th>
    <th class="tg-yw4l"></th>
    <th class="tg-yw4l"></th>
    <th class="tg-yw4l"></th>
    <th class="tg-yw4l"></th>
    <th class="tg-yw4l"></th>
    <th class="tg-yw4l"></th>
  </tr>
  <tr>
    <td class="tg-yw4l">1</td>
    <td class="tg-yw4l">2</td>
    <td class="tg-yw4l">3</td>
    <td class="tg-yw4l">4</td>
    <td class="tg-yw4l">5</td>
    <td class="tg-yw4l">6</td>
    <td class="tg-yw4l"></td>
    <td class="tg-yw4l"></td>
    <td class="tg-yw4l"></td>
    <td class="tg-yw4l"></td>
    <td class="tg-yw4l"></td>
    <td class="tg-yw4l"></td>
  </tr>
  <tr>
    <td class="tg-yw4l">6<br>5<br>4<br>3<br>2<br>1</td>
    <td class="tg-yw4l">7<br>6<br>5<br>4<br>3<br>2</td>
    <td class="tg-yw4l">8<br>7<br>6<br>5<br>4<br>3</td>
    <td class="tg-yw4l">9<br>8<br>7<br>6<br>5<br>4</td>
    <td class="tg-yw4l">10<br>9<br>8<br>7<br>6<br>5</td>
    <td class="tg-yw4l">11<br>10<br>9<br>8<br>7<br>6</td>
    <td class="tg-yw4l">12<br>11<br>10<br>9<br>8<br>7</td>
    <td class="tg-yw4l"></td>
    <td class="tg-yw4l"></td>
    <td class="tg-yw4l"></td>
    <td class="tg-yw4l"></td>
    <td class="tg-yw4l"></td>
    <td class="tg-yw4l"></td>
  </tr>
  <tr>
    <td class="tg-yw4l">x</td>
    <td class="tg-yw4l"></td>
    <td class="tg-yw4l">2</td>
    <td class="tg-yw4l">3</td>
    <td class="tg-yw4l">4</td>
    <td class="tg-yw4l">5</td>
    <td class="tg-yw4l">6</td>
    <td class="tg-yw4l">7</td>
    <td class="tg-yw4l">8</td>
    <td class="tg-yw4l">9</td>
    <td class="tg-yw4l">10</td>
    <td class="tg-yw4l">11</td>
    <td class="tg-yw4l">12</td>
  </tr>
  <tr>
    <td class="tg-yw4l">$p(X=x)$</td>
    <td class="tg-yw4l"></td>
    <td class="tg-yw4l">$\frac{1}{36}$</td>
    <td class="tg-yw4l">$\frac{1}{18}$</td>
    <td class="tg-yw4l">$\frac{1}{12}$</td>
    <td class="tg-yw4l">$\frac{1}{9}$</td>
    <td class="tg-yw4l">$\frac{5}{36}$</td>
    <td class="tg-yw4l">$\frac{1}{6}$</td>
    <td class="tg-yw4l">$\frac{5}{36}$</td>
    <td class="tg-yw4l">$\frac{1}{9}$</td>
    <td class="tg-yw4l">$\frac{1}{12}$</td>
    <td class="tg-yw4l">$\frac{1}{18}$</td>
    <td class="tg-yw4l">$\frac{1}{36}$</td>
  </tr>
</table>

**Table 1:** A random variable X for the sum of the two dice.  Entries in the body of the table show the value of X given the underlying basic outcomes, while the bottom two rows show the pmf p(x).

- Because a random variable has a numeric range, we can often do mathematics more easily by working with the values of a random variable, rather than directly with events.
- In particular we can define the **probability mass function (pmf)** for a random variable X, which gives the probability that the random variable has different numeric values
$$pmf\text{ }p(x)=p(X=x)=P(A_x)\text{ where }A_x={\omega\in\Omega:X(\omega)=x} -- - -\dots(4)$$

### Expectation and Variance

The **expectation** is the **mean** or average or a random variable.  If X is a random variable with a pmf p(x) such that $\sum_x|x|p(x)<\infty$ then the expectation is:
$$E(X)=\sum_xxp(x) - - - -\dots(5)$$

#### Example 3
If a rolling one die and Y is the value on its face, then:
#### Solution
$$E(Y)=\sum_{y=1}^6yp(y)=\frac{1}{6}\sum_{y=1}^6y=\frac{21}{6}=3\frac{1}{2}$$
This is the expected average found by totaling up a large number of throws of the die, and dividing by the number of throws.  If Y~p(y) is a random variable, any function g(Y) defines a new random variable. If E(g(y)) is defined then:
$$E(g(Y))=\sum_yg(y)p(y) - - - -\dots(6)$$

For instance, by letting g be a linear function g(Y)=aY+b, we see that E(g(Y))=aE(Y)+b.  We also have that E(X+Y)=E(X)+E(Y) and if X and Y are independent, then E(XY)=E(X)E(Y).

The **variance** of a random variable is the measure of whether the values of the random variable tend to be consistent over trials or to vary a lot.  One measures it by finding out how much on average the variable's values deviate from the variable's expectation:
$$\begin{aligned}Var(X)&=E((X-E(X))^2) \\ &=E(X^2)-E^2(X)\end{aligned}- - - -\dots(7)$$

The commonly used **standard deviation** of a variable is the square root of variance.  When talking about a particular distribution or set of data, the mean is commonly denoted as $\mu$, the variance as $\sigma^2$ and hence the standard deviation written as $\sigma$.

#### Example 4
What is the expectation and variance for the random variable for the sum on two dice?

#### Solution
 The expectation can be as a result of the combined expectations
 $$E(X)=E(Y+Y)=E(Y)+E(Y)=3\frac{1}{2}+3\frac{1}{2}=7$$
 
 The variance is given y:
 $$Var(X)=E((X-E(X))^2)=\sum_xp(x)(x-E(X))^2=5\frac{5}{6}$$
 
 Because the results for rolling two dice are concentrated around 7, the variance of this distribution is less than for an '11-sided die', which returns a uniform distribution over the numbers 2-12.  For such uniform distribution random variable U, we find that Var(U)=10.

### Joint and conditional distributions

- Often we define many random variables over a sample space giving us a joint (or multivariate) probability distribution. The joint probability mass function for two discrete random variables X, Y is:
$$p(x,y)=P(X=x,Y=y)$$
- Related to a joint pmf are the marginal pms, which total up the probability masses for the values of each variable separately
$$\begin{aligned}p_X(x)\sum_yp(x,y)&&p_Y=\sum_xp(x,y)\end{aligned} - - - -\dots(8)$$

- In general the marginal mass functions do not determine the joint mass function. But if X and Y are independent, then p(x,y) = px(x) py(y).
- For example, for the probability of getting two sixes from rolling two dice, since these events are independent, we can compute that:
$$p(Y=6,Z=6)=p(Y=6)p(Z=6)=\frac{1}{6}\times\frac{1}{6}=\frac{1}{36}$$
- There are analogous results for joint distributions and probabilities for the intersection of events. So we can define a conditional pmf in terms of the joint distribution:
$$p_{X|Y}(x|y)=\frac{p(x,y)}{p_Y(y)} \text{ for y such that }p_Y(y)>0$$
and deduce a chain rule in terms of random variables, for instance:
$$p(w,x,y,z)=p(w)p(x|w)p(y|w,x)p(z|w,x,y)$$

### Standard Distributions

- Certain probability mass functions crop up commonly in practice. 
- In particular, one commonly finds the same basic form of a function, but just with different constants employed. 
- Statisticians have long studied these families of functions. They refer to the family of functions as a **distribution** and to the numbers that define the different members of the family as **parumeters**. 
- Parameters are constants when one is talking about a particular pmf, but variables when one is looking at the family. 
- When writing out the arguments of a distribution, it is usual to separate the random v,ariable arguments from the parameters with a semicolon (;). 
- In this section, we just briefly introduce the idea of distributions with one example each of a discrete distribution (the binomial distribution), and a continuous distribution (the normal distribution).

#### Binomial Distribution
- A binomial disrribution results when one has a series of trials with only two outcomes (i.e., Bernoulli trials), each trial being independent from all the others.
- Repeatedly tossing a (possibly unfair) coin is the prototypical example of something with a binomial distribution. 
- Now when looking at linguistic corpora, it is never the case that the next sentence is truly independent of the previous one, so use of a binomial distribution is always an approximation. 
- Nevertheless, for many purposes, the dependency between words falls off fairly quickly and we can assume independence.
- In any situation where one is counting whether something is present or absent, or has a certain property or not, and one is ignoring the possibility of dependencies between one trial and the next, one is at least implicitly using a binomial distribution, so this distribution actually crops up quite commonly in Statistical NLP applications. Examples include: looking through a corpus to find an estimate of the percent of sentences in English that have the word the in them or finding out how commonly a verb is used transitively by looking through a corpus for instances of a certain verb and noting whether each use is transitive or not.
- The family of binomial distributions gives the number Y of successes out of n trials given that the probability of success in any trial is p:
$$b(r;n,p)=\left(\begin{matrix}n\\r\end{matrix}\right)p^r(1-p)^{n-r} where \left(\begin{matrix}n\\r\end{matrix}\right)=\frac{n!}{(n-r)!r!}, 0\le r\le n - \dots(9)$$

- The term $\left(\begin{matrix}n\\r\end{matrix}\right)$ counts the number of different possibilities for choosing Y objects out of n, not considering the order in which they are chosen.

#### Example 5:  
Let R have as the value, the number of heads in n tosses of a (possibly weighted) coin, where the probability of a head is p.  Then we have a binomial distribution $r(R=r)=b(r; n,p)$

#### Proof:
This is by counting each basic outcome with r heads and n-r tails has the probability $h^r(1-h)^{n-r}$, and there are $\left(\begin{matrix}n\\r\end{matrix}\right)$ of them

- The generalization of a binomial trial to the case where each of the trials has more than two basic outcomes is called a **multinomial** experiment, and is modeled by the multinomial distribution. 
A zeroth order n-gram
I~ISTRIBIITION model of the type we discuss in chapter 6 is a straightforward example
of a multinomial distribution.
- Another discrete distribution that we discuss and use in this book is the
Poisson distribution (section 15.3.1). Section 5.3 discusses the Bernoulli
distribution, which is simply the special case of the binomial distribution
where there is only one trial. That is, we calculate b(r; 1, p).

## References

Manning, C. D., & Schℓutze, H. (1999). Foundations of statistical natural language processing. Cambridge, Mass;London;: MIT Press.