# Deep Learning Book: Probabilty and Information Theory

In probability theory, an **event** is a set of outcomes of an **experiment** to which a **probability** is assigned. If **E** represents an event, then **P(E)** represents the probability that E will occur. A situation where E might happen (success) or might not happen (failure) is called a **trial**. When we say that an outcome has a probability p of occurring, it means that if we repeated the experiment infinitely many times, then proportion p of the repetitions would result in that outcome. According to the book, nearly all activities require some ability to reason in the presence of uncertainty.

The chapter is divided into following sections and subsections 

* Why Probability
* Random Variables
* Probability Distributions
    * Discrete Variables and Probability Mass Functions
    * Continuous Variables and Probablity Density Functions
* Marginal Probability
* Conditional Proability
* The Chain rule of Conditional Probabilities
* Independence or Conditional Independence
* Expectation, Variance and Covariance
* Common Probability Distributions
    * Bernoulli
    * Multinoulli
    * Gaussian
    * Exponential and Laplace Distributions
    * The Dirac Distributiona and Empirical Distribution
    * Mixtures of Distribution
* Useful properties of Common Functions
* Baye's Rule
* Technical Details of Continuous Variables
* Information Theory
* Structured Probablistic Models

### 1. Why Probability

As mentioned earlier, very few things are certain. So Machine Learning models should reason using probabilistic rules. In the language of the book, machine learning must always deal with uncertain quantities and sometimes stochastic (nondeterministic) quantities. Uncertainty and stochasticity can arise from many sources.  The three possible sources are:

1. Inherent Stochasticity in the system being modelled: A situation where the outcome is really random. For e.g, say we have a language model which predicts the next word in the sentence. For it, a sentence like *Github is ___* is difficult because the next word can be anything

2. Imcomplete observability: For example, in the Monty Hall problem, the outcome given the contestant’s choice is deterministic, but from the contestant’s point of view, the outcome is uncertain.

3. Incomplete modeling. When we use a model that must discard some ofthe information we have observed, the discarded information results inuncertainty in the model’s predictions. 

In many cases, it is more practical to use a simple but uncertain rule ratherthan a complex. The book gives a great example of birds. It makes more sense to remember “most birds can fly” than “all birds can fly except sick, injured or very young birds etc.”.

If we have a probability related directly to the rates at which events occur, this probability known as **Frequentist probability** (deal the cards).

If probability related to qualitative levels of certainty, it’s known as **Bayesian probability** (have patient flue or not). In such a case we'll have a **degree of belief**, with 1 indicating absolute certainty that the patient has the flu and 0 indicating absoluting certainty that the patient doesn't have the flu.

### 2. Random Variables

A random variableis a variable that can take on diﬀerent values randomly (Can't be more obvious :P). 

Random variables may be **discrete** or **continuous**. A discrete random variable is one that has a finite or countably infinite number of states. Note that these states are not necessarily the integers; they can also just be named states that are not considered to have any numerical value. A continuous random variable is associated with a real value

On its own, a random variable is just a description of the states that are possible; it must be coupled with a **probability distribution** that speciﬁes how likely each of these states are.

### 3. Probability Distributions

A probability distribution is a description of how likely a random variable or set of random variables is to take on each of its possible states.

#### 3.1 Discrete Variables and Probability Mass Functions

A probability distribution over ***discrete variables*** may be described using a **probability mass function(PMF)**.The probability mass function maps from a state of a random variable tothe probability of that random variable taking on that state.

Probability mass functions can act on many variables at the same time. Such a probability distribution over many variables is known as a **joint probability distribution**.P(x=*x*, y=*y*) denotes the probability that x=*x* and y=*y* simultaneously. We may also write P (*x*, *y*) for brevity.

To be a PMF on a random variable x, a function P must satisfy the following properties:
* The domain of P must be the set of all possible states of x.
* ∀*x* $\epsilon$ x,0≤ P(*x*)≤1.
* $\sum_{\varkappa \epsilon x}$ **P(*x*)** = 1 (Normalization)

For example, a single discrete random variable x with k different states. In a uniform distribution on x by setting PMF to         
P(x = $\varkappa_i$) = 1/k for all i. We can also see that this is normalized since adding up them all results in 1

#### 3.2 Continuous Variables and Probability Density Functions

For ***continuous random variables***, we describe probability distributions using a **probability density function(PDF)**. 

To be a PDF, a function *p* must satisfy the following properties:
* The domain of *p* must be the set of all possible states of x.
* $\forall$*x* $\epsilon$ x, p(*x*) ≥ 0. Note that we do not require p(x) ≤ 1.
* $\int$p(*x*) d*x* = 1


A probability density function p(*x*) does not give the probability of a speciﬁc state directly; instead the probability of landing inside an inﬁnitesimal region withvolume δ*x* is given by p(*x*)δ*x*

### 4. Marginal Probability

Marginal probability is the probability distribution over a **subset of a set**.
Suppose we know P(x,y) for discrete random variables x and y. We can find P(x) with the sum rule: ∀*x*∈x,P(x)=∑yP(x=*x*,y=*y*).

For continuous variables, we need to use integration instead of summation: p(x)=∫p(x,y)dy

When the values of P(x,y) are written in a grid with different values of x in rows and different values of y in columns, it is natural to sum across a row of the grid, then write P(x) in the margin of the paper just to the right of the row.(ref. figure below)


![image.png](https://www.statisticshowto.datasciencecentral.com/wp-content/uploads/2014/02/marginal-distributions-1.jpg)

### 5. Conditional Probability

Marginalization allows us to get the distribution of variable X ignoring variable Y from the joint distribution of X and Y, but what if we want to know the distribution of X given a specific value of Y? Conditional Probability is the probability of **some event X, given that some other event Y has happened**. It is denoted by $P(X \hspace{.1cm}| \hspace{.1cm} Y)$ such that 
$$ P(\text{y} = y \hspace{.1cm}| \hspace{.1cm} \text{x} = x) = \frac{P(\text{y} = y , \text{x} = x)}{P(\text{x} = x)} $$



To give an example, suppose you know that Jens can speak Germany but want to know the probability that he belongs to Germany. In such a case, you'll make use of conditional probability.

however, note that conditional probability does not imply causation. To take a common example from Pearl & Mackenzie, the probability of the sun rising, given that the rooster has crowed, is high. But in no way implies that the rooster's crowing causes the sun to rise.

More precisely, the conditional probability holds under the assumption of observation without intervention. **If we observe the rooster crowing without having intervened and forcing it to crow, there is a good chance that the sun has risen. But if we intervene by forcing the rooster to crow, the crowing is no longer related to the sunrise.**

### 6. The Chain Rule of Conditional Probabilities

Any joint probability distribution over many random variables may be decomposed into conditional distributions over only one variable. 
The general expression is given by: $$P(\text{x}^{(1)}, ..., \text{x}^{(n)}) = P(\text{x}^{(1)}) \prod_{i=2}^{n} P(\text{x}^{(i)} \hspace{.1cm} | \hspace{.1cm} \text{x}^{(1)},..., \text{x}^{(i-1)}) $$

### 7.Independence and Conditional Independence

Two random variables x & y are said to be **independent** (x $\perp$ y) if they satisfy: 

$$ \forall x \in \text{x}, y \in \text{y}, P(\text{x} = x, \text{y} = y) = P(\text{x} = x)P(\text{y} = y) $$