# Probability and Information Theory

<br>

Probability theory is a mathematical framework for representing uncertainstatements. Probaility in simple terms certainity or uncertainity of anything.

If you are familier with following concepts you can move to next notebook:
>Reference http://www.deeplearningbook.org/contents/prob.html
    
    1. Why Probability
    2. Random Variables
    3. Probability Distributions
    4. Marginal Probabilty
    5. Conditional Probabilty
    6. The Chain Rule of Conditional Probabilities
    7. Independence and Conditional Independence
    8. Expectation, Variance and Covariance
    9. Common Probability Distributions
    10. Useful Properties of Common Functions
    11. Bayes’ Rule
    12. Technical Details of Continuous Variables
    13. Information Theory
    14. Structured Probabilistic Models
    
----------------------
    

In [None]:
#library imports
import numpy as np
import math
import matplotlib.pyplot as plt

### 1. Why Probability

When we want to make a program to do a certain thing we usually know what kinds of input will come and what output will be but in machine learning and also in our brains we compare things and decide the certainity and uncertaininty of things differently.
For example i show a animal species you haven't seen before you will try to relate it to a creature that you have seen. Like if you see a Liger(Hybrid of Lion and Tiger). You will think how much it acts/looks like a lion and how much like a tiger so this how much is what we call probility of that feature or event etc. Machine Learning uses alot of probability theory to decide things like that and this increases the power of machine mind from logic to extended and uncertain logic.

There are 3 possible sources of uncertaininty:
    
    - Undecidable(Stochastic) system modeling
    - Incomplete Observation
    - Incomplete model
    
In many cases it's easy to use a small uncertainity rather than a large complex certain one.

Some terms:
    
***Degree of Belief:*** Varies from 0(absolute certainity being false) to 1(absolute certainity being true).

***Frequentist Probability:*** It is an interpretation of probability, it defines an event's probability as the limit of its relative frequency in a large number of trials

***Bayesian Probability:*** Probability is interpreted as reasonable expectation representing a state of knowledge or as
quantification of a personal belief.

>Probability can be seen as the extension of logic to deal with uncertainty

In [None]:
#Example
## To find what is certainity of you being concentrated on this chapter?

## YOUR CODE HERE:
max_dof = None  # write maximum limit of Degree of bilief
#END

if(math.pow(max_dof,max_dof) == max_dof and max_dof > 0):
    print("Good! you have good concentration.")
else:
    print("You are not concentrating even a bit!!!")

> Expected output : Good! you have good concentration.
    

### 2. Random Variables

A random variable is a variable that can take on diﬀerent values randomly. Random variables may be discrete or continuous. A discrete random variable is one that has a ﬁnite or countably inﬁnite number of states. Note that these states are not necessarily the integers, they can also just be named states that are not considered to have any numerical value. A continuous random variable is associated with a real value.

### 3. Probability Distributions

A probability distribution is a description of how likely a random variable or set of random variables is to take on each of its possible states. The way we describe probability distributions depends on whether the variables are discrete or continuous.

----------------------

#### 3.1 Discrete Variables and Probability Mass Functions
Probabilties for a set of discrete random variables in being in a set of particular states respectively can be measured using a probabilty mass function (PMF).

The probability mass function maps from a state of a random variable tothe probability of that random variable taking on that state.

A PMF must satisfy following criteria:

   - The domain of P must be the set of all possible states of x.
   - Probabilty of each state for x must be between 0 and 1
   - For all possible states of a random variable the sum of probabilities must go to 1.0

> Example:
```
A = {1,9,4,7,4,3,1,4,0,8}
P(4) = (frequency of 4 in A)/(length of A)  --> PMF  = frequency/no_samples
     = 3/10
     = 0.3 or 30%
```

-------------------

#### 3.2 Continuous Variables and Probability Density Functions

When working with continuous random variables, we describe probability distributions using a probability density function (PDF) rather than a probability mass function.

A PDF must satisfy following criteria:
    
   - The domain of P must be the set of all possible states of x.
   - Probabilty of each state for x must be greater than 0.
   - ∫(p(x)dx) = 1. Integration must be equal to 1
    
    
A probability density functionp(x) does not give the probability of a speciﬁc state directly, instead the probability of landing inside an inﬁnitesimal region with volume δx is given by p(x)δx.

> Example: <img src='prob/pdf.gif'><center>A Gaussian PDF of random variable Length.

In [None]:
# Example Probability distribution:
# PMD:
def pmd(x,A):
    
    ## YOUR CODE HERE:
    freq = None # use np.sum(A==x),  A == x gives a array of ones and zeros where if A[i] == x --> 1 
    l1   = None # use A.shape[0] to calculate length of A
    pxA  = None # freq/l1
    #END
    
    return pxA

# A Gaussian PDF , Gauss was a german mathematician
def gaussian_pdf(x):
    mu = np.mean(x)
    sig = np.std(x)
    return np.exp(-np.power(x - mu, 2.) / (2 * np.power(sig, 2.)))


A = np.array([1,0,2,30,2,1,22,1,5,6,3,8,6,1,2,6,1,34,1])
pmd_1 = pmd(1,A)
print(pmd_1)

A = np.arange(-10,10,0.001)
pdf_A = gaussian_pdf(A)
plt.plot(A,pdf_A)
plt.show()

>Expected Output 
```
   0.315789473684
```
   <img src="prob/pdf_A.png">
   

### 4. Marginal Probability

Sometimes we know the probability distribution over a set of variables and we want to know the probability distribution over just a subset of them. The probability distribution over the subset is known as the marginal probability distribution.

For example, suppose we have discrete random variables x and y, and we know Pr(x, y). We can ﬁnd Pr(x) with the sum rule:

$$ {\Pr(X=x)=\sum _{y}\Pr(X=x,Y=y)} $$

Similarly for continuous random variables, the marginal probability density function can be written as p<sub>X</sub>(x). This is

$${p_{X}(x)=\int _{y}p_{X,Y}(x,y)\,\mathrm {dy}}$$

> Example: <img src="prob/mdist.jpg">
Each Cell in inner light skyblue colored table represents a Pr(X=x,Y=y)

<br>
### 5. Conditional Probabilty

In many cases, we are interested in the probability of some event, given that some other event has happened. This is called a conditional probability

> **P(X=x given Y=y) = P(X=x, Y=y) / P(Y=y)**, in mathematical symbols$$P(x|y)={\frac {P(x\cap y)}{P(y)}}$$

It is only defined when P(Y=y) > 0

<br>

### 6. The Chain Rule of Conditional Probabilities

You would have observed that if we apply condition probability rule to a n-variabled problem we can extend the probability function something like this:

$$\mathrm {P} (A_{4},A_{3},A_{2},A_{1})=\mathrm {P} (A_{4}\mid A_{3},A_{2},A_{1})\cdot \mathrm {P} (A_{3}\mid A_{2},A_{1})\cdot \mathrm {P} (A_{2}\mid A_{1})\cdot \mathrm {P} (A_{1})$$
<center>or u can write like this</center>
$$\mathrm {P} (A_{4} \cap A_{3} \cap A_{2} \cap A_{1})=\mathrm {P} (A_{4}\mid A_{3} \cap A_{2} \cap A_{1})\cdot \mathrm {P} (A_{3}\mid A_{2} \cap A_{1})\cdot \mathrm {P} (A_{2}\mid A_{1})\cdot \mathrm {P} (A_{1})$$

> Think a bit you will get it.

In [None]:
# Conditional Probabilty read and run this cell
#definition--------------------------------------------------------------------
A1 = np.array([1,2,5,1,1,6,1,8,1,0,1,1])
A2 = np.array([4,2,3,8,4,6,4,8,4,0,4,4])
A3 = np.array([2,5,8,4,2,6,2,8,2,0,2,4])
A4 = np.array([1,5,3,4,5,6,1,8,1,0,1,4])

m = A1.shape[0] # length of matrices

#-------------------------------------------------------------------------------
# function to operate on two array sets and return a array if element is present put 1 else 0 and a probabilty
def pr_xy(x,y,A,B):
    Ax = (A==x)
    By = (B==y)
    Cxy = np.logical_and(Ax,By)
    return Cxy


#In this section-----------------------------------------------------------------
# we will find probability p(A4 = 1, A3 = 2, A2 = 4, A1 = 1):
pA1 = np.sum(A1 == 1)/m                     # p(A1 = 1)

A2nA1 = pr_xy(1,4,A1,A2)
pA2nA1 = np.sum(A2nA1)/m                    # p(A1 = 1 , A2 = 4)
pA2_A1 = (pA2nA1/pA1)                       # p(A2 | A1)

A3nA2nA1 = pr_xy(1,2,A2nA1,A3)
pA3nA2nA1 = np.sum(A3nA2nA1)/m              # p(A1 = 1, A2 = 4, A3 = 2)
pA3_A2nA1 = (pA3nA2nA1/pA2nA1)              # p(A3 | A2,A1)

A4nA3nA2nA1 = pr_xy(1,1,A3nA2nA1,A4)
pA4nA3nA2nA1 = np.sum(A4nA3nA2nA1)/m        # p(A4 = 1, A3 = 2, A2 = 4, A1 = 1)
pA4_A3nA2nA1 = (pA4nA3nA2nA1/pA3nA2nA1)     # p(A4 | A3,A2,A1)

#------------------------------------------------------------------------------
# Verification of conditional prob:
chain = (pA4_A3nA2nA1*pA3_A2nA1*pA2_A1*pA1)
assert(chain == pA4nA3nA2nA1)

#------------------------------------------------------------------------------
print("P(A4,A3,A2,A1)=" + str(pA4nA3nA2nA1)+"")
print("P(A4,A3,A2,A1)=P(A4∣A3,A2,A1)⋅P(A3∣A2,A1)⋅P(A2∣A1)⋅P(A1)")
print("              ="+str(pA4_A3nA2nA1)+"*"+str(pA3_A2nA1)+"*"+str(pA2_A1)+"*"+str(pA1))
print("              ="+str(chain)+"")

<br>
### 7. Independence and Conditional Independence

Two random variables x and y are independent if their probability distribution can be expressed as a product of two factors, one involving only x and one involving only y:

$$\mathrm{P}(x \cap y) = \mathrm{P}(x)\mathrm{P}(y).$$

Two random variables x and y are conditionally independent given a random variable z if the conditional probability distribution over x and y factorizes in this way for every value of z:

$$\mathrm{P}(x \cap y | z) = \mathrm{P}(x | z)\mathrm{P}(y | z).$$

### 8. Expectation, Variance and Covariance
#### Expectation
The expectation or expected value of some function f(x) with respect to a probability distribution P(x) is the average, or mean value that f takes on when x is drawn from P.

For discrete random variable

$${\displaystyle \operatorname {E} [fX]=fx_{1}p_{1}+fx_{2}p_{2}+\cdots +fx_{k}p_{k}}$$

For continous random variable

$${\displaystyle \operatorname {E} [fX]=\int f{_x}p(x)\,dx.}$$

#### Variance
The variance gives a measure of how much the values of a function of a random variable x vary as we sample diﬀerent values of x from its probability distribution:

$$\operatorname {Var} (fX)=\operatorname {E} \left[(fX-\mu )^{2}\right]$$
$$\mu = {E} [fX]$$

When the variance is low, the values of f(x) cluster near their expected value. The square root of the variance is known as the **standard deviation**.

#### Covariance

The covariance gives some sense of how much two values are linearly related to each other, as well as the scale of these variables:

$$\operatorname {cov} (X,Y)=\operatorname {E} {{\big [}(X-\operatorname {E} [X])(Y-\operatorname {E} [Y]){\big ]}}$$


In [None]:
# Expectation, Variance and Covariance
x = np.array([1,8,9,2,1,0,1,2,-3,-2,-9,-11,2,1,-1])
A = x**2 + 3
B = (x) - 10
C = x**2 - 5

def mean(X):
    ## YOUR CODE HERE
    mu = None # considering probabilty of x is 1/N so you can use --> np.sum(X,axis=0)/X.shape[0]
    #END
    return mu

def vrns(X):
    mu = mean(X)
    ## YOUR CODE HERE
    var = None # read and apply the equation shown above as follows --> mean(np.square(X - mu))
    #END
    return var
    
def covr(X, Y):
    mux = mean(X)
    muy = mean(Y)
    ## YOUR CODE HERE
    cov = None # remember variance is a spl. case of covariance with itself so use --> mean((X-mux)*(Y-muy))
    #END
    return cov

mx = mean(x)
va = vrns(A)
vb = vrns(B)
cac = covr(A, C)
cab = covr(A, B)


print("The mean of x :",mx)
print("The variance of A :",va)
print("The variance of B :",vb)
print("The Covariance of A,C :",cac)
print("The Covariance of A,B :",cab)
print("Note if Covariance is high then the two sets are highly correalted")

> Expected Output:
```
    The mean of x : 0.0666666666667
    The variance of A : 1502.24888889
    The variance of B : 25.1288888889
    The Covariance of A,C : 1502.24888889
    The Covariance of A,B : -56.8088888889
    Note if Covariance is high then the two sets are highly correalted
```

### 9. Common Probability Distributions

#### 9.1 Bernoulli Distribution

If X is a random variable with this distribution, we have:

$${\displaystyle \Pr(X=1)=p =1-\Pr(X=0)=1-q.}$$

The PMF is defined as $${\displaystyle f(k;p)=p^{k}(1-p)^{1-k}\!\quad {\text{for }}k\in \{0,1\}}$$

--------------------------

#### 9.2 Multinoulli Distribution

The multinoulli or categorical distribution is a distribution over a single discrete variable with k diﬀerent states, where k is ﬁnite.

----------------

#### 9.3 Gaussian Distribution

As we have discussed earlier the most commonly used distribution over real numbers is the normal distribution, also known as the Gaussian distribution:

The PDF is given by: $${\displaystyle f(x\;|\;\mu ,\sigma ^{2})={\frac {1}{\sqrt {2\pi \sigma ^{2}}}}\;e^{-{\frac {(x-\mu )^{2}}{2\sigma ^{2}}}}}$$ where , $$\mu = mean,\sigma = standard-deviation$$

> Read datailed explanation for n-dimensional generalization of this distribution here: http://www.deeplearningbook.org/contents/prob.html section 3.9.3

----------------------

#### 9.4 Exponential and Laplace Distributions

In the context of deep learning, we often want to have a probability distribution with a sharp point at x= 0. To accomplish this, we can use the **exponential distribution**:

$$p(x; λ) = λ.exp (−λx);x >= 0\\ = 0 ; x < 0$$

A closely related probability distribution that allows us to place a sharp peak of probability mass at an arbitrary point µ is the **Laplace distribution**:

$$Laplace(x; µ, γ) = {1 \over 2γ} .exp \left(−{|x − µ|\overγ}\right)$$

-----------------------------

#### 9.5 The Dirac Distribution and Empirical Distribution

In some cases, we wish to specify that all the mass in a probability distribution clusters around a single point. This can be accomplished by deﬁning a PDF using the Dirac delta function, δ(x):

$$p(x) = δ(x −µ)$$

> Dirac delta function is a special function which is very high at 0 and rapidly decreases to zero in its neighbour hood. In other word its like a peak at it's 0.<br>If you wanna read in more detail check wikipedia https://en.wikipedia.org/wiki/Dirac_delta_function


A common use of the Dirac delta distribution is as a component of an **empirical distribution**,

$$p(x) ={1\over m}\sum^m_{ i=1}δ(x −x(i))$$

which puts probability mass 1/m on each of the m points x(1), . . . , x(m), forming a given data set or collection of samples. The Dirac delta distribution is only necessary to deﬁne the empirical distribution over continuous variables. For discrete variables, the situation is simpler: an empirical distribution can be conceptualized as a multinoulli distribution, with a probability associated with each possible input value that is simply equal to the empirical frequency of that value in the training set.

-----------------------------
#### 9.6 Mixtures of Distributions

It is also common to deﬁne probability distributions by combining other simpler probability distributions. One common way of combining distributions is to construct a mixture distribution.

For example look as this a countable finite mixture distribution:
$$f(x)=\sum _{i=1}^{n}\,w_{i}\,p_{i}(x).$$

such that wi ≥ 0 and ∑wi = 1

An uncountable mixture distribution:

$$f(x)=\int _{A}\,w(a)\,p(x;a)\,da$$

> For more detailed study of mixture distribution you can look here: https://en.wikipedia.org/wiki/Mixture_distribution

In [None]:
# Probability Distribution
# plot arrangements
f, mpl = plt.subplots(2, 2, figsize=[14,7])
# Different distribution are as follows: -

# Bernoulli Distribution
Ab = np.array([0,1,1,0,1,1,1,0,0,0,0,1,1,1,0,1,1]) # data

## YOUR CODE HERE
pb1 = None # probability of one = (number of ones / total length ) use --> np.sum(Ab,axis=0)/Ab.shape[0] 
#END

pb0 = 1 - pb1 #probability o zeros is 1 - (probabilitiy of 1)
print("The Bernoulli Probability of 1 : ",pb1,"\nThe Bernoulli Probability of 0 : ",pb0)
#---------------------------
# Exponential Distribution
Ae = np.arange(-10,100,0.01) # random variable data 
lmda = 0.2 # lambda for distribution

## YOUR CODE HERE
pe = None # use --> lmda*np.exp(-lmda*Ae)
#END

pe = pe*(Ae>=0) # masking negative values of Ae to zero
mpl[0,0].plot(Ae,pe)
mpl[0,0].set_title("The exponential distribution")
#---------------------------
# Laplace distribution
me = np.mean(Ae)

## YOUR CODE HERE
pl = None # use --> (1/2/lmda)*lmda*np.exp(-np.abs(Ae-me))
#END

mpl[0,1].plot(Ae,pl)
mpl[0,1].set_title("The Laplace distribution")
#----------------------------
# Emperical distribution
#dirac delta fucntion
def diracdel(x,a):
    dxa = (1/np.abs(a)/np.sqrt(np.pi))*np.exp(-np.square(x/a))
    return dxa

a = 0.01 # make it near to zero to get exact curves 
# according to the definition of emperical distribution
pemp = (diracdel(Ae,a) + diracdel(Ae-20,a) + diracdel(Ae-30,a) + diracdel(Ae-90,a))/4
mpl[1,0].plot(Ae,pemp)
mpl[1,0].set_title("The Emperical distribution")
#-----------------------------
# Mixture Distribution
def gauss(x,mu,sig):
    return np.exp(-np.power(x - mu, 2.) / (2 * np.power(sig, 2.)))

# parameters for mixture distribution
W1 = 0.3
G1 = gauss(Ae,20,6) # distribution 1
W2 = 0.4
G2 = gauss(Ae,40,6) # distribution 2
W3 = 0.3
G3 = gauss(Ae,60,6) # distribution 3

pmix = W1*G1 + W2*G2 + W3*G3 # sum of weighted distributions
mpl[1,1].plot(Ae,pmix)
mpl[1,1].set_title("The Mixture distribution")
plt.show()

> Expected output:
```
The Bernoulli Probability of 1 :  0.588235294118 
The Bernoulli Probability of 0 :  0.411764705882
```
<img src="prob/distributions.png">

### 10. Useful Properties of Common Functions

Certain functions arise often while working with probability distributions, especially the probability distributions used in deep learning models.

Common Examples:

**Logistic Sigmoid:**$${\displaystyle S(x)={\frac {1}{1+e^{-x}}}={\frac {e^{x}}{e^{x}+1}}}$$

The logistic sigmoid is has range between 0 and one.  It saturates when its argument is very positive or very negative, meaning that the function becomes very ﬂat and insensitive to small changes in its input.

**Softplus function:**$${\displaystyle f(x)=\ln[1+\exp(x)]}$$

The name of the softplus function comes from the fact that it is a smoothed, or “softened,” version of Rectifier linear unit **ReLU**

**ReLU function:**$${\displaystyle f(x)=x^{+}=\max(0,x)}$$

This function acts as a rectifier for negative values.

>NOTE: If you are familiar with calculation of Integrals and Dervatives of a given function you should find the derivatives of above function it will be useful later on. If you aree not familier then it would be good if you learn it. It will affect a great deal of understanding in many fields if you know those concepts.
Look for small lectures on **Google**

In [None]:
# Common Fucntions
# Run this cell to see plots of those functions
def sigmoid(X):
    return(1/(1+np.exp(-x)))
def softplus(X):
    return(np.log(1+np.exp(x)))
def relu(X):
    msk = X>=0
    return X*msk

x = np.arange(-10,10,0.1)
sig_x = sigmoid(x)
sof_x = softplus(x)
rel_x = relu(x)

f, ax = plt.subplots(1,3, figsize=[20,4])
ax[0].plot(x,sig_x)
ax[0].set_title("Sigmoid")
ax[1].plot(x,sof_x)
ax[1].set_title("Softplus")
ax[2].plot(x,rel_x)
ax[2].set_title("ReLU")
plt.show()

### 11. Bayes' Rule

We often ﬁnd ourselves in a situation where we know P(B | A) and need to know P(A | B). Fortunately, if we also know P(A), we can compute the desired quantity using **Bayes’ rule:**

$${\displaystyle P(A\mid B)={\frac {P(B\mid A)\,P(A)}{P(B)}}}$$

Note that while P(B) appears in the formula, it is usually feasible to compute 

$$P (B) =\sum_AP (B | A)P (A)$$

so we do not need to begin with knowledge of P (y).

-----------
### 12. Technical Details of Continuous Variables

A detailed understanding of some properties of random continous varible in probability theory.

> Read Section 3.12 here http://www.deeplearningbook.org/contents/prob.html

### 13. Information Theory

Information theory is a branch of applied mathematics that revolves around quantifying how much information is present in a signal.

The basic intuition behind information theory is that learning an unlikely event has occurred is more informative than learning that a likely event has occurred. A message saying “the sun rose this morning” is so uninformative as to be unnecessary to send, but a message saying “there was a solar eclipse this morning” is very informative.

In other words the requirement is:
    - likely event should have low information content.
    - unlikely event should have high information content.
    - independent events should have additive information.
    
So, we define a term ** Self Information :**
$$I(x) = −log_eP (x)$$

Self-information deals only with a single outcome.

To know expected amount of infromation in a distribution we can use:

** Shannon Distribution:**
$$H(x) = E_{x∼P}[I(x)] = −E_{x∼P}[log P (x)]$$
<center>or</center>$${\displaystyle \mathrm {H} (X)=\sum _{i=1}^{n}{\mathrm {P} (x_{i})\,\mathrm {I} (x_{i})}=-\sum _{i=1}^{n}{\mathrm {P} (x_{i})\log _{e}\mathrm {P} (x_{i})},}$$
When x is continuous, the Shannon entropy is known as the diﬀerential entropy.

If we have two separate probability distributions P(x) and Q(x) over the same random variable x, we can measure how diﬀerent these two distributions are using the **Kullback-Leibler (KL) divergence:**

$$D_{KL}(P||Q) = E_{x∼P}[log{P (x)\over Q(x)}]$$

A quantity that is closely related to the KL divergence is the cross-entropy:

$$H(P, Q) = −E_{x∼P}[log Q(x)]$$

In [None]:
# Plot shannon run this cell
def shannon(P):
    info1 = -np.log(P   + 0.000000000000000001)
    info2 = -np.log(1-P + 0.000000000000000001)
    shan1 = P*info1
    shan2 = (1-P)*info2
    shan = shan1 + shan2
    return shan
print("\nShannon Entropy plot of a binary random variable being 0 or 1")
p = np.arange(0,1,0.001) # probability outcomes
s = shannon(p)
plt.plot(p,s)
plt.xlabel("P")
plt.ylabel("Entropy")
plt.show()

>See Figure : When p is near 0, the distribution is nearly deterministic, because the random variable is nearly always 0. When p is near 1, the distribution is nearly deterministic, because the random variable is nearly always 1. When p= 0.5, the entropy is maximal, because the distribution is uniform over the two outcomes.

### 14. Structured Probabilistic Models

> This part is very intuitive so i advise you to read section 3.14 from http://www.deeplearningbook.org/contents/prob.html 


---------------------
> ** Congrats on completing second tutorial in this series! Keep Moving **

In next part we will discuss the Numerical Computation.