# Probability Theory
Probability theory is a branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expressing it through a set of axioms.

Probability is a ratio of the number of probabilities that meet the given condition to the number of equally likely possibilities (i.e. P(heads on coin toss) = 1 chance of heads / 2 options (heads or tails) = ½. In probability theory, an event is a set of outcomes of an experiment to which a probability is assigned. If E represents an event, then P(E) represents the probability that E will occur. A situation where E might happen (success) or might not happen (failure) is called a trial. This event can be anything like tossing a coin, rolling a die or pulling a colored ball out of a bag. In these examples the outcome of the event is random, so the variable that represents the outcome of these events is called a random variable.
- The **empirical probability** of an event is given by number of times the event occurs divided by the total number of incidents observed. If forntrials and we observe ssuccesses, the probability of success is s/n. In the above example. any sequence of coin tosses may have more or less than exactly 50% heads.
- **Theoretical probability** on the other hand is given by the number of ways the particular event can occur divided by the total number of possible outcomes. So a head can occur once and possible outcomes are two (head, tail). The true (theoretical) probability of a head is 1/2.
- **Joint Probability** is the probability of events A and B denoted by P(A and B) or P(A ∩ B) is the probability that events A and B both occur. P(A ∩ B) = P(A). P(B) . This only applies if Aand B are independent, which means that if A occurred, that doesn’t change the probability of B, vice versa. The probability of the intersection of A and B may be written P(A ∩ B). *Example: What is the probability that a drawn card is a red four? There are two red fours in a deck of 52, the 4 of hearts and the 4 of diamonds, therefore P(four and red) = 2/52=1/26.*
- **Conditional Probability** suggests A and B are not independent, because if A occurred, the probability of B is higher. When A and B are not independent, it is often useful to compute the conditional probabiliuty, P(A|B), which is the probability of A given that B occurred: 
    - P(A|B) = P(A ∩ B)/ P(B) or similarly,  P(B|A) = P(A ∩ B)/ P(A) . We can write the joint probability of as A and B as P(A ∩ B)= p(A).P(B|A), which means : *“The chance of both things happening is the chance that the first one happens, and then the second one given the first happened.”*
    - https://youtu.be/bgCMjHzXTXs
    - https://youtu.be/ES9HFNDu4Bs
    - https://mithunmanohar.medium.com/machine-learning-101-what-the-is-a-conditional-probability-f0f9a9ec6cda
- **Marginal Probability** is the probability of an event occurring P(A) . We can think of it as an unconditional probability. It is not conditioned by another event. *Example: The probability that a drawn card is red P(red) = 0.5.*

## Probability Distribution
Probability distributions describe the dispersion of the values of a random variable. Consequently, the kind of variable determines the type of probability distribution. For a single random variable, statisticians divide distributions into the following two types:
- Probability mass functions for discrete variables (PMF)
- Probability density functions for continuous variables (PDF)

## Data Types
- Discrete data can take only specified values i.e. roll of a dice is 1, 2, 3, 4, 5, or 6 not 1.5
- Continuous data can take any value within a given range, finite or infinite i.e. a person’s height/weight

Online Resources:
* https://towardsdatascience.com/machine-learning-probability-statistics-f830f8c09326
* https://youtu.be/uzkc-qNVoOk

## Bayes Theorem
A relationship between the conditional probabilities of two events. For example: selling ice cream on a hot sunny day, BT uses prior knowledge of likelihood selling on other days (rainy, windy, snowy, etc.)	

<img src='images/bayes.png'>

* where H and E are events, P(H|E) is the conditional probability that event H occurs given E occurred
* Probability P(H) is basically frequency analysis; given our prior data what is the probability of it occurring
* P(E|H) is called the likelihood, the probability that the evidence is correct, given info from freq. analysis.
* P(E) is the probability that the actual evidence is true.

In example: H represents event selling ice cream. E is the event of weather. P(H) is the marginal probability of prior sales of ice cream regardless of weather.

Online Resources:
* https://www.mathsisfun.com/data/bayes-theorem.html
* https://youtu.be/HZGCoVF3YvM

## Bernoulli Distribution
Only two possible outcomes, namely 1 (success) and 0 (failure), and a single trial. Random variable X can take value 1 with probability of success, p, or the value 0 with probability of failure, q or 1-p. The probability mass function is given by px(1-p)1-x  where x € (0, 1) or as:

<img src=''>
    
Basically expected value of any distribution is the mean of the distribution. The expected value of a random variable X from a Bernoulli distribution is found as follows:
```python 
    E(X) = 1*p + 0*(1-p) = p
```
The variance of a random variable from a bernoulli distribution is:
```python
    V(X) = E(X²) – [E(X)]² = p – p² = p(1-p)
```
There are many examples of Bernoulli distribution such as whether it’s going to rain tomorrow or not where rain denotes success and no rain denotes failure // winning (success) or losing (failure) the game.

## Uniform Distribution
When you roll a fair die, the outcomes are 1 to 6. The probabilities of getting these outcomes are equally likely and that is the basis of a uniform distribution. Unlike Bernoulli Distribution, all the n number of possible outcomes of a uniform distribution are equally likely.
A variable X is said to be uniformly distributed if the density function is:
fx=1b-a          for-∞ <a ≤x ≤b < ∞
You can see that the shape of the Uniform 
distribution curve is rectangular, the reason why 
it’s called rectangular distribution.
For a Uniform Distribution, a and b are the parameters.
Example: 
The number of bouquets sold daily at a flower shop is uniformly distributed with a maximum of 40 and a minimum of 10. Let’s try calculating the probability that the daily sales will fall between 15 and 30.
The probability that daily sales will fall between 15 and 30 is (30-15)*(1/(40-10)) = 0.5
Similarly, the probability that daily sales are greater than 20 is  = 0.667
The mean and variance of X following a uniform distribution is:
Mean -> E(X) = (a+b)/2
Variance -> V(X) =  (b-a)²/12
The standard uniform density has parameters a = 0 and b = 1, so the PDF for standard uniform density is given by:	fx= {1,  0≤x≤1 0,  otherwise 

## Binomial Distribution
A distribution where only two outcomes are possible, such as success or failure, gain or loss, win or lose and where the probability of success and failure is same for all the trials. 
If probability of success = 0.2 then the probability of failure q = 1 – p = 0.8 (does not have to be equal)
On the basis of the above explanation, the properties of a Binomial Distribution are
Each trial is independent.
There are only two possible outcomes in a trial- either a success or a failure.
A total number of n identical trials are conducted.
The probability of success and failure is same for all trials. (Trials are identical.)
Px= n!(n-x)!x!pxqn-x
Mean -> µ = n*p
Variance -> Var(X) = n*p*q

## Normal Distribution
The mean, median and mode of the distribution coincide.
The curve of the distribution is bell-shaped and symmetrical about the line x=μ.
The total area under the curve is 1.
Exactly half of the values are to the left of the center and the other half to the right.
The PDF of a random variable X following a normal distribution is given by:
fx= 12πe{-1 2 (x-μ)2}           for-∞<x<∞
Mean -> E(X) = µ
Variance -> Var(X) = σ2		Here, µ (mean) and σ (standard deviation) are the parameters.
The graph of a random variable X ~ N (µ, σ) is shown.

A standard normal distribution is defined as the distribution with mean 0 and standard deviation 1.
For such a case, the PDF becomes:
fx= 12πe-x2/2           for-∞<x<∞

## Poisson Distribution
Poisson Distribution is applicable in situations where events occur at random points of time and space wherein our interest lies only in the number of occurrences of the event.
Any successful event should not influence the outcome of another successful event.
The probability of success over a short interval must equal the probability of success over a longer interval.
The probability of success in an interval approaches zero as the interval becomes smaller.
Now, if any distribution validates the above assumptions then it is a Poisson distribution. Some notations used in Poisson distribution are:
λ is the rate at which an event occurs,
t is the length of a time interval,
And X is the number of events in that time interval.
Here, X is called a Poisson Random Variable and the probability distribution of X is called Poisson distribution. Let µ denote the mean number of events in an interval of length t. Then, µ = λ*t.
The PMF of X following a Poisson distribution is given by:
PX=x= e-uxx!         for x=0, 1, 2, …
			
Mean -> E(X) = µ
Variance -> Var(X) = µ

## Exponential Distribution
Exponential distribution is widely used for survival analysis. From the expected life of a machine to the expected life of a human, exponential distribution successfully delivers the result.
A random variable X is said to have an exponential distribution with PDF:
fx= {e-λx, x≥0
and parameter λ>0 which is also called the rate. For survival analysis, λ is called the failure rate of a device at any time t, given that it has survived up to t.
Mean -> E(X) = 1/λ
Variance -> Var(X) = (1/λ)²
Also, the greater the rate, the faster the curve drops and the lower the rate, flatter the curve. This is explained better with the graph shown below.

PX≤x=1-e-λx , corresponds to the area under the density curve to the left of x
PX>x=e-λx, corresponds to the area under the density curve to the right of x
Px1<X≤x2=e-λx1-e-λx2, corresponds to the area under the density curve between x1 and x2
