# MLE Notebook 1: Probability and review
### 0. Contents
Axioms. Common discrete and continuous random variables, probability mass and density functions, cumulative functions.

Expecation, variance, statistical moments, moment generating functions. 

Central limit theorem, Weak law of large numbers. 

#### 0.1 Suggested readings
1) Millar, *Maximum Likelihood Estimation and Inference*
   
2) Crawley, *The R Book*

3) Hogg & Tanis, *Probability and Statistical Inference*

#### 0.2 Acknowledgements
This series of five notebooks are compiled by Tin-Yu Hui based on existing materials. Special thanks to Dan Reuman who used to teach this module a long time ago. Some examples are extracted from Mick Crawley's GLM course, where I had the pleasure to attend, both as a student and as a GTA. Any errors that remain are, of course, my sole responsibility. 

### 1. The three axioms of probability
These axioms are the building blocks of modern theories of probability and statistics. 

1) For any event $A$ in a sample space $S$, $Pr(A)\geqslant 0$.

2) $Pr(S)=1$
   
3) For disjoint events $A_1, A_2, A_3, ...$, then $Pr(A_1\cup A_2\cup A_3\cup ...)=Pr(A_1)+Pr(A_2)+Pr(A_3)+...$

We assign a probability measure $Pr(A)$ to an event $A$. The first axiom states that probabiliy is always non-netative. The smallest probability is zero (i.e. impossible). The second axiom states that the probability of the whole sample space is one. The sample space $S$ contains all possible outcomes for the given random experiment. This also specifies the upper bound for a probability. For the third axiom, the probability of the union of disjoint (i.e. non-overlapping) events equals the sum of their individual probabilities. Think of a Venn diagram. 

### 2. Random variables
A random variable (r.v.) is a variable who takes on its value by chance. A r.v. can take on a set of possible values, each with an associated probability. To fully characterise a r.v. we need to know 1) all its possible outcomes, which form the domain or support of the r.v., and 2) the probability of getting each outcome. 

Example: Let $X$ be the outcome from a coin toss. Certainly $X$ is random. There can only be two possible outcomes: head or tail. If the coin is fair then $Pr(X=head)=Pr(X=tail)=0.5$. These statements jointly characterise the r.v. $X$. 

#### 2.1 Discrete random variables
Some r.v. take a discrete collection of values. We call them discrete r.v.. An example of a discrete r.v. is the outcome from rolling a fair die. 

A probability *mass* function (pmf) for a discrete r.v. $X$ is a function that describes the relative probability that $X$ takes each of its possible values. In most textbooks, the pmf is written as $f(x)$ or $f_X(x)$. See #2.3 for more notations. 

##### 2.1.1 Bernoulli random variable
A Bernoulli r.v. is the simplest r.v. with two outcomes: success (1) or failure (0). It has one parameter $p$, the probability of success, which is bounded between 0 and 1. If $X\sim Bernoulli(p)$ then it is obvious that $Pr(X=1)=p$ and $Pr(X=0)=1-p$. While these two equations technically summarise the probabilities, the pmf has an alternative expression: $f_X(x)=p^x(1-p)^{1-x}$. 

Note that $f_X(x)=0$ elsewhere (outside of the support), but this statement is often too trivial to be included. 

##### 2.1.2 Binomial random variable
A binomial r.v. is the sum of $n$ independent and identically distributed (i.i.d.) Bernoulli r.v. hence it takes values on $\{0, 1, 2, ..., n\}$. It is a two-parameter r.v.: $p$ the probability of success, inherited from Bernoulli r.v., and $n$ the number of i.i.d. Bernoulli trials. If $X\sim binomial(n, p)$ then its pmf is
$$f_X(x)=C^n_{x}p^x(1-p)^{n-x}$$
where $C^n_{x}$ is the number of combinations when we choose $x$ objects from $n$. Order of selection does not matter here. 

##### 2.1.3 Poisson random variable
A Poisson r.v. models the number of events occurring in a fixed interval of time. Since it is a count, its possible values are all non-negative integers $\{0, 1, 2, 3, ...\}$. While there are infinitely many possible outcomes it is still regarded as a discrete r.v.. 

Poisson has one parameter which is the rate of occurrance $\lambda>0$. If $X\sim Poisson(\lambda)$ then
$$f_X(x)=\frac{\lambda^{x}e^{-\lambda}}{x!}$$

If $X\sim binomial(n, p)$ with reasonably large $p$ and reasonably small $np$, then $X$ can be approximated by a Poisson r.v. with $\lambda=np$. That is, the number of rare events can be modelled by Poisson. 

#### 2.2 Continuous random variables
Continuous r.v., in contrast, take a whole range of real-number values (think of tomorrow's temperature or allele frequencies). To accommodate continuous r.v.., a probability *density* function (pdf) is in place to describe the relative probability that the r.v. takes each value in the range of possible values. 

Recall: The range of possible values within non-zero probability is called the *support* of a r.v.. 

##### 2.2.1 Uniform random variable
A uniform r.v. is a continous r.v. with two parameters $a$ and $b$, which are the lower and upper bounds (support). If $X\sim U(a,b)$ then
$$f_X(x)=1/(b-a)$$
which looks like a horizontal line from $a$ to $b$. 

##### 2.2.2 Exponential random variable
Exponential r.v. models the time between two successive events (remember Poisson r.v.?). Since it is a measure of time, it is continous with support $[0, \infty)$ (inclusive of 0, but always smaller than infinity). It is a one-parameter r.v. which shares the same rate paramter $\lambda$ with Poisson. If $X\sim Exponential(\lambda)$ then
$$f_X(x)=\lambda e^{-\lambda x}$$

##### 2.2.3 Normal random variable
Later we will learn why normal the most famous r.v. of all, and why normal approximation usually holds even if we have limited knowledge on the underlying distribution. For $X\sim N(\mu, \sigma^2)$, it takes values over the entire real number line, from negative to positive infinity, or $x \in \Re$. Although its bell-shaped pdf is widely known and aesthetically pleasing, its mathematical form is not as memorable: 
$$f_X(x)=\frac{1}{\sqrt{2\pi \sigma}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$$
It is a two-parameter r.v. with $\mu$ and $\sigma^2$ if you have not already realised. 

#### 2.3 Some notations
Most textbooks use the function $f()$ to denote a pmf/pdf. Some may specify the r.v. of interest through the subscript, e.g. $f_X()$. It is understood to use capital letters for r.v.. The use of subscripts is extremely helpful when multiple r.v. are involved, say, when mentioning $X$ and $Y$ and their associated $f_X(x)$ and $f_Y(y)$. The lowercase $x$ inside the round brackets indicates the value at which the pmf/pdf is evaluated. These small $x$ or $y$ are real numbers (not r.v.). Numbers are numbers, r.v. are r.v.. 

Some texts may even state the associated parameter(s) $\theta$ in the pmf/pdf, say, $f(x; \theta)$ or $f(x|\theta)$. The latter reads as "$x$ given $\theta$". 

#### 2.4 Built-in statistical tables in R
We should always make good use of the built-in statistical functions in R. For example, there are <code>pnorm()</code>, <code>dnorm</code>, <code>qnorm()</code>, and <code>rnorm()</code> for normal distribution. The prefix <code>p</code> returns the cmf/cdf (see #3.2 below), <code>d</code> for the pmf/pdf, <code>q</code> for quantiles (for hypothesis testing), and <code>r</code> for random number generation. We will experiment some of these functions in today's practical. 

### 3. Probability and cumulative functions
#### 3.1 Properties of probability mass and density functions
Per discussed, pmf/pdf are functions to describe the relative probabilities of the outcomes and to characterise a r.v.. From the first axiom, probabilities are non-negative, hence the pmf/pdf never go below the horizontal axis. From the second axiom, we learnt that the sum of pmf bars must be one:
$$\sum_{all~possible~outcomes} f_X(x)=1$$
For continuous case, if we take the limit of summation (of vertical bars) it becomes integration: 
$$\int_{all~possible~outcomes} f_X(x)dx=1$$
That is, the *area* under a pdf must be one. 

#### 3.2 Cumulative mass and density functions
We use capital $F()$ for cumulative functions (cmf/cdf). As its name suggests, $F_X(x)=Pr(X\leqslant x)$ by definition. It is a non-decreasing function with $F(-\infty)=0$ and $F(\infty)=1$. For discrete r.v., 
$$F_X(x)=\sum_{x_i\leqslant x}f_X(x_i)$$
For continuous case, $F_X(x)$ is the area under the pdf curve, from $-\infty$ to $x$: 
$$F_X(x)=\int_{-\infty}^{x}f_X(t)dt$$
I hope you still remember the fundamental theorem of calculus. Conversely, we can obtain pdf by differentiating cdf. You only need either the cumulative or probability function to characterise a r.v. 

### 4. Statistical moments and expectation
#### 4.1 Expectation
Imagine an experiment can be repeated for *infinitely* many times. Imagine you keep tossing a coin or keep drawing random numbers from a given distribution for *infinitely* many times. The expecation is the "average" of the said experiment. 

Of course this "average" is a hypthetical one as nobody can afford having *infinitely* many repeats. Here we describe the "average" behaviour of a r.v. on the population level. Try not to confuse with the "sample average" that we tend to calculate from real data. In fact, today's discussion does not involve any data. We are merely discussing the characteristics of r.v. based on some given random mechanisms. 

For discrete r.v., 
$$E[X]=\sum_{all~possible~outcomes}xf(x)$$
For continuous r.v., 
$$E[X]=\int_{-\infty}^{+\infty}xf(x)dx$$
You can replace the bounds of the integral by the support of $X$. $E[X]$ is the expected value of $X$, the "average" value weighted according to the pmf/pdf. $E[X]$ is also called the population mean or true mean of the r.v. $X$. It is a measure of central tendency. 

#### 4.2 Variance
Similarly, we have the population variance, which is given by
$$Var[X]=E[(X-E[X])^2]$$
The formula above suggests that variance is the expected distance squared of the r.v. $X$ from its population mean. In practice we tend to use this alternative form: 
$$Var[X]=E[X^2]-(E[X])^2$$
There is no surprise that variance is a measure of dispersion. 

#### 4.3 Higher moments
In general, the $n^{th}$ *raw* moment of $X$ is $E[X^n]$: 
$$E[X^n]=\int x^nf(x)dx$$

And the $n^{th}$ central moment is $E[(X-E[X])^n]$. In most cases only the first few moments are studied. For example, the third moment of a r.v. describes its skewness (e.g. a normal r.v. has 0 skewness as a bell curve is symmetric about $\mu$), and the fourth moments is a measure of kurtosis (fat tails). 

Note that not all distributions have finite moments. One example is the Cauchy distribution (t-distribution with 1 degree of freedom) whose $E[X]$ is undefined. 

#### 4.4 More on the expectation operator
After taking the expectation from a r.v., we get a real number. Note that expectation is linear: 
$$E[aX+bY]=aE[X]+bE[Y]$$
for any r.v. $X$, $Y$ and any real numbers $a$, $b$. 

In some cases, we may be required to transform a r.v. or to calculate the expectation of a transformed r.v.: 
$$E[g(X)]=\int g(x)f(x)dx$$ for any real function $g$. 

Note that $g(X)$ itself is another r.v. with its own support, pdf/pmf, expecation, etc.. The same is true for $(X+Y)$, that is, the sum (or prodictof r.v. is another r.v.. Remember, transformation of a r.v. yields another r.v.. A r.v. will not suddenly turn into a real number. 

#### 4.5 Moment generating function
A moment generating function (mgf) is the third way to characterise a r.v.. $M_X(t)$ is carefully crafted function from $X$ such that it "generates" statistical moments through its derivatives at $t=0$, note that $t$ is a dummy variable. The $n^th$ moment of $X$ is: 
$$E[X^n]=\frac{d^nM_X(t)}{dt^n}|_{t=0}$$

For keen readers, $M_X(t)=E[e^{tX}]$. 

### 5. Central limit theorem and Weak law of large numbers
#### 5.1 Central limit theorem
Let $\{X_1, X_2, X_3, ..., X_n\}$ be i.i.d. r.v. with finite $E[X_i]=\mu$ and finite $Var[X_i]=\sigma^2$. Also let $\bar{X_n}=(X_1+X_2+...+X_n)/n$ be the sample mean of these r.v. ($\bar{X_n}$ is another r.v.). The central limit theorem states that as $n\rightarrow \infty$, the r.v. $\sqrt{n}(\bar{X_n}-\mu)$ converges *in distribution* to a normal distribution: 
$$\sqrt{n}(\bar{X_n}-\mu) \xrightarrow{d} N(0,\sigma^2)$$

See today's practical for visualisation. 

#### 5.2 Weak law of large numbers
Let us consider a similar series of i.i.d. r.v. $\{X_1, X_2, X_3, ..., X_n\}$ with finite $E[X_i]=\mu$. The weak law of large numbers states that the sample mean $\bar{X_n}$ converges *in probability* to the expected value when $n\rightarrow \infty$. That is, for any postive $\epsilon$, 
$$\lim_{n\rightarrow \infty}Pr(|\bar{X_n}-\mu|<\epsilon)=1$$