# Probability


## Probability Space


A *sample space* $\Omega$ is a collection of all possible outcomes. It is a set of things.

An *event* $A$ is a subset of $\Omega$. It is something of interest on the sample space.

A $\sigma$-*field* is a complete set of events that includes all countably finite unions, interactions, and differences.
It is a well-organized structure built on the sample space. 


A *probability measure* satisfies

* (positiveness) $P\left(A\right)\geq0$ for all events;
* (countable additivity) if $A_{i}$, $i\in\mathbb{N}$, are 
are mutually disjoint, then
$P\left(\bigcup_{i\in\mathbb{N}}A_{i}\right)=\sum_{i\in\mathbb{N}}\mu\left(A_{i}\right).$
* $P(\Omega) = 1$.




**Example**: (probability measure) Personal wealth management: asset allocation.
A probability function is not necessarily about uncertainty. Two allocation rules: 1. Equal weight. 2. Optimal weight.

So far we have answered the question: "What is a well-defined probability?", but we have not yet
answered "How to assign the probability?"

There are two major schools of thinking on probability assignement. One is
*frequentist*, who considers probability as the average chance of occurrence if a large number of experiments
are carried out. The other is *Bayesian*, who deems probability as a subjective brief.
The principles of these two schools are largely incompatible, while each school has
peculiar merit under different context.

## Random Variable

*Random variable* maps events to a real number. If the outcome is multivariate, we call it a *random vector*.


**Data example**

Data source: [HK top 300 Youtubers](https://www.kaggle.com/datasets/patriotboy112/hks-top-300-youtubers). We look at the number of uploaded videos in these accounts.

In [None]:
d0 = readr::read_csv("HKTop300YouTubers.csv")
print(d0)

library(magrittr)

# remove NA and zeros
d0[ d0 == 0 ] <- NA
sel <- d0 %>% is.na( ) %>% apply( 1, any)
d0 %<>% dplyr::filter( !sel )

d1 <- d0 %>% dplyr::select(3:5)
names(d1) <- c("subs", "view", "count")


In [None]:
print( d1[["count"]] )

In [None]:
hist(d1[["count"]])

In [None]:
d1.log <- log(d1)
hist(d1.log[["count"]])

## Distribution Function

We go back to some terms that we have learned in a undergraduate
probability course. A *(cumulative) distribution function*
$F:\mathbb{R}\mapsto [0,1]$ is defined as
$$F\left(x\right)=P\left(X\leq x\right)=
P\left(\{X\leq x\}\right).$$
It is often abbreviated as CDF, and it has the following properties.

* $\lim_{x\to-\infty}F\left(x\right)=0$,
* $\lim_{x\to\infty}F\left(x\right)=1$,
* non-decreasing,
* right-continuous $\lim_{y\to x^{+}}F\left(y\right)=F\left(x\right).$

In [None]:
plot(ecdf(d1.log[["count"]]), verticals = TRUE, do.points = FALSE)

Add the definitions of quantiles

In [None]:
quantile(d1.log[["count"]], probs = 1:4/4 )


For continuous distribution, if there exists a function $f$ such that for all $x$,
$$F\left(x\right)=\int_{-\infty}^{x}f\left(y\right)dy,$$
 then $f$ is
    called the *probability density function* of $X$, often abbreviated as PDF.
It is easy to show that $f\left(x\right)\geq0$ and
    $\int_{a}^{b}f\left(x\right)dx=F\left(b\right)-F\left(a\right)$.



**Example** We have learned many parametric distributions like the binary distribution, the Poisson distribution,
the uniform distribution, the normal distribution, $\chi^{2}$, $t$, $F$ and so on.
They are parametric distributions, meaning that the CDF or PDF can be completely
characterized by a few parameters.



**Example** `R` has a rich collection of distributions implemented in a unified rule:
`d` for density, `p` for probability, `q` for quantile, and `r` for random variable generation.
For instance, `dnorm`, `pnorm`, `qnorm`, and `rnorm` are the corresponding functions of the normal distribution, and the parameters $\mu$ and $\sigma$ can be specified in the arguments of the functions.


In [None]:
qnorm(0.975)

In [None]:
pnorm(0)

In [None]:
dnorm(0)

In [None]:
rnorm(3)

In [None]:
rpois(2, 5)


Below is a piece of `R` code for demonstration.

1. Plot the density of standard normal distribution over an equally spaced grid system `x_axis = seq(-3, 3, by = 0.01)` (black line).
2. Generate 1000 observations for $N(0,1)$. Plot the kernel density, a nonparametric estimation of the density (red line).
3. Calculate the 95th quantile and the empirical probability of observing a value greater than the 95th quantile.
In population, this value is 5%. What is the number coming out of this experiment?

(Since we do not fix the random seed in the computer, the outcome is slightly different each time we run the code.)

In [None]:
x_axis = seq(-3, 3, by = 0.01)

y = dnorm(x_axis)
plot(y = y, x=x_axis, type = "l", xlab = "value", ylab = "density")
z = rnorm(1000)
lines( density(z), col = "red")
crit = qnorm(.95)

cat("the empirical rejection probability is", mean( z > crit ), "\n")

# Expected Value



## Integration

An integral
$\int X\mathrm{d}P$ is called the *expected value,* or
*expectation,* of $X$. We often use the notation
$E\left[X\right]$, instead of $\int X\mathrm{d}P$, for convenience.

Expectation provides the average of a random variable,
despite that we cannot foresee the realization of a random variable in a particular trial
(otherwise the study of uncertainty is trivial). In the frequentist's view,
the expectation is the average outcome if we carry out a large number of independent
trials.

If we know the probability mass function of a discrete random variable, its expectation
is calculated as $E\left[X\right]=\sum_{x}xP\left(X=x\right)$, which is
the integral of a simple function.
If a continuous random variable has a PDF $f(x)$, its expectation
can be computed as  $E\left[X\right]=\int xf\left(x\right)\mathrm{d}x$.




In [None]:
mean( d1.log[["count"]] )


Here are some properties of the expectation.


-  The probability of an event $A$ is the expectation
of an indicator function. $E\left[1\left\{ A\right\}  \right]= 1\times P(A) + 0 \times P(A^c) =P\left(A\right)$.

-   $E\left[X^{r}\right]$ is call the $r$-moment of $X$. The *mean* of a random variable is the first moment $\mu=E\left[X\right]$, and
the second *centered* moment is called the *variance*
$\mathrm{var}\left[X\right]=E\left[\left(X-\mu\right)^{2}\right]$.

In [None]:
var(d1.log[["count"]])


The third centered moment $E\left[\left(X-\mu\right)^{3}\right]$,
called *skewness*, is a measurement of the
symmetry of a random variable, and the fourth centered moment
    $E\left[\left(X-\mu\right)^{4}\right]$, called *kurtosis*, is
     a measurement of the tail thickness.

- We call
    $E\left[\left(X-\mu\right)^{3}\right]/\sigma^{3}$ the *skewness coefficient*, and
    $E\left[\left(X-\mu\right)^{4}\right]/\sigma^{4}-3$ *degree of excess*. A normal distribution's  skewness and  degree of excess are both zero.

    -   **Application**: [The formula that killed Wall
        Street](http://archive.wired.com/techbiz/it/magazine/17-03/wp_quant?currentPage=all)

- Moments do not always exist. For example, the mean of the Cauchy distribution does not exist,
and the variance of the $t(2)$ distribution does not exist.

- $E[\cdot]$ is a linear operation. If $\phi(\cdot)$ is a linear function, then $E[\phi(X)] = \phi(E[X]).$

-   *Jensen's inequality* is an important fact.
A function $\varphi(\cdot)$ is convex if
$\varphi( a x_1 + (1-a) x_2 ) \leq a \varphi(x_1) + (1-a) \varphi(x_2)$ for all $x_1,x_2$
in the domain and $a\in[0,1]$. For instance, $x^2$ is a convex function.
Jensen's inequality says that if $\varphi\left(\cdot\right)$ is a convex
    function, then
    $\varphi\left(E\left[X\right]\right)\leq E\left[\varphi\left(X\right)\right].$


*Markov inequality* is another simple but important fact. If $E\left[\left|X\right|^{r}\right]$ exists,
    then
    $P\left(\left|X\right|>\epsilon\right)\leq E\left[\left|X\right|^{r}\right]/\epsilon^{r}$
    for all $r\geq1$. *Chebyshev inequality* $P\left(\left|X\right|>\epsilon\right)\leq E\left[X^{2}\right]/\epsilon^{2}$
    is a special case of the Markov inequality when $r=2$.


# Multivariate Random Variable

A bivariate random variable is a
measurable function $X:\Omega\mapsto\mathbb{R}^{2}$, and more generally a multivariate random
variable is a measurable function $X:\Omega\mapsto\mathbb{R}^{n}$.
We can define the *joint CDF* as
$F\left(x_{1},\ldots,x_{n}\right)=P\left(X_{1}\leq x_{1},\ldots,X_{n}\leq x_{n}\right)$.
Joint PDF is defined similarly.

In [None]:
plot(d1.log$subs, d1.log$view)


It is sufficient to introduce the joint distribution, conditional distribution
and marginal distribution in the simple bivariate case, and these definitions
can be extended to multivariate distributions. Suppose
a bivariate random variable $(X,Y)$ has a joint density
$f(\cdot,\cdot)$.
The  *conditional density* can be roughly written as  $f\left(y|x\right)=f\left(x,y\right)/f\left(x\right)$ if we do not formally deal
with the case $f(x)=0$.
The *marginal density* $f\left(y\right)=\int f\left(x,y\right)dx$ integrates out
the coordinate that is not interested.

## Independence

In a probability space $(\Omega, \mathcal{F}, P)$, for two events $A_1,A_2\in \mathcal{F}$ the *conditional probability* is

$$P\left(A_1|A_2\right) = \frac{P\left(A_1 A_2\right)}{ P\left(A_2\right) }.$$ 

In the definition of conditional probability, $A_2$ plays
the role of the outcome space  so that
 $P(A_1 A_2)$ is standardized by the total mass $P(A_2)$.

Since $A_1$ and $A_2$ are symmetric, we also have $P(A_1 A_2) = P(A_2|A_1)P(A_1)$.
It implies
$$P(A_1 | A_2)=\frac{P\left(A_2| A_1\right)P\left(A_1\right)}{P\left(A_2\right)}$$
This formula is the well-known *Bayes' Theorem*. It is particularly important in
decision theory.



**Example:** $A_1$ is the event "a student can survive CUHK's MSc program", and $A_2$ is
his or her application profile.




We say two events $A_1$ and $A_2$ are *independent* if $P(A_1A_2) = P(A_1)P(A_2)$.
If $P(A_2) \neq 0$, it is equivalent to $P(A_1 | A_2 ) = P(A_1)$.
In words, knowing $A_2$ does not change the probability of $A_1$.

If $X$ and $Y$ are independent, $E[XY] = E[X]E[Y]$.


In [None]:
Y <- matrix(rnorm(200), ncol = 2)
plot(x = Y[, 1], y = Y[, 2])

In [None]:
Y <- matrix(rnorm(200), ncol = 2) %*% matrix( c(1, 0.5, 0.5, 1), 2 )
plot(x = Y[, 1], y = Y[, 2])


**Application**: (Chebyshev law of large numbers) If
$X_{1},X_{2},\ldots,X_{n}$ are independent, and they have the same mean
$0$ and variance $\sigma^{2}<\infty$. Let
$Z_{n}=\frac{1}{n}\sum_{i=1}^{n} X_{i}$. Then the
probability $P\left(\left|Z_{n}\right|>\epsilon\right)\to0$ as
$n\to\infty$.

The culmination of probability theory is *law of large numbers* and
*central limit theorem*.

## Law of Iterated Expectations




In the bivariate case, if the conditional density exists, the conditional expectation can be computed as
    $E\left[Y|X\right]=\int yf\left(y|X\right)dy$.
The law of iterated expectation implies $E\left[E\left[Y|X\right]\right]=E\left[Y\right]$.


In [None]:
dx <- d1.log %>% 
  tibble::add_column(category = d0$category) %>%
  dplyr::group_by(category) %>%
  dplyr::summarize(mean = mean(count), no = dplyr::n() )
print(dx)

print( sum( dx$mean * ( dx$no/nrow(d1.log) ) ) ) # average over categories
print( mean(d1.log[["count"]])) # overall average



Below are some properties of conditional expectations

1.  $E\left[E\left[Y|X_{1},X_{2}\right]|X_{1}\right]=E\left[Y|X_{1}\right];$
2.  $E\left[E\left[Y|X_{1}\right]|X_{1},X_{2}\right]=E\left[Y|X_{1}\right];$
3.  $E\left[h\left(X\right)Y|X\right]=h\left(X\right)E\left[Y|X\right].$

**Application**: Regression is a technique that decomposes a random variable $Y$
into two parts, a conditional mean and a residual. Write
 $Y=E\left[Y|X\right]+\epsilon$, where
$\epsilon=Y-E\left[Y|X\right]$. Show that $E[\epsilon] = 0$ and  $E[\epsilon E[Y|X] ] = 0$.

In [None]:
# readr::write_csv(d1.log, file = "logYoutuber.csv")
save(d1.log, file = "logYoutuber.Rdata")