<font size="6"><b>BASIC PROBABILITY DISTRIBUTIONS: BEYOND NORMAL DISTRIBUTION</b></font>

<font size="5"><b>Serhat Çevikel</b></font>

In [None]:
library(data.table)
library(tidyverse)
library(plotly)
library(psych) # for pairwise scatter plots
library(estimators) # for theoretical moments
library(moments) # empirical moments
library(miscTools) # for ddnorm, derivative of normal distribution curve
library(MASS) # for multivariate normal distribution
#library(GGally) # for pairwise scatter plots
library(BBmisc) # for standardization
library(PearsonDS) # for flexible four parameter distbutions

In [None]:
options(repr.matrix.max.rows=20, repr.matrix.max.cols=15) # for limiting the number of top and bottom rows of tables printed 

In [None]:
options(htmlwidgets_embed=TRUE)

In [None]:
options(knitr.kable.max_rows = 10)

![xkcd](../imagesba/t_distribution.png)

(https://xkcd.com/1347/)

# Cauchy Distribution

Up to now we covered "well-behaved" discrete and continuous probability distribution which all had finite means and variances.

Now let's cover an "ill-behaved" or pathological distribution type which we will revisit in Student's t distribution: The Cauchy distribution

Suppose that we are in front of an infinite length wall 1 meters away and we are continually shooting balls to the wall:


[![federer](https://img.youtube.com/vi/8BdHP6FWxKU/0.jpg)](https://www.youtube.com/watch?v=8BdHP6FWxKU)

Of course we won't be as accurate as Roger Federer. Suppose there is a semi circle in front us and between the wall and us, and before every shoot we are selecting a random point on the semi circle uniformly, so the angle is uniformly distributed:

In [None]:
xval <- seq(-1, 1, 0.01)
yvals <- sqrt(1 - xval^2)

plot(xval, yvals, type = "l", xlim = c(-1, 1), ylim = c(0, 1))
abl <- lapply(seq(0, 1, length.out = 9), function(x) abline(a = 0, tan(pi * (x - 1/2)), col = x * 10))

Each ray from the origin is the trajectory of a shoot. The distance of the point that the ball hits the wall to the origin follows a Cauchy distribution. As you may see, as the angle approaches 0 or $\pi$, the distance also approaches $-\infty$ or $\infty$.

The x intercept can be calculated by tangent function:

${\displaystyle x=\tan \left(\pi (u-{\tfrac {1}{2}})\right)}$

Let's draw a sample following this definition:

In [None]:
set.seed(10)
hist(tan(pi*(runif(1e3) - 1/2)))

Another definition of a Cauchy distribution is the ratio of two standard normally distributed variables with mean 0 and variance 1:

In [None]:
set.seed(20)
hist(rnorm(1e3) / rnorm(1e3))

Or we can sample using `rcauchy` function:

In [None]:
set.seed(40)
hist(rcauchy(1e3))

Now to see the main difference between Cauchy and normal distributions, let's first make 10 random simulations from normal distributions and draw their histograms sequentially in an animated plot:

In [None]:
set.seed(50)
rruns <- data.table(runs = 1:10)
rruns2 <- rruns[, .(samplx = {rnorm(1e4)}), by = runs]

In [None]:
rruns2 %>%
plot_ly(x = ~samplx) %>%
add_trace(frame = ~runs, type = "histogram") %>%
animation_opts(
    frame = 500, redraw = T, easing = "linear", mode = "next"
)

Across runs overall location and scale. of the distribution does not change much

Now let's draw 10 large samples from Cauchy distribution and animate the histograms:

In [None]:
set.seed(60)
cruns <- data.table(runs = 1:10)
cruns2 <- rruns[, .(samplx = {rcauchy(1e4)}), by = runs] 

In [None]:
cplot <- cruns2 %>%
plot_ly(x = ~samplx) %>%
add_trace(frame = ~runs, type = "histogram", nbinsx = 500) %>%
animation_opts(
    frame = 1000, redraw = T, easing = "cubic", mode = "next"
)

We have to rescale the x axis across frames:

In [None]:
cplot2 <- plotly_build(cplot)
cranges <- lapply(split(cruns2$samplx, cruns2$runs), range)
for (i in 1:10) cplot2$x$frames[[i]]$layout <- list(xaxis = list(range = cranges[[i]]))

In [None]:
cplot2

You can see that across frames the location and scale of the distribution changes vastly.

Now again let's conduct random simulations and tabulate the summaries:

In [None]:
set.seed(70)
cauchyruns <- lapply(1:10,
                     function(x) { samp <- rcauchy(1e5); c(summary(samp), Var = var(samp)) })

In [None]:
set.seed(80)
normruns <- lapply(1:10,
                     function(x) { samp <- rnorm(1e5); c(summary(samp), Var = var(samp)) })

See that, across runs, the mean and five point summaries and the variances are all very similar for normal distribution:

In [None]:
normruns %>% lapply(as.list) %>% rbindlist %>% round(2)

However for Cauchy distribution, the mean and variance changes extensively across runs while median is always at the center:

In [None]:
cauchyruns %>% lapply(as.list) %>% rbindlist %>% round(2)

In [None]:
stepx <- 0.05
pvals <- seq(stepx, 1 - stepx, stepx)

Now let's get quantiles for 5% cumulative probability steps of normal and Cauchy distributions and plot them:

In [None]:
plot(pvals, qnorm(pvals) %>% round(2), type = "l", col = "blue", ylim = range(qcauchy(seq(stepx, 1 - stepx, stepx))))
lines(pvals, qcauchy(seq(stepx, 1 - stepx, stepx)) %>% round(2), col = "red")

You can see the dispersion of quantiles are far wider for the Cauchy distribution.

## Law of Large Number with Cauchy Distribution

As we may recall, the mean of a sample converges to the mean of the population as the sample size grows:

In [None]:
set.seed(100)
plot(cummean(rnorm(1e5)), type = "l")

For a Cauchy distribution, law of large numbers does not apply: The mean of the sample does not converge to a certain value as the sample size grows:

In [None]:
set.seed(0)
plot(cummean(rcauchy(1e5)), type = "l")

## CLT with Cauchy Distribution

Now let's check whether means of samples drawn from a Cauchy distribution converges to normal distribution:

In [None]:
sampx <- rcauchy(1e5)

The population:

In [None]:
hist(sampx)

Means of 1e4 samples of size 100 each:

In [None]:
simx <- rowMeans(t(replicate(1e4, sample(sampx, 100))))

Still a Cauchy distribution:

In [None]:
hist(simx)

In [None]:
list(c(summary(sampx), Var = var(sampx)),
     c(summary(simx), Var = var(simx))) %>%
lapply(as.list) %>% rbindlist %>% round(2)

So the rule about having finite variance is a prerequisite for central limit theorem: The means of the samples from a Cauchy distribution, do not converge to normal distribution but still conforms with a Cauchy distribution.

We will see Cauchy distribution as a special case of Student's t distribution later on.

# Chi-squared Distribution

Until now, we talked about the distribution of sample means to discuss square root law and central limit theorem.

We know that, for i.i.d variables with finite variances, the sample means converge to normal distribution.

But what about the scales - as measured by sum of squared deviations, variance of standard deviation - of those samples?

Let's first create a large sample from the population following a standard normal distribution:

In [None]:
set.seed(1)
popnorm <- rnorm(1e5)

We know that the normal distribution is bell shaped:

In [None]:
hist(popnorm)

However, the squared values follow an almost - but not exactly - exponential shape:

In [None]:
hist(popnorm^2)

Set the total number of samples and the size of each sample:

In [None]:
nsamp <- 1e4

In [None]:
sizet <- 4

Now create smaller samples from the same population:

In [None]:
set.seed(10)
sampt <- t(replicate(nsamp, rnorm(sizet)))

Get the sum of squares for each sample:

In [None]:
sumsq <- rowSums(sampt^2)

And view the histogram:

In [None]:
hist(sumsq)

Now let's draw samples from Chi-Squared distribution with $k$ degrees of freedom - as the name of the parameter goes:

In [None]:
set.seed(11)
hist(rchisq(1e4, 10))

And let's see whether they follow a similar PDF:

In [None]:
densx <- density(sumsq)[c("x", "y")]
maxx <- max(sumsq)
setDT(densx)
xvals <- seq(0, maxx, 0.001)

In [None]:
plot(densx[x > 0], type = "l", col = "blue")
lines(xvals, dchisq(xvals, sizet), col = "red")

We see that the sum of squares of the 10 sized samples that we created and Chi-squared distribution with 10 degrees of freedom have the same densities.

the  ${\displaystyle \chi ^{2}}$-distribution (or *Chi-squared* distribution) with ${\displaystyle k}$ degrees of freedom is the distribution of a sum of the squares of ${\displaystyle k}$ independent standard normal random variables.

The chi-squared distribution ${\displaystyle \chi _{k}^{2}}$ is a special case of the gamma distribution. Specifically if ${\displaystyle X\sim \chi _{k}^{2}}$ then ${\displaystyle X\sim {\text{Gamma}}(\alpha ={\frac {k}{2}},\theta =2)}$ (where ${\displaystyle \alpha }$ is the shape parameter and ${\displaystyle \theta }$ the scale parameter of the gamma distribution.

Let's confirm that with simulation:

In [None]:
hist(rgamma(1e4, sizet/2, 1/2))

And densities:

In [None]:
plot(densx[x > 0], type = "l", col = "blue")
lines(xvals, dchisq(xvals, sizet), col = "red")
lines(xvals, dgamma(xvals, sizet/2, 1/2), col = "green")

The expected value of $\chi^2$-distribution is $k$ while variance is $2k$. Let's check...

Theoretical statistics:

In [None]:
distx <- Chisq(sizet)

In [None]:
mean(distx)

In [None]:
var(distx)

And the empirical statistics from our samples:

In [None]:
mean(sumsq)

In [None]:
var(sumsq)

$\chi^2$-distribution is closely related to Student's t distribution - the topic of next section - and also a part of $\chi^2$ test for goodness of fit of observed data to hypothetical distributions.

# Student's t Distribution

Now let's create a large sample and smaller samples again.

Suppose we have a sample from the population, the mean and standard deviation of which we don't know, and we want to test whether that sample is from the hypothesized population (a topic which we will see later)

Let's first create a large sample from the population following a standard normal distribution:

In [None]:
set.seed(1)
popnorm <- rnorm(1e5)

Again create smaller samples from the same population:

In [None]:
nsamp <- 1e4

In [None]:
sizet <- 4

In [None]:
set.seed(10)
sampt <- t(replicate(nsamp, rnorm(sizet)))

Let's calculate the means of the samples

In [None]:
sampm <- rowMeans(sampt)

Let's calculate the sd's of the samples:

In [None]:
sampsd <- apply(sampt, 1, sd)

Combine them into a table:

In [None]:
samp_dt <- data.table(sampm, sampsd)

And also add sum of squared deviations following the formula of sample variance:

${\displaystyle s^{2}={\frac {1}{n-1}}\sum _{i=1}^{n}\left(x_{i}-{\overline {x}}\right)^{2}}$

In [None]:
samp_dt[, sampss := sampsd^2 * (sizet - 1)]

Now let's look at the distribution of standard deviations:

In [None]:
hist(samp_dt$sampsd)

Or better check distribution of sum of squared deviations:

In [None]:
hist(samp_dt$sampss)

The sum of squared deviations follow a $\chi^2$ deviation with $n - 1$ degrees of freedom:

In [None]:
densx <- density(samp_dt$sampss)[c("x", "y")]
maxx <- max(samp_dt$sampss)
setDT(densx)
xvals <- seq(0, maxx, 0.001)

In [None]:
plot(densx[x >= 0], type = "l", col = "blue")
lines(xvals, dchisq(xvals, sizet - 1), col = "red")

Note that the densities overlap, we confirm the theoretical distribution.

Now let's view the distribution of sample means:

In [None]:
hist(samp_dt$sampm)

In line with square root law, the sd of the sample means is $\displaystyle s = \frac {\sigma}{\sqrt{k}}$

In [None]:
1 / sqrt(sizet)

In [None]:
sd(samp_dt$sampm)

Note that, the sample means are:

${\displaystyle {\overline {X}}_{n}={\frac {1}{n}}(X_{1}+\cdots +X_{n})}$

And sample variances are:

${\displaystyle s^{2}={\frac {1}{n-1}}\sum _{i=1}^{n}\left(X_{i}-{\overline {X}}_{n}\right)^{2}}$

The standardized sum of squares:

${\displaystyle V=(n-1){\frac {s^{2}}{\sigma ^{2}}}}$

has a chi-squared distribution with ${\displaystyle \nu =n-1}$ degrees of freedom as we have shown above.

${\displaystyle Z=\left({\overline {X}}_{n}-\mu \right){\frac {\sqrt {n}}{\sigma }}}$

is normally distributed with mean 0 and variance 1, since the sample mean ${\displaystyle {\overline {X}}_{n}}$ is normally distributed with mean μ and variance $\displaystyle \frac {\sigma^2}{n}$.

Let's calculate the Z values using this formula (note that population standard deviation $\sigma$ is 1):

In [None]:
samp_dt[, sampz := sampm * sqrt(sizet)]

And see the histogram of Z values:

In [None]:
hist(samp_dt$sampz)

They have a mean of around 0 and variance of 1:

In [None]:
mean(samp_dt$sampz)

In [None]:
var(samp_dt$sampz)

Let's see the densities of Z values and compare with normal distribution:

In [None]:
densx <- density(samp_dt$sampz)[c("x", "y")]
maxx <- max(abs(range(samp_dt$sampss)))
setDT(densx)
xvals <- seq(-maxx, maxx, 0.001)

In [None]:
plot(densx, type = "l", col = "blue")
lines(xvals, dnorm(xvals, 0, 1), col = "red")

Z values are normally distributed, as suggested in above formulation.

Now suppose from the sample means and standard deviations we want to extract the standardized scores:

${\displaystyle T={\frac {Z}{\sqrt {V/\nu }}}=Z{\sqrt {\frac {\nu }{V}}}}$

where

- Z is a standard normal with expected value 0 and variance 1;
- V has a chi-squared distribution (χ2-distribution) with ${\displaystyle \nu }$ degrees of freedom;
- Z and V are independent;

In [None]:
samp_dt[, .(sampz, sampss)] %>% cor %>% round(2)

The statistic can also be rearranged such that:

${\displaystyle T\equiv {\frac {Z}{\sqrt {V/\nu }}}=\left({\overline {X}}_{n}-\mu \right){\frac {\sqrt {n}}{s}}}$

Notice that the unknown population variance σ2 does not appear in T, since it was in both the numerator and the denominator, so it canceled. 

This is the distribution of t-statistic to conduct Student's t-test of whether the mean of a population has a value specified in a null hypothesis, as we will see later in hypothesis testing.

Now let's calculate this t-statistics and understand its distribution:

In [None]:
samp_dt[, sampt := sampm / sampsd * sqrt(sizet)]

In [None]:
hist(samp_dt$sampt)

It looks closer to Cauchy distribution than to normal distribution. Check the kurtosis:

In [None]:
kurtosis(samp_dt$sampt) - 3

The distribution has a very high level of excess kurtosis, so highly leptokurtic.

Now let's overlap the density with that of a normal distribution with the same standard deviation and also with that of Cauchy distribution:

In [None]:
densx <- density(samp_dt$sampt)[c("x", "y")]
maxx <- max(abs(range(samp_dt$sampt)))
setDT(densx)
xvals <- seq(-maxx, maxx, 0.001)

In [None]:
plot(densx, type = "l", col = "blue")
lines(xvals, dnorm(xvals, 0, sd(samp_dt$sampt)), col = "red")

The distribution is more peaked than normal distribution. The statistics that we derived conforms with Student's t distribution with degrees of freedom $\nu = n - 1$.

Student's t distribution (or simply the t distribution) ${\displaystyle t_{\nu }}$ generalizes the standard normal distribution. Like the latter, it is symmetric around zero and bell-shaped.

However, ${\displaystyle t_{\nu }}$ has heavier tails, and the amount of probability mass in the tails is controlled by the parameter ${\displaystyle \nu }$.

Mean is 0 for $\nu > 1$ otherwise undefined (we will see below why it is undefined)

Variance is ${\displaystyle {\frac {\nu }{\nu -2}}}$ for ${\displaystyle \nu >2}$, ${\displaystyle \infty }$ for ${\displaystyle 1<\nu \leq 2}$ otherwise undefined.

Student's t distribution is named after the the penname of William Sealy Gosset. Gosset was a statistician working for Guiness Brewery company in early 1900's. In order to prevent researchers from revealing trade secrets of the company, the Board of Directors decided that scientist at Guiness could publish their work on the condition that beer, Guiness or their surnames are not mentioned. So Gosset chose the penname *Student* to publish his papers.

(https://en.wikipedia.org/wiki/Student%27s_t-distribution)

(https://en.wikipedia.org/wiki/William_Sealy_Gosset)

Let's try for $\nu = 3$ and sample from the distribution:

In [None]:
sizet

In [None]:
set.seed(6)
tsamp <- rt(1e5, sizet - 1)

In [None]:
hist(tsamp)

The empirical variance of the sample is:

In [None]:
var(tsamp)

While the theoretical variance is:

In [None]:
var(Stud(sizet - 1))

And the comparison of densities of the t-statistic values we derived from the simulation and the theoretical distribution:

In [None]:
densx <- density(samp_dt$sampt)[c("x", "y")]
maxx <- max(abs(range(samp_dt$sampt)))
setDT(densx)
xvals <- seq(-maxx, maxx, 0.001)

In [None]:
plot(densx, type = "l", col = "blue")
lines(xvals, dt(xvals, sizet - 1), col = "red")

So the distribution of t statistics that we simulated obeys t distribution with $\nu = 3$.

Now let's compare the densities of Student's t distribution with different $\nu$ parameters

In [None]:
dfs <- c(1:10, 50, 100, 1e5)

In [None]:
st_dt <- data.table(df = dfs)

For 5% cumulative p-value points:

In [None]:
pvals<- seq(0.05, 1 - 0.05, 0.05)

Calculate the quantiles:

In [None]:
st_dt2 <- st_dt[, .(pval = pvals, qvals = qt(pvals, df)), by = df]

And the densities:

In [None]:
st_dt2[, dvals := dt(qvals, df)]

Plot the CDFs:

In [None]:
p_cdf <- st_dt2 %>%
mutate_at("df", factor) %>%
ggplot(aes(x = qvals, y = dvals, color = df)) +
geom_line()

In [None]:
p_cdf

In [None]:
ggplotly(p_cdf)

Tails are fatter in lower df values.

The theoretical excess kurtosis values start high and approach 0 by higher degrees of freedom:

In [None]:
round(sapply(dfs[dfs > 5], Stud) %>% sapply(kurt), 2)

The pdf also confirms the tails:

In [None]:
p_pdf <- st_dt2 %>%
mutate_at("df", factor) %>%
ggplot(aes(x = qvals, y = pval, color = df)) +
geom_line()

In [None]:
p_pdf

In [None]:
ggplotly(p_pdf)

The pdf's of t distributions with different df values also confirm the change in shape: At lower df degrees, p-values correspond to more extreme quantile values at both tails so the tails are fatter.

We can also confirm with another plot showing quantiles vs degrees of freedom for each p-value:

In [None]:
pst3 <- st_dt2 %>%
mutate_at("df", factor) %>%
mutate_at("pval", factor) %>%
ggplot(aes(x = qvals, y = df, color = pval, group = pval)) +
geom_line()

In [None]:
pst3

In [None]:
ggplotly(pst3)

For each selected p-value, lower df values correspond to more extreme quantile values.

Now let's take two extreme examples, the one with a very high df:

The PDF of normal distribution and the t-distribution with 1e5 df overlap drawn for p-values of 0.05 intervals between 0.05 and 0.95:

In [None]:
pvals

In [None]:
plot(qnorm(pvals), pvals, type = "l", col = "blue")
lines(qt(pvals, 1e5), col = "red")

So t distribution converges to normal distribution as df approaches $\infty$.

The PDF of Cauchy distribution and the t-distribution with 1 df also overlap:

In [None]:
plot(qcauchy(pvals), pvals, type = "l", col = "blue")
lines(qt(pvals, 1), col = "red")

So Cauchy distribution is a special case of t distribution with $\nu = 1$ and that's why moments are not defined for that df value.

# Revisiting Higher Moments: Interpreting the Shape of Distributions

Now that we covered the higher moments and distributions with support on the continuum (between $-\infty$ and $\infty$), we can revisit those higher moments in order to infer the details of the distribution from those moments.

We will create simple distributions to investigate the relationship between their higher moments and some statistical summaries.

In order to simplify the analysis, we will ensure that all distributions are Z-score standardized so that the mean is 0 and variance and standard deviation are 1.

The kurtosis of normal distribution is 3 and skewness is 0. When kurtosis is defined as excess kurtosis, 3 is extracted from the value.

## Mesokurtosis

A mesokurtic distribution has the same kurtosis value of normal distribution, 3. Now let's create a very simple mesokurtic distribution from three different values:

One sixth of the values will be to the left and one sixth of the values will be to the right at the same distance. The values are standardized to have a mean of 0 and sd of 1:

In [None]:
vals <- normalize(c(rep(0, 1e4), rep(-1, 2.5e3), rep(1, 2.5e3)))

In [None]:
table(vals)

The off-mean values after standardization is simply $\sqrt{3}$:

In [None]:
sqrt(3)

Now we will prove why the kurtosis will be 3. Note that since standard deviation or $\sigma$ is already one, we don't need to standardize by the $\sigma$. Note that kurtosis is the fourth moment.

Since mean is 0, the fourth degree of each deviation is $\sqrt{3}^4$ or $3^2$. Considering 1 sixth of values to the left, 1 sixth of values to the right and 4 sixth of values at the mean:

In [None]:
(1 * (-sqrt(3))^4 +
 1 * (sqrt(3))^4 +
 4 * 0) / 6

or in simplified terrms:

In [None]:
3^2 / 3

Since base `var` and `sd` functions calculate the sample variance and standard deviation, to calculate the population values, we will create functions that will undo the Bessel correction:

In [None]:
sdp <- function(x)
{
    len <- length(x)
    bess <- (len - 1) / len
    sd(x) * sqrt(bess)
}

In [None]:
varp <- function(x)
{
    len <- length(x)
    bess <- (len - 1) / len
    var(x) * bess
}

Now let's check that the distribution is Z-score standardized:

In [None]:
meanx <- mean(vals)
varx <- varp(vals)
sdx <- sdp(vals)

In [None]:
round(meanx, 3)
round(varx, 3)
round(sdx, 3)

And let's see the distribution's histogram:

In [None]:
hist(vals)
abline(v = mean(vals), col = "red")
abline(v = median(vals), col = "blue")
abline(v = c(-varx, varx), col = "green")

Mean is shown as red vertical line, median is shown as blue vertical line and +/- $\sigma$ are shown as green vertical lines.

Median and mean completely overlap.

The distribution is centered so skewness is zero and kurtosis value is 3 as expected:

In [None]:
skewness(vals)
kurtosis(vals)

If the shares of off-mean values are smaller, we may expect a leptokurtic distribution since $\sigma$ will be lower and the standardized off-mean values have to be further out when divided by a smaller $\sigma$. So there will be more values with a higher Z-score - outlying more.

If the shares of off-mean values are larger, we may expect a platykurtic distribution since $\sigma$ will be higher and the standardized off-mean values have to be closer to the mean when divided by a larger $\sigma$. So there will be more values with a lower Z-score - outlying less.

## Leptokurtosis

### Leptokurtic and centered

Now let's create a distribution with fewer off-mean values symettrically, compared to the mesokurtic case:

In [None]:
vals <- normalize(c(rep(0, 1e4), rep(-1, 1.5e3), rep(1, 1.5e3)))

Standardized off-mean values have higher Z-scores so they are lying further out compared to the mesokurtic case:

In [None]:
table(vals)

Now let's check that the distribution is Z-score standardized:

In [None]:
meanx <- mean(vals)
varx <- varp(vals)
sdx <- sdp(vals)

In [None]:
round(meanx, 3)
round(varx, 3)
round(sdx, 3)

And let's see the distribution's histogram:

In [None]:
hist(vals)
abline(v = mean(vals), col = "red")
abline(v = median(vals), col = "blue")
abline(v = c(-varx, varx), col = "green")

Median and mean completely overlap. However extreme values are further away from the +/- $\sigma$ values.

The distribution is centered so skewness is zero and kurtosis value is larger than 3 as expected:

In [None]:
skewness(vals)
kurtosis(vals)

Now let's start to imbalance the centered distribution.

### Leptokurtic and left skewed

Now while keeping the share of off-mean values the same, let's shift the values on the left side further away from the center:

In [None]:
vals <- normalize(c(rep(0, 1e4), rep(-2, 1.5e3), rep(1, 1.5e3)))

Standardized off-mean values on the left have higher Z-scores so they are lying further out compared to the centered case:

In [None]:
table(vals)

In [None]:
meanx <- mean(vals)
varx <- varp(vals)
sdx <- sdp(vals)

And let's see the distribution's histogram:

In [None]:
hist(vals)
abline(v = mean(vals), col = "red")
abline(v = median(vals), col = "blue")
abline(v = c(-varx, varx), col = "green")

Note that skewness is the third moment - average of the cubic values of deviations. So more outlying values on one side will have more weight along with their signs.

The distribution is has a negative skewness value, so it is left skewed. Kurtosis value is still larger than 3:

In [None]:
skewness(vals)
kurtosis(vals)

The mean is now to the left of median. Note that since mean is the first moment and the skewness is the third moment, the outlying values have more weight on the calculation of skewness than it has on the mean. So this pattern where mean is to the left of median in left-skewed distributions is not guaranteed in all cases however it is the most common pattern.

We can think about the pattern as a weighting scale that is unbalanced to the left, the center of mass is to the left of the middle position:

![sl](../imagesba/left-skew.jpg)

### Leptokurtic and right skewed

Now while keeping the share of off-mean values the same, let's shift the values on the right side further away from the center:

In [None]:
vals <- normalize(c(rep(0, 1e4), rep(2, 1.5e3), rep(-1, 1.5e3)))

Standardized off-mean values on the right have higher Z-scores so they are lying further out compared to the centered case:

In [None]:
table(vals)

In [None]:
meanx <- mean(vals)
varx <- varp(vals)
sdx <- sdp(vals)

And let's see the distribution's histogram:

In [None]:
hist(vals)
abline(v = mean(vals), col = "red")
abline(v = median(vals), col = "blue")
abline(v = c(-varx, varx), col = "green")

Note that skewness is the third moment - average of the cubic values of deviations. So more outlying values on one side will have more weight along with their signs.

The distribution is has a positive skewness value, so it is right skewed. Kurtosis value is still larger than 3:

In [None]:
skewness(vals)
kurtosis(vals)

The mean is now to the right of median. Note that since mean is the first moment and the skewness is the third moment, the outlying values have more weight on the calculation of skewness than it has on the mean. So this pattern where mean is to the right of median in right-skewed distributions is not guaranteed in all cases however it is the most common pattern.

We can think about the pattern as a weighting scale that is unbalanced to the right, the center of mass is to the right of the middle position:

![sl](../imagesba/right-skew.jpg)

## Platykurtosis

### Platykurtic and centered

Now let's create a distribution with more off-mean values symettrically, compared to the mesokurtic case:

In [None]:
vals <- normalize(c(rep(0, 1e4), rep(-1, 5e3), rep(1, 5e3)))

Standardized off-mean values have lower Z-scores so they are lying closer to the mean, compared to the mesokurtic case:

In [None]:
table(vals)

Now let's check that the distribution is Z-score standardized:

In [None]:
meanx <- mean(vals)
varx <- varp(vals)
sdx <- sdp(vals)

In [None]:
round(meanx, 3)
round(varx, 3)
round(sdx, 3)

And let's see the distribution's histogram:

In [None]:
hist(vals)
abline(v = mean(vals), col = "red")
abline(v = median(vals), col = "blue")
abline(v = c(-varx, varx), col = "green")

Median and mean completely overlap. However extreme values are closer to the +/- $\sigma$ values, compared to the leptokurtic and mesokurtic cases.

The distribution is centered so skewness is zero and kurtosis value is lower than 3 as expected:

In [None]:
skewness(vals)
kurtosis(vals)

Now let's start to imbalance the centered distribution.

### Platykurtic and left skewed

Now while keeping the share of off-mean values the same, let's shift the values on the left side further away from the center:

In [None]:
vals <- normalize(c(rep(0, 1e4), rep(-2, 5e3), rep(1, 5e3)))

Standardized off-mean values on the left have higher Z-scores so they are lying further out compared to the centered case:

In [None]:
table(vals)

In [None]:
meanx <- mean(vals)
varx <- varp(vals)
sdx <- sdp(vals)

And let's see the distribution's histogram:

In [None]:
hist(vals)
abline(v = mean(vals), col = "red")
abline(v = median(vals), col = "blue")
abline(v = c(-varx, varx), col = "green")

Note that skewness is the third moment - average of the cubic values of deviations. So more outlying values on one side will have more weight along with their signs.

The distribution is has a negative skewness value, so it is left skewed. Kurtosis value is still smaller than 3:

In [None]:
skewness(vals)
kurtosis(vals)

The mean is now to the left of median. Note that since mean is the first moment and the skewness is the third moment, the outlying values have more weight on the calculation of skewness than it has on the mean. So this pattern where mean is to the left of median in left-skewed distributions is not guaranteed in all cases however it is the most common pattern.

We can think about the pattern as a weighting scale that is unbalanced to the left, the center of mass is to the left of the middle position:

![sl](../imagesba/left-skew.jpg)

### Platykurtic and right skewed

Now while keeping the share of off-mean values the same, let's shift the values on the right side further away from the center:

In [None]:
vals <- normalize(c(rep(0, 1e4), rep(2, 5e3), rep(-1, 5e3)))

Standardized off-mean values on the right have higher Z-scores so they are lying further out compared to the centered case:

In [None]:
table(vals)

In [None]:
meanx <- mean(vals)
varx <- varp(vals)
sdx <- sdp(vals)

And let's see the distribution's histogram:

In [None]:
hist(vals)
abline(v = mean(vals), col = "red")
abline(v = median(vals), col = "blue")
abline(v = c(-varx, varx), col = "green")

Note that skewness is the third moment - average of the cubic values of deviations. So more outlying values on one side will have more weight along with their signs.

The distribution is has a positive skewness value, so it is right skewed. Kurtosis value is still smaller than 3:

In [None]:
skewness(vals)
kurtosis(vals)

The mean is now to the right of median. Note that since mean is the first moment and the skewness is the third moment, the outlying values have more weight on the calculation of skewness than it has on the mean. So this pattern where mean is to the right of median in right-skewed distributions is not guaranteed in all cases however it is the most common pattern.

We can think about the pattern as a weighting scale that is unbalanced to the right, the center of mass is to the right of the middle position:

![sl](../imagesba/right-skew.jpg)

## Comparison of Moments with Pearson Distribution

Now we will create samples from Pearson Distribution, which is flexible enough to have all four moments as parameter values.

For benchmarking we will also add samples from normal distribution, uniform distribution and t distribution with 2 degrees of freedom.

First let's set the parameters and names:

In [None]:
kurtvals <- c(2.8, 2.8, 4, 4)
skewvals <- c(0.5, -0.5, 1, -1)
qvals <- seq(-5, 5, 0.1)
distnames1 <- c("platy_right", "platy_left", "lepto_right", "lepto_left")
distnames2 <- c("normal", "t_2", "unif_5")
distnames <- c(distnames1, distnames2)

And then create the densities for quantile values between -5 and 5:

In [None]:
densities1 <- mapply(function(x, y) dpearson(qvals, moments = c(0, 1, y, x)), kurtvals, skewvals, SIMPLIFY = F)
names(densities1) <- distnames1
densities2 <- list(normal = dnorm(qvals), t_2 = dt(qvals, 2), unif_5 = dunif(qvals, -5, 5))
densities <- c(densities1, densities2)
densities_dt <- mapply(function(x, y) data.table(qval = qvals, dens = x, distname = y),
                       densities, distnames, SIMPLIFY = F) %>% rbindlist                     

Let's create the an interactive plot to superimpose densities:

In [None]:
densep <- densities_dt %>%
highlight_key(~distname) %>%
ggplot(aes(x = qval, y = dens, col = distname)) +
geom_line()

In [None]:
ggplotly(densep) %>%
highlight(
  plotly_obj, 
  on = "plotly_click",
  off = "plotly_relayout",
  opacityDim = .1
  )

For example if we compare the left-skewed leptokurtic and platykurtic distributions, leptokurtic distribution has more density in middle and leftmost quantiles while platykurtic distribution has more density moderate regions in between.

Now let's create random samples from the same distributions:

In [None]:
samplesize <- 1e4

In [None]:
set.seed(10)
samples1 <- mapply(function(x, y) rpearson(samplesize, moments = c(0, 1, y, x)), kurtvals, skewvals, SIMPLIFY = F)
names(samples1) <- distnames1
samples2 <- list(normal = rnorm(samplesize), t_2 = rt(samplesize, 2), unif_5 = runif(samplesize, -5, 5))
samples2 <- lapply(samples2, normalize)
samples <- c(samples1, samples2)
samples_dt <- mapply(function(x, y) data.table(vals = x, distname = y),
                       samples, distnames, SIMPLIFY = F) %>% rbindlist

Let's get statistical and moment summaries of the distributions:

In [None]:
samples_dt %>%
group_by(distname) %>%
summarize(meanx = mean(vals), median = median(vals), sdx = sd(vals), skewx = skewness(vals), kurtx = kurtosis(vals)) %>%
mutate_if(is.numeric, round, 2)

In left skewed distributions, the mean is lower than (to the left of) the median and vice versa.

We can also confirm with the histograms:

In [None]:
histplot2 <- samples_dt[, (c("meanx", "medianx")) := .(mean(vals), median(vals)), by = distname][] %>%
ggplot(aes(x = vals)) +
geom_histogram() +
geom_vline(aes(xintercept = meanx), color = "red") +
geom_vline(aes(xintercept = medianx), color = "blue") +
xlim(c(-5, 5)) +
facet_wrap(. ~ distname, nrow = 7)
ggplotly(histplot2) %>% layout(autosize = F, width = 800, height = 800)

We see that in leptokurtic distributions with the same skewness direction, there are more extreme values to the skewed direction. This fact is confirmed by centile values of distributions:

In [None]:
samples_dt[, as.list(quantile(vals, seq(0, 1, 0.1)) %>% round(2)), by = distname] %>% t

For each skewed distribution, the extreme quantile values on the skewed side is more distant to the mean than the extreme qnatile values on the other side. For example, the 0% (minimum) and 10% quantile values of the platykurtic left-skewed distribution are -3.72 and 1.42 while 100% (maximum) and 90% quantile values are 1.97 and 1.24

For the distributions skewed to the same side, leptokurtic distributions have more extreme quantile values than platykurtic ones. For example right-skewed platykurtic distribution's 100% (maximum) quantile value is 3.70 while that of the similar leptokurtic distribution is 4.89

# APPENDIX: Other Important Continuous Probability Distributions (Optional)

## Weibull distribution (optional)

In the previous life expectancy simulation, we assumed that the hazard or mortality rate is fixed throughout the whole life of each member of the population.

The resulting life expectancies or failure times formed an exponential distribution with the hazard rate as the single parameter.

Now we will relax the assumption of fixed rate such that the rate is allowed to be a decreasing or increasing function of the initial rate and shape parameters. The distribution of failure times with accelerated or decelerated hazard rate agrees with a Weibull distribution. So first let's give some information on Weibull distribution.

The PDF of Weibull distribution is:

${\displaystyle f(x;\lambda ,k)={\begin{cases}{\frac {k}{\lambda }}\left({\frac {x}{\lambda }}\right)^{k-1}e^{-(x/\lambda )^{k}},&x\geq 0,\\0,&x<0,\end{cases}}}$

where k > 0 is the shape parameter and λ > 0 is the scale parameter of the distribution.

If the quantity, x, is a "time-to-failure", the Weibull distribution gives a distribution for which the failure rate is proportional to a power of time. The shape parameter, k, is that power plus one, and so this parameter can be interpreted directly as follows:

- A value of ${\displaystyle k<1\,}$ indicates that the failure rate decreases over time. This happens if there is significant "infant mortality", or defective items failing early and the failure rate decreasing over time as the defective items are weeded out of the population.

- A value of ${\displaystyle k=1\,}$ indicates that the failure rate is constant over time. This might suggest random external events are causing mortality, or failure. The Weibull distribution reduces to an exponential distribution.

- A value of ${\displaystyle k>1\,}$ indicates that the failure rate increases with time. This happens if there is an "aging" process, or parts that are more likely to fail as time goes on. The function is first convex, then concave with an inflection point at ${\displaystyle (e^{1/k}-1)/e^{1/k},\,k>1\,}$.

The expected value is:

${\displaystyle \operatorname {E} (X)=\lambda \Gamma \left(1+{\frac {1}{k}}\right)\,}$

where $\Gamma\$ is the gamma function, mathematical details of which is outside the scope of this course, however, you can access it through `gamma` function in base R.

Variance is:

${\displaystyle \operatorname {var} (X)=\lambda ^{2}\left[\Gamma \left(1+{\frac {2}{k}}\right)-\left(\Gamma \left(1+{\frac {1}{k}}\right)\right)^{2}\right]\,}$

(https://en.wikipedia.org/wiki/Weibull_distribution)

The effect of $k$ parameters on the shape of the distribution is shown below:

Different $k$ parameters:

In [None]:
kvals <- c(0.5, 0.7, 1, 1.5, 2, 5, 10)

Scale, $\lambda$ parameter:

In [None]:
lmb <- 1

Different values for $x$, time to failure:

In [None]:
xvals <- seq(0, 5, 0.01)

Create a cartesian product of parameters values:

In [None]:
densewb_dt <- crossing(lmb, kvals, xvals)
setDT(densewb_dt)

Calculate the densities:

In [None]:
densewb_dt[, dwb := dweibull(xvals, kvals, lmb)]

Plot them:

In [None]:
pwb <- densewb_dt %>%
mutate_at("kvals", factor) %>%
mutate(expo = ((kvals == 1) * 1.5) + 0.5) %>%
ggplot(aes(x = xvals, y = dwb, color = kvals)) +
geom_line(aes(size = expo)) +
scale_size_identity()

In [None]:
ggplotly(pwb)

The case of $k=1$, drawn in bolder line, is the special case of exponential distribution.

Distibutions with $k < 1$ has a decelarating hazard rate. So the early higher mortality rates cause the distribution to concentrate in the leftmost part, while the lower rates cause tails extended to right.

Distributions with $k > 1$ has an accelerating hazard rate. So the early lower mortality rates cause the leftmost part to be thinner while the accelerating mortality will cause a concentration in the middle and an exhaustion of the population causes a thinner tail that stretches to the right

### Accelerated hazard rate

Now, let's simulate a population again, similar to the one we did in the fixed rate with exponential distribution case:

Number of agents:

In [None]:
nagentw <- 1e3

Initial life table:

In [None]:
life_dtw <- data.table(dead = rep(0, nagentw), life = rep(0, nagentw))

And set the hazard rate - proportion of still alive population to die in the period - to 10%

In [None]:
ratew <- 0.1

The shape parameter is $k>1$ so we have accelerated hazard rate:

In [None]:
kx <- 1.7

Initialize the period:

In [None]:
periodxw <- 0

Randomly sample from live population using Bernoulli trials with a bias of 0.1 and record the age (period) of each individual when they die.

Continue until no one stays alive.

Note that the hazard rate for a time period $x$ should be adjusted with $\lambda$ and $k$ parameters using the formula below, from the PDF of Weibull distribution:

$\displaystyle {\frac {k}{\lambda }\left({\frac {x}{\lambda }}\right)^{k-1}}$

$\lambda$ is the inverse of hazard rate:

In [None]:
set.seed(5)
while(life_dtw[dead == 0, .N] > 0)
{
    periodxw <- periodxw + 1
    ratew2 <- kx * ratew * (periodxw * ratew)^(kx-1)
    life_dtw[dead == 0, dead := rbinom(.N, 1, ratew2)]
    life_dtw[dead == 1 & life == 0, life := periodxw]
}

Check the inidivuduals who die at the oldest age:

In [None]:
life_dtw[, sort(life, decreasing = T)][1:10]

The last period is:

In [None]:
periodxw

Histogram of simulated life expectancies:

In [None]:
hist(life_dtw$life)

The mean life expectancy:

In [None]:
life_dtw[, mean(life)]

And variance:

In [None]:
life_dtw[, var(life)]

Mean and variance from theoretical distributions are (note that the hazard rate is inverted for the rate parameter:

In [None]:
mean(Weib(kx, 1/ratew))

In [None]:
var(Weib(kx, 1/ratew))

Now simulate from Weibull distribution with the same parameters:

In [None]:
set.seed(5)
lewei <- rweibull(nagentw, kx, 1/ratew)
hist(lewei)

The longest life values: 

In [None]:
round(sort(lewei, decreasing = T)[1:10])

Extract the density of accelerated hazard rate simulation:

In [None]:
life_densw <- density(life_dtw[, life], bw = 1)[c("x", "y")]
setDT(life_densw)
maxxw <- life_densw[, max(x)]

And compare with theoretical density of Weibull distribution and exponential distribution:

In [None]:
plot(life_densw[x > 0], type = "l", col = "blue", ylim = c(0, max(dexp(0:maxxw, ratew))))
lines(0:maxxw, dweibull(0:maxxw, kx, 1/ratew), col = "red")
lines(0:maxxw, dexp(0:maxxw, ratew), col = "black")

We see that, density of simulated values and theoretical Weibull densities overlap to a large extent. The density in shorter life spans is lower for Weibull, since hazard rate in early times is relatively lower, while the density concentrates in the middle where most of the population dies due to accelerated hazard rate.

### Decelerated hazard rate

Now let's simulate a case with decelerated hazard rate:

In [None]:
nagentw2 <- 1e3

In [None]:
life_dtw2 <- data.table(dead = rep(0, nagentw2), life = rep(0, nagentw2))

Set the hazard rate - proportion of still alive population to die in the period - to 10%

In [None]:
ratew2 <- 0.1

And the shape parameters is $k<1$

In [None]:
kx2 <- 0.6

Initialize the period:

In [None]:
periodxw2 <- 0

Randomly sample from live population using Bernoulli trials with a bias of 0.1 and record the age (period) of each individual when they die.

Continue until no one stays alive:

In [None]:
set.seed(5)
while(life_dtw2[dead == 0, .N] > 0)
{
    periodxw2 <- periodxw2 + 1
    ratew22 <- kx2 * ratew2 * (periodxw2 * ratew2)^(kx2 - 1)
    life_dtw2[dead == 0, dead := rbinom(.N, 1, ratew22)]
    life_dtw2[dead == 1 & life == 0, life := periodxw2]
}

Check the inidivuduals who die at the oldest age:

In [None]:
life_dtw2[, sort(life, decreasing = T)][1:10]

In [None]:
periodxw2

Histogram of simulated life expectancies:

In [None]:
hist(life_dtw2$life)

The mean life expectancy:

In [None]:
life_dtw2[, mean(life)]

And variance:

In [None]:
life_dtw2[, var(life)]

Mean and variance from theoretical distributions are (note that the hazard rate is inverted for the rate parameter:

In [None]:
mean(Weib(kx2, 1/ratew2))

In [None]:
var(Weib(kx2, 1/ratew2))

Simulate from exponential distribution with the same hazard rate ($\lambda$ is inverted hazard rate):

In [None]:
set.seed(5)
lewei2 <- rweibull(nagentw, kx2, 1/ratew2)
hist(lewei2)

Largest life expectancy values:

In [None]:
round(sort(lewei2, decreasing = T)[1:10])

Density of values from iterative simulation:

In [None]:
life_densw2 <- density(life_dtw2[, life], bw = 1)[c("x", "y")]
setDT(life_densw2)
maxxw2 <- min(life_densw2[, max(x)], 50)

And compare with theoretical density with a rate of reciprocal of initial wealth:

In [None]:
plot(life_densw2[x > 1 & x <= maxxw2], type = "l", col = "blue", xlim = c(0, 50), ylim = c(0, max(dweibull(1:maxxw2, kx2, 1/ratew2))))
lines(1:maxxw2, dweibull(1:maxxw2, kx2, 1/ratew2), col = "red")
lines(1:maxxw2, dexp(1:maxxw2, ratew2), col = "black")

Due to decelarated hazard rate (starting out with higher rates), early and later deaths have more density while the density in the middle is lower as compared to the exponential distribution with the same $\lambda$ parameter.

## Gamma distribution (optional)

As you may recall from discrete distributions, negative binomial distribution is used for modelling number of failures until a certain number of successes are encountered.

Gamma distribution, while it has many more applications due to its flexible parametrization, can be considered as the continuous analogue to the negative binomial distribution, in its simplest case.



Consider a sequence of events, with the waiting time for each event being an exponential distribution with rate $\lambda$. Then the waiting time for the $n$th event to occur is the gamma distribution with integer shape ${\displaystyle \alpha =n}$.

This construction of the gamma distribution allows it to model a wide variety of phenomena where several sub-events, each taking time with exponential distribution, must happen in sequence for a major event to occur.

Examples include the waiting time of cell-division events, number of compensatory mutations for a given mutation, waiting time until a repair is necessary for a hydraulic system and so on.

If $\alpha$ is an integer, the gamma distribution (the special case called an Erlang distribution) is the probability distribution of the waiting time until the \$alpha$-th "arrival" in a one-dimensional Poisson process with intensity $1/\theta$. If,

${\displaystyle X\sim \Gamma (\alpha \in \mathbf {Z} ,\theta ),\qquad Y\sim \operatorname {Pois} \left({\frac {x}{\theta }}\right),}$

then

${\displaystyle P(X>x)=P(Y<\alpha ).}$

Let ${\displaystyle X_{1},X_{2},\ldots ,X_{n}}$ be ${\displaystyle n}$ independent and identically distributed random variables following an exponential distribution with rate parameter $\lambda$, then

${\displaystyle \sum _{i}X_{i}\sim \operatorname {Gamma} (n,\lambda )}$

where $n$ is the shape parameter and $\lambda$ is the rate, and

${\textstyle {\bar {X}}={\frac {1}{n}}\sum _{i}X_{i}\sim \operatorname {Gamma} (n,n\lambda )}$

If X ~ Gamma(1, λ) (in the shape–rate parametrization), then X has an exponential distribution with rate parameter λ. In the shape-scale parametrization, X ~ Gamma(1, θ) has an exponential distribution with rate parameter 1/θ.

The above mentioned specifications are just some special cases of Gamma distribution, which has a wider are of applications.

The mean of gamma distribution is given by the product of its shape and scale parameters:

${\displaystyle \mu =\alpha \theta =\alpha /\lambda }$

The variance is:

${\displaystyle \sigma ^{2}=\alpha \theta ^{2}=\alpha /\lambda ^{2}}$

(https://en.wikipedia.org/wiki/Gamma_distribution)

### Gamma distribution as time to n-th event in Bernoulli trials

We have seen that, the geometric distribution of number of trials to success in discrete Bernoulli trials converge to an exponential distribution.

Now, using the same discrete Bernoulli trials simulation, we will see that the negative binomial distribution of number of trials to nth success converge to a gamma distribution.

we will simulate 1e7 Bernoulli trials with a success probability of 0.5%. To show convergence to Poisson, we will treat each 1e3 trials as a unit interval so the expected rate of success is 5.

So these 1e7 trials are supposed converge to 1e4 Poisson samples.

1e7 Bernoulli trials with a success probability of 0.5%:

In [None]:
ratenb <- 5

In [None]:
set.seed(2)
bernpois <- rbinom(1e7, 1, ratenb/1000)

Check total number of successes:

In [None]:
sum(bernpois)

Now, in order to calculate the failure lengths or arrival times, we will compress the data in the run lengths:

In [None]:
length(bernpois)

In [None]:
bernpoisrl <- rle(bernpois)
setDT(bernpoisrl)

The run length encoding shows the length of contiguous runs of each value:

In [None]:
bernpoisrl %>% head

We have ~1e5 run lengths from out of 1e6 Bernoulli trials:

In [None]:
bernpoisrl[, .N]

We add an index for rows:

In [None]:
bernpoisrl[, ind := .I]

There are 274 runs with length larger than 1 for values 1, that means, there are 0 length waiting times in between those consecutive 1 values:

In [None]:
bernpoisrl[values == 1 & lengths > 1, .N]

We are going to manually add those 0 length waiting times between consecutive 1 values by replicating the rows. So don't bother with what the following code does in detail:

In [None]:
bernpoisrl[, replx := values == 1 & lengths > 1]
bernpoisrl[, times := ifelse(replx, lengths, 1)]
bernpoisrl[, length2 := ifelse(replx, 1, lengths)]

In [None]:
bernpoisrl2 <- bernpoisrl[, .(lengths = rep(as.integer(length2), times),
                              values = rep(values, times)), by = ind]
setorder(bernpoisrl2, -ind)
bernpoisrl2[, successn := cumsum(values)]
setorder(bernpoisrl2, ind)
bernpoisrl2[, successn := max(successn) - successn + 1]
bernpoisrl2[.N, successn := ifelse(values == 0, NA, successn)]

In [None]:
bernpoisrl3 <- bernpoisrl2[!is.na(successn), .(lengths = sum(lengths)), by = successn]

Let's delete some interim objects that we don't need anymore and that holds a large space in memory:

In [None]:
#sort(sapply(ls(), function(x) object.size(get(x))), decreasing = T)
rm(list = c("bernpois", "bernpoisrl", "bernpoisrl2"))
gc()

At the end we have a simpler table that shows each success and waiting time to that success (including the success itself):

In [None]:
bernpoisrl3

These rows are for length 1 waiting times, that means for successes that occus just after a previous success without a failure:

In [None]:
bernpoisrl3[lengths == 1]

Now let's calculate the waiting times to each 3rd success:

In [None]:
nth <- 3

In order to trim incomplete waiting times at the end, we get the last row that a third success occurs:

In [None]:
lastrow <- bernpoisrl3[, (.N %/% nth) * nth]
lastrow

We calculate the waiting times to each 3rd success, again don't mind the code complexity, we just divide the table in 3 row chunks and sum the consecutive lengths:

In [None]:
nthlength <- bernpoisrl3[1:lastrow, .(lengths = sum(lengths)), by = (successn - 1) %/% nth + 1] %>% pull(lengths)

In [None]:
summary(nthlength)

The histogram of waiting times:

In [None]:
hist(nthlength)

Let's treat the distribution as a discrete one and create a similar sample from negative binomial distribution. The rate per discrete interval is 0.5%:

In [None]:
ratenb/1000

In [None]:
set.seed(5)
sampnb <- rnbinom(length(nthlength), nth, ratenb/1000)

In [None]:
hist(sampnb)

And compare their densities:

In [None]:
simnbd <- density(nthlength)[c("x", "y")]

In [None]:
maxnb <- max(nthlength)

In [None]:
setDT(simnbd)

In [None]:
plot(simnbd[x > 0], type = "l", col = "blue")
lines(1:maxnb, dnbinom(1:maxnb, nth, ratenb/1000), col = "red")

They perfectly overlap!

Now let's treat are simulated values as continous such that we compress each discrete 1000 waiting as an interval of 1 time unit of continuous waiting times:

In [None]:
nthlengthc <- nthlength / 1000

See the distribution:

In [None]:
summary(nthlengthc)

And histogram:

In [None]:
hist(nthlengthc)

Now we sample from gamma distribution. The rate per continous time interval is 5:

In [None]:
ratenb

In [None]:
set.seed(5)
sampgm <- rgamma(length(nthlengthc), nth, ratenb)

In [None]:
hist(sampgm)

And compare the densities:

In [None]:
simgmd <- density(nthlengthc)[c("x", "y")]

In [None]:
maxgm <- max(nthlengthc)

In [None]:
setDT(simgmd)

In [None]:
gmvals <- seq(0, maxgm, 0.001)

In [None]:
plot(simgmd[x > 0], type = "l", col = "blue")
lines(gmvals, dgamma(gmvals, nth, ratenb), col = "red")

They perfectly overlap!

So we show that gamma distribution is the continuous analog of negative binomial distribution for waiting times to nth successes.

### Gamma distribution as sum of exponentially distributed variables

Now we will conduct a second simulation to show that gamma distribution is sum of exponentially distributed variables.

The size of the sample:

In [None]:
sizeexp <- 1e4

The rate per unit time:

In [None]:
rateexp <- 5

We will create 3 samples:

In [None]:
nexp <- 3

Create 3 samples from i.i.d exponential distribution and sum each row:

In [None]:
set.seed(6)
sampexpgm <- rowSums(replicate(3, rexp(sizeexp, rateexp)))

See the histogram of row sums:

In [None]:
hist(sampexpgm)

They are gamma distributed. Let's confirm with directly creating a sample from gamma distribution with the same parameters:

In [None]:
set.seed(7)
sampgm2 <- rgamma(sizeexp, nexp, rateexp)

In [None]:
hist(sampgm2)

Let's extract the density of the first sample created from row sums of exponentially distributed variables:

In [None]:
sampexpgmd <- density(sampexpgm)[c("x", "y")]

In [None]:
maxexpgm <- max(sampexpgm)

In [None]:
setDT(sampexpgmd)

In [None]:
gmvals2 <- seq(0, maxexpgm, 0.001)

And compare with the density of gamma distribution:

In [None]:
plot(sampexpgmd[x > 0], type = "l", col = "blue")
lines(gmvals2, dgamma(gmvals2, nexp, rateexp), col = "red")

They overlap perfectly! So we show that gamma distribution is the sum of i.i.d exponentially distributed variables.

# Object Generating Code

In [None]:
student_id <- 2025000000
library(tidyverse)
library(data.table)
library(BBmisc)
library(PearsonDS)
library(moments)
library(plotly)
kurtranges <- list(c(2, 2.5), c(4, 5))
sizex <- 1e3
skewsign <- c(-1, 1)
normrange <- function(x, rang) rang[1] + diff(rang) * x
set.seed(student_id)
kurtlp <- sample(2)
skewnp <- sample(2)
kurtsamp1 <- runif(2)
kurtvals <- mapply(normrange, kurtsamp1, kurtranges[kurtlp])
skewrang <- mapply("*", lapply(sqrt(kurtvals - 1), c, 0.5), skewsign[skewnp], SIMPLIFY = F) %>% lapply(sort)
skewvals <- sapply(skewrang, function(x) runif(1, x[1], x[2]))
distnames <- c("sample 1", "sample 2", "normal")
samplesize <- 1e4
set.seed(10)
samples1 <- mapply(function(x, y) rpearson(samplesize, moments = c(0, 1, y, x)), kurtvals, skewvals, SIMPLIFY = F)
samples2 <- list(rnorm(samplesize))
samples2 <- lapply(samples2, normalize)
samples <- c(samples1, samples2)
samples_dt <- mapply(function(x, y) data.table(vals = x, distname = y),
                       samples, distnames, SIMPLIFY = F) %>% rbindlist

Statistical summaries:

In [None]:
samples_dt %>%
group_by(distname) %>%
summarize(meanx = mean(vals), median = median(vals), sdx = sd(vals), skewx = skewness(vals), kurtx = kurtosis(vals)) %>%
mutate_if(is.numeric, round, 2)

Density plots:

In [None]:
qvals <- seq(-5, 5, 0.1)
densities1 <- mapply(function(x, y) dpearson(qvals, moments = c(0, 1, y, x)), kurtvals, skewvals, SIMPLIFY = F)
densities2 <- list(normal = dnorm(qvals))
densities <- c(densities1, densities2)
densities_dt <- mapply(function(x, y) data.table(qval = qvals, dens = x, distname = y),
                       densities, distnames, SIMPLIFY = F) %>% rbindlist                     
densep <- densities_dt %>%
highlight_key(~distname) %>%
ggplot(aes(x = qval, y = dens, col = distname)) +
geom_line()
ggplotly(densep) %>%
highlight(
  plotly_obj, 
  on = "plotly_click",
  off = "plotly_relayout",
  opacityDim = .1
  )

Histograms:

In [None]:
histplot2 <- samples_dt[, (c("meanx", "medianx")) := .(mean(vals), median(vals)), by = distname][] %>%
ggplot(aes(x = vals)) +
geom_histogram() +
geom_vline(aes(xintercept = meanx), color = "red") +
geom_vline(aes(xintercept = medianx), color = "blue") +
xlim(c(-5, 5)) +
facet_wrap(. ~ distname, nrow = 7)
ggplotly(histplot2) %>% layout(autosize = F, width = 800, height = 800)

Centile values:

In [None]:
samples_dt[, as.list(quantile(vals, seq(0, 1, 0.1)) %>% round(2)), by = distname] %>% t