# Sampling and distributions

Since R was designed for statistical applications, it contains a rich set of functions for drawing from different distributions. We cover some of the more commonly used functions below.

Note - depending on the order of presentations in this series, we may not have yet covered plotting using ggplot2. If so, don't worry about the ggplot2 syntax and just focus on math and resulting figures. 

In [None]:
# Importing ggplot2 so that we can visualize results
require("ggplot2")
require("gridExtra")
options(repr.plot.width=5, repr.plot.height=3)

## Normal distribution

Functions for sampling from distributions in R follow a common pattern density. For the normal distributions this is

+ density (dnorm)
+ distribution function (pnorm)
+ quantile function (qnorm)
+ random generation (rnorm)

We illustrate dnorm and rnorm below

In [None]:
nsamples = 5000
mean = 1
sd = 0.75
xmin = -3
xmax = 5
spacing = 0.2

normal_sampling <- rnorm(nsamples, mean=mean, sd=sd)
xvals <- seq(xmin, xmax, spacing)
normal_function <- dnorm(xvals, mean=mean, sd=sd)

df1 <- data.frame(normal_sampling)
g1 <- ggplot(df1, aes(x=normal_sampling)) + geom_histogram(col="black", fill="blue", binwidth=spacing) +
coord_cartesian(xlim=c(xmin, xmax))

df2 <- data.frame(xvals, normal_function)
g2 <- ggplot(df2, aes(x=xvals, y=normal_function)) + geom_line() + geom_point(col="blue") +
coord_cartesian(xlim=c(xmin, xmax))
grid.arrange(g1, g2, ncol=2)

## Poisson distribution

The Poisson distribution is a discrete distribution that gives the probability of a number of events occuring in a given interval

$p(k) = e^{-k} \frac{\lambda^{k}}{k!}$, where $\lambda$ is the average number of events

rpois(n, lambda) draws n samples from a Poisson distribution with $\lambda$ = lambda  
dpois(q, lambda) is the probability that sample will have value q

In [None]:
nsamples = 500
lambda = 3

poisson_sampling <- rpois(nsamples, lambda)
poisson_function <- dpois(seq(0,10), lambda) * nsamples

df1 <- data.frame(poisson_sampling)
g1 <- ggplot(df1, aes(x=poisson_sampling)) + geom_histogram(col="black", fill="blue", binwidth=1)

xvals <- seq(0:10)
df2 <- data.frame(xvals, poisson_function)
g2 <- ggplot(df2, aes(x=xvals, y=poisson_function)) + geom_line() + geom_point(col="blue")
grid.arrange(g1, g2, ncol=2)

## Sampling and permutations

Many machine learning applications require that we either permute or draw a subsample of the data. This can be done with the sample function. To generate a permutation, we can select all elements without replacement. The examples below illustrates this using the built-in R constant "letters"

In [None]:
letters

In [None]:
# Sampling without replacement
for (i in seq(1,5)) {
    print(sample(letters, 5))
}

In [None]:
# Sampling with replacement (sorting output to see repeated elements)
for (i in seq(1,5)) {
    print(sort(sample(letters, 10, replace=TRUE)))
}

In [None]:
# Permutations
for (i in seq(1,5)) {
    print(sample(letters, length(letters)))
}

### Sampling from a weighted distribution

R also makes it easy to sample elements of list with non-uniform probabilites by including a vector of probability weights

In [None]:
nsample <- 50
sampling <- sample(c('A', 'B', 'C'), nsample, replace=TRUE, prob=c(0.7, 0.2, 0.1))

In [None]:
sampling

To confirm that our distribution looks like what we expected, can use the table function

In [None]:
nsample <- 1000
sampling <- sample(c('A', 'B', 'C'), nsample, replace=TRUE, prob=c(0.7, 0.2, 0.1))
table(sampling)