# Period 4: Visualization of genome-scale data with R/Bioconductor

## Overview of visualization

The basic purpose of visualization is to expose the "structure of variation" in datasets.
Even for simple data elements that might occupy a given vector in R
can have intricate "structure" that visualization can help expose.  Before focusing
sharply on tools for genomic visualization, we'll introduce
some basic concepts of statistical modeling to help formalize this concept
of "structured variation".

### Brief comments on statistical models 

#### The concept of a density function

Basic model for continuous univariate measurement: $N$ data points denoted
$x_i, i = 1, \ldots, N$ 

- are _statistically independent_ (value of $x_i$ tells us
nothing about the value of $x_j$, $j \neq i$)
- have relative frequencies prescribed by mathematical functions that have certain properties

Example 1, for students with some exposure to probability theory:  Define 
$$
f(t) = \frac{1}{\sqrt{2\pi}} e^{-\frac{t^2}{2}}
$$
and let $F(x) = \int\limits_{-\infty}^x f(t) dt$.  $F$ is known as the standard
normal or Gaussian _distribution function_.  You can use R to explore this formalism.

In [None]:
myf = function(t) (1/(sqrt(2*pi)))*exp(-t^2/2)
myF = function(x) integrate(myf, -Inf, x)$value
dom = seq(-3,3,.1)
plot(dom, sapply(dom, myF), xlab="x", ylab="F(x)", type="l")

The function $f$ defined above is the standard Gaussian _density function_.  We have written code for it already.  It's a little easier to plot it because we coded it in a vectorized way.
(As a mildly advanced exercise, recode `myF` so that it returns a vector of values $F(x)$ for a vector input.)

The density function has a familiar display:

In [None]:
plot(dom, myf(dom), type="l", xlab="x", ylab="f(x)")

R has more general Gaussian density and distribution functions built in, in the sense that
mean and variance values can also be specified to define location and spread.

In [None]:
range(myf(dom)-dnorm(dom)) # dnorm is built in

#### Histograms as estimates of densities

A basic tool of exploratory visualization is the histogram.  This can be tuned
in various ways, but let's consider how its default implementation can be used to think about the
plausibility of a model for a given set of data.  We can simulate standard
Gaussian observations in R, plot the histogram, and check its relationship to
the theoretical density.

In [None]:
dat = rnorm(1000) # simulated
hist(dat, freq=FALSE)
lines(dom, dnorm(dom), lty=2)
lines(density(dat), lty=3)

Both the default histogram and the default density estimate accurately characterize
variation in the simulated data.  We will now apply them to a classic dataset.

#### Histogram and density with a dataset

R.A. Fisher studied measurements of parts of three species of iris plants.

In [None]:
head(iris)

We'll use a histogram and default density estimate to visualize the distribution of petal width.

In [None]:
hist(iris$Petal.Width, freq=FALSE)
lines(density(iris$Petal.Width))

Both the histogram and the density serve to demonstrate that a unimodal model 
will not work well for this dataset.  

It is worth noting that the default density display seems particularly flawed in that
it appears to support negative width measures.  (The density sketch extended smoothly
at the left boundary will be positive.)  A density estimation tool that can respect
constraints on the measurement range is available in the logspline package.

In [None]:
library(logspline)  # tell logspline no negative values 
suppressWarnings(f1 <- logspline(iris$Petal.Width, lbound=0))
hist(iris$Petal.Width, freq=FALSE)
dom=seq(0,2.6,.01)
lines(dom, dlogspline(dom, fit=f1), lty=2)
lines(density(iris$Petal.Width), lty=3, col="gray")
legend(1.0, 1.1, legend=c("logspline", "default"), 
       lty=c(2,3), col=c("black", "gray"), bty="n")

The histogram is a form of 'nonparametric' density estimation -- it makes no assumptions about the functional form of the underlying distribution.  The logspline density estimate
assumes that there is a smooth underlying density, and constraints like bounds
on the observation space can be imposed.  The default density estimator in R
can be tuned, see `?density`.

As a concluding remark on the value of refined density estimation, we note that the logspline
estimate was quite suggestive of a trimodal distribution.  We know more about the
iris data -- it is in fact collected on three different species of plant.  The mean values
of petal width for these species seem to lie close to the modes suggested by the
logspline estimate.  This is much less apparent with the default density estimate.

In [None]:
sapply(split(iris$Petal.Width, iris$Species), mean)

In [None]:
hist(iris$Petal.Width, freq=FALSE)
dom=seq(0,2.6,.01)
lines(dom, dlogspline(dom, fit=f1), lty=2)
lines(density(iris$Petal.Width), lty=3, col="gray")
legend(1.0, 1.1, legend=c("logspline", "default"), 
       lty=c(2,3), col=c("black", "gray"), bty="n")
abline(v=c(.246, 1.326, 2.026), lty=4)

#### Models and visualization for discrete data

We'll briefly address categorical data that do not have a natural ordering.  A nice
example dataset is the `HairEyeColor` array.


In [None]:
HairEyeColor

We'd like to understand whether eye and hair color are associated, and whether
the association varies by sex.  The null model is that of mutual independence.

We'll use the `mosaicplot` function to examine relative frequencies of different
(sex, eyecolor, haircolor) configurations in the data.  With the `shade` parameter
set, we are presented with a collection of colored rectangles that depict the
relative frequencies of the different categories.  Red boxes correspond to 
configurations that are relatively unusual under the assumption of independence.
Blue boxes correspond to configurations that have unexpectedly high prevalence.

In [None]:
options(repr.plot.width=5, repr.plot.height=5)
mosaicplot(HairEyeColor, shade=TRUE)

Thus there are fewer blond-haired brown-eyed individuals than expected under
an assumption of independence, and black hair/brown eyes are more common than
would be expected under this assumption.  The discipline of log-linear modeling
can be reviewed to get a deeper understanding of this technique.

## A quick survey of R's base graphics for multivariate data

### Additional views of Fisher's iris data

We focused before on a feature measure.  The full dataset has four features and species label.

In [None]:
head(iris)

In [None]:
options(repr.plot.width=4, repr.plot.height=4) # specific for jupyter
plot(Sepal.Length~jitter(Sepal.Width), data=iris)

In [None]:
plot(Sepal.Length~jitter(Sepal.Width), data=iris, pch=19, col=Species)

In [None]:
pairs(iris[,-5], col=iris$Species, pch=19, cex.labels=.9)

In [None]:
names(par())