# Introduction to R Visualizations

In this module we will explore some of the approaches for visualizing data within R.

If you are unfamiliar with notebooks, please review some basics [here](https://github.com/michhar/useR2016-tutorial-jupyter). 

## Essential Tips

A very brief summary of the critical components and commands within jupyter are:

1. Critically, press `Ctrl+Enter` to run (or render) the current cell.
2. Output will print to the notebook. You may have to scroll up to see it all.
3. Get help for any function by typing a question mark and then its name into
   the console: `?rxLinMod`. It will split the window, and will bring up the documentation for 
   that function below.
5. Files will appear in the specified directory. You can find them by selecting File in the menu bar and selecting "Open...". This will open a new browser window with a file navigator.
6. R objects can be viewed by typing `ls()` in an R cell.
7. Run all the example code!

There are a number of hands-on exercises in the document, so while you can run the notebook from beginning to end, you will get a lot more out of it by actually walking through cell-by-cell, and filling out the corresponding exercises.

These notebooks are based on a tutorial presented at a Microsoft conference in June of 2016. The original files are available [here](https://github.com/joseph-rickert/MLADS_JUNE_2016).

Before we get started, we'll source a configuration file in the next cell. It simply makes sure that the relevant R packages and datasets are available. You do not need to look at it, but if you are interested, you can view the configuration file [here](Resources/config.R). It may take a few moments to run the first time you run it, but it should be fast afterwards.

In [None]:
source("Resources/config.R")

## Visualizations in R

There are three major plotting systems in R: 

1. base graphics
2. lattice graphics
3. ggplot2 

Additionally, there is a significant amount of development work going on to allow R users to produce dynamic Javascript plots. In this module we will give an example of plotting histograms in all three systems and then go on to show more ggplot2 examples and finish with a Javascript based interactive plot

### R's Three Plotting Systems

For this we will use Duncan's famous Prestige data set that shows income, education level and a prestige score for "professional" (prof), "blue collar" (bc), and "white collar" (wc) workers. Conveniently, these data are available in the `car` package, which is the package associated with the book ["An R Companion to Applied Regression"](http://socserv.socsci.mcmaster.ca/jfox/Books/Companion/index.html). To load a package and make its functions and datasets available, we use the function `library()`.

Let's first explore the data using base graphics. 

First step - let's load and view the dataset.

In [None]:
library(car)
data(Duncan)
dim(Duncan)
head(Duncan)

Next, we will create some individual vectors to create histograms:

In [None]:
# Select data using basic subsetting
inc_prof <- Duncan[Duncan$type=="prof","income"]
inc_bc <- Duncan[Duncan$type=="bc","income"]
inc_wc <- Duncan[Duncan$type=="wc","income"]

## plot all 3:
par(mfrow=c(1,3))  # set option to put 3 plots in one pane
## BC
hist(inc_bc, 
     prob = TRUE, 
     col = "pink", 
     main = ("Income BC"), 
     xlab=("Dollars"))
lines(density(inc_bc))
## Prof
hist(inc_prof, 
     prob = TRUE, 
     col = "yellow", 
     main = ("Income Prof"), 
     xlab=("Dollars")) 
lines(density(inc_prof))
## WC
hist(inc_wc, 
     prob = TRUE, 
     col = "light blue", 
     main = ("Income WC"), 
     xlab=("Dollars")) 
lines(density(inc_wc)) 
par(mfrow=c(1,1))  # set it back to 1 plot per plane

### Lattice (trellis) Graphics

Lattice graphics are the second major plotting system in R. Plots built with lattice have a very distinctive look, but the real value is the ease of making trellis plots - graphs that display a variable conditioned on an other variable. Some useful websites are: 
http://www.statmethods.net/advgraphs/trellis.html
http://user2007.org/program/presentations/sarkar.pdf

In [None]:
library(lattice)
histogram( ~ income | type, 
           data = Duncan,
           nint=10,
           xlab = "Income",  
           main= "Hitogram by profession",
           type = "density",
           panel = function(x, ...) {
             panel.histogram(x, ...)
             panel.mathdensity(dmath = dnorm, col = "black",
                               args = list(mean=mean(x),sd=sd(x)))
           },
           layout = c(3,1)
         )



Note that in that command, I could create an arbitrary panel function (defined in-line as the `panel` argument) that described the specific steps that I wanted to run in each panel. Further, when using the lattice package, I didn't need to do any subsetting - the formula specification `| type` indicated that there should be a unique panel for each level of `type`.

Incredibly useful!

### ggplot2 Graphics

`ggplot` is the third major plotting system for R. It is based on Leland Wilkinson’s [grammar of graphics](http://www.cs.uic.edu/~wilkinson/TheGrammarOfGraphics/GOG.html) (hence the `gg` name).

Some useful websites are: 
 
- http://ggplot2.org/ 
- http://docs.ggplot2.org/current/ 
- http://www.cookbook-r.com/Graphs/
- http://www.rstudio.com/wp-content/uploads/2015/12/ggplot2-cheatsheet-2.0.pdf

A key component of ggplot is that you effectively separate out key components of a visualization:

- the data
- the aesthetic mapping (i.e. which variables map onto which visualizatin properties like x axis and y axis)
- the geometries (what kinds of things do you want to have in the visualization)
- the operations or statistics (what do you want the geometries to represent?)

We build a visualization in ggplot with layers - we start out by defining the data to be used and the aesthetic:

In [None]:
library(ggplot2)
p1 <- ggplot(Duncan,aes(income,..density.., fill=type))


The first argument to `ggplot()` is the data, and the second argument is the aesthetic. An aesthetic is the mapping between variables and features in a plot - the first argument corresponds to the xxaxis, the second corresponds to the y-axis, and then other arguments can be used to specify other features, such as color for lines or color for fills. In this particular case, we create an aesthetic object with the `aes()` function, and that object specifes that we would like to place `income` on the x-axis,  `..density..` on the y-axis, and color the inside of any bars or objects we add with a color that maps onto the type variable. 

However, if we do this, we haven't actually plotted anything yet! We've just set the data and aesthetic. In order to actually render a plot, we actually need to add geometry layers. To do this, we literally use the `+` symbol to add these layers onto the plot!

In the next cell, we add two geometries to the space defined in `p1`, and we add a `facet_grid()` that separates out the different types in a way that is very similar to what we did with lattice by specifying `| type`, and in base graphics with `par(mfrow = c(1,3))`).

Note that `..density..` is an interesting and relatively advanced argument value, as it does not exist in the `Duncan` dataset. Rather, it is computed by the geometry functions that we add in the next cell (specifically both `geom_histogram()` and `geom_density()` compute density). 


In [None]:
p1 + geom_histogram(bins=10) +    ## adds a layer of the histogram
     geom_density(alpha = 0.5) +             ## adds a layer of the density plot, makes the fill semi-transparent
     facet_grid(. ~ type) +       ## makes sure that different levels of `type` are in different facets
     xlab("Income: Canadian $") +  ## adjust the x-axis label
     ggtitle("Histogram by Profession")   ## add a title to the graph

### More ggplot2 visualizations

Let's explore a few more visualizations with `ggplot2`.

We will use the diamonds data set that comes with the `ggplot2` package.

Because we have already loaded the `ggplot2` package with the `library()` function, we can actually find diamonds without adding it explicitly to the workspace.

Let's examine the diamonds dataset:

In [None]:
ls()
head(diamonds)
dim(diamonds)

Next, let's sample down to 5000 rows to make rendering a little faster. We will use the `sample_n` function from the popular
`dplyr` package:

In [None]:
library(dplyr)
set.seed(123)
dsmall <- sample_n(diamonds,5000)

### Scatter plots

Let's start by creating some simple scatter plots.

The documentation for ggplot2 is available at http://docs.ggplot2.org, and the 
relevant geometry we use to add to create a scatter plot is [geom_point()](http://docs.ggplot2.org/current/geom_point.html)

Let's plot `carat` on the x axis, `price` on the y axis, and color each point according to what `cut` it is. We do this by defining the data set to be our small diamonds dataset (`dsmall`), and then constructing an aesthetic mapping with `aes()`. The first argument corresponds to the x-axis, the second corresponds to the y-axis, and then we specify color.


In [None]:
p1 <- ggplot(dsmall,aes(carat,price,color=cut))

Note again that we create the space with `ggplot()`, but that it doesn't actually render a plot until you add a geometry to it.

Now, let's add some points to our plot and render it.


In [None]:
p1 + geom_point()

We can easily break down and facet as a function of another variable, in this case, we can make a separate plot for each level of `clarity` using facet_grid:

In [None]:
p1 + geom_point() + facet_grid(. ~ clarity) 

If we want each facet to appear in different rows, we simply change the format of the formula in the `facet_grid()` call

In [None]:
p1 + geom_point() + facet_grid(clarity ~ .) 

### Histograms

As we saw above, we create historgrams with [geom_histogram](http://docs.ggplot2.org/current/geom_histogram.html).

Because histogram does some data aggregation, we actually only need to specify the x-axis in the aesthetic mapping. By default, it will plot the frequency count of each bin. The `bins` argument in the geometry specifies how many bins should be used.


In [None]:
p2 <- ggplot(dsmall, aes(x = price))
p2 + geom_histogram(bins=200)

As we saw in our first ggplot example, if we want to overlay with a density plot, it is useful to plot the density rather than the count. We do this by specifying `..density..` as the variable that should map onto the y-axis.

In [None]:
p2 <- ggplot(dsmall, aes(x = price, ..density..))
p2 + 
 geom_histogram(bins=200) + 
 geom_density(alpha = 0.5, color = 'red', size = 2)    

### Boxplots

A Boxplot is another useful geometry that is available with ggplot, and it is available with the [geom_boxplot() function](http://docs.ggplot2.org/current/geom_boxplot.html)

In this example, we specify the x-axis and color both as `cut`, and the y-axis as `carat`. 



In [None]:
p3 <- ggplot(dsmall, aes(cut,carat,fill=cut))
p3 + geom_boxplot()

This is one clear example of where different geometries really come into play. If we simply used geom_point(), the figure would be a dot plot, and much of the relevant information about the distribution at each level of `cut` would have been lost. However, using a boxplot allows us to visualize a number of properties of the distribution at each level of `cut`.

#### Exercise

Go ahead and try plotting with geom_point instead. (Remember that p3 still exists, so you can simply use that object to add a different geometry!

In [None]:
## Place your exercise code here

#### Exercise
Next, create a new plot that examines the relationship between cut and cost.

In [None]:
## Place your exercise code here
p3 <- ggplot(dsmall, aes(cut,price,fill=cut))
p3 + geom_boxplot()

### Scatter plot with statistical smoothing

It's frequently very useful to be able to visualize trends in data. One way to do this is to estimate a smoothing function and draw a line that corresponds to that smoother. This is accomplished in ggplot with the [`geom_smooth()` function](http://docs.ggplot2.org/current/geom_smooth.html)

In [None]:
p4 <- ggplot(data = dsmall, aes(carat, price)) 
p4 +  geom_point(aes(colour=cut)) + 
      geom_smooth(method="loess") + 
      ggtitle("Sample of Diamonds Data with Smoother") 

This is a powerful method, and allows for a lot of flexibility. For example, I can fit a simple linear regression by simply
setting method to "lm":


In [None]:
p4 +  geom_point(aes(colour=cut)) + 
      geom_smooth(method="lm") + 
      ggtitle("Sample of Diamonds Data with lm Smoother") 

And if I want to use a linear regression, but account for non-linear trends via polynomials, we can do that as well:

In [None]:
p4 +  geom_point(aes(colour=cut)) + 
      geom_smooth(method="lm", formula = y ~ poly(x,4)) + 
      ggtitle("Sample of Diamonds Data with Smoother") 

### Plotting the Nile Overflow Data

Next we'll look at another set of data about Nile river overflow. This data is in the `pracma` package

In [None]:
library(pracma)   ## for Nile river data
head(nile)        ## look at the first few rows
?nile             ## Examine the nile data meta-data

As we can see from the head (or from the data meta-data), the `nile` data.frame is in wide-format, where each row corresponds to a different year, and each column (other than year) corresponds to a different month. The actual value that is measured is a measure of flow at the Dongola measurement station.

In order to make this dataset more amenable to visualizing with ggplot, we can convert it to long format by using hte `melt` function from the `reshape2` package:

In [None]:
library(reshape2)           # for melt function to build long form data frame

nile_dat <- melt(
    nile,                           # dataset we are processing
    idvar="Year",                   # The grouping variable that has repeated measurements within it
    measure.vars=month.abb,         # which variables correspond to measures we want to reformat (all the months)
    variable.name="Month",          # The variable name in the new data.frame we want to create to hold month values
    value.name="Flow"                # THe variable name in the new data.frame we want to create to hold the actual observed values
)
head(nile_dat)

As we can see, it has been reformatted appropriately, but we probably want to sort according to year first, and then by Month. Fortunately, `melt()` already cast the months in the appropriate order, so they will be sorted appropriately.

In [None]:
## because the dplyr package is already attached, we can use arrange to sort the way we want.
nile_dat_long <- arrange(nile_dat, Year, Month)
head(nile_dat_long)

We haven't seen this yet, but R also has built-in support for dates. We can create a character string that maps onto a date string, and then we can convert that to an internal date object that R will be able to treat appropriately.

In this case, we'll just assume the 15th of each month as the date of the actual observation.

In [None]:
# We can also create a date variable
# Make a date variable
nile_dat_long$Date <- with(nile_dat_long, paste0(Month,"-","15","-",as.character(Year)))
nile_dat_long$Date <- as.Date(nile_dat_long$Date,format="%b-%d-%Y")

head(nile_dat_long)

Now that we have reformatted, let's see what the data actually looks like.

Let's start by plotting the observed flow as a function of the date for 200 observations.

In [None]:
# Plot the time series
p <- ggplot(nile_dat_long[100:300,],aes(x=Date,y=Flow)) # set up the data and aesthetic
# actually make the plot:
p + geom_line() +                           # include a line from observation to observation
  geom_point(shape=1,col="red") +           # place a point as well.  
  ggtitle("Monthly Flow of Nile River at Dongola Station")

That looks pretty cyclical. 
Now let's actually plot an aggregation of all the data, and plot a boxplot for each month to see if it has a pretty regular 12 month cycle.

In [None]:
# Boxplots of monthly flows
b <- ggplot(nile_dat_long,aes(Month,Flow))                                  # Set up data and aesthetic
b + geom_boxplot() +                                                       # Create the boxplot!
  stat_summary(fun.y=mean, geom = "line", aes(group = 1), color = 'red', size = 2) + # Draw a line connecting the mean of each distribution
  ggtitle("Variation of Flow at Dongola Station by Month")

### BONUS: Create an interactive graph with a Javascript library

In addition to the three visualization approaches above, there are also additional packages and functions that can provide interactive graphs. One approach is implemented in the `dygraphs` package. See additional details [here](http://rstudio.github.io/dygraphs/).

In [None]:
library(xts)    ## works with time-series objects
library(dygraphs)
# Make into a time series object
nile_ts <- xts(nile_dat_long$Obs,
               order.by=nile_dat_long$Date,
               frequency=12,start=c(1871,1))

# Plot with htmlwidget dygraph
dygraph(nile_ts, 
        main="Nile Monthly Flow Data", width = 600, height = 500) %>%
  dySeries("V1",label="Flow") %>%
  dyRangeSelector(dateWindow = c("1871-01-01","1984-12-01"))
