<div style="text-align: center"><img width=150px src="http://upload.wikimedia.org/wikipedia/commons/thumb/1/1b/R_logo.svg/200px-R_logo.svg.png"></div>

# An introduction to R

R is a highly extensible language and environment for statistical computing and graphics. It's distributed for free under the GNU General Public License, enjoys strong community support, and is known for its ability to produce publication-quality plots including mathematical symbols and formulae. You can learn more about R at [r-project.org](https://www.r-project.org/about.html) and [An Introduction to R](https://cran.r-project.org/doc/manuals/r-release/R-intro.html).

Typically, using R typically means taking the time to set up an R environment. Azure Notebooks removes this detailed process, giving you a pre-configured environment that's ready for your R code.

This notebook demonstrates R within a Jupyter notebook, using material from [Section 2 - Simple manipulations; numbers and vectors](https://cran.r-project.org/doc/manuals/r-release/R-intro.html#Simple-manipulations-numbers-and-vectors) and [Appendix A - A Sample Session](https://cran.r-project.org/doc/manuals/r-release/R-intro.html#A-sample-session) from the aforementioned *An Introduction to R* documentation. It also includes content built around the well-known `demo(graphics)` command of R, with comments converted into Markdown cells. 

Note that the R kernel is still in development, so some language features may not be available. To submit issues and requests for features, refer to the [Azure Notebooks GitHub repository](https://github.com/Microsoft/AzureNotebooks/issues).

## Simple manipulations; numbers and vectors

### Vectors and assignment

R operates on named *data structures*. The simplest such structure is the numeric *vector*, which is a single entity consisting of an ordered collection of numbers. To set up a vector named `x`, say, consisting of five numbers, namely 10.4, 5.6, 3.1, 6.4 and 21.7, use the following R command:

In [None]:
x <- c(10.4, 5.6, 3.1, 6.4, 21.7)

In a notebook, the previous cell won't show any output. You can see the contents of `x` by simple running `x` in a code cell:

In [None]:
x

`x <-` is an assignment statement using the *function* `c()`, which in this context can take an arbitrary number of vector arguments and whose value is a vector got by concatenating its arguments end to end.

A number occurring by itself in an expression is taken as a vector of length one.

The assignment operator (`<-`) consists of the two characters, `<` ("less than") and '-' ("minus") occurring strictly side-by-side. The operator points to the object receiving the value of the expression. In most contexts the '=' operator can be used as an alternative.

Assignment can also be made using the `assign()` function (including here also the line `x` to show the value as output). The `<-` operator can be thought of as a shortcut to `assign()`.

In [None]:
assign("x", c(10.4, 5.6, 3.1, 6.4, 21.7))
x

Assignments can also be made from left to right by changing the direction of the assignment operator: 

In [None]:
c(10.4, 5.6, 3.1, 6.4, 21.7) -> x
x

If an expression is used as a complete command, the value is printed and lost. For example, the following statement displays the reciprocals of the values in `x`, but doesn't assign those values to any variable:

In [None]:
1/x

Finally, the following code creates a vector `y` with 11 entries consisting of two copies of x with a zero in the middle place.

In [None]:
y <- c(x, 0, x)
y

### Vector arithmetic

Vectors can be used in arithmetic expressions, in which case the operations are performed element by element. Vectors occurring in the same expression need not all be of the same length. If they are not, the value of the expression is a vector with the same length as the longest vector which occurs in the expression. Shorter vectors in the expression are *recycled* as often as need be (perhaps fractionally) until they match the length of the longest vector. In particular, a constant is simply repeated.

The following expression, using the `x` and `y` values from the previous section (which, if you ran those code cells, are in the notebook session), generates a new vector v of length 11 constructed by adding together, element by element, 2\*x repeated 2.2 times, y repeated just once, and 1 repeated 11 times. Note that the code issues a warning because the length of y is not an integral multiple of 2\*x.

In [None]:
v <- 2*x + y + 1
v

The elementary arithmetic operators are the usual `+`, `-`, `*`, `/`, and `^` (raise to a power), along with all the common arithmetic functions: `log`, `exp`, `sin`, `cos`, `tan`, `sqrt`, and so on. `max` and `min` select the largest and smallest elements of a vector, respectively. `range(v)` returns a vector of length two, namely `c(min(x), max(x))`. `length(x)` is the number of elements in `x`, `sum(x)` gives the total of the elements in `x`, and `prod(x)` their product.

In [None]:
log(v)
sin(v)
sqrt(v)
min(v)
max(v)
range(v)
length(v)
sum(v)
prod(v)

Two statistical functions are `mean(x)`, which calculates the sample mean and is the same as `sum(x)/length(x)`, and `var(x)` which gives the sample variance:

In [None]:
sum((x-mean(x))^2)/(length(x)-1)
var(x)

If the argument to `var()` is an n-by-p matrix the value is a *p*-by-*p* sample covariance matrix got by regarding the rows as independent *p*-variate sample vectors. 

`sort(x)` returns a vector of the same size as `x` with the elements arranged in increasing order; however there are other more flexible sorting facilities available (see `order()` or `sort.list()` which produce a permutation to do the sorting). 

Note that `max` and `min` select the largest and smallest values in their arguments, even if they are given several vectors. The *parallel* max/min functions `pmax` and `pmin` return a vector (of length equal to their longest argument) that contains in each element the largest (smallest) element in that position in any of the input vectors. 

For most purposes, you're not concerned if the "numbers" in a numeric vector are integers, reals or even complex. Internally calculations are done as double precision real numbers, or double precision complex numbers if the input data are complex. 

To work with complex numbers, supply an explicit complex part, otherwise, as the code below demonstrates, you'll see a Nan (not-a-number) error:

In [None]:
sqrt(-17)

But the following expression performs the computation as complex numbers:

In [None]:
sqrt(-17+0i)

### Generating regular sequences

R has a number of facilities for generating commonly used sequences of numbers. For example, the colon is a shorthand for creating a vector of sequential numbers in ascending or descending order:

In [None]:
1:30
30:1

`1:30` is equivalent to writing `c(1, 2, 3, ..., 30)`, and a lot less tedious!

Within expressions, the colon has a higher precedence than other operators except parenthases:

In [None]:
2*1:15

n <- 10
1:n-1
1:(n-1)

The `seq()` function is a more general means for generating sequences. It has five arguments, only some of which may be specified in any one call. The first two arguments, if given, specify the beginning and end of the sequence, and if these are the only two arguments given the result is the same as the colon operator. For example, `seq(2,10)` is the same vector as `2:10`. 

In [None]:
seq(2,10)
2:10

Arguments to `seq()`, and to many other R functions, can also be given in named form, in which case the order in which they appear is irrelevant. With `seq()`, the first two arguments may be named `from=value` and `to=value`; thus the following expressions are all identical:

In [None]:
seq(1,30)
seq(from=1, to=30)
seq(to=30, from=1)
1:30

The next two arguments to `seq()` may be named `by=value` and `length=value`, which specify a step size and a length for the sequence respectively. If neither of these is given, the default `by=1` is assumed. For example:

In [None]:
seq(-5, 5, by=.2) -> s3               # Assigns the vector c(-5.0, -4.8, -4.6, …, 4.6, 4.8, 5.0) to s3
s3

s4 <- seq(length=51, from=-5, by=.2)  # Assigns the same vector to s4
s4

The fifth argument may be named `along=vector`, which is normally used as the only argument to create the sequence 1, 2, ..., `length(vector)`, or the empty sequence if the vector is empty (as it can be).

A related function is `rep()` which can be used for replicating an object in various complicated ways. The simplest form is the following, which puts five copies of `x` end-to-end in `s5`.

In [None]:
s5 <- rep(x, times=5)
s5

Another useful version is the following expression, which repeats each element of `x` five times before moving on to the next.

In [None]:
s6 <- rep(x, each=5)
s6

### Logical vectors

Along with numerical vectors, R allows manipulation of logical quantities. The elements of a logical vector can have the values `TRUE`, `FALSE`, and `NA` (for "not available"). The first two are often abbreviated as `T` and `F`, respectively. However, `T` and `F` are just variables that are set to `TRUE` and `FALSE` by default; they aren't reserved words and thus can be overwritten by your own code if you use the same names (and the same case; variables in R are case-sensitive). Consequently, always use `TRUE` and `FALSE` for clarity:


In [None]:
T
F

t <- 5
t
T

T <- 10
T
TRUE

Logical vectors are generated by *conditions*. The following expression, for example, sets `temp` as a vector of the same length as `x` with values `FALSE` corresponding to elements of `x` where the condition is not met and `TRUE` where it is:

In [None]:
temp <- x > 13
temp

The logical operators are `<`, `<=`, `>`, `>=`, `==` for exact equality, and `!=` for inequality. In addition if `c1` and `c2` are logical expressions, then `c1 & c2` is their intersection ("and"), `c1 | c2` is their union ("or"), and `!c1` is the inverse of `c1`. 

Logical vectors may be used in ordinary arithmetic, in which case they are coerced into numeric vectors, `FALSE` becoming 0 and `TRUE` becoming 1. However there are situations where logical vectors and their coerced numeric counterparts are not equivalent, as explained in the next section.

### Missing values

In some cases, the components of a vector may not be completely known. When an element or value is "not available" or a "missing value" in the statistical sense, you can reserve a place for it within a vector by assigning it the special value `NA`. In general, any operation on an `NA` becomes an `NA`. The motivation for this rule is simply that if the specification of an operation is incomplete, the result cannot be known and hence is not available. 

The function `is.na(x)` gives a logical vector of the same size as `x` with value `TRUE` if and only if the corresponding element in `x` is `NA`:


In [None]:
z <- c(1:3,NA);  ind <- is.na(z)
z

Notice that the logical expression `x == NA` is quite different from `is.na(x)` because `NA` is not really a value but a marker for a quantity that is not available. Thus `x == NA` is a vector of the same length as `x`, *all* of whose values are `NA` as the logical expression itself is incomplete and hence undecidable. 

There is also is a second kind of "missing" value, the NaN or not-a-number, which is produced by numerical computation that cannot be sensibly performed:

In [None]:
0/0
Inf - Inf

In summary, `is.na(x)` is `TRUE` for *both* `NA` and `NaN` values. To differentiate these, `is.nan(x)` is only `TRUE` for NaNs. 
Missing values are sometimes printed as `<NA>` when character vectors are printed without quotes. 

### Character vectors

Character quantities and character vectors are used frequently in R, for example as plot labels. They're defined by a sequence of characters inside double quotes, for example:

In [None]:
"x-values"
"New iteration results"

Character strings are entered using either matching double (") or single (') quotes, but are printed using double quotes (or sometimes without quotes). They use C-style escape sequences, using `\` as the escape character, so `\\` is entered and printed as \\, and inside double quotes. `"` is entered as `\"`. Other useful escape sequences are `\n` (newline), `\t` (tab), and `\b` (backspace). —see ?Quotes for a full list. 

Character vectors may be concatenated into a vector using the `c()` function. 

The `paste()` function takes an arbitrary number of arguments and concatenates them one by one into character strings. Any numbers given among the arguments are coerced into character strings in the evident way, that is, in the same way they would be if they were printed. The arguments are by default separated in the result by a single blank character, but this can be changed by the named argument, sep=string, which changes it to string, possibly empty.

For example, the following expression makes `labs` into the same character vector as the second expression:

In [None]:
labs <- paste(c("X","Y"), 1:10, sep="")
labs

c("X1", "Y2", "X3", "Y4", "X5", "Y6", "X7", "Y8", "X9", "Y10")

Note particularly that recycling of short lists takes place here; thus `c("X", "Y")` is repeated five times to match the sequence `1:10`.

### Index vectors; selecting and modifying subsets of a data set

Subsets of the elements of a vector may be selected by appending to the name of the vector an *index vector* in square brackets. More generally any expression that evaluates to a vector may have subsets of its elements similarly selected by appending an index vector in square brackets immediately after the expression. 

Such index vectors can be any of four distinct types.

#### (1) A logical vector

In this case the index vector is recycled to the same length as the vector from which elements are to be selected. Values corresponding to `TRUE` in the index vector are selected and those corresponding to `FALSE` are omitted. For example, the following expression creates (or re-creates) an object `y` which contains the non-missing values of `x`, in the same order. Note that if `x` has missing values, `y` is be shorter than `x`.

In [None]:
y <- x[!is.na(x)]
y

The following expression creates an object `z` and places in it the values of the vector `x+1` for which the corresponding value in `x` was both non-missing and positive:

In [None]:
(x+1)[(!is.na(x)) & x>0] -> z
z

#### (2) A vector of positive integral quantities

In this case the values in the index vector must lie in the set {1, 2, ..., length(x)}. The corresponding elements of the vector are selected and concatenated, *in that order*, in the result. The index vector can be of any length and the result is of the same length as the index vector. For example:

In [None]:
x[6]    # The sixth component of x
x[1:10] # Selects the first 10 elements of x (assuming length(x) is not less than 10)

The following expression, though an admittedly unlikely thing to use, produces a character vector of length 16 consisting of "x", "y", "y", "x" repeated four times:

In [None]:
c("x","y")[rep(c(1,2,2,1), times=4)]

#### A vector of negative integral quantities

Such an index vector specifies the values to be excluded rather than included. Thus the following expression gives y all but the first five elements of x.

In [None]:
y <- x[-(1:5)]

#### A vector of character strings

This possibility applies only where an object has a names attribute to identify its components. In this case a sub-vector of the names vector may be used in the same way as the positive integral labels in item 2 further above. 

In [None]:
fruit <- c(5, 10, 1, 20)
fruit

names(fruit) <- c("orange", "banana", "apple", "peach")
fruit

lunch <- fruit[c("apple","orange")]
lunch

The advantage of such vectors is that alphanumeric *names* are often easier to remember than *numeric indices*. This option is particularly useful in connection with data frames.

Besides the four types above, an indexed expression can also appear on the receiving end of an assignment, in which case the assignment operation is performed *only on those elements of the vector*. The expression must be of the form `vector[index_vector]` as having an arbitrary expression in place of the vector name does not make much sense here. For example, the first expression below replaces any missing values in x by zeros and the second has the same effect as `y <- abs(y)`:

In [None]:
x[is.na(x)] <- 0
x

y[y < 0] <- -y[y < 0]
y

### Other types of objects

Vectors are the most important type of object in R, but there are several others which we will meet more formally in later sections. 

*Matrices*, or more generally *arrays*, are multi-dimensional generalizations of vectors. In fact, they are vectors that can be indexed by two or more indices and will be printed in special ways. See [Arrays and matrices](https://cran.r-project.org/doc/manuals/r-release/R-intro.html#Arrays-and-matrices). 

*Factors* provide compact ways to handle categorical data. See [Factors](https://cran.r-project.org/doc/manuals/r-release/R-intro.html#Factors). 

*Lists* are a general form of vector in which the various elements need not be of the same type, and are often themselves vectors or lists. Lists provide a convenient way to return the results of a statistical computation. See [Lists](https://cran.r-project.org/doc/manuals/r-release/R-intro.html#Lists).

*Data frames* are matrix-like structures in which the columns can be of different types. Think of data frames as 'data matrices' with one row per observational unit but with (possibly) both numerical and categorical variables. Many experiments are best described by data frames: the treatments are categorical but the response is numeric. See [Data frames](https://cran.r-project.org/doc/manuals/r-release/R-intro.html#Data-frames).

*Functions* are themselves objects in R which can be stored in the project's workspace. This provides a simple and convenient way to extend R. See [Writing your own functions](https://cran.r-project.org/doc/manuals/r-release/R-intro.html#Writing-your-own-functions). 

## A sample R session

The code in this walkthrough introduces you to various features of the R environment, such as plotting and management of objects.

To begin with, generate two psuedo-random normal vectors of x- and y-coordinates:

In [None]:
x <- rnorm(50)
y <- rnorm(x)

Plot the points in a plane, generating an inline graphic:

In [None]:
# Set plot size for this section
options(repr.plot.width=8, repr.plot.height=6)

plot(x, y)

See which R objects are now in the R workspace. 

In [None]:
ls()

Clean up objects that aren't needed:

In [None]:
rm(x, y)
ls()

Create a 'weight'’ vector of standard deviations:

In [None]:
x <- 1:20            # Create a vector of 1, 2, 3, ... 20
w <- 1 + sqrt(x)/2
w

Make a data frame of two columns, x and y, and look at it:

In [None]:
dummy <- data.frame(x=x, y= x + rnorm(x)*w)
dummy

Fit a simple linear regression and look at the analysis. With y to the left of the tilde, we are modelling y dependent on x. 

In [None]:
fm <- lm(y ~ x, data=dummy)
summary(fm)

Because we know the standard deviations, we can do a weighted regression. 

In [None]:
fm1 <- lm(y ~ x, data=dummy, weight=1/w^2)
summary(fm1)

Make the columns in the data frame visible as variables. 

In [None]:
attach(dummy)

Make a nonparametric local regression function.

In [None]:
lrf <- lowess(x, y)

The next lines created (a) A standard point plot, (b) a line for the local regressions, (c) the true regression line, (d) unweighted regression line, and (e) weighted regression line.

In [None]:
plot(x, y)                     # Standard plot point
lines(x, lrf$y)                # Local regression
abline(0, 1, lty=3)            # True regression line (intercept = 0, slope = 1)
abline(coef(fm))               # Unweighted regression line
abline(coef(fm1), col = "red") # Weighted regression line

Remove data frame from the search path.

In [None]:
detach()

A standard regression diagnostic plot to check for heteroscedasticity. Can you see it? 

In [None]:
plot(fitted(fm), resid(fm),
     xlab="Fitted values",
     ylab="Residuals",
     main="Residuals vs Fitted")

A normal scores plot to check for skewness, kurtosis and outliers. (Not very useful here.) 

In [None]:
qqnorm(resid(fm), main="Residuals Rankit Plot")

Clean up again.

In [None]:
rm(fm, fm1, lrf, x, dummy)

### Work with the Michelson-Morley experiment

The next section looks at data from the classical experiment of Michelson to measure the speed of light. This dataset is available in the morley object, but we will read it to illustrate the `read.table` function.

Get the path to the data file.

In [None]:
filepath <- system.file("data", "morley.tab" , package="datasets")
filepath

Optional. Look at the file.

In [None]:
file.show(filepath)

Read in the Michelson data as a data frame, and look at it. There are five experiments (column Expt) and each has 20 runs (column Run) and sl is the recorded speed of light, suitably coded. 

In [None]:
mm <- read.table(filepath)
mm

Change Expt and Run into factors. 

In [None]:
mm$Expt <- factor(mm$Expt)
mm$Run <- factor(mm$Run)

Make the data frame visible at position 3 (the default). 

In [None]:
attach(mm)

Compare the five experiments with simple boxplots. 

In [None]:
# Set plot size for this section
options(repr.plot.width=8, repr.plot.height=6)

plot(Expt, Speed, main="Speed of Light Data", xlab="Experiment No.")

Analyze as a randomized block, with ‘runs’ and ‘experiments’ as factors. 

In [None]:
fm <- aov(Speed ~ Run + Expt, data=mm)
summary(fm)

Fit the sub-model omitting ‘runs’, and compare using a formal analysis of variance. 

In [None]:
fm0 <- update(fm, . ~ . - Run)
anova(fm0, fm)

Clean up before moving on. 

In [None]:
detach()
rm(fm, fm0)

### Graphical features: contour and image plots

x is a vector of 50 equally spaced values in the interval [-pi\, pi]. y is the same. 

In [None]:
x <- seq(-pi, pi, len=50)
y <- x

f is a square matrix, with rows and columns indexed by x and y respectively, of values of the function cos(y)/(1 + x^2). 

In [None]:
f <- outer(x, y, function(x, y) cos(y)/(1 + x^2))

Save the plotting parameters and set the plotting region to “square”.

In [None]:
oldpar <- par(no.readonly = TRUE)
par(pty="s")

Make a contour map of f; add in more lines for more detail. 

In [None]:
contour(x, y, f)
contour(x, y, f, nlevels=15, add=TRUE)

Make a contour plot. fa is the “asymmetric part” of f. (t() is transpose). 

In [None]:
fa <- (f-t(f))/2
contour(x, y, fa, nlevels=15)

Then restore the old graphics parameters. 

In [None]:
par(oldpar)

Make some high density image plots, (of which you can get hardcopies if you wish):

In [None]:
image(x, y, f)
image(x, y, fa)

Clean up before moving on.

In [None]:
objects(); rm(x, y, f, fa)

### Complex arithmetic in R

1i is used for the complex number i.

In [None]:
th <- seq(-pi, pi, len=100)
z <- exp(1i*th)

Plotting complex arguments means plot imaginary versus real parts. This should be a circle. 

In [None]:
# Set plot size for this section
options(repr.plot.width=8, repr.plot.height=6)

par(pty="s")
plot(z, type="l")

Suppose we want to sample points within the unit circle. One method would be to take complex numbers with standard normal real and imaginary parts:

In [None]:
w <- rnorm(100) + rnorm(100)*1i

And to map any outside the circle onto their reciprocal. 

In [None]:
w <- ifelse(Mod(w) > 1, 1/w, w)

All points are inside the unit circle, but the distribution is not uniform. 

In [None]:
plot(w, xlim=c(-1,1), ylim=c(-1,1), pch="+",xlab="x", ylab="y")
lines(z)

The second method uses the uniform distribution. The points should now look more evenly spaced over the disc. 

In [None]:
w <- sqrt(runif(100))*exp(2*pi*runif(100)*1i)
plot(w, xlim=c(-1,1), ylim=c(-1,1), pch="+", xlab="x", ylab="y")
lines(z)

Clean up again. 

In [None]:
rm(th, w, z)

 # Comparison of R and S graphics capabilities

The following code cells illustrate some of the differences between R and S graphics capabilities. Colors are generally specified by a character string name (taken from the X11 rgb.txt file) and that line textures are given similarly. The parameter "bg" sets the background parameter for the plot and there is also an "fg" parameter which sets the foreground color.

In [None]:
require(datasets)

require(grDevices); require(graphics)

In [None]:
# Set plot size for this section
options(repr.plot.width=8, repr.plot.height=6)

x <- stats::rnorm(50)
opar <- par(bg = "white")
plot(x, ann = FALSE, type = "n") +
abline(h = 0, col = gray(.90)) +
lines(x, col = "green4", lty = "dotted") +
points(x, bg = "limegreen", pch = 21) +
title(main = "Simple Use of Color In a Plot",
       xlab = "Just a Whisper of a Label",
       col.main = "blue", col.lab = gray(.8),
       cex.main = 1.2, cex.lab = 1.0, font.main = 4, font.lab = 3)

## A little color wheel.

This code plots equally spaced hues in a pie chart. On low-quality monitors you may find that numerically equispaced are not visually equispaced and may cluster at the RGB primaries. On high-quality monitor, the color wheel should appear quite accurate.

In [None]:
# Set plot size for this section
options(repr.plot.width=8, repr.plot.height=6)

par(bg = "gray")

pie(rep(1,24), col = rainbow(24), radius = 0.9) +
title(main = "A Sample Color Wheel", cex.main = 1.4, font.main = 3) +
title(xlab = "(Use this as a test of monitor linearity)",
      cex.lab = 0.8, font.lab = 3)

 A scatterplot matrix using Iris data

In [None]:
# Set plot size for this section
options(repr.plot.width=8, repr.plot.height=6)

pairs(iris[1:4], main="Edgar Anderson's Iris Data", pch=21,
       bg = c("red", "green3", "blue")[unclass(iris$Species)])