# Introduction

## History of R

### What is R?

R is a dialect of the S language.

### What is S?
- S is a language that was developed by John Chambers and others at Bell
Labs.

- S was initiated in 1976 as an internal statistical analysis
environment—originally implemented as Fortran libraries.

- Early versions of the language did not contain functions for statistical
modeling.

- In 1988 the system was rewritten in C and began to resemble the system
that we have today (this was Version 3 of the language). The book
*Statistical Models in S* by Chambers and Hastie (the white book)
documents the statistical analysis functionality.

- Version 4 of the S language was released in 1998 and is the version we
use today. The book *Programming with Data* by John Chambers documents this version of the language.

### Historical Notes

- In 1993 Bell Labs gave StatSci (now Insightful Corp.) an exclusive
license to develop and sell the S language.

- In 2004 Insightful purchased the S language from Lucent for \$2 million
and is the current owner.

- In 2006, Alcatel purchased Lucent Technologies and is now called
Alcatel-Lucent.

- Insightful sells its implementation of the S language under the product
name S-PLUS and has built a number of fancy features (GUIs, mostly) on
top of it—hence the “PLUS”.

- In 2008 Insightful is acquired by TIBCO for $25 million

- The fundamentals of the S language itself has not changed dramatically
since 1998.

- In 1998, S won the Association for Computing Machinery’s Software System
Award.

### S Philosophy

In “Stages in the Evolution of S”, John Chambers writes:

> “[W]e wanted users to be able to begin in an interactive environment,
> where they did not consciously think of themselves as programming.
> Then as their needs became clearer and their sophistication increased,
> they should be able to slide gradually into programming, when the
> language and system aspects would become more important.”
http://www.stat.bell-labs.com/S/history.html

### Back to R

- 1991: Created in New Zealand by Ross Ihaka and Robert Gentleman. Their
experience developing R is documented in a 1996 *JCGS* paper.

- 1993: First announcement of R to the public.

- 1995: Martin Mächler convinces Ross and Robert to use the GNU General
Public License to make R free software.

- 1996: A public mailing list is created (R-help and R-devel)

- 1997: The R Core Group is formed (containing some people associated with
S-PLUS). The core group controls the source code for R.

- 2000: R version 1.0.0 is released.
- 2018: R version 3.5.2 is released on December 20, 2018.

### Features of R

- Syntax is very similar to S, making it easy for S-PLUS users to switch
over.

- Semantics are superficially similar to S, but in reality are quite
different (more on that later).

- Runs on almost any standard computing platform/OS (even on the
PlayStation 3)

- Frequent releases (annual + bugfix releases); active development.

### Features of R

- Quite lean, as far as software goes; functionality is divided into modular packages

- Graphics capabilities very sophisticated and better than most stat packages.

- Useful for interactive work, but contains a powerful programming language for developing new tools (user -> programmer)

- Very active and vibrant user community; R-help and R-devel mailing lists and Stack Overflow

It's free!

### Free Software

With *free software*, you are granted

-   The freedom to run the program, for any purpose (freedom 0).

-   The freedom to study how the program works, and adapt it to your
    needs (freedom 1). Access to the source code is a precondition for
    this.

-   The freedom to redistribute copies so you can help your neighbor
    (freedom 2).

-   The freedom to improve the program, and release your improvements to
    the public, so that the whole community benefits (freedom 3). Access
    to the source code is a precondition for this.
    
http://www.fsf.org


### Drawbacks of R

-   Essentially based on 40 year old technology.

-   Little built in support for dynamic or 3-D graphics (but things have
    improved greatly since the “old days”).

-   Functionality is based on consumer demand and user contributions. If
    no one feels like implementing your favorite method, then it’s
    *your* job!

    -   (Or you need to pay someone to do it)

-   Objects must generally be stored in physical memory; but there have
    been advancements to deal with this too

-   Not ideal for all possible situations (but this is a drawback of all
    software packages).

### Design of the R System

The R system is divided into 2 conceptual parts:

1.  The “base” R system that you download from CRAN

2.  Everything else.

R functionality is divided into a number of *packages*.

-   The “base” R system contains, among other things, the **base**
    package which is required to run R and contains the most fundamental
    functions.

-   The other packages contained in the “base” system include
    **utils**, **stats**, **datasets**, **graphics**,
    **grDevices**, **grid**, **methods**, **tools**,
    **parallel**, **compiler**, **splines**, **tcltk**,
    **stats4**.

-   There are also “Recommend” packages: **boot**, **class**,
    **cluster**, **codetools**, **foreign**, **KernSmooth**,
    **lattice**, **mgcv**, **nlme**, **rpart**, **survival**,
    **MASS**, **spatial**, **nnet**, **Matrix**.

### Design of the R System

And there are many other packages available:

-   There are about 4000 packages on CRAN that have been developed by
    users and programmers around the world.

-   There are also many packages associated with the Bioconductor
    project (http://bioconductor.org).

-   People often make packages available on their personal websites;
    there is no reliable way to keep track of how many packages are
    available in this fashion.

### Some R Resources

Available from CRAN (http://cran.r-project.org)

-   An Introduction to R

-   Writing R Extensions

-   R Data Import/Export

-   R Installation and Administration (mostly for building R from
    sources)

-   R Internals (not for the faint of heart)

### Some Useful Books on S/R

Standard texts

-   Peng (2018). *R Programming for Data Science*, [leanpub.com](https://leanpub.com/rprogramming/)

-   Chambers (2008). *Software for Data Analysis*, Springer.

-   Chambers (1998). *Programming with Data*, Springer.

-   Venables & Ripley (2002). *Modern Applied Statistics with S*,
    Springer.

-   Venables & Ripley (2000). *S Programming*, Springer.

-   Pinheiro & Bates (2000). *Mixed-Effects Models in S and S-PLUS*,
    Springer.

-   Murrell (2005). *R Graphics*, Chapman & Hall/CRC Press.

## Data types

### Objects

R has five basic or “atomic” classes of objects:

-   character

-   numeric (real numbers)

-   integer

-   complex

-   logical (True/False)

The most basic object is a vector

-   A vector can only contain objects of the same class

-   BUT: The one exception is a *list*, which is represented as a vector
    but can contain objects of different classes (indeed, that’s usually
    why we use them)

Empty vectors can be created with the **vector()** function.


### Numbers

-   Numbers in R a generally treated as numeric objects (i.e. double
    precision real numbers)

-   If you explicitly want an integer, you need to specify the ```L```
    suffix

-   Ex: Entering ```1``` gives you a numeric object; entering ```1L```
    explicitly gives you an integer.

-   There is also a special number ```Inf``` which represents infinity;
    e.g. ```1 / 0```; ```Inf``` can be used in ordinary calculations;
    e.g. ```1 / Inf``` is 0

-   The value ```NaN``` represents an undefined value (“not a number”);
    e.g. 0 / 0; ```NaN``` can also be thought of as a missing value
    (more on that later)


### Attributes

R objects can have attributes

-   names, dimnames

-   dimensions (e.g. matrices, arrays)

-   class

-   length

-   other user-defined attributes/metadata

Attributes of an object can be accessed using the ```attributes()```
function.


### Entering Input

At the R prompt we type expressions. The `<-` symbol is the assignment operator.

In [12]:
x <- 1
print(x)
msg <- "hello"
msg

[1] 1


The grammar of the language determines whether an expression is complete or not.

In [15]:
x <-  ## Incomplete expression

ERROR: Error in parse(text = x, srcfile = src): <text>:2:0: unexpected end of input
1: x <-  ## Incomplete expression
   ^


The # character indicates a comment. Anything to the right of the # (including the # itself) is ignored.

### Evaluation

When a complete expression is entered at the prompt, it is evaluated and the result of the evaluated expression is returned. The result may be auto-printed.

In [16]:
x <- 5  ## nothing printed

In [17]:
x       ## auto-printing occurs

In [18]:
print(x)  ## explicit printing

[1] 5


The `[1]` indicates that `x` is a vector and `5` is the first element.

### Printing

In [19]:
x <- 1:20 
x

The `:` operator is used to create integer sequences.

### Creating Vectors

The `c()` function can be used to create vectors of objects.

In [21]:
x <- c(0.5, 0.6)       ## numeric
x <- c(TRUE, FALSE)    ## logical
x <- c(T, F)           ## logical
x <- c("a", "b", "c")  ## character
x <- 9:29              ## integer
x <- c(1+0i, 2+4i)     ## complex

Using the `vector()` function

In [22]:
x <- vector("numeric", length = 10) 
x

### Mixing Objects

What about the following?

In [23]:
y <- c(1.7, "a")   ## character
y <- c(TRUE, 2)    ## numeric
y <- c("a", TRUE)  ## character

When different objects are mixed in a vector, _coercion_ occurs so that every element in the vector is of the same class.

### Explicit Coercion

Objects can be explicitly coerced from one class to another using the `as.*` functions, if available.

In [25]:
x <- 0:6
class(x)

In [26]:
as.numeric(x)

In [27]:
as.logical(x)

In [28]:
as.character(x)

### Explicit Coercion

Nonsensical coercion results in `NA`s.

In [30]:
x <- c("a", "b", "c")
as.numeric(x)

“NAs introduced by coercion”

In [31]:
as.logical(x)

In [32]:
as.complex(x)

“NAs introduced by coercion”

### Matrices

Matrices are vectors with a _dimension_ attribute. The dimension attribute is itself an integer vector of length 2 (nrow, ncol)


In [33]:
m <- matrix(nrow = 2, ncol = 3) 
m

0,1,2
,,
,,


In [34]:
dim(m)

In [35]:
attributes(m)

### Matrices (cont’d)

Matrices are constructed _column-wise_, so entries can be thought of starting in the “upper left” corner and running down the columns.

In [36]:
m <- matrix(1:6, nrow = 2, ncol = 3) 
m

0,1,2
1,3,5
2,4,6


### Matrices (cont’d)

Matrices can also be created directly from vectors by adding a dimension attribute.

In [37]:
m <- 1:10 
m

In [38]:
dim(m) <- c(2, 5)
m

0,1,2,3,4
1,3,5,7,9
2,4,6,8,10


### cbind-ing and rbind-ing

Matrices can be created by _column-binding_ or _row-binding_ with `cbind()` and `rbind()`.

In [39]:
x <- 1:3
y <- 10:12
cbind(x, y)

x,y
1,10
2,11
3,12


In [40]:
rbind(x, y)

0,1,2,3
x,1,2,3
y,10,11,12


### Lists

Lists are a special type of vector that can contain elements of different classes. Lists are a very important data type in R and you should get to know them well.


In [41]:
x <- list(1, "a", TRUE, 1 + 4i) 
x

### Factors

Factors are used to represent categorical data. Factors can be unordered or ordered. One can think of a factor as an integer vector where each integer has a _label_.

- Factors are treated specially by modelling functions like `lm()` and `glm()`

- Using factors with labels is _better_ than using integers because factors are self-describing; having a variable that has values “Male” and “Female” is better than a variable that has values 1 and 2.


### Factors

In [43]:
x <- factor(c("yes", "yes", "no", "yes", "no")) 
x

In [44]:
table(x) 

x
 no yes 
  2   3 

In [45]:
unclass(x)

### Factors

The order of the levels can be set using the `levels` argument to `factor()`. This can be important in linear modelling because the first level is used as the baseline level.

In [33]:
x <- factor(c("yes", "yes", "no", "yes", "no"),
              levels = c("yes", "no"))
x

In [34]:
levels(x)

### Missing Values

Missing values are denoted by `NA` or `NaN` for undefined mathematical operations. 

- `is.na()` is used to test objects if they are `NA`

- `is.nan()` is used to test for `NaN`

- `NA` values have a class also, so there are integer `NA`, character `NA`, etc.

- A `NaN` value is also `NA` but the converse is not true


In [48]:
x <- c(1, 2, NA, 10, 3)
is.na(x)

In [49]:
is.nan(x)

In [50]:
x <- c(1, 2, NaN, NA, 4)
is.na(x)

In [51]:
is.nan(x)

### Data Frames

Data frames are used to store tabular data

- They are represented as a special type of list where every element of the list has to have the same length

- Each element of the list can be thought of as a column and the length of each element of the list is the number of rows

- Unlike matrices, data frames can store different classes of objects in each column (just like lists); matrices must have every element be the same class

- Data frames also have a special attribute called `row.names`

- Data frames are usually created by calling `read.table()` or `read.csv()`

- Can be converted to a matrix by calling `data.matrix()`

In [60]:
x <- data.frame(foo = 1:4, bar = c(T, T, F, F)) 
x

foo,bar
1,True
2,True
3,False
4,False


In [53]:
nrow(x)

In [54]:
ncol(x)

### Names

R objects can also have names, which is very useful for writing readable code and self-describing objects.

In [64]:
x <- 1:3
names(x)

NULL

In [65]:
names(x) <- c("foo", "bar", "norf") 
x

In [66]:
names(x)

In [67]:
x['foo']

### Names

Lists can also have names.

In [58]:
x <- list(a = 1, b = 2, c = 3) 
x

### Names

And matrices.

In [59]:
m <- matrix(1:4, nrow = 2, ncol = 2)
dimnames(m) <- list(c("a", "b"), c("c", "d")) 
m

Unnamed: 0,c,d
a,1,3
b,2,4


### Summary

Data Types

- atomic classes: numeric, logical, character, integer, complex \

- vectors, lists

- factors

- missing values

- data frames

- names

## Dates and times

R has developed a special representation of dates and times
- Dates are represented by the `Date` class
- Times are represented by the `POSIXct` or the `POSIXlt` class
- Dates are stored internally as the number of days since 1970-01-01
- Tmes are stored internally as the number of seconds since 1970-01-01

### Dates in R

Dates are represented by the Date class and can be coerced from a character string using the `as.Date()` function.

In [6]:
x <- as.Date("1970-01-01")
x

In [7]:
unclass(x)

In [9]:
unclass(as.Date("1970-01-02"))

### Times in R

Times are represented using the `POSIXct` or the `POSIXlt` class

- `POSIXct` is just a very large integer under the hood; it use a useful class when you want to store times in something like a data frame
- `POSIXlt` is a list underneath and it stores a bunch of other useful information like the day of the week, day of the year, month, day of the month

There are a number of generic functions that work on dates and times

- `weekdays`: give the day of the week
- `months`: give the month name
- `quarters`: give the quarter number (“Q1”, “Q2”, “Q3”, or “Q4”)

### Times in R
Times can be coerced from a character string using the `as.POSIXlt` or `as.POSIXct` function.

In [10]:
x <- Sys.time()
x

[1] "2019-02-12 10:27:27 CET"

In [11]:
p <- as.POSIXlt(x)
names(unclass(p))

In [12]:
p$sec

### Times in R
You can also use the `POSIXct` format.

In [13]:
x <- Sys.time()
x  ## Already in ‘POSIXct’ format

[1] "2019-02-12 10:28:21 CET"

In [14]:
unclass(x)

In [15]:
x$sec

ERROR: Error in x$sec: $ operator is invalid for atomic vectors


In [16]:
p <- as.POSIXlt(x)
p$sec

### Times in R

Finally, there is the `strptime` function in case your dates are
written in a different format

In [17]:
datestring <- c("January 10, 2012 10:40", "December 9, 2011 9:10")
x <- strptime(datestring, "%B %d, %Y %H:%M")
x
class(x)

[1] "2012-01-10 10:40:00 CET" "2011-12-09 09:10:00 CET"

I can _never_ remember the formatting strings. Check `?strptime` for details.

In [21]:
??strptime

### Operations on Dates and Times
You can use mathematical operations on dates and times. Well, really just + and -. You can do comparisons too (i.e. ==, <=)


In [22]:
x <- as.Date("2012-01-01")
y <- strptime("9 Jan 2011 11:34:21", "%d %b %Y %H:%M:%S") 
x-y

“Incompatible methods ("-.Date", "-.POSIXt") for "-"”

ERROR: Error in x - y: non-numeric argument to binary operator


In [23]:
x <- as.POSIXlt(x) 
x-y

Time difference of 356.5595 days

### Operations on Dates and Times
Even keeps track of leap years, leap seconds, daylight savings, and time zones.

In [28]:
x <- as.Date("2012-03-01"); y <- as.Date("2012-02-28") 
x-y

Time difference of 2 days

In [29]:
x <- as.POSIXct("2012-10-25 01:00:00"); y <- as.POSIXct("2012-10-25 06:00:00", tz = "GMT") 
y-x

Time difference of 7 hours

### Summary

- Dates and times have special classes in R that allow for numerical and statistical calculations
- Dates use the `Date` class
- Times use the `POSIXct` and `POSIXlt` class
- Character strings can be coerced to Date/Time classes using the `strptime` function or the `as.Date`, `as.POSIXlt`, or `as.POSIXct`

## Control structures

### Control Structures

Control structures in R allow you to control the flow of execution of the program, depending on runtime conditions. Common structures are

- `if`, `else`: testing a condition

- `for`: execute a loop a fixed number of times 

- `while`: execute a loop _while_ a condition is true 

- `repeat`: execute an infinite loop

- `break`: break the execution of a loop

- `next`: skip an interation of a loop

- `return`: exit a function

Most control structures are not used in interactive sessions, but rather when writing functions or longer expresisons.

### Control Structures: if

In [None]:
if(<condition>) {
        ## do something
} else {
        ## do something else
}
if(<condition1>) {
        ## do something
} else if(<condition2>)  {
        ## do something different
} else {
        ## do something different
}

### if

This is a valid if/else structure.

In [31]:
if(x > 3) {
        y <- 10
} else {
        y <- 0
}

So is this one.

In [32]:
y <- if(x > 3) {
        10
} else { 
        0
}

### if

Of course, the else clause is not necessary. 

In [None]:
if(<condition1>) {
}
if(<condition2>) {
}

### for

`for` loops take an interator variable and assign it successive values from a sequence or vector. For loops are most commonly used for iterating over the elements of an object (list, vector, etc.)

In [None]:
for(i in 1:10) {
        print(i)
}

This loop takes the `i` variable and in each iteration of the loop gives it values 1, 2, 3, ..., 10, and then exits.

### for

These four loops have the same behavior.

In [33]:
x <- c("a", "b", "c", "d")

In [None]:
for(i in 1:4) {
        print(x[i])
}

In [None]:
for(i in seq_along(x)) {
        print(x[i])
}

In [None]:
for(letter in x) {
        print(letter)
}

In [None]:
for(i in 1:4) print(x[i])

### Nested for loops

`for` loops can be nested.

In [None]:
x <- matrix(1:6, 2, 3)

for(i in seq_len(nrow(x))) {
        for(j in seq_len(ncol(x))) {
                print(x[i, j])
        }   
}

Be careful with nesting though. Nesting beyond 2–3 levels is often very difficult to read/understand.

### while

While loops begin by testing a condition. If it is true, then they execute the loop body. Once the loop body is executed, the condition is tested again, and so forth.

In [None]:
count <- 0
while(count < 10) {
        print(count)
        count <- count + 1
}

While loops can potentially result in infinite loops if not written properly. Use with care!

### while

Sometimes there will be more than one condition in the test.

In [None]:
z <- 5
while(z >= 3 && z <= 10) {
        print(z)
        coin <- rbinom(1, 1, 0.5)
        
        if(coin == 1) {  ## random walk
                z <- z + 1
        } else {
                z <- z - 1
        } 
}

Conditions are always evaluated from left to right.

### repeat

Repeat initiates an infinite loop; these are not commonly used in statistical applications but they do have their uses. The only way to exit a `repeat` loop is to call `break`.

In [None]:
x0 <- 1
tol <- 1e-8
repeat {
        x1 <- computeEstimate()
        
        if(abs(x1 - x0) < tol) {
                break
        } else {
                x0 <- x1
        } 
}

### repeat

The loop in the previous slide is a bit dangerous because there’s no guarantee it will stop. Better to set a hard limit on the number of iterations (e.g. using a for loop) and then report whether convergence was achieved or not.

### next, return

`next` is used to skip an iteration of a loop

In [None]:
for(i in 1:100) {
        if(i <= 20) {
                ## Skip the first 20 iterations
                next 
        }
        ## Do something here
}

`return` signals that a function should exit and return a given value

### Control Structures

Summary

- Control structures like `if`, `while`, and `for` allow you to control the flow of an R program

- Infinite loops should generally be avoided, even if they are theoretically correct.

- Control structures mentiond here are primarily useful for writing programs; for command-line interactive work, the *apply functions are more useful.

## Functions

### Functions

Functions are created using the `function()` directive and are stored as R objects just like anything else. In particular, they are R objects of class “function”.

In [None]:
f <- function(<arguments>) {
        ## Do something interesting
}

Functions in R are “first class objects”, which means that they can be treated much like any other R object. Importantly,
- Functions can be passed as arguments to other functions
- Functions can be nested, so that you can define a function inside of another function
- The return value of a function is the last expression in the function body to be evaluated.

### Function Arguments

Functions have _named arguments_ which potentially have _default values_.
- The _formal arguments_ are the arguments included in the function definition 
- The `formals` function returns a list of all the formal arguments of a function 
- Not every function call in R makes use of all the formal arguments
- Function arguments can be _missing_ or might have default values

### Argument Matching

R functions arguments can be matched positionally or by name. So the
following calls to `sd` are all equivalent

In [None]:
mydata <- rnorm(100)
sd(mydata)
sd(x = mydata)
sd(x = mydata, na.rm = FALSE)
sd(na.rm = FALSE, x = mydata)
sd(na.rm = FALSE, mydata)

Even though it’s legal, I don’t recommend messing around with the
order of the arguments too much, since it can lead to some confusion.

### Argument Matching

You can mix positional matching with matching by name. When an argument is matched by name, it is “taken out” of the argument list and the remaining unnamed arguments are matched in the order that they are listed in the function definition.

In [None]:
> args(lm)
function (formula, data, subset, weights, na.action,
          method = "qr", model = TRUE, x = FALSE,
          y = FALSE, qr = TRUE, singular.ok = TRUE,
          contrasts = NULL, offset, ...)

The following two calls are equivalent.

In [None]:
lm(data = mydata, y ~ x, model = FALSE, 1:100)
lm(y ~ x, mydata, 1:100, model = FALSE)

### Argument Matching

- Most of the time, named arguments are useful on the command line when you have a long argument list and you want to use the defaults for everything except for an argument near the end of the list
- Named arguments also help if you can remember the name of the argument and not its position on the argument list (plotting is a good example).

### Argument Matching

Function arguments can also be _partially_ matched, which is useful for interactive work. The order of operations when given an argument is

1. Check for exact match for a named argument
2. Check for a partial match
3. Check for a positional match

### Defining a Function

In [41]:
f <- function(a, b = 1, c = 2, d = NULL) {

}

In addition to not specifying a default value, you can also set an argument value to `NULL`.

### Lazy Evaluation

Arguments to functions are evaluated _lazily_, so they are evaluated only as needed.

In [43]:
f <- function(a, b) {
        a^2
} 
f(2)

This function never actually uses the argument `b`, so calling `f(2)` will not produce an error because the 2 gets positionally matched to `a`.

### Lazy Evaluation

In [44]:
f <- function(a, b) {
        print(a)
        print(b)
}
f(45)

[1] 45


ERROR: Error in print(b): argument "b" is missing, with no default


Notice that “45” got printed first before the error was triggered. This is because `b` did not have to be evaluated until after `print(a)`. Once the function tried to evaluate `print(b)` it had to throw an error.

### The “...” Argument

The ... argument indicate a variable number of arguments that are usually passed on to other functions.

- ... is often used when extending another function and you don’t want to copy the entire argument list of the original function

In [45]:
myplot <- function(x, y, type = "l", ...) {
        plot(x, y, type = type, ...)
}

- Generic functions use ... so that extra arguments can be passed to methods
(more on this later).

In [49]:
mean

### The “...” Argument

The ... argument is also necessary when the number of arguments passed to the function cannot be known in advance.


In [51]:
args(paste)

In [52]:
args(cat)

### Arguments Coming After the “...” Argument

One catch with ... is that any arguments that appear _after_ ... on the argument list must be named explicitly and cannot be partially matched.


In [53]:
args(paste)

In [54]:
paste("a", "b", sep = ":")

In [55]:
paste("a", "b", se = ":")

### More examples
Mixing dots argument and namend arguments:

In [48]:
f <- function(aa=7, ...) {
        print(paste("aa: ",aa))
        print(paste("arr: ", ...))
        print("")
}
f(1, 2)
f(1, aa=2, 3)
f(1, a=2, 3)

[1] "aa:  1"
[1] "arr:  2"
[1] ""
[1] "aa:  2"
[1] "arr:  1 3"
[1] ""
[1] "aa:  2"
[1] "arr:  1 3"
[1] ""


### More examples
Generating sequences with special step:

In [49]:
0:5 * 2

In [52]:
seq(1, 10, by=2)

### Questions

Practice with RStudio.