In [1]:
options(jupyter.rich_display = FALSE)

# BASIC R FEATURES

## Understanding R

> “To understand computations in R, two slogans are helpful:
> 
> Everything that exists is an object.
>
> Everything that happens is a function call."
>
> — John Chambers

(http://adv-r.had.co.nz/Functions.html)

<img src="https://statweb.stanford.edu/~jmc4/CopyPhoto.jpg" width="200">

## Data types

Atomic data structure in R is a vector. There is no separate scalar structure:

A numeric vector of size 1

In [2]:
class(1)

[1] "numeric"

An integer vector of size 1

In [3]:
class(1L)

[1] "integer"

A character vector of size 1

In [4]:
class("a")

[1] "character"

A logical vector of size 1

In [5]:
class(T)

[1] "logical"

Data types are checked with is.xxx() functions:

In [6]:
is.logical(T)
is.integer("a")

[1] TRUE

[1] FALSE

And converted with as.xxx() functions

In [7]:
1
as.character(1)

[1] 1

[1] "1"

## Operators

Before going on to vectors in detail, let's cover basic operators in R:

### Arithmetic

4 operations:

In [8]:
1 + 1

[1] 2

In [9]:
2 / 1

[1] 2

In [10]:
4 - 3

[1] 1

In [11]:
5 * 3

[1] 15

Exponentiation

In [12]:
2^3

[1] 8

Modulo operator:

In [13]:
10 %% 3

[1] 1

Floor division operator:

In [14]:
10 %/% 3

[1] 3

### Logical

And operator

In [15]:
T & F

[1] FALSE

Or operator

In [16]:
T | F

[1] TRUE

Not operator

In [17]:
!T

[1] FALSE

Comparison operators:

In [18]:
4 > 3

[1] TRUE

In [19]:
3 >= 3

[1] TRUE

In [20]:
3 < 4

[1] TRUE

In [21]:
3 <= 3

[1] TRUE

In [22]:
3 == 4

[1] FALSE

In [23]:
3 != 4

[1] TRUE

## Objects, variables and assignments

Assignment is done with the "<-" operator:

In [24]:
var_1 <- 1

In [25]:
var_1

[1] 1

In [26]:
class(var_1)

[1] "numeric"

When a variable is assigned into a new variable, any change to the objects, creates deep copies:

In [27]:
var_2 <- var_1

In [28]:
var_1 <- var_1 + 1

In [29]:
var_1

[1] 2

In [30]:
var_2

[1] 1

An object name cannot start with a number and cannot have a hypen

## Vectors

A vector is an R object holding values of the same type

A single value is also a vector

A vector is created with assignment (no declaration is needed)

c() concatenate function combines vectors into a single vector object:

In [31]:
var_3 <- c(1, 2, 3)

In [32]:
class(var_3)

[1] "numeric"

In [33]:
var_4 <- c("a", "b", "c")
var_4

[1] "a" "b" "c"

: colon operator creates a sequence of integers:

In [34]:
var_5 <- 1:3
var_5
class(var_5)

[1] 1 2 3

[1] "integer"

### Vector attributes

The attribute of a vector is names

In [35]:
var_6 <- 1:3
names(var_6) <- c("a", "b", "c")
var_6

a b c 
1 2 3 

In [36]:
attributes(var_6)

$names
[1] "a" "b" "c"


A vector does not have a dimension attribute (it is not 1-dimension):

In [37]:
dim(var_6)

NULL

But it has length:

In [38]:
length(var_6)

[1] 3

### Vector subsetting

A vector can be subset by:

- A vector of numeric indices
- A vector of logical values for inclusion
- A vector of character values for names

**VECTOR INDEXING IN R STARTS AT 1 NOT 0!**

In [39]:
var_6

a b c 
1 2 3 

In [40]:
var_6[2:3]

b c 
2 3 

The logical vector must be of the same length as the subsetted vector (or will be recycled otherwise):

In [41]:
var_6[c(T, F, T)]

a c 
1 3 

In [42]:
var_6[c("a", "b")]

a b 
1 2 

Subsetting can be used for both retrieval and assignment

In [43]:
var_7 <- var_6
var_7

a b c 
1 2 3 

In [44]:
var_7[2] <- var_7[3]

In [45]:
var_7

a b c 
1 3 3 

Non-existent items of a vector are automatically created when they are assigned to:

In [46]:
var_7

a b c 
1 3 3 

In [47]:
var_7[5]

<NA> 
  NA 

In [48]:
var_7[5] <- 10
var_7

 a  b  c       
 1  3  3 NA 10 

Negative indices exclude those items:

In [49]:
var_7[-5]

 a  b  c    
 1  3  3 NA 

### Dynamic typing

Vectors in R are dynamically typed: Their types can be changed

In [50]:
var_8 <- 1:3

In [51]:
class(var_8)

[1] "integer"

In [52]:
var_8 <- "a"

In [53]:
class(var_8)

[1] "character"

When an R vector is updated partially with a different data type, other values are coerced when necessary

In [54]:
var_9 <- 1:3
var_9

[1] 1 2 3

In [55]:
class(var_9)

[1] "integer"

In [56]:
var_9[2] <- "a"

In [57]:
var_9
class(var_9)

[1] "1" "a" "3"

[1] "character"

### Recycling

In some operations involving vectors of different length, the shorter vector is recycled to the length of the longer one

In [58]:
var_10 <- 1:4
var_10

[1] 1 2 3 4

In [59]:
var_10[c(T, F)] # "T F" vector is recycled to length 4

[1] 1 3

### Initiating an empty vector

An empty vector can be initiated with an assignment of NULL:

In [60]:
var_11 <- NULL
var_11
length(var_11)
class(var_11)

NULL

[1] 0

[1] "NULL"

By c() function:

In [61]:
var_12 <- c()
var_12
length(var_12)
class(var_12)

NULL

[1] 0

[1] "NULL"

Or the initiator function of the appropriate type:

In [62]:
var_13 <- integer(0)

In [63]:
var_13

integer(0)

In [64]:
length(var_13)
class(var_13)

[1] 0

[1] "integer"

### Vectorization

Many basic operations are vectorized in R: The operation is instantly repeated on every value of the vector object

In [65]:
1:10
1:10 + 1

 [1]  1  2  3  4  5  6  7  8  9 10

 [1]  2  3  4  5  6  7  8  9 10 11

With this feature at hand, the raw speed of R can match that of compiled C code

Vectorization is in fact, an implicit loop done at the C level (R is mostly written in C and Fortran)

#### Vectorized and non-vectorized and/or

& is vectorized:

In [66]:
c(T, T, F) & c(F, T, T)

[1] FALSE  TRUE FALSE

While && is not: checks only the first items

In [67]:
c(T, T, F) && c(F, T, T)

[1] FALSE

The same is true for | and || also

### %in% operator

A set operator that checks membership. Returns logical values with length of the LHS

In [68]:
1:5 %in% 3:10

[1] FALSE FALSE  TRUE  TRUE  TRUE

It is not symmetric

In [69]:
3:10 %in% 1:5

[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

### rep() function

rep() function repeats given values in a vector to create a longer vector:

In [70]:
rep(1:3, 2)

[1] 1 2 3 1 2 3

In [71]:
rep(1:3, each = 2)

[1] 1 1 2 2 3 3

The return values are like collated or non-collated multi-set print-outs from a printer

## Special variables and values

### NA

Missing values. Each data type has its own NA method so NA can be used with any data type:

In [72]:
class(NA)

[1] "logical"

For integers:

In [73]:
var_14 <- 1:3
var_14
class(var_14)

[1] 1 2 3

[1] "integer"

In [74]:
var_14[2] <- NA
var_14

[1]  1 NA  3

In [75]:
var_14[2]
class(var_14)

[1] NA

[1] "integer"

For characters:

In [76]:
var_15 <- c("a", "b", "c")
var_15
class(var_15)

[1] "a" "b" "c"

[1] "character"

In [77]:
var_15[2] <- NA
var_15

[1] "a" NA  "c"

In [78]:
var_15[2]
class(var_15[2])

[1] NA

[1] "character"

### NULL

NULL object is for a non-existent value, contrary to NA, which is an existing but missing value

In [79]:
length(NA)
length(NULL)

[1] 1

[1] 0

NA is a logical vector of length 1
NULL is not a vector, does not have a length and does not have a data type

In [80]:
class(NA)
class(NULL)

[1] "logical"

[1] "NULL"

In [81]:
c(1, NA, 3)

[1]  1 NA  3

In [82]:
c(1, NULL, 3)

[1] 1 3

### Inf, -Inf

Holds positive and negative infinite ($\infty$) values

In [83]:
var_16 <- Inf
var_16

[1] Inf

In [84]:
Inf + 1e17

[1] Inf

In [85]:
Inf / 1e22

[1] Inf

In [86]:
-Inf + 1e22

[1] -Inf

Division by zero also creates an Inf value:

In [87]:
1 / 0

[1] Inf

In [88]:
-1 / 0

[1] -Inf

Division by Inf creates 0 value:

In [89]:
3 / Inf

[1] 0

### NaN

Not a Number: Output of undefined operations on infinite numbers

Has numeric type

In [90]:
Inf - Inf

[1] NaN

In [91]:
class(Inf - Inf)

[1] "numeric"

### letters and LETTERS

Reserved objects for lower and upper case letters in the English alphabet:

In [92]:
letters

 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"

In [93]:
LETTERS

 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
[20] "T" "U" "V" "W" "X" "Y" "Z"

### pi

In [94]:
pi

[1] 3.141593

## Numeric accuracy

Internal decimal accuracy of numeric numbers is 22 digits

Add-on packages such as rmpfr() allows more precision with special data types

the "digits" option controls how the numbers are printed (by default 7 digits)

In [95]:
getOption("digits")

[1] 7

In [96]:
options(digits = 1); 1/3
options(digits = 5); 1/3
options(digits = 10); 1/3
options(digits = 22); 1/3

[1] 0.3

[1] 0.33333

[1] 0.3333333333

[1] 0.3333333333333333148296

In [97]:
options(digits = 7)

## Scientific notation

Large numeric numbers or too small decimal numbers are automatically printed with scientific notation:

In [98]:
getOption("scipen")

[1] 0

In [99]:
1111111111111111

[1] 1.111111e+15

In [100]:
0.000000000000000000001

[1] 1e-21

To disable scipen option is set to 999

In [101]:
options(scipen = 999)

In [102]:
1111111111111111

[1] 1111111111111111

In [103]:
0.00000000000000000000001

[1] 0.00000000000000000000001

## Numeric limits

Largest integer value to be held by R is 2^31 - 1

In [104]:
as.integer(2^31 - 1)

[1] 2147483647

In [105]:
as.integer(2^31)

“NAs introduced by coercion to integer range”

[1] NA

Largest numeric value to be handled properly is 2^53

In [106]:
2^53
2^53 - 1
2^53 + 1

[1] 9007199254740992

[1] 9007199254740991

[1] 9007199254740992

With add-on libraries like gmp, much larger values can be held properly with special data types

## Functions

R is intended as a functional programming language:

> R, at its heart, is a functional programming (FP) language. This means that it provides many tools for the creation and manipulation of functions. In particular, R has what’s known as first class functions. You can do anything with functions that you can do with vectors: you can assign them to variables, store them in lists, pass them as arguments to other functions, create them inside functions, and even return them as the result of a function.

(http://adv-r.had.co.nz/Functional-programming.html)

Apart from built-in functions, new functions can be created with the "function()" function

These functions can be assigned to named objects (the usual way):

In [107]:
add_one <- function(x) x + 1

In [108]:
add_one(1)

[1] 2

The function execution terminates after the first return() call encountered

If no return() call exists, the function returns the value of the last executed statement

In [109]:
func_1 <- function(x)
{
    return("stop here")
    return(x + 1)
}

func_1(3)

[1] "stop here"

In [110]:
func_2 <- function(x)
{
    x
    x + 1
    x + 2
    x + 3
}

func_2(5)

[1] 8

### return value

A function can return only a single object, but that object may be one with multiple values or a nested one

In [111]:
func_3 <- function(x = 1, y = 3)
{
    a <- (x + y)^2
    b <- (x - y)^2
    return(a, b)
}

func_3()

ERROR: Error in return(a, b): multi-argument returns are not permitted


In [136]:
func_3 <- function(x = 1, y = 3)
{
    a <- (x + y)^2
    b <- (x - y)^2
    return(c(a, b))
}

func_3()

[1] 16  4

### Default values for arguments

When default values are defined for arguments, these values are taken as granted when no value is supplied to that argument:

In [137]:
func_3 <- function(x = 1, y = 3)
{
    (x + y)^2
}

In [138]:
func_3()

[1] 16

### Operators as functions

LISP is one of the inspirations of R language

Every operator on R is also a function:

In [139]:
exists("var_17")
"<-"("var_17", 1)
exists("var_17")
var_17

[1] FALSE

[1] TRUE

[1] 1

In [140]:
var_18 <- 1:3

In [141]:
var_18 <- ":"(1, 3)
var_18

[1] 1 2 3

In [142]:
var_18[2] <- 4
var_18

[1] 1 4 3

In [143]:
"<-"("["(var_18, 3), 5)
var_18

[1] 1 4 5

## Environments and scoping

R is lexicall scoped: Functions create their own environments on call. Objects are kept in their own environments

For example a global object is created:

In [144]:
var_19 <- 3

In [145]:
func_3 <- function()
{
    var_19 <- 5
    return(var_19)
}

In [146]:
func_3()
var_19

[1] 5

[1] 3

var_19 at the global environment and at func_3's scope are different

However, superassignment operator "<<-" can modify global objects from a function's scope

In [147]:
func_4 <- function(x)
{
    var_20 <<- x
    return(var_20)
}

In [148]:
var_20 <- 10
func_3(12)
var_20

ERROR: Error in func_3(12): unused argument (12)


## Control structures

### Loops

#### for loop

For definite number of iterations

For creates a new variable that iterates through a vector object:

In [None]:
for (i in 1:10)
{
    print(i^2)
}

The iterated vector cannot be modified one the loop starts:

In [None]:
vec_1 <- 1:10

for (i in vec_1)
{
    vec_1[i] <- vec_1[i] + 1
    print(i)
}

vec_1

Although vec_1 is modified inside the loop, the iteration went over the original object

Combined with conditions, "next" statement instantly skips to the next iteration while "break" terminates the execution of the loop

#### while loop

while loop is used when the number of iterations is not predetermined but dependent on a logical condition

while loop does not iterate through a vector (hence not limited with object sizes) and the objects in the condition must already be existing

while continues as long as the condition returns T

In [None]:
x <- 5
while(x < 10)
{
    print(x < 10)
    print("x is still below 10")
    x <- x + 1
}

x

### Conditionals

If else statements on logical conditions:

In [None]:
check3 <- function(x)
{
    if (x < 3)
    {
        print("a is smaller than 3")
    }
    else
    {
        print("a is larger than or equal to 3")
    }    
}

check3(2)
check3(4)

Condition will check only the first value if the input is a vector of size > 1

A vectorized version is given by ifelse() function:

In [None]:
ifelse(1:5 < 3,
       "a is smaller than 3",
       "a is larger than or equal to 3"
      )

## Recursion

R supports recursion (the stack size and maximum depth can be controlled)

In [None]:
factorial_r <- function(x)
{
    if(x == 1)
    {
        return(x)
    }
    else
    {
        return(x * factorial_r(x-1))
    }
}

In [149]:
factorial_r(5)

ERROR: Error in factorial_r(5): could not find function "factorial_r"


## Indenting

Contrary to Python, indentation of the code is not important for interpretation

However, for good style, proper indentation should be followed

## Other data structures

### Matrix

If vector is a ball of wool:
![wool](http://images.esellerpro.com/2278/I/699/36/PureMerinoPM9.jpg)

a matrix is a pullover:

![pullover](https://amp.businessinsider.com/images/58516db4ca7f0cdf1e8b56ad-1136-852.jpg)

A pullover inherits all attributes of the wool (It's color, softness, the ability to shrink when soaked in hot water, etc)

However, the wool does not have all attributes of the pullover (No sleeves, collars)

You can think of matrix as a folded form of a vector

We can create a matrix out of vector(s) by folding or binding

**JUST LIKE A VECTOR, A MATRIX HAS VALUES OF THE SAME TYPE**

#### Matrix out of a vector

In [150]:
vec_1 <- 1:20

In [151]:
mat_1 <- matrix(vec_1, nrow = 4)
mat_1

     [,1] [,2] [,3] [,4] [,5]
[1,] 1    5     9   13   17  
[2,] 2    6    10   14   18  
[3,] 3    7    11   15   19  
[4,] 4    8    12   16   20  

It is created with column-major order by default

Let's provide row and column names:

In [152]:
rownames(mat_1) <- letters[1:nrow(mat_1)]
mat_1

  [,1] [,2] [,3] [,4] [,5]
a 1    5     9   13   17  
b 2    6    10   14   18  
c 3    7    11   15   19  
d 4    8    12   16   20  

In [153]:
colnames(mat_1) <- letters[1:ncol(mat_1) + nrow(mat_1)]
mat_1

  e f g  h  i 
a 1 5  9 13 17
b 2 6 10 14 18
c 3 7 11 15 19
d 4 8 12 16 20

Now let's check dimensions, attributes and structure:

In [154]:
dim(mat_1)
length(mat_1)
attributes(mat_1)
str(mat_1)

[1] 4 5

[1] 20

$dim
[1] 4 5

$dimnames
$dimnames[[1]]
[1] "a" "b" "c" "d"

$dimnames[[2]]
[1] "e" "f" "g" "h" "i"



 int [1:4, 1:5] 1 2 3 4 5 6 7 8 9 10 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:4] "a" "b" "c" "d"
  ..$ : chr [1:5] "e" "f" "g" "h" ...


Matrix has dimensions and rownames and colnames as attributes

Note that a matrix is a dimensioned vector, so it still has a length equal to row * column counts

However a vector does not have dimensions!

In [155]:
mat_1

  e f g  h  i 
a 1 5  9 13 17
b 2 6 10 14 18
c 3 7 11 15 19
d 4 8 12 16 20

#### Getting the dimensions of a matrix

dim() gets all dimension as a vector:

In [237]:
dim(mat_1)

[1] 4 5

nrow() get numbers of rows:

In [239]:
nrow(mat_1)

[1] 4

ncol() gets number of columns:

In [240]:
ncol(mat_1)

[1] 5

#### rbind() vectors into a matrix

rbind() creates a matrix where vectors become rows of the matrix and vector names become rownames:

In [156]:
vec_2 <- 1:3
vec_3 <- 10:8
vec_2
vec_3

[1] 1 2 3

[1] 10  9  8

In [157]:
mat_2 <- rbind(vec_2, vec_3)
mat_2
class(mat_2)
attributes(mat_2)

      [,1] [,2] [,3]
vec_2  1   2    3   
vec_3 10   9    8   

[1] "matrix"

$dim
[1] 2 3

$dimnames
$dimnames[[1]]
[1] "vec_2" "vec_3"

$dimnames[[2]]
NULL



#### cbind() vectors into a matrix

 cbind() creates a matrix where vectors become columns of the matrix and vector names become colnames:

In [158]:
vec_2 <- 1:3
vec_3 <- 10:8
vec_2
vec_3

[1] 1 2 3

[1] 10  9  8

In [159]:
mat_2 <- cbind(vec_2, vec_3)
mat_2
class(mat_2)
attributes(mat_2)

     vec_2 vec_3
[1,] 1     10   
[2,] 2      9   
[3,] 3      8   

[1] "matrix"

$dim
[1] 3 2

$dimnames
$dimnames[[1]]
NULL

$dimnames[[2]]
[1] "vec_2" "vec_3"



#### Subsetting matrices with two vector arguments

Matrices can be subsetted by two arguments to return another matrix:

Two index vectors:

In [160]:
mat_1[2:3,3:5]

  g  h  i 
b 10 14 18
c 11 15 19

Or two character vectors for dimension names:

In [161]:
mat_1[c("a", "c"),c("f", "g", "h")]

  f g  h 
a 5  9 13
c 7 11 15

Or two logical vectors:

In [162]:
mat_1[c(T, F), c(F, F, T, T, F)]

  g  h 
a  9 13
c 11 15

Negative indices exclude items:

In [163]:
mat_1[-(1:2), -3]

  e f h  i 
c 3 7 15 19
d 4 8 16 20

#### Drop

When subsetted with two arguments, the object retains its matrix structure:

In [164]:
mat_1[2:3,3:5]
class(mat_1[2:3,3:5])

  g  h  i 
b 10 14 18
c 11 15 19

[1] "matrix"

Unless only a single row or column is returned, the "matrix" structure is dropped in this case:

In [165]:
mat_1[2,3:5]
class(mat_1[2,3:5])

 g  h  i 
10 14 18 

[1] "integer"

To prevent this, "drop" argument must be provided with F value:

In [166]:
mat_1[2,3:5, drop = F]
class(mat_1[2,3:5, drop = F])

  g  h  i 
b 10 14 18

[1] "matrix"

#### Subsetting a matrix with a single vector argument

What if we subset with a single index:

In [167]:
mat_1[2:10]

[1]  2  3  4  5  6  7  8  9 10

In [168]:
class(mat_1[2:10])

[1] "integer"

In [169]:
attributes(mat_1[2:10])

NULL

It is converted to a vector automatically

#### Subsetting a matrix by another matrix

Non contiguous cells from a matrix can be subsetted by using a two column matrix:

In [170]:
row_indices <- c(4, 1, 2)
col_indices <- c(2, 5, 3)

index_mat <- cbind(row_indices, col_indices)
index_mat

     row_indices col_indices
[1,] 4           2          
[2,] 1           5          
[3,] 2           3          

In [171]:
mat_1

  e f g  h  i 
a 1 5  9 13 17
b 2 6 10 14 18
c 3 7 11 15 19
d 4 8 12 16 20

In [172]:
mat_1[index_mat]

[1]  8 17 10

row() and col() functions return the row and column indices of all cells in a matrix:

In [173]:
row(mat_1)
col(mat_1)

     [,1] [,2] [,3] [,4] [,5]
[1,] 1    1    1    1    1   
[2,] 2    2    2    2    2   
[3,] 3    3    3    3    3   
[4,] 4    4    4    4    4   

     [,1] [,2] [,3] [,4] [,5]
[1,] 1    2    3    4    5   
[2,] 1    2    3    4    5   
[3,] 1    2    3    4    5   
[4,] 1    2    3    4    5   

These return matrices can be used for subsetting and manipulating matrices in more complicated ways, for example extracting the secondary diagonal, etc

In [174]:
mat_1[row(mat_1) + col(mat_1) == 4] <- 0
mat_1

  e f g  h  i 
a 1 5  0 13 17
b 2 0 10 14 18
c 0 7 11 15 19
d 4 8 12 16 20

In [175]:
mat_1[row(mat_1)]

 [1] 1 2 0 4 1 2 0 4 1 2 0 4 1 2 0 4 1 2 0 4

#### Transpose a matrix

t() function transposes a matrix : rows become columns, columns become rows:

In [235]:
mat_1

  e f g  h  i 
a 1 5  0 13 17
b 2 0 10 14 18
c 0 7 11 15 19
d 4 8 12 16 20

In [236]:
t(mat_1)

  a  b  c  d 
e  1  2  0  4
f  5  0  7  8
g  0 10 11 12
h 13 14 15 16
i 17 18 19 20

#### Vectorization

Many operators and functions work in a vectorized manner on matrices as they do on vectors:

In [176]:
mat_1

  e f g  h  i 
a 1 5  0 13 17
b 2 0 10 14 18
c 0 7 11 15 19
d 4 8 12 16 20

In [230]:
mat_1 * 2

  e f  g  h  i 
a 2 10  0 26 34
b 4  0 20 28 36
c 0 14 22 30 38
d 8 16 24 32 40

In [231]:
sqrt(mat_1)

  e        f        g        h        i       
a 1.000000 2.236068 0.000000 3.605551 4.123106
b 1.414214 0.000000 3.162278 3.741657 4.242641
c 0.000000 2.645751 3.316625 3.872983 4.358899
d 2.000000 2.828427 3.464102 4.000000 4.472136

#### %*% operator

\* operator causes element-wise multiplication of matrices:

In [232]:
mat_2 <- matrix(20:1, nrow = 4, byrow = T)
mat_2

     [,1] [,2] [,3] [,4] [,5]
[1,] 20   19   18   17   16  
[2,] 15   14   13   12   11  
[3,] 10    9    8    7    6  
[4,]  5    4    3    2    1  

In [233]:
mat_1 * mat_2

  e  f  g   h   i  
a 20 95   0 221 272
b 30  0 130 168 198
c  0 63  88 105 114
d 20 32  36  32  20

%\*% operator causes two matrices of sizes n x m and m x o to be matrix multiplied: 

In [234]:
mat_1 %*% t(mat_2)

  [,1] [,2] [,3] [,4]
a  608 428  248   68 
b  746 526  306   86 
c  890 630  370  110 
d 1040 740  440  140 

#### outer() function

outer() repeats a function on the cartesian product of multiple vectors

It can create a matrix out of two vectors:

In [177]:
outer(1:5, 1:5, "*")

     [,1] [,2] [,3] [,4] [,5]
[1,] 1     2    3    4    5  
[2,] 2     4    6    8   10  
[3,] 3     6    9   12   15  
[4,] 4     8   12   16   20  
[5,] 5    10   15   20   25  

#### Extension packages

- bigmemory package creates matrix objects to handle larger data
- gpuR package creates matrices to be processed by the gpu's of PCs at a much faster speed
- matrix package handles sparse matrixobjects

### List

List is a special kind of vector that holds R objects as values. It can be in a nested structure:

In [178]:
list_1 <- list(a_vector = 1:3,
              a_matrix = outer(1:2, 1:2),
              another_list = list(1, 2, 3))

list_1

$a_vector
[1] 1 2 3

$a_matrix
     [,1] [,2]
[1,]    1    2
[2,]    2    4

$another_list
$another_list[[1]]
[1] 1

$another_list[[2]]
[1] 2

$another_list[[3]]
[1] 3



Lists are very versatile and powerful objects in R for handling non-regular data

Hierarchical data structures such as JSON and XML can be converted back and forth into list objects

#### Subsetting and modifying lists

Lists can be subset with three operators:

A single bracket returns a list of requested items. Multiple items can be subsetted this way:

In [179]:
list_1[1:2]

$a_vector
[1] 1 2 3

$a_matrix
     [,1] [,2]
[1,]    1    2
[2,]    2    4


In [180]:
class(list_1[1:2])

[1] "list"

In [181]:
list_1[1]

$a_vector
[1] 1 2 3


In [182]:
class(list_1)

[1] "list"

Double bracket returns a single object. Numeric indices or names (with quotes) can be supplied. Multiple items cannot be subsetted:

In [183]:
list_1[[1]]

[1] 1 2 3

In [184]:
class(list_1[[1]])

[1] "integer"

In [185]:
list_1[["a_matrix"]]

     [,1] [,2]
[1,] 1    2   
[2,] 2    4   

In [186]:
class(list_1[["a_matrix"]])

[1] "matrix"

$ operator returns a single object using name of the item w/o quotes:

In [187]:
list_1$a_matrix

     [,1] [,2]
[1,] 1    2   
[2,] 2    4   

In [188]:
class(list_1$a_matrix)

[1] "matrix"

A new item can be added by c() (the new item should also be inside a list to be added as is)

In [189]:
list_1 <- c(list_1, another_vector = list(10:5))
list_1

$a_vector
[1] 1 2 3

$a_matrix
     [,1] [,2]
[1,]    1    2
[2,]    2    4

$another_list
$another_list[[1]]
[1] 1

$another_list[[2]]
[1] 2

$another_list[[3]]
[1] 3


$another_vector
[1] 10  9  8  7  6  5


By calling the name of the new item:

In [190]:
list_1$another_matrix <- matrix(1:4, nrow = 2)
list_1

$a_vector
[1] 1 2 3

$a_matrix
     [,1] [,2]
[1,]    1    2
[2,]    2    4

$another_list
$another_list[[1]]
[1] 1

$another_list[[2]]
[1] 2

$another_list[[3]]
[1] 3


$another_vector
[1] 10  9  8  7  6  5

$another_matrix
     [,1] [,2]
[1,]    1    3
[2,]    2    4


Or indexing:

In [191]:
list_1[[6]] <- list(1:3)
list_1

$a_vector
[1] 1 2 3

$a_matrix
     [,1] [,2]
[1,]    1    2
[2,]    2    4

$another_list
$another_list[[1]]
[1] 1

$another_list[[2]]
[1] 2

$another_list[[3]]
[1] 3


$another_vector
[1] 10  9  8  7  6  5

$another_matrix
     [,1] [,2]
[1,]    1    3
[2,]    2    4

[[6]]
[[6]][[1]]
[1] 1 2 3



Lists also support negative subsetting:

In [192]:
list_1[-1]

$a_matrix
     [,1] [,2]
[1,]    1    2
[2,]    2    4

$another_list
$another_list[[1]]
[1] 1

$another_list[[2]]
[1] 2

$another_list[[3]]
[1] 3


$another_vector
[1] 10  9  8  7  6  5

$another_matrix
     [,1] [,2]
[1,]    1    3
[2,]    2    4

[[5]]
[[5]][[1]]
[1] 1 2 3



#### Recursing a list

Lists can be traversed recursively

In [193]:
list_1

$a_vector
[1] 1 2 3

$a_matrix
     [,1] [,2]
[1,]    1    2
[2,]    2    4

$another_list
$another_list[[1]]
[1] 1

$another_list[[2]]
[1] 2

$another_list[[3]]
[1] 3


$another_vector
[1] 10  9  8  7  6  5

$another_matrix
     [,1] [,2]
[1,]    1    3
[2,]    2    4

[[6]]
[[6]][[1]]
[1] 1 2 3



In [194]:
list_1$another_list[[2]]

[1] 2

#### Structure and attributes

In [195]:
str(list_1)

List of 6
 $ a_vector      : int [1:3] 1 2 3
 $ a_matrix      : num [1:2, 1:2] 1 2 2 4
 $ another_list  :List of 3
  ..$ : num 1
  ..$ : num 2
  ..$ : num 3
 $ another_vector: int [1:6] 10 9 8 7 6 5
 $ another_matrix: int [1:2, 1:2] 1 2 3 4
 $               :List of 1
  ..$ : int [1:3] 1 2 3


In [196]:
attributes(list_1)

$names
[1] "a_vector"       "a_matrix"       "another_list"   "another_vector"
[5] "another_matrix" ""              


List is closer to a vector than it is to a matrix: It does not have dimension, rownames or colnames attributes

A list can have names and it has a length

In [197]:
length(list_1)

[1] 6

#### unlisting

A list in R can be flattened into simpler vectors by unlist()

In [198]:
list_1

$a_vector
[1] 1 2 3

$a_matrix
     [,1] [,2]
[1,]    1    2
[2,]    2    4

$another_list
$another_list[[1]]
[1] 1

$another_list[[2]]
[1] 2

$another_list[[3]]
[1] 3


$another_vector
[1] 10  9  8  7  6  5

$another_matrix
     [,1] [,2]
[1,]    1    3
[2,]    2    4

[[6]]
[[6]][[1]]
[1] 1 2 3



In [199]:
unlist(list_1)
class(unlist(list_1))

      a_vector1       a_vector2       a_vector3       a_matrix1       a_matrix2 
              1               2               3               1               2 
      a_matrix3       a_matrix4   another_list1   another_list2   another_list3 
              2               4               1               2               3 
another_vector1 another_vector2 another_vector3 another_vector4 another_vector5 
             10               9               8               7               6 
another_vector6 another_matrix1 another_matrix2 another_matrix3 another_matrix4 
              5               1               2               3               4 
                                                
              1               2               3 

[1] "numeric"

#### Extension packages

- rlist packages provides functionality to work with lists beyond base
- listviewer package makes navigating list or list-like hierarchical objects in R easy

### Data Frame

A data frame is special type of list that is comprised of vectors of same sizes. The data types of vectors may be different.

Although it is a "list" by nature, a data frame is treated like a matrix object since it has row and column dimensions and rownames for rows and names for columns

Data frame is the main data structure to use in data science since it allows for handling different data types in each column

In [200]:
df_1 <- data.frame(int1 = 1:5, char1 = letters[1:5], logi1 = c(T, T, F, T, T))
rownames(df_1) <- LETTERS[10:14]
df_1

  int1 char1 logi1
J 1    a      TRUE
K 2    b      TRUE
L 3    c     FALSE
M 4    d      TRUE
N 5    e      TRUE

In [201]:
str(df_1)
attributes(df_1)

'data.frame':	5 obs. of  3 variables:
 $ int1 : int  1 2 3 4 5
 $ char1: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
 $ logi1: logi  TRUE TRUE FALSE TRUE TRUE


$names
[1] "int1"  "char1" "logi1"

$class
[1] "data.frame"

$row.names
[1] "J" "K" "L" "M" "N"


The dimensions of a data frame is its row and column counts:

In [202]:
dim(df_1)

[1] 5 3

However length of a data frame is its column count (contrary to a matrix)

In [203]:
length(df_1)

[1] 3

#### subsetting

Data frames can be subsetted with two vectors (like a matrix) or one vector (like a list)

In [204]:
df_1[1:2, 1:2]

  int1 char1
J 1    a    
K 2    b    

In [205]:
df_1[2:3]

  char1 logi1
J a      TRUE
K b      TRUE
L c     FALSE
M d      TRUE
N e      TRUE

Data frames can be subsetted with negative indices like other R structures

In [206]:
df_1[-1]

  char1 logi1
J a      TRUE
K b      TRUE
L c     FALSE
M d      TRUE
N e      TRUE

#### matrix to data.frame

A matrix can be converted to a data frame with as.data.frame:

In [207]:
df_2 <- as.data.frame(mat_1)
df_2
attributes(df_2)

  e f g  h  i 
a 1 5  0 13 17
b 2 0 10 14 18
c 0 7 11 15 19
d 4 8 12 16 20

$names
[1] "e" "f" "g" "h" "i"

$class
[1] "data.frame"

$row.names
[1] "a" "b" "c" "d"


#### list to data.frame

A list of vectors of same sizes can be converted to a data frame:

In [208]:
list_2 <- list(a = 1:3, b = letters[2:4], c = rep(T, 3))
list_2

$a
[1] 1 2 3

$b
[1] "b" "c" "d"

$c
[1] TRUE TRUE TRUE


In [209]:
df_3 <- as.data.frame(list_2)
df_3
attributes(df_3)

  a b c   
1 1 b TRUE
2 2 c TRUE
3 3 d TRUE

$names
[1] "a" "b" "c"

$class
[1] "data.frame"

$row.names
[1] 1 2 3


#### Factors

Apart from numeric, integer, logical and character values, many datasets in data science have categorical variables that have to be represented in discrete values that are kept as integer values internally but printed with comprehensive labels.

They are called factors and they are mostly used with data frames

First let's create a numeric variable:

In [210]:
vec_4 <- c(1,1,4,2,3,1)

And concert it into a factor with labels:

In [211]:
fct_1 <- factor(vec_4, levels = 1:4, labels = c("a", "b", "c", "d"))

In [212]:
fct_1

[1] a a d b c a
Levels: a b c d

We can append or modify a value with any of the defined labels:

In [213]:
fct_1[7] <- "a"

And we still have a vector of factor type:

In [214]:
class(fct_1)

[1] "factor"

However if we try to add a value of new label:

In [215]:
fct_1[8] <- "e"
fct_1

“invalid factor level, NA generated”

[1] a    a    d    b    c    a    a    <NA>
Levels: a b c d

It is not identified as a level and hence added as NA

Now get the unique levels (as labels)

In [216]:
levels(fct_1)

[1] "a" "b" "c" "d"

Add add a new level:

In [217]:
levels(fct_1) <- c(levels(fct_1), "e")

In [218]:
fct_1

[1] a    a    d    b    c    a    a    <NA>
Levels: a b c d e

Now you can add a value of that new level:

In [219]:
fct_1[8] <- "e"
fct_1

[1] a a d b c a a e
Levels: a b c d e

forcats package from tidyverse bundle provides better functions to handle factors

#### Extension packages

- dplyr and tibble from tidyverse and data.table packages have better functionality and performance with data frames and they are the basis of data science with R

### Arrays

Arrays are hyper dimensional version of matrices in R: They can have > 2 dimension but they still have to contain values of the same type like a matrix

In [220]:
ar_1 <- array(1:12, dim = c(2, 3, 2))
ar_1

, , 1

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

, , 2

     [,1] [,2] [,3]
[1,]    7    9   11
[2,]    8   10   12


In [221]:
str(ar_1)
attributes(ar_1)

 int [1:2, 1:3, 1:2] 1 2 3 4 5 6 7 8 9 10 ...


$dim
[1] 2 3 2


abind package extends the rbind and cbind functionality into n-dimensional arrays

#### Subsetting and modifying arrays

When drop argument with F value is not provided, subsetting an array with a single value on a margin will return a lower dimension object:

In [222]:
ar_1[1,,]
class(ar_1[1,,])

     [,1] [,2]
[1,] 1     7  
[2,] 3     9  
[3,] 5    11  

[1] "matrix"

In [223]:
ar_1[1,1,]
class(ar_1[1,1,])

[1] 1 7

[1] "integer"

Multiple values on all margins:

In [224]:
ar_1[1:2,1:2,1:2]
class(ar_1[1:2,1:2,1:2])

, , 1

     [,1] [,2]
[1,]    1    3
[2,]    2    4

, , 2

     [,1] [,2]
[1,]    7    9
[2,]    8   10


[1] "array"

Or the drop = F argument keeps the dimension of the array:

In [225]:
ar_1[1,1,1,drop = F]
class(ar_1[1,1,1,drop = F])

, , 1

     [,1]
[1,]    1


[1] "array"

Negative indices can be used for subsetting arrays

In [226]:
ar_1[-1,-1,-1, drop = F]

, , 1

     [,1] [,2]
[1,]   10   12


#### outer() to array

outer() function can create an array out of vector or matrix objects:

In [227]:
vec_4 <- 1:2
mat_2 <- outer(vec_4, vec_4, FUN = "*")
vec_4
mat_2

[1] 1 2

     [,1] [,2]
[1,] 1    2   
[2,] 2    4   

A 3D array:

In [228]:
ar_2 <- outer(vec_4, mat_2, FUN = "^")
ar_2
class(ar_2)
str(ar_2)
attributes(ar_2)

, , 1

     [,1] [,2]
[1,]    1    1
[2,]    2    4

, , 2

     [,1] [,2]
[1,]    1    1
[2,]    4   16


[1] "array"

 num [1:2, 1:2, 1:2] 1 2 1 4 1 4 1 16


$dim
[1] 2 2 2


A 4D array:

In [229]:
ar_3 <- outer(mat_2, mat_2, FUN = "^")
ar_3
class(ar_3)
str(ar_3)
attributes(ar_3)

, , 1, 1

     [,1] [,2]
[1,]    1    2
[2,]    2    4

, , 2, 1

     [,1] [,2]
[1,]    1    4
[2,]    4   16

, , 1, 2

     [,1] [,2]
[1,]    1    4
[2,]    4   16

, , 2, 2

     [,1] [,2]
[1,]    1   16
[2,]   16  256


[1] "array"

 num [1:2, 1:2, 1:2, 1:2] 1 2 2 4 1 4 4 16 1 4 ...


$dim
[1] 2 2 2 2
