In [232]:
options(jupyter.rich_display = FALSE)

# Manipulating data structures in R and useful functions

## Getting help

The easiest way to get help on a function or structure in R is the "?" utility

Just write

```R
?[any_function]
```

In [31]:
?"?"

In [32]:
?c

## Import and export data

### csv and tsv files

Comma separated and tab separated values can easily be imported to and exported from R

Let's first the see available datasets in R:

In [3]:
data()

Let's select famous iris dataset:

In [4]:
iris

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa
5.4,3.9,1.7,0.4,setosa
4.6,3.4,1.4,0.3,setosa
5.0,3.4,1.5,0.2,setosa
4.4,2.9,1.4,0.2,setosa
4.9,3.1,1.5,0.1,setosa


In [8]:
str(iris)

'data.frame':	150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...


Let's export this data as csv:

In [6]:
write.csv(iris, file = "iris.csv", row.names = F)

You can check the iris.csv file from the filesystem

Now let's import the data again

In [10]:
iris_2 <- read.csv("iris.csv")

In [11]:
iris_2

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa
5.4,3.9,1.7,0.4,setosa
4.6,3.4,1.4,0.3,setosa
5.0,3.4,1.5,0.2,setosa
4.4,2.9,1.4,0.2,setosa
4.9,3.1,1.5,0.1,setosa


In [12]:
identical(iris_2, iris)

Imported object is identical with the original one

read.table() and write.table() are the general functions for file read and write while read.csv() and write.csv() are wrappers around the options for csv filetype.

Not that the file argument can also take URL's as input

### Binary R objects

save() and load() functions work with binary files representing R objects:

Create a copy of iris:

In [14]:
iris3 <- iris

Save it as an RData file:

In [16]:
save(iris3, file = "iris.RData")

Remove the iris3 object:

In [19]:
rm(iris3)
exists("iris3")

“object 'iris3' not found”

And load it from the RDatafile:

In [20]:
load("iris.RData")

See it is imported:

In [21]:
exists("iris3")

In [22]:
iris3

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa
5.4,3.9,1.7,0.4,setosa
4.6,3.4,1.4,0.3,setosa
5.0,3.4,1.5,0.2,setosa
4.4,2.9,1.4,0.2,setosa
4.9,3.1,1.5,0.1,setosa


### Extension packages

- readxl, xlsx, openxlsx and XLConnect imports and exports from/to xls(x) files
- googlesheets package connects R to googlesheets
- readr package of tidyverse extends the functionality of read.table, write.table and similar base utilities
- DBI, RPostgreSQL, RMySQL, ROracle, sqldf packages connects R to common database servers
- data.table package has a faster implementation of file read and writes with fread() and fwrite()

## Vectorization

Many functions in R can handle multiple values vectors as inputs and call the function for all the values in the vector sequentially without the need for an explicit loop

The two main benefits of vectorization in R are:
- Speed: Natively vectorized functions (written in C, C++ or Fortranand compiled) are as fast as compiled code
- Conciseness: Vectorized functions simplify code writing: Less and more clear code

In some cases the functions R not vectorized can only handle single values.

In this case, Vectorize() can generate vectorized versions of these functions while it does not bring the performance advantage of vectorization in native vectorized code

In [25]:
func_single <- function(x, y)
{
    if (x < y)
    {
        return("x is smaller than y")
    }
    else
    {
        return("x is not smaller than y")
    }
}

In [26]:
func_single(3, 1)
func_single(2, 5)

In [27]:
func_single(1:4, 5:2)

“the condition has length > 1 and only the first element will be used”

if condition only regards the first values in vector and ignores the rest

In [28]:
func_vec <- Vectorize(func_single)

In [29]:
func_vec(1:4, 5:2)

### outer() with Vectorize

In order for outer() to work properly, the function provided to FUN argument must be vectorized

For example let's create a matrix of the cartesian product of 1:10, where values are the max of row or column 

Note that max() is not a vectorized function, it aggregates its input and returns a single value:

So this does not work:

In [73]:
outer(1:10, 1:10, max)

ERROR: Error in dim(robj) <- c(dX, dY): dims [product 100] do not match the length of object [1]


This does not work either:

In [74]:
outer(1:10, 1:10, function(x,y) max(x,y))

ERROR: Error in dim(robj) <- c(dX, dY): dims [product 100] do not match the length of object [1]


But this works:

In [77]:
outer(1:10, 1:10, Vectorize(function(x,y) max(x,y)))

0,1,2,3,4,5,6,7,8,9
1,2,3,4,5,6,7,8,9,10
2,2,3,4,5,6,7,8,9,10
3,3,3,4,5,6,7,8,9,10
4,4,4,4,5,6,7,8,9,10
5,5,5,5,5,6,7,8,9,10
6,6,6,6,6,6,7,8,9,10
7,7,7,7,7,7,7,8,9,10
8,8,8,8,8,8,8,8,9,10
9,9,9,9,9,9,9,9,9,10
10,10,10,10,10,10,10,10,10,10


## Random number generation and simulation

### set.seed()

set.seed makes (pseudo)random number generation reproducible, so that after a certain seed is provided the same sequence of numbers are always generated

### sample()

sample() function takes a sample of replaced or non-replaced values from a vector:

In [52]:
set.seed(1)
sample(1:20, size = 15, replace = T)

In [51]:
set.seed(2)
sample(1:20, size =15, replace = F)

### runif()

Generates uniformly distributed numeric (decimal) values from a given range:

In [56]:
runif(n = 10, min = 5, max = 8)

In [58]:
?rnorm

### rnorm()

Generates normally distributed values with a given mean and standard deviation:

In [59]:
rnorm(n= 20, mean = 2 , sd = 1)

## Useful functions

### Vector operations

#### seq()

Creates a sequence of values. More versatile than the ":" operator:

In [61]:
?seq

```R
seq(from = 1, to = 1, by = ((to - from)/(length.out - 1)),
    length.out = NULL, along.with = NULL, ...)
```

Sequence of values from 3, with steps of 2 and length of 7: 

In [62]:
seq(from = 3, by = 2, length.out = 7)

sequence along another vector. Same as 1:length(x). Useful for creating indices across a vector:

In [63]:
samp_1 <- sample(20, 5)
samp_1

In [64]:
seq_along(samp_1)

#### rev()

Reverses the order of a vector:

In [197]:
rev(1:5)

#### pmax(), pmin()

Vectorized maximum and minimum of the same indices on multiple vectors

In [199]:
set.seed(20)
samp_13 <- sample(100, 5)
samp_14 <- sample(100, 5)
samp_15 <- sample(100, 5)

samp_13
samp_14
samp_15

In [200]:
pmax(samp_13, samp_14, samp_15)

In [201]:
pmin(samp_13, samp_14, samp_15)

In [202]:
?pmax

### Logical functions

#### all()

Works on logical values and reports whether all values are T

In [78]:
all(c(T,F,T))

In [79]:
all(c(T,T,T))

all() is lazy: When it finds an F, the rest is not probed:

In [88]:
large_bool1 <- c(F, rep(T, 1e7))
large_bool2 <- c(rep(T, 1e7), F)

In [89]:
system.time(all(large_bool1))
system.time(all(large_bool2))

   user  system elapsed 
      0       0       0 

   user  system elapsed 
  0.012   0.000   0.013 

In the first example, all() call stops at the very first encounter with the F at the beginning while the second example has to go all along the vector

#### any()

Returns whether any of the values in a logical vector is TRUE

In [90]:
any(c(T, F, F))

In [91]:
any(c(F, F, F))

It is also a lazy function

### which(), which.min(), which.max()

which() returns the indices of the TRUE values of a logical vector

It is used in order to return the indices on a vector which satisfies a condition

In [103]:
set.seed(3)
samp_2 <- sample(5, 10, replace = T)
samp_2

In [104]:
samp_2 > 3

In [105]:
which(samp_2 > 3)

So 2nd, 5th, 6th and 10th values in the samp_2 vector satisfy the > 3 condition

which.max() returns the index of the max value, which.min returns the index of the min value in a vector

In [106]:
set.seed(4)
samp_3 <- sample(10)
samp_3

In [107]:
which.max(samp_3)
which.min(samp_3)

### match()

Returns the first positions of the values of the first argument in the second argument

In [109]:
set.seed(5)
samp_4 <- sample(10)
samp_4

In [110]:
match(1, samp_4)

1 is the 5th item in samp_3

In [111]:
match(1:3, samp_4)

1:3 are 5th, 4th and 1st items in samp_4 respectively

### Ordering functions

#### sort()

Sorts a vector in monotonical order

In [112]:
set.seed(6)
samp_5 <- sample(100, 10)
samp_5

In [113]:
sort(samp_5)

In [114]:
sort(samp_5, decreasing = T)

#### order()

Order returns a vector of indices which rearranges its first argument into ascending or descending order

In [117]:
set.seed(7)
samp_6 <- sample(5)
samp_6

In [118]:
order(samp_6)

It is the same as:

In [132]:
match(sort(samp_6), samp_6)

In order to arrange samp_6 in ascending order it must be subset in this order: 3rd, 2nd, 4th, 5th and 1st indices.

Let's check:

In [119]:
samp_6[order(samp_6)]

To get the descending order:

In [120]:
order(samp_6, decreasing = T)

Or (for numeric/integer vectors only):

In [121]:
order(-samp_6)

In [122]:
samp_6[order(-samp_6)]

#### rank()

Returns the position of each value when sorted:

In [125]:
set.seed(10)
samp_7 <- sample(20, 5)
samp_7

In [126]:
rank(samp_7)

It is the same as:

In [129]:
match(samp_7, sort(samp_7))

### Set operations: setdiff(), intersect(), union()

In [167]:
set.seed(200)
samp_10 <- sample(10, 7)
samp_11 <- sample(10, 7)
samp_10
samp_11

setdiff()returns only the differing items between vectors, not symmetric:

In [169]:
setdiff(samp_10, samp_11)

In [170]:
setdiff(samp_11, samp_10)

intersect() returns common values (values the same, orders may differ):

In [171]:
intersect(samp_10, samp_11)

In [172]:
intersect(samp_11, samp_10)

union() combines values:

In [173]:
union(samp_10, samp_11)

In [174]:
union(samp_11, samp_10)

### Counting functions

In [176]:
set.seed(200)
samp_12 <- sample(10, 100, replace = T)
length(samp_12)

unique() returns the unique values in a vector

In [177]:
unique(samp_12)

table() summarizes the occurence of each unique item:

In [178]:
table(samp_12)

samp_12
 1  2  3  4  5  6  7  8  9 10 
 7 14  9 11 11 16 14  7  4  7 

prop.table() reports the proportions in total instead of counts of a table:

In [180]:
prop.table(table(samp_12))

samp_12
   1    2    3    4    5    6    7    8    9   10 
0.07 0.14 0.09 0.11 0.11 0.16 0.14 0.07 0.04 0.07 

### Rounding functions

round() round to the desired accuracy:

In [181]:
pi
round(pi)
round(pi, 1)
round(pi, 2)

In [184]:
round(1.6)
round(1.4)

ceiling() rounds up:

In [185]:
ceiling(1.6)
ceiling(-1.6)

floor() rounds down:

In [186]:
floor(1.6)
floor(-1.6)

trunc() rounds to the integer closer to 0 

In [188]:
trunc(1.6)
trunc(-1.6)

### Vector sum and products

sum() gets the single value sum of a vector:

In [189]:
sum(1:5)

cumsum() returns the cumulative sums from the first to nth values:

In [190]:
cumsum(1:5)

prod() returns the single value product of a vector:

In [192]:
prod(1:5)

cumprod() returns the cumulative products from the first to nth values

In [193]:
cumprod(1:5)

They can be combined with rev()

In [195]:
cumsum(rev(1:5))

In [196]:
cumprod(rev(1:5))

RcppRoll package has high performance rolling/windowed operations on vectors and matrices 

### Matrix operations

In [206]:
set.seed(30)
mat_3 <- matrix(sample(10, 25, replace = T), nrow = 5)
mat_3

0,1,2,3,4
1,2,1,10,2
5,9,4,3,6
4,3,6,9,1
5,10,9,7,4
4,2,3,5,3


colSums() returns the column sums

In [207]:
colSums(mat_3)

rowSums() returns the row sums

In [208]:
rowSums(mat_3)

max.col() returns the column index of maximum value for each row:

In [209]:
max.col(mat_3)

There is no built-in max.row but we can easily emulate its functionality:

In [210]:
max.col(t(mat_3))

matrixStats package has high-performing functions for row and column operations on matrices

### Statistical functions

#### max(), min(), mean(), median(), sd()

Returns the respective single max, min, mean, median and sd values: 

In [141]:
set.seed(15)
samp_8 <- rnorm(1e4, 0, 1)

In [142]:
max(samp_8)

In [143]:
min(samp_8)

In [144]:
mean(samp_8)

In [145]:
median(samp_8)

In [146]:
sd(samp_8)

#### quantile()

With only a single argument, get five-point summary of a numeric variable

In [147]:
quantile(samp_8)

Or you can define any percentile values. For example to get the deciles:

In [155]:
quantile(samp_8, probs = seq(0.1, 1, 0.1))

#### summary()

A generic function. For a numeric variable it provides a summary table of the statistics above (five-point sumamry + mean:

In [149]:
summary(samp_8)

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-4.210662 -0.683042  0.000698 -0.002780  0.691464  3.817933 

#### cor(), cov(), var()

cor() returns correlation values between vectors:

In [157]:
set.seed(100)
samp_9 <- runif(100)
samp_10 <- runif(100)
samp_11 <- samp_9 + 2* samp_10

Between two variables:

In [162]:
cor(samp_9, samp_11)

Or as a correlation matrix when a matrix is provided:

In [161]:
cor(cbind(samp_9, samp_10, samp_11))

Unnamed: 0,samp_9,samp_10,samp_11
samp_9,1.0,0.2412013,0.5649171
samp_10,0.2412013,1.0,0.937044
samp_11,0.5649171,0.937044,1.0


Variance of a vector:

In [166]:
var(samp_9)

Or a covariance matrix of multiple columns:

In [164]:
cov(cbind(samp_9, samp_10, samp_11))

Unnamed: 0,samp_9,samp_10,samp_11
samp_9,0.06865373,0.01956398,0.1077817
samp_10,0.01956398,0.09582769,0.2112194
samp_11,0.10778169,0.21121936,0.5302204


### Mathematical functions

abs() returns the absolute value:

In [211]:
abs(-10)

exp() returns the exponent: e^n

In [215]:
exp(1)

log() returns the the logarithm in a base (default base is e = exp(1)

In [216]:
log(exp(1))

sqrt() return the square root:

In [218]:
sqrt(16)

factorial() returns the factorial (!n)

In [219]:
factorial(5)

#### Extension packages

gmp, numbers, adagio packages have efficient implementations of numeric operations

### Trigonometric functions

Degrees should be converted to radians as such $180\,^{\circ} = \pi$

In [220]:
degvalues <- seq(0, 360, 45)
degvalues

In [221]:
radians <- degvalues / 180 * pi
radians

In [222]:
sin(radians)

In [223]:
cos(radians)

In [224]:
tan(radians)

### Combinatorics

#### choose()

C(n,k) is the number of k sized combinations out of a vector of size r. It is a vectorized function:

In [227]:
choose(10, 0:10)

#### expand.grid()

Returns the cartesian product of multiple vectors as a data frame:

In [228]:
expand.grid(1:2, 1:3)

Var1,Var2
1,1
2,1
1,2
2,2
1,3
2,3


In [229]:
expand.grid(list(1:2, 3:5, 4:7))

Var1,Var2,Var3
1,3,4
2,3,4
1,4,4
2,4,4
1,5,4
2,5,4
1,3,5
2,3,5
1,4,5
2,4,5


#### combn()

Returns all unique k sized combinations of a vector of size n

In [231]:
?combn

In [233]:
combn(1:5, 3)

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1    1    1    1    1    1    2    2    2    3    
[2,] 2    2    2    3    3    4    3    3    4    4    
[3,] 3    4    5    4    5    5    4    5    5    5    

#### Extension packages

gtools, combinat, permutations and iterpc packages provides more functionality on combinatorics

### List operations

#### split()

split(), splits a data.frame, matrix or vector, based on distinct values of another vector into a list:

In [70]:
?split

First let's create a letter vector of size 20 from the first 5 letters of the alphabet:

In [65]:
samp_2 <- sample(letters[1:5], 20, replace = T)
samp_2

Let's create a vector of indices along samp_2:

In [66]:
ind_2 <- seq_along(samp_2)
ind_2

Now let's split the indices by the values in samp_2: The indices corresponding to each unique value of samp_2 will be held in a separate list item:

In [67]:
?split

In [68]:
split_1 <- split(ind_2, f = samp_2)
split_1

So "a" appears in 7th, 15th, 16th and 18th positions

In [69]:
str(split_1)
attributes(split_1)

List of 5
 $ a: int [1:4] 7 15 16 18
 $ b: int [1:4] 2 3 6 13
 $ c: int [1:4] 8 10 14 19
 $ d: int [1:5] 1 5 12 17 20
 $ e: int [1:3] 4 9 11


split_1 is a list

#### do.call()

Repeats a function call on all items of a list

Lets create a list of ten same sized vectors:

In [240]:
list_3 <- list()

set.seed(100)
for (i in 1: 10)
{
    vec <- sample(100, 10, replace = T)
    list_3[[i]] <- vec
}

list_3

[[1]]
 [1] 31 26 56  6 47 49 82 38 55 18

[[2]]
 [1] 63 89 29 40 77 67 21 36 36 70

[[3]]
 [1] 54 72 54 75 43 18 78 89 55 28

[[4]]
 [1] 49 93 35 96 70 89 19 63 99 14

[[5]]
 [1] 34 87 78 83 61 50 79 89 21 31

[[6]]
 [1] 34 20 24 28 60 26 13 23 60 22

[[7]]
 [1] 47 65 97 68 45 36 46 45 25 70

[[8]]
 [1] 42 33 58 97 67 63 86 78 84 10

[[9]]
 [1] 46 60 92 99  4 58 74 25 31 74

[[10]]
 [1] 91 21 36 45 91 39 52 13  4 78


Now combine all list items into a matrix without explicitly providing each vector as a separate argument to cbind():

In [241]:
do.call(cbind, list_3)

      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,] 31   63   54   49   34   34   47   42   46   91   
 [2,] 26   89   72   93   87   20   65   33   60   21   
 [3,] 56   29   54   35   78   24   97   58   92   36   
 [4,]  6   40   75   96   83   28   68   97   99   45   
 [5,] 47   77   43   70   61   60   45   67    4   91   
 [6,] 49   67   18   89   50   26   36   63   58   39   
 [7,] 82   21   78   19   79   13   46   86   74   52   
 [8,] 38   36   89   63   89   23   45   78   25   13   
 [9,] 55   36   55   99   21   60   25   84   31    4   
[10,] 18   70   28   14   31   22   70   10   74   78   

### Optimization

#### solve()

Solves a system of linear equations

```R
Solve a System of Equations

Description:

     This generic function solves the equation ‘a %*% x = b’ for ‘x’,
     where ‘b’ can be either a vector or a matrix.

Usage:

     solve(a, b, ...)
```

In [249]:
?solve

In [250]:
a <- matrix(runif(25, -10, 10), nrow = 5)
a

     [,1]      [,2]       [,3]      [,4]       [,5]     
[1,] -7.549348 -6.0185508  2.132548  3.9617499 -8.214875
[2,]  4.577256  0.1017252  6.317534  6.3018623 -6.746353
[3,]  9.007544  8.5277753  6.867624  1.3680012 -9.465431
[4,] -9.132882 -7.2319662  5.761544 -0.3987423  4.192590
[5,] -9.608014 -6.6161085 -9.619583 -6.7725976  5.220364

In [251]:
b <- matrix(runif(10, -5, 5), ncol = 2)
b

     [,1]       [,2]      
[1,]  3.5747478  2.9313923
[2,] -0.6273418 -1.7384743
[3,] -0.8286355  4.5793194
[4,]  0.8564949  1.5301961
[5,]  3.2496758 -0.4132418

In [253]:
x <- solve(a, b)
x

     [,1]        [,2]      
[1,] -0.16896849 -0.8810440
[2,] -0.14074988  0.9594990
[3,] -0.09368302  0.2716812
[4,] -0.19823963 -0.2399375
[5,] -0.29668069 -0.2953273

Let's confirm

In [257]:
b2 <- a %*% x
b2

     [,1]       [,2]      
[1,]  3.5747478  2.9313923
[2,] -0.6273418 -1.7384743
[3,] -0.8286355  4.5793194
[4,]  0.8564949  1.5301961
[5,]  3.2496758 -0.4132418

Check whether b and b2 are equal:

In [258]:
identical(b, b2)

[1] FALSE

In [259]:
b == b2

     [,1]  [,2] 
[1,]  TRUE FALSE
[2,] FALSE FALSE
[3,] FALSE  TRUE
[4,] FALSE FALSE
[5,]  TRUE FALSE

Due to numeric computations, the accuracy of the calculations may be affected

For this purposes "near" equality must be checked with the all.equal() function: 

In [260]:
all.equal(b, b2)

[1] TRUE

#### optimize()

Finds the min or max of a function vis-a-vis a single argument over an interval

```
optimize                 package:stats                 R Documentation

One Dimensional Optimization

The function ‘optimize’ searches the interval from ‘lower’ to ‘upper’ for a minimum or maximum of the function ‘f’ with respect to its first argument.

optimize(f, interval, ..., lower = min(interval), upper = max(interval), maximum = FALSE, tol = .Machine$double.eps^0.25)
     
Arguments:

       f: the function to be optimized.  The function is either
          minimized or maximized over its first argument depending on
          the value of ‘maximum’.

interval: a vector containing the end-points of the interval to be
          searched for the minimum.

     ...: additional named or unnamed arguments to be passed to ‘f’.

   lower: the lower end point of the interval to be searched.

   upper: the upper end point of the interval to be searched.

 maximum: logical.  Should we maximize or minimize (the default)?
```

Let's create a polynomial function:

In [289]:
polynom4 <- function(x) sum(x^(2:0) * c(1, -3, 8))

In [292]:
polynom4(1)

[1] 6

In [293]:
optimize(f = polynom4, interval = c(-100, 100), maximum = F)

$minimum
[1] 1.5

$objective
[1] 5.75


In [295]:
optimize(f = polynom4, interval = c(-100, 100), maximum = T)

$maximum
[1] -99.99993

$objective
[1] 10307.99


#### optim()

Optimization on multiple arguments

In [296]:
xy1 <- function(arg)
{
    x <- arg[1]
    y <- arg[2]
    3 * x^2 - 4 * x + 5 * y - 2 * y^2 + 3
}

In [300]:
optim(c(0,0), method = "L-BFGS-B", xy1, lower = -5, upper = 5)

$par
[1]  0.6666667 -5.0000000

$value
[1] -73.33333

$counts
function gradient 
       5        5 

$convergence
[1] 0

$message
[1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"


### Object information

#### str()

Returns the structure of an object:

In [234]:
list_2 <- list(mat_1, samp_1, df1 = as.data.frame(mat_1), list(1:3))
list_2

[[1]]
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    6   11   16   21
[2,]    2    7   12   17   22
[3,]    3    8   13   18   23
[4,]    4    9   14   19   24
[5,]    5   10   15   20   25

[[2]]
[1] 16  9  8 14 10

$df1
  V1 V2 V3 V4 V5
1  1  6 11 16 21
2  2  7 12 17 22
3  3  8 13 18 23
4  4  9 14 19 24
5  5 10 15 20 25

[[4]]
[[4]][[1]]
[1] 1 2 3



In [235]:
str(list_2)

List of 4
 $    : int [1:5, 1:5] 1 2 3 4 5 6 7 8 9 10 ...
 $    : int [1:5] 16 9 8 14 10
 $ df1:'data.frame':	5 obs. of  5 variables:
  ..$ V1: int [1:5] 1 2 3 4 5
  ..$ V2: int [1:5] 6 7 8 9 10
  ..$ V3: int [1:5] 11 12 13 14 15
  ..$ V4: int [1:5] 16 17 18 19 20
  ..$ V5: int [1:5] 21 22 23 24 25
 $    :List of 1
  ..$ : int [1:3] 1 2 3


#### Extension packages

optimx and optimization packages provide more functionality for optimization

#### object.size()

Returns the size of an object in memory

**NOTE THAT R IS BASICALLY AN IN MEMORY COMPUTATION ENVIRONMENT. SO EFFICIENCY IN MEMORY USAGE IS IMPORTANT** 

In [236]:
object.size(1:1e5)

400048 bytes

### Performance functions

#### system.time()

Returns the execution time of a function call in seconds

In [237]:
system.time(any(c(rep(F, 1e6), T)))

   user  system elapsed 
  0.005   0.003   0.008 

microbenchmark and rbenchmark packages bring more functionality and precision to performance measurement

## \*plying

A very important functionality of R is provided by the \*apply family of functions


Alhough they do not provide as fast or concise as native vectorized code, they can still substitute more verbose loops and can have some performance benefits over loops:

### apply

apply() works with matrices: It applies a function on each row or column of a matrix

In [33]:
?apply

```
apply(X, MARGIN, FUN, ...)

Arguments

X	
an array, including a matrix.

MARGIN	
a vector giving the subscripts which the function will be applied over. E.g., for a matrix 1 indicates rows, 2 indicates columns, c(1, 2) indicates rows and columns. Where X has named dimnames, it can be a character vector selecting dimension names.

FUN	
the function to be applied: see ‘Details’. In the case of functions like +, %*%, etc., the function name must be backquoted or quoted.

...	
optional arguments to FUN.

```

In [30]:
mat_1 <- matrix(1:25, nrow = 5)
mat_1

0,1,2,3,4
1,6,11,16,21
2,7,12,17,22
3,8,13,18,23
4,9,14,19,24
5,10,15,20,25


Now apply function max() to each row of the matrix. row is the 1st margin: 

In [34]:
apply(mat_1, 1, max)

And apply max() to each column:

In [35]:
apply(mat_1, 2, max)

And we can define new function to applied on the spot:

In [41]:
apply(mat_1, 2, function(x) max(x) - min(x))

Functions in apply can take more than one arguments

For example extract the third elements from each row (of course not an efficient implementation

In [36]:
apply(mat_1, 1, "[", 3)

If we the function returns multiple values on each row, each return becomes a column in the return matrix:

In [39]:
apply(mat_1, 1, "[", 3:4)

0,1,2,3,4
11,12,13,14,15
16,17,18,19,20


What if we want to have each row work with separate values?

### lapply

lapply() is list apply: It works on lists as well as vectors but always returns a list: 

Let's generate the pascal's triangle for values from 1 to 10:

In [304]:
pascalst <- lapply(1:10, function(x) choose(x, 0:x))
pascalst

[[1]]
[1] 1 1

[[2]]
[1] 1 2 1

[[3]]
[1] 1 3 3 1

[[4]]
[1] 1 4 6 4 1

[[5]]
[1]  1  5 10 10  5  1

[[6]]
[1]  1  6 15 20 15  6  1

[[7]]
[1]  1  7 21 35 35 21  7  1

[[8]]
[1]  1  8 28 56 70 56 28  8  1

[[9]]
 [1]   1   9  36  84 126 126  84  36   9   1

[[10]]
 [1]   1  10  45 120 210 252 210 120  45  10   1


Then we can get the max value for each item as a list:

In [305]:
lapply(pascalst, max)

[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[1] 3

[[4]]
[1] 6

[[5]]
[1] 10

[[6]]
[1] 20

[[7]]
[1] 35

[[8]]
[1] 70

[[9]]
[1] 126

[[10]]
[1] 252


We can have a vector by unlisting the object:

In [311]:
unlist(lapply(pascalst, max))

 [1]   1   2   3   6  10  20  35  70 126 252

### sapply

Sapply is the "S"implified lapply: returns a vector by default

Think about the previous example of getting the maximum values out of the pascal's triangle:

In [310]:
sapply(pascalst, max)

 [1]   1   2   3   6  10  20  35  70 126 252

It can also iterate through vectors

Lets have the max of each of the values versus 5:

In [312]:
sapply(1:10, max, 5)

 [1]  5  5  5  5  5  6  7  8  9 10

Now let's make the second argument to the function a multi valued vector

And get the matching index of each of the values in the first argument inside the samp_20 vector:

In [313]:
set.seed(300)
samp_20 <- sample(20)
samp_20

 [1] 19 15 20 13 11  1 16  7  6 10 14  3 17  5  4  8 12  2  9 18

In [314]:
sapply(1:10, match, samp_20)

 [1]  6 18 12 15 14  9  8 16 19 10

What if I want to make a pairwise comparison:

In [315]:
sapply(1:10, max, 10:1)

 [1] 10 10 10 10 10 10 10 10 10 10

That did not do the job: In each iteration over 1:10, the max is done against the whole 10:1 vector

### mapply()

mapply is the multivariate sapply: It can iterate through multiple vectors or lists

So the previous example becomes:

In [318]:
mapply(max, 1:10, 10:1)

 [1] 10  9  8  7  6  6  7  8  9 10

Now max did pairwise operation on each index of both vectors

What if I want mapply to iterate though some arguments and take some arguments as a whole? This object is converted into a list item:

Get pairwise maximum of first two vectors and find the matching index of this maximum inside the samp_20 vector

In [322]:
samp_20
list(samp_20)

 [1] 19 15 20 13 11  1 16  7  6 10 14  3 17  5  4  8 12  2  9 18

[[1]]
 [1] 19 15 20 13 11  1 16  7  6 10 14  3 17  5  4  8 12  2  9 18


Note that the vector should be an item of the list, not be converted into a list itself as such:

In [323]:
as.list(samp_20)

[[1]]
[1] 19

[[2]]
[1] 15

[[3]]
[1] 20

[[4]]
[1] 13

[[5]]
[1] 11

[[6]]
[1] 1

[[7]]
[1] 16

[[8]]
[1] 7

[[9]]
[1] 6

[[10]]
[1] 10

[[11]]
[1] 14

[[12]]
[1] 3

[[13]]
[1] 17

[[14]]
[1] 5

[[15]]
[1] 4

[[16]]
[1] 8

[[17]]
[1] 12

[[18]]
[1] 2

[[19]]
[1] 9

[[20]]
[1] 18


In [321]:
mapply(function(x, y, z) match(max(x,y), z), 1:10, 10:1, list(samp_20))

 [1] 10 19 16  8  9  9  8 16 19 10

### Extension packages

- plyr package enables \*ply operations on multidimensional arrays
- purrr package from tidyverse brings similar functionality to \*ply functionsbut somehow more harmonized with other tidyverse packages

## Basic plotting

# String operations

sprintf

paste

strsplit

substring

gsub

grepl