In [None]:
options(jupyter.rich_display = FALSE)

# Manipulating data structures in R and useful functions

## Getting help

The easiest way to get help on a function or structure in R is the "?" utility

Just write

```R
?[any_function]
```

In [None]:
?"?"

In [None]:
?c

## Import and export data

### csv and tsv files

Comma separated and tab separated values can easily be imported to and exported from R

Let's first the see available datasets in R:

In [None]:
data()

Let's select famous iris dataset:

In [None]:
iris

In [None]:
str(iris)

Let's export this data as csv:

In [None]:
write.csv(iris, file = "iris.csv", row.names = F)

You can check the iris.csv file from the filesystem

Now let's import the data again

In [None]:
iris_2 <- read.csv("iris.csv")

In [None]:
iris_2

In [None]:
identical(iris_2, iris)

Imported object is identical with the original one

read.table() and write.table() are the general functions for file read and write while read.csv() and write.csv() are wrappers around the options for csv filetype.

Not that the file argument can also take URL's as input

### Binary R objects

save() and load() functions work with binary files representing R objects:

Create a copy of iris:

In [None]:
iris3 <- iris

Save it as an RData file:

In [None]:
save(iris3, file = "iris.RData")

Remove the iris3 object:

In [None]:
rm(iris3)
exists("iris3")

And load it from the RDatafile:

In [None]:
load("iris.RData")

See it is imported:

In [None]:
exists("iris3")

In [None]:
iris3

### Extension packages

- readxl, xlsx, openxlsx and XLConnect imports and exports from/to xls(x) files
- googlesheets package connects R to googlesheets
- readr package of tidyverse extends the functionality of read.table, write.table and similar base utilities
- DBI, RPostgreSQL, RMySQL, ROracle, sqldf packages connects R to common database servers
- data.table package has a faster implementation of file read and writes with fread() and fwrite()

## Vectorization

Many functions in R can handle multiple values vectors as inputs and call the function for all the values in the vector sequentially without the need for an explicit loop

The two main benefits of vectorization in R are:
- Speed: Natively vectorized functions (written in C, C++ or Fortranand compiled) are as fast as compiled code
- Conciseness: Vectorized functions simplify code writing: Less and more clear code

In some cases the functions R not vectorized can only handle single values.

In this case, Vectorize() can generate vectorized versions of these functions while it does not bring the performance advantage of vectorization in native vectorized code

In [None]:
func_single <- function(x, y)
{
    if (x < y)
    {
        return("x is smaller than y")
    }
    else
    {
        return("x is not smaller than y")
    }
}

In [None]:
func_single(3, 1)
func_single(2, 5)

In [None]:
func_single(1:4, 5:2)

if condition only regards the first values in vector and ignores the rest

In [None]:
func_vec <- Vectorize(func_single)

In [None]:
func_vec(1:4, 5:2)

### outer() with Vectorize

In order for outer() to work properly, the function provided to FUN argument must be vectorized

For example let's create a matrix of the cartesian product of 1:10, where values are the max of row or column 

Note that max() is not a vectorized function, it aggregates its input and returns a single value:

So this does not work:

In [None]:
outer(1:10, 1:10, max)

This does not work either:

In [None]:
outer(1:10, 1:10, function(x,y) max(x,y))

But this works:

In [None]:
outer(1:10, 1:10, Vectorize(function(x,y) max(x,y)))

## Random number generation and simulation

### set.seed()

set.seed makes (pseudo)random number generation reproducible, so that after a certain seed is provided the same sequence of numbers are always generated

### sample()

sample() function takes a sample of replaced or non-replaced values from a vector:

In [None]:
set.seed(1)
sample(1:20, size = 15, replace = T)

In [None]:
set.seed(2)
sample(1:20, size =15, replace = F)

### runif()

Generates uniformly distributed numeric (decimal) values from a given range:

In [None]:
runif(n = 10, min = 5, max = 8)

In [None]:
?rnorm

### rnorm()

Generates normally distributed values with a given mean and standard deviation:

In [None]:
rnorm(n= 20, mean = 2 , sd = 1)

## Useful functions

### Vector operations

#### seq()

Creates a sequence of values. More versatile than the ":" operator:

In [None]:
?seq

```R
seq(from = 1, to = 1, by = ((to - from)/(length.out - 1)),
    length.out = NULL, along.with = NULL, ...)
```

Sequence of values from 3, with steps of 2 and length of 7: 

In [None]:
seq(from = 3, by = 2, length.out = 7)

sequence along another vector. Same as 1:length(x). Useful for creating indices across a vector:

In [None]:
samp_1 <- sample(20, 5)
samp_1

In [None]:
seq_along(samp_1)

#### rev()

Reverses the order of a vector:

In [None]:
rev(1:5)

#### pmax(), pmin()

Vectorized maximum and minimum of the same indices on multiple vectors

In [None]:
set.seed(20)
samp_13 <- sample(100, 5)
samp_14 <- sample(100, 5)
samp_15 <- sample(100, 5)

samp_13
samp_14
samp_15

In [None]:
pmax(samp_13, samp_14, samp_15)

In [None]:
pmin(samp_13, samp_14, samp_15)

In [None]:
?pmax

### Logical functions

#### all()

Works on logical values and reports whether all values are T

In [None]:
all(c(T,F,T))

In [None]:
all(c(T,T,T))

all() is lazy: When it finds an F, the rest is not probed:

In [None]:
large_bool1 <- c(F, rep(T, 1e7))
large_bool2 <- c(rep(T, 1e7), F)

In [None]:
system.time(all(large_bool1))
system.time(all(large_bool2))

In the first example, all() call stops at the very first encounter with the F at the beginning while the second example has to go all along the vector

#### any()

Returns whether any of the values in a logical vector is TRUE

In [None]:
any(c(T, F, F))

In [None]:
any(c(F, F, F))

It is also a lazy function

### which(), which.min(), which.max()

which() returns the indices of the TRUE values of a logical vector

It is used in order to return the indices on a vector which satisfies a condition

In [None]:
set.seed(3)
samp_2 <- sample(5, 10, replace = T)
samp_2

In [None]:
samp_2 > 3

In [None]:
which(samp_2 > 3)

So 2nd, 5th, 6th and 10th values in the samp_2 vector satisfy the > 3 condition

which.max() returns the index of the max value, which.min returns the index of the min value in a vector

In [None]:
set.seed(4)
samp_3 <- sample(10)
samp_3

In [None]:
which.max(samp_3)
which.min(samp_3)

### match()

Returns the first positions of the values of the first argument in the second argument

In [None]:
set.seed(5)
samp_4 <- sample(10)
samp_4

In [None]:
match(1, samp_4)

1 is the 5th item in samp_3

In [None]:
match(1:3, samp_4)

1:3 are 5th, 4th and 1st items in samp_4 respectively

### Ordering functions

#### sort()

Sorts a vector in monotonical order

In [None]:
set.seed(6)
samp_5 <- sample(100, 10)
samp_5

In [None]:
sort(samp_5)

In [None]:
sort(samp_5, decreasing = T)

#### order()

Order returns a vector of indices which rearranges its first argument into ascending or descending order

In [None]:
set.seed(7)
samp_6 <- sample(5)
samp_6

In [None]:
order(samp_6)

It is the same as:

In [None]:
match(sort(samp_6), samp_6)

In order to arrange samp_6 in ascending order it must be subset in this order: 3rd, 2nd, 4th, 5th and 1st indices.

Let's check:

In [None]:
samp_6[order(samp_6)]

To get the descending order:

In [None]:
order(samp_6, decreasing = T)

Or (for numeric/integer vectors only):

In [None]:
order(-samp_6)

In [None]:
samp_6[order(-samp_6)]

#### rank()

Returns the position of each value when sorted:

In [None]:
set.seed(10)
samp_7 <- sample(20, 5)
samp_7

In [None]:
rank(samp_7)

It is the same as:

In [None]:
match(samp_7, sort(samp_7))

### Set operations: setdiff(), intersect(), union()

In [None]:
set.seed(200)
samp_10 <- sample(10, 7)
samp_11 <- sample(10, 7)
samp_10
samp_11

setdiff()returns only the differing items between vectors, not symmetric:

In [None]:
setdiff(samp_10, samp_11)

In [None]:
setdiff(samp_11, samp_10)

intersect() returns common values (values the same, orders may differ):

In [None]:
intersect(samp_10, samp_11)

In [None]:
intersect(samp_11, samp_10)

union() combines values:

In [None]:
union(samp_10, samp_11)

In [None]:
union(samp_11, samp_10)

### Counting functions

In [None]:
set.seed(200)
samp_12 <- sample(10, 100, replace = T)
length(samp_12)

unique() returns the unique values in a vector

In [None]:
unique(samp_12)

table() summarizes the occurence of each unique item:

In [None]:
table(samp_12)

prop.table() reports the proportions in total instead of counts of a table:

In [None]:
prop.table(table(samp_12))

### Rounding functions

round() round to the desired accuracy:

In [None]:
pi
round(pi)
round(pi, 1)
round(pi, 2)

In [None]:
round(1.6)
round(1.4)

ceiling() rounds up:

In [None]:
ceiling(1.6)
ceiling(-1.6)

floor() rounds down:

In [None]:
floor(1.6)
floor(-1.6)

trunc() rounds to the integer closer to 0 

In [None]:
trunc(1.6)
trunc(-1.6)

### Vector sum and products

sum() gets the single value sum of a vector:

In [None]:
sum(1:5)

cumsum() returns the cumulative sums from the first to nth values:

In [None]:
cumsum(1:5)

prod() returns the single value product of a vector:

In [None]:
prod(1:5)

cumprod() returns the cumulative products from the first to nth values

In [None]:
cumprod(1:5)

They can be combined with rev()

In [None]:
cumsum(rev(1:5))

In [None]:
cumprod(rev(1:5))

RcppRoll package has high performance rolling/windowed operations on vectors and matrices 

### Matrix operations

In [None]:
set.seed(30)
mat_3 <- matrix(sample(10, 25, replace = T), nrow = 5)
mat_3

colSums() returns the column sums

In [None]:
colSums(mat_3)

rowSums() returns the row sums

In [None]:
rowSums(mat_3)

max.col() returns the column index of maximum value for each row:

In [None]:
max.col(mat_3)

There is no built-in max.row but we can easily emulate its functionality:

In [None]:
max.col(t(mat_3))

matrixStats package has high-performing functions for row and column operations on matrices

### Statistical functions

#### max(), min(), mean(), median(), sd()

Returns the respective single max, min, mean, median and sd values: 

In [None]:
set.seed(15)
samp_8 <- rnorm(1e4, 0, 1)

In [None]:
max(samp_8)

In [None]:
min(samp_8)

In [None]:
mean(samp_8)

In [None]:
median(samp_8)

In [None]:
sd(samp_8)

#### quantile()

With only a single argument, get five-point summary of a numeric variable

In [None]:
quantile(samp_8)

Or you can define any percentile values. For example to get the deciles:

In [None]:
quantile(samp_8, probs = seq(0.1, 1, 0.1))

#### summary()

A generic function. For a numeric variable it provides a summary table of the statistics above (five-point sumamry + mean:

In [None]:
summary(samp_8)

#### cor(), cov(), var()

cor() returns correlation values between vectors:

In [None]:
set.seed(100)
samp_9 <- runif(100)
samp_10 <- runif(100)
samp_11 <- samp_9 + 2* samp_10

Between two variables:

In [None]:
cor(samp_9, samp_11)

Or as a correlation matrix when a matrix is provided:

In [None]:
cor(cbind(samp_9, samp_10, samp_11))

Variance of a vector:

In [None]:
var(samp_9)

Or a covariance matrix of multiple columns:

In [None]:
cov(cbind(samp_9, samp_10, samp_11))

### Mathematical functions

abs() returns the absolute value:

In [None]:
abs(-10)

exp() returns the exponent: e^n

In [None]:
exp(1)

log() returns the the logarithm in a base (default base is e = exp(1)

In [None]:
log(exp(1))

sqrt() return the square root:

In [None]:
sqrt(16)

factorial() returns the factorial (!n)

In [None]:
factorial(5)

#### Extension packages

gmp, numbers, adagio packages have efficient implementations of numeric operations

### Trigonometric functions

Degrees should be converted to radians as such $180\,^{\circ} = \pi$

In [None]:
degvalues <- seq(0, 360, 45)
degvalues

In [None]:
radians <- degvalues / 180 * pi
radians

In [None]:
sin(radians)

In [None]:
cos(radians)

In [None]:
tan(radians)

### Combinatorics

#### choose()

C(n,k) is the number of k sized combinations out of a vector of size r. It is a vectorized function:

In [None]:
choose(10, 0:10)

#### expand.grid()

Returns the cartesian product of multiple vectors as a data frame:

In [None]:
expand.grid(1:2, 1:3)

In [None]:
expand.grid(list(1:2, 3:5, 4:7))

#### combn()

Returns all unique k sized combinations of a vector of size n

In [None]:
?combn

In [None]:
combn(1:5, 3)

#### Extension packages

gtools, combinat, permutations and iterpc packages provides more functionality on combinatorics

### List operations

#### split()

split(), splits a data.frame, matrix or vector, based on distinct values of another vector into a list:

In [None]:
?split

First let's create a letter vector of size 20 from the first 5 letters of the alphabet:

In [None]:
samp_2 <- sample(letters[1:5], 20, replace = T)
samp_2

Let's create a vector of indices along samp_2:

In [None]:
ind_2 <- seq_along(samp_2)
ind_2

Now let's split the indices by the values in samp_2: The indices corresponding to each unique value of samp_2 will be held in a separate list item:

In [None]:
?split

In [None]:
split_1 <- split(ind_2, f = samp_2)
split_1

So "a" appears in 7th, 15th, 16th and 18th positions

In [None]:
str(split_1)
attributes(split_1)

split_1 is a list

#### do.call()

Repeats a function call on all items of a list

Lets create a list of ten same sized vectors:

In [None]:
list_3 <- list()

set.seed(100)
for (i in 1: 10)
{
    vec <- sample(100, 10, replace = T)
    list_3[[i]] <- vec
}

list_3

Now combine all list items into a matrix without explicitly providing each vector as a separate argument to cbind():

In [None]:
do.call(cbind, list_3)

### Optimization

#### solve()

Solves a system of linear equations

```R
Solve a System of Equations

Description:

     This generic function solves the equation ‘a %*% x = b’ for ‘x’,
     where ‘b’ can be either a vector or a matrix.

Usage:

     solve(a, b, ...)
```

In [None]:
?solve

In [None]:
a <- matrix(runif(25, -10, 10), nrow = 5)
a

In [None]:
b <- matrix(runif(10, -5, 5), ncol = 2)
b

In [None]:
x <- solve(a, b)
x

Let's confirm

In [None]:
b2 <- a %*% x
b2

Check whether b and b2 are equal:

In [None]:
identical(b, b2)

In [None]:
b == b2

Due to numeric computations, the accuracy of the calculations may be affected

For this purposes "near" equality must be checked with the all.equal() function: 

In [None]:
all.equal(b, b2)

#### optimize()

Finds the min or max of a function vis-a-vis a single argument over an interval

```
optimize                 package:stats                 R Documentation

One Dimensional Optimization

The function ‘optimize’ searches the interval from ‘lower’ to ‘upper’ for a minimum or maximum of the function ‘f’ with respect to its first argument.

optimize(f, interval, ..., lower = min(interval), upper = max(interval), maximum = FALSE, tol = .Machine$double.eps^0.25)
     
Arguments:

       f: the function to be optimized.  The function is either
          minimized or maximized over its first argument depending on
          the value of ‘maximum’.

interval: a vector containing the end-points of the interval to be
          searched for the minimum.

     ...: additional named or unnamed arguments to be passed to ‘f’.

   lower: the lower end point of the interval to be searched.

   upper: the upper end point of the interval to be searched.

 maximum: logical.  Should we maximize or minimize (the default)?
```

Let's create a polynomial function:

In [None]:
polynom4 <- function(x) sum(x^(2:0) * c(1, -3, 8))

In [None]:
polynom4(1)

In [None]:
optimize(f = polynom4, interval = c(-100, 100), maximum = F)

In [None]:
optimize(f = polynom4, interval = c(-100, 100), maximum = T)

#### optim()

Optimization on multiple arguments

In [None]:
xy1 <- function(arg)
{
    x <- arg[1]
    y <- arg[2]
    3 * x^2 - 4 * x + 5 * y - 2 * y^2 + 3
}

In [None]:
optim(c(0,0), method = "L-BFGS-B", xy1, lower = -5, upper = 5)

### Object information

#### str()

Returns the structure of an object:

In [None]:
mat_1 <- matrix(1:25, nrow = 5)
list_2 <- list(mat_1, samp_1, df1 = as.data.frame(mat_1), list(1:3))
list_2

In [None]:
str(list_2)

#### Extension packages

optimx and optimization packages provide more functionality for optimization

#### object.size()

Returns the size of an object in memory

**NOTE THAT R IS BASICALLY AN IN MEMORY COMPUTATION ENVIRONMENT. SO EFFICIENCY IN MEMORY USAGE IS IMPORTANT** 

In [None]:
object.size(1:1e5)

### Performance functions

#### system.time()

Returns the execution time of a function call in seconds

In [None]:
system.time(any(c(rep(F, 1e6), T)))

microbenchmark and rbenchmark packages bring more functionality and precision to performance measurement

## \*plying

A very important functionality of R is provided by the \*apply family of functions


Alhough they do not provide as fast or concise as native vectorized code, they can still substitute more verbose loops and can have some performance benefits over loops:

### apply

apply() works with matrices: It applies a function on each row or column of a matrix

In [None]:
?apply

```
apply(X, MARGIN, FUN, ...)

Arguments

X	
an array, including a matrix.

MARGIN	
a vector giving the subscripts which the function will be applied over. E.g., for a matrix 1 indicates rows, 2 indicates columns, c(1, 2) indicates rows and columns. Where X has named dimnames, it can be a character vector selecting dimension names.

FUN	
the function to be applied: see ‘Details’. In the case of functions like +, %*%, etc., the function name must be backquoted or quoted.

...	
optional arguments to FUN.

```

In [None]:
mat_1 <- matrix(1:25, nrow = 5)
mat_1

Now apply function max() to each row of the matrix. row is the 1st margin: 

In [None]:
apply(mat_1, 1, max)

And apply max() to each column:

In [None]:
apply(mat_1, 2, max)

And we can define new function to applied on the spot:

In [None]:
apply(mat_1, 2, function(x) max(x) - min(x))

Functions in apply can take more than one arguments

For example extract the third elements from each row (of course not an efficient implementation

In [None]:
apply(mat_1, 1, "[", 3)

If we the function returns multiple values on each row, each return becomes a column in the return matrix:

In [None]:
apply(mat_1, 1, "[", 3:4)

What if we want to have each row work with separate values?

#### apply() exercise

**EXERCISE 5:**

Suppose the neighbours of an item in a vector is the two adjacent values and the value itself (for initial and end values there are only two neighbours

Create a function max_neigh() that takes a vector and returns the largest neighbour of each value in a vector.

So if vec_1 is:

```R
set.seed(123)
vec_1 <- sample(50, 10, replace = T)
vec_1

15 40 21 45 48 3 27 45 28 23
```

```R
max_neigh(vec_1)
40 40 45 48 48 48 45 45 45 28
```

Hint: 
- Create three vectors: original, offsetted to left, offsetted to right.
- You may use do.call() and rbind() and you MUST use apply()

**SOLUTION 5:**

In [None]:
pass <- readline(prompt = "Please enter the password for the solution: ")
encrypt <- "U2FsdGVkX1/w2XjH2qjOd82AZXVkShNl1lYixhQO6oZY7XW4Qomh9/YLKR2wbJet g3euq8UAk6X2knx6Nw/INqM3IRFk6FIGrFnAYZp/UkbnBAwRVTR5kaF5qk0zO8Mi TFCgFxwqJpVjx1og+tLWrU6QJIeF4QpeOIjR99eI8VHUIqo9+IlxIowPkgfahYPh HBorJMf0hTLekHhAg6kpYfMJqJgPqIqYPrHzYQ2WMCBdsUGj8FSrotvDq2esnk9f w9uS1P3SFVfqxI8wo/KZsjT6/Uu0Hs+sU9Oys60pViCWbaUISqzXzC0mGajY1gPW 23ZcfZ41x/ONoZzzMe+d9myjVoqydaCbLkUXpbi1xAaMdPn0NIc31OxHs+nKEKnv gUXg0XxJqnCUwz0r/sEAunghCkrjqVCTewSD1ob8yx4="
solution <- system(sprintf("echo %s | openssl enc -md sha256 -aes-128-cbc -a -d -salt -pass pass:%s 2> /dev/null", encrypt, pass), intern = T, ignore.stderr = T)
cat(solution, sep = "\n")
eval(parse(text = solution))

### lapply

lapply() is list apply: It works on lists as well as vectors but always returns a list: 

Let's generate the pascal's triangle for values from 1 to 10:

In [None]:
pascalst <- lapply(1:10, function(x) choose(x, 0:x))
pascalst

Then we can get the max value for each item as a list:

In [None]:
lapply(pascalst, max)

We can have a vector by unlisting the object:

In [None]:
unlist(lapply(pascalst, max))

#### lapply() exercise

**EXERCISE 6:**

Let's revisit the previous list example:

> Create a list of 100 items and in each item, the proper divisors of the index (1 to 100) should be collected

This time use lapply() to do the same thing

**SOLUTION 6:**

In [None]:
pass <- readline(prompt = "Please enter the password for the solution: ")
encrypt <- "U2FsdGVkX1/xf/WZMAr+z21/wE6OTzvfDYWalqdsXpwkgXVVNaxDEhQZjf/9NOL7 eJIgNqD3oSMZrb7pdgOmZyTE2C/WmEpYnMjEmLAc7n8="
solution <- system(sprintf("echo %s | openssl enc -md sha256 -aes-128-cbc -a -d -salt -pass pass:%s 2> /dev/null", encrypt, pass), intern = T, ignore.stderr = T)
cat(solution, sep = "\n")
eval(parse(text = solution))

### sapply

Sapply is the "S"implified lapply: returns a vector by default

Think about the previous example of getting the maximum values out of the pascal's triangle:

In [None]:
sapply(pascalst, max)

It can also iterate through vectors

Lets have the max of each of the values versus 5:

In [None]:
sapply(1:10, max, 5)

Now let's make the second argument to the function a multi valued vector

And get the matching index of each of the values in the first argument inside the samp_20 vector:

In [None]:
set.seed(300)
samp_20 <- sample(20)
samp_20

In [None]:
sapply(1:10, match, samp_20)

What if I want to make a pairwise comparison:

In [None]:
sapply(1:10, max, 10:1)

That did not do the job: In each iteration over 1:10, the max is done against the whole 10:1 vector

### mapply()

mapply is the multivariate sapply: It can iterate through multiple vectors or lists

So the previous example becomes:

In [None]:
mapply(max, 1:10, 10:1)

Now max did pairwise operation on each index of both vectors

What if I want mapply to iterate though some arguments and take some arguments as a whole? This object is converted into a list item:

Get pairwise maximum of first two vectors and find the matching index of this maximum inside the samp_20 vector

In [None]:
samp_20
list(samp_20)

Note that the vector should be an item of the list, not be converted into a list itself as such:

In [None]:
as.list(samp_20)

In [None]:
mapply(function(x, y, z) match(max(x,y), z), 1:10, 10:1, list(samp_20))

### Extension packages

- plyr package enables \*ply operations on multidimensional arrays
- purrr package from tidyverse brings similar functionality to \*ply functionsbut somehow more harmonized with other tidyverse packages

## Basic plotting

Base R provides some basic plotting capabilities to data structures.

However the power of visualization in R comes from countless extension packages based on many visualization libraries built on JS and similar visually enhanced technologies

### scatterplot

Let's first create two series:

In [None]:
set.seed(1000)
series_1 <- rnorm(100)
series_2 <- rnorm(100)
series_12 <- (series_1 + series_2)/2

In [None]:
plot(series_1,
     series_12,
     col = "blue",
     pch = 5,
     xlab = "1st series",
     ylab = "2nd series",
     main = "A simple scatterplot")

### Line plot

In [None]:
rads <- seq(0, 360, 10) / 180 * pi
sin2r <- sin(2 * rads)
cosr <- cos(rads)

In [None]:
plot(x = cosr,
     y = sin2r,
     type = "l",
     col = "green",
     xlab = "Cosine",
     ylab = "Sine of 2*rad",
     main = "Green Papillon")

### hist()

Let's draw a histogram for the distribution of values:

In [None]:
series_4 <- rnorm(1000, 10, 2)

In [None]:
hist(series_4)

We can have histograms with different number of break points:

In [None]:
sapply(seq(4, 16, 4), function(x) hist(series_4, x, main = x))

### Extension packages

Numerous R packages add a very wide range of visualization capabilities ever existing in any of the platforms and languages for data science. Most important and general ones are

- ggplot2 (a very powerful, versatile and easy to use system, a part of tidyverse)
- plotly (a very advanced JS powered library that adds interactivity, 2D and 3D)
- shiny (makes any visualizatin interactive)
- ggiraph, ggiraphExtra, gg3D (extends ggplot2 functionality)
- lattice (versatile and powerful visualization system)

A selection of important packages for specific purposes are:
- datatable (JS powered library, for very powerful tabular data visualization, not to be confused with high performance data science package data.table)
- knitr, kableExtra (enhanced tabular visualizations)
- visNetwork (interactively visualizing network and treelike data structures)
- plot3D, plot3Drgl, rgl (interactive 3D visualizations)
- visreg (visualizations of regression models)
- D3partitionR, r2d3 (JS library D3 enhanced visualizations)
- dygraphs (interactive visualization of time series)
- gridExtra (enhanced grid graphics: multiple plots)
- dendextend (very enhanced dendrogram visualizations)
- VIM (missing data and imputation visualizations)
- heatmaply (advanced interactive heatmap visualizations)
- factoextra (distance and cluster visualizations)
- arulesViz (visualization of association rules models)
- wordcloud, wordcloud2 (visualization of word clouds)
- corrplot (advanced visualizations of correlation plots)
- GGally (advanced visualizations of statistical summaries)
- ggmap (GIS visualizations)


A wide gallery of available visualizations can be found here:

https://www.r-graph-gallery.com/

https://shiny.rstudio.com/gallery/

http://gallery.htmlwidgets.org/

http://www.ggplot2-exts.org/gallery/

https://plot.ly/r/

## String operations

### sprintf()

R implementation of C library printf()

Use it as a template:

In [None]:
sprintf("The value to be inserted here is %s", 3:5)

Now complete the total number of digits to a specific value:

In [None]:
sprintf("%0.7d", 10^(0:6))

### paste()

Concatenating words:

In [None]:
words_1 <- c("combine", "these", "words")

In [None]:
paste(words_1, collapse = " ")

In [None]:
sent_1 <- paste("combine", "these", "words", sep = " ")
sent_1

### strsplit()

The inverse of paste(), split a character vector from certain split characters

Returns a list:

In [None]:
sent_1

In [None]:
strsplit(sent_1, split = " ")

In [None]:
strsplit(words_1, split = "")

### substring()

Returns substrings of character vectors

month.name is a built-in object for month names:

In [None]:
month.name

In [None]:
substring(month.name, 1, 3)

### grep(), grepl()

Regular expressions pattern matching functionality:

```R
grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE,
          fixed = FALSE, useBytes = FALSE, invert = FALSE)
```

Return the indices of characters that hold either uar or ber

In [None]:
grep("uar|ber", month.name, perl = T)

Or the values:

In [None]:
grep("uar|ber", month.name, perl = T, value = T)

Or logical values for match:

In [None]:
grepl("uar|ber", month.name, perl = T)

### gsub()

Pattern substitution with regular expressions:

Find the "ua"'s (positive lookahead), capture the 2 characters after the match, put "li" before the capture group and "co" after the capture group

In [None]:
gsub("(?<=ua)(.{2})", "li\\1co", month.name, perl = T)

### Extension packages

- stringr package of tidyverse provides more functionality on string operations and regex
- tm package is powerful for text mining operations