# Learn R for Statistics
## Section 2: Data and Programming
### 1. Data types

R has a number of basic data types
- Numeric (Double)
- Integer
- Complex
- Logical
- Character

In [2]:
10.1 
30L
4 + 2i
TRUE
"Statistics"

### 2. Data structures
R also has a number of basic data structures. A data structure is either homogeneous (all elements are of the same data type) or heterogeneous (elements can be of more than one data type).

#### 2.1 Vectors
Data structure that store element of the same type. Many operations in R make heavy use of vectors. Vectors in R are indexed
starting at 1. Larger vectors will start additional rows with [*] where * is the index of the first element of the row.

##### To create a vector there are four ways:
- c()
- :
- seq()
- rep()

###### Use c()

In [3]:
c(1,3,5,7,8,9)

Can store to a variable using assignment operator = or <-

Not much difference between the two, but should stay consistent

In [10]:
x = c(1,3,5,7,8,9)
y <- c(4,2,1)
y
x[1]

R will automatically coerce to a single type when attempting to create a vector combines multiple types

In [19]:
c(42, "Statistics", TRUE)
c(32, FALSE)
c(4+2i, 20, 30L)

###### Use : and seq()
Create vector based on a sequence of number

In [21]:
(y = 1:100)

In [23]:
seq(from = 2.1, to = 2.2, by = 0.01)
seq(2.1, 2.2, 0.01)

###### Use rep()
Can use to repeat a vector by a number of times

In [27]:
rep("A", times = 10)
rep("B", 2)
rep(x,2)

###### Combine methods
So far we have mostly used them in isolation, but they are often used together.

In [28]:
c(x, rep(seq(1, 9, 2), 3), c(1, 2, 3), 42, 2:4)

##### Length
The length of a vector can be obtained with the length() function.

In [29]:
length(x)
length(y)

##### Subsetting
To subset a vector, use square brackets [ ]

In [31]:
x
x[1]
x[-2] #exclude second element

#subset based on vector of indices:
x[1:3] 
x[c(1,3,4)]
z = c(TRUE,TRUE,FALSE,TRUE,TRUE,FALSE)
x[z]

##### Vectorization
One of the biggest stranths of R

In [32]:
x = 1:10
x + 1
2 * x
2 ^ x
sqrt(x)

##### Logical operators
Types:
- < / > : less than/greater than
- <= / >= : less/greater than or equal to
- == : equal to
- != : not equal to
- !x : not x
- x | y : x or y
- x & y : x and y

In R, logical operators are vectorized

In [33]:
x = c(1,3,5,7,8,9)
x > 3
x == 3
x != 3
x == 3 & x!= 3


This is useful for subsetting

In [41]:
x[x > 3]
x[x != 3]
sum(x > 3) 
#sum of all elements. Here R is auto-coercing logical to numeric
as.numeric( x > 3) 
which(x > 3) #return indices
x[which(x > 3)]
max(x)
which.max(x)

##### More vectorization


In [42]:
x = c(1, 3, 5, 7, 8, 9)
y = 1:100

In [43]:
x + y

"longer object length is not a multiple of shorter object length"

R is auto-coercing

In [44]:
y = 1:60
x + y

In [58]:
all(x + y == rep(x, 10) + y)
#return whether all of logical vectors in this are true
identical(x + y, rep(x, 10) + y)
#test if two objects are identical

#### 2.2 Matrices
Matrices have rows and columns containing a single data type. In a matrix, the order of rows and columns is important.

##### Create matrix
Matrices can be created using the matrix function.

In [60]:
x = 1:9
X = matrix(x, nrow = 3, ncol = 3)
X

0,1,2
1,4,7
2,5,8
3,6,9


By default the matrix function reorders a vector into columns, but we can also tell R to use rows instead

In [61]:
Y = matrix(x, nrow = 3, ncol = 3, byrow = TRUE)
Y

0,1,2
1,2,3
4,5,6
7,8,9


Create a matrix where every element is the same 

In [63]:
Z = matrix(0, 3, 2)
Z

0,1
0,0
0,0
0,0


##### Subsetting 
Using [ ] with specifying row and column

In [65]:
X
X[1,2]
X[1,]
X[,2]

0,1,2
1,4,7
2,5,8
3,6,9


Can subset more than one row or column at a time

In [66]:
X[2, c(1, 3)]

##### Combining

In [67]:
x = 1:9
rev(x) # reverse vector
rep(1,9)

- rbind: combine vectors to matrix where each is a row
- cbind: combine vectors to matrix where each is a column

Can specified argument names 

In [72]:
rbind(x, rev(x), rep(1, 9))
cbind(col_1 = x, col_2 = rev(x), col_3 = rep(1, 9)) 

0,1,2,3,4,5,6,7,8,9
x,1,2,3,4,5,6,7,8,9
,9,8,7,6,5,4,3,2,1
,1,1,1,1,1,1,1,1,1


col_1,col_2,col_3
1,9,1
2,8,1
3,7,1
4,6,1
5,5,1
6,4,1
7,3,1
8,2,1
9,1,1


 ##### Matrix calculation

In [1]:
X = matrix(1:9, 3, 3)
Y = matrix(9:1, 3, 3)
X
Y

0,1,2
1,4,7
2,5,8
3,6,9


0,1,2
9,6,3
8,5,2
7,4,1


In [2]:
X + Y
X - Y
X * Y # element by element multiplication
X / Y

0,1,2
10,10,10
10,10,10
10,10,10


0,1,2
-8,-2,4
-6,0,6
-4,2,8


0,1,2
9,24,21
16,25,16
21,24,9


0,1,2
0.1111111,0.6666667,2.333333
0.25,1.0,4.0
0.4285714,1.5,9.0


In [3]:
X %*% Y #Matrix multiplication

0,1,2
90,54,18
114,69,24
138,84,30


In [4]:
t(X) #transpose

0,1,2
1,2,3
4,5,6
7,8,9


In [6]:
Z = matrix(c(9, 2, -3, 2, 4, -2, -3, -2, 16), 3, byrow = TRUE)
solve(Z) #inverse of a matrix (if it is invertible)

0,1,2
0.12931034,-0.05603448,0.01724138
-0.05603448,0.29094828,0.02586207
0.01724138,0.02586207,0.06896552


In [7]:
solve(Z) %*% Z

0,1,2
1.0,-6.245005e-17,0.0
8.326673e-17,1.0,-5.5511150000000004e-17
-2.775558e-17,0.0,1.0


In [8]:
diag(3)

0,1,2
1,0,0
0,1,0
0,0,1


In [9]:
all.equal(solve(Z) %*% Z, diag(3))

In [11]:
dim(Z) #dimension
rowSums(Z)
colSums(Z)
rowMeans(Z)
colMeans(Z)

diag() can extract diagonal of a matrix or create a diagonal matrix

In [13]:
diag(Z)
diag(1:3)
diag(3)

0,1,2
1,0,0
0,2,0
0,0,3


0,1,2
1,0,0
0,1,0
0,0,1


#### 2.3 Calculations with vectors and matrices

Some operations have different behavior on vectors and matrices.

Example: %*% on vector is inner product

In [15]:
a_vec = c(1,2,3)
b_vec = c(2,2,2)
c(is.vector(a_vec), is.vector(b_vec))

In [16]:
a_vec %*% b_vec 

0
12


%o% : outer product of 2 vectors

Important: when vector is coerced to become matrix, they are column vectors

In [23]:
as.matrix(a_vec)
a_vec %o% b_vec

0
1
2
3


0,1,2
2,2,2
4,4,4
6,6,6


But in this case, b_vec is auto coerced to 1x3 matrix

In [24]:
as.matrix(a_vec) %*% b_vec

0,1,2
2,2,2
4,4,4
6,6,6


Another way for inner and outer product

In [25]:
crossprod(a_vec, b_vec)
tcrossprod(a_vec, b_vec)

0
12


0,1,2
2,2,2
4,4,4
6,6,6


When use this with matric, it calculates t(X).Y

This is really useful when using in linear models

In [27]:
C_mat = matrix(c(1, 2, 3, 4, 5, 6), 2, 3)
D_mat = matrix(c(2, 2, 2, 2, 2, 2), 2, 3)
crossprod(C_mat, D_mat)
t(C_mat) %*% D_mat

0,1,2
6,6,6
14,14,14
22,22,22


0,1,2
6,6,6
14,14,14
22,22,22


#### 2.4 Lists
One dimensional heterogeneous data structure (element of any type)


##### Creation
List can be in very complex form

In [28]:
list(42, "Hello", TRUE) 

In [31]:
ex_list = list(
    a = c(1,2,3,4),
    b = TRUE,
    c = "Hello!",
    d = function(arg = 42) {print("Hello World")},
    e = diag(3)
    )

In [35]:
ex_list[1]
ex_list[2:3]
ex_list["e"]

0,1,2
1,0,0
0,1,0
0,0,1


In [42]:
ex_list$d
ex_list$d(arg = 2)

[1] "Hello World"


#### 2.5 Data Frames
Data frame is a list of vectors, each vector contain the same data type but different vectors can store different data types

The elements of data frame must all be vectors, and have the same length

In [46]:
example_data = data.frame(x = c(1, 3, 5, 7, 9, 1, 3, 5, 7, 9),
                        y = c(rep("Hello", 9), "Goodbye"),
                        z = rep(c(TRUE, FALSE), 5))

In [47]:
example_data

x,y,z
1,Hello,True
3,Hello,False
5,Hello,True
7,Hello,False
9,Hello,True
1,Hello,False
3,Hello,True
5,Hello,False
7,Hello,True
9,Goodbye,False


In [48]:
str(example_data)

'data.frame':	10 obs. of  3 variables:
 $ x: num  1 3 5 7 9 1 3 5 7 9
 $ y: Factor w/ 2 levels "Goodbye","Hello": 2 2 2 2 2 2 2 2 2 1
 $ z: logi  TRUE FALSE TRUE FALSE TRUE FALSE ...


In [50]:
nrow(example_data)
ncol(example_data)
dim(example_data)

#### Read data from file
Read csv file using read_csv().

In [51]:
library(readr)

In [53]:
example_data_from_csv = read_csv("D:/HUST/20212/Applied statistics and experimental design/Code/example-data.csv")
example_data_from_csv

Parsed with column specification:
cols(
  x = col_double(),
  y = col_character(),
  z = col_logical()
)


x,y,z
1,Hello,True
3,Hello,False
5,Hello,True
7,Hello,False
9,Hello,True
1,Hello,False
3,Hello,True
5,Hello,False
7,Hello,True
9,Goodbye,False


as_tibble() function can be used to coerce a regular data frame to a tibble

In [54]:
library(tibble)
example_data = as_tibble(example_data)
example_data

x,y,z
1,Hello,True
3,Hello,False
5,Hello,True
7,Hello,False
9,Hello,True
1,Hello,False
3,Hello,True
5,Hello,False
7,Hello,True
9,Goodbye,False


### 3. Programming basics
#### 3.1 Control Flow

##### if/else

In [56]:
x = 1
y = 3
if (x > y) {
    z = x * y
    print("x is larger than y")
} else {
    z = x + 5 * y
    print("x is less than or equal to y")
}
z

[1] "x is less than or equal to y"


Useful function ifelse()

In [57]:
ifelse(4 > 3, 1, 0)

In [58]:
fib = c(1, 1, 2, 3, 5, 8, 13, 21)
ifelse(fib > 6, "Foo", "Bar")

##### for loop

In [62]:
x = 11:15
for(i in 1:5){
    if (i %% 2 == 0){
            x[i] = x[i] * 2
        }
    else {
        x[i] = x[i] * 3
    }
}
x

#### 3.2 Functions

function_name(arg1 = .., arg2 = .., ...)

In [63]:
standardize = function(x) {
    m = mean(x)
    std = sd(x)
    result = (x - m) / std
    result
}

In [68]:
test_sample = rnorm(n = 10, mean = 2, sd = 5)
test_sample
standardize(test_sample)

Example: Function to calculate unbiased estimate of population variance, which is sample variance

In [82]:
get_var = function(x, biased = FALSE){
    sum((x - mean(x)) ^ 2) / (length(x) - 1 * !biased)
}

In [85]:
get_var(test_sample)
get_var(test_sample, TRUE)