# 0 Intro

We started our R learning journey by working through the dice example together. Now you have some idea about R programming. Let's take a step back and examine formally some of the concepts we have learned so far. Along the way, we will also learn some new concepts.

For this part, instead of RStudio, we will run R in a Notebook environment/setup. The Notebook setup allows you to write "notes" (using [Markdown](https://www.markdownguide.org/basic-syntax/)) and code (R code in our case) together. It's very useful for presentation or learning/teaching purpose.

We are running our R notebook in the cloud using Google's Colab. Officially, Colab only supports Python notebook, but in fact you could run R notebook too as what we are doing now. Let's first check what version of R is installed in the Google Colab.

In [None]:
version

# 1 Expression & Assignment

You can write an expression and get a result immediately. Below we calculate $2 + \sqrt{4} + ln(e^2) + 2^2$.

In [None]:
# an expression
2 + sqrt(4) + log(exp(2)) + 2^2

You could also assign an expression to a variable and then print out the variable.

In [None]:
# assignment
x <- (pi == 3.14)
print(x)

`<-` is the assignment operator. `x` is a variable or R object. (You could replace `print(x)` with `x` to display the value of x, but I find 'print()' gives better output format.)

The expression here is `(pi == 3.14)`. `pi` is a built-in constant with value `3.1415926535897931`. `==` is the "equal to" relational operator for value comparison. The expression evaluates to `TRUE` if `pi` is exactly equal to `3.14`, or `FALSE` otherwise.

Aside: Why `"<-"` is the assignment operator, but not `"="` like in most other programming languages? In fact, `"="` is also an assignment operator in R, but it's slightly different from `"<-"`. If you want to know more about R's assignment operators, read [this stackoverflow post](https://stackoverflow.com/questions/1741820/what-are-the-differences-between-and-assignment-operators-in-r).

### Exercise - 1 / 1

Compare the built-in constant `pi` with `3.1415926535897931`. Are they equal?

In [None]:
# your code here




### Exercise - 1 / 2

Create your own pi approximation variable `my_pi` using the formula $\pi=6\arctan(\frac{1}{\sqrt{3}})$. Print out the value of `my_pi`. (Hint: Google "R arctan" to find out the R function that calculates arc-tangent.)

In [None]:
# your code here




# 2 Data Structure

Data structures are about how you organize and store data. R's *basic* data structures can be summarized as below.

| Dimension | Homogeneous   | Heterogeneous |
|:---------:|:-------------:|:-------------:|
| 1D        | Atomic vector | List          |
| 2D        | Matrix        | Dataframe     |
| nD        | Array         | -             |

All elements in a homogeneous data structure (e.g., an atomic vector) must be the same type (integer, double, etc.). On the other hand, elements in a heterogeneous data structure can have different types.

## 2.1 Atomic vector

In [None]:
# create R vectors
vec_character <- c("Hello,", "World!")
vec_integer <- c(1L, 2L, 3L)
vec_double <- c(1.1, 2.2, 3.3, -4.4)
vec_logical <- c(TRUE, TRUE, FALSE)

`c()` is a function in base R. It combines values into a vector.

To check whether an objective/variable is an atomic vector, you can use the `is.vector()` function.

In [None]:
# check if an object is an atomic vector
is.vector(vec_integer)

The `typeof()` function shows the type of an object.

In [None]:
# check the type of the object
typeof(vec_integer)

Scalar and string in R are just correponding single element vectors.

In [None]:
# scalar integer is the same as an integer vector with a single element
is.vector(1L)
typeof(1L)
# identical() tests objects for exact equality
identical(1L, c(1L))

# scalar double is the same as a double vector with a single element
is.vector(1.1)
typeof(1.1)
identical(1.1, c(1.1))

# a string is the same as a string vector with a single element
is.vector("hello")
typeof("hello")
identical("hello", c("hello"))

`str()` is a handy function to display the structure of a vector (or any object in general).

In [None]:
# display structure of a vector
str(vec_character)
str(vec_integer)
str(vec_double)
str(vec_logical)

`length()` is another useful function to tell you the length of a vector.

In [None]:
# display length of a vector
length(vec_double)

You can retrieve an element in a vector using `vec_name[]` with a numeric index. R's vector indexing starts with `1`.

In [None]:
# retrieve the first element of vec_double
print(vec_double[1])

Select/subset multiple elements in a vector is also easy.

In [None]:
# select/subset multiple elements
print(vec_double[c(1, 3)])
print(vec_double[1:2])
print(vec_double[c(-1, -2)])
print(vec_double[c(TRUE, FALSE, TRUE, FALSE)])
print(vec_double[vec_double < 0])

### Exercise - 2.1

Define two vectors `(1, 2, -3)` and `(4, -5, -6)`. Perform an element-wise multiplication, and sum up the positive elements in the resulting vector. (Hint: 1. use operator `*` for element-wise multiplication; 2. use `sum()` function to sum up elements in a numeric vector.)

In [None]:
# your code playground here




## 2.2 List

List is another kind of 1-dimension vector. It can contain different types of elements.

In [None]:
# create a list
l1 <- list(
  1:3,
  "a",
  c(TRUE, FALSE, TRUE),
  c(2.3, 5.9),
  c(1L, 2L)
)

# print the list
print(l1)

# print the structure of the list
str(l1)

List can contain list as well.

In [None]:
# a nested list
l2 <- list(list(list(1)))
str(l2)

Retrieving/subsetting elements in a list is similar as retrieving elements in a vector.

In [None]:
# list subsetting
str(l1)
str(l1[1:2])
str(l1[c(1, 3)])
str(l1[c(-1, -2)])

Note that using `[]` always returns a list (i.e. it wraps the retrieved elements in a list). To return the element in a list as it is, use `[[]]`. `[[]]` only returns one single element, so usually you should only specify a single index.

In [None]:
# [] returns a list
str(l1[1])

In [None]:
# [[]] returns an element as it is in a list
# note that this element can still be a list (recall nested list)
str(l1[[1]])

### Exercise - 2.2

Retrieve the vector `(2.3, 5.9)` in the below list and sum up its elements.

In [None]:
# define a list
l_ex <- list(
  1:3,
  list(
    "abc",
    c(2.3, 5.9)
  )
)

# your code below
# this visualization may help
# L-------------------+
# | V---+---+---+     |
# | | 1 | 2 | 3 |     |
# | +---+---+---+     |
# |                   |
# | L---------------+ |
# | | V-----+       | |
# | | |"abc"|       | |
# | | +-----+       | |
# | | V-----+-----+ | |
# | | | 2.3 | 5.9 | | |
# | | +-----+-----+ | |
# | +---------------+ |
# +-------------------+




## 2.3 Matrix (self-study section)


Use `matrix()` function to create a matrix. Use `dim()` to find dimension of a matrix.

In [None]:
# use the matrix() function to create a matrix
y <- matrix(1:6, nrow = 2, ncol = 3)
print(y)
print(dim(y))

Subsetting a matrix is similar to subsetting a vector.

In [None]:
print(y[1:2, c(1,3)])
print(y[1:2, -2])

Note that `[]` by default simplify the subsetting result to lowest possible dimension.

In [None]:
# y[1, 1:2] gives a vector
str(y[1, 1:2])

Matrix algebra is easy. See here for a list of R matrix operations, https://www.statmethods.net/advstats/matrix.html.

In [None]:
# define two matrics
m1 <- matrix(1:4, nrow = 2)
m2 <- matrix(5:8, nrow = 2)
print(m1)
print(m2)

# element-wise multiplication
print(m1 * m2)

# matrix multiplication
print(m1 %*% m2)

# transpose
print(t(m1))

# solve Ax = b problem
b <- matrix(7:8, nrow = 2)
print(b)
print(solve(m1, b))

## 2.4 Data frame

A data frame is like a 2-D table in an Excel sheet. Data are organized in columns and rows. Each column is identified by a column name (in the column header row). Formally, you can think of data frame as a named list of equal-length atomic vectors (where each vector is a column), and plus a few extra attributes. (Let's not get into attributes of a data structure today.)


In [None]:
# create a data frame
df1 <- data.frame(
  x = 1:3,
  y = c("a", "a", "b"),
  z = c(1.1, 2.2, 3.3)
)

print(df1)

In [None]:
# check the structure of a data frame
str(df1)

In [None]:
# check a data frame's attributes
print(attributes(df1))

You can select a column in a data frame using `df_name$col_name` or `df_name["col_name"]`.

In [None]:
print(df1$x)
print(df1["x"])

However, note that `df_name$col_name` returns the column as a vector and `df_name["col_name"]` returns a data frame with the selected column. (`df_name[["col_name"]]` returns a vector. Recall `[]` vs `[[]]`  )

In [None]:
str(df1$x)

In [None]:
str(df1["x"])

In [None]:
# df_name[["col_name"]] will instead return a vector too
str(df1[["x"]])

Select multiple columns in a dataframe can be done as fellow.

In [None]:
# select 'x' and 'z' columns
print(df1[c('x', 'z')])

# select row 1 and 2, and 'x' and 'z' columns
print(df1[1:2, c('x', 'z')])

In a data frame, certain columns can contain categorical information, for example, an indicator column with `0` and `1` values, or `"male"` and `"female"`. For modeling purpose, you can turn those columns into factors. Factors are R's way of storing categorical information. When you build a model using a data frame with factor columns, the model will take the categorical information into account.

In [None]:
# as.factor() turns a character vector into factor
df1$y <- as.factor(df1$y)
str(df1)

Factors are in fact integer vectors with a few extra attributes.

In [None]:
gender <- factor(c("male", "female", "female", "male"))
print(typeof(gender))

In [None]:
print(attributes(gender))

When you create a data frame using the `data.frame()` function, you can optionally turn character columns into factor columns using the `stringsAsFactors = TRUE` argument. (`stringsAsFactors = TRUE` is the default setting before R 4.0. Since R 4.0, `stringsAsFactors = FALSE` is the default.)

In [None]:
# use 'stringsAsFactors = FALSE' to keep strings as they are
df2 <- data.frame(
  x = 1:3,
  y = c("a", "a", "c"),
  stringsAsFactors = TRUE
)
str(df2)

It's often useful to find out column names and number of columns and rows.

In [None]:
# find out column names using names() or colnames()
print(names(df1))
print(colnames(df1))

# find out number of columns using length() or ncol()
print(length(df1))
print(ncol(df1))

# find out number of rows
print(nrow(df1))

You can create a new column based on formula/expression of other columns.

In [None]:
print(df1)

# create a column x_square
df1["x_square"] <- df1["x"]^2
print(df1)

# create a column x_cube
df1$x_cube <- df1$x ^ 3
print(df1)

You can subset/filter rows based on row index.

In [None]:
print(df1)

# first row
print(df1[1, ])

# second row to the end
print(df1[2:nrow(df1), , ])

# all rows except the second
print(df1[-2, ])

# all rows except the first and the third
print(df1[-c(1, 3), ])

# randomly select 2 rows
row_sample <- sample(nrow(df1), 2)
print(df1[row_sample, ])

You can subset/filter rows based on conditions. You can do it in many ways. I will show you a few below.

In [None]:
print(df1)

# df_name[row_cond, col, drop = FALSE]
df1_new <- df1[(df1$x_square >= 4 & df1$x_cube < 20), ,]
print(df1_new)

# no change in df1
print(df1)

In [None]:
print(df1)

# df_name[row_cond, col, drop = FALSE]
df1_new <- df1[(df1$x_square >= 4 & df1$x_cube < 20), ,drop = FALSE]
print(df1_new)

# no change in df1
print(df1)

In [None]:
print(df1)

# df_name[which(row_cond), col, drop = FALSE]
df1_new <- df1[which(df1$x_square >= 4 & df1$x_cube < 20), ,drop = FALSE]
print(df1_new)

# no change in df1
print(df1)

In [None]:
print(df1)

# subset(df, row_cond, select = col_vec)
df1_new <- subset(df1, x_square >= 4 & x_cube < 20, select = c(x, y, z))
print(df1_new)

# no change in df1
print(df1)

Data frame is perhaps the most important data structure in R. That's because tabular data are the most common type of data (columns & rows). We will learn much more about it later.

### Exercise - 2.4

Given the dataframe `df3` below, create a new column `xz` as `x * z`. Subset/Filter the new dataframe such that all values in column `xz` are greater than 5.

In [None]:
# create a data frame
df3 <- data.frame(
  x = 1:3,
  y = c("a", "a", "b"),
  z = c(1.1, 2.2, 3.3)
)

print(df3)

# your code below


