# Basic R

Zhentao Shi

## Statistical Languages

* Time investment is essential for language learning
* Python vs R

* Official document: [R-Introduction](https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf)

## Help System

* If exact name of a function known, call `help(function_name)` or `?function_name`
* Otherwise, `??key_words`

In [None]:
?seq

In [None]:
??sequence

## Assignment

 * `<-` or `=`
   * Personally I prefer "=" to "<-".

In [None]:
a <- 1; a

In [None]:
b <- 2; b

In [None]:
f = a + b; f # try to avoid `c`, which is an internal command

In [None]:
d = log(f); d

In [None]:
e = sqrt(d); e

In [None]:
cat("log(c) =", e, "is a simple calculation"); print(e)

In [None]:
cat("exp(e) =", exp(e), ". I want a nice new line. \n"); print(e)

In [None]:
ls() # display the objects in memory

R is case sentitive. `a` and `A` are two different objects.

In [None]:
A = "abc"
cat("a is", a, ", whereas A is ", A, ".")

Clean up the memory. It is recommended as the first line of a clean script.

In [None]:
rm(list = ls())

In [None]:
ls()

## Vector

* A collection of elements of the same type, 
  * integer
  * logical
  * real number
  * complex number 
  * characters
  * factor

  
* R does not require explicit type declaration.

 * `c()`  combines two or more vectors into a long vector.
 * Binary arithmetic operations
   * element by element 
   * `+`, `-`, `*` and `/`
   * logical operations `&` `|` `!=`

In [None]:
a = c(1,2,3, 4); a

In [None]:
b = rep(c(1,2), 2); b

In [None]:
a+b

In [None]:
# logical vectors
logi_1 <- c(T, T, F); logi_1

In [None]:
logi_2 <- c(F, T, T); logi_2

In [None]:
logi_1 & logi_2

Missing values in R is represented as `NA` (Not Available). 

In [None]:
   a = NA; b = 3; a+b

When some operations are not allowed, say, `log(-1)`, R returns  `NaN` (Not a Number).

In [None]:
log(-1)

In [None]:
sqrt(-1)

In [None]:
a = Inf
a+a

In [None]:
b = -Inf
a+b

## Selection

* Vector selection is specified in square bracket `a[ ]` 
  * by either positive integer or logical vector.
  * Index initiates from 1, not 0 (Python's rule). 

In [None]:
a = 1:10
a[5:7]

In [None]:
d = seq(-1, 1, by = 0.1); print(d)
d[5:7]

In [None]:
f = c("a","b","c","d","e","f","g","h","i","j")
f[5:7]

In [None]:
b = "abcdefghij"
b[5:7] # the indexed items do not exists

## Data types

* The way R stores data

In [None]:
a <- "18"; a

In [None]:
b <- as.numeric(a); b

In [None]:
x = pi * c(-1:1, 10); x

In [None]:
as.integer(x)

In [None]:
a = 3; is.integer(a) # it is numeric

In [None]:
a = as.integer(3); a; is.integer(a)

In [None]:
b = as.double(a); is.integer(b); b

## Array and Matrix

* *array*: number table of multiple dimensions. 
* *matrix*: 2-dimensional array.

* R is of column-major order
* array arithmetic: element-by-element. 

In [None]:
A = array(rpois(4*3*2, lambda = 1), dim = c(4,3,2)); print(A) # 3 dimensional array

In [None]:
B = array(rnorm(4*3*2), dim = c(4,3,2)); print(B)

In [None]:
print(A+B)

Caution must be exercised in binary operations involving two objects of different length. This is error-prone.

In [None]:
A = matrix(1:6, 3); print(A)

In [None]:
B = matrix(1:3, 3); print(B)

In [None]:
print(A+B) # produce error message

In [None]:
b = 1:3
print(A+b)

In [None]:
d = 1:4
print(A+d)

## Matrix Operations

* `%*%`: matrix multiplication
* `solve` matrix inverse
* `eigen` eigenvalues and eigenvectors

**Example**: OLS estimation with one $x$ regressor and a constant.
Graduate textbook expresses the OLS in matrix form
$$\hat{\beta} = (X' X)^{-1} X'y.$$
To conduct OLS estimation in R, we literally translate the mathematical expression into code.



Step 1: We need data $Y$ and $X$ to run OLS. We simulate an artificial dataset.

In [None]:
# simulate data
rm(list = ls())
set.seed(111) # can be removed to allow the result to change

# set the parameters
n <- 100
b0 <- matrix(c(0.2, 1.0), nrow = 2)

# generate the data
e <- rnorm(n)
X <- cbind(1, rnorm(n))
Y <- X %*% b0 + e
rm(e)

Step 2: translate the formula to code


In [None]:
# OLS estimation
bhat <- solve(t(X) %*% X, t(X) %*% Y); print(bhat)

In [None]:
bhat <- solve( crossprod(X), crossprod(X, Y))
print( bhat ) # equivalent computation

Step 3 (additional): plot the regression graph with the scatter points and the regression line.

* Further compare the regression line (black) with the true coefficient line (red).


In [None]:
# plot
plot(y = Y, x = X[, 2], xlab = "X", ylab = "Y", main = "regression")
abline(a = bhat[1], b = bhat[2])
abline(a = b0[1], b = b0[2], col = "red")
abline(h = 0, lty = 2)
abline(v = 0, lty = 2)

Step 4: Hypothesis testing.

The *t*-statistic is widely used.
To test the null $H_0: \beta_2 = 1$, we compute the associated *t*-statistic.
Again, this is a translation.
$$
t  =  \frac{\hat{\beta}_2 - \beta_{02}}{ \hat{\sigma}_{\hat{\beta}_2}  }
   =  \frac{\hat{\beta}_2 - \beta_{02}}{ \sqrt{ \left[ (X'X)^{-1} \hat{\sigma}^2 \right]_{22} } }.
$$
where $[\cdot]_{22}$ is the (2,2)-element of a matrix.


In [None]:
# calculate the t-value
bhat2 <- bhat[2] # the parameter we want to test
e_hat <- Y - X %*% bhat
sigma_hat_square <- sum(e_hat^2) / (n - 2)
Sigma_B <- solve(t(X) %*% X) * sigma_hat_square
t_value_2 <- (bhat2 - b0[2]) / sqrt(Sigma_B[2, 2])
cat("The t-statistic =", t_value_2)

## Mixed Data Types

*  *Vector* only contains one type of elements.
* *list* is a basket for objects of various types.
  * A container when a procedure returns more than one useful object.

In [None]:
Lst <- list(dept = "Econ", no = 5821)
Lst

In [None]:
Lst$dept

In [None]:
Lst[[2]]

**Example**: When we invoke `eigen`, we are
interested in both eigenvalues and eigenvectors,
which are stored into `$value` and `$vector`, respectively.

In [None]:
A = diag(2)
eigen(A)

## Package

* Base installation is small
* Extensive ecosystem of add-on packages.
* Most packages are hosted on [CRAN](https://cran.r-project.org/web/packages/).
 

* Installation: `install.packages("package_name")`. 

* Invoking: `library(package_name)` or `package_name::function_name`

In [None]:
library(magrittr)

## Input and Output

* Raw data is often saved in ASCII file or Excel.
* Excel spreadsheet is discouraged.
* Recommend `csv` format

`read.table()` or `read.csv()` imports data from an ASCII file into an R session.

**Example**: Acemoglu, Johnson and Robinson (2001). [Data source](https://economics.mit.edu/faculty/acemoglu/data/ajr2001). 
* This empirical example was adopted by Chang, Shi and Zhang (2022).

In [None]:
AJR = read.csv("data_example/AJR.csv", header = TRUE)
head(AJR)

## Data Frame

* *data.frame* is a two-dimensional table that stores the data, 
  * similar to a spreadsheet in Excel.

* *Matrix* it only accommodates one type of elements.

* `tibble` is a new and refined alternative data frame type.

In [None]:
tibble::tibble(AJR)

<a id='AJR_exec'></a>
**Exercise**

Use the dataset `AJR.csv`. 
* Collect a small dataset with five columns `shortnam`, `logpgp95`, `avexpr` (protection against exploitation), `lat_abst`, `logem4` (log of mortality rate) and `cons1`.
* If any country has one of the above variables missing, remove that country from the data. (Hint: use `apply()`.)

* It is better to convert Chinese characters into the encoding `UTF-8`. 
  * Need experiment to deal with garbled texts.
  * `Notepad++` is a free tool for conversion; check `Encoding` in its menu.

In [None]:
# stock_id <- readr::read_csv("data_example/SH_stockid_UTF8.csv", 
# locale = readr::locale(encoding = "UTF-8"))

stock_id <- readr::read_csv("data_example/SH_stockid_UTF8.csv")
head(stock_id)


`write.table()` or `write.csv()` exports the data in an R session to an ASCII file.

## Statistics

* R is created by statisticians.

* `p` (probability)
* `d` (density)
* `q` (quantile)
* `r` (random variable generator) 

* `norm` (normal)
* `chisq` ($\chi^2$)
* `t` (*t*)
* `weibull` (Weibull)
* `cauchy` (Cauchy)
* `binomial` (binomial)
* `pois` (Poisson)

In [None]:
pnorm(0)

In [None]:
qnorm(0.975)

In [None]:
rnorm(5)

In [None]:
dnorm(0)

**Example**

This example illustrates the sampling error.

1. Plot the density of $\chi^2(3)$ over an equally spaced grid system `x_axis = seq(0.01, 15, by = 0.01)` (black line).
2. Generate 1000 observations from $\chi^2(3)$ distribution. Plot the kernel density, a nonparametric estimation of the density (red line).
3. Calculate the 95th quantile and the empirical probability of observing a value greater than the 95-th quantile.
In population, this value should be 5%. What is the number in this experiment?

In [None]:
set.seed(888)
x_axis <- seq(0.01, 15, by = 0.01)

y <- dchisq(x_axis, df = 3)
plot(y = y, x = x_axis, type = "l", xlab = "x", ylab = "density")
z <- rchisq(1000, df = 3)
lines(density(z), col = "red")
crit <- qchisq(.95, df = 3)

mean(z > crit)


## User-defined Function

* Highly recommended to encapsulate repeated procedures into a user-defined function.

1. In the developing stage, focus on a small chunk of code. More manageable.
2. Variables defined inside a function are local.
3. In revision, only need to change one place. 

The format of a user-defined function is

```
function_name <- function(input) {
  expressions
  return(output)
}
```

**Example**

* 95% two-sided asymptotic confidence interval as
$$\left(\hat{\mu} - \frac{1.96}{\sqrt{n}} \hat{\sigma}, \hat{\mu} + \frac{1.96}{\sqrt{n}} \hat{\sigma} \right)$$
from a given sample.

* An easy job, but no built-in function.


In [None]:
# construct confidence interval

CI <- function(x) {
  # x is a vector of random variables

  n <- length(x)
  mu <- mean(x)
  sig <- sd(x)
  upper <- mu + 1.96 / sqrt(n) * sig
  lower <- mu - 1.96 / sqrt(n) * sig
  return(list(lower = lower, upper = upper))
}


## Flow Control

* Flow control is common in all programming languages.

  * `if` is used for choice
  * `for` or `while` is used for loops.


**Example**

Calculate the empirical coverage probability of a Poisson distribution of degrees of freedom 2.
We conduct this experiment for 1000 times.


In [None]:
Rep <- 1000
sample_size <- 100
capture <- rep(0, Rep)

if (sample_size < 50){
      print("Sample size too small. Refuse to work")
      } else {
    for (i in 1:Rep) {
      mu <- 2
      x <- rpois(sample_size, mu)
      bounds <- CI(x)
      capture[i] <- ((bounds$lower <= mu) & (mu <= bounds$upper))
    }
    print("Asymptotic theory may work")
    cat("the emprical size = ", mean(capture)) # empirical size
}

## Statistical Model

* `y~x`
  * `y`: dependent variable,
  * `x`: explanatory variable.
* `lm(y~x, data = data_frame)`.


### A Linear Regression 

This is a toy example with simulated data.

In [None]:
n <- 100
p <- 1

b0 <- 1
# Generate data
x <- matrix(rnorm(n * p), n, 1)
y <- x %*% b0 + rnorm(T)

# Linear Model
result <- lm(y ~ x)
summary(result)


Plot the true value of $y$ and fitted value

In [None]:
plot(result$fitted.values,
  col = "red", type = "l", xlab = "x", ylab = "y",
  main = "Fitted Value"
)
lines(y, col = "blue", type = "l", lty = 2)
legend("bottomleft",
  legend = c("Fitted Value", "True Value"),
  col = c("red", "blue"), lty = 1:2, cex = 0.75
)



Then we plot the best fitted line.


In [None]:
plot(y = y, x = x, xlab = "x", ylab = "y", main = "Fitted Line")
abline(a = result$coefficients[1], b = result$coefficients[2])
abline(a = 0, b = b0, col = "red")

legend("bottomright",
  legend = c("Fitted Line", "True Coef"),
  col = c("black", "red"), lty = c(1, 1), cex = 0.75
)


## Reading

<!-- [Wickham and Grolemund](https://r4ds.had.co.nz/): Ch 1, 2, 4, 8, 19 and 20 -->
* A thorough reading of [R-Introduction](https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf) 
* Wickham and Grolemund](https://r4ds.had.co.nz/)
  * Ch 4: workflow: basics
  * Ch 6: workflow: scripts
  * Ch 8: workflow: projects
