# Advanced Computing

Zhentao Shi

<!-- code is tested on SCRP -->

## Speed

* Efficient computation in R.

* R is a vector-oriented language. 
  * In most cases, vectorization speeds up computation.
* Multiple CPUs for parallel execution
  * Save time after optimizing the code for speed.


## Vectorization

* Mathematical equivalence $\neq$ computation equivalence

* Speed matter in
  * Structural estimation
  * Big data
  * Simulations
  * Hyper parameter tuning


* For example, @lin2020's preferred algorithm 
  * 8 hours on a 24-core = 192 core hours

* Code optimization is essential for such problems.

* Optimizing code takes human time.

* Tradeoff between human time and computer time.

### Econometrics Example

In OLS regression, under heteroskedasticity we want to estimate 

$$
\underset{\mathrm{opt1}}{\frac{1}{n}\sum_{i=1}^{n}x_{i}x_{i}'\widehat{e}_{i}^{2}}=\underset{\mathrm{opt2,3}}{\frac{1}{n}X'DX}=\underset{\mathrm{opt 4}}{\frac{1}{n}\left(X'D^{1/2}\right)\left(D^{1/2}X\right)}
$$

where $D$ is a diagonal matrix of $\left(\widehat{e}_{1}^{2},\widehat{e}_{2,}^{2},\ldots,\widehat{e}_{n}^{2}\right)$.

At least 4 mathematically equivalent ways:

1. literally sum $\hat{e}_i^2 x_i x_i'$  over $i=1,\ldots,n$ one by one.
2. $X' \mathrm{diag}(\hat{e}^2) X$, with a dense central matrix.
3. $X' \mathrm{diag}(\hat{e}^2) X$, with a sparse central matrix.
4. Do cross product to `X*e_hat`. It takes advantage of the element-by-element operation in R.


In [None]:
# an example of robust variance matrix.
# compare the implementation via matrix, Matrix (package) and vectorization.

# n = 5000; Rep = 10; # Matrix is quick, matrix is slow, adding is OK

source("data_example/lec2.R")

n <- 50
Rep <- 1000 

data.Xe <- lpm(n) # see the function in "data_example/lec2.R"
X <- data.Xe$X
e_hat <- data.Xe$e_hat

XXe2 <- matrix(0, nrow = 2, ncol = 2)

We run the 4 estimators for the same data, and compare the time.

In [None]:
for (opt in 1:4) {
  pts0 <- Sys.time()
  for (iter in 1:Rep) {
    set.seed(iter) # to make sure that the data used
    # different estimation methods are the same
    if (opt == 1) {
      for (i in 1:n) {  XXe2 <- XXe2 + e_hat[i]^2 * X[i, ] %*% t(X[i, ])  }
    } else if (opt == 2) { # the vectorized version with dense matrix
      e_hat2_M <- matrix(0, nrow = n, ncol = n)
      diag(e_hat2_M) <- e_hat^2; XXe2 <- t(X) %*% e_hat2_M %*% X
    } else if (opt == 3) { # the vectorized version with sparse matrix
      e_hat2_M <- Matrix::Matrix(0, ncol = n, nrow = n)
      diag(e_hat2_M) <- e_hat^2; XXe2 <- t(X) %*% e_hat2_M %*% X
    } else if (opt == 4) { # the best vectorization method. No waste
      Xe <- X * e_hat
      XXe2 <- t(Xe) %*% Xe }
    XX_inv <- solve(t(X) %*% X)
    sig_B <- XX_inv %*% XXe2 %*% XX_inv
  }
  cat("n =", n, ", Rep =", Rep, ", opt =", opt, ", time =", Sys.time() - pts0, "\n")
}

* When $n$ is small
  * `matrix` is fast
  * `Matrix` is slow
  * Vectorized version is the fastest.

* When $n$ is big
  * `matrix` is slow
  * `Matrix` is fast
  * Vectorized version is still the fastest.

In [None]:
for (opt in c(1,3,4)){ # option 2 takes too much time. We omit it.
  pts0 = Sys.time()
  XXe2 = matrix(0, nrow = K, ncol = K)
  if (opt == 1){
    for ( i in 1:n){
      XXe2 = XXe2 + e_hat[i]^2 * X[i,] %*% t(X[i,])
    }
  } else if (opt == 2) {# the vectorized version
    e_hat2_M = matrix(0, nrow = n, ncol = n)
    diag(e_hat2_M) = e_hat^2
    XXe2 = t(X) %*% e_hat2_M %*% X
  } else if (opt == 3)  {# the vectorized version
    e_hat2_M = Matrix::Matrix( 0, ncol = n, nrow = n)
    diag(e_hat2_M) = e_hat^2
    XXe2 = t(X) %*% e_hat2_M %*% X
  } else if (opt == 4)  {# the best vectorization method. No waste
    Xe = X * e_hat
    XXe2 = t(Xe) %*% Xe
  }
  cat("outcome = ", as.vector(XXe2), ", opt = ", opt, ", time = ", Sys.time() - pts0, "\n")
}

## Efficient Loop

* R evolves for big data
* housekeeping is needed in `for` loops
* `plyr` simplifies the job and facilitates parallelization.



### Example

* Empirical coverage probability of a Poisson distribution
* Write a DIY `CI` for confidence interval

This is a standard `for` loop.


In [None]:
Rep <- 100000
sample_size <- 1000
mu <- 2

In [None]:
source("data_example/lec2.R")
# append a new outcome after each loop
pts0 <- Sys.time() # check time
for (i in 1:Rep) {
  x <- rpois(sample_size, mu)
  bounds <- CI(x)
  out_i <- ((bounds$lower <= mu) & (mu <= bounds$upper))
  if (i == 1) {
    out <- out_i
  } else {
    out <- c(out, out_i)
  }
}

pts1 <- Sys.time() - pts0 # check time elapse
cat("loop without pre-definition takes", pts1, "seconds\n")

In [None]:
# pre-define a container
out <- rep(0, Rep)
pts0 <- Sys.time() # check time
for (i in 1:Rep) {
  x <- rpois(sample_size, mu)
  bounds <- CI(x)
  out[i] <- ((bounds$lower <= mu) & (mu <= bounds$upper))
}

pts1 <- Sys.time() - pts0 # check time elapse
cat("loop with pre-definition takes", pts1, "seconds\n")

* Pay attention to the line `out = rep(0, Rep)`. 
* Memoery operates differently with or without the container

## Parallel Computing

* Parallel computing becomes essential when the data size is beyond the storage of a single computer


* Coordinate multiple cores on a single computer
* The packages `foreach` and `doParallel` are useful for parallel computing.
* `registerDoParallel(number)` prepares a few CPU cores to accept incoming jobs.

In [None]:
library(plyr)
library(foreach) 
library(doParallel)

```
registerDoParallel(a_number) # opens specified number of CPUs

out <- foreach(icount(Rep), .combine = option) %dopar% {
  my_expressions
}
```



### Example

* Two CPUs running simultaneously, in theory cut the time to a half of that on a single CPU

* Compare the speed of a parallel loop and a single-core sequential loop.

In [1]:
capture <- function(i) {
  x <- rpois(sample_size, mu)
  bounds <- CI(x)
  return((bounds$lower <= mu) & (mu <= bounds$upper))
}


registerDoParallel(2) # open 2 CPUs

pts0 <- Sys.time() # check time

out <- foreach(icount(Rep), .combine = c) %dopar% {
  capture()
}

pts1 <- Sys.time() - pts0 # check time elapse
cat("parallel loop takes", pts1, "seconds\n")


ERROR: Error in registerDoParallel(2): could not find function "registerDoParallel"


* Surprisingly, parallel computing runs more slowly
  * Each loop can be done in very short time.

* code chunk below will tell a different story.
  * Time in each loop is non-trivial
  * The only difference is `%dopar%` vs. `%do%`.

In [None]:
Rep <- 200
sample_size <- 2000000

registerDoParallel(2) # change the number of open CPUs according to
# the specification of your computer

pts0 <- Sys.time() # check time
out <- foreach(icount(Rep), .combine = c) %dopar% {
  capture()
}

cat("2-core parallel loop takes", Sys.time() - pts0 , "seconds\n")

pts0 <- Sys.time()
out <- foreach(icount(Rep), .combine = c) %do% {
  capture()
}

cat("single-core loop takes", Sys.time() - pts0 , "seconds\n")

## Summary

* Speed matters
* Vectorization
* Parellel computing
* Experiments