# Advanced Computing

Zhentao Shi

<!-- code is tested on SCRP -->

## Speed

* Efficient computation in R.

* R is a vector-oriented language. 
  * In most cases, vectorization speeds up computation.
* Multiple CPUs for parallel execution
  * Save time after optimizing the code for speed.


## Vectorization

* Mathematical equivalence $\neq$ computation equivalence

* Speed matter in
  * Structural estimation
  * Big data
  * Simulations
  * Hyper parameter tuning


* For example, the preferred algorithm in Lin, Shi, Wang and Yan (2023)
  * 8 hours on a 24-core = 192 core hours

* Code optimization is essential for such problems.

* Optimizing code takes human time.

* Tradeoff between human time and computer time.

### Econometrics Example

In OLS regression, under heteroskedasticity we want to estimate 

$$
\underset{\mathrm{opt1}}{\frac{1}{n}\sum_{i=1}^{n}x_{i}x_{i}'\widehat{e}_{i}^{2}}=\underset{\mathrm{opt2,3}}{\frac{1}{n}X'DX}=\underset{\mathrm{opt 4}}{\frac{1}{n}\left(X'D^{1/2}\right)\left(D^{1/2}X\right)}
$$

where $D$ is a diagonal matrix of $\left(\widehat{e}_{1}^{2},\widehat{e}_{2,}^{2},\ldots,\widehat{e}_{n}^{2}\right)$.

At least 4 mathematically equivalent ways:

1. literally sum $\hat{e}_i^2 x_i x_i'$  over $i=1,\ldots,n$ one by one.
2. $X' \mathrm{diag}(\hat{e}^2) X$, with a dense central matrix.
3. $X' \mathrm{diag}(\hat{e}^2) X$, with a sparse central matrix.
4. Do cross product to `X*e_hat`. It takes advantage of the element-by-element operation in R.


In [1]:
# an example of robust variance matrix.
# compare the implementation via matrix, Matrix (package) and vectorization.

# n = 5000; Rep = 10; # Matrix is quick, matrix is slow, adding is OK

source("data_example/lec2.R") # import the function lpm()

n <- 50
Rep <- 1000 

data.Xe <- lpm(n) # see the function in "data_example/lec2.R"
X <- data.Xe$X
e_hat <- data.Xe$e_hat

XXe2 <- matrix(0, nrow = 2, ncol = 2)

We run the 4 estimators for the same data, and compare the time.

In [3]:
for (opt in 1:4) {

  pts0 <- Sys.time()

  for (iter in 1:Rep) {
    
    if (opt == 1) {
      for (i in 1:n) {  XXe2 <- XXe2 + e_hat[i]^2 * X[i, ] %*% t(X[i, ])  }
    } else if (opt == 2) { # the vectorized version with dense matrix
      e_hat2_M <- matrix(0, nrow = n, ncol = n)
      diag(e_hat2_M) <- e_hat^2; XXe2 <- t(X) %*% e_hat2_M %*% X
    } else if (opt == 3) { # the vectorized version with sparse matrix
      e_hat2_M <- Matrix::Matrix(0, ncol = n, nrow = n)
      diag(e_hat2_M) <- e_hat^2; XXe2 <- t(X) %*% e_hat2_M %*% X
    } else if (opt == 4) { # the best vectorization method. No waste
      Xe <- X * e_hat
      XXe2 <- t(Xe) %*% Xe }
  }
  cat("n =", n, ", Rep =", Rep, ", opt =", opt, ", time =", Sys.time() - pts0, "\n")
}

n = 50 , Rep = 1000 , opt = 1 , time = 0.1979077 
n = 50 , Rep = 1000 , opt = 2 , time = 0.1683941 
n = 50 , Rep = 1000 , opt = 3 , time = 0.6548798 
n = 50 , Rep = 1000 , opt = 4 , time = 0.003408909 


* When $n$ is small
  * `matrix` is fast
  * `Matrix` is slow
  * Vectorized version is the fastest.

* When $n$ is big
  * `matrix` is slow
  * `Matrix` is fast
  * Vectorized version is still the fastest.

In [4]:
n <- 5000
Rep <- 10

data.Xe <- lpm(n) # see the function in "data_example/lec2.R"
X <- data.Xe$X
e_hat <- data.Xe$e_hat


In [6]:
for (opt in 1:4){ 
  pts0 = Sys.time()
  XXe2 = matrix(0, nrow = 2, ncol = 2)
  
  if (opt == 1){
    for ( i in 1:n){
      XXe2 = XXe2 + e_hat[i]^2 * X[i,] %*% t(X[i,])
    }
  } else if (opt == 2) {# the vectorized version
    e_hat2_M = matrix(0, nrow = n, ncol = n)
    diag(e_hat2_M) = e_hat^2
    XXe2 = t(X) %*% e_hat2_M %*% X
  } else if (opt == 3)  {# the vectorized version
    e_hat2_M = Matrix::Matrix( 0, ncol = n, nrow = n)
    diag(e_hat2_M) = e_hat^2
    XXe2 = t(X) %*% e_hat2_M %*% X
  } else if (opt == 4)  {# the best vectorization method. No waste
    Xe = X * e_hat
    XXe2 = t(Xe) %*% Xe
  }
  cat("outcome = ", as.vector(XXe2), ", opt = ", opt, ", time = ", Sys.time() - pts0, "\n")
}

outcome =  663.4357 318.0449 318.0449 602.0184 , opt =  1 , time =  0.0155623 
outcome =  663.4357 318.0449 318.0449 602.0184 , opt =  2 , time =  0.484184 
outcome =  663.4357 318.0449 318.0449 602.0184 , opt =  3 , time =  0.001304626 
outcome =  663.4357 318.0449 318.0449 602.0184 , opt =  4 , time =  0.0001137257 


## Efficient Loop

* R evolves for big data
* housekeeping is needed in `for` loops
* `plyr` simplifies the job and facilitates parallelization.



### Example

* Empirical coverage probability of a Poisson distribution
* Write a DIY `CI` for confidence interval

This is a standard `for` loop.


In [10]:
Rep <- 100000
sample_size <- 100
mu <- 2

In [11]:
source("data_example/lec2.R")
# append a new outcome after each loop
pts0 <- Sys.time() # check time
for (i in 1:Rep) {
  x <- rpois(sample_size, mu)
  bounds <- CI(x)
  out_i <- ((bounds$lower <= mu) & (mu <= bounds$upper))
  if (i == 1) {
    out <- out_i
  } else {
    out <- c(out, out_i)
  }
}

pts1 <- Sys.time() - pts0 # check time elapse
cat("loop without pre-definition takes", pts1, "seconds\n")

loop without pre-definition takes 9.060467 seconds


In [12]:
# pre-define a container
out <- rep(0, Rep)
pts0 <- Sys.time() # check time
for (i in 1:Rep) {
  x <- rpois(sample_size, mu)
  bounds <- CI(x)
  out[i] <- ((bounds$lower <= mu) & (mu <= bounds$upper))
}

pts1 <- Sys.time() - pts0 # check time elapse
cat("loop with pre-definition takes", pts1, "seconds\n")

loop with pre-definition takes 2.308328 seconds


* Pay attention to the line `out = rep(0, Rep)`. 
* Memoery operates differently with or without the container

## Parallel Computing

* Parallel computing becomes essential when the data size is beyond the storage of a single computer


* Coordinate multiple cores on a single computer
* The packages `foreach` and `doParallel` are useful for parallel computing.
* `registerDoParallel(number)` prepares a few CPU cores to accept incoming jobs.

In [13]:
library(plyr)
library(foreach) 
library(doParallel)

Loading required package: iterators

Loading required package: parallel



```
registerDoParallel(a_number) # opens specified number of CPUs

out <- foreach(icount(Rep), .combine = option) %dopar% {
  my_expressions
}
```



### Example

* Two CPUs running simultaneously, in theory cut the time to a half of that on a single CPU

* Compare the speed of a parallel loop and a single-core sequential loop.

In [14]:
capture <- function(i) {
  x <- rpois(sample_size, mu)
  bounds <- CI(x)
  return((bounds$lower <= mu) & (mu <= bounds$upper))
}


registerDoParallel(2) # open 2 CPUs

pts0 <- Sys.time() # check time

out <- foreach(icount(Rep), .combine = c) %dopar% {
  capture()
}

pts1 <- Sys.time() - pts0 # check time elapse
cat("parallel loop takes", pts1, "seconds\n")


parallel loop takes 11.04546 seconds


* Surprisingly, parallel computing runs more slowly
  * Each loop can be done in very short time.

* The code chunk below will tell a different story.
  * Time in each loop is non-trivial
  * The only difference is `%dopar%` vs. `%do%`.

In [15]:
Rep <- 200
sample_size <- 1000000

registerDoParallel(2) # change the number of open CPUs according to
# the specification of your computer

pts0 <- Sys.time() # check time
out <- foreach(icount(Rep), .combine = c) %dopar% {
  capture()
}

cat("2-core parallel loop takes", Sys.time() - pts0 , "seconds\n")

pts0 <- Sys.time()
out <- foreach(icount(Rep), .combine = c) %do% {
  capture()
}

cat("single-core loop takes", Sys.time() - pts0 , "seconds\n")

4-core parallel loop takes 4.956544 seconds
single-core loop takes 8.69477 seconds


## Summary

* Speed matters
* Vectorization
* Parellel computing
* Experiments

# Cloud Computing

* Remote server is more powerful than personal computer
* Instruments for intensive jobs

* Cloud storage
* Cloud computing

### Workflow 

* No fundamental difference lies between local and cloud
* Prepare in the cloud serve the data and code
* Open a shell for communication, run the code, and collect the results

* Command-line interface (CLI)

### Open Source

* Strong justification for open-source languages
* Installation with no limitations

### User Experience

* Cost and barrier of remote computing have reduced significantly
* Remote desktop best mimics the familiar operation system on a local computer
* Internet latency
* CLI flexible

* Remote Jupyter and Rstudio works via a web browser as an interface
* Mouse and keyboard are still local
* Command is send from the browser to the remote computer
* Results are send back to the browser for display
* IDEs also have file management, to partially replace WinSCP



### RStudio Server

* CLI lacks a graphic interface for interactive data analysis. 
* [RStudio server](https://rstudio.com/products/rstudio/#rstudio-server) offers a local-like
environment via a web browser to communicate with a remote server.

* Rstudio on SCRP
* Jupyter Notebook on SCRP


* `RStudio Cloud` 
  * a free service to facilitate teaching and demonstration
  * computation unit is too weak to execute serious tasks.
* CUHK's `SCRP` 
  * resembles a workplace environment in a small company
  * always online (with VPN connection)
  * more powerful than the best local computer we can afford.
* `Amazon Web Service Cloud` or `阿里云`
  * commercial service tailored according to budget
  * from tiny demonstrative display to big enterprise business applications

### CUHK Econ

* Students have access to powerful multi-core computers

1. Log in `scrp-login-2.econ.cuhk.edu.hk`;
2. Upload R scripts and data to the server;
3. In a shell, run `R --no-save <file_name.R> result_file_name.out`;
4. To run a command in the background, add `&` at the end of the above command.

* This example comes from Lin, Shi, Wang and Yan (2023)
* Only use 15% of the data and a sparse grid of tuning parameters
* It takes about 9 minutes with 24 cores on `econsuper` (27 min on `SCRP`)

R packages `caret`, `doParallel`, and `gbm` are needed for the following script.
```
ssh zhentao.shi@scrp-login-2.econ.cuhk.edu.hk
cd data_example
R --no-save <Beijing_housing_gbm.R> GBM_BJ.out & 
```

### Long jobs

* Keep jobs running on the background 
* Terminal can be freed for other task

* Disconnect with the server and the task is still running
* Otherwise, the task will be terminated immediately once we close window, disconnect with the server, or lose Internet or VPN connection.

### Prepare in Advance

* Test the input and output in small scale on  local computer or graphic cloud server
  * Work in CLI means no interaction with intermediate results
  * Don't debug in CLI
* Think in advance how to retrieve the results
* Export key results as data files (Rdata or csv...) or graph files (pdf, jpeg, png). 


## Reproducibility

* Keep the same environment across local computers and remote clusters
* Virtual machine
* [Docker](https://hub.docker.com/repository/docker/ztshi/econ_data_sci/general)
* [Gitpod](https://gitpod.io/#https://github.com/zhentaoshi/econ_data_science)