# Lecture 2: Advanced Python
## ECON5170 Computational Methods in Economics


## Speed

* Efficient computation in R.

* R is a vector-oriented language. 
  * In most cases, vectorization speeds up computation.
* Multiple CPUs for parallel execution
  * Save time after optimizing the code for speed.


## Vectorization

* Mathematical equivalence $\neq$ computation equivalence

* Speed matter in
  * Structural estimation
  * Big data
  * Simulations
  * Hyper parameter tuning



* For example, @lin2020's preferred algorithm 
  * 8 hours on a 24-core = 192 core hours

* Code optimization is essential for such problems.

* Optimizing code takes human time.

* Tradeoff between human time and computer time.

### Econometrics Example

In OLS regression, under heteroskedasticity
$
\sqrt{n}\left(\widehat{\beta}-\beta_{0}\right)\stackrel{d}{\to}N\left(0,E\left[x_{i}x_{i}'\right]^{-1}\mathrm{var}\left(x_{i}e_{i}\right)E\left[x_{i}x_{i}'\right]^{-1}\right)
$
where $\mathrm{var}\left(x_{i}e_{i}\right)$ can be estimated by 

$$
\underset{\mathrm{opt1}}{\frac{1}{n}\sum_{i=1}^{n}x_{i}x_{i}'\widehat{e}_{i}^{2}}=\underset{\mathrm{opt2,3}}{\frac{1}{n}X'DX}=\underset{\mathrm{opt 4}}{\frac{1}{n}\left(X'D^{1/2}\right)\left(D^{1/2}X\right)}
$$

where $D$ is a diagonal matrix of $\left(\widehat{e}_{1}^{2},\widehat{e}_{2,}^{2},\ldots,\widehat{e}_{n}^{2}\right)$.

At least 4 mathematically equivalent ways:

1. literally sum $\hat{e}_i^2 x_i x_i'$  over $i=1,\ldots,n$ one by one.
2. $X' \mathrm{diag}(\hat{e}^2) X$, with a dense central matrix.
3. $X' \mathrm{diag}(\hat{e}^2) X$, with a sparse central matrix.
4. Do cross product to `X*e_hat`. It takes advantage of the element-by-element operation in R.


In [2]:
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix 
import random
import datetime
import math
import statistics
import matplotlib.pyplot as plt

AttributeError: module 'pyarrow' has no attribute '__version__'

In [3]:
def lpm(n):
    # estimation in a linear probability model

    # set the parameters
    b0 = np.array([-1, 1])

    # generate the data
    e = np.random.normal(size=n)
    X = np.hstack((np.ones((n, 1)), np.random.normal(size=(n, 1))))
    Y = (X @ b0 + e >= 0)

    # OLS estimation
    bhat = np.linalg.inv(X.T @ X) @ X.T @ Y

    # calculate the t-value
    bhat2 = bhat[1]  # parameter we want to test
    e_hat = Y - X @ bhat
    return X, e_hat


We run the 4 estimators for the same data, and compare the time.

In [6]:
import time
import numpy as np
from scipy.sparse import diags

# an example of robust variance matrix.
# compare the implementation via matrix and vectorization.

# n = 5000; Rep = 10 # large matrix

n = 5000; Rep = 10 # small matrix



for opt in range(1, 5):
    pts0 = time.time()
    
    # initialize the matrix for computing the variance-covariance matrix
    XXe2 = np.zeros((2, 2))

    # loop over the replications and compute the variance-covariance matrix
    for iter in range(Rep):
        np.random.seed(iter)
        data = lpm(n)
        X = data[0]
        e_hat = data[1]
        # compute the variance-covariance matrix using element-by-element calculation
        if opt == 1:
            for i in range(n):
                XXe2 += e_hat[i]**2 * np.matrix(X[i,]).T @ np.matrix(X[i,])
        
        # compute the variance-covariance matrix using matrix multiplication with dense matrices
        elif opt == 2:
            e_hat2_M = np.zeros((n, n))
            np.fill_diagonal(e_hat2_M, e_hat**2)
            XXe2 = np.matrix(X).T @ np.matrix(e_hat2_M) @ np.matrix(X)
        
        # compute the variance-covariance matrix using matrix multiplication with sparse matrices
        elif opt == 3:
            e_hat2_M = diags(e_hat**2, format='csr')
            XXe2 = X.T @ e_hat2_M @ X
        
        # compute the variance-covariance matrix using vectorization with no waste
        elif opt == 4:
            e_hat = e_hat.reshape((-1, 1))
            Xe = np.matrix(X).T * np.matrix(e_hat)
            XXe2 = Xe @ Xe.T
        
        
    print("n =", n, ", Rep =", Rep, ", opt =", opt, ", time =", time.time() - pts0, "\n")


n = 5000 , Rep = 10 , opt = 1 , time = 1.2698514461517334 

n = 5000 , Rep = 10 , opt = 2 , time = 2.802621841430664 

n = 5000 , Rep = 10 , opt = 3 , time = 0.047385454177856445 

n = 5000 , Rep = 10 , opt = 4 , time = 0.020370960235595703 



* When $n$ is small
  * `matrix` is fast
  * `Matrix` is slow
  * Vectorized version is the fastest.

* When $n$ is big
  * `matrix` is slow
  * `Matrix` is fast
  * Vectorized version is still the fastest.

## Efficient Loop


### Example

* Empirical coverage probability of a Poisson distribution
* Write a DIY `CI` for confidence interval

This is a standard `for` loop.

In [3]:
import numpy as np
import time

def CI(x):
    # x is a numpy array of random variables
    n = len(x)
    mu = np.mean(x)
    sig = np.std(x)
    upper = mu + 1.96 / np.sqrt(n) * sig
    lower = mu - 1.96 / np.sqrt(n) * sig
    return {"lower": lower, "upper": upper}

## Parallel Computing

Parallel computing becomes essential when the data size is beyond the storage of a single computer, for example  {% cite li2018embracing %}.
Here we explore the speed gain of parallel computing on a multicore machine.

Here we explore the speed gain of parallel computing on a multicore machine.

The package `multiprocessing` is the choice for parallel computing in Python.
Below is the basic structure. 

### Example

* Two CPUs running simultaneously, in theory cut the time to a half of that on a single CPU

* Compare the speed of a parallel loop and a single-core sequential loop.

In [12]:

from multiprocessing import Pool


2-core parallel loop takes 19.011606454849243 seconds


In [4]:
from joblib import Parallel, delayed 
import time
#Zhentao's version uses joblib

def parallel_func(i):
    # your function code here
    # for example:
    x = np.random.poisson(mu, size=sample_size)
    bounds = CI(x)
    return (bounds["lower"] <= mu  <= bounds["upper"])

Rep = 200 # or whatever value you choose
mu = 10 # or whatever value you choose
sample_size = 20 # or whatever value you choose

pts0 = time.time() # check time
results = Parallel(n_jobs=4) \
    (delayed(parallel_func) (i) for i in range(Rep))

print("4-core parallel loop takes", time.time() - pts0 , "seconds")


4-core parallel loop takes 1.2228224277496338 seconds


In [5]:
# single-core version
pts0 = time.time()
out = [parallel_func(i) for i in range(Rep)]
print("single-core loop takes", time.time() - pts0 , "seconds")

single-core loop takes 0.0170133113861084 seconds


If we have two CPUs running simultaneously, in theory we can cut the time to a half of that on a single CPU. Is that what happening in practice?

In [26]:
Rep = 200
sample_size = 2000
mu = 2


pts0 = time.time()  # check time

def capture(i):
    x = np.random.poisson(mu, sample_size)
    bounds = CI(x)
    return ((bounds['lower'] <= mu) & (mu <= bounds['upper']))

# Only allows to run 4 processes at the time
pool = mp.Pool(processes=4)

# Initiate the multiprocess process wit apply()
results = [pool.apply(capture, args=(i,)) for x in range(Rep)]

print( "empirical coverage probability = ", np.mean(results), "\n") # empirical size
pts1 = time.time() - pts0 # check time elapse
print("The the calculation time is:", pts1, "\n")
# print(results) 

empirical coverage probability =  0.955 

The the calculation time is: 0.17428827285766602 



## Remote Computing

Investing money from our own pocket to an extremely powerful laptop to conduct heavy-lifting computational work
is unnecessary. (i) We do not run these long jobs every day, it is more cost efficient
to share a workhorse. (ii) We cannot keep our laptop always on when we move it
around. The right solution is remote computing on a server.



No fundamental difference lies between local and remote computing.
We prepare the data and code, open a shell for communication, run the code, and collect the results.
One potential obstacle is dealing with a command-line-based operation system.
Such command line tools is the norm of life two or three decades ago, but today we mostly
work in a graphic operating system like Windows or OSX.
For Windows users (I am one of them), I recommend [PuTTY](http://www.putty.org/), a shell, and [WinSCP](http://winscp.net/eng/download.php), a graphic interface for input and output.


Most servers in the world are running Unix/Linux operation system.
Here are a few commands for basic operations.

* mkdir
* cd
* copy
* top
* screen
* ssh user@address
* start a program


Our department's computation infrastructure has been improving.
A server dedicated to  professors is a 16-core machine. I have opened an account for you.
You can try out this script on `econsuper`.

1. Log in `econsuper.econ.cuhk.edu.hk`;
2. Save the code block below as `loop_server.R`, and upload it to the server;
3. In a shell, run `R --vanilla <loop_server.R> result_your_name.out`;
4. To run a command in the background, add `&` at the end of the above command. To keep it running after closing the console, add `nohup` at the beginning of the command.