# Lecture 2: Advanced Python
## ECON5170 Computational Methods in Economics
#### Author: Zhentao Shi
#### Date: March 2020

# Advanced R (Python)


### Introduction

In this lecture, we will talk about efficient computation in R (Python).

*  **R is a vector-oriented language. In most cases, vectorization speeds up computation**.
*  We turn to more CPUs for parallel execution to save time if there is no more room to optimize the code to improve the speed.
*  Clusters are accessed remotely. Communicating with a remote cluster is different from operating a local machine.

### Vectorization

Despite mathematical equivalence, various ways of calculation can perform distinctively in terms of computational speed.

Does computational speed matter?
For a job that takes less than a minutes, the time difference is not a big deal.
For modern economic structural estimation problems commonly seen in industrial organization, a single estimation can take up to a week. For those problems code optimization is essential.

Other computational intensive procedures include bootstrap, simulated maximum likelihood and simulated method of moments. Even if a single execution does not take much time, repeating such a procedure for thousands of replications will consume a non-trivial duration.

Of course, optimizing code takes human time. It is a balance of human time and machine time.

__Example__

The heteroskedastic-robust variance for the OLS regression is
$$(X'X)^{-1} X'\hat{e}\hat {e}'X (X'X)^{-1}$$
The difficult part is $X'\hat{e}\hat {e}'X=\sum_{i=1}^n \hat{e}_i^2 x_i x_i'$.
There are at least 4 mathematically equivalent ways to compute this term.

1.  literally sum over $i=1,\dots,n$ one by one.
2.  $X' \mathrm{diag}(\hat{e}^2) X$, with a dense central matrix.
3.  $X' \mathrm{diag}(\hat{e}^2) X$, with a sparse central matrix.
4.  Do cross product to `X*e_hat`. It takes advantage of the element-by-element operation.

We first generate the data of binary response and regressors. Due to the discrete nature of the dependent variable, the error term in the linear probability model is heteroskedastic. It is necessary to use the heteroskedastic-robust variance to consistently estimate the asymptotic variance of the OLS estimator. The code chunk below estimates the coefficients and obtains the residual.

In [1]:
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix 
import random
import datetime
import math
import statistics
import matplotlib.pyplot as plt
import multiprocessing as mp

Matplotlib created a temporary config/cache directory at /tmp/matplotlib-n1gtq6db because the default path (/home/jovyan/.cache/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.


In [2]:
def lpm(n):
    # estimation in a linear probability model
    # set the parameters
    b0 = np.array([[-1], [1]])
    # generate the data
    e = np.random.normal(size=(n, 1))
    X = np.ones((n, 2))
    X[:, 1] = np.random.normal(size=n)
    Y = X @ b0 + e
    # note that in this regression b0 is not converge to b0 because the model is changed.

    # OLS estimation
    bhat = np.linalg.inv(X.T @ X) @ X.T @ Y

    e_hat = Y - X @ bhat
    return X, e_hat


We run the 4 estimators for the same data, and compare the time.

In [5]:
# an example of robust variance matrix.
# compare the implementation via matrix and vectorization.

n = 5000; Rep = 10 # large matrix

# n = 50; Rep = 1000 # small matrix

import datetime
import numpy as np
from scipy.sparse import csr_matrix

n = 50
Rep = 1000

for opt in range(4):
    pts0 = datetime.datetime.now()
    for iter in range(Rep):
        np.random.seed(iter)
        [X, e_hat] = lpm(n)
        XXe2 = np.zeros((2, 2))
        if opt == 0:
            for i in range(len(X)):
                XXe2 += e_hat[i]**2 * X[i,] @ X[i,].T
        elif opt == 1:
            e_hatt2 = np.diag(np.square(e_hat).flatten())
            XXe2 = X.T @ e_hatt2 @ X
        elif opt == 2:
            e_hat2_M = csr_matrix(np.diag(np.square(e_hat).flatten()))
            XXe2 = X.T @ e_hat2_M @ X
        elif opt == 3:
            Xe = X * e_hat
            XXe2 = Xe.T @ Xe
        XX_inv = np.linalg.inv(X.T @ X)
        sig_B = XX_inv @ XXe2 @ XX_inv
        print("opt = ", opt, ", iter = ", iter, ", sig_B = ", sig_B, "\n")
    print("n = ", n, ", Rep = ", Rep, ", opt = ", opt, ", time = ", datetime.datetime.now() - pts0, "\n")

opt =  0 , iter =  0 , sig_B =  [[0.04578088 0.06045031]
 [0.06045031 0.07982023]] 

opt =  0 , iter =  1 , sig_B =  [[0.01815567 0.03181826]
 [0.03181826 0.05576229]] 

opt =  0 , iter =  2 , sig_B =  [[0.04279314 0.0401636 ]
 [0.0401636  0.03769564]] 

opt =  0 , iter =  3 , sig_B =  [[0.03951701 0.0323535 ]
 [0.0323535  0.02648856]] 

opt =  0 , iter =  4 , sig_B =  [[0.02778969 0.03020292]
 [0.03020292 0.03282571]] 

opt =  0 , iter =  5 , sig_B =  [[0.031381   0.03058735]
 [0.03058735 0.02981377]] 

opt =  0 , iter =  6 , sig_B =  [[0.02489067 0.02790357]
 [0.02790357 0.03128117]] 

opt =  0 , iter =  7 , sig_B =  [[0.0282896  0.03236636]
 [0.03236636 0.03703061]] 

opt =  0 , iter =  8 , sig_B =  [[0.04506192 0.03501235]
 [0.03501235 0.027204  ]] 

opt =  0 , iter =  9 , sig_B =  [[0.04686242 0.04066426]
 [0.04066426 0.03528589]] 

opt =  0 , iter =  10 , sig_B =  [[0.02573061 0.02417525]
 [0.02417525 0.0227139 ]] 

opt =  0 , iter =  11 , sig_B =  [[0.02299532 0.0253431 ]
 [0.02

We clearly see the difference in running time, though the 4 methods are mathematically the same.
When $n$ is small, `matrix` is fast and `Matrix` is slow; the vectorized version is the fastest.
When $n$ is big, `matrix` is slow and `Matrix` is fast; the vectorized version is still the fastest.

## Efficient Loop

In standard `for` loops, we have to do a lot of housekeeping work. 

In [6]:
def CI(x): # construct confidence interval
           # x is a vector of random variables
    n = len(x)
    mu = np.mean(x)
    sig = np.std(x)
    upper = mu + 1.96 / np.sqrt(n) * sig
    lower = mu - 1.96 / np.sqrt(n) * sig
    return {'lower': lower, 'upper': upper}

This is a standard `for` loop.

In [7]:
Rep = 100000
sample_size = 10
mu = 2

# append a new outcome after each loop
pts0 = datetime.datetime.now() # check time
for i in range(Rep):
    x = np.random.poisson(mu, sample_size)
    bounds = CI(x)
    out_i = ( ( bounds['lower'] <= mu  ) & (mu <= bounds['upper']) )
    if i == 0:
        out = np.array(out_i)
    else:
        out = np.append(out, out_i)

stat_cover = np.count_nonzero(out)/Rep*100

print( "empirical coverage probability = ", stat_cover, "% \n") # empirical size
pts1 = datetime.datetime.now() - pts0 # check time elapse
print(pts1, "\n")

empirical coverage probability =  89.266 % 

0:00:05.230653 



### Classical loop with an empty list

In [8]:
Rep = 100000
sample_size = 10
mu = 2

# append a new outcome after each loop

pts0 = datetime.datetime.now() # check time

# Empty list
out = list()

for i in range(Rep):    
    x = np.random.poisson(mu, sample_size)
    bounds = CI(x)
    out.append( ( bounds['lower'] <= mu  ) & (mu <= bounds['upper']) )

stat_cover = out.count(True)/Rep*100


print( "empirical coverage probability = ", stat_cover, "% \n") # empirical size
pts1 = datetime.datetime.now() - pts0 # check time elapse
print(pts1) 

empirical coverage probability =  89.27000000000001 % 

0:00:03.802653


### Classical loop with an existing list and overwriting

In [9]:
Rep = 100000
sample_size = 10
mu = 2

# override an existing list

pts0 = datetime.datetime.now() # check time

# List with same length as Rep
out = [0] * Rep

for i in range(Rep):
    x = np.random.poisson(mu, sample_size)
    bounds = CI(x)
    out.append((bounds['lower'] <= mu  ) & (mu <= bounds['upper']))

stat_cover = out.count(True)/Rep*100

print( "empirical coverage probability = ", stat_cover, "% \n") # empirical size
pts1 = datetime.datetime.now() - pts0 # check time elapse
print(pts1) 

empirical coverage probability =  89.156 % 

0:00:04.093897


Pay attention to the line `out = [0] * Rep`. It *pre-defines* a vector `out` to be filled by `out[i] = out.append((bounds['lower'] <= mu  ) & (mu <= bounds['upper']))`. The computer opens a continuous patch of memory for the vector `out`. When new result comes in, the old element is replaced. If we do not pre-define `out` but append one more element in each loop, the length of `out` will change in each replication and every time a new patch of memory will be assigned to store it. The latter approach will spend much more time just to locate the vector in the memory.

`out` is the result container. In a `for` loop, we pre-define a container, and replace the elements
of the container in each loop by explicitly calling the index.

### For loop with a function

In [10]:
Rep = 100000
sample_size = 10
mu = 2

# Create a function and let it run with a for loop

def capture(i):
    x = np.random.poisson(mu, sample_size)
    bounds = CI(x)
    return ((bounds['lower'] <= mu) & (mu <= bounds['upper']))

pts0 = datetime.datetime.now()  # check time

out = [capture(i) for i in range(Rep)]

stat_cover = out.count(True)/Rep*100

print( "empirical coverage probability = ", stat_cover, "% \n") # empirical size
pts1 = datetime.datetime.now() - pts0  # check time elapse
print(pts1)

empirical coverage probability =  89.117 % 

0:00:03.625780


### Apply() has still some error

In [11]:
import numpy as np
import datetime
from scipy import stats

# Set parameters
Rep = 10
sample_size = 1000
mu = 2

# Define a function to calculate the confidence interval
def CI(x):
    alpha = 0.05
    n = len(x)
    mu = np.mean(x)
    std_err = np.std(x, ddof=1) / np.sqrt(n)
    z = np.abs(stats.norm.ppf(alpha / 2))
    lower = mu - z * std_err
    upper = mu + z * std_err
    return {'lower': lower, 'upper': upper}

# Define a function to capture the empirical coverage probability
def capture(i):
    x = np.random.poisson(mu, sample_size)
    bounds = CI(x)
    return ((bounds['lower'] <= mu) & (mu <= bounds['upper']))

# Capture start time
pts0 = datetime.datetime.now()

# Use apply_along_axis to apply the capture function to an array of indices
out = np.apply_along_axis(capture, axis=0, arr=np.arange(Rep))

# Calculate the empirical coverage probability
emp_coverage_prob = np.sum(out) / Rep * 100

# Print the result
print("empirical coverage probability = ", emp_coverage_prob, "\n")  # empirical size

# Capture elapsed time
pts1 = datetime.datetime.now() - pts0
print("Elapsed time:", pts1)



empirical coverage probability =  10.0 

Elapsed time: 0:00:00.001038


In [12]:
Rep = 10
sample_size = 1000
mu = 2

# Create a function and let it run with map

def capture(i):
    x = np.random.poisson(mu, sample_size)
    bounds = CI(x)
    return ((bounds['lower'] <= mu) & (mu <= bounds['upper']))

pts0 = datetime.datetime.now()  # check time

out = map(capture, range(Rep), )
out = list(out)
stat_cover = out.count(True)/Rep*100

print( "empirical coverage probability = ", stat_cover, "% \n") # empirical size

pts1 = datetime.datetime.now() - pts0  # check time elapse
print(list(out))

empirical coverage probability =  90.0 % 

[True, True, False, True, True, True, True, True, True, True]


## Parallel Computing

Parallel computing becomes essential when the data size is beyond the storage of a single computer, for example  {% cite li2018embracing %}.
Here we explore the speed gain of parallel computing on a multicore machine.

Here we explore the speed gain of parallel computing on a multicore machine.

The package `multiprocessing` is the choice for parallel computing in Python.
Below is the basic structure. 

In [13]:
# import multiprocessing
from multiprocessing import Process, current_process
import multiprocessing as mp
import os

print("Number of processors: ", mp.cpu_count())

Number of processors:  16


In [5]:
Rep = 10
sample_size = 10
mu = 2

for i in range(Rep):
    np.random.seed(i)
    x = np.random.poisson(mu, sample_size)
    print(x)

[3 2 5 1 0 0 7 1 3 3]
[2 1 0 1 2 2 0 3 3 3]
[1 2 1 2 2 1 4 1 1 0]
[2 3 1 1 2 2 2 2 1 2]
[5 1 1 2 2 0 1 5 0 1]
[2 4 1 0 2 2 2 2 1 1]
[3 0 2 2 5 3 5 2 3 4]
[0 4 1 1 3 2 2 2 2 3]
[4 0 2 3 3 2 0 1 1 1]
[0 2 1 1 0 1 2 4 4 3]


If we have two CPUs running simultaneously, in theory we can cut the time to a half of that on a single CPU. Is that what happening in practice?

### Multiprocessing with the `process` class

In [14]:
import numpy as np
import multiprocessing as mp
import datetime
from scipy import stats

# Set parameters
Rep = 10
sample_size = 1000
mu = 2

# Define a function to calculate the confidence interval
def CI(x):
    alpha = 0.05
    n = len(x)
    mu = np.mean(x)
    std_err = np.std(x, ddof=1) / np.sqrt(n)
    z = np.abs(stats.norm.ppf(alpha / 2))
    lower = mu - z * std_err
    upper = mu + z * std_err
    return {'lower': lower, 'upper': upper}


def capture(i, return_dict):
    np.random.seed(i)
    x = np.random.poisson(mu, sample_size)
    bounds = CI(x)
    result = ((bounds['lower'] <= mu) & (mu <= bounds['upper']))
    print("The result: " + str(result))
    return_dict[i] = result


pts0 = datetime.datetime.now()  # check time

manager = mp.Manager()
return_dict = manager.dict()
jobs = []

for i in range(Rep):
    p = mp.Process(target=capture, args=(i, return_dict))
    jobs.append(p)
    p.start()

for proc in jobs:
    proc.join()

# Count the number of True values in the return_dict
emp_coverage_prob = sum(return_dict.values()) / Rep * 100

print("empirical coverage probability = ", emp_coverage_prob, "% \n")  # empirical size

pts1 = datetime.datetime.now() - pts0  # check time elapse
print("The calculation time is:", pts1, "\n")


The result: True
The result: TrueThe result: True
The result: True

The result: True
The result: True
The result: True
The result: True
The result: True
The result: True
empirical coverage probability =  100.0 % 

The calculation time is: 0:00:00.171462 



### Multiprocessing with the `pool` class & `apply()`

In [15]:
Rep = 200
sample_size = 2000
mu = 2


pts0 = datetime.datetime.now()  # check time

def capture(i):
    x = np.random.poisson(mu, sample_size)
    bounds = CI(x)
    return ((bounds['lower'] <= mu) & (mu <= bounds['upper']))

# Only allows to run 4 processes at the time
pool = mp.Pool(processes=4)

# Initiate the multiprocess process wit apply()
results = [pool.apply(capture, args=(i,)) for x in range(Rep)]

print( "empirical coverage probability = ", results.count(True)/Rep*100, "\n") # empirical size
pts1 = datetime.datetime.now() - pts0 # check time elapse
print("The the calculation time is:", pts1, "\n")
# print(results) 

empirical coverage probability =  98.0 

The the calculation time is: 0:00:00.269177 



### Multiprocessing with the `pool` class & `map()`

In [16]:
Rep = 200
sample_size = 2000
mu = 2

pts0 = datetime.datetime.now()  # check time
def capture(i):
    x = np.random.poisson(mu, sample_size)
    bounds = CI(x)
    return ((bounds['lower'] <= mu) & (mu <= bounds['upper']))
    
# Only allows to run 4 processes at the time
pool = mp.Pool(processes=4)

# Initiate the multiprocess process with the map()
results = pool.map(capture, range(Rep), )

print( "empirical coverage probability = ", results.count(True)/Rep*100, "\n") # empirical size
pts1 = datetime.datetime.now() - pts0 # check time elapse
print("The the calculation time is:", pts1, "\n")
# print(results) 

empirical coverage probability =  98.0 

The the calculation time is: 0:00:00.091526 



## Remote Computing

Investing money from our own pocket to an extremely powerful laptop to conduct heavy-lifting computational work
is unnecessary. (i) We do not run these long jobs every day, it is more cost efficient
to share a workhorse. (ii) We cannot keep our laptop always on when we move it
around. The right solution is remote computing on a server.

No fundamental difference lies between local and remote computing.
We prepare the data and code, open a shell for communication, run the code, and collect the results.
One potential obstacle is dealing with a command-line-based operation system.
Such command line tools is the norm of life two or three decades ago, but today we mostly
work in a graphic operating system like Windows or OSX.
For Windows users (I am one of them), I recommend [PuTTY](http://www.putty.org/), a shell, and [WinSCP](http://winscp.net/eng/download.php), a graphic interface for input and output.

Most servers in the world are running Unix/Linux operation system.
Here are a few commands for basic operations.

* mkdir
* cd
* copy
* top
* screen
* ssh user@address
* start a program

Our department's computation infrastructure has been improving.
A server dedicated to  professors is a 16-core machine. I have opened an account for you.
You can try out this script on `econsuper`.

1. Log in `econsuper.econ.cuhk.edu.hk`;
2. Save the code block below as `loop_server.R`, and upload it to the server;
3. In a shell, run `R --vanilla <loop_server.R> result_your_name.out`;
4. To run a command in the background, add `&` at the end of the above command. To keep it running after closing the console, add `nohup` at the beginning of the command.