# Lecture 4: Simulation
## ECON5170 Computational Methods in Economics
#### Author: Zhentao Shi
#### Date: July 2019

# Simulation

Probability theory has an infamous inception for its association with gambling.
Monte Carlo, the European casino capital, is another unfortunate presence.
However, naming it Macau simulation or Hong Kong Jockey Club simulation does not make me feel any better.
I decide to simply call it "simulation".

Simulation has been widely used for (i) checking finite-sample performance of asymptotic theory; (ii) bootstrap, an automated inference procedure;
(iii) generating non-standard distributions; (iv) approximating integrals with no analytic expressions. In this lecture, we will focus on (i) and (ii), whereas (iii) and (iv)
will be deferred to the next lecture on integration.


From now on, we will start to write script. A script is a piece of code for a particular
purpose. We do not write a script of thousands of lines from the beginning
to the end; we develop it recursively. We cut a big job into small manageable tasks.
Write a small piece, test it, and perhaps encapsulate it into a user-defined function.
Small pieces are integrated by the super structure. This is just like building an Airbus 380.
The engines and wings are made in UK, the fuselage is made in Germany and so on.
All pieces are assembled in Toulouse, France, and then the giant steel bird can fly.
Finally, add comments to the script to facilitate
readability. Without comments you will forget
what you did when you open the script again one month later.




**Example**

Zu Chongzhi (429--500 AD), an ancient Chinese mathematician, calculated $\pi$ being between 3.1415926 and 3.1415927, which
for 900 years held the world record of the most accurate $\pi$.
He used a deterministic approximation algorithm.
Now imagine that we present to Zu Chongzhi, with full respect and admiration, a modern PC. How can he achieve a better approximation? Of course, we suppose that he would not google it.

Standing on the shoulder of laws of large numbers, $\pi$ can be approximated by stochastic algorithm.

In [1]:
# Import the NumPy library
import numpy as np
# Import the Pandas library
import pandas as pd
# Import the SciPy library
from scipy.sparse import csr_matrix 
# Import the Random library
import random
# Import System Time
import datetime
# Import Math
import math
# Import statistics
import statistics
# Import MathPlotLib
import matplotlib.pyplot as plt
# Import Daytime
import datetime

Matplotlib created a temporary config/cache directory at /tmp/matplotlib-zaw7g19d because the default path (/home/jovyan/.cache/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.


In [2]:
n = 10000
Z = np.random.rand(n, 2)
Z = np.matrix(Z)

inside = np.mean((np.sqrt((np.square(Z-0.5)).sum(axis=1)) <= 0.5), axis=0)
pi_hat = 4 * inside

print(pi_hat)
type(pi_hat)

[[3.1328]]


numpy.matrix

## Finite Sample Evaluation

In the real world, a sample is finite. The distribution of a statistic in finite sample depends on 
the sample size $n$, which has simple a mathematical expression only in rare cases. Fortunately,
the expression can often be simplified when we imagine the sample size being arbitrarily large.
Asymptotic theory is such apparatus to approximate finite sample distributions.
It is so far the best mathematical tool that helps us
understand the behavior of estimators and tests, either in econometrics or in statistics in general.
Simulation is one way to evaluate the accuracy of approximation.

Even though real data empirical example can also be used to illustrate a statistical procedure,
artificial data are convenient and boast advantages. The prevalent paradigm in statistics is
to assume that the data are generated from a model. We, as researchers, check how close the estimate is to
the model, which is often characterized by a set of unknown parameters. In simulation
we have full control of the data generation process, so we also know the
true parameter.
In a real example, however, we have no knowledge about the true model, so we cannot directly
evaluate the quality of parameter estimation.

(It would be a different story if we are mostly interested in prediction, as we often
encounter in machine learning. In such cases, we can split the data into two parts: one part
for modeling and estimation, and the other for verification.)


**Example**

In OLS theory, the classical approach is to view $X$ as fixed regressions, and only
cares about the randomness of the error term.
Modern econometrics textbook emphasizes that a random $X$ is more appropriate
for econometrics applications. In rigorous textbooks, the moment of $X$ is explicitly
stated.
Is asymptotic inferential theory for the OLS estimator---consistency and asymptotic normality---valid when $X$ follows a
[Pareto distribution](https://en.wikipedia.org/wiki/Pareto_distribution) with shape coefficient 1.5?
(A Pareto distribution with shape coefficient between 1 and 2 has finite mean but infinite variance.)

 1. given sample size, get OLS `b_hat` and its associated `t_value`.
 2. wrap `t_value` as a user-defined function so that we can replicate for many times.
 3. given sample size, report the size under two distributions.
 4. wrap it again as a user-defined function, ready for different sample sizes.
 5. develop the super structure.
 6. add comments and documentation.

In [3]:
np.random.seed(888)
# set the parameters
Rep = 10
b0 = np.ones((2,1))
df = 1 # t dist. with df = 1 is Cauchy
n = 10
# the workhorse functions
def simulation(n, dist = "Normal", df = df):
    # a function gives the t-value under the null
    if dist == "Normal":
        e = np.random.rand(n)
        e = e.reshape(n,1)
    elif dist == "T":
        e = np.random.standard_t(df, size=n)
        e = e.reshape(n,1)
        
    X = np.hstack((np.ones((n, 1)), np.random.pareto(a = 1.5, size = (n, 1))))
    y = np.dot(X, b0) + e
    del e
    
    bhat = np.dot(np.linalg.inv(np.dot( X.T, X ) ), np.dot( X.T, y ) ) 
    bhat2 = np.array(bhat[1,0]) # parameter we want to test
    
    e_hat = y - np.dot(X, bhat)
    sigma_hat_square = np.sum(np.square(e_hat))/(n-2)
    sig_B = np.dot(np.dot(X.T, X), sigma_hat_square)
    t_value_2 = (bhat2 - b0[1]) / (math.sqrt(sig_B[1,1]))

    out = np.array([(bhat2), (t_value_2)], dtype =[('bhat2', float), ('t_value', float)])
    return(out)


In [4]:
type(list(range(Rep)))
a = list(range(Rep))

### report the empirical test size implementation:

In [5]:

TEST_SIZE = np.repeat(0, [3], axis=0)

np.apply_along_axis(simulation, axis = 0, arr = a )

Res = apply(fun = simulation(i), args(n, "normal"))
# # report the empirical test size
# report = function(n){
#   # collect the test size from the two distributions
#   # this function contains some repetitive code, but is OK for such a simple one
#   TEST_SIZE = rep(0,3)

#   # e ~ normal distribution, under which the t-dist is exact
#   Res = ldply( .data = 1:Rep, .fun = function(i) simulation(n, "Normal")  )
#   TEST_SIZE[1] = mean( abs(Res$t_value) > qt(.975, n-2) )
#   TEST_SIZE[2] = mean( abs(Res$t_value) > qnorm(.975) )

#   # e ~ t-distribution, under which the exact distribution is complicated.
#   # we rely on asymptotic normal distribution for inference instead
#   Res = ldply( .data = 1:Rep, .fun = function(i) simulation(n, "T", df)  )
#   TEST_SIZE[3] = mean( abs(Res$t_value) > qnorm(.975) )

#   return(TEST_SIZE)
# }


# pts0 = Sys.time()
# # run the calculation of the empirical sizes for different sample sizes
# NN = c(5, 10, 200, 5000)
# RES = ldply(.data = NN, .fun = report )
# names(RES) = c("exact", "normal.asym", "cauchy.asym") # to make the results readable
# RES$n = NN
# RES = RES[, c(4,1:3)] # beautify the results
# print(RES)
# print( Sys.time() - pts0 )

SyntaxError: positional argument follows keyword argument (1655698511.py, line 5)