These notes introduce the concept of simulation as a potentially useful tool in a data scientist's belt.
Simulation can mean a lot of different things, depending on the discipline.
-
As it relates to our previous discussions, simulation has a close relationship with optimization. That is, some objective functions are better solved by simulation than by gradient descent or other optimization algorithms. (Note: we won't cover any of these in this class)
-
Simulation can be a useful tool for debugging. Suppose you have a statistical model, and you're not quite sure what the answer should be. You can generate fake data (in which you know all of the parameters of interest) and can use that fake data to make sure that you haven't made any mistakes. This is the context we will be discussing today.
"Monte Carlo methods (or Monte Carlo experiments) are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results." -- Wikipedia
Why is it called "Monte Carlo"?
- Monte Carlo is a phrase coined by John von Neumann while working on the Manhattan Project.
- (Monte Carlo is the name of the casino in Monaco where the uncle of von Neumann's co-worker, Stanisław Ulam, would borrow money to gamble)
- It was a code name for Ulam and von Neumann's early use of what we now call the Monte Carlo method
What are the odds of being dealt a flush in 5-card poker? There are two avenues to computing this:
- Use combinatorics (i.e. combinations and permuations ... "n Choose r") to analytically solve for the probability
- Shuffle the deck and deal the cards 100,000 times and count the number of times a flush happens
(2) is known as the "Monte Carlo method"
In sports: "If the Cavs and Warriors had played that game 10 times, the Warriors would have won 9 of them." This type of discussion implictly appeals to the Monte Carlo method (though 10 is almost never a reasonable sample size!)
Nowadays with computing getting cheaper and cheaper, it's often easier to solve problems via simulation than going through the analytical derivation.
Monte Carlo (or, equivalently, Simulation) can be thought of as "the data scientist's lab" because it's a way to discover new methods and test the properties of existing methods.
How to do this? Let's go through a simple example.
What's the "Hello world" of optimization? The classical linear model!
y <- X%*%beta + epsilon
Here's what we need to do the simulation:
- Random number seed (to be able to replicate random draws)
- Sample size (
N
) - Number of covariates (
K
) - Covariate matrix (
X
), which is N by K - Parameter vector (
beta
) which matches the size ofX
- Distribution of ε (e.g. N(0,sigma^2))
Steps to create the data:
set.seed(100)
N <- 100000
K <- 10
sigma <- 0.5
X <- matrix(rnorm(N*K,mean=0,sd=sigma),N,K)
X[,1] <- 1 # first column of X should be all ones
eps <- rnorm(N,mean=0,sd=0.5)
betaTrue <- as.vector(runif(K))
Y <- X%*%betaTrue + eps
Now we have some fake data that follows our data generating process
We can now evaluate our simulation by, e.g. running
estimates <- lm(Y~X -1)
print(summary(estimates))
and we should get something very close to betaTrue
as our estimates
The estimates won't be exactly equal to betaTrue
unless we have N <- infinity
. This is because of randomness in our ε.