diff --git a/.gitignore b/.gitignore index 2c3747f..3013a7b 100644 --- a/.gitignore +++ b/.gitignore @@ -9,3 +9,5 @@ copy-qmds.R crossref.sh cover.png /.quarto/ +*.tex +*.pdf \ No newline at end of file diff --git a/_quarto.yml b/_quarto.yml index 06699ba..086101e 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -43,6 +43,7 @@ book: - inference/clt.qmd - inference/confidence-intervals.qmd - inference/hypothesis-testing.qmd + - inference/bootstrap.qmd - inference/models.qmd - inference/bayes.qmd - inference/hierarchical-models.qmd @@ -83,6 +84,18 @@ format: code-link: true author-meta: Rafael A. Irizarry callout-appearance: simple + pdf: + documentclass: krantz + classoption: [krantz2,10pt,twoside,onecolumn,final,openright] + include-in-header: preamble.tex + header-includes: | + \usepackage{amssymb} + \usepackage{amsmath} + \usepackage{graphicx} + \usepackage{subfigure} + \usepackage{makeidx} + \usepackage{multicol} + keep-tex: true knitr: opts_chunk: diff --git a/docs/highdim/dimension-reduction.html b/docs/highdim/dimension-reduction.html index 980c783..419a92a 100644 --- a/docs/highdim/dimension-reduction.html +++ b/docs/highdim/dimension-reduction.html @@ -5,7 +5,7 @@ -Advanced Data Science - 21  Dimension reduction +Advanced Data Science - 22  Dimension reduction + + + + + + + + + + + + + + + + + + + + + + +
+
+ +
+ + + +
+

10  Bootstrap

+
+ + + +
+ + + + +
+ + +

CLT provides an useful approach to building confidence intervals and performing hypothesis testing. However, it does not always apply. Here we provide a short introduction to an alternative approach to estimating the distribution of an estimate that does not rely on CLT.

+

+10.1 Example: median income

+

Suppose the income distribution of your population is as follows:

+
+
set.seed(1995)
+n <- 10^6
+income <- 10^(rnorm(n, log10(45000), log10(3)))
+hist(income/10^3, nclass = 1000)
+
+
+

+
+
+
+
+

The population median is:

+
+
m <- median(income)
+m
+#> [1] 44939
+
+

Suppose we don’t have access to the entire population, but want to estimate the median \(m\). We take a sample of 100 and estimate the population median \(m\) with the sample median \(M\):

+
+
N <- 100
+x <- sample(income, N)
+median(x)
+#> [1] 38461
+
+

+10.2 Confidence intervals for the median

+

Can we construct a confidence interval? What is the distribution of \(M\) ?

+

Because we are simulating the data, we can use a Monte Carlo simulation to learn the distribution of \(M\).

+
+
library(gridExtra)
+B <- 10^4
+m <- replicate(B, {
+  x <- sample(income, N)
+  median(x)
+})
+hist(m, nclass = 30)
+qqnorm(scale(m)); abline(0,1)
+
+
+
+
+

+
+
+
+
+

If we know this distribution, we can construct a confidence interval. The problem here is that, as we have already described, in practice we do not have access to the distribution. In the past, we have used the Central Limit Theorem, but the CLT we studied applies to averages and here we are interested in the median. We can see that the 95% confidence interval based on CLT

+
+
median(x) + 1.96*sd(x)/sqrt(N)*c(-1, 1)
+#> [1] 21018 55905
+
+

is quite different from the confidence interval we would generate if we know the actual distribution of \(M\):

+
+
quantile(m, c(0.025, 0.975))
+#>  2.5% 97.5% 
+#> 34438 59050
+
+

The bootstrap permits us to approximate a Monte Carlo simulation without access to the entire distribution. The general idea is relatively simple. We act as if the observed sample is the population. We then sample (with replacement) datasets, of the same sample size as the original dataset. Then we compute the summary statistic, in this case the median, on these bootstrap samples.

+

Theory tells us that, in many situations, the distribution of the statistics obtained with bootstrap samples approximate the distribution of our actual statistic. This is how we construct bootstrap samples and an approximate distribution:

+
+
B <- 10^4
+m_star <- replicate(B, {
+  x_star <- sample(x, N, replace = TRUE)
+  median(x_star)
+})
+
+

Note a confidence interval constructed with the bootstrap is much closer to one constructed with the theoretical distribution:

+
+
quantile(m_star, c(0.025, 0.975))
+#>  2.5% 97.5% 
+#> 30253 56909
+
+

For more on the Bootstrap, including corrections one can apply to improve these confidence intervals, please consult the book An introduction to the bootstrap by Efron, B., & Tibshirani, R. J.

+

+10.3 Exercises

+

1. Generate a random dataset like this:

+
+
y <- rnorm(100, 0, 1)
+
+

Estimate the 75th quantile, which we know is:

+
+
qnorm(0.75)
+
+

with the sample quantile:

+
+
quantile(y, 0.75)
+
+

Run a Monte Carlo simulation to learn the expected value and standard error of this random variable.

+

2. In practice, we can’t run a Monte Carlo simulation because we don’t know if rnorm is being used to simulate the data. Use the bootstrap to estimate the standard error using just the initial sample y. Use 10 bootstrap samples.

+

3. Redo exercise 12, but with 10,000 bootstrap samples.

+ + +
+
+ + + + \ No newline at end of file diff --git a/docs/inference/clt.html b/docs/inference/clt.html index 189d421..83ebf65 100644 --- a/docs/inference/clt.html +++ b/docs/inference/clt.html @@ -223,23 +223,29 @@ 9  Hypothesis testing + + @@ -256,37 +262,37 @@ @@ -303,31 +309,31 @@ @@ -344,49 +350,49 @@ @@ -401,7 +407,7 @@
@@ -420,38 +426,38 @@

-

The CLT tells us that the distribution function for a sum of draws is approximately normal. We also learned that dividing a normally distributed random variable by a constant is also a normally distributed variable. This implies that the distribution of \(\bar{X}\) is approximately normal.

+

The CLT tells us that the distribution function for a sum of draws is approximately normal. Additionally, we have learned that dividing a normally distributed random variable by a constant results in another normally distributed variable. This implies that the distribution of \(\bar{X}\) is approximately normal.

In summary, we have that \(\bar{X}\) has an approximately normal distribution with expected value \(p\) and standard error \(\sqrt{p(1-p)/N}\).

-

Now how does this help us? Suppose we want to know what is the probability that we are within 1% from \(p\). We are basically asking what is

+

Now how does this help us? Suppose we want to know what is the probability that we are within 1% from \(p\). We are basically asking what is:

\[ \mbox{Pr}(| \bar{X} - p| \leq .01) \] which is the same as:

\[ \mbox{Pr}(\bar{X}\leq p + .01) - \mbox{Pr}(\bar{X} \leq p - .01) \]

-

Can we answer this question? We can use the mathematical trick we learned in the previous chapter. Subtract the expected value and divide by the standard error to get a standard normal random variable, call it \(Z\), on the left. Since \(p\) is the expected value and \(\mbox{SE}(\bar{X}) = \sqrt{p(1-p)/N}\) is the standard error we get:

+

Can we answer this question? We can use the mathematical trick we learned in the previous section: subtract the expected value and divide by the standard error to obtain a standard normal random variable, which we’ll denote as \(Z\), on the left. Since \(p\) is the expected value and \(\mbox{SE}(\bar{X}) = \sqrt{p(1-p)/N}\) is the standard error, we get:

\[ \mbox{Pr}\left(Z \leq \frac{ \,.01} {\mbox{SE}(\bar{X})} \right) - \mbox{Pr}\left(Z \leq - \frac{ \,.01} {\mbox{SE}(\bar{X})} \right) \]

-

One problem we have is that since we don’t know \(p\), we don’t know \(\mbox{SE}(\bar{X})\). But it turns out that the CLT still works if we estimate the standard error by using \(\bar{X}\) in place of \(p\). We say that we plug-in the estimate. Our estimate of the standard error is therefore:

+

One problem we have is that since we don’t know \(p\), we don’t know \(\mbox{SE}(\bar{X})\). However, it turns out that the CLT still works if we estimate the standard error by using \(\bar{X}\) in place of \(p\). We say that we plug-in the estimate. Our estimate of the standard error is therefore:

\[ \hat{\mbox{SE}}(\bar{X})=\sqrt{\bar{X}(1-\bar{X})/N} \] In statistics textbooks, we use a little hat to denote estimates. The estimate can be constructed using the observed data and \(N\).

-

Now we continue with our calculation, but dividing by \(\hat{\mbox{SE}}(\bar{X})=\sqrt{\bar{X}(1-\bar{X})/N})\) instead. In our first sample we had 12 blue and 13 red so \(\bar{X} = 0.48\) and our estimate of standard error is:

+

Now we continue with our calculation, but dividing by \(\hat{\mbox{SE}}(\bar{X})=\sqrt{\bar{X}(1-\bar{X})/N})\) instead. In our first sample, we had 12 blue and 13 red, so \(\bar{X} = 0.48\) and our estimate of standard error is:

x_hat <- 0.48
 se <- sqrt(x_hat*(1-x_hat)/25)
 se
 #> [1] 0.0999
-

And now we can answer the question of the probability of being close to \(p\). The answer is:

+

Now, we can answer the question of the probability of being close to \(p\). The answer is:

pnorm(0.01/se) - pnorm(-0.01/se)
 #> [1] 0.0797

Therefore, there is a small chance that we will be close. A poll of only \(N=25\) people is not really very useful, at least not for a close election.

-

Earlier we mentioned the margin of error. Now we can define it because it is simply two times the standard error, which we can now estimate. In our case it is:

+

Earlier, we mentioned the margin of error. Now, we can define it simply as two times the standard error, which we can now estimate. In our case it is:

1.96*se
 #> [1] 0.196
@@ -470,22 +476,22 @@

pnorm(1.96) - pnorm(-1.96)
 #> [1] 0.95

-

Hence, there is a 95% probability that \(\bar{X}\) will be within \(1.96\times \hat{SE}(\bar{X})\), in our case within about 0.2, of \(p\). Note that 95% is somewhat of an arbitrary choice and sometimes other percentages are used, but it is the most commonly used value to define margin of error. We often round 1.96 up to 2 for simplicity of presentation.

+

Hence, there is a 95% probability that \(\bar{X}\) will be within \(1.96\times \hat{SE}(\bar{X})\), in our case within about 0.2, of \(p\). Observe that 95% is somewhat of an arbitrary choice and sometimes other percentages are used, but it is the most commonly used value to define margin of error. We often round 1.96 up to 2 for simplicity of presentation.

In summary, the CLT tells us that our poll based on a sample size of \(25\) is not very useful. We don’t really learn much when the margin of error is this large. All we can really say is that the popular vote will not be won by a large margin. This is why pollsters tend to use larger sample sizes.

-

From the table above, we see that typical sample sizes range from 700 to 3500. To see how this gives us a much more practical result, notice that if we had obtained a \(\bar{X}\)=0.48 with a sample size of 2,000, our standard error \(\hat{\mbox{SE}}(\bar{X})\) would have been 0.0111714. So our result is an estimate of 48% with a margin of error of 2%. In this case, the result is much more informative and would make us think that there are more red balls than blue. Keep in mind, however, that this is hypothetical. We did not take a poll of 2,000 since we don’t want to ruin the competition.

+

From the table above, we see that typical sample sizes range from 700 to 3500. To see how this gives us a much more practical result, consider that if we had obtained a \(\bar{X}\)=0.48 with a sample size of 2,000, our standard error \(\hat{\mbox{SE}}(\bar{X})\) would have been 0.0111714. So our result is an estimate of 48% with a margin of error of 2%. In this case, the result is much more informative and would lead us to believe that there are more red balls than blue. Keep in mind, however, that this is hypothetical. We did not take a poll of 2,000, since we don’t want to ruin the competition.

7.1 A Monte Carlo simulation

-

Suppose we want to use a Monte Carlo simulation to corroborate the tools we have built using probability theory. To create the simulation, we would write code like this:

-
+

Suppose we want to use a Monte Carlo simulation to corroborate the tools we have developed using probability theory. To create the simulation, we would write code like this:

+
B <- 10000
 N <- 1000
 x_hat <- replicate(B, {
-  x <- sample(c(0,1), size = N, replace = TRUE, prob = c(1-p, p))
+  x <- sample(c(0,1), size = N, replace = TRUE, prob = c(1 - p, p))
   mean(x)
 })
-

The problem is, of course, we don’t know p. We could construct an urn like the one pictured above and run an analog (without a computer) simulation. It would take a long time, but you could take 10,000 samples, count the beads and keep track of the proportions of blue. We can use the function take_poll(n=1000) instead of drawing from an actual urn, but it would still take time to count the beads and enter the results.

-

One thing we therefore do to corroborate theoretical results is to pick one or several values of p and run the simulations. Let’s set p=0.45. We can then simulate a poll:

+

The problem is, of course, that we don’t know p. We could construct an urn, similar to the one pictured above, and conduct an analog simulation (without a computer). While time-consuming, we could take 10,000 samples, count the beads, and track the proportions of blue. We can use the function take_poll(n=1000), instead of drawing from an actual urn, but it would still take time to count the beads and enter the results.

+

Therefore, one approach we can use to corroborate theoretical results is to pick one or several values of p and run simulations. Let’s set p=0.45. We can then simulate a poll:

p <- 0.45
 N <- 1000
@@ -501,14 +507,14 @@ 

mean(x) })

-

To review, the theory tells us that \(\bar{X}\) is approximately normally distributed, has expected value \(p=\) 0.45 and standard error \(\sqrt{p(1-p)/N}\) = 0.0157321. The simulation confirms this:

+

To review, the theory tells us that \(\bar{X}\) is approximately normally distributed, has expected value \(p=\) 0.45, and standard error \(\sqrt{p(1-p)/N}\) = 0.0157321. The simulation confirms this:

mean(x_hat)
 #> [1] 0.45
 sd(x_hat)
-#> [1] 0.0158
+#> [1] 0.0157
-

A histogram and qq-plot confirm that the normal approximation is accurate as well:

+

A histogram and qqplot confirm that the normal approximation is also accurate:

@@ -517,25 +523,25 @@

-

Of course, in real life we would never be able to run such an experiment because we don’t know \(p\). But we could run it for various values of \(p\) and \(N\) and see that the theory does indeed work well for most values. You can easily do this by re-running the code above after changing p and N.

+

Of course, in real life, we would never be able to run such an experiment because we don’t know \(p\). However, we can run it for various values of \(p\) and \(N\) and see that the theory does indeed work well for most values. You can easily do this by rerunning the code above after changing the values of p and N.

7.2 The spread

-

The competition is to predict the spread, not the proportion \(p\). However, because we are assuming there are only two parties, we know that the spread is \(\mu = p - (1-p) = 2p - 1\). As a result, everything we have done can easily be adapted to an estimate of \(\mu\). Once we have our estimate \(\bar{X}\) and \(\hat{\mbox{SE}}(\bar{X})\), we estimate the spread with \(2\bar{X} - 1\) and, since we are multiplying by 2, the standard error is \(2\hat{\mbox{SE}}(\bar{X})\). Note that subtracting 1 does not add any variability so it does not affect the standard error.

-

For our 25 item sample above, our estimate \(p\) is .48 with margin of error .20 and our estimate of the spread is 0.04 with margin of error .40. Again, not a very useful sample size. However, the point is that once we have an estimate and standard error for \(p\), we have it for the spread \(\mu\).

+

The objective of the competition is to predict the spread, not the proportion \(p\). However, since we are assuming there are only two parties, we know that the spread is \(\mu = p - (1-p) = 2p - 1\). As a result, everything we have done can easily be adapted to an estimate of \(\mu\). Once we have our estimate \(\bar{X}\) and \(\hat{\mbox{SE}}(\bar{X})\), we estimate the spread with \(2\bar{X} - 1\) and, since we are multiplying by 2, the standard error is \(2\hat{\mbox{SE}}(\bar{X})\). Note that subtracting 1 does not add any variability, so it does not affect the standard error.

+

For our 25 item sample above, our estimate \(p\) is .48 with margin of error .20, and our estimate of the spread is0.04with margin of error.40`. Again, this is not a very useful sample size. Nevertheless, the point is that, once we have an estimate and standard error for \(p\), we have it for the spread \(\mu\).

-

We use \(\mu\) the denote the spread here and in the next chapters because this is the typical notation used in statistical textbooks for the parameter of interest. The reason we use \(\mu\) is because a populuation mean is often the parameter of interest and \(\mu\) is the Greek letter for m.

+

We use \(\mu\) to denote the spread here and in the next sections because this is the typical notation used in statistical textbooks for the parameter of interest. The reason we use \(\mu\) is that a population mean is often the parameter of interest, and \(\mu\) is the Greek letter for m.

-7.3 Bias: why not run a very large poll?

-

For realistic values of \(p\), say from 0.35 to 0.65, if we run a very large poll with 100,000 people, theory tells us that we would predict the election perfectly since the largest possible margin of error is around 0.3%:

-
+7.3 Bias: Why not run a very large poll?

+

For realistic values of \(p\), let’s say ranging from 0.35 to 0.65, if we conduct a very large poll with 100,000 people, theory tells us that we would predict the election perfectly, as the largest possible margin of error is around 0.3%:

+
#> Warning: `qplot()` was deprecated in ggplot2 3.4.0.
@@ -544,12 +550,12 @@

-

One reason is that running such a poll is very expensive. Another possibly more important reason is that theory has its limitations. Polling is much more complicated than picking beads from an urn. Some people might lie to pollsters and others might not have phones. But perhaps the most important way an actual poll differs from an urn model is that we actually don’t know for sure who is in our population and who is not. How do we know who is going to vote? Are we reaching all possible voters? Hence, even if our margin of error is very small, it might not be exactly right that our expected value is \(p\). We call this bias. Historically, we observe that polls are indeed biased, although not by that much. The typical bias appears to be about 1-2%. This makes election forecasting a bit more interesting and we will talk about how to model this in a later chapter.

+

One reason is that conducting such a poll is very expensive. Another, and possibly more important reason, is that theory has its limitations. Polling is much more complicated than simply picking beads from an urn. Some people might lie to pollsters, and others might not have phones. However, perhaps the most important way an actual poll differs from an urn model is that we don’t actually know for sure who is in our population and who is not. How do we know who is going to vote? Are we reaching all possible voters? Hence, even if our margin of error is very small, it might not be exactly right that our expected value is \(p\). We call this bias. Historically, we observe that polls are indeed biased, although not by a substantial amount. The typical bias appears to be about 1-2%. This makes election forecasting a bit more interesting, and we will explore how to model this in a later section.

7.4 Exercises

-

1. Write an urn model function that takes the proportion of Democrats \(p\) and the sample size \(N\) as arguments and returns the sample average if Democrats are 1s and Republicans are 0s. Call the function take_sample.

-

2. Now assume p <- 0.45 and that your sample size is \(N=100\). Take a sample 10,000 times and save the vector of mean(X) - p into an object called errors. Hint: use the function you wrote for exercise 1 to write this in one line of code.

-

3. The vector errors contains, for each simulated sample, the difference between the actual \(p\) and our estimate \(\bar{X}\). We refer to this difference as the error. Compute the average and make a histogram of the errors generated in the Monte Carlo simulation and select which of the following best describes their distributions:

+

1. Write an urn model function that takes the proportion of Democrats \(p\) and the sample size \(N\) as arguments, and returns the sample average if Democrats are 1s and Republicans are 0s. Call the function take_sample.

+

2. Now assume p <- 0.45 and that your sample size is \(N=100\). Take a sample 10,000 times and save the vector of mean(X) - p into an object called errors. Hint: Use the function you wrote for exercise 1 to write this in one line of code.

+

3. The vector errors contains, for each simulated sample, the difference between the actual \(p\) and our estimate \(\bar{X}\). We refer to this difference as the error. Compute the average and make a histogram of the errors generated in the Monte Carlo simulation, and select which of the following best describes their distributions:

mean(errors)
 hist(errors)
@@ -560,18 +566,18 @@

The errors are symmetrically distributed around 0.
  • The errors range from -1 to 1.
  • -

    4. The error \(\bar{X}-p\) is a random variable. In practice, the error is not observed because we do not know \(p\). Here we observe it because we constructed the simulation. What is the average size of the error if we define the size by taking the absolute value \(\mid \bar{X} - p \mid\) ?

    -

    5. The standard error is related to the typical size of the error we make when predicting. We say size because we just saw that the errors are centered around 0, so thus the average error value is 0. For mathematical reasons related to the Central Limit Theorem, we actually use the standard deviation of errors rather than the average of the absolute values to quantify the typical size. What is this standard deviation of the errors?

    +

    4. The error \(\bar{X}-p\) is a random variable. In practice, the error is not observed because we do not know \(p\). Here, we observe it since we constructed the simulation. What is the average size of the error if we define the size by taking the absolute value \(\mid \bar{X} - p \mid\)?

    +

    5. The standard error is related to the typical size of the error we make when predicting. For mathematical reasons related to the Central Limit Theorem, we actually use the standard deviation of errors, rather than the average of the absolute values, to quantify the typical size. What is this standard deviation of the errors?

    6. The theory we just learned tells us what this standard deviation is going to be because it is the standard error of \(\bar{X}\). What does theory tell us is the standard error of \(\bar{X}\) for a sample size of 100?

    7. In practice, we don’t know \(p\), so we construct an estimate of the theoretical prediction based by plugging in \(\bar{X}\) for \(p\). Compute this estimate. Set the seed at 1 with set.seed(1).

    -

    8. Note how close the standard error estimates obtained from the Monte Carlo simulation (exercise 5), the theoretical prediction (exercise 6), and the estimate of the theoretical prediction (exercise 7) are. The theory is working and it gives us a practical approach to knowing the typical error we will make if we predict \(p\) with \(\bar{X}\). Another advantage that the theoretical result provides is that it gives an idea of how large a sample size is required to obtain the precision we need. Earlier we learned that the largest standard errors occur for \(p=0.5\). Create a plot of the largest standard error for \(N\) ranging from 100 to 5,000. Based on this plot, how large does the sample size have to be to have a standard error of about 1%?

    +

    8. Note how close the standard error estimates obtained from the Monte Carlo simulation (exercise 5), the theoretical prediction (exercise 6), and the estimate of the theoretical prediction (exercise 7) are. The theory is working and it gives us a practical approach to knowing the typical error we will make if we predict \(p\) with \(\bar{X}\). Another advantage that the theoretical result provides is that it gives an idea of how large a sample size is required to obtain the precision we need. Earlier, we learned that the largest standard errors occur for \(p=0.5\). Create a plot of the largest standard error for \(N\) ranging from 100 to 5,000. Based on this plot, how large does the sample size have to be to have a standard error of about 1%?

    1. 100
    2. 500
    3. 2,500
    4. 4,000
    -

    9. For sample size \(N=100\), the central limit theorem tells us that the distribution of \(\bar{X}\) is:

    +

    9. For sample size \(N=100\), the Central Limit Theorem tells us that the distribution of \(\bar{X}\) is:

    1. practically equal to \(p\).
    2. approximately normal with expected value \(p\) and standard error \(\sqrt{p(1-p)/N}\).
    3. @@ -586,7 +592,7 @@

      not a random variable.

    11. To corroborate your answer to exercise 9, make a qq-plot of the errors you generated in exercise 2 to see if they follow a normal distribution.

    -

    12. If \(p=0.45\) and \(N=100\) as in exercise 2, use the CLT to estimate the probability that \(\bar{X}>0.5\). You can assume you know \(p=0.45\) for this calculation.

    +

    12. If \(p=0.45\) and \(N=100\) as in exercise 2, use the CLT to estimate the probability that \(\bar{X}>0.5\). Assume you know \(p=0.45\) for this calculation.

    13. Assume you are in a practical situation and you don’t know \(p\). Take a sample of size \(N=100\) and obtain a sample average of \(\bar{X} = 0.51\). What is the CLT approximation for the probability that your error is equal to or larger than 0.01?

    diff --git a/docs/inference/clt_files/figure-html/normal-approximation-for-polls-1.png b/docs/inference/clt_files/figure-html/normal-approximation-for-polls-1.png index ddd05a5..5a48eeb 100644 Binary files a/docs/inference/clt_files/figure-html/normal-approximation-for-polls-1.png and b/docs/inference/clt_files/figure-html/normal-approximation-for-polls-1.png differ diff --git a/docs/inference/confidence-intervals.html b/docs/inference/confidence-intervals.html index a31227b..3fe8f79 100644 --- a/docs/inference/confidence-intervals.html +++ b/docs/inference/confidence-intervals.html @@ -223,23 +223,29 @@ 9  Hypothesis testing

    + + @@ -256,37 +262,37 @@ @@ -303,31 +309,31 @@ @@ -344,49 +350,49 @@ @@ -420,7 +426,7 @@

    -

    Confidence intervals are a very useful concept widely employed by data analysts. A version of these that are commonly seen come from the ggplot geometry geom_smooth. Here is an example using a temperature dataset available in R:

    +

    Confidence intervals are a very useful concept widely employed by data analysts. A version of these that are commonly seen come from the ggplot geometry geom_smooth. Below is an example using a temperature dataset available in R:

    @@ -429,38 +435,38 @@

    -

    In the Machine Learning part we will learn how the curve is formed, but for now consider the shaded area around the curve. This is created using the concept of confidence intervals.

    -

    In our earlier competition, you were asked to give an interval. If the interval you submitted includes the \(p\), you get half the money you spent on your “poll” back and pass to the next stage of the competition. One way to pass to the second round is to report a very large interval. For example, the interval \([0,1]\) is guaranteed to include \(p\). However, with an interval this big, we have no chance of winning the competition. Similarly, if you are an election forecaster and predict the spread will be between -100% and 100%, you will be ridiculed for stating the obvious. Even a smaller interval, such as saying the spread will be between -10 and 10%, will not be considered serious.

    +

    In the Machine Learning section, we will learn how the curve is formed, but for now consider the shaded area around the curve. This is created using the concept of confidence intervals.

    +

    In our earlier competition, you were asked to give an interval. If the interval you submitted includes the \(p\), you receive half the money you spent on your “poll” back and proceed to the next stage of the competition. One way to pass to the second round is to report a very large interval. For example, the interval \([0,1]\) is guaranteed to include \(p\). However, with an interval this big, we have no chance of winning the competition. Similarly, if you are an election forecaster and predict the spread will be between -100% and 100%, you will be ridiculed for stating the obvious. Even a smaller interval, such as saying the spread will be between -10 and 10%, will not be considered serious.

    On the other hand, the smaller the interval we report, the smaller our chances are of winning the prize. Likewise, a bold pollster that reports very small intervals and misses the mark most of the time will not be considered a good pollster. We want to be somewhere in between.

    -

    We can use the statistical theory we have learned to compute the probability of any given interval including \(p\). If we are asked to create an interval with, say, a 95% chance of including \(p\), we can do that as well. These are called 95% confidence intervals.

    -

    When a pollster reports an estimate and a margin of error, they are, in a way, reporting a 95% confidence interval. Let’s show how this works mathematically.

    +

    We can use the statistical theory we have learned to compute the probability of any given interval including \(p\). If we are asked to create an interval with, say, a 95% chance of including \(p\), we can do that as well; these are called 95% confidence intervals.

    +

    When a pollster reports an estimate and a margin of error, they are, in a way, reporting a 95% confidence interval. Let’s now see how this works mathematically.

    We want to know the probability that the interval \([\bar{X} - 2\hat{\mbox{SE}}(\bar{X}), \bar{X} + 2\hat{\mbox{SE}}(\bar{X})]\) contains the true proportion \(p\). First, consider that the start and end of these intervals are random variables: every time we take a sample, they change. To illustrate this, run the Monte Carlo simulation above twice. We use the same parameters as above:

    p <- 0.45
     N <- 1000

    And notice that the interval here:

    -
    -
    x <- sample(c(0, 1), size = N, replace = TRUE, prob = c(1-p, p))
    +
    +
    x <- sample(c(0, 1), size = N, replace = TRUE, prob = c(1 - p, p))
     x_hat <- mean(x)
    -se_hat <- sqrt(x_hat * (1 - x_hat) / N)
    -c(x_hat - 1.96 * se_hat, x_hat + 1.96 * se_hat)
    -#> [1] 0.422 0.484
    +se_hat <- sqrt(x_hat*(1 - x_hat)/N) +c(x_hat - 1.96*se_hat, x_hat + 1.96*se_hat) +#> [1] 0.427 0.489

    is different from this one:

    -
    -
    x <- sample(c(0,1), size=N, replace=TRUE, prob=c(1-p, p))
    +
    +
    x <- sample(c(0,1), size = N, replace = TRUE, prob = c(1 - p, p))
     x_hat <- mean(x)
    -se_hat <- sqrt(x_hat * (1 - x_hat) / N)
    -c(x_hat - 1.96 * se_hat, x_hat + 1.96 * se_hat)
    -#> [1] 0.451 0.513
    +se_hat <- sqrt(x_hat*(1 - x_hat)/N) +c(x_hat - 1.96*se_hat, x_hat + 1.96*se_hat) +#> [1] 0.467 0.529
    -

    Keep sampling and creating intervals and you will see the random variation.

    -

    To determine the probability that the interval includes \(p\), we need to compute this:

    +

    Keep sampling and creating intervals, and you will see the random variation.

    +

    To determine the probability that the interval includes \(p\), we need to compute the following:

    \[ \mbox{Pr}\left(\bar{X} - 1.96\hat{\mbox{SE}}(\bar{X}) \leq p \leq \bar{X} + 1.96\hat{\mbox{SE}}(\bar{X})\right) \]

    -

    By subtracting and dividing the same quantities in all parts of the equation, we get that the above is equivalent to:

    +

    By subtracting and dividing the same quantities in all parts of the equation, we find that the above is equivalent to:

    \[ \mbox{Pr}\left(-1.96 \leq \frac{\bar{X}- p}{\hat{\mbox{SE}}(\bar{X})} \leq 1.96\right) \]

    @@ -484,14 +490,13 @@

    z #> [1] 2.58 -

    will achieve this because by definition pnorm(qnorm(0.995)) is 0.995 and by symmetry pnorm(1-qnorm(0.995)) is 1 - 0.995. As a consequence, we have that:

    +

    will achieve this because by definition pnorm(qnorm(0.995)) is 0.995, and by symmetry pnorm(1-qnorm(0.995)) is 1 - 0.995. As a consequence, we have that:

    pnorm(z) - pnorm(-z)
     #> [1] 0.99

    is 0.995 - 0.005 = 0.99.

    -

    We can use this approach for any probability, not just 0.95 and 0.99. In statistics textbooks, these are usually written for any probability as \(1-\alpha\). We can then obtain the \(z\) for the equation above noting using z = qnorm(1 - alpha / 2) because \(1 - \alpha/2 - \alpha/2 = 1 - \alpha\).

    -

    So, for example, for \(\alpha=0.05\), \(1 - \alpha/2 = 0.975\) and we get the 1.96 we have been using:

    +

    We can use this approach for any probability, not just 0.95 and 0.99. In statistics textbooks, confidence interval formulas are given for arbitraty probabilities written as \(1-\alpha\). We can obtain the \(z\) for the equation above using z = qnorm(1 - alpha / 2) because \(1 - \alpha/2 - \alpha/2 = 1 - \alpha\). So, for example, for \(\alpha=0.05\), \(1 - \alpha/2 = 0.975\) and we get the \(z=1.96\) we used above:

    qnorm(0.975)
     #> [1] 1.96
    @@ -499,20 +504,20 @@

    8.1 A Monte Carlo simulation

    We can run a Monte Carlo simulation to confirm that, in fact, a 95% confidence interval includes \(p\) 95% of the time.

    -
    +
    N <- 1000
     B <- 10000
     inside <- replicate(B, {
    -  x <- sample(c(0,1), size = N, replace = TRUE, prob = c(1-p, p))
    +  x <- sample(c(0,1), size = N, replace = TRUE, prob = c(1 - p, p))
       x_hat <- mean(x)
    -  se_hat <- sqrt(x_hat * (1 - x_hat) / N)
    -  between(p, x_hat - 1.96 * se_hat, x_hat + 1.96 * se_hat)
    +  se_hat <- sqrt(x_hat*(1 - x_hat)/N)
    +  between(p, x_hat - 1.96*se_hat, x_hat + 1.96*se_hat)
     })
     mean(inside)
     #> [1] 0.948

    The following plot shows the first 100 confidence intervals. In this case, we created the simulation so the black line denotes the parameter we are trying to estimate:

    -
    +

    @@ -520,15 +525,23 @@

    -

    ::: {.callout-note title = “The correct language”}

    -

    When using the theory we described above, it is important to remember that it is the intervals that are random, not \(p\). In the plot above, we can see the random intervals moving around and \(p\), represented with the vertical line, staying in the same place. The proportion of blue in the urn \(p\) is not. So the 95% relates to the probability that this random interval falls on top of \(p\). Saying the \(p\) has a 95% chance of being between this and that is technically an incorrect statement because \(p\) is not random. :::

    +
    +
    +
    + +
    +
    +

    When applying the theory we described above, it’s important to remember that it’s the intervals that are random, not \(p\). In the plot above, we can see the random intervals moving around, while the proportion of blue beads in the urn \(p\), represented with the vertical line, remains in the same place. So the 95% relates to the probability that the random interval falls on top of \(p\). Stating that \(p\) has a 95% chance of being between this or that is technically incorrect because \(p\) is not random.

    +
    +
    +

    8.2 Exercises

    For these exercises, we will use actual polls from the 2016 election. You can load the data from the dslabs package.

    library(dslabs)
    -

    Specifically, we will use all the national polls that ended within one week before the election.

    +

    Specifically, we will use all the national polls that ended within one week prior to the election.

    library(tidyverse)
     polls <- polls_us_election_2016 |> 
    @@ -540,19 +553,19 @@ 

    x_hat <- polls$rawpoll_clinton[1]/100

    Assume there are only two candidates and construct a 95% confidence interval for the election night proportion \(p\).

    -

    2. Now use dplyr to add a confidence interval as two columns, call them lower and upper, to the object poll. Then use select to show the pollster, enddate, x_hat,lower, upper variables. Hint: define temporary columns x_hat and se_hat.

    +

    2. Now use dplyr to add a confidence interval as two columns, call them lower and upper, to the object poll. Then, use select to show the pollster, enddate, x_hat,lower, upper variables. Hint: Define temporary columns x_hat and se_hat.

    3. The final tally for the popular vote was Clinton 48.2% and Trump 46.1%. Add a column, call it hit, to the previous table stating if the confidence interval included the true proportion \(p=0.482\) or not.

    4. For the table you just created, what proportion of confidence intervals included \(p\)?

    5. If these confidence intervals are constructed correctly, and the theory holds up, what proportion should include \(p\)?

    -

    6. A much smaller proportion of the polls than expected produce confidence intervals containing \(p\). If you look closely at the table, you will see that most polls that fail to include \(p\) are underestimating. The reason for this is undecided voters, individuals polled that do not yet know who they will vote for or do not want to say. Because, historically, undecideds divide evenly between the two main candidates on election day, it is more informative to estimate the spread or the difference between the proportion of two candidates \(\mu\), which in this election was \(0. 482 - 0.461 = 0.021\). Assume that there are only two parties and that \(\mu = 2p - 1\), redefine polls as below and re-do exercise 1, but for the difference.

    -
    +

    6. A much smaller proportion of the polls than expected produce confidence intervals containing \(p\). If you look closely at the table, you will see that most polls that fail to include \(p\) are underestimating. The reason for this is undecided voters, individuals polled that do not yet know who they will vote for or do not want to say. Because, historically, undecideds divide evenly between the two main candidates on election day, it is more informative to estimate the spread or the difference between the proportion of two candidates, \(\mu\), which in this election was \(0. 482 - 0.461 = 0.021\). Assume that there are only two parties and that \(\mu = 2p - 1\). Redefine polls as below and re-do exercise 1, but for the difference.

    +
    polls <- polls_us_election_2016 |> 
       filter(enddate >= "2016-10-31" & state == "U.S.")  |>
    -  mutate(mu_hat = rawpoll_clinton / 100 - rawpoll_trump / 100)
    + mutate(mu_hat = rawpoll_clinton/100 - rawpoll_trump/100)

    7. Now repeat exercise 3, but for the difference.

    8. Now repeat exercise 4, but for the difference.

    -

    9. Although the proportion of confidence intervals goes up substantially, it is still lower than 0.95. In the next chapter, we learn the reason for this. To motivate this, make a plot of the error, the difference between each poll’s estimate and the actual \(mu=0.021\). Stratify by pollster.

    +

    9. Although the proportion of confidence intervals increases substantially, it is still lower than 0.95. In the next chapter, we learn the reason for this. To motivate this, make a plot of the error, the difference between each poll’s estimate and the actual \(mu=0.021\). Stratify by pollster.

    10. Redo the plot that you made for exercise 9, but only for pollsters that took five or more polls.

    diff --git a/docs/inference/confidence-intervals_files/figure-html/confidence-interval-coverage-1.png b/docs/inference/confidence-intervals_files/figure-html/confidence-interval-coverage-1.png index 138e464..563870c 100644 Binary files a/docs/inference/confidence-intervals_files/figure-html/confidence-interval-coverage-1.png and b/docs/inference/confidence-intervals_files/figure-html/confidence-interval-coverage-1.png differ diff --git a/docs/inference/hierarchical-models.html b/docs/inference/hierarchical-models.html index 0b0db34..5ee5f1a 100644 --- a/docs/inference/hierarchical-models.html +++ b/docs/inference/hierarchical-models.html @@ -5,7 +5,7 @@ -Advanced Data Science - 12  Hierarchichal Models +Advanced Data Science - 13  Hierarchichal Models