In [None]:
from datascience import *
from prob140 import *
import pandas as pd
import numpy as np
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline

# Worksheet 4 #
All the libraries that you might reasonably want to use for the calculations in Exercises 2 and 3 are imported in the cell above.

## 1. Towards the Asymptotic Normality of the Sample Correlation
Let $(X_1, Y_1), (X_2, Y_2), \ldots, (X_n, Y_n)$ be i.i.d. pairs.

**(a)** Let $a, b, c, d$ be constants and assume $b$ and $d$ are non-zero. Identify the shape of the asymptotic distribution of 

$$
\frac{1}{n} \sum_{i=1}^n \left( \frac{X_i - a}{b} \right)\left( \frac{Y_i - c}{d} \right)
$$

and justify your choice. You don't have to find the asymptotic mean and variance.

**(b)** Refer to Worksheet 3 for notation for the sample mean and sample variance and define the [sample correlation coefficient](https://inferentialthinking.com/chapters/15/1/Correlation.html#calculating-r)

$$
R_n ~ = ~ \frac{1}{n} \sum_{i=1}^n \left( \frac{X_i - \bar{X}_n}{\hat{\sigma}_X} \right)\left( \frac{Y_i - \bar{Y}_n}{\hat{\sigma}_Y} \right).
$$

Discuss why the asymptotic distribution of $R_n$ should have the same shape as the one you identified in Part **(a)**, apart from the mean and variance. You don't have the tools to provide a complete argument. Discuss some issues that a complete argument would have to account for.

## 2. Nonparametric Bootstrap and Confidence Intervals
In the nonparametric bootstrap, which is what we use in Data 8 and 100, the data $X_1, X_2, \ldots, X_n$ are assumed to be i.i.d. from some underlying distribution but nothing further is assumed about that distribution. In particular, we don't assume that is comes from any parametric family such as normal or exponential.

In this exercise you will construct boostrap confidence intervals for an underlying correlation, using the percentile method as well as the basic bootstrap method described in seminar.

The Data 8 table `births` contains data on a random sample of mother-newborn pairs. You can assume that the sample is like a random sample drawn with replacement from a large underlying population. Let $\rho$ be the correlation between the newborns' birthweights and their mothers' heights in the underlying population. You will use the boostrap to estimate $\rho$.

**Useful code:** For two arrays `x` and `y`, the expression `np.corrcoef(x, y)` evaluates to a $2 \times 2$ correlation matrix that has $1$ on the diagonal and ${\rm Corr}(x, y)$ as the two off-diagonal elements. So `np.corrcoef(x, y)[0, 1]` can be used to get the correlation which is the $(0, 1)$ element of the matrix.

**(a)** Find $\hat{\rho}$, the correlation between the birth weights and mothers' heights in the sample.

In [None]:
# Part (a)
births = pd.read_csv('births.csv')
...

**(b)** Now bootstrap the sample $B = 10,000$ times by resampling from the original sample as in Data 8 and 100. Each time, find the correlation of birth weights and mothers' heights in the new sample, and collect all these correlations in an array or other similar form.

In [None]:
# Part (b)
...

**(c) Bootstrap Percentile Method:** As in Data 8, draw the empirical histogram of all the simulated values, and find an approximate 95% confidence interval for $\rho$ by using the appropriate percentiles of the empirical distribution of your estimates.

In [None]:
# Part (c)
...

**(d) Basic or Empirical or Pivotal Bootstrap Method:** Yes, it does have a lot of names. Subtract $\hat{\rho}$ from each of your estimates in Part **(b)** and draw a histogram of these deviations. What value do you notice near the center of the distribution? Provide the appropriate percentiles of these deviations and use them along with $\hat{\rho}$ to construct an approximate 95% confidence for $\rho$.

In [None]:
# Part (d)
...

**e)** Provide a brief comparison of the intervals in Parts **(c)** and **(d)**.

## 3. Parametric Bootstrap and Confidence Intervals
The parametric bootstrap method assumes $X_1, X_2, \ldots, X_n$ are i.i.d. from a parametric family with density $f(x \mid \theta)$. So the parametric bootstrap doesn't need to resample from the original sample. Instead, it constructs an estimate $\hat{\theta}$ based on the original sample and then creates new samples by drawing repeatedly from $f(x \mid \hat{\theta})$. 

In this exercise you will construct a parametric bootstrap confidence interval for an exponential rate and compare it with the corresponding interval based on the asymptotic distribution of the MLE.

**Useful code:** To simulate `n` draws from the exponential distribution with rate `lam`, use `stats.expon.rvs(size = n, scale = 1/lam)`. 

The array `expon_sample` contains the results of 400 i.i.d. draws from the exponential distribution with unknown rate $\lambda$. For $n = 400$, these are the observed values of $X_1, X_2, \ldots, X_n$ drawn independently from the exponential $(\lambda)$ distribution.

**(a)** Use `expon_sample` to find the observed value of $\hat{\lambda}$, the MLE of $\lambda$. This is the value of some function of your sample. Let's call it $T(X_1, X_2, \ldots, X_n)$.

In [None]:
# Part (a)
expon_sample = np.load('expon_sample.npy')
...

**(b)** Now pretend you don't know any more MLE theory, and construct $B = 10,000$ parametric bootstrap estimates of $\lambda$ as follows.

- Do the following $B$ times and collect the results in an array or other similar form:
    - Generate 400 i.i.d. draws from the exponential distribution with the rate you found in Part **(a)**. Call this new sample $X_1^*, X_2^*, \ldots, X_n^*$.
    - Use the new sample and construct a new estimate of $\lambda$ as you did in Part **(a)**. That is, find the value of $T(X_1^*, X_2^*, \ldots, X_n^*)$.

In [None]:
# Part (b)
...

**(c)** Draw an empirical histogram of your estimates and check that its shape resembles what you'd expect.

In [None]:
# Part (c)
...

**(d)** Construct an approximate 95% empirical bootstrap confidence interval for $\lambda$ based on your $B$ estimates. See Part **(d)** of Exercise 2.

In [None]:
# Part (d)
...

**(e)** Return to the original sample and use MLE theory to construct an approximate 95% confidence interval for $\lambda$. Do not use your bootstrap estimates here. Instead, refer to Exercise 1 of Worksheet 2.

In [None]:
# Part (e)
...

**(f)** Provide a brief comparison of the intervals in Parts **(d)** and **(e)**. 