University of Helsinki, Master's Programme in Mathematics and Statistics  
MAST32001 Computational Statistics, Autumn 2022  
Luigi Acerbi  

# Week 2 exercises

## 1. Permutation testing (6 pts)

We will use permutation testing to study if the mother's age (`age`) affects the birth weight (`bwt`) of their babies. We will use the absolute difference in the means as the test statistic. We will focus the analysis on full term pregnancies (`gestation >= 273`).

*Note*: When reporting a $p$-value for $b$ more extreme tests out of $m$, use $p = (b+1)/(m+1)$ to avoid zero p-values. 50000 permutations will be sufficient for obtaining the required accuracy.

1. Load the data set below. Test whether the birth weights (`bwt`) of babies with young (`age < 26`) and older (`age >= 26`) mothers are statistically significantly different using the difference of the means as the test statistic. Report the $p$-value you obtain in Moodle.
2. Stratify the analysis by the variable smoking status of the mothers by splitting to separate smoker (`smoke = 0`) and non-smoker (`smoke = 1`) groups. Constrain the permutations so that only changes within each group are allowed. After the permutation, merge the two groups back together to compute the means. Report the $p$-value you obtain in Moodle.

In [None]:
import pandas as pd
import numpy as np
import numpy.random as npr
from typing import Callable
from scipy.optimize import minimize, least_squares
rng = npr.default_rng(42)

#### Question 1: Birth weight non-stratified

In [None]:

# Load the data set
babies_full = pd.read_csv("https://raw.githubusercontent.com/lacerbi/compstats-files/main/data/babies2.txt", sep='\t')

# Pick a subset
babies = babies_full.query('gestation >= 273')


In [None]:
def shuffle(x1: np.ndarray, x2: np.ndarray) -> tuple(np.ndarray, np.ndarray):
    """Return a random reshuffling of elements in two arrays"""
    n1 = len(x1)
    z = npr.permutation(np.concatenate((x1, x2)))
    return z[0:n1], z[n1:]

In [None]:

young_mothers = babies.query('age < 26').copy()
older_mothers = babies.query('age >= 26').copy()

n_perm = 50000
perm_diff = np.zeros(n_perm)
true_diff = np.abs(young_mothers['bwt'].mean() - older_mothers['bwt'].mean())

for p in range(n_perm):
    one, two = shuffle(young_mothers['bwt'].values, older_mothers['bwt'].values)
    perm_diff[p] = np.abs(one.mean() - two.mean())

print(f"P-value: {(np.sum(perm_diff <= true_diff)+1) / (n_perm+1)}")

#### Question 2: Birth weight stratified by smoking

In [None]:
young_smokers = young_mothers.query('smoke == 1').copy()
young_non_smokers = young_mothers.query('smoke == 0').copy()
older_smokers = older_mothers.query('smoke == 1').copy()
older_non_smokers = older_mothers.query('smoke == 0').copy()

n_perm = 50000
true_diff = np.abs(young_mothers['bwt'].mean() - older_mothers['bwt'].mean())
perm_diff = np.zeros(n_perm)

for p in range(n_perm):
    yns, ons = shuffle(young_non_smokers['bwt'].values, older_non_smokers['bwt'].values)
    ys, os = shuffle(young_smokers['bwt'].values, older_smokers['bwt'].values)
    younger = np.concatenate((yns, ys), axis=0)
    older = np.concatenate((ons, os), axis=0)
    perm_diff[p] = np.abs(np.mean(younger) - np.mean(older))

print(f"P-value: {(np.sum(perm_diff <= true_diff)+1) / (n_perm+1)}")

## 2. Bootstrap confidence intervals on data statistics (4 pts)

In this exercise we use bootstrap to estimate confidence intervals for various quantities. (Using 1000 bootstrap samples will give you enough accuracy assuming everything is correctly done.)

1. Use bootstrap to estimate the central 95% confidence interval for the mean of `bwt` in the *full* data set loaded in Problem 1. Report the lower and upper ends of the interval in Moodle.
2. Use bootstrap to estimate the central 95% confidence interval for the mean of `bwt` in the smaller subset (`gestation >= 273`) of the data set used in Problem 1. Report the lower and upper ends of the interval in Moodle.
3. Use bootstrap to estimate the central 95% confidence interval for the correlation coefficient of `gestation` and `age` in the full data set loaded in Problem 1. What does this tell about the relation between the duration of the pregnancy and the age of the mother? Report the bounds of the interval in Moodle.
4. Use bootstrap to estimate the central 95% confidence interval for the correlation coefficient of `gestation` and `age` in the smaller subset (`gestation >= 273`) used in Problem 1. What does this tell about the relation between the duration of the pregnancy and the age of the mother? Report the bounds of the interval in Moodle.

*Hint*: Remember that the size of the bootstrap sample is always the same as the size of the original data set.

In [None]:
def do_bootstrap_mean(x: np.ndarray, n_bootstraps: int = 1000) -> np.ndarray:
    return np.mean(
        rng.choice(x, size=(len(x), n_bootstraps), replace=True), axis=0
    )

In [None]:
def do_bootstrap_corrcoeff(x: np.ndarray, y: np.ndarray, n_bootstraps: int = 1000) -> np.ndarray:
        xboot = rng.choice(x, size=(len(x), n_bootstraps), replace=True)
        yboot = rng.choice(y, size=(len(y), n_bootstraps), replace=True)
        boots = np.array([xboot, yboot])
        bootstrap_results = np.zeros(n_bootstraps)
        for i in range(n_bootstraps):
                res = np.corrcoef(x=boots[0, :, i], y=boots[1, :, i])[0,1]
                bootstrap_results[i] = res
        return bootstrap_results

#### Question 1: Bootstrap of the whole dataset birth weight mean

In [None]:
birth_weights = babies_full['bwt'].copy().values
bootstrapped_means = do_bootstrap_mean(birth_weights)
print(f"N bootstraps: {bootstrapped_means.shape}")
print(f"Mean: {bootstrapped_means.mean():.2f}")
print(f"Confidence interval: {np.quantile(bootstrapped_means, [0.025, 0.975]).round(2)}")


#### Question 2: Bootstrap of `gestation >= 273` birth weight mean

In [None]:
subset = babies_full.query('gestation >= 273').copy()['bwt'].values
bootstrapped_means_subset = do_bootstrap_mean(subset)
print(f"Mean: {bootstrapped_means_subset.mean():.2f}")
print(f"Confidence interval: {np.quantile(bootstrapped_means_subset, [0.025, 0.975]).round(2)}")

#### Question 3: Bootstrap of the correlation coefficient between `gestation` and `age`.

In [None]:
bootstrapped_coeffs = do_bootstrap_corrcoeff(babies_full['age'].values, babies_full['gestation'].values)
print(f"N bootstraps: {bootstrapped_coeffs.shape}")
print(f"Mean: {bootstrapped_coeffs.mean():.2f}")
print(f"Confidence interval: {np.quantile(bootstrapped_coeffs, [0.025, 0.975]).round(2)}")

Interpretation: It seems there is no relationship between the age of the mother and the duration of the pregnancy when considering the whole dataset.

#### Question 4: Bootstrap of the correlation coefficient between `gestation` and `age` for full term pregnancies.

In [None]:
bootstrapped_coeffs = do_bootstrap_corrcoeff(babies_full.query('gestation >= 273')['age'].values, babies_full.query('gestation >= 273')['gestation'].values)
print(f"N bootstraps: {bootstrapped_coeffs.shape}")
print(f"Mean: {bootstrapped_coeffs.mean():.2f}")
print(f"Confidence interval: {np.quantile(bootstrapped_coeffs, [0.025, 0.975]).round(2)}")

## 3. Bootstrap confidence intervals on parameter estimates (4 pts)

In this task, we will use bootstrap to obtain confidence intervals on maximum likelihood parameter estimates for linear regression models. We will apply simple case resampling, i.e. resampling the individuals and then fitting the model using the data $(x_i, y_i)$ from these individuals. There are alternative methods that may work better when the data are limited, but in our case there are enough observations so that this will not be a problem. 1000 bootstrap samples will again give you enough accuracy.

A linear regression fit to scalar $x_i, y_i$ involves fitting the model
$$ y_i = \alpha + \beta x_i + \epsilon_i, $$
where $\beta$ is the regression coefficient and $\alpha$ is the intercept. Assuming $\epsilon_i \sim N(0, \sigma^2)$, the log-likelihood of the model is
$$ \log p(Y | X, \alpha, \beta) = \sum_{i=1}^n \log p(y_i | x_i, \alpha, \beta)
  = \sum_{i=1}^n - \frac{1}{2 \sigma^2} (y_i - \alpha - \beta x_i)^2 + C, $$
where $C$ is independent of $\alpha, \beta$. This is maximised when
$$ \hat{\beta}= \frac{\sum_{i = 1}^n (x_i - \bar{x})(y_i - \bar{y}) }{ \sum_{i = 1}^n (x_i - \bar{x})^2} \\
   \hat{\alpha} = \bar{y} - \hat{\beta} \bar{x},$$
where $\bar{x} = \frac{1}{n} \sum_{i = 1}^n x_i$ and $\bar{y} = \frac{1}{n} \sum_{i = 1}^n y_i$.

1. Implement the above linear regression model to predict `gestation` ($y$) as a function of `age` ($x$) in the full data set. Report the estimated $\beta$ in Moodle.
2. Use bootstrap to estimate the confidence interval of the regression coefficient $\beta$ in the above model by resampling the individuals used to fit the model. Report the lower and upper bounds of the central 95% confidence interval of $\beta$ in Moodle.


In [None]:


# Load the data set
babies_full = pd.read_csv("https://raw.githubusercontent.com/lacerbi/compstats-files/main/data/babies2.txt", sep='\t')

babies3 = babies_full

def compute_c(sigma: float, size: int = 1) -> float:
    return rng.normal(loc=0, scale=sigma, size=size)

def compute_beta(x: np.ndarray, xbar: float, y: np.ndarray, ybar: float) -> float:
    return np.sum((x - xbar) * (y - ybar)) / np.sum((x - xbar) ** 2)

def compute_alpha(xbar: float, ybar: float, beta: float) -> float:
    return ybar - beta * xbar

def logprob(x: np.ndarray, y: np.ndarray, alpha: float, beta: float, sigma: float, c: float) -> float:
    return np.sum(
        -(1/(2*sigma**2)) * (y - alpha - beta * x)**2 + c
    )

bootstrap_iterations = 1000

print(f"Non-bootstrap estimate: {compute_beta(babies3['age'].values, babies3['age'].mean(), babies3['gestation'].values, babies3['gestation'].mean())}")

beta_observations = np.zeros(bootstrap_iterations)
for i in range(bootstrap_iterations):
    n_samples = len(babies3['gestation'].values)
    y: np.ndarray = rng.choice(babies3['gestation'].values, size=n_samples, replace=True)
    x: np.ndarray = rng.choice(babies3['age'].values, size=n_samples, replace=True)
    xbar = x.mean()
    ybar = y.mean()
    beta = compute_beta(x, xbar, y, ybar)
    beta_observations[i] = beta

print(f"Bootstrap estimate: {beta_observations.mean()}")
print(f"Confidence interval: {np.quantile(beta_observations, [0.025, 0.975]).round(2)}")



## 4. Density estimation (6 pts)

1. Estimate the joint density of `bwt` and `age` in the full data set using kernel density estimation with a 2-dimensional Gaussian kernel
$$ K(\mathbf{x}) = \frac{1}{2\pi} \exp\left( - \frac{\|\mathbf{x}\|^2}{2} \right)
 = \frac{1}{2\pi} \exp\left( - \frac{x_1^2 + x_2^2}{2} \right) $$
using bandwidth $h=5$. Report the value of the estimated density at point `bwt=110`, `age=31` in Moodle.

*Hint*: you can verify your results by ploting a 2D histogram of the data (`matplotlib.pyplot.hist2d`) and a contour plot of the estimated density (see e.g. Sec. 5.1.1 in Course notes for a contour plot example).

2. With the Gaussian kernel above, use leave-one-out (LOO) cross validation to find the optimal $h$ in the range `np.linspace(1.0, 5.0, 50)`. The optimal $h$ maximizes the LOO log-likelihood. Report the value of $h$ and the value of the estimated density at `bwt=110`, `age=31` in Moodle.

3. Use $k$-fold cross validation with $k=17$ to find the optimal $h$ in the range `np.linspace(1.0, 5.0, 50)`. Report the value of $h$ and the value of the estimated density at `bwt=110`, `age=31` in Moodle. For this exercise, the sample point indices for the $k$ partitions of the data should consist of consecutive indices, e.g. the first partition should be the data with indices `0:69` and so on. (In practical applications it is generally good practice to randomly permute the indices as well when creating partitions, but don't do it for this exercise.)

In [None]:
def k_ndgauss(x):
    """Gaussian kernel
    Input: x (np.array, shape (n, d)) for n points with d dimensions"""
    d = x.shape[1]
    return np.exp(-np.sum(x**2, 1)/2) / np.sqrt(2*np.pi)**d

def kde_(x, h):
    """Kernel density estimation
    Input: x (np.array, shape (n, d)) for n points with d dimensions
           h (float) bandwidth"""
    y = np.zeros(x.shape[0])
    n = x.shape[0]
    for i in range(n):
        xx = x - x[i]
        res = (k_ndgauss(xx/h)) / (n*h**x.shape[1])
        y[i] = res.sum()
    return y
density = kde_(babies3[['age', 'bwt']].values, 5)
print(density[babies3.query("bwt == 110 and age == 31").index.values.item()])

In [None]:
babies3[['age', 'bwt']].values

In [None]:
np.delete(babies3[['age', 'bwt']].values, i, axis=0)

In [None]:
n_rows = babies3.shape[0]
h_candidates = np.linspace(1, 5, 50)
log_probs = np.zeros(h_candidates.shape[0])
for i in range(n_rows):
    for _h in range(len(h_candidates)):
        density = kde_(x=np.delete(babies3[['age', 'bwt']].values, i, axis=0), h=_h)
        log_probs[_h] += np.log(np.sum(density))


In [None]:
log_probs

In [None]:
h_candidates[np.argmax(log_probs)]

In [None]:
density = kde_(babies3[['age', 'bwt']].values, .1)
