# Week 2 stats exercises

## Problem 1.

Let $X_1,...,X_n \sim \text{Bernoulli}(p)$

### 1.1

Find the plug-in estimator for $p$.

The plug-in estimator for $p$ is the sample mean, which is: $\hat{p} = \frac{1}{n}\sum_{i=1}^n X_i = \bar{X}$


### 1.2.

Find the corresponding estimated standard error for $p$

$
SE(\hat{p}) = \sqrt{Var(\hat{p})}
$

$Var(\hat{p}) = Var(\frac{1}{n}\sum_{i=1}^n X_i)$

From variance rules: $Var(aX) = a^2 \cdot Var(X)$

$Var(\hat{p}) =\frac{1}{n^2} Var(\sum_{i=1}^n X_i)$

Also sum of independent variances is: $Var(X+Y)=Var(X) + Var(Y)$

$Var(\hat{p}) =\frac{1}{n^2} \sum_{i=1}^n Var(X_i)$

Since $Var(X_i) = Var(X_j)$ for all $i,j \in \{0, ..., n\} $

$Var(\hat{p}) =\frac{1}{n^2} n \cdot Var(X_i)$

$Var(\hat{p}) =\frac{Var(X_i)}{n}$

Using estimator $\hat{p}$ for $Var(X_i)$ and Bernoulli variance rule: $Var[X] = p(1-p)$

$Var(\hat{p}) =\frac{p(1 - p )}{n}$

$SE(\hat{p}) = \sqrt{Var(p)}$

$SE(\hat{p}) = \sqrt{\frac{p(1 - p )}{n}}$

Writing the $SE$ as sum:

$Var(\hat{p}) =\frac{1}{n^2} \sum_{i=1}^n p(1-p)$

$SE(\hat{p}) = \sqrt{Var(\hat{p})}$

$SE(\hat{p}) = \sqrt{\frac{1}{n^2} \sum_{i=1}^n p(1-p)}$ 

or alternatively:

$SE(\hat{p}) = \sqrt{\frac{1}{n^2} \sum_{i=1}^n \bar{ X_i}(1-\bar{ X_i})}$


Unfortunately there is quite a few possible formulations here for the correct answer. I couldn't get any of them through the moodle automic check, so here is the list of alternatives again:

$SE(\hat{p}) = \sqrt{\frac{1}{n^2} \sum_{i=1}^n \bar{ X_i}(1-\bar{ X_i})}$

$SE(\hat{p}) = \sqrt{\frac{1}{n^2} \sum_{i=1}^n p(1-p)}$ 

$SE(\hat{p}) = \sqrt{\frac{p(1 - p )}{n}}$


## Problem 2.

(Computer Exercise) In this exercise we will study the correlation between cholesterol levels and presence of a heart disease using bootstrapping.

In [15]:
import pandas as pd
import numpy as np

def calculate_pho(dataframe, column_one, column_two):
    return np.cov(dataframe[column_one], dataframe[column_two])[0][1] / (np.std(dataframe[column_one]) * np.std(dataframe[column_two]))

data = pd.read_csv("week2_heart_disease_data.csv", index_col=0)

df = data[['chol', 'heart-disease']].copy()
df['heart-disease'] = (df['heart-disease'] > 0).astype(int)

pho = calculate_pho(df, 'chol', 'heart-disease')
print(f"Pearson correlation coefficient: {pho:.3f}")

Pearson correlation coefficient: 0.085


Use bootstrapping with 1000 resamples to form a standard error estimate for the correlation.

In [16]:
def get_sample_with_replacements(sample):
    rng = np.random.default_rng()
    size = sample.shape[0]
    ints = rng.integers(0, size, size=size)
    return sample.iloc[ints]

pho_list = np.array([calculate_pho(get_sample_with_replacements(df), 'chol', 'heart-disease') for _ in range(1000)])

se_mean = pho_list.mean()
se_std = pho_list.std(ddof=1)

print(f"Standard error estimate. SE: {se_std:.4f}")

Standard error estimate. SE: 0.0584


Compute a 95% bootstrap confidence interval for the correlation using the Normal Interval, Percentile Interval and Pivotal Interval.

In [17]:
from scipy.stats import norm

# Normal interval

alpha = 0.05
z = norm.ppf(1 - alpha/2)
ci_normal = (pho_list.mean() - z * pho_list.std(ddof=1), pho_list.mean() + z * pho_list.std(ddof=1))
print(f"Normal confidence interval:     ({ci_normal[0]:.4f}, {ci_normal[1]:.4f})")

# Percentile 

lower = np.percentile(pho_list, 2.5)
upper = np.percentile(pho_list, 97.5)
print(f"Percentile confidence interval: ({lower:.4f}, {upper:.4f})")

# Pivotal

lower_q = np.percentile(pho_list, 100 * (1 - alpha/2))
upper_q = np.percentile(pho_list, 100 * (alpha/2))

ci_pivotal = (2 * se_mean - lower_q, 2 * se_mean - upper_q)
print(f"Pivotal confidence interval:    ({ci_pivotal[0]:.4f}, {ci_pivotal[1]:.4f})")


Normal confidence interval:     (-0.0271, 0.2019)
Percentile confidence interval: (-0.0277, 0.2004)
Pivotal confidence interval:    (-0.0255, 0.2025)


# Problem 3.

Suppose we have a computer program consisting of 100 pages of code. Let $X_i$ be the number of errors on the $i^th$ page of code. Suppose that the $X_i$ s are Poisson with mean 1 and that they are independent. 


Let $Y=\sum_{i=1}^n X_i$ be the total number of errors. Use the central limit theorem to approximate $P(Y < 90)$.

The mean for the total number of errors is the sum of $X_i$ means which is $100 \cdot 1 = 100$

Since the random variables are independent, the variance of the sum of all random variables is the sum of all individual variances:

$Var(Y)=\sum_{i=1}^{100} Var(X_i) = \sum_{i=1}^{100} 1 = 100$

So $Y$ follows a normal distribution with mean $\mu=100$ and standard deviation of $\sigma=\sqrt{Var(Y)} = \sqrt{100} = 10$

Convert Y to normal variable $Z$

$Z=\frac{Y - \mu}{\sigma} = \frac{90 - 100 }{10} = -1$

So $P(Y < 90) = P(Z<-1) = 0.1587$

In [18]:
import scipy
print(f"P(Y<90) = {scipy.stats.norm.cdf(-1)*100:.2f}%")

P(Y<90) = 15.87%


# Problem 4.

Let $X_1, ... ,X_n \sim Poisson(\lambda)$ and define an estimator $\hat \lambda = \frac{1}{n}\sum_{i=1}^n X_i $
 
### 4.1. Find the bias

Let $Y=\sum_{i=1}^n X_i$

Sum of random poisson variables is a new poisson variable where the new $\lambda$ is the sum of previous $\lambda$ s.

Therefore: $Y=Poisson(\sum_{i=1}^n \lambda)$

The expected value of $E[Y] = n \cdot \lambda$.

Expected average value $\frac{E[Y]}{n} = \lambda$

$Bias(\hat \lambda) = E[\hat \lambda] - \frac{E[Y]}{n} = \lambda - \lambda = 0$


### 4.2. Find the standard error (se)

Standard error of estimator is $SE(\hat \theta) = \sqrt{Var(\hat \theta)}$

$Var(\hat \lambda) = Var(\frac{1}{n} \sum_{i=1}^n X_i)$

$Var(\hat \lambda) = (\frac{1}{n})^2 \cdot Var( \sum_{i=1}^n X_i)$

$Var(\hat \lambda) = \frac{1}{n^2} \cdot \sum_{i=1}^n \lambda$

$Var(\hat \lambda) = \frac{1}{n^2} \cdot n \cdot \lambda$

$Var(\hat \lambda) = \frac{\lambda}{n} $

$SE(\hat \lambda) = \sqrt{Var(\hat \theta)}$

$SE(\hat \lambda) = \sqrt{\frac{\lambda}{n} }$


### 4.3. Find the mean squared error (MSE) of this estimator.

MSE definition: $MSE(\hat \theta) = E[(\hat \theta - \theta)^2] = Var(\hat \theta) + (Bias(\hat \theta))^2$

$Var(\hat \lambda) + (Bias(\hat \lambda))^2 = Var(\hat \lambda) + 0^2$

$Var(\hat \lambda) + (Bias(\hat \lambda))^2= \frac{\lambda}{n} $

$MSE(\hat \lambda) = \frac{\lambda}{n}$

# Problem 5.

1. Estimate the probability of $Binom(10, 1/2)=5$ based on $n=100$ samples from the distribution.

In [19]:
from random import randint

def simulate():
    return sum([randint(0, 1) for _ in range(10)])

def get_estimate():
    samples = [simulate() for _ in range(100)]
    return sum([1 for i in samples if i == 5]) / len(samples)

estimate = get_estimate()
print(f"Estimate of Binom(10, 1/2)=5 is {estimate}")

Estimate of Binom(10, 1/2)=5 is 0.27


Repeat the estimation $m=1000$ times, and compute fraction of repeats where the estimate lies in $[0.16, 0.33]$

In [20]:
m = 1000
many_estimates = [get_estimate() for _ in range(m)]
within = lambda x: (x >= 0.16 and x <= 0.33)

within_estimates = [estimate for estimate in many_estimates if within(estimate)]
proportion_of_within = len(within_estimates) / m
print(f"Proportion of estimates within [0.16, 0.33] = {proportion_of_within}")

Proportion of estimates within [0.16, 0.33] = 0.967


Derive an analytical distribution for the estimator. Denote the corresponding random variable with $Y$. Compute the probability that $PR(0.16 \leq Y \leq 0.33)$

Let $X=Binom(10, 1/2)$

$P(X=5) = \binom{10}{5}\cdot \frac{1}{2}^{10} $



In [21]:
from scipy.special import comb
print(f"10 choose 5: {int(comb(10,5))}")
print(f"2^10:        {2**10}")

10 choose 5: 252
2^10:        1024


$P(X=5) = 252 \cdot \frac{1}{1024} $

$P(X=5) = \frac{252}{1024} $

$P(X=5) = \frac{63}{256}$

If we are repeating the "test" m times, we can write the probablity of seeing $P(X=5)$ $m$ times as:

$\hat X=Binom(100, \frac{63}{256})$

$Y = \frac{\hat X}{100} $

$Pr(0.16 \leq Y \leq 0.33) = 1 - (Pr(Y < 0.16) + Pr(Y > 0.33))$

$Pr(Y < 0.16) = Pr(\hat X < 16)$

$Pr(Y > 0.33) = 1 - Pr(Y <= 0.33) $

$Pr(Y > 0.33) = 1 - Pr(\hat X <= 34) $

$Pr(\hat X < n) = \sum_{i=0}^{n-1} \binom{100}{i}\frac{63}{256}^i\cdot \frac{256-63}{256}^{100 -i}$

$Pr(\hat X \leq n) = \sum_{i=0}^{n} \binom{100}{i}\frac{63}{256}^i\cdot \frac{256-63}{256}^{100 -i}$

In [23]:
def calculate_X_hat_smaller_than_n(n, p, m):
    pr = 0
    for i in range(n):
        pr += comb(m, i) * (p ** i) * ((1 - p)**(m-i))
    return pr

def calculate_X_hat_smaller_or_equal_than_n(n, p, m):
    pr = 0
    for i in range(n + 1):
        pr += comb(m, i) * (p ** i) * ((1 - p)**(m-i))
    return pr

y_more_than_33 = 1 - calculate_X_hat_smaller_or_equal_than_n(33, 63/256, 100)
y_less_than_16 = calculate_X_hat_smaller_than_n(16, 63/256, 100)

y_between_values = 1 - (y_less_than_16 + y_more_than_33)
print(f"Pr(0.16 <= Y <= 0.33) = {y_between_values:.3f}")


Pr(0.16 <= Y <= 0.33) = 0.964


$Pr(0.16 \leq Y \leq 0.33) \approx 0,964$