## Exam 2025 Solutions
**Robert Mackevič** 2325045

In [53]:
from math import comb
import numpy as np

# Individualized parameters
N = 6
S = 8
I1 = 5
I2 = 4

### Problem 1

$N = 6 + 8 = 14$

$q_1 = \frac{6}{6 + 8} = \frac{6}{14} \approx 0.4286$

$q_2 = \frac{1 + 5 + 6}{5 + 6 + 8} = \frac{12}{19} \approx 0.6316$

$j = 6$

#### Task (a)

- $P_j$ depends only on the ranks of the order statistics $Y(j)$, which are independent of the underlying distribution $F_Y$ as long as $F_Y$ is continuous.  
- For continuous $F_Y$, the normalized ranks $F_Y(Y(j))$ follow a uniform distribution on $[0, 1]$.  
- Thus, $P_j$ is determined by the uniform distribution of ranks and does not depend on $F_Y$, making it distribution-free.

**Assumptions:**
- $F_Y$ is continuous to avoid ties.
- The sample is i.i.d. (independent and identically distributed).


---
#### Task (b)
As in (a) we make an assumption that ranks are uniformaly distributed, therefore distribution of $Y_j$ is:

$$
F_{Y_{(j)}}(u) = \sum_{k=j}^{N} \binom{N}{k} u^k (1 - u)^{N - k}
$$

And the probability is:
$$
P_j = F_{Y_{(j)}}(q_2) - F_{Y_{(j)}}(q_1)
$$

Now we can calculate this with code:

In [54]:
q1 = N / (N + S)
q2 = (1 + I1 + N) / (I1 + N + S)
j = N
sample_size = N + S

# Function to compute CDF of Fy
def FY_cdf(u, j, n):
    return sum(comb(n, k) * (u**k) * ((1-u)**(n-k)) for k in range(j, n+1))

# Compute P_j for j=6
F_q1 = FY_cdf(q1, j, sample_size)
F_q2 = FY_cdf(q2, j, sample_size)
P_j = F_q2 - F_q1

print(f"P_{j} = {P_j:.6f}")

P_6 = 0.364679


The result is $P_6 = 0.364679$

---

### Problem 2

$N = 9 + \text{min}(6, 8) = 9 + 6 = 15$

$T = \{21, 35, 26, 16, 21, 17, 44, 60, 16\}$

#### Task (a)


$$
\eta_0 = P\{T > t_1 \mid T > t_0\} =
\frac{P\{T > t_1 \cap T > t_0\}}{P\{T > t_0\}} =
\frac{P\{T > t_1\}}{P\{T > t_0\}} =
\frac{1 - G_T(t_1)}{1 - G_T(t_0)}
$$

$$
\eta_1 := E[T - t_1 \mid T > t_1] =
\frac{\int_{t_1}^\infty (T - t_1) f_T(T) \, dT}{P\{T > t_1\}} = 
\frac{\int_{t_1}^\infty (T - t_1) f_T(T) \, dT}{1 - G_T(t_1)}
$$

Simplify the numerator:

$$
\int_{t_1}^\infty (T - t_1) f_T(T) \, dT = \int_{t_1}^\infty T f_T(T) \, dT - t_1 \int_{t_1}^\infty f_T(T) \, dT
$$

From the properties of PDFs:

$$
\int_{t_1}^\infty f_T(T) \, dT = P\{T > t_1\} = 1 - G_T(t_1)
$$

Thus:

$$
\eta_1 = \frac{\int_{t_1}^\infty T f_T(T) \, dT - t_1 (1 - G_T(t_1))}{1 - G_T(t_1)}
$$

---
#### Task (b)

$G_T(t)$ can be estimated empirically as:
$$
\hat{G}_T(t) = \frac{\# \{ T_i \leq t \}}{n}
$$

The PDF $f_T(t)$ is related to $G_T(t)$ as:
$$
f_T(t) = \frac{d}{dt} G_T(t)
$$

Knowing this, we can find the values sybolically, since $t_0$ and $t_1$ are not yet given:

$$
\eta_0 = \frac{1 - \hat{G}_T(t_1)}{1 - \hat{G}_T(t_0)}
$$

$$
\eta_1 = \frac{\sum_{T_i > t_1} (T_i - t_1)}{\#\{ T_i > t_1 \}}
$$

$$
Var(\eta_0) = \frac{\eta_0(1 - \eta_0)}{n}
$$

$$
Var(\eta_1) = \frac{\#\{ T_i > t_1 \} \cdot Var(T_i - t_1 \mid T_i > t_1)}{\#\{ T_i > t_1 \}}
$$

----
#### Task (c)
The CDF of the Pareto distribution is:

$G_T(t) = \begin{cases} 
1 - \left(\frac{b}{t}\right)^a & t \geq b \\
0 & t < b
\end{cases}$

The PDF is:

$f_T(t) = \begin{cases} 
\frac{a}{b^a} t^{a + 1} & t \geq b \\
0 & t < b
\end{cases}$


From Problem 2(a):

$$
\eta_0 = P\{T > t_1 \mid T > t_0\} = \frac{P\{T > t_1\}}{P\{T > t_0\}}
$$

Using $P\{T > t\} = 1 - G_T(t) = \left(\frac{b}{t}\right)^a$:

$$
\eta_0 = \frac{\left(\frac{b}{t_1}\right)^a}{\left(\frac{b}{t_0}\right)^a} = \left(\frac{t_0}{t_1}\right)^a
$$

From Problem 2(a):

$$
\eta_1 = E[T - t_1 \mid T > t_1]
$$

The conditional expectation $E[T \mid T > t_1]$ for a Pareto distribution is:

$$
E[T \mid T > t_1] = \frac{a t_1}{a - 1}, \text{for }a > 1
$$

Thus:
$$
\eta_1 = E[T - t_1 \mid T > t_1] = \frac{a t_1}{a - 1} - t_1 = t_1 \left(\frac{a}{a - 1} - 1\right) = t_1 \cdot \frac{a - 1}{a - 1} = t_1 \cdot (a - 1)
$$


The variance of $\eta_0$ depends on the variability in $G_T(t_0)$ and $G_T(t_1)$. For large $n$ (sample size):

$$
Var(\eta_0) \propto \frac{\eta_0(1 - \eta_0)}{n}
$$

The variance of $\eta_1$ depends on the second moment of $T$ conditional on $T > t_1$. For $T \sim \text{Pareto}(a, b)$, the conditional variance is:

$$
Var(T \mid T > t_1) = \frac{a t_1^2 (a - 2)}{(a - 1)^2}, \text{for } a > 2
$$

Thus:
$$
Var(\eta_1) = \frac{\#\{T_i > t_1\} \cdot Var(T \mid T > t_1)}{\#\{T_i > t_1\}}
$$

---
#### Task (d)
$t_0=12$\
$t_1=20$

Since $a$ and $b$ are not given, we can estimate using MLE for the Pareto distribution:
$$
\hat{b} = \min(T) = 16
$$
$$
\hat{a} = \frac{n}{\sum_{i=1}^{n} \ln\left(\frac{T_i}{\hat{b}}\right)} 
$$

Below code is for computing (b) statistics

In [55]:
T = np.array([17 + I2, 35, 26, 11 + I1, 21, 9 + S, 44, 10 * N, 16])
t0 = 12
t1 = 15 + min(S, 9)
sample_size = len(T)

G_T_t0 = len(T[T <= t0]) / sample_size  # Empirical CDF at t0
G_T_t1 = len(T[T <= t1]) / sample_size  # Empirical CDF at t1

# Compute eta_0
eta_0 = (1 - G_T_t1) / (1 - G_T_t0)

# Compute eta_1
T_greater_t1 = T[T > t1]
eta_1 = sum(T_greater_t1 - t1) / len(T_greater_t1) if len(T_greater_t1) > 0 else 0

# Variance of eta_0
var_eta_0 = (eta_0 * (1 - eta_0)) / sample_size

# Variance of eta_1
deviations = [(t - t1 - eta_1)**2 for t in T_greater_t1]
var_eta_1 = sum(deviations) / (len(T_greater_t1) - 1) if len(T_greater_t1) > 1 else 0

print("eta_0", eta_0)
print("eta_1", eta_1)
print("var_eta_0", var_eta_0)
print("var_eta_1", var_eta_1)

eta_0 0.4444444444444444
eta_1 18.25
var_eta_0 0.027434842249657063
var_eta_1 210.25


Bellow code is computing (c) statistics

In [56]:
T = np.array([17 + I2, 35, 26, 11 + I1, 21, 9 + S, 44, 10 * N, 16])
sample_size = len(T)

# Compute MLE estimates for a and b (shape and scale parameters)
b_hat = np.min(T)
a_hat = sample_size / np.sum(np.log(T / b_hat))

t0 = 12
t1 = 15 + min(S, 9)

# Compute eta_0
eta_0 = (t0 / t1) ** a_hat

# Compute eta_1
if a_hat > 1:
    eta_1 = t1 / (a_hat - 1)
else:
    eta_1 = None  # Not defined for a_hat <= 1

# Variance of eta_0
var_eta_0 = (eta_0 * (1 - eta_0)) / sample_size

# Variance of eta_1
if a_hat > 2:
    var_eta_1 = (a_hat * t1**2) / ((a_hat - 2) * (a_hat - 1)**2)
else:
    var_eta_1 = None  # Not defined for a_hat <= 2

print("eta_0", eta_0)
print("eta_1", eta_1)
print("var_eta_0", var_eta_0)
print("var_eta_1", var_eta_1)

eta_0 0.24855550959684575
eta_1 20.180018175276352
var_eta_0 0.020752852027322008
var_eta_1 6235.619660485078


Accuracy Comparison:
- The empirical estimates (from task b) yield smaller variances compared to those based on the Pareto model (task c).  
- The discrepancy in $\eta_1$ values suggests the Pareto assumption introduces bias due to overestimation of larger values.  

---
#### Task (e)

The Mean $ E(T) $ can be rstimated as:
$$
\hat{E}(T) = \bar{T} = \frac{1}{n} \sum_{i=1}^{n} T_i
$$

where $T_i$ are the observed values and $n$ is the sample size.

The variance of the sample mean $\bar{T}$ is given by:

$$
Var(\bar{T}) = \frac{S^2}{n}
$$

where $S^2$ is the sample variance:

$$
S^2 = \frac{1}{n - 1} \sum_{i=1}^{n} (T_i - \bar{T})^2
$$

In [57]:
T = np.array([17 + I2, 35, 26, 11 + I1, 21, 9 + S, 44, 10 * N, 16])

# Compute median
median_T = np.median(T)

# Compute mean
mean_T = np.mean(T)

# Compute sample variance
sample_variance = np.var(T, ddof=1)

# Variance of the estimator of the mean
var_mean_T = sample_variance / len(T)

print("median", median_T)
print("mean", mean_T)
print("sample_variance", sample_variance)
print("var_mean", var_mean_T)

median 21.0
mean 28.444444444444443
sample_variance 229.77777777777777
var_mean 25.530864197530864


$Median(T) = 21.0$\
$\hat{E}(T) = \bar{T} \approx 28.44$\
$Var(\hat{E}(T)) = \frac{S^2}{n} \approx 25.53$

---
#### Task (f)
The symmetry of $\log(T)$ suggests that the median of $\log(T)$ can be used to estimate $E(\log(T))$. By transforming back, the exponential of this median serves as a nonparametric estimator for $E(T)$.

$\hat{E}(\log(T)) = Median(\log(T))$\
$\hat{E}(T) = \exp(Median(\log(T)))$


To estimate the variance of $\hat{E}(T)$, the following steps can be proposed:

1. Generate bootstrap samples $T_b^*$ by sampling $T$ with replacement.
2. For each bootstrap sample, compute: $\hat{E}(T_b^*) = \exp(Median(log(T_b^*)))$.
3. Calculate the variance of the bootstrap estimates:  
   $$
   Var(\hat{E}(T)) = \frac{1}{B - 1} \sum_{b=1}^{B} (\hat{E}(T_b^*) - \hat{E}(T^*))^2
   $$
   where $B$ is the number of bootstrap samples.


---
#### Task (g)

In [58]:
sample_size = 9
breakdown_point_eta0 = 1 / sample_size  # Single outlier affects ranks in G_T
breakdown_point_eta1 = 1 / len(T_greater_t1) if len(T_greater_t1) > 0 else None

# Compute maximal bias for eta_0
# Maximize eta_0 by minimizing G_T(t1) and maximizing G_T(t0)
max_bias_eta0 = abs(
    ((1 - (len(T[T <= t1]) - 1) / sample_size) / (1 - (len(T[T <= t0]) + 1) / sample_size)) - eta_0
)

# Compute maximal bias for eta_1
# Maximize eta_1 by adding an extremely large T_i > t1
max_T = max(T)  # Assume max outlier
min_T = t1  # Assume min outlier near t1

max_bias_eta1 = abs((sum(T_greater_t1) + max_T - t1) / (len(T_greater_t1) + 1) - eta_1)

print(breakdown_point_eta0)
print(breakdown_point_eta1)
print(max_bias_eta0)
print(max_bias_eta1)

0.1111111111111111
0.25
0.37644449040315436
20.219981824723646



$Breakdown\ Point(\eta_0) = \approx 0.1111$

$Breakdown\ Point(\eta_1) \approx 0.25$

$Maximal\ Bias(\eta_0) \approx 0.3764$

$Maximal\ Bias(\eta_1) \approx 20.22$


**Robustness:**
- $\eta_0$ is robust to moderate contamination due to rank-based computation, but breakdown occurs with one extreme outlier, reflected in its lower maximal bias.
- $\eta_1$ is sensitive to extreme values in $T_i > t_1$, as seen in the high maximal bias. Its breakdown point is higher, but its sensitivity to magnitudes makes it less robust.


---

### Problem 3

#### Task (a)

The parameter to estimate is:

$\alpha = E(X) \cdot (E(X))^2$

If $\mu = E(X)$, then:

$\alpha = \mu \cdot \mu^2 = \mu^3$

A naive plug-in estimator for $\alpha$ could be:

$$
\hat{\alpha}_{naive} = \bar{X} \cdot (\bar{X})^2 = (\bar{X})^3
$$

where $\bar{X}$ is the sample mean.


To correct the bias, we consider the following:

- Expand $(\bar{X})^3$ using the properties of variance and higher-order moments.  
- Derive a correction term for the bias based on $Var(\bar{X})$.  

$$
Var(\bar{X}) = \frac{\sigma^2}{n}
$$,

where $\sigma^2 = Var(X)$.

Using a Taylor series expansion or direct adjustments for bias, the bias-corrected estimator is:

$$
\hat{\alpha} = (\bar{X})^3 - 3 \cdot \bar{X} \cdot \frac{Var(\bar{X})}{n} =
(\bar{X})^3 - \frac{3 \cdot \bar{X} \cdot \sigma^2}{n^2}
$$.

If $\sigma^2$ is unknown, replace it with the sample variance $S^2$:

$S^2 = \frac{1}{n - 1} \sum_{i=1}^{n} (X_i - \bar{X})^2$.

Thus:

$$
\hat{\alpha} = (\bar{X})^3 - \frac{3 \cdot \bar{X} \cdot S^2}{n}
$$.

The variance of $\hat{\alpha}$ is derived based on the delta method or variance propagation for nonlinear estimators. A simplified expression is:

$$
V_2 = Var(\hat{\alpha}) \approx Var((\bar{X})^3) = 9 \cdot (\mu^4) \cdot \frac{\sigma^2}{n}
$$,

where higher-order moments of $X$ influence the variance.

Using sample estimates:

$$
\hat{V}_2 = 9 \cdot (\bar{X})^4 \cdot \frac{S^2}{n}
$$.

----
#### Task (b)

In [59]:
X = [17 + I2, 35, 26, 11 + I1, 21, 9 + S, 44, 10 * N, 16]
sample_size = len(X)

# Compute mean
mean_X = sum(X) / sample_size

# Compute sample variance
variance_X = sum((t - mean_X)**2 for t in X) / (sample_size - 1)

# Compute bias-adjusted alpha
alpha_hat = (mean_X**3) - (3 * mean_X * variance_X / sample_size)

# Compute variance of alpha estimator
variance_alpha_hat = 9 * (mean_X**4) * (variance_X / sample_size)

print("Sample mean", mean_X)
print("Sample variance", variance_X)
print("alpha_hat", alpha_hat)
print("variance_alpha_hat", variance_alpha_hat)


Sample mean 28.444444444444443
Sample variance 229.77777777777777
alpha_hat 20835.37997256515
variance_alpha_hat 150417320.6680553


Answers are:

$\hat{\alpha} \approx 20835.38$

$\hat{V}^2 \approx 150417320.67$

----

### Problem 5

Testing $H_0: E(T) = 18$ vs $H_1: E(T) \neq 18$

The symmetry of $h(T)$ implies that the missingness in $T$ can be addressed under the assumption that the observed values of $T$ are representative of the true distribution.

To perform bootstrap with missing data:
- Assume that $h(T)$ is symmetric about some $a$ and that the observed data $T$ can approximate the distribution.  
- Bootstrap samples are drawn only from the observed values of $T$, treating them as complete for resampling purposes.

Procedure:
- Resample $T$ (observed values only) with replacement to create bootstrap samples $T_b^*$.  
- Compute the mean $\bar{T}_b^*$ for each bootstrap sample.

- Under the null hypothesis $E(T) = 18$, shift the bootstrap means:

$$
\bar{T}_b^*(\text{null-adjusted}) = \bar{T}_b^* - (\bar{T} - 18)
$$.

- Compute the Test Statistic $T_{obs} = \bar{T}$.
- Calculate the proportion of null-adjusted bootstrap means $\bar{T}_b^*$ that are as extreme as $T_{obs}$, considering the symmetry:

$$
p\text{-value} = \frac{\# \{ |\bar{T}_b^*| \geq |T_{obs}| \}}{\text{number of bootstrap samples}}
$$

- If p-value is less than the significance level, than reject the null hypothesis
----