## Intermediate Exam 2022 Solutions
**Robert Mackevič**

In [1]:
import numpy as np
from scipy.stats import norm

# Individualized parameters
N = 6
S = 8
I1 = 5
I2 = 4

### Problem 1

#### Task (a)
$$
q_1 = \frac{1 + 6 + 8}{6} = \frac{15}{6} = 0.4
$$
$$
q_2 = \frac{1 + 6 + 8}{6 + 8} = \frac{15}{14} \approx 0.9333
$$

**$P\{Y > y(q_1)\}$** is the probability that $Y$ exceeds the 40th percentile, which corresponds to:

$$
P\{Y > y(q_1)\} = 1 - F_Y(y(q_1)) = 1 - 0.4 = 0.6
$$

**$P\{y(q_1) < Y \leq y(q_2)\}$** is the probability that $Y$ falls between the 40th and approximately 93.33rd percentiles:

$$
P\{y(q_1) < Y \leq y(q_2)\} = F_Y(y(q_2)) - F_Y(y(q_1)) = 0.9333 - 0.4 = 0.5333
$$

Assumptions:
- $Y$ follows a continuous distribution $F_Y$ such that the quantiles $y(q_1)$ and $y(q_2)$ exist.

----
#### Task (b)
- **$N$**: The total number of observations is:
$$
N = 3 + 6 + 8 = 17
$$
- **$m_1$** denotes the 6th order statistic:
$$
m_1 = 6
$$

- **$m_2$**: denotes 14th order statistic:
$$
m_2 = 6 + 8 = 14
$$

- **$N_1$** counts the number of observations greater than the 6th order statistic $Y(m_1)$. It represents the 11 largest values (17 total values minus the first 6 smallest):

$$
N_1 = \#\{ t \in [1, 17] : Y_t > Y(6) \} = 11
$$

- **$N_2$** counts the number of observations between the 6th order statistic $Y(m_1)$ and the 14th order statistic $Y(m_2)$. It represents the 8 values between the 6th and 14th smallest:

$$
N_2 = \#\{ t \in [1, 17] : Y(6) < Y_t \leq Y(14) \} = 8
$$

Assumptions:
1. The data $Y_t$ are independent and identically distributed (i.i.d.).
2. The values $Y_t$ are continuous, ensuring no ties among the order statistics.

----
#### Task (c)
$$
N = 1 + 6 = 7
$$
$$
m = 7
$$
$$
q = \frac{S}{N + S} = \frac{8}{7 + 8} = \frac{8}{15} \approx 0.5333
$$

- $Y(m)$ is the 7th order statistic, which is the largest observation since $m = N = 7$.
- $y(q)$ is the 53.33rd percentile of the distribution $F_Y$.

The probability $P\{Y(m) \leq y(q)\}$ is the probability that all $N = 7$ observations are less than or equal to $y(q)$. For independent and identically distributed (i.i.d.) observations, this can be written as:

$$
P\{Y(m) \leq y(q)\} = P\{Y_1 \leq y(q)\}^N = F_Y(y(q))^N
$$

Substituting the values:

$$
P\{Y(m) \leq y(q)\} = F_Y(y(q))^7
$$

At $q = 0.5333$, we have:

$$
F_Y(y(q)) = q = 0.5333
$$

Thus, the probability becomes:

$$
P\{Y(m) \leq y(q)\} = (0.5333)^7 \approx 0.01227
$$

Assumptions

1. The sample $Y_t$ is i.i.d. from the distribution $F_Y$.
2. The distribution $F_Y$ is continuous, ensuring well-defined order statistics and quantiles.

---

### Problem 2

#### Task (a)
**L-statistics Example: Sample Median**

- **Statistic**: Median
- **Definition**: The median is a special case of an L-statistic:

$$
\text{Median} = Y(m), \quad m = \lceil \frac{N}{2} \rceil
$$

where $Y(m)$ is the middle value in the ordered sample.

- **Parameter Estimated**: The median is an estimator of the 50th percentile (or 0.5-quantile) of the distribution $F_Y$.

- **Explanation**: The median is robust to outliers and provides a measure of central tendency, especially for skewed distributions.

**R-statistics Example: Spearman's Rank Correlation**

- **Statistic**: Spearman's $\rho$
- **Definition**: Spearman's rank correlation measures the association between two variables by comparing their rank statistics:

$$
\rho = 1 - \frac{6 \sum_{t=1}^{N} (R(X_t) - R(Y_t))^2}{N(N^2 - 1)}
$$

where $R(X_t)$ and $R(Y_t)$ are the ranks of the values $X_t$ and $Y_t$.

- **Parameter Estimated**: A measure of monotonic relationship between two variables.

- **Explanation**: It uses ranks to determine correlation, making it robust to nonlinear relationships and outliers.


---
#### Task (b)
**1. Estimating $a$ (Location Parameter) when $b = 1$**

**L-statistics Construction**

- **Statistic**: Sample Median
- **Definition**: The sample median is given by:

$$
\text{Median} = Y(m), \quad \text{where } m = \lceil N/2 \rceil
$$

The sample median is an L-statistic:

$$
L_N = \sum_{t=1}^{N} c_t Y(t)
$$

where $c_t = 1$ for $t = m$ and $c_t = 0$ otherwise.

- **Parameter Estimated**: The median is a robust estimator of $a$, assuming $F^\circ$ is symmetric around $a$.

- **Explanation**: The median minimizes the sum of absolute deviations, making it robust to outliers in the data $Y_t$.

**R-statistics Construction**

- **Statistic**: Hodges-Lehmann Estimator
- **Definition**: The Hodges-Lehmann estimator for $a$ is:

$$
R_N = \text{Median}\left( \frac{Y_i + Y_j}{2} : 1 \leq i < j \leq N \right)
$$

- **Parameter Estimated**: The location parameter $a$.

- **Explanation**: By averaging all pairs of values, the estimator remains robust while efficiently using rank-based information.

**2. Estimating $b$ (Scale Parameter) when $a = 0$**

**L-statistics Construction**

- **Statistic**: Interquartile Range (IQR)
- **Definition**: The IQR is the difference between the 75th and 25th percentiles:

$$
\text{IQR} = Y(q_3) - Y(q_1)
$$

where:

- $q_3 = \lceil 0.75N \rceil$
- $q_1 = \lceil 0.25N \rceil$

The L-statistic for the IQR is:

$$
L_N = \sum_{t=1}^{N} c_t Y(t)
$$

where:

- $c_{q_1} = -1$
- $c_{q_3} = 1$
- $c_t = 0$ otherwise.

- **Parameter Estimated**: $b$, assuming $F^\circ$ is symmetric and normalized.

- **Explanation**: The IQR scales linearly with the spread of the data and is robust to outliers.

**R-statistics Construction**

- **Statistic**: Gini's Mean Difference
- **Definition**: Gini's mean difference is defined as:

$$
R_N = \frac{1}{N(N-1)} \sum_{i=1}^{N} \sum_{j=1}^{N} |Y_i - Y_j|
$$

- **Parameter Estimated**: $b$, as it measures dispersion.

- **Explanation**: It uses rank-based pairwise differences to provide a robust estimate of scale.



---
#### Task (c)
$$
N = 2(8) + 6 = 16 + 6 = 22
$$

The L-statistic uses the range of order statistics from $Y(S)$ to $Y(N - m)$. The effective size is:

$$
\text{Effective Size} = (22 - 10) - 8 + 1 = 12 - 8 + 1 = 5
$$

The L-statistic ignores:

- The smallest $7$ values.
- The largest $m = 10$ values.

$$
\text{Ignored Size} = 7 + 10 = 17
$$

The proportion of ignored data is:

$$
\text{Proportion Ignored} = \frac{\text{Ignored Size}}{\text{Total Size}} = \frac{17}{22}
$$
The breakout point is:
$$
\text{Breakout Point} = \frac{17}{22} \approx 0.7727
$$

---

### Problem 4

In [2]:
r = N / (20 * S)
tau = 21 + I1

loan_data = [
    (1000, 18),
    (200, 47),
    (50 + I2, 90),
    (400, 20 + S),
    (3000, 7),
    (300, 36),
]

repayments = []
for S_i, T_i in loan_data:
    repayment = S_i + S_i * r * max(0, T_i - tau)
    repayments.append(repayment)

repayments = np.array(repayments)

mean_repayment = repayments.mean()
var_single_repayment = np.var(repayments, ddof=0)

expected_total_repayment = 100 * mean_repayment
var_total_repayment = 100 * var_single_repayment

print(f"Repayments for each consumer: {repayments}")
print(f"Expected repayment for one consumer: {mean_repayment:.2f}")
print(f"Variance of repayment for one consumer: {var_single_repayment:.2f}")
print(f"Expected total repayment for 100 consumers: {expected_total_repayment:.2f}")
print(f"Variance of total repayment for 100 consumers: {var_total_repayment:.2f}")


Repayments for each consumer: [1000.   357.5  183.6  430.  3000.   412.5]
Expected repayment for one consumer: 897.27
Variance of repayment for one consumer: 947674.44
Expected total repayment for 100 consumers: 89726.67
Variance of total repayment for 100 consumers: 94767443.89


---

### Problem 5

In [3]:
r = S / 40
trials = N * 10

# Observed successes (assume successes observed = r * trials)
successes = r * trials

# Estimate probability of success
p_hat = successes / trials

# Logit function
logit_theta = np.log(p_hat / (1 - p_hat))

# Variance of the logit estimator using the delta method
variance_logit = (1 / (p_hat * (1 - p_hat) * trials))

# 90% lower one-sided confidence interval for the logit function
z_score = norm.ppf(0.90)  # Z-score for 90% confidence
lower_bound = logit_theta - z_score * np.sqrt(variance_logit)

print(f"Success rate (r): {r:.4f}")
print(f"Total trials: {trials}")
print(f"Estimated probability of success (p^): {p_hat:.4f}")
print(f"Logit function value (\u03B8): {logit_theta:.4f}")
print(f"Variance of logit function estimator: {variance_logit:.4e}")
print(f"90% Lower one-sided confidence interval for \u03B8: ({lower_bound:.4f}, \u221E)")


Success rate (r): 0.2000
Total trials: 60
Estimated probability of success (p^): 0.2000
Logit function value (θ): -1.3863
Variance of logit function estimator: 1.0417e-01
90% Lower one-sided confidence interval for θ: (-1.7999, ∞)


---

### Problem 6

#### Task (a)
The bias of $\hat{\sigma}_N^2$ is the expected difference between $\hat{\sigma}_N^2$ and the true variance $\sigma^2$:

$$
\text{Bias}[\hat{\sigma}_N^2] = E[\hat{\sigma}_N^2] - \sigma^2
$$

For a sample of size $N$, the sample variance $\hat{\sigma}_N^2$ is an unbiased estimator:

$$
E[\hat{\sigma}_N^2] = \sigma^2
$$

$$
\text{Bias}[\hat{\sigma}_N^2] = 0
$$

The variance of $\hat{\sigma}_N^2$ measures the spread of $\hat{\sigma}_N^2$ around its mean:

$$
\text{Var}[\hat{\sigma}_N^2] = E[(\hat{\sigma}_N^2 - \sigma^2)^2]
$$

For small sample sizes, an approximation for the variance is:

$$
\text{Var}[\hat{\sigma}_N^2] \approx \frac{2\sigma^4}{N - 1}
$$

where:

$$
\sigma^4 = E[(Y - \mu)^4]
$$

and $\mu = E[Y]$.

The bias of $\hat{\sigma}_N$ is more complex due to the square root operation:

$$
\text{Bias}[\hat{\sigma}_N] = E[\hat{\sigma}_N] - \sigma
$$

Using a Taylor expansion, an approximation for the bias of $\hat{\sigma}_N$ is:

$$
\text{Bias}[\hat{\sigma}_N] \approx -\frac{\sigma^2}{2N}
$$


In [4]:
sample = [N, 0, S]
n = len(sample)

# Calculate sample variance σ̂²_N
sample_mean = np.mean(sample)
sample_variance = np.mean((np.array(sample) - sample_mean)**2)

# Bias of σ̂²_N (nonparametric bias estimator)
bias_variance = (1 / n) * np.var(sample, ddof=1)

# Variance of σ̂²_N (nonparametric variance estimator)
variance_variance = (2 / (n * (n - 1))) * np.var(sample, ddof=1)**2

# Standard deviation estimator σ̂_N
sample_std_dev = np.sqrt(sample_variance)

# Bias of σ̂_N
bias_std_dev = np.sqrt(bias_variance)

# Variance of σ̂_N
variance_std_dev = (1 / (2 * sample_std_dev)) * variance_variance

# Output results
print(f"Sample: {sample}")
print(f"Sample mean: {sample_mean:.4f}")
print(f"Sample variance (σ̂²_N): {sample_variance:.4f}")
print(f"Bias of σ̂²_N: {bias_variance:.4f}")
print(f"Variance of σ̂²_N: {variance_variance:.4f}")
print(f"Sample standard deviation (σ̂_N): {sample_std_dev:.4f}")
print(f"Bias of σ̂_N: {bias_std_dev:.4f}")
print(f"Variance of σ̂_N: {variance_std_dev:.4f}")


Sample: [6, 0, 8]
Sample mean: 4.6667
Sample variance (σ̂²_N): 11.5556
Bias of σ̂²_N: 5.7778
Variance of σ̂²_N: 100.1481
Sample standard deviation (σ̂_N): 3.3993
Bias of σ̂_N: 2.4037
Variance of σ̂_N: 14.7305


---
#### Task (b)
The sample standard deviation $\hat{\sigma}_N$ can be written as a functional:

$$
\hat{\sigma}_N = T(F) = \int (x - \mu)^2 \, dF(x)
$$

where $\mu = \int x \, dF(x)$ is the population mean.

When the distribution $F$ is perturbed by $\delta y$, the mean becomes:

$$
\mu_\epsilon = (1 - \epsilon) \mu + \epsilon y
$$

The variance changes to:

$$
\sigma_\epsilon^2 = \int (x - \mu_\epsilon)^2 \, dF_\epsilon(x)
$$

Substitute $\mu_\epsilon$ into $\sigma_\epsilon^2$, keeping terms linear in $\epsilon$:

$$
\sigma_\epsilon^2 = \int (x - \mu)^2 \, dF(x) - 2\epsilon (y - \mu) \int (x - \mu) \, dF(x) + \epsilon (y - \mu)^2
$$

Simplify using:

$$
\int (x - \mu) \, dF(x) = 0
$$

Thus:

$$
\sigma_\epsilon^2 = \sigma^2 + \epsilon (y - \mu)^2
$$

The perturbed standard deviation becomes:

$$
\hat{\sigma}_N(F_\epsilon) = \sigma_\epsilon^2
$$

Using a first-order Taylor expansion around $\sigma^2$:

$$
\hat{\sigma}_N(F_\epsilon) \approx \sigma + \frac{1}{2} \sigma \epsilon (y - \mu)^2
$$

The influence function is the derivative of $\hat{\sigma}_N$ with respect to $\epsilon$:

$$
\text{IF}(y; \hat{\sigma}_N, F) = \frac{1}{2} \sigma (y - \mu)^2
$$

---

### Problem 7

#### Task (a)
For a significance level of $\alpha = 0.01$:

- The total rejection region must have probability $\alpha = 0.01$.
- Since the test is two-sided, the rejection region is split equally:
  - $\frac{\alpha}{2} = 0.005$ in the left tail,
  - $\frac{\alpha}{2} = 0.005$ in the right tail.

Thus, the quantiles $q_1$ and $q_2$ correspond to:

$$
q_1 = 0.005, \quad q_2 = 1 - 0.005 = 0.995
$$

The choice of $q_1 = 0.005$ and $q_2 = 0.995$ ensures the interval covers $1 - \alpha = 99\%$ of the bootstrap distribution under the null hypothesis. This aligns with the significance level $\alpha = 0.01$, as 1% of the distribution lies outside the interval.

---
#### Task (b)
For $q_1 = 0.005$ and $q_2 = 0.995$, the accuracy of the estimated quantiles depends on the number of bootstrap samples:

$$
\text{Standard Error} = \sqrt{\frac{q(1 - q)}{B}}
$$

where $q$ is the quantile of interest (e.g., $q = 0.995$).

For a reasonable precision (e.g., standard error $\approx 0.001$):

$$
B \geq \frac{q(1 - q)}{(\text{Standard Error})^2}
$$

For $q = 0.995$, $1 - q = 0.005$, and Standard Error = 0.001:

$$
B \geq \frac{0.995 \times 0.005}{(0.001)^2} = 4975
$$

A common recommendation is:
- $B \geq 1000$ for rough estimates,
- $B \geq 5000$ for accurate quantile estimation.

The choice of $B$ should balance accuracy with computational constraints. For modern computing systems, $B = 10,000$ is a practical and robust choice.

Given the significance level $\alpha = 0.01$ and the need to estimate extreme quantiles $q_1 = 0.005$ and $q_2 = 0.995$ with precision, I recommend:

$$
B = 10,000
$$


---
#### Task (c)
**Insights on the Procedure**
- **Independence and Identical Distribution (i.i.d.)**: The bootstrap assumes that the data are independent and identically distributed.
- **Nonparametric Approach**: The bootstrap works well when the underlying distribution is unknown or nonparametric.
- **Significance Level Control**: The significance level $\alpha = 0.01$ is controlled by choosing $q_1 = 0.005$ and $q_2 = 0.995$.
- **Flexibility**: The bootstrap does not rely on parametric assumptions about the data distribution.
- **Resampling**: By resampling directly from the data, the bootstrap captures the empirical variability of the sample.
- **Quantile Estimation**: Direct computation of the rejection region using quantiles $q_1$ and $q_2$ ensures the desired confidence level.
- **Small Sample Size**: If $N$ (sample size) is small, bootstrap samples may not sufficiently approximate the true distribution.
- **Outliers**: Outliers in the original data can disproportionately affect bootstrap estimates.
- **Computational Cost**: For large $B$, the method can become computationally expensive.


**Proposed Improvements**

- **Adjust for Small Sample Size**:
  If the sample size $N$ is small, use **bias-corrected and accelerated (BCa) bootstrap**:
  - Adjusts quantiles $q_1$ and $q_2$ to account for bias and skewness in the bootstrap distribution.

- **Robustness to Outliers**:
  Apply a robust statistic, such as the **trimmed mean** or **median**, instead of the sample mean $\bar{Y}_N$. These measures are less sensitive to outliers.

- **Precision in Quantile Estimation**:
  Ensure $B$ is large enough (e.g., $B = 10,000$) to minimize variability in estimating $y(q_1)^*$ and $y(q_2)^*$.

- **Alternative Resampling Methods**:
  - **Wild Bootstrap**: Use when the data exhibit heteroscedasticity (unequal variance).
  - **Block Bootstrap**: Use for dependent data to preserve the dependency structure.


---