# Statistics — Assessment

This assessment aligns with `Statistics/statistics.ipynb` and covers descriptive stats, probability, distributions, sampling, estimation, hypothesis testing, correlation, and regression basics.

Total questions: 25 (10 Theory, 8 Fill-in-the-Blanks, 7 Coding). Difficulty mix: 40% easy, 40% medium, 20% hard.


## Instructions
- Answer all questions.
- Implement coding tasks in Python with NumPy/SciPy where applicable; asserts provided.
- Solutions are at the bottom.


## References
- `Statistics/statistics.ipynb`


## Part A — Theory (10)
1. Define population vs sample. Why is sampling used in practice?
2. MCQ: The law of large numbers states that as sample size increases, the sample mean (a) diverges (b) converges to population mean (c) equals median (d) is unbiased.
3. Explain bias and variance in the context of estimators.
4. What is a confidence interval? Interpret a 95% CI correctly.
5. MCQ: A p-value is (a) P(H0 true) (b) P(data | H0) tail probability (c) P(H1 true) (d) alpha.
6. Distinguish between Type I and Type II errors.
7. When would you use a t-distribution instead of normal for inference on the mean?
8. What is the central limit theorem and why is it useful?
9. MCQ: Pearson correlation measures (a) monotonic (b) linear (c) nonlinear (d) categorical association.
10. Explain the difference between parametric and nonparametric tests.


## Part B — Fill in the Blanks (8)
1. The sample mean is an __________ of the population mean.
2. A 95% confidence interval will contain the true parameter in repeated samples about ______ of the time.
3. The null hypothesis is typically denoted by ______.
4. The area under a probability density function over its support equals ______.
5. The variance is the expected value of the squared __________ from the mean.
6. For small n with unknown variance and normality, use the ______ distribution.
7. The correlation coefficient r is bounded between ______ and ______.
8. Reject H0 if p-value is ______ than alpha.


## Part C — Coding Tasks (7)
Implement using NumPy/SciPy (avoid randomness in asserts except with fixed seeds).

Tasks:
1. `sample_mean(x)` — return mean of a 1D array.
2. `sample_variance(x, ddof=1)` — sample variance.
3. `z_score(x)` — return standardized array (mean 0, std 1) using sample stats.
4. `ci_mean(x, alpha=0.05)` — 95% CI for mean using t-distribution; return tuple (lo, hi).
5. `ttest_independent(x, y)` — two-sample t-test (equal var); return (t, p).
6. `pearson_r(x, y)` — compute Pearson correlation.
7. `bootstrap_mean(x, n_boot=1000, seed=0)` — bootstrap CI (percentile 2.5, 97.5). Return (lo, hi).


In [None]:
import numpy as np
from math import sqrt
from scipy import stats

def sample_mean(x):
    x = np.asarray(x, float)
    return float(np.mean(x))

def sample_variance(x, ddof=1):
    x = np.asarray(x, float)
    return float(np.var(x, ddof=ddof))

def z_score(x):
    x = np.asarray(x, float)
    m = x.mean()
    s = x.std(ddof=1)
    s = 1.0 if s == 0 else s
    return (x - m) / s

def ci_mean(x, alpha=0.05):
    x = np.asarray(x, float)
    n = x.size
    m = x.mean()
    s = x.std(ddof=1)
    se = s / sqrt(n)
    tcrit = stats.t.ppf(1 - alpha/2, df=n-1)
    lo, hi = m - tcrit*se, m + tcrit*se
    return float(lo), float(hi)

def ttest_independent(x, y):
    t, p = stats.ttest_ind(x, y, equal_var=True)
    return float(t), float(p)

def pearson_r(x, y):
    x = np.asarray(x, float)
    y = np.asarray(y, float)
    xm, ym = x.mean(), y.mean()
    xs, ys = x.std(ddof=1), y.std(ddof=1)
    cov = np.mean((x - xm)*(y - ym))
    return float(cov / (xs*ys))

def bootstrap_mean(x, n_boot=1000, seed=0):
    rng = np.random.default_rng(seed)
    x = np.asarray(x, float)
    n = x.size
    means = np.empty(n_boot)
    for i in range(n_boot):
        idx = rng.integers(0, n, size=n)
        means[i] = x[idx].mean()
    lo, hi = np.percentile(means, [2.5, 97.5])
    return float(lo), float(hi)


In [None]:
# Asserts
x = np.array([1,2,3,4,5])
assert sample_mean(x) == 3.0
assert round(sample_variance(x), 6) == round(np.var(x, ddof=1), 6)
zs = z_score(x)
assert abs(zs.mean()) < 1e-8
t_ci = ci_mean(x)
assert t_ci[0] < 3.0 < t_ci[1]
t, p = ttest_independent([1,2,3],[1,2,3])
assert p > 0.9
r = pearson_r([1,2,3],[2,4,6])
assert abs(r-1.0) < 1e-8
lo, hi = bootstrap_mean(x, 200, 0)
assert lo < 3.0 < hi
print('Statistics asserts passed ✅')


## Solutions

### Theory (sample)
1. Population entire set; sample subset due to cost/practicality.
2. (b)
3. Bias: systematic error; variance: sensitivity to data variation.
4. Range of plausible values for parameter under repeated sampling (95% contain true value 95% of time).
5. (b)
6. Type I: false positive; Type II: false negative.
7. Small sample, unknown variance, normality assumption.
8. Sum/mean of iid has approx normal distribution; enables inference.
9. (b)
10. Parametric assumes distribution forms; nonparametric fewer assumptions.

### Fill blanks
1. estimator
2. 95%
3. H0
4. 1
5. deviations
6. t
7. -1, 1
8. less
