# TP03. Confidence Intervals & Hypothesis Tests

#### enseignant: Anastasios Giovanidis 2021 - 2022
#### date: 29 September 2021

#### student name or binome:

This is the TP related to 03. Confidence Intervals & Hypothesis Tests. 

We need to import the following libraries:

In [1]:
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import norm
import random

## Exercise 1 (interval)

We wish to measure a quantity $\theta$, but there is a random error in each measurement (noise). 

Then, measurement $i$ is

$X_i = \theta+W_i$,

$W_i$ being the error in the $i$-th measurement. All $W_i$s are i.i.d.

Preliminaries: GENERATE $n$ measurements $(X_1,\ldots, X_n)$ and report the average of the measurements $\overline{X}$ as the estimated value of $\theta$. To do so,
- The $W_i$s are drawn from $Normal(0,\sigma^2)$ with **known** standard deviation $\sigma=4$.
- The **unknown** parameter $\theta=1$.

**Questions**

   a) Given a sample-set of size $n=10$, provide the confidence interval for $\theta$, with confidence $\alpha=90\%$. 
      
   b) Draw $T=2,000$ times, new sets of size $n=10$ each, find new intervals for each $t$, and mark with $+1$ if the unknown parameter $\theta=1$ falls inside the new confidence interval calculated, otherwise $0$. What is the percentage that it falls inside the estimated interval?
   
   c) After having written the code, repeat (a)-(b) for unknown variance, using the sample standard deviation and approximate confidence intervals. What do you observe? Why?

**Answers**

In [1]:
#Hint: you will need X.mean(), X.var(), norm.ppf(1-a/2)

## Exercise 2 (hypothesis test)*

We will study in more detail the Neyman-Pearson Test, which leads to the Likelihood Ratio Test (LRT) we saw during the course. It can be shown that this test has the following property:

**Theory:** The LRT minimises Type II error, under the requirement that Type I error is bounded by: $\alpha\leq2^{-\lambda n}$ for a given $\lambda>0$.

**Application in Wireless Networks:** We can use a hypothesis test to determine anomalies in the normal operation of a cellular network. Consider an LTE network which serves mobile users, and let us focus on some specific period every Monday. Specifically, assume that the network consists of just two base stations ($S_1$ and $S_2$), on neighbouring cells.

During this period, and every Monday, each of these Base Stations have a charge $Y_{i}$, $i\in\left\{1,2\right\}$ which is a random variable, drawn from a Normal distribution of mean $\rho$ and standard deviation $\sigma$, both known. This knowledge comes from systematic measurements that the stations constantly perform and send to some control center. **Suppose we can only get measurements from $S_1$.**

If an anomaly occurs on station $S_2$, the second Base Station becomes deactivated. As a result, all users that were served by this station will migrate to the neighbouring $S_1$, and the new charge of the remaining station will become $2\rho$ in mean value. This information will gradually be sent through load measurements to the control center as well. The new load of $S_1$ will be drawn from a Normal distribution with same standard deviation $\sigma$ but with mean $2\rho$.

Consider the hypothesis:

- $H_0:$ the system of two stations is operating normally, VS

- $H_1:$ there is an anomaly in base station $S_2$.

**Questions**

(A) Find (analytically) the criterion that guarantees a false alarm of $1\%$.

(B) The designer wishes to achieve a false alarm of $1\%$ within $10$ measurements. **What is the appropriate threshold?**

Draw $T=20,000$ sets of size $N=10$ from the $H_0$, and verify with simulations that indeed the false alarm is $1\%$.

(C) **Keep the threshold you found from question (A)**. Suppose that at the beginning of the measurements, all works well, but at time $t_0>0$ the station breaks down. We do not know the instant that the anomaly begins. How many additional measurements after $t_0$ are necessary, to detect the anomaly? Use simulations to find out! (Again do $T=20,000$ simulations to answer on average.)

Use first value for $t_0=10$ and evaluate the average delay of detection. As a next step, repeat the experiment with $t_0=50,100, 200$

Values: $\rho = 50$ [Mbps], and $\sigma = 5$ [Mbps].

**Answers**

The Normal distribution for measurements from $S_1$ is given by 

$f_X(x) = \frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$,

so, that the $\log$-LRT between the two hypotheses reads:

$\log LRT = \log\frac{L(\rho)}{L(2\rho)} = \frac{-\sum_{i=1}^N(x_i-\rho)^2+\sum_{i=1}^N(x_i-2\rho)^2}{2\sigma^2}=\frac{\rho}{2\sigma^2}\sum_{i=1}^N\left(3\rho-2 x_i\right)$.

We accept $H_0$ if $\overline{X}_N:=\frac{1}{N}\sum_{i=1}^N x_i\leq \frac{3}{2}\rho-\frac{\sigma^2}{N\rho}\log(c)$, else we reject $H_0$ and we declare an anomaly. We can replace by $q:=\frac{3}{2}\rho-\frac{\sigma^2}{N\rho}\log(c)$, which can be a real negative or positive number, because $c>0$ and $\log(c)<0$ for $c<1$. Altogether:

- if $\overline{X}_N\leq q$, then $H_0$, otherwise
- if $\overline{X}_N > q$, then $H_1$.