1) Work out the Neyman-Pearson detection statistic for detecting a rectangular box located at position 555, with hight 1, size 100 in white gaussian noise (np.random.normal(0,1,10**6))
a) Write the statistical model for the two competing hypothesis (H0, H1)
b) If a false detection costs 10^6 dollars, but a true detection gains you a dollar, what would be the detection bar?
c) Suppose that the hight of the rectangular box is unknown, what would you do? Is the test optimal ?
d) Suppose that the noise was complex (as in complex numbers), and that the boxed signal had a uniform "hight" that is a complex number of unknown magnitude and phase. What would you do? is the test optimal?
e) Suppose that the position of the box is unknown. How would you detect it then?
How would you compute the detection statistic at all positions at once using FFT?
What would be the detection bar in this situation, for the same financial conditions as in (Q1 b)? Use monte-carlo to set the bar. What is the equivalent look-elsewhere-effect? 
f) What is the amplitude of the box such that if we "inject" a signal with this amplitude, the detection probability is 50% [what about 10%, 99%, 99.99%]
g) What would you do if the size of the box is unknown, and can take any width between 2 bins and 500 bins? Use monte-carlo to set the bar. What is the equivalent look-elsewhere-effect? 

2) You are looking for the same box as in (1), but suppose that the noise is of the following form:
n(t) = np.random.normal(0,1,10**6) convolved with a normalized triangular shape of width 500 [normalized such that np.linalg.norm(triangle)=1].
a) Write a statistical model for the null (H0) and the alternative hypothesis (H1) in real space what is the Neyman-Pearson detection statistic you would compute? [use matrix notation]
b) Write a statistical model for the null (H0) and the alternative hypothesis (H1)  in Fourier space. what is the Neyman-Pearson detection statistic you would compute? Is it the same statistic? Simulate a signal and compute the score in several positions, both in real space and Fourier space, make sure you get the same number up to machine precision.
c) Are you more sensitive with this noise source, or in Q1? What is the amplitude of the box such that if we "inject" a signal with this amplitude, the detection probability is 50% [what about 10%, 99%, 99.99%]?
(d) about the 50% detection amplitude as a function of the Triangle's width. Can you compute it analytically? (good approximation is OK) 

3) You are looking for the same box as in (1), but suppose that the noise is of the following form:
n(t) = np.random.normal(0,1,10**6) convolved with a normalized filter of width 500 and unknown shape [normalized such that np.linalg.norm(filter)=1].
a) Is it possible to obtain a good estimator of the filter? [Read about the Welch method AFTER trying to solve it yourself]
b) What is the impact of using the best-estimate filter in the detection statistic computed in (2) ? Is this hampering detection at all?
c) Suppose instead of 10**6 samples, you have only 10**4 samples, how does this impact the precision of estimating the filter?  Is this hampering detection at all?

4) Suppose you have the same situation as in (2), but after generating the noise, Gargamel chooses at random 10**3 samples and zeroizes them.
a) Compute the statistic from (2) in this situation, and plot their histogram. Is that the same histogram as in (2)? Would this interfere with detection?
b) Write the time-domain statistic relevant for detecting a signal at a particular place, taking the missing data into account.

In [None]:
import numpy as np

1) Work out the Neyman-Pearson detection statistic for detecting a rectangular box located at position 555, with hight 1, size 100 in white gaussian noise (np.random.normal(0,1,10**6))

a) Write the statistical model for the two competing hypothesis (H0, H1)

we have a series of $ n = 10^6 $ data points $x[i]$ dominated by white gaussian noise $w[i]$, in which we attempt to detect a signal $s[i]$ believed to be a rectangular box of height $h$ and size $l$ located at starting position $p$. 

thus $ w[i] \sim N(0,1) $

and  $ s[i] = h (\Theta[i - p] - \Theta[i - (p + l - 1)]) $
    
and we test the hypotheses

$H0:$ $ x[i] = w[i] $

$H1:$ $ x[i] = w[i] + s[i] $

with likelihoods 
$ L[H|x] = P[x|H] = P[H|x] \frac{P[x]}{P[H]} $

since we know that the probability of getting the draws $w[i]$ from $N(0,1)$ is

$ P[w] = (2 \pi)^{-\frac{n}{2}} \prod_{i}^{n} \exp[-\frac{w[i]^2}{2}] $ 

where the product is over all i, we can rewrite the hypotheses as 

$ H0:  w[i] = x[i] $

$ H1:  w[i] = x[i] - s[i] $
    
and plugging in these values of $w[i]$ gives us the likelihood ratio

$ \frac{L[H0]}{L[H1]} = \exp[-\frac{1}{2}  \sum_{i}^{n} (x[i] - s[i])^2 - x[i]^2] $

$ = \exp[\sum_{i}^{n} s[i] x[i] - \frac{s[i]^2}{2}] $

$ \equiv \Lambda[x] $

then we reject the null hypothesis $H0$ in favor of the alternative hypothesis $H1$ when

$ \Lambda[x] < \eta $

where the threshold (detection bar?) $\eta$ is set by the requirement that

$ P[\Lambda[x] < \eta | H0] = \alpha $

for our desired significance level (false detection rate) $\alpha$

the neyman-pearson lemma states that this test has the greatest power (smallest false negative / missed detection rate) of all statistical tests at level $\alpha$, but we notice that $ \sum_{i}^{n} s[i]^2 = h^2 l $ does not depend on the data, so it will help our computations to instead consider the monotonically transformed statistic

$ \sum_{i}^{n} s[i] x[i] \eqiv \lambda[x] = \ln[\Lambda] + \sum_{i}^{n} s[i]^2 = \ln[\Lambda] + h^2 l $

now we have a new threshold $ \gamma \eqiv \ln[\Lambda] + \sum_{i}^{n} s[i]^2 $ so that we reject $H0$ in favor of $H1$ when

$ \lambda[x] < \gamma $

where $\gamma$ is set to achieve our significance level

$ \alpha = P[\lambda[x] < \gamma | H0] $

next we find the distribution of our statistic,

$ \lambda[x] = \sum_{i}^{n} s[i] x[i] = \sum_{i=p}^{i=p+l-1} x[i] $

which, under the null hypothesis $x[i] = w[i]$ is simply a sum of $l$ standard gaussian RVs, resulting in a gaussian RV with zero mean and variance equal to $l$

$ \sum_{i}^{n} s[i]^2 - 2 s[i] x[i] = h \, (l h - 2 \sum_{i=p}^{i=p+l-1} x[i]) $









OLD***


so the rejection region is defined by

$ \sum_{i=p}^{i=p+l-1} x[i] < \frac{l h}{2} + \frac{\ln[\eta]}{h} $

and the probability of being in this rejection region if the null hypothesis is actually true is

$ P[\lambda[x] < \eta | H0] = P[\sum_{i=p}^{i=p+l-1} w[i] < \frac{l h}{2} + \frac{\ln[\eta]}{h}] $

which is just the probability that the sum of $l$ standard gaussian RVs is less than a number determined by the signal's area and our threshold

this sum is itself a gaussian RV with mean zero and variance $l$, so the probability is given by 

$ \alpha = \int_{-\infty}^{\frac{l h}{2} + \frac{\ln[\eta]}{h}} \exp[-\frac{y^2}{2 l^2}] dy = l \int_{-\infty}^{\frac{h}{2} + \frac{\ln[\eta]}{l h}} \exp[-\frac{z^2}{2}] dz $

now plugging in our values of $h = 1$ and $l = 100$ gives

$ \alpha = 100 \int_{-\infty}^{\frac{1}{2} + \frac{\ln[\eta]}{100}} \exp[-\frac{z^2}{2}] dz $




finding $\chi^2$ DOF if Wilks' theorem applies: the null hypothesis has two parameters specifying the WGN mean and variance, and the alternative hypothesis includes these two as well as three additional parameters for the height, size, and starting position of the rectangular signal. setting these all to fixed values gives k = 0, but allowing height, size, or starting position to vary would each add a DOF 

b) If a false detection costs 10^6 dollars, but a true detection gains you a dollar, what would be the detection bar?

c) Suppose that the hight of the rectangular box is unknown, what would you do? Is the test optimal ?

d) Suppose that the noise was complex (as in complex numbers), and that the boxed signal had a uniform "hight" that is a complex number of unknown magnitude and phase. What would you do? is the test optimal?

e) Suppose that the position of the box is unknown. How would you detect it then?
How would you compute the detection statistic at all positions at once using FFT?
What would be the detection bar in this situation, for the same financial conditions as in (Q1 b)? Use monte-carlo to set the bar. What is the equivalent look-elsewhere-effect? 

f) What is the amplitude of the box such that if we "inject" a signal with this amplitude, the detection probability is 50% [what about 10%, 99%, 99.99%]

g) What would you do if the size of the box is unknown, and can take any width between 2 bins and 500 bins? Use monte-carlo to set the bar. What is the equivalent look-elsewhere-effect? 

2) You are looking for the same box as in (1), but suppose that the noise is of the following form:
n(t) = np.random.normal(0,1,10**6) convolved with a normalized triangular shape of width 500 [normalized such that np.linalg.norm(triangle)=1].

a) Write a statistical model for the null (H0) and the alternative hypothesis (H1) in real space what is the Neyman-Pearson detection statistic you would compute? [use matrix notation]

b) Write a statistical model for the null (H0) and the alternative hypothesis (H1)  in Fourier space. what is the Neyman-Pearson detection statistic you would compute? Is it the same statistic? Simulate a signal and compute the score in several positions, both in real space and Fourier space, make sure you get the same number up to machine precision.

c) Are you more sensitive with this noise source, or in Q1? What is the amplitude of the box such that if we "inject" a signal with this amplitude, the detection probability is 50% [what about 10%, 99%, 99.99%]?

(d) about the 50% detection amplitude as a function of the Triangle's width. Can you compute it analytically? (good approximation is OK) 

3) You are looking for the same box as in (1), but suppose that the noise is of the following form:
n(t) = np.random.normal(0,1,10**6) convolved with a normalized filter of width 500 and unknown shape [normalized such that np.linalg.norm(filter)=1].

a) Is it possible to obtain a good estimator of the filter? [Read about the Welch method AFTER trying to solve it yourself]

b) What is the impact of using the best-estimate filter in the detection statistic computed in (2) ? Is this hampering detection at all?

c) Suppose instead of 10**6 samples, you have only 10**4 samples, how does this impact the precision of estimating the filter?  Is this hampering detection at all?

4) Suppose you have the same situation as in (2), but after generating the noise, Gargamel chooses at random 10**3 samples and zeroizes them.

a) Compute the statistic from (2) in this situation, and plot their histogram. Is that the same histogram as in (2)? Would this interfere with detection?

b) Write the time-domain statistic relevant for detecting a signal at a particular place, taking the missing data into account.