# Comparing Spectra

This notebooks demonstrates how you can use the χ² metric to compare spectra.

In [None]:
using NeXLSpectrum
using DataFrames, Gadfly, InvertedIndices

In [None]:
specs = [ loadspectrum(joinpath(@__DIR__, "..","test","ADM6005a spectra","ADM-6005a_$i.msa")) for i in 1:15 ]


In [None]:
det = matching(specs[1], 128.0)

In [None]:
set_default_plot_size(8inch, 4inch)
elms = [ n"C",n"O",n"Al",n"Ca",n"Ge",n"Si",n"Ti",n"Zn" ]
# plot(specs..., xmax=12.0e3, klms=elms)

I'll present two different ways to compare spectra.
  * Direct spectrum to spectrum comparison (`χ²(...)`)
  * Comparing a spectrum to the sum of the other spectra (`similarity(...)`).

First, `χ²`.  This metric will be approximately equal to the length of the range of channels when the spectra differ only by count statistics.  

In [None]:
fullroi = channel(100.0, det):channel(10.0e3, det)
χ²(specs, fullroi)

In [None]:
χ²(specs, NeXLSpectrum.fwhmroi(specs[1], n"Si K-L3"))

In [None]:
χ²(specs, NeXLSpectrum.fwhmroi(specs[1], n"Fe K-L3"))

In [None]:
χ²(specs, NeXLSpectrum.fwhmroi(specs[1], n"O K-L3"))

In [None]:
χ²(specs, NeXLSpectrum.fwhmroi(specs[1], n"Mg K-L3"))

However, the `χ²` matrices can be hard to interpret.  Which spectrum is the "problem child"?   What we really want to know is how each spectrum compares with the mean of the others independent of the length of the range of channels examined.

We want to retain the spectra that are most similar to the mean.  That is what `similarity(...)` is used for.

In [None]:
NeXLSpectrum.similarity(specs, det, n"O")

Note that the metric for spectrum 2 is the largest at 1.23.  (This isn't large.)

Removing spectrum 2 improves most, but not all the metrics for the other spectra.

In [None]:
NeXLSpectrum.similarity(specs[Not(2)], det, n"O")

Overall, the mean similarity of the spectra improves.

In [None]:
using Statistics
mean(NeXLSpectrum.similarity(specs, det, n"O")), mean(NeXLSpectrum.similarity(specs[Not(2)], det, n"O"))

Let's tabulate the similarity for ranges of channels corresponding to the relevant elements in this material.

In [None]:
ENV["columns"]=200
df=DataFrame( 
    :Spectrum=>name.(specs), 
    map(elm->Symbol(elm.symbol)=>NeXLSpectrum.similarity(specs, det, elm), elms)...,
    :All => NeXLSpectrum.similarity(specs)
)
insertcols!(df, :Mean=>map(r->mean(r[2:9]), eachrow(df)))

In [None]:
describe(df[:,2:end], :mean, :std, :min, :max)

This is odd!  As I said above, the similarity metric should take on a minimum value of approximately one and yet the mean is universally less than one in the above table.  

What is happening?  This data suggests that this measurement is not count-statistics limited. That seems improbable unless the vendor is manipulating the data.
Let's look for more evidence.

I'm going to plot channel-by-channel the variance over the mean for the 15 spectra.  For Poisson statistics, this should be no better than unity.

In [None]:
cx = map(eachindex(specs[1])) do i
    var(s[i] for s in specs) / max(1.0, mean(s[i] for s in specs))
end
ss=Spectrum(specs[1].energy, cx, specs[1].properties)

plot(ss, klms=elms, yscale=0.3)

In [None]:
plot(ss, klms=elms, xmin=4000.0, xmax=7000.0, yscale=0.15)

## WTF?
There are entire ranges of channels that are clearly less than the unity.  This should never happen. 

It would seem that this is strong evidence that the spectra are being manipulated but for what purpose?  Pulse pair removal? Noise reduction? Escape peak removal?

What type of operations can produce sub-Poisson statistics?  Let's say the nominal measured value is $A$ with $var(A)=A$ or equivalently uncertainty $\sigma(A) = \sqrt{A}$.  We are looking at $\frac{var(A)}{A} = \frac{A}{A} \approx 1$ nominally. To get below unity, we need to either decrease the numerator or increase the denominator.  The numerator is controlled by very fundamental statistical reasoning based on the nature of the process.  There are very few assumptions and they are exceedingly basic.  
It seems that we can either add or multiply.
  * If $B = a A$, then $\sigma(B) = a \sqrt{A}$ and $var(B) = a^2 A$.  So $\frac{var(B)}{B} = \frac{a^2 A}{aA} = a$. So when $a < 1$, this can produce sub-Poissonian statistics.
  * Alternatively, consider $C=A+c$ where $c$ is a noise-free constant ($\sigma(c)=0$).  $\sigma(C) = \sigma(A+c) = \sqrt{\sigma(A)^2 + \sigma(c)^2} = \sqrt{\sigma(A)^2} = \sigma(A) = \sqrt{A}$ which means $var(C) = A$, so $\frac{var(C)}{C} = \frac{A}{A+c} < 1$ when $c>0$.  

So either the detector is adding some noise free constant to the signal or they are downscaling the signal.  Again, why?

  * Pulse pair removal would be equivalent to alternative two but with $c<0$ so this isn't it.
  * Escape peak removal is similar to pulse pair removal.  Again, this isn't it.
  * The only reasonable suggestion would seem to be that the signal is being down scaled in certain energy regions.  Why?  Why would you want to return a smaller signal than the one the detector is measuring?

  *This seems scandalous to me. We can't trust our detectors.*

In [None]:
hstack(plot(y=map(s->s[:ProbeCurrent], specs), Geom.point), plot(y=map(s->s[:LiveTime], specs), Geom.point))

We expect a bit of variation in O since the soft X-ray is quite susceptible to absorption and topography.  

Let's remove spectra 1 and 4 and see what happens.

In [None]:
using Statistics
for elm in elms
    println( ( mean(NeXLSpectrum.similarity(specs, det, elm)), std(NeXLSpectrum.similarity(specs, det, elm))) )
end

As we increase the X-ray energy, the variability decreases.

Let's try applying these functions to a spectrum that we know should compare well since they represent sub-samplings of the same source.

  * `subdivide(...)` takes a single spectrum and distributes the counts at random among N spectra creating N spectra that sums to the original spectrum.
  * `subsample(...)` takes a single spectrum and emulates taking a fraction of the same live-time.  The results won't necessarily sum to the original.

In [None]:
sd=mapreduce(_->subdivide(specs[1], 8), append!, 1:6)
describe(DataFrame(
    :Spectrum=>eachindex(sd),
    [ Symbol(symbol(elm))=>NeXLSpectrum.similarity(sd, det, elm) for elm in elms]...
)[:,2:end], :mean, :std, :max, :min)

In [None]:
sd2=mapreduce(_->map(i->subsample(specs[1], 0.1),1:8),append!,1:10)
describe(DataFrame(
    :Spectrum=>eachindex(sd2),
    [ Symbol(symbol(elm))=>NeXLSpectrum.similarity(sd2, det, elm) for elm in elms]...
)[:,2:end], :mean, :std, :max, :min)

Interestingly, these are consistently slightly less than unity?  Why?

In [None]:
using Distributions

In [None]:
σ=10.0
n=Normal(0.0,σ)
mean(mean((rand(n,15).^2))-σ^2 for i in 1:100000)

In [None]:
p1, p2 = Dict(:ProbeCurrent=>1.0, :LiveTime=>10.0),Dict(:ProbeCurrent=>1.0, :LiveTime=>0.99*40.0)
r = rand(1:10000, 2048)
d1, d2 = Poisson.(r), Poisson.(4r)
s1 = Spectrum(det.scale, [ rand(d) for d in d1], p1)
s2 = Spectrum(det.scale, [ rand(d) for d in d2], p2)
NeXLSpectrum.similarity(s1,s2,1:2048)

In [None]:
plot(s1,s2)

In [None]:
p1, p2 = Dict(:ProbeCurrent=>1.0, :LiveTime=>10.0),Dict(:ProbeCurrent=>1.0, :LiveTime=>40.0)
mean(map(1:1000) do i
    r = rand(1:100, 2048)
    d1, d2 = Poisson.(r), Poisson.(4r)
    s1 = Spectrum(det.scale, [ rand(d) for d in d1], p1)
    s2 = Spectrum(det.scale, [ rand(d) for d in d2], p2)
    NeXLSpectrum.similarity(s1, s2, 10:20)
end)

In [None]:
p=plot(specs[1],duanehunt=true, xmin=17000.0)
#p |> SVG(joinpath(homedir(),"Desktop","duane_hunt.svg", 6inch, 4inch)
p

In [None]:
duane_hunt(specs[1])