# Gaussian Process Regression

Assume data is multivariate normal 

$$\begin{bmatrix}
y \\
f_*
\end{bmatrix}
\sim \mathcal{N} \left(0, \begin{bmatrix}
K(X,X) + \sigma^2_o I & K(X,X_*) \\
K(X_*,X) & K(X_*,X_*)
\end{bmatrix}\right)$$

Then 

$$f_* | X, y, X_* =
K(X_*,X) (K(X,X) + \sigma^2_o I)^{-1} y$$



$$cov(f_*) = K(X_*,X_*) - K(X_*,X) \left(K(X,X) + \sigma^2_o I\right)^{-1} K(X,X_*)$$

Can we, instead of specifying stastics for "full fields" like $p(t,x)$, specify statistics for the anomaly $\delta p(t,x) = p(t,x) -\hat{p}(t,x)$. Which is by definition will be zero mean, i.e. $\hat{p}$ is a Reynolds average. 

In [1]:
using Plots, Interact, Dates, DataFrames, LinearAlgebra, IterativeSolvers
using Netatmo

CSV_ARCHIVE=/lustre/storeB/users/roels/netatmo
JSON_ARCHIVE=/lustre/storeB/project/metproduction/products/netatmo/


In [2]:
dtg       = DateTime(2019,5,1,0)
period    = Hour(72)
timerange = dtg:Minute(10):dtg + period
latrange  = 59.9:0.01:60  
lonrange  = 10.7:0.01:10.8
latrange  = 52:0.01:65  
lonrange  = 7:0.01:15
df = Netatmo.read(timerange, latrange=latrange, lonrange=lonrange)


[32mReading: CSV files100%|█████████████████████████████████| Time: 0:00:39[39m


Unnamed: 0_level_0,id,time_utc
Unnamed: 0_level_1,String,Int64
1,enc:16:RjcKJHB7i6/T7H5OVIuVH5Rvz/aHlMD/avLgo96otmA13e3gpK8Y4zhUby5ZBnoG,1556668801
2,enc:16:3DlBWufvVn5Y8frQVAfukWMKjiAlqvWkrXFhDh2h4/gPusjU1K0wZQ8SF0Vejnr+,1556668804
3,enc:16:8USjqWB8mr5di7cAMak+ZVA4I5ogvXj4Hyjl0re2lkAa211EN/omZxnkBFiCLGgW,1556668804
4,enc:16:hKBCs8IdkHb8mk03bAN6S1pBpH/EwMySvpwiCwa6iSojUCaCikjagvjA7V075gmv,1556668802
5,enc:16:e3UN4VZxKA/A1KMFuogZD+ObBZncqIbKW9akOdDr/ijgbfR/63eODmI4vulbTEMD,1556668800
6,enc:16:kVOs7FeKYJ1aX6expeNmcXP/w25jy6JPsVcFoj05hFEAkZt0s6bj7MmBGPwkhXMF,1556668804
7,enc:16:240orv4vbl7hybq5D2T4SiENhVx0skjH9X/OWvN5u9YRSdVuPw8w2spiM+MAlLnn,1556668804
8,enc:16:vCwDIMkxacAeW1vhD3Vp9Mhi3x/efCxdv2kVlKdAb7jGOvBPl1dZU5rRByiDTE+s,1556668802
9,enc:16:hGa54LSm6bPloqs5SEZ/WTfVPKCeUAp0hQ2ZFy4O5ITRbZWxTqluMmY3R1Ne/Ua0,1556669276
10,enc:16:DhnG3yKv5nrHM+PVQCmXpp34OjbjsQta33xc6oLh9riKZoKLFQJxszk4ACG4gkMT,1556668914


# Temporal Gaussian process regression  

We use Gaussian process regression to compute the pressure anomaly as the difference between a smoothed surface pressure signal. 
The Squared exponential kernel is 

$$K_{se}(t_1,t_2) = \exp \left( - \frac{ |t_1-t_2|^2 }{2 l^2} \right) $$

The Ornstein Uhlenbeck kernel is 

$$K_{ou}(t_1,t_2) = \exp \left( - \frac{|t_1-t_2| }{l} \right) $$


## Impact of length scales

Try Lengthscale=24  in the $K_{se}$ kernel and see the  impact that the asssumed $\sigma_o$ has on the anomaly at the the end of the time window. I.e. $\sigma_o$ is more than just a regularization parameter. 
Check station 371 on 20180510

In [7]:
groupbyid = groupby(df,:id);
ranges = 1:1:48 
sigmaos = 0.0001:0.0001:0.001
indices=1:length(groupbyid)



mp = @manipulate for σₒ in sigmaos,  r in slider(ranges; label="Range"), index in indices 
    s = groupbyid[index]
    Kou(t₁,t₂) = exp(-1/2*abs(t₁-t₂)/(1000000*60*60*r))   # Ornstein–Uhlenbeck
    Kse(t₁,t₂) = exp(-1/2*(t₁-t₂)^2/(60*60*r)^2)          # squared-exponential 
    K  = [Kou(t₁,t₂) for t₁ in s[:time_utc], t₂ in s[:time_utc] ] 
    # Ks = [rbf(t1,t2) for t1 in datetime2unix.(timerange),     t2 in s1[:time_utc] ]     
    KpI = copy(K)  
    KpI[diagind(KpI)] .= diag(KpI) .+ (σₒ)^2
    timecg  = @elapsed q, cglog = cg(KpI,s[:pressure],log=true)       
    timenor = @elapsed q2 = KpI\s[:pressure]
    
    pshat2 = K* q2
    pshat = K*q
    
    #print(KpI-K)    
    scatter(unix2datetime.(s[:time_utc]),s[:pressure],marker=:o,label="Ps")     
    plot!(unix2datetime.(s[:time_utc]),pshat,label="cg")  
    plot!(unix2datetime.(s[:time_utc]),pshat2,label="nor")  
    
    plot!(title = "CG $cglog speedupfactor=$(round(timenor/timecg,digits=2))")
    plot!(legend=:bottomleft) 
    # end
end