Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adapt a diagonal mass matrix by default #35

Closed
tpapp opened this issue Jan 10, 2019 · 8 comments · Fixed by #44
Closed

adapt a diagonal mass matrix by default #35

tpapp opened this issue Jan 10, 2019 · 8 comments · Fixed by #44

Comments

@tpapp
Copy link
Owner

tpapp commented Jan 10, 2019

In practice, even a regularized covariance matrix is just estimating noise for the off-diagonal elements for a short sample. Either use a diagonal matrix always, or allow customizations of this but make diagonal the default.

AFAIK this is what Stan does, too.

@cscherrer
Copy link

Here is some discussion indicating this may not be the case.

I'd really like the ability to specify or update the mass matrix, along the lines of this suggestion from 5 years ago, AFAIK the Stan folks haven't yet gotten to it :)

@chriselrod
Copy link

chriselrod commented Feb 27, 2019

A little off topic, but I think the mass matrix could be made a little more efficient.
Right now, from here:

GaussianKE(Minv::AbstractMatrix) = GaussianKE(Minv, cholesky(inv(Minv)).L)
rand(rng::AbstractRNG, κ::GaussianKE, q = nothing) = κ.W * randn(rng, size.W, 1))

this could be:

GaussianKE(Minv::AbstractMatrix) = GaussianKE(Minv, inv(cholesky(Minv).U))
rand(rng::AbstractRNG, κ::GaussianKE, q = nothing) = κ.W * randn(rng, size.W, 1))

W would then be an upper triangular matrix such that W*W' = M, instead of a lower triangular matrix as it is now. Either way, if z ~ Normal(0, I), then W * z = x ~ Normal(0, W*W').

The former inverse calculates the cholesky decomposition, and then ptotri!, which essentially calculates the triangular inverse while multiplying the result by it's own transpose. Doing just the inverse of a triangular matrix will generally be faster.

Of course, it only gets called once, so the impact on performance is going to be negligible.

@tpapp
Copy link
Owner Author

tpapp commented Feb 28, 2019

@chriselrod: good point, I will address this as part of #30. Yes, it is only called maybe 5-8 times (during the adaptation), but there is no reason we should not do this elegantly.

@chriselrod
Copy link

What would it take to implement this? Is it just using var instead of cov, or do other parts of the algorithm change?
Any differences in regularization?

I'm struggling with a model that converges fine in Stan, and realized I can reproduce Julia's failure to converge in Stan by specifying "metric = dense_e".
Using more default settings (just adapt_delta = 0.995 and max_treedepth = 15):

sampler_params <- get_sampler_params(itp_res, inc_warmup = FALSE)
(mean_accept_stat_by_chain <- sapply(sampler_params, function(x) mean(x[, "accept_stat__"])))
# [1] 0.9973673 0.9930246
(mean_stepsize_stat_by_chain <- sapply(sampler_params, function(x) mean(x[, "stepsize__"])))
# [1] 0.006663567 0.011519709
(max_treedepth_by_chain <- sapply(sampler_params, function(x) max(x[, "treedepth__"])))
# [1] 10  9

and I'm trying to find out how to get it to converge in Julia:

julia> NUTS_statistics(mcmc_chain2)
Hamiltonian Monte Carlo sample of length 500
  acceptance rate mean: 0.82, min/25%/median/75%/max: 0.29 0.74 0.86 0.94 1.0
  termination: MaxDepth => 100%
  depth: 15 => 100%

julia> tuned_sampler
NUTS sampler in 94 dimensions
  stepsize (ϵ)  4.44e-5
  maximum depth = 15
  Gaussian kinetic energy, √diag(M⁻¹): [1.895813313637011e-5, 0.0001115812362165227, 9.157446164012355e-5, 4.46418426218662e-5, 1.8307063191189693e-5, 3.200949304693278e-5, 2.1938155774762758e-5, 5.9087940848096736e-5, 3.73472302349929e-5, 5.274846520513566e-5, 5.367501293396515e-5, 3.1397511915424646e-5, 6.08221582019119e-5, 5.4267538547114185e-5, 5.29947670788417e-5, 0.00011075177985935453, 7.609173377511799e-5, 0.00010532538379063398, 8.51499318202174e-5, 2.478363462668975e-5, 3.658987692014494e-5, 2.007888754739028e-5, 0.00011127385432189406, 1.3223131609749478e-5, 0.00016837339824462473, 1.1839460215858603e-5, 6.585270064600971e-5, 4.19335185808455e-5, 5.544623760676068e-5, 9.784890108486913e-5, 3.311545937725982e-5, 2.471979711349106e-5, 3.112412949389102e-5, 9.570724240006883e-5, 3.49521924945658e-5, 6.0333831183082444e-5, 4.840991138763101e-5, 2.783856460714573e-5, 4.831540994951892e-5, 4.439850328987186e-5, 6.071328215069777e-5, 7.347164675861388e-5, 4.974277394209337e-5, 9.255844061671983e-5, 2.957893020284297e-5, 3.0813197939751596e-5, 0.00011275072683006586, 0.00016486450332804875, 2.436513209762428e-5, 7.000045840614247e-5, 4.893351242935738e-5, 1.9587454417079087e-5, 3.789299805871494e-5, 8.862310005896023e-5, 6.167247733285533e-5, 7.58130680565668e-5, 4.376895615887048e-5, 5.861454040079281e-5, 0.0001342476133258934, 3.45328862714302e-5, 7.746663993962745e-5, 6.656366056200958e-5, 1.4143934443648073e-5, 8.328992141907225e-6, 2.274360501496452e-5, 5.523437355505674e-5, 2.6838929735689244e-5, 2.070253693705773e-5, 7.425841165122632e-5, 3.64062126857918e-5, 8.442014624480622e-5, 5.496281774664461e-5, 3.87778371731205e-5, 9.127710121098522e-5, 1.3767075754649895e-5, 8.13280725468518e-5, 9.41620226522111e-5, 2.3713593495284312e-5, 7.345259532704505e-5, 2.5847927065962024e-5, 3.4417125925637836e-5, 4.685867281541214e-5, 4.993549553072007e-5, 1.181268068574348e-5, 0.0001727602477423264, 3.904524248961542e-5, 0.0001584036741565338, 5.5143612458872284e-5, 9.972190958460448e-5, 7.388235798203804e-5, 1.5004410870109371e-5, 2.0118912968850808e-5, 5.401727489332943e-5, 2.3884781934174396e-5]

I get similar tiny step sizes and convergence failure in Stan with the dense metric. Admittedly, this was based on a single simulation, but I have started 8 more to confirm that pattern. I'll report later.

There are of course other differences, eg
#25
which Stan seems to do, based on slide 14.

Hopefully just a diagonal matrix will help. I'm struck by this difference (and even though Stan failed with a dense matrix, I'm still combing through my Julia code, hunting for bugs).
Stan had a treedepth of at most 10, far larger step sizes, and had great acceptance probabilities. adapt_delta was to 0.995 to avoid divergent transitions. I had 3 divergent transitions with adapt_delta = 0.95, and the two chain's step sizes were 0.039 and 0.0227. Stan converged to the correct values, and had great effective sample sizes relative to the number of samples. The Julia version did not converge at all (the highest effective sample size I achieved was 10).

FWIW, the log likelihood and gradient evaluation in Julia is so much faster that even pegging out the maximum tree depth, it still took less time per iteration. Of course, less time per sample is meaningless if you have approximately zero effective samples/sample.

@tpapp
Copy link
Owner Author

tpapp commented Apr 5, 2019

What would it take to implement this?

Time. 😉

I am currently working on #30 and related, and postponed all of these changes. This may not be ideal though, so perhaps I should address this and organize a nice API for it later.

Will look into this soon because I am also running into these problems.

I would be grateful if you could contribute the aforementioned problem as a test, a gist is fine.

@chriselrod
Copy link

chriselrod commented Apr 7, 2019

Well, embarrassingly, the problem was a bug in my gradients!

The code I was running has about a dozen unregistered dependencies, so I wrote an example simply using common libraries to share. That one converged.

Stan does struggle when specifying a dense energy matrix.
Of the 8 simulations I started before making my comment, only one has finished (2000 total iterations) -- with effective sample sizes of 1 for most parameters.
It converges well in Stan if given a high adapt delta, but even better in DynamicHMC with simply the defaults.

If you'd like it as another test anyway:

using ForwardDiff, TransformVariables, LogDensityProblems, DynamicHMC, Distributions, Parameters, Random, LinearAlgebra

struct LKJ{T}
    η::T
end

function Distributions.logpdf(lkj::LKJ, L::AbstractMatrix{T}) where {T}
    out = zero(T)
    K = size(L,1)
    η = lkj.η
    for k  2:K
        out += (K + 2η - k - 2) * log(L[k,k])
    end
    out
end

@with_kw struct ITPData
    Y₁::Array{Float64,3}
    Y₂::Array{Float64,3}
    time::Vector{Float64}
    δtime::Vector{Float64}
    domains::Matrix{Float64}
end
# Could use for loops (more efficient in general), but don't want
# autodiff to track individual values because of downstream
# matrix arithmetic.
function create_domain_matrix(domains)
    domains_matrix = zeros(sum(domains), length(domains))
    ind = 0
    for (i,domain)  enumerate(domains)
        for d  1:domain
            ind += 1
            domains_matrix[ind,i] = 1.0
        end
    end
    domains_matrix
end
function ITPData(Y₁, Y₂, time, domains::Union{Vector{Int},NTuple{N,Int} where N})
    ITPData(Y₁, Y₂, time, diff(time), create_domain_matrix(domains))
end

(data::ITPData)(t::NamedTuple) = data(; t...)
function (data::ITPData)(; ρ, logκ, θ, U, μₕ₁, μₕ₂, μᵣ₁, μᵣ₂, βᵣ₁, βᵣ₂, σₕ, σᵦ, logσ)
    @unpack Y₁, Y₂, time, δtime, domains = data
    
    K, N₁, T = size(Y₁)
    K, N₂, T = size(Y₂)
    N = N₁ + N₂

    # Priors
    target = logpdf(Beta(2,2), ρ)
    κ = exp.(logκ)
    target += sum(logκ) # add log jacobian for ℝ → ℝ₊ transform
    target += sum(κᵢ -> logpdf(Gamma(0.1, 10.0), κᵢ), κ) 
    target -= 0.005sum(abs2, θ) # Normal(0, 10)
    target += logpdf(LKJ(2), U)
    σ = exp.(logσ)
    target += sum(logσ) # add log jacobian for ℝ → ℝ₊ transform
    target += sum(σᵢ -> logpdf(Gamma(0.1, 10), σᵢ), σ)
    target -= 0.005σₕ^2 # (Half-)Normal(0, 10)
    target -= 0.005σᵦ^2 # (Half-)Normal(0, 10)

    target -= 0.02μₕ₁^2 # Normal(0, 5)
    target -= 0.02μₕ₂^2 # Normal(0, 5)
    target -= 0.5sum(abs2, μᵣ₁) # Normal(0, 1) 
    target -= 0.5sum(abs2, μᵣ₂) # Normal(0, 1) 
    target -= 0.5sum(abs2, βᵣ₁) # Normal(0, 1) 
    target -= 0.5sum(abs2, βᵣ₂) # Normal(0, 1)
    
    # Center the uncentered parameters
    μᵦ₁ = domains * (μᵣ₁ .* σₕ .+ μₕ₁)
    μᵦ₂ = domains * (μᵣ₂ .* σₕ .+ μₕ₂)
    β₁ = @. μᵦ₁ + σᵦ * βᵣ₁
    β₂ = @. μᵦ₂ + σᵦ * βᵣ₂


    # AR1 matrix diagonal and negative offdiagonal (dropping first element of diagonal, which equals 1)
    if ρ > 0.5
        ρᵗ = (2ρ-1) .^ δtime
    else
        ρᵗ = -1 .* (1-2ρ) .^ δtime
    end
    ARdiag = @. 1 / sqrt(1 - ρᵗ * ρᵗ)
    nARoffdiag = @. ARdiag * ρᵗ
    
    # Expected values
    μ₁ = @. θ + β₁ * (1 - exp(-κ * time'))
    μ₂ = @. θ + β₂ * (1 - exp(-κ * time'))

    L = σ .* U'
    # Matrix Normal log determinants
    # ARdiag is the diaognal of the cholesky factor of the inverse AutoRegressive 1 matrix
    target += N * K * sum(log, ARdiag)
    target -= N * T * sum(log, diag(L))
    
    
    # Matrix Normal Kernel
    # L⁻¹ = inv(U')

    δY₁ = Y₁ .- reshape(μ₁, (K,1,T))
    δY₂ = Y₂ .- reshape(μ₂, (K,1,T))
    local L⁻¹δY₁, L⁻¹δY₂
    try
        L⁻¹δY₁ = reshape(L \ reshape(δY₁, (size(Y₁,1), size(Y₁,2) * size(Y₁,3))), size(Y₁))
        L⁻¹δY₂ = reshape(L \ reshape(δY₂, (size(Y₂,1), size(Y₂,2) * size(Y₂,3))), size(Y₂))
    catch err # There's probably a better way.
        if isa(err, SingularException)
            return -Inf
        else
            rethrow(err)
        end
    end
    target -= 0.5sum(abs2, L⁻¹δY₁[:,:,1])
    for t  2:T
        target -= 0.5sum( (@. (L⁻¹δY₁[:,:,t] * ARdiag[t-1] - L⁻¹δY₁[:,:,t-1] * nARoffdiag[t-1] ) ^ 2 ) )
    end

    target -= 0.5sum(abs2, L⁻¹δY₂[:,:,1])
    for t  2:T
        target -= 0.5sum( (@. (L⁻¹δY₂[:,:,t] * ARdiag[t-1] - L⁻¹δY₂[:,:,t-1] * nARoffdiag[t-1] ) ^ 2 ) )
    end
    
    isfinite(target) ? target : -Inf
end


function ITP_transform(D, K, T)
    as(
        (ρ = as𝕀, logκ = as(Array, K), θ = as(Array, K), U = CorrCholeskyFactor(K),
        μₕ₁ = asℝ, μₕ₂ = asℝ, μᵣ₁ = as(Array, D), μᵣ₂ = as(Array, D),
        βᵣ₁ = as(Array, K), βᵣ₂ = as(Array, K), σₕ = asℝ₊, σᵦ = asℝ₊, logσ = as(Array, K))
    )
end


ITP_D4_K9_T24 = ITP_transform(4, 9, 24);


domain_set = (2,2,2,3)
T = 24; K = sum(domain_set); D = length(domain_set);
domain_mat = create_domain_matrix(domain_set);
ρ = 0.7;
κ = 0.03125 .* ( +([randexp(K) for i  1:8]...) ); # ~ Gamma(8, 1/32)
σd = 0.0625 * sum(randexp(4));
θ = 2.0 .* randn(K);
S = randn(K,4K) |> x -> x * x';
L = cholesky(S).L ./ 4;
μₕ₁, μₕ₂ = -3.0, 9.0;
μᵦ₁ = domain_mat * (μₕ₁ .+ randn(D))
μᵦ₂ = domain_mat * (μₕ₂ .+ randn(D))
β₁ = μᵦ₁ .+ σd .* randn(K);
β₂ = μᵦ₂ .+ σd .* randn(K);

δtime = 0.06125 .* ( +([randexp(T-1) for i  1:8]...) ); # ~ Gamma(8, 1/16)
time = vcat(0.0, cumsum(δtime));

μ₁ = @. θ + β₁ * (1 - exp(-κ * time'))
μ₂ = @. θ + β₂ * (1 - exp(-κ * time'))

ρᵗ = ρ .^ δtime;
ARdiag = @. 1 / (1 - ρᵗ * ρᵗ);
ARoffdiag = @. - ARdiag * ρᵗ;
AR1_chol_inverse = Bidiagonal(vcat(1.0,ARdiag), ARoffdiag, :U);
AR1_chol = inv(AR1_chol_inverse);

N₁, N₂ = 55, 55;
Y₁ = reshape(reshape(L * randn(K, N₁ * T), (K * N₁, T)) * AR1_chol, (K, N₁, T)) .+ reshape(μ₁, (K, 1, T));
Y₂ = reshape(reshape(L * randn(K, N₂ * T), (K * N₂, T)) * AR1_chol, (K, N₂, T)) .+ reshape(μ₂, (K, 1, T));

itpdata = ITPData(Y₁, Y₂, time, δtime, domain_mat);


TLD = TransformedLogDensity(ITP_D4_K9_T24, itpdata);
# aditp_reverse = ADgradient(Val(:ReverseDiff), TLD); # slow
aditp_forward = ADgradient(Val(:ForwardDiff), TLD);
# aditp_flux = ADgradient(Val(:Flux), TLD); # error; doesn't work with TransformVariables

x = randn(94);
logdensity(LogDensityProblems.ValueGradient, aditp_forward, x)

@time mcmc_chain, tuned_sampler = NUTS_init_tune_mcmc(aditp_forward, 1000);

NUTS_statistics(mcmc_chain)
tuned_sampler
using MCMCDiagnostics
chain_matrix = get_position_matrix(mcmc_chain);
[effective_sample_size(chain_matrix[:,i]) for i in 1:10]'

It takes a couple hours to run.

The optimized version runs in about half a minute.
I still have the impression that the effective sample sizes aren't quite as good as with the above version relative to the number of samples. But if the gradients and densities match, I don't see why that would be the case.

The basic idea I have been working on is providing reasonably optimized probability functions that also return their analytical gradients. Other functions / transformations can be given optimized Jacobians, often with their own types so that it can be much more efficient than dense matrix multiplication. The simplest (and perhaps most common) example is a diagonal matrix -- although that (or the equivalent) is something I think all the AD libraries are already doing (eg, with broadcasting).

@tpapp
Copy link
Owner Author

tpapp commented Apr 9, 2019

@chriselrod: just to make sure I get this right:

  1. there is no issue with this specific model now, and
  2. the problem was miscalculated gradients?

Do you think that some utility function for testing gradient calculations (eg against finite differences or ForwardDiff) in LogDensityProblems would help with issues like this?

@tpapp
Copy link
Owner Author

tpapp commented Apr 9, 2019

I opened tpapp/LogDensityProblems.jl#42 for AD testing.

This was referenced Aug 5, 2019
@tpapp tpapp closed this as completed in #44 Aug 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants