# Gravitational Wave Approximate LiKelihood (GWALK)

Welcome to GWALK, a truncated Gaussian code with the original purpose of approximating gravitational-wave likelihood functions for the parameter estimation samples provided along with the major Gravitational-Wave Transient Catalog publications.

We intend to provide an object structure for truncated Gaussians which allows for efficient evaluation, parameterization, normalization, optimization, and serialization.



In [1]:
# Import numpy
import numpy as np

# Declare a seed
SEED = 42

# Initialize a random state object
RS = np.random.RandomState(SEED)

## Parameterization ##

The $\mu$, $\sigma$ and corellation parameters of the Multivariate Normal distribution can be accounted for with fewer elements than their matrix counterparts. This is essential to avoiding the problem of overconstraining the parameters of the distribution.

For each N-dimensional Gaussian, we store:
- 1 Paramter for a normalization constant, denoted "offset" (for log scale, this is added to the pdf)
- N Parameters for the location of the Gaussian ($\mu$), denoted "mu"
- N Parameters for the width of the Gaussian ($\sigma$), denoted "std"
- (N \times N - 1)/2 Parameters for the corellation parameters of the Gaussian, denoted "cor"

We also store a scale factor for each parameter, which is used in calculations to ortho-normalize the distribution. This solves a lot of issues that arise with eigenvalues in the scipy representation when coordinates have vastly different ranges (i.e. $\sigma_{\tilde{\Lambda}} \propto 1000$, and $\sigma_{\chi_{eff}} \propto 0.1$)

We provide compiled code which translates these parameters to and from the $\mu$ and $\Sigma$ matrix parameterizations very quickly and efficiently, and have developed tests to ensure that a given choice of parameters satisfies the constraints of the multivariate normal distribution.

The object representation for the truncated Gaussian handles this for the user.

In [2]:
# imports
from gwalk.multivariate_normal.decomposition import params_of_offset_mu_cov
from gwalk.multivariate_normal.decomposition import cov_of_std_cor

# Pick some offset
offset = np.asarray([0.])
# Pick some mu
mu = np.asarray([1., 2., 3.])
# Pick some standard deviations
std = np.asarray([0.1, 0.5, 0.3])
# Pick some corelations
cor = np.asarray([[1., 0., 0.3], [0., 1., 0.5], [0.3, 0.5, 1.]])

# Assemble a covariance matrix
cov = cov_of_std_cor(std, cor)
# Get your parameters
X = params_of_offset_mu_cov(offset, mu, cov).flatten()

print("All parameters:")
print(X)
print()
print("offset:")
print(offset)
print("X[0]")
print(X[0])
print()
print("mu:")
print(mu)
print("X[1:N+1]:")
print(X[1:mu.size+1])
print()
print("std:")
print(std)
print("X[N+1:2*N+1]:")
print(X[mu.size+1:2*mu.size + 1])
print()
print("cor:")
print(cor)
print("X[2*N+1:]:")
print(X[2*mu.size + 1:])

All parameters:
[0.  1.  2.  3.  0.1 0.5 0.3 0.  0.3 0.5]

offset:
[0.]
X[0]
0.0

mu:
[1. 2. 3.]
X[1:N+1]:
[1. 2. 3.]

std:
[0.1 0.5 0.3]
X[N+1:2*N+1]:
[0.1 0.5 0.3]

cor:
[[1.  0.  0.3]
 [0.  1.  0.5]
 [0.3 0.5 1. ]]
X[2*N+1:]:
[0.  0.3 0.5]


In [4]:
# Introduce the object
from gwalk.multivariate_normal import MultivariateNormal

# Instantiate the object one way
MV1 = MultivariateNormal(X)
print("MV1 params:")
print(MV1.params)
print("mu:")
print(MV1.mu)

# Instantiate the object another way
MV2 = MultivariateNormal.from_properties(mu=mu,cov=cov)
print("MV2 params:")
print(MV2.params)

MV1 params:
[0.  1.  2.  3.  0.1 0.5 0.3 0.  0.3 0.5]
mu:
[1. 2. 3.]
MV2 params:
[0.  1.  2.  3.  0.1 0.5 0.3 0.  0.3 0.5]


In [9]:
# Truncate this Gaussian!
limits = np.asarray([[1., 30.0], [-30., 30.], [-30.0, 30.]])
# Print limits
print("limits:")
print(limits)
# Set up a truncate Gaussian
MV = MultivariateNormal(X, limits=limits)
print("MV.limits:")
print(MV.limits)

limits:
[[  1.  30.]
 [-30.  30.]
 [-30.  30.]]
MV.limits:
[[  1.  30.]
 [-30.  30.]
 [-30.  30.]]


## Evaluation ##

We provide a function which is compiled in C, to efficiently evaluate the Gaussian PDF for many Gaussians at once on a consistent set of points. In order to do this with scipy's multivariate Gaussian, this would need to be done in a python loop for each Gaussian.

In [10]:
# Evaluate a non-truncated Gaussian

# Import things from gwalk
from gwalk.multivariate_normal import pdf as gwalk_pdf

# Number of points to evaluate
ndim = mu.size
npts = int(1e4)

# Come up with some bogus data to evaluate the pdf on
Y = RS.uniform(size=npts*ndim).reshape(npts, ndim)
# These points will be between 0. and 3.
Y = Y*3

# Evaluate the pdf
L1 = gwalk_pdf(mu, cov, Y, scale=None, log_scale=False)
print("Likelihood:")
print(L1)

Likelihood:
[1.60596556e-20 1.33458422e-03 5.08146369e-63 ... 1.32003607e-04
 1.11880892e-80 2.74739578e-05]


In [13]:
# Evaluate a truncated Gaussian
L2 = MV.likelihood(Y, scale=None)
print("Likelihood (object):")
print(L2)

Likelihood (object):
[3.21193112e-20 0.00000000e+00 1.01629274e-62 ... 2.64007214e-04
 2.23761783e-80 0.00000000e+00]


## Normalization ##

Normalizing a truncated Gaussian is very easy, as the one-dimensional CDFs for each coordinate capture everything about the space required to re-normalize the distribution. We also allow the normalization offset to float freely if desired, to allow rescaling of the multivariate normal distribution to suit the needs of different problems

In [12]:
# Find out what fraction of our Gaussian is inside the limits
print("Integral of enclosed region:")
print(MV.analytic_enclosed_integral())

# Normalize the MultivariateNormal object
MV.normalize()
L3 = MV.likelihood(Y)
print("Likelihood (accounting for normalization):")
print(L3)
print("Scale factor:")
print(np.mean((L3/L1)[L3 != 0]))

Integral of enclosed region:
[0.5]
Likelihood (accounting for normalization):
[3.21193112e-20 0.00000000e+00 1.01629274e-62 ... 2.64007214e-04
 2.23761783e-80 0.00000000e+00]
Scale factor:
2.000000000000006


## Optimization ##

Once the issues ofreducing our parameters to the minimal set (as seen above) to describe a multivariate normal distribution, providing a check for the constraints of a multivariate normal distribution, providing a means of evaluating the distribution efficiently, and providing an analytic way to resolve normalization with truncation are solved, optimization is fairly straightforward.


In [15]:
## Imports ##
from gwalk.multivariate_normal import random_gauss_params
from gwalk.optimize_likelihood_grid import optimize_likelihood_grid
from gwalk.multivariate_normal import pdf as multivariate_normal_pdf
from gwalk.multivariate_normal import params_of_offset_mu_cov
from gwalk.multivariate_normal import mu_cov_of_params
from gwalk.multivariate_normal import offset_of_params
from gwalk.multivariate_normal import MultivariateNormal

## Constants ##

# Number of samples to draw from Gaussian
N_RVS = int(1e6)
# Number of points to select for grid
NSELECT = 1000
# Number of dimensions
NDIM = 3
# Seed
SEED = 0
# Limits of Gaussian
LIMITS = np.asarray([
                     [-7., 7.],
                     [-7., 7.],
                     [-7., 7.],
                    ])
# Fit the least squares of the residual
objective = "lstsq"
# Fit using scipy's SLSQP method --
# note this is the only method I have tested which
# works for bounds and constraints
method = "SLSQP"

## Setup ##
# Generate an arbitrary Gaussian
RS.seed(SEED)
X0 = random_gauss_params(1, NDIM, rs=RS)
# Get the mu and cov params
mu, cov = mu_cov_of_params(X0)
# flatten the parameters, so we know we are dealing with the parameters
# of a single Gaussian
X0 = X0.flatten()

## Generate some samples ##

# Generate uniform samples
samples = (RS.uniform(size=(N_RVS,NDIM))*(LIMITS[:,1]-LIMITS[:,0])) + LIMITS[:,0]
# Get pdf values
values = multivariate_normal_pdf(mu, cov, samples, log_scale=True).flatten()
# Adjust by the lnL offset
values = values + offset_of_params(X0)
# Downselect samples
select = np.argsort(values)[-NSELECT:]
# Keep only NSELECT samples
samples = samples[select]
values = values[select]

## Run the optimize likelihood grid function ##
print("There's some print statements I need to remove.")

MV = optimize_likelihood_grid(
                              samples,
                              lnP=values,
                              seed=SEED,
                              objective=objective,
                              method=method,
                             )

# Retrieve estimated value of parameters from optimization
Xg = MV.params

# Print the log of the absolute value of the difference between X0 and Xg
print("ln(optimized error):")
print(np.log(np.abs(Xg-X0)))

There's some print statements I need to remove.
[0.54881306 0.71518899 0.6027638  0.54488327 0.42365458 0.64589403
 0.43758695 0.44588515 0.48183223 0.19172186]
[ True]
[[1.10576961e+09 2.07108744e+09 4.50359963e+09]]
ln(optimized error):
[-14.62308264 -14.80218518 -14.67665835 -16.26862825 -15.31476706
 -16.31395211 -15.13963096 -13.51752047 -13.97419588 -13.72027351]


## Serialization ##

We also provide a way to save and load these distributions in a compressed file format.


In [17]:
# setup
fname = "test.hdf5"
label = "my_label"
attrs = {"butter": "jam"}

print("Testing serialization:")
MV.save(fname, label, attrs=attrs)
MVnew = MV.load(fname, label)

print("Saved params:")
print(MV.params)
print("Loaded params:")
print(MVnew.params)

print("MV == MVnew:")
print(MV == MVnew)

Testing serialization:
Saved params:
[0.54881306 0.71518899 0.6027638  0.54488327 0.42365458 0.64589403
 0.43758695 0.44588515 0.48183223 0.19172186]
Loaded params:
[0.54881306 0.71518899 0.6027638  0.54488327 0.42365458 0.64589403
 0.43758695 0.44588515 0.48183223 0.19172186]
MV == MVnew:
False
