Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trouble converging #7

Closed
Svalorzen opened this issue Mar 23, 2020 · 4 comments
Closed

Trouble converging #7

Svalorzen opened this issue Mar 23, 2020 · 4 comments

Comments

@Svalorzen
Copy link

Svalorzen commented Mar 23, 2020

Hi, I'm trying to estimate the parameters of a GP with a standard RBF kernel. I'm having some trouble converging to the same parameters I get when implementing the code in Python. I am using the current master branch (not sure if I should be using a particular commit).

With LBFGSSolver the search seems to never end, going in loops. With LBFGSBSolver, after around ~20 iterations the search ends with a thrown exception. I've tried modifying the parameters to make it look longer, but it never really converges more than what it already achieved.

LBFGSBSolver gets somewhat close to the optimal solution found via Python (1.3455902517905933 0.7625631641671134 with the negative log likelihood at 5.8842014), but since it then throws I don't want to simply use a try-catch if it's not stopping cleanly (since maybe it could actually get to the same solution as in Python).

I've double checked my gradients, and I hope I haven't made silly mistakes. Could you help me figure out if there's something I should be doing differently? Here is my code, if it helps:

#include <iostream>
#include <Eigen/Eigenvalues>

#include <LBFGSB.h>
#include <LBFGS.h>

constexpr double PI = 3.141592653589793238462643383279502884197169399375105820974944;

using Matrix2D = Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor | Eigen::AutoAlign>;
using Vector = Eigen::Matrix<double, Eigen::Dynamic, 1>;

Matrix2D makeDist(const Matrix2D & x1, const Matrix2D & x2) {
    return ((-2.0 * (x1 * x2.transpose())).colwise() + x1.rowwise().squaredNorm()).rowwise() + x2.rowwise().squaredNorm().transpose();
}

Matrix2D kernel(const Matrix2D & x1, const Matrix2D & x2, double l = 1.0, double sigma = 1.0) {
    auto dist = makeDist(x1, x2);
    return std::pow(sigma, 2) * ((-0.5 / std::pow(l, 2)) * dist).array().exp();
}

struct Likelihood {
    Likelihood(const Matrix2D & xData, const Vector & muData, double noise) :
            xData_(xData), muData_(muData), noise_(noise) {}

    const Matrix2D & xData_;
    const Vector & muData_;
    double noise_;

    double operator()(const Eigen::VectorXd& x, Eigen::VectorXd& grad) {
        const auto l = x[0];
        const auto sigma = x[1];

        auto k = kernel(xData_, xData_, l, sigma);
        k.diagonal().array() += std::pow(noise_, 2);

        auto kinv = k.inverse();

        // Compute derivatives as per
        // https://math.stackexchange.com/questions/1030534/gradients-of-marginal-likelihood-of-gaussian-process-with-squared-exponential-co
        Matrix2D dkl = k.array() * (makeDist(xData_, xData_) / std::pow(l, 3)).array();
        auto dks = 2 / sigma * k;

        grad[0] = 0.5 * muData_.transpose() * kinv * dkl * kinv * muData_ - 0.5 * (kinv * dkl).trace();
        grad[1] = 0.5 * muData_.transpose() * kinv * dks * kinv * muData_ - 0.5 * (kinv * dks).trace();
        // Minimize negative log-likelihood
        grad = -grad;

        // Compute log likelihood (correct)
        auto L = k.llt();
        Vector alpha = L.solve(muData_);

        const double logLikelihood = -0.5 * muData_.transpose() * alpha - L.matrixL().selfadjointView().diagonal().array().log().sum() - 0.5 * muData_.size() * std::log(2.0 * PI);

        // Print current parameters
        std::cout << l << ", " << sigma << " ==> " << -logLikelihood << "  -  " << grad.transpose() << std::endl;

        // Return negative log-likelihood
        return -logLikelihood;
    }
};

int main() {
    const double noise = 0.4;

    // Some test points to train on.
    Matrix2D X_train(6, 1);
    X_train << -5, -4, -3, -2, -1, 1;

    Vector Y_train = X_train.array().sin().array();

    // Optimizer params
    LBFGSpp::LBFGSBParam<double> param;
    param.epsilon = 1e-5;
    param.max_iterations = 100;

    param.m = 10;
    param.ftol = 0.4;

    LBFGSpp::LBFGSParam<double> paramx;
    param.epsilon = 1e-5;
    param.max_iterations = 100;

    // Bounds
    Vector lb = Vector::Constant(2, 1e-8);
    Vector ub = Vector::Constant(2, std::numeric_limits<double>::infinity());
    // Initial guess
    Vector x = Vector::Constant(2, 1.0);

    LBFGSpp::LBFGSBSolver<double> solver(param);
    LBFGSpp::LBFGSSolver<double> solver2(paramx);

    // Log-likelihood function with set data/noise
    Likelihood like(X_train, Y_train, noise);

    double log;
    // solver.minimize(like, x, log, lb, ub);
    solver2.minimize(like, x, log);

    return 0;
}

@yixuan
Copy link
Owner

yixuan commented Mar 24, 2020

Thanks for reporting. I'll take a look today.

@yixuan
Copy link
Owner

yixuan commented Mar 24, 2020

The reason is that the gradient was incorrectly calculated.

Since you have added a diagonal noise to K, the derivatives of l and sigma need to be computed before K is added the noise.

Other tweaks:

  1. You should cache makeDist(xData_, xData_) outside kernel() and operator(), since it stays the same.
  2. Also cache the Cholesky decomposition of K to compute other related quantities, for example K^(-1), K^(-1) * y, and log|K|.
  3. Avoid direct matrix-matrix multiplication: instead of y.transpose() * kinv * dkl * kinv * y, do
alpha = kinv * y
alpha.transpose() * dkl * alpha
  1. tr(A * B) = sum(A .* B), where .* is the elementwise multiplication. The former is O(n^3), whereas the latter is O(n^2).

Below is my version:

#include <iostream>
#include <Eigen/Core>
#include <Eigen/Cholesky>

#include <LBFGSB.h>
#include <LBFGS.h>

constexpr double PI = 3.141592653589793238462643383279502884197169399375105820974944;

using Matrix2D = Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor | Eigen::AutoAlign>;
using Vector = Eigen::Matrix<double, Eigen::Dynamic, 1>;

Matrix2D makeDist(const Matrix2D & x1, const Matrix2D & x2) {
    return ((-2.0 * (x1 * x2.transpose())).colwise() + x1.rowwise().squaredNorm()).rowwise() +
    		x2.rowwise().squaredNorm().transpose();
}

Matrix2D kernel(const Matrix2D & dist, double l = 1.0, double sigma = 1.0) {
    return std::pow(sigma, 2) * ((-0.5 / std::pow(l, 2)) * dist).array().exp();
}

struct Likelihood {
    Likelihood(const Matrix2D & xData, const Vector & muData, double noise) :
            dist_(makeDist(xData, xData)), muData_(muData), noise_(noise) {}

    const Matrix2D dist_;
    const Vector & muData_;
    double noise_;

    double operator()(const Eigen::VectorXd& x, Eigen::VectorXd& grad) {
    	const int N = muData_.size();
        const double l = x[0];
        const double sigma = x[1];

        Matrix2D k = kernel(dist_, l, sigma);

        // Compute derivatives as per
        // https://math.stackexchange.com/questions/1030534/gradients-of-marginal-likelihood-of-gaussian-process-with-squared-exponential-co
        Matrix2D dkl = k.cwiseProduct(dist_) / std::pow(l, 3);
        Matrix2D dks = (2.0 / sigma) * k;

        k.diagonal().array() += std::pow(noise_, 2);
        auto llt = Eigen::LLT<Matrix2D>(k);
        Matrix2D kinv = llt.solve(Matrix2D::Identity(N, N));
        Vector alpha = llt.solve(muData_);

        grad[0] = 0.5 * alpha.transpose() * dkl * alpha - 0.5 * kinv.cwiseProduct(dkl).sum();
        grad[1] = 0.5 * alpha.transpose() * dks * alpha - 0.5 * kinv.cwiseProduct(dks).sum();
        // Minimize negative log-likelihood
        grad = -grad;

        // Compute log likelihood (correct)
        const double logLikelihood = -0.5 * muData_.dot(alpha) - llt.matrixL().selfadjointView().diagonal().array().log().sum() - 0.5 * N * std::log(2.0 * PI);

        // Print current parameters
        std::cout << l << ", " << sigma << " ==> " << -logLikelihood << "  -  " << grad.transpose() << std::endl;

        // Return negative log-likelihood
        return -logLikelihood;
    }
};

int main() {
    const double noise = 0.4;

    // Some test points to train on.
    Matrix2D X_train(6, 1);
    X_train << -5, -4, -3, -2, -1, 1;
    Vector Y_train = X_train.array().sin();

    // Log-likelihood function with set data/noise
    Likelihood like(X_train, Y_train, noise);

    // Bounds
    Vector lb = Vector::Constant(2, 1e-8);
    Vector ub = Vector::Constant(2, std::numeric_limits<double>::infinity());
    // Initial guess
    Vector x = Vector::Constant(2, 1.0);

    // L-BFGS-B
    LBFGSpp::LBFGSBParam<double> param;
    param.epsilon = 1e-5;
    param.max_iterations = 100;
    LBFGSpp::LBFGSBSolver<double> solver(param);
    double log;
    solver.minimize(like, x, log, lb, ub);

    std::cout << "\n===============================\n\n";

    LBFGSpp::LBFGSParam<double> paramx;
    paramx.epsilon = 1e-5;
    paramx.max_iterations = 10;
    LBFGSpp::LBFGSSolver<double> solver2(paramx);
    x = Vector::Constant(2, 1.0);
    solver2.minimize(like, x, log);

    return 0;
}

Output:

1, 1 ==> 6.49954  -  -1.64172  2.33992
1.85404, 0.479791 ==> 7.35173  -   2.13605 -6.82365
1.3421, 0.791619 ==> 5.88876  -  -0.101547  0.295286
1.36368, 0.760786 ==> 5.88481  -   0.0601469 -0.0763338
1.35529, 0.76696 ==> 5.88431  -  0.0152388 0.0146282
1.35236, 0.76593 ==> 5.88426  -  0.00972153  0.0133352
1.34553, 0.762505 ==> 5.8842  -  -5.29077e-06 -0.000405903
1.34559, 0.762563 ==> 5.8842  -   1.4304e-06 1.52874e-06

===============================

1, 1 ==> 6.49954  -  -1.64172  2.33992
1.57435, 0.181389 ==> 10.1205  -  1.09806 -14.218
1.28717, 0.590695 ==> 6.05536  -   0.38649 -2.43346
1.21589, 0.792321 ==> 5.92615  -  -0.488156  0.681163
1.25462, 0.747928 ==> 5.89378  -  -0.233815   0.12984
1.28602, 0.735536 ==> 5.88844  -  -0.0968389  -0.107242
1.3083, 0.741189 ==> 5.88623  -  -0.045728 -0.113439
1.34315, 0.760708 ==> 5.88421  -  -0.00147428  -0.0116414
1.34548, 0.76243 ==> 5.8842  -  0.000102842 -0.00104711
1.3456, 0.762561 ==> 5.8842  -     2.875e-05 -4.39889e-05
1.34559, 0.762563 ==> 5.8842  -   3.86423e-06 -3.34703e-06

@yixuan yixuan closed this as completed Mar 24, 2020
@Svalorzen
Copy link
Author

Sorry for my mistake, I hope you haven't lost too much time on it. Thanks a ton for the help!

@yixuan
Copy link
Owner

yixuan commented Mar 24, 2020

Not a big deal. It was actually a good time to refresh my memory on GP fitting. 😂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants