Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to converge (again...) #9

Closed
Svalorzen opened this issue Apr 3, 2020 · 2 comments · Fixed by #25
Closed

Failure to converge (again...) #9

Svalorzen opened this issue Apr 3, 2020 · 2 comments · Fixed by #25

Comments

@Svalorzen
Copy link

Svalorzen commented Apr 3, 2020

Hi, sorry again. I'm having trouble with convergence, again using the code you helped me to fix. I have tested the code with larger problems successfully, but I have just found a small case where it fails.

It looks relatively simple, but the minimization process fails due to the line search step becoming too small after 2 iterations. Small variations in the setup make the problem solvable (for example, modifying the initial guess from 1.0, 1.0 to 1.1, 1.1; or by adding more points to the training set or modifying some of them).

Is this type of failure a common occurrence with this type of optimizers (as you may have guessed I'm not really an expert in this field)? Using scipy, with the same algorithm, does not seem to usually have problems. Should I just put some try/catch blocks around the optimizer, and if it fails try again with different parameters (which I should do anyway to ensure a global min is found, but I'm asking specifically about errors)? Or maybe it's just a bug and I can keep reporting them :)

Thanks again in advance for your help.

#include <iostream>
#include <Eigen/Eigenvalues>

#include <LBFGSB.h>

constexpr double PI = 3.141592653589793238462643383279502884197169399375105820974944;

using Matrix2D = Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor | Eigen::AutoAlign>;
using Vector = Eigen::Matrix<double, Eigen::Dynamic, 1>;

Matrix2D makeDist(const Matrix2D & x1, const Matrix2D & x2) {
    return ((-2.0 * (x1 * x2.transpose())).colwise() + x1.rowwise().squaredNorm()).rowwise() + x2.rowwise().squaredNorm().transpose();
}

Matrix2D kernel(const Matrix2D & dist, double l = 1.0, double sigma = 1.0) {
    return std::pow(sigma, 2) * ((-0.5 / std::pow(l, 2)) * dist).array().exp();
}

struct Likelihood {
    Likelihood(const Matrix2D & xData, const Vector & muData, double noise) :
            xData_(xData), muData_(muData), noise_(noise) {}

    const Matrix2D & xData_;
    const Vector & muData_;
    double noise_;

    double operator()(const Eigen::VectorXd& x, Eigen::VectorXd& grad) {
        const auto l = x[0];
        const auto sigma = x[1];

        auto dist = makeDist(xData_, xData_);
        auto k = kernel(dist, l, sigma);

        Matrix2D dkl = k.cwiseProduct(dist / std::pow(l, 3));
        Matrix2D dks = (2.0 / sigma) * k;

        k.diagonal().array() += std::pow(noise_, 2);

        auto L = k.llt();
        Matrix2D kinv = L.solve(Matrix2D::Identity(muData_.size(), muData_.size()));
        Vector alpha = L.solve(muData_);

        // Negative gradient to minimize negative log-likelihood
        grad[0] = -(0.5 * alpha.transpose() * dkl * alpha - 0.5 * kinv.cwiseProduct(dkl).sum());
        grad[1] = -(0.5 * alpha.transpose() * dks * alpha - 0.5 * kinv.cwiseProduct(dks).sum());

        // Log-likelihood
        const double ll = -0.5 * muData_.transpose() * alpha - L.matrixL().selfadjointView().diagonal().array().log().sum() - 0.5 * muData_.size() * std::log(2.0 * PI);
        std::cout << l << ", " << sigma << " ==> " << -ll << "  -  " << grad.transpose() << std::endl;
        return -ll;
    }
};

int main() {
    const double noise = 1e-8;

    // Training points to test on. Small modifications seem to make the problem
    // solvable again (for example replacing the 0 with a 1).
    Matrix2D X_train(7, 1);
    X_train << -4, -3, -2, -1, 0, 2, 4;

    Vector Y_train = X_train.array().sin().array();

    // Log-likelihood function with set data/noise
    Likelihood like(X_train, Y_train, noise);

    // Optimizer params
    LBFGSpp::LBFGSBParam<double> param;
    param.epsilon = 1e-5;
    param.max_iterations = 100;

    // Bounds
    Vector lb = Vector::Constant(2, 1e-8);
    Vector ub = Vector::Constant(2, std::numeric_limits<double>::infinity());

    // Initial guess (replacing 1.0 with other numbers seem to make the problem
    // solvable as well).
    Vector x = Vector::Constant(2, 1.0);

    LBFGSpp::LBFGSBSolver<double> solver(param);

    double log;
    solver.minimize(like, x, log, lb, ub);

    return 0;
}

The output is:

1, 1 ==> 6.96291  -  -3.03909  3.49729
1.9499, 0.68744 ==> 5.99037  -   4.90625 -14.3289
1.28814, 0.905188 ==> 5.76892  -  -3.16873  2.10955
terminate called after throwing an instance of 'std::runtime_error'
  what():  the line search step became smaller than the minimum value allowed
Aborted (core dumped)

If the starting point [1.1, 1.1] is used, it works:

1.1, 1.1 ==> 6.99387  -  -3.4496 3.58888
2.05273, 0.796194 ==> 5.19971  -   4.32208 -9.35253
1.46782, 0.982711 ==> 5.34336  -  -3.34543  2.06644
6.15734, 0.336145 ==> 1.88086e+06  -   3.82257e+06 -1.11909e+07
2.89361, 0.78613 ==> 39.5965  -   131.639 -109.667
1.86985, 0.927282 ==> 4.3294  -  -0.92675 -1.49038
2.08611, 0.992819 ==> 4.19746  -   1.20536 -2.77274
2.10575, 1.08705 ==> 4.02656  -  0.314624 -1.26704
2.18967, 1.21189 ==> 3.94199  -   0.334645 -0.571357
2.20651, 1.27356 ==> 3.92379  -  0.00729654  -0.127709
2.22573, 1.30522 ==> 3.92164  -   0.0325654 -0.0340186
2.2261, 1.30993 ==> 3.92157  -  -0.00439909  0.00135147
2.22656, 1.31028 ==> 3.92156  -   0.000116421 -1.42964e-05
2.22655, 1.31027 ==> 3.92156  -  -3.84116e-07  4.18181e-07

Same if the starting point [0.9, 0.9] is used.

0.9, 0.9 ==> 6.90955  -  -2.61983  3.17243
1.84575, 0.575104 ==> 7.52613  -   6.48836 -24.6506
1.37833, 0.735678 ==> 5.43282  -  -2.02235 -1.69158
2.20017, 0.694899 ==> 8.04254  -    13.791 -24.1841
1.6203, 0.723671 ==> 5.12384  -  -0.641197  -4.86297
1.95251, 0.837837 ==> 4.62563  -   1.38858 -5.23043
2.05031, 1.00839 ==> 4.13246  -  0.396067 -1.95868
2.14692, 1.12796 ==> 3.9945  -  0.522762 -1.14357
2.16689, 1.23065 ==> 3.9341  -  -0.221894 -0.153314
2.23342, 1.30105 ==> 3.9229  -   0.202664 -0.141067
2.22493, 1.30545 ==> 3.92161  -   0.0169154 -0.0246754
2.22623, 1.30971 ==> 3.92157  -  -0.000161013  -0.00168692
2.22654, 1.31025 ==> 3.92156  -  -9.30085e-05  9.63869e-06
2.22655, 1.31026 ==> 3.92156  -  -1.78201e-06  9.67718e-07
@yixuan
Copy link
Owner

yixuan commented Apr 6, 2020

It might be numerical issues. The Gaussian kernel computes the exponential function of the squared distance. For example, in your example the distance between -4 and 4 is 8, so when l=1, the kernel computes exp(-8^2/2)=1.266417e-14. Then when you invert the kernel matrix you get pretty large gradient, and the line search may fail. I agree that some more mature software may handle this better, but given the nature of the kernel function, there can be lots of instabilities. A good initial value is always appreciated.

Specific to your problem, let me think of a better way to computing the gradient.

@Svalorzen
Copy link
Author

I think you might be right. Now I realize that the Python code actually uses numerical gradients (as I don't pass it the derivative), so it might be able to avoid the numerical instabilities at the price of performance?

I am looking at GPy's code also to see if I can find a different way to compute the gradient..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants