Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Differences in scalar vs vectorized predictions with GaussianProcessRegressor #25750

Open
whophil opened this issue Mar 3, 2023 · 4 comments
Labels

Comments

@whophil
Copy link

whophil commented Mar 3, 2023

Describe the bug

I would expect that calling GaussianProcessRegressor.predict(X) with a single X matrix, or with repeated scalar evaluations rows of X should would give nearly the same result, but they don't.

An example is given below.

Steps/Code to Reproduce

import numpy as np
import pandas as pd
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import DotProduct, RBF
from sklearn.model_selection import train_test_split


data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]


def evaluate_gpr_vec_and_scalar(gpr, X):
    res_vector = gpr.predict(X).squeeze()
    res_scalar = np.array([gpr.predict(xi.reshape(1, -1)) for xi in X]).squeeze()
    return res_scalar, res_vector

# load/split data
X_train, X_test, y_train, y_test = train_test_split(data, target, random_state=0)

# train
gpr = GaussianProcessRegressor(DotProduct() + RBF(), alpha=1.)
gpr.fit(X_train, y_train)

# predict scalar and vector
pred_scalar, pred_vector = evaluate_gpr_vec_and_scalar(gpr, X_test)
err = pred_scalar - pred_vector
print(err.max())

Expected Results

I would expect the error to be 0 or small - perhaps around machine eps, if anything.

Actual Results

2.1609594114124775e-08

Versions

System:
    python: 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:41:54) [Clang 13.0.1 ]
executable: /Users/phil/miniconda3/envs/skl/bin/python
   machine: macOS-12.5.1-x86_64-i386-64bit
Python dependencies:
      sklearn: 1.1.1
          pip: 22.3
   setuptools: 65.5.0
        numpy: 1.23.4
        scipy: 1.9.3
       Cython: None
       pandas: 1.5.1
   matplotlib: 3.6.1
       joblib: 1.2.0
threadpoolctl: 3.1.0
Built with OpenMP: True
threadpoolctl info:
       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: /Users/phil/miniconda3/envs/skl/lib/libopenblasp-r0.3.21.dylib
        version: 0.3.21
threading_layer: openmp
   architecture: Nehalem
    num_threads: 8
       user_api: openmp
   internal_api: openmp
         prefix: libomp
       filepath: /Users/phil/miniconda3/envs/skl/lib/libomp.dylib
        version: None
    num_threads: 8
@whophil whophil added Bug Needs Triage Issue requires triage labels Mar 3, 2023
@ogrisel
Copy link
Member

ogrisel commented Mar 9, 2023

The condition number of K_trans is:

np.linalg.cond(K_trans) = 2441679943182.803

so maybe probably the solution to this is not numerical stable and could cause the discrepancy:

            V = solve_triangular(
                self.L_, K_trans.T, lower=GPR_CHOLESKY_LOWER, check_finite=False
            )

@ogrisel
Copy link
Member

ogrisel commented Mar 9, 2023

I checked that with the RBF kernel alone, we do not have the problem.

With the DotProduct kernel alone we get:

/Users/ogrisel/code/scikit-learn/sklearn/gaussian_process/_gpr.py:629: ConvergenceWarning: lbfgs failed to converge (status=2):
ABNORMAL_TERMINATION_IN_LNSRCH.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  _check_optimize_result("lbfgs", opt_res)
2.0055939131680134e-08

@ogrisel ogrisel added Needs Investigation Issue requires investigation Numerical Stability and removed Needs Triage Issue requires triage labels Mar 9, 2023
@ogrisel
Copy link
Member

ogrisel commented Mar 9, 2023

I am not sure what to do to fix this problem.

@ogrisel
Copy link
Member

ogrisel commented Mar 9, 2023

BTW @whophil I edited your code snippet to load the Boston dataset with pandas from the original URL and be able to reproduce the problem with the latest versions of scikit-learn (including the dev branch).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants