Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[performance issue] model.fantasize() is significantly slower on GPU #492

Closed
saitcakmak opened this issue Jul 23, 2020 · 5 comments
Closed

Comments

@saitcakmak
Copy link
Contributor

Issue description

Generating fantasy models using model.fantasize() takes significantly longer on GPU compared to CPU. The example below is extracted from evaluation of raw_samples while optimizing qKnowledgeGradient. Running the code below, I get ~60 ms using CPU and ~10000 ms using GPU. I traced the issue down to gpytorch.models.exact_prediction_strategies.py line 220 Q, R = torch.qr(new_root). That line appears to be the bottleneck, however, I do not know what is happening beyond there.

Code example

Run the code below with device = torch.device('cuda') and device = torch.device('cpu').

import torch
from botorch.models import SingleTaskGP
from botorch.fit import fit_gpytorch_model
from botorch.models.transforms import Standardize
from gpytorch.mlls import ExactMarginalLogLikelihood
from botorch.sampling.samplers import SobolQMCNormalSampler
from time import time

# set the device, 'cuda' or 'cpu'
device = torch.device('cuda')
print('Using device:', device)

# train data, obtained from Branin function projected to unit hypercube
train_X = torch.tensor([[0.5200, 0.0661],
                        [0.9702, 0.8459],
                        [0.0119, 0.3492],
                        [0.2245, 0.9323],
                        [0.7585, 0.4347],
                        [0.7473, 0.6529]],
                       device=device
                       )
train_Y = torch.tensor([[-2.4734],
                        [-102.1244],
                        [-139.2399],
                        [-34.9917],
                        [-48.7695],
                        [-94.5974]],
                       device=device
                       )
# initialize and fit the gp model
model = SingleTaskGP(train_X, train_Y, outcome_transform=Standardize(m=1))
mll = ExactMarginalLogLikelihood(model.likelihood, model)
fit_gpytorch_model(mll)

sampler = SobolQMCNormalSampler(64)

X = torch.rand((2000, 1, 2), device=device)

if device == torch.device('cuda'):
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)
    start.record()
else:
    start = time()

model.fantasize(X, sampler)

if device == torch.device('cuda'):
    end.record()
    torch.cuda.synchronize()
    print("time elapsed (ms): %f" % start.elapsed_time(end))
else:
    print("time elapsed (ms): %f" % (1000 * (time() - start)))

System Info

Please provide information about your setup, including

  • BoTorch Version 0.3.0
  • GPyTorch Version 1.1.1
  • PyTorch Version 1.5.1 with torchvision 0.6.1 and cudatoolkit 10.2.89
  • Python 3.8.3 on anaconda
  • Computer OS: Ubuntu 20.04
@Balandat
Copy link
Contributor

Thanks for raising this, this is an upstream issue that we are aware of: cornellius-gp/gpytorch#1157

Really, it is a pytorch issue with qr being slow here - we aim to find a workaround on the gpytorch end since the pytorch fix will likely take a while.

@saitcakmak
Copy link
Contributor Author

Thanks for quick response Max!
It is a dirty fix but replacing line 220 of gpytorch.models.exact_prediction_strategies.py with

        device = new_root.device
        Q, R = torch.qr(new_root.cpu())
        Q = Q.to(device)
        R = R.to(device)

fixes the issue. It reduced the runtime from ~10000 ms to ~33 ms.

@jacobrgardner
Copy link

@Balandat new_root should always be reasonably skinny in cases where we ought to be doing QR here. Maybe we should indeed do it on the CPU upstream for now? That would be a pretty reasonably quick fix that would maintain the numerical stability of using QR instead of Woodbury.

@Balandat
Copy link
Contributor

Yeah that makes sense to me. One thing I do want to do once #1102 goes in is to just check whether L is a TriangularLazTensor (e.g. always when using cholesky) - in that case we just need to do two successive triangular solves. We can use the CPU fix whenever L is not a TLT.

@Balandat
Copy link
Contributor

cornellius-gp/gpytorch#1224

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants