New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Choice equivalent for PyTorch #18624
Conversation
Let us know if you want review prior to removal of WIP |
I think it would be nice. |
Tensor uniform_samples = at::rand({k}, weights.options()); | ||
Tensor cdf = weights.cumsum(0); | ||
cdf /= cdf[-1]; | ||
samples = (uniform_samples.unsqueeze(1) > cdf.unsqueeze(0)).sum(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm making some changes in the sampling_with_replacement
for performance reasons. These are going to be the last improvements I have in mind prior to a review.
The CUDA extensions are not randomizing correctly. It's like the random states don't actually change that much.
If I added I'm investigating how to fix it. I'll try to get some inspiration from the implementation of |
I've made some checks available here. I won't be bringing any more changes until a review, and I'm looking forward to it! Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was reviewing this PR, and my biggest comment is around implemeting choice kernels.
I think it'd be much simpler and probably much more performant if choice
just used torch.multinomial
+ advanced indexing, instead of it's own kernels -- considering how long we took to optimize multinomial.
Also, the correctness tests look like a good start but insufficient
Thanks @soumith for your feedback. I've made some benchmarks, and I've noticed that
I could perform more tests with bigger tensors and with CUDA later if needed. Indeed, when I first implemented I can improve the correctness tests, but first I'd like to know what would make you interested in this implementation. It seems to me that this implementation only makes sense now if I could prove performance gains in all cases. |
I've done some tests with big CUDA Tensors
It looks like
|
I've been checking the choice's distribution sampling correctness. So far, everything is working properly. Here is a snippet I used to do some tests: Change the parameters, import torch
import torch.nn.functional as F
import numpy as np
m = 20
n = 10000
k = 4
replace = False
device = 'cpu'
###################################
# Comparing Choice vs Multinomial #
###################################
multinomial_samples = []
choice_samples = []
weights = torch.rand(m, device=device)
for _ in range(n):
multinomial_samples += torch.multinomial(
weights,
k,
replace
).cpu().numpy().tolist()
choice_samples += torch.choice(
torch.arange(m).to(device),
weights,
replace,
k
).cpu().numpy().tolist()
_, multinomial_dist = np.unique(multinomial_samples, return_counts=True)
_, choice_dist = np.unique(choice_samples, return_counts=True)
multinomial_dist = torch.Tensor(multinomial_dist) / (n * k)
choice_dist = torch.Tensor(choice_dist) / (n * k)
print(F.kl_div(choice_dist.log(), multinomial_dist, reduction='sum'))
############################################
# Comparing Choice vs Correct distribution #
############################################
choice_samples = []
weights = torch.rand(m, device=device)
for _ in range(n):
choice_samples += torch.choice(
torch.arange(m).to(device),
weights,
replace,
1
).cpu().numpy().tolist()
correct_dist = weights / weights.sum()
correct_dist = correct_dist.to('cpu')
_, choice_dist = np.unique(choice_samples, return_counts=True)
choice_dist = torch.Tensor(choice_dist) / n
print(F.kl_div(choice_dist.log(), correct_dist, reduction='sum')) |
I had a quick look at our current implementation of It seems that our GPU implementation launches pytorch/aten/src/THC/generic/THCTensorRandom.cu Lines 252 to 274 in b6d0f6c
which might explain the slowdown of torch.multinomial compared to @LeviViana implementation of choice , which launches only one or two kernels.
I didn't think carefully why this is needed for Also, looking at the implementation of |
@@ -1802,6 +1802,16 @@ | |||
CPU: randperm_out_cpu | |||
CUDA: randperm_out_cuda | |||
|
|||
- func: choice(Tensor input, Tensor weights, bool replace, int k) -> Tensor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer if the arguments had the same names and order as in its inspiration in NumPy.
(With the exception of input
, which we can leave as-is, since that's a PyTorch standard).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make this change as well.
int64_t k | ||
){ | ||
at::Tensor weights = at::empty({0}, input.options().dtype(at::kFloat)); | ||
if (replace){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here instead of duplicating the code you can just call through to native::choice_cpu(input, weights, replace, k)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I'll make this change.
const Tensor& weights, | ||
int64_t k | ||
){ | ||
int n = x.size(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I guess we are assuming the input tensor x
is 1-D, which seems reasonable since that's what's NumPy does.
But we need to actually check that and report an error if it's not the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not assuming x is 1-D, instead I'm forcing the sampling to happen only in the first dimension. If x = torch.Tensor([[1, 2], [3, 4]])
then torch.choice(x, w, True, 3)
can be torch.Tensor([[1, 2], [1, 2], [3, 4]])
for instance (i.e. it is sampling 0, 0, 1
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Anyway, we should still check that there is at least one dimension, because this will barf if you call it on a 0-dim tensor.
int64_t *samples_ptr = samples.data<int64_t>(); | ||
|
||
Tensor cdf = weights.cumsum(0); | ||
cdf /= cdf[-1]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If somebody passes in a tensor with all zero weights, this divides by zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I'll fix this.
Tensor cdf = weights.cumsum(0); | ||
cdf /= cdf[-1]; | ||
|
||
AT_DISPATCH_FLOATING_TYPES(weights.scalar_type(), "Sampling with replacement", [&] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not allow weights to be an integral type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make that change as well. I guess some casting will be necessary in the CUDA kernels though.
|
||
AT_CHECK( | ||
weights.is_contiguous(), | ||
"The sampling weights must be contiguous." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, It doesn't. I can just check whether the weights are contiguous and in the case they aren't I can just force it to be. I'll make this change, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good, thanks!
I've looked the CPU part for now, there are a few things that I think should be improved. Let me know what you think
Tensor weights_contiugous; | ||
|
||
if(!weights.is_contiguous()){ | ||
weights_contiugous = weights.contiguous(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can always make weights.contiguous()
unconditionally, this will avoid a copy in the case it is already contiguous.
if(!weights.is_contiguous()){ | ||
weights_contiugous = weights.contiguous(); | ||
}else{ | ||
weights_contiugous = weights.clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you need to clone it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not anymore !
} | ||
|
||
AT_CHECK( | ||
weights_contiugous.device() == x.device(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a typo here. it's meant to be contiguous
AT_DISPATCH_FLOATING_TYPES(weights_contiugous.scalar_type(), "generate keys", [&] { | ||
generate_keys<scalar_t>( | ||
keys.data<scalar_t>(), | ||
weights_contiugous.data<scalar_t>(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you have an assert in the beginning that the weights should be float
, so this doesn't work for double
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I'll fix this.
){ | ||
|
||
AT_CHECK( | ||
weights.dtype() == kFloat, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean that you want it to be floating point type? Like double
or float
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove this check.
I've done some benchmarks against numpy. I'm using only the CPU implementation for these tests. Here are the results for sampling 2k items out of 10k elements. Edit: These benchmarks give the same results using
To reproduce:
|
@LeviViana I'm not sure, how python3 -m timeit --setup="import torch; x = torch.arange(10 ** 7).cuda(); w = torch.arange(10 ** 7).cuda().float() + 1; J, q = torch._multinomial_alias_setup(w)" "x[torch._multinomial_alias_draw(q, J, 10 ** 4)]"
python3 -m timeit --setup="import torch; x = torch.arange(10 ** 7).cuda(); w = torch.arange(10 ** 7).cuda().float() + 1" "torch.choice(x, 10 ** 4, True, w)" |
Thanks @ptrblck, you are right, |
Thanks for the update @LeviViana! I've built your current branch and used this script to profile both methods: import torch
import time
# Setup
x = torch.arange(10 ** 7).cuda()
w = torch.arange(10 ** 7).cuda().float() + 1
nb_iters = 1000
# warmup
for _ in range(50):
J, q = torch._multinomial_alias_setup(w)
output = x[torch._multinomial_alias_draw(q, J, 10 ** 4)]
# Profile1
torch.cuda.synchronize()
t0 = time.time()
for _ in range(nb_iters):
J, q = torch._multinomial_alias_setup(w)
output = x[torch._multinomial_alias_draw(q, J, 10 ** 4)]
torch.cuda.synchronize()
t1 = time.time()
print('elapsed {:.6f}s/iter'.format((t1 - t0)/nb_iters))
# warmup
for _ in range(50):
output = torch.choice(x, 10 ** 4, True, w)
# Profile2
torch.cuda.synchronize()
t0 = time.time()
for _ in range(nb_iters):
output = torch.choice(x, 10**4, True, w)
torch.cuda.synchronize()
t1 = time.time()
print('elapsed {:.6f}s/iter'.format((t1 - t0)/nb_iters)) On a TitanV, CUDA10.1.105 and cudnn7500 I get the following numbers:
Could you run this script and check your current output? |
Thanks @ptrblck for the script. These are my outputs (RTX 2080Ti, CUDA 10.1, CuDNN 7.5):
Indeed, the multinomial aliases are faster than choice for sampling with replacement, despite the |
It would be interesting to establish an objective to this PR. So far, what I've noticed is that the
This way we get the best of both implementations. If you validate this plan, I can make the changes and update the PR. What do you think @gchanan @soumith ? |
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
Any updates on this? It could be interesting to integrate this in the C++ API (I don't know if the .yaml files allow for auto-generating C++ API code too), since you have to use a slower |
Also, why is this getting compared to |
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
/easycla As part of the transition to the PyTorch Foundation, this project now requires contributions be covered under the new CLA. See #85559 for additional details. This comment will trigger a new check of this PR. If you are already covered, you will simply see a new "EasyCLA" check that passes. If you are not covered, a bot will leave a new comment with a link to sign. |
|
Any hope to get this re-opened? |
Why did this issue closed?... It would be really great to have |
Related to #16897 and #18457.