-
Notifications
You must be signed in to change notification settings - Fork 24.9k
Description
Hello,
I found some absurd errors when trying to replicate my luatorch code to pytorch. After some serious debugging, my intuition tells me it goes down to some intrinsic parts of PyTorch.
I run on master brunch; to run the code I provided, place it into examples/vae/ folder, then run python xxx.py to reproduce the errors.
I attached a few modifications of examples/vae/main.py in the zip file pytorch_debug.zip
(1) main_conv_new_rep.py: this code runs, but note that I modified the reparametrization part and got rid of KLD error.
(2) main_conv_new_rep_KLD.py: this code DOES NOT run, and gives the error message (below) with the only difference between the previous code to be KLD (line 144),I set KLD to be 0. set KLD to be 0.**
oat, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>, Dtype = float, Acctype = float]: block: [65,0,0], thread: [59,0,0] Assertion `input >= 0. && input <= 1.` failed.
/home/shangw/pytorch/torch/lib/THCUNN/BCECriterion.cu:30: Acctype bce_functor<Dtype, Acctype>::operator()(Tuple) [with Tuple = thrust::tuple<float, float, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>, Dtype = float, Acctype = float]: block: [65,0,0], thread: [60,0,0] Assertion `input >= 0. && input <= 1.` failed.
/home/shangw/pytorch/torch/lib/THCUNN/BCECriterion.cu:30: Acctype bce_functor<Dtype, Acctype>::operator()(Tuple) [with Tuple = thrust::tuple<float, float, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>, Dtype = float, Acctype = float]: block: [65,0,0], thread: [61,0,0] Assertion `input >= 0. && input <= 1.` failed.
/home/shangw/pytorch/torch/lib/THCUNN/BCECriterion.cu:30: Acctype bce_functor<Dtype, Acctype>::operator()(Tuple) [with Tuple = thrust::tuple<float, float, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>, Dtype = float, Acctype = float]: block: [65,0,0], thread: [62,0,0] Assertion `input >= 0. && input <= 1.` failed.
/home/shangw/pytorch/torch/lib/THCUNN/BCECriterion.cu:30: Acctype bce_functor<Dtype, Acctype>::operator()(Tuple) [with Tuple = thrust::tuple<float, float, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>, Dtype = float, Acctype = float]: block: [65,0,0], thread: [63,0,0] Assertion `input >= 0. && input <= 1.` failed.
CUDA error after cudaEventDestroy in future dtor: device-side assert triggeredTraceback (most recent call last):
File "main_conv_new_rep_kl.py", line 188, in <module>
train(epoch)
File "main_conv_new_rep_kl.py", line 159, in train
loss = loss_function(recon_batch, data, mu, logvar)
File "main_conv_new_rep_kl.py", line 135, in loss_function
BCE = reconstruction_function(recon_x, x)
File "/home/shangw/local/anaconda3/envs/py35s/lib/python3.5/site-packages/torch/nn/modules/module.py", line 225, in __call__
result = self.forward(*input, **kwargs)
File "/home/shangw/local/anaconda3/envs/py35s/lib/python3.5/site-packages/torch/nn/modules/loss.py", line 34, in forward
return backend_fn(self.size_average, weight=self.weight)(input, target)
File "/home/shangw/local/anaconda3/envs/py35s/lib/python3.5/site-packages/torch/nn/_functions/thnn/loss.py", line 28, in forward
result = super(BCELoss, self).forward(input, target)
File "/home/shangw/local/anaconda3/envs/py35s/lib/python3.5/site-packages/torch/nn/_functions/thnn/auto.py", line 41, in forward
output, *self.additional_args)
RuntimeError: cudaEventSynchronize in future::wait: device-side assert triggered
(3) main_conv_old_rep.py: this code DOES NOT run, gives the same error, and the only difference between this and the first one main_conv_new_rep.py is that I used the original reparametrization function (line 120), which is mathematically equivalent (and should be computationally equivalent as well).
If someone can possibly look into this, I would greatly appreciate the effort! since it is a little bit time sensitive... Thanks!