-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU training crashes #43
Comments
Hi @Bhavani01 ! This issue should have been fixed with #39. Can you specify which version of pytorch you're using so we can test appropriately? Thanks! |
Hi,
I use version 1.2.0 of Pytorch.
Thank you.
Bhavani
…On Wed, 16 Oct 2019 at 18:27, Miguel Vera ***@***.***> wrote:
Hi @Bhavani01 <https://github.com/Bhavani01> !
This issue should have been fixed with #39
<#39>. Can you specify which
version of pytorch you're using so we can test appropriately?
Thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#43?email_source=notifications&email_token=AC27SNVO3AC5UUA6VOAPCSTQO46IHA5CNFSM4JBIDUV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBNDWKY#issuecomment-542784299>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC27SNVZDCBJIFGZJEH4IE3QO46IHANCNFSM4JBIDUVQ>
.
|
Is there any update on this? I now have the latest versions of of Kiwi and Pytorch. But the GPU training still fails. I also have an additional issue. The training on the CPU is fine but when I try to predict, it fails. It exits without giving an error. Pasting the log here. Any insights on what I could do differently? Thanks in advance. |
Hi! On the second issue, it is hard to diagnose through a stopped log, would you mind sharing the command/config of how you're running the prediction pipeline? |
experiment-name: predict-predest |
Hi, I have made a pull request #44 that should solve the issue at hand. Miguel |
Hi @Bhavani01. Please let us know whether the current version of |
I re-installed it but it still crashes. This is the only difference in the output log. |
In the exact same line as before? As for your second issue, I can only reproduce this logging-but-no-output situation when running the predict pipeline with a Predictor and not a Predictor-Estimator. It should be noted that the Predictor is just a pre-training step and can't actually generate QE tags. You need to train the Estimator on top of the Predictor. Can you confirm you have a predictor-estimator? @kepler maybe we should add an error message when trying to run the predict pipeline with a predictor. (the names are kind of confusing hehe). This would avoid these silent crashes and provide actionable feedback. |
|
Could it be because of this? I assumed --load-pred-source was when I wanted to predict the source and similarly for --load-pred-target but I now looked at the documentation again it says --load-pred-target - If set, model architecture and vocabulary parameters are ignored. Load pretrained predictor tgt->src. Will let you know if this training is successful in 3 days. But the GPU problem still persists. Thanks a lot for your help. |
Hmmm, I think you assumed the correct thing and our documentation is wrong. I'm going to confirm this but an initial look seems to indicate that --load-pred-target is indeed used when predicting src -> tgt which is what I assume you want to do. On the other hand, I'm not being able to reproduce your error with the gpu. I'm using python 3.6.8, latest version of openkiwi (installed from master, this is important as we haven't updated the version on pip yet) and pytorch 1.2.0 and can't reproduce your error. The only thing I've changed in the config you provided was adding a line with As for your second issue, an easy way to test if the model is at fault is to download our pre-trained models available on our releases and run the same config but pointing to one of our pre-trained models. |
I did test the config with the pretrained models. So, I guess there is something wrong with the model even though the training completed successfully. I am retraining now. Will test it again when completed. For the gpu issue, I will download from the master again this time instead of updating it, and test. I would be really grateful if you can confirm the --load-pred-target. Thanks |
I cloned the master again instead of pulling changes and the training on the GPU seems OK so far. Thanks. Will report about the other issue when I finish. |
Nice! Glad to hear your problem has been solved :) I confirmed the issue about the --load-pred-target and my suspicion was correct. It is used to load src -> tgt predictors. Our documentation has a mistake, thanks for pointing it out! |
The predictor training on the GPU was fine. However, it crashed for the estimator training. Command: kiwi train --config /ec/dgt/local/exodus/home/bhaskbh/new_train/estimate.yaml Logging: 2019-11-08 09:06:48.174 [root setup:380] This is run ID: 27917144000c41e4a505dcaff111c669
2019-11-08 09:06:48.174 [root setup:383] Inside experiment ID: 0 (EN-DE Train Estimator)
2019-11-08 09:06:48.174 [root setup:386] Local output directory is: /ec/dgt/local/exodus/home/bhaskbh/new_train/gpu_test
2019-11-08 09:06:48.174 [root setup:389] Logging execution to MLflow at: None
2019-11-08 09:06:48.194 [root setup:395] Using GPU: 2
2019-11-08 09:06:48.194 [root setup:400] Artifacts location: None
2019-11-08 09:06:48.201 [kiwi.lib.train run:154] Training the PredEst (Predictor-Estimator) model
2019-11-08 09:07:05.865 [kiwi.data.utils load_vocabularies_to_fields:126] Loaded vocabularies from /ec/dgt/local/exodus/home/bhaskbh/new_train/gpu_test/best_model.torch
2019-11-08 09:07:12.816 [kiwi.lib.train run:187] Estimator(
(predictor_tgt): Predictor(
(attention): Attention(
(scorer): MLPScorer(
(layers): ModuleList(
(0): Sequential(
(0): Linear(in_features=1600, out_features=800, bias=True)
(1): Tanh()
)
(1): Sequential(
(0): Linear(in_features=800, out_features=1, bias=True)
(1): Tanh()
)
)
)
)
(embedding_source): Embedding(45004, 200, padding_idx=1)
(embedding_target): Embedding(45004, 200, padding_idx=1)
(lstm_source): LSTM(200, 400, num_layers=2, batch_first=True, dropout=0.5, bidirectional=True)
(forward_target): LSTM(200, 400, num_layers=2, batch_first=True, dropout=0.5)
(backward_target): LSTM(200, 400, num_layers=2, batch_first=True, dropout=0.5)
(W1): Embedding(45004, 200, padding_idx=1)
(_loss): CrossEntropyLoss()
)
(mlp): Sequential(
(0): Linear(in_features=1000, out_features=125, bias=True)
(1): Tanh()
)
(lstm): LSTM(125, 125, batch_first=True, bidirectional=True)
(embedding_out): Linear(in_features=250, out_features=2, bias=True)
(sentence_pred): Sequential(
(0): Linear(in_features=250, out_features=125, bias=True)
(1): Sigmoid()
(2): Linear(in_features=125, out_features=62, bias=True)
(3): Sigmoid()
(4): Linear(in_features=62, out_features=1, bias=True)
)
(binary_pred): Sequential(
(0): Linear(in_features=250, out_features=125, bias=True)
(1): Tanh()
(2): Linear(in_features=125, out_features=62, bias=True)
(3): Tanh()
(4): Linear(in_features=62, out_features=2, bias=True)
)
(xents): ModuleDict(
(tags): CrossEntropyLoss()
)
(mse_loss): MSELoss()
(xent_binary): CrossEntropyLoss()
)
2019-11-08 09:07:12.816 [kiwi.lib.train run:188] 39845791 parameters
2019-11-08 09:07:12.817 [kiwi.trainers.trainer run:75] Epoch 1 of 10
Batches: 0%| | 1/5942 [00:02<3:29:00, 2.11s/ batches]/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [9,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [10,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [12,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [26,0,0] Assertion `t >= 0 && t < n_classes` failed.
Floating point exception My config is as follows: model: estimator
output-dir: /ec/dgt/local/exodus/home/bhaskbh/new_train/gpu_test
hidden-est: 125
rnn-layers-est: 1
dropout-est: 0.0
mlp-est: True
token-level: True
sentence-level: True
sentence-ll: False
binary-level: True
predict-target: true
target-bad-weight: 2.5
predict-source: false
source-bad-weight: 2.5
predict-gaps: false
target-bad-weight: 2.5
epochs: 10
checkpoint-validation-steps: 0
checkpoint-save: true
checkpoint-keep-only-best: 3
checkpoint-early-stop-patience: 0
log-interval: 100
learning-rate: 2e-3
train-batch-size: 64
valid-batch-size: 64
load-pred-target: /ec/dgt/local/exodus/home/bhaskbh/new_train/gpu_test/best_model.torch
wmt18-format: false
train-source: /ec/dgt/local/exodus/home/bhaskbh/data/en_de.src
train-target: /ec/dgt/local/exodus/home/bhaskbh/data/en_de.mt
train-pe: /ec/dgt/local/exodus/home/bhaskbh/data/en_de.pe
train-target-tags: /ec/dgt/local/exodus/home/bhaskbh/data/en_de.tags
train-sentence-scores: /ec/dgt/local/exodus/home/bhaskbh/data/en_de.ter
split: 0.99
experiment-name: EN-DE Train Estimator
gpu-id: 2 |
Hey, @Bhavani01 I'll take a look at this today and get back to you shortly! I took the freedom to edit your comment to make it easier to read :) |
I'm having problems reproducing your problem. I trained a small predictor with the example training config we make available on the repo and then trained an estimator with your config above.The only things I changed were the data (using WMT19) and the wmt18-format flag (since you're not predicting gaps and WMT19 has gaps and kiwi needs to know to filter them out). Am I correct to you assume that you trained your predictor with the config you show in your first comment? I'd like your confirmation on that, but while I wait I'll try training a predictor with that and an estimator on top to see if I can find something and get back to you. |
For the predictor I used the same config as my first comment. Other than saving best model, is there any other message to indicate successful training. Coz I say the predictor training was successful because it ran for 6 epochs and saved the best model. |
Nope, that should be it. If you're getting reasonable results (Acc > 0.6) and the model is improving and finishes training without any error then yes, it is a successful training. I've just finished training a predictor and an estimator with your configs (using WMT19 data on both, something done solely for testing purposes) and they both trained successfully. With this, I can't really reproduce your problem... Maybe it is somehow related to the data you're using and it being handled wrongly by kiwi somehow? That floating point error is extremely weird. Also, can you train with the CPU? The estimator should be pretty fast to train on cpu, that can be an alternative for the time being while we find out what's going on here! |
OK. I will try to train with the CPU and the WMT19 data. |
This is my error with the CPU: Batches: 0%| | 1/5942 [00:37<62:35:24, 37.93s/ batches]Traceback (most recent call last):
File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/env/bin/kiwi", line 11, in <module>
load_entry_point('openkiwi', 'console_scripts', 'kiwi')()
File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/kiwi/__main__.py", line 22, in main
return kiwi.cli.main.cli()
File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/kiwi/cli/main.py", line 71, in cli
train.main(extra_args)
File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/kiwi/cli/pipelines/train.py", line 142, in main
train.train_from_options(options)
File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/kiwi/lib/train.py", line 123, in train_from_options
trainer = run(ModelClass, output_dir, pipeline_options, model_options)
File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/kiwi/lib/train.py", line 204, in run
trainer.run(train_iter, valid_iter, epochs=pipeline_options.epochs)
File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/kiwi/trainers/trainer.py", line 76, in run
self.train_epoch(train_iterator, valid_iterator)
File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/kiwi/trainers/trainer.py", line 96, in train_epoch
outputs = self.train_step(batch)
File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/kiwi/trainers/trainer.py", line 141, in train_step
loss_dict = self.model.loss(model_out, batch)
File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/kiwi/models/predictor_estimator.py", line 507, in loss
loss_bin = self.binary_loss(model_out, batch)
File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/kiwi/models/predictor_estimator.py", line 497, in binary_loss
loss = self.xent_binary(model_out[const.BINARY], labels.long())
File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/env/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/env/lib64/python3.6/site-packages/torch/nn/modules/loss.py", line 904, in forward
ignore_index=self.ignore_index, reduction=self.reduction)
File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/env/lib64/python3.6/site-packages/torch/nn/functional.py", line 1970, in cross_entropy
return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/env/lib64/python3.6/site-packages/torch/nn/functional.py", line 1790, in nll_loss
ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: Assertion `cur_target >= 0 && cur_target < n_classes' failed. at /pytorch/aten/src/THNN/generic/ClassNLLCriterion.c:93 |
Hmmm, this definitely seems to imply something weird with the data you're feeding into Kiwi, would you mind sharing a subset of this data? Since this fails on the first batch, a couple of lines of each file should be enough for me to check what's going on! |
I suspected this has to do with the tags. Since I don't have human annotators I have a simple script to generate the labels from the alignments. A superficial glance seemed fine. I trained with just TER and sentence level and got the same error. Since the predictor trained successfully with the same src, mt and pe files, TER is the only additional input. Could this be because my TER range is not 0-1 but some segments have a score greater than 1? I will try rounding all >1 to 1 and check. Will also try to get a subset from my training data that is public domain to share. Thanks. |
Hi @Bhavani01, has your issue been solved by regenerating/ repairing the data? |
I don't get the same error anymore after changing the TER scores. In the GPU training I get the "RuntimeError: CUDA out of memory." error even though I have more than enough memory and irrespective of the size of data I am training. I got a different error in the CPU training with the full dataset. Now I am running with a subsection of the data. It is still training. I am trying to use the --load-model option to train with smaller sets of data until I reach the one that is causing problems as I did basic data cleaning and cant see any obvious problems. |
That's good news. On the GPU error, that is not related to the computer's memory but to the GPU memory. As such, it does not matter the size of the training data (that will fill the RAM but not the GPU), what matters is the batch size and the number of tokens in each sentence. My recommendations would be to decrease the batch_size (while adjusting the learning rate accordingly) and to use the options we provide to control the max token count of src and tgt sentences, respectively: It can happen that you have some unusually long sentences being loaded into the GPU and this exceeds the amount of memory available. As for the CPU training, I'd be very interested in the error you're getting, as that is not expected. Finally, we are preparing some updates for Kiwi that should add some sanity checks for data. This should help us avoid errors like your previous one in the future, stay tuned! |
I did try with batch size 32. I restrict the length of segments to 200 in my data cleaning. I will reduce it and test. Thanks. |
Hi, for the NuQE and Quetch trainings, the target is the MT output right? Not the reference or the post edit? |
When training a QE model, target should always be MT. This applies to all models in OpenKiwi. We normally refer to things with the following nomenclature: Thanks for the headsup! We'll see how to use system time. I'd say we probably set up something wrong but never noticed since we are in London time :) |
Hi, Thanks. |
Hey @Bhavani01, that shouldn't happen, let me have a look into what's going on. Also, I'd appreciate it if you could open new issues instead of continuing the conversation on this one! That way we can containerise topics and use these issues to help similar questions in the future. |
Got it. BTW the gpu issue was from my system, all my gpus were blocked. Sorry about that. |
Are you talking about this last comment? So the predict pipeline is working as expected? 🙂 |
Yes. I was outside the virtual env and the gpu was not visible to it. |
Ah! Glad to know It's solved! I'll close this issue for now. Feel free to open a new one in case you have any further questions |
I have no problems training on a CPU. But when I train on a GPU it crashes every time.
2019-10-15 09:02:52.215 [root setup:380] This is run ID: 4d526700aafc4f5fba779bae21789a82
2019-10-15 09:02:52.215 [root setup:383] Inside experiment ID: 0 (EN-DE Pretrain Predictor)
2019-10-15 09:02:52.215 [root setup:386] Local output directory is: /ec/dgt/local/exodus/home/bhaskbh/gpu_train
2019-10-15 09:02:52.215 [root setup:389] Logging execution to MLflow at: None
2019-10-15 09:02:52.247 [root setup:395] Using GPU: 3
2019-10-15 09:02:52.247 [root setup:400] Artifacts location: None
2019-10-15 09:02:52.252 [kiwi.lib.train run:154] Training the PredEst Predictor model (an embedder model) model
2019-10-15 09:03:13.830 [kiwi.lib.train run:187] Predictor(
(attention): Attention(
(scorer): MLPScorer(
(layers): ModuleList(
(0): Sequential(
(0): Linear(in_features=1600, out_features=800, bias=True)
(1): Tanh()
)
(1): Sequential(
(0): Linear(in_features=800, out_features=1, bias=True)
(1): Tanh()
)
)
)
)
(embedding_source): Embedding(45004, 200, padding_idx=1)
(embedding_target): Embedding(45004, 200, padding_idx=1)
(lstm_source): LSTM(200, 400, num_layers=2, batch_first=True, dropout=0.5, bidirectional=True)
(forward_target): LSTM(200, 400, num_layers=2, batch_first=True, dropout=0.5)
(backward_target): LSTM(200, 400, num_layers=2, batch_first=True, dropout=0.5)
(W1): Embedding(45004, 200, padding_idx=1)
(_loss): CrossEntropyLoss()
)
2019-10-15 09:03:13.831 [kiwi.lib.train run:188] 39389601 parameters
2019-10-15 09:03:13.831 [kiwi.trainers.trainer run:75] Epoch 1 of 6
Batches: 0%| | 0/5680 [00:00<?, ? batches/s]
Traceback (most recent call last):
File "/home/bhaskbh/.local/bin/kiwi", line 11, in
sys.exit(main())
File "/home/bhaskbh/.local/lib/python3.6/site-packages/kiwi/main.py", line 22, in main
return kiwi.cli.main.cli()
File "/home/bhaskbh/.local/lib/python3.6/site-packages/kiwi/cli/main.py", line 71, in cli
train.main(extra_args)
File "/home/bhaskbh/.local/lib/python3.6/site-packages/kiwi/cli/pipelines/train.py", line 142, in main
train.train_from_options(options)
File "/home/bhaskbh/.local/lib/python3.6/site-packages/kiwi/lib/train.py", line 123, in train_from_options
trainer = run(ModelClass, output_dir, pipeline_options, model_options)
File "/home/bhaskbh/.local/lib/python3.6/site-packages/kiwi/lib/train.py", line 204, in run
trainer.run(train_iter, valid_iter, epochs=pipeline_options.epochs)
File "/home/bhaskbh/.local/lib/python3.6/site-packages/kiwi/trainers/trainer.py", line 76, in run
self.train_epoch(train_iterator, valid_iterator)
File "/home/bhaskbh/.local/lib/python3.6/site-packages/kiwi/trainers/trainer.py", line 96, in train_epoch
outputs = self.train_step(batch)
File "/home/bhaskbh/.local/lib/python3.6/site-packages/kiwi/trainers/trainer.py", line 140, in train_step
model_out = self.model(batch)
File "/home/bhaskbh/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/home/bhaskbh/.local/lib/python3.6/site-packages/kiwi/models/predictor.py", line 240, in forward
source_mask = self.get_mask(batch, source_side)[:, 1:-1]
File "/home/bhaskbh/.local/lib/python3.6/site-packages/kiwi/models/model.py", line 205, in get_mask
input_tensor != pad_id, dtype=torch.uint8
RuntimeError: Expected object of backend CUDA but got backend CPU for argument #3 'other'
To Reproduce
Simple train with config file.
epochs: 6
checkpoint-validation-steps: 5000
checkpoint-save: true
checkpoint-keep-only-best: 1
checkpoint-early-stop-patience: 0
optimizer: adam
log-interval: 100
learning-rate: 2e-3
learning-rate-decay: 0.6
learning-rate-decay-start: 2
train-batch-size: 64
valid-batch-size: 64
train-source: /home/bhaskbh/data/en_de.src
train-target: /home/bhaskbh/data/en_de.pe
split: 0.99
source-vocab-size: 45000
target-vocab-size: 45000
source-max-length: 50
source-min-length: 1
target-max-length: 50
target-min-length: 1
source-vocab-min-frequency: 1
target-vocab-min-frequency: 1
experiment-name: EN-DE Pretrain Predictor
gpu-id: 3
Environment
The text was updated successfully, but these errors were encountered: