GPU training crashes #43

Bhavani01 · 2019-10-16T08:36:52Z

I have no problems training on a CPU. But when I train on a GPU it crashes every time.

2019-10-15 09:02:52.215 [root setup:380] This is run ID: 4d526700aafc4f5fba779bae21789a82
2019-10-15 09:02:52.215 [root setup:383] Inside experiment ID: 0 (EN-DE Pretrain Predictor)
2019-10-15 09:02:52.215 [root setup:386] Local output directory is: /ec/dgt/local/exodus/home/bhaskbh/gpu_train
2019-10-15 09:02:52.215 [root setup:389] Logging execution to MLflow at: None
2019-10-15 09:02:52.247 [root setup:395] Using GPU: 3
2019-10-15 09:02:52.247 [root setup:400] Artifacts location: None
2019-10-15 09:02:52.252 [kiwi.lib.train run:154] Training the PredEst Predictor model (an embedder model) model
2019-10-15 09:03:13.830 [kiwi.lib.train run:187] Predictor(
(attention): Attention(
(scorer): MLPScorer(
(layers): ModuleList(
(0): Sequential(
(0): Linear(in_features=1600, out_features=800, bias=True)
(1): Tanh()
)
(1): Sequential(
(0): Linear(in_features=800, out_features=1, bias=True)
(1): Tanh()
)
)
)
)
(embedding_source): Embedding(45004, 200, padding_idx=1)
(embedding_target): Embedding(45004, 200, padding_idx=1)
(lstm_source): LSTM(200, 400, num_layers=2, batch_first=True, dropout=0.5, bidirectional=True)
(forward_target): LSTM(200, 400, num_layers=2, batch_first=True, dropout=0.5)
(backward_target): LSTM(200, 400, num_layers=2, batch_first=True, dropout=0.5)
(W1): Embedding(45004, 200, padding_idx=1)
(_loss): CrossEntropyLoss()
)
2019-10-15 09:03:13.831 [kiwi.lib.train run:188] 39389601 parameters
2019-10-15 09:03:13.831 [kiwi.trainers.trainer run:75] Epoch 1 of 6
Batches: 0%| | 0/5680 [00:00<?, ? batches/s]
Traceback (most recent call last):
File "/home/bhaskbh/.local/bin/kiwi", line 11, in
sys.exit(main())
File "/home/bhaskbh/.local/lib/python3.6/site-packages/kiwi/main.py", line 22, in main
return kiwi.cli.main.cli()
File "/home/bhaskbh/.local/lib/python3.6/site-packages/kiwi/cli/main.py", line 71, in cli
train.main(extra_args)
File "/home/bhaskbh/.local/lib/python3.6/site-packages/kiwi/cli/pipelines/train.py", line 142, in main
train.train_from_options(options)
File "/home/bhaskbh/.local/lib/python3.6/site-packages/kiwi/lib/train.py", line 123, in train_from_options
trainer = run(ModelClass, output_dir, pipeline_options, model_options)
File "/home/bhaskbh/.local/lib/python3.6/site-packages/kiwi/lib/train.py", line 204, in run
trainer.run(train_iter, valid_iter, epochs=pipeline_options.epochs)
File "/home/bhaskbh/.local/lib/python3.6/site-packages/kiwi/trainers/trainer.py", line 76, in run
self.train_epoch(train_iterator, valid_iterator)
File "/home/bhaskbh/.local/lib/python3.6/site-packages/kiwi/trainers/trainer.py", line 96, in train_epoch
outputs = self.train_step(batch)
File "/home/bhaskbh/.local/lib/python3.6/site-packages/kiwi/trainers/trainer.py", line 140, in train_step
model_out = self.model(batch)
File "/home/bhaskbh/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/home/bhaskbh/.local/lib/python3.6/site-packages/kiwi/models/predictor.py", line 240, in forward
source_mask = self.get_mask(batch, source_side)[:, 1:-1]
File "/home/bhaskbh/.local/lib/python3.6/site-packages/kiwi/models/model.py", line 205, in get_mask
input_tensor != pad_id, dtype=torch.uint8
RuntimeError: Expected object of backend CUDA but got backend CPU for argument #3 'other'

To Reproduce
Simple train with config file.
epochs: 6
checkpoint-validation-steps: 5000
checkpoint-save: true
checkpoint-keep-only-best: 1
checkpoint-early-stop-patience: 0
optimizer: adam
log-interval: 100
learning-rate: 2e-3
learning-rate-decay: 0.6
learning-rate-decay-start: 2
train-batch-size: 64
valid-batch-size: 64
train-source: /home/bhaskbh/data/en_de.src
train-target: /home/bhaskbh/data/en_de.pe
split: 0.99
source-vocab-size: 45000
target-vocab-size: 45000
source-max-length: 50
source-min-length: 1
target-max-length: 50
target-min-length: 1
source-vocab-min-frequency: 1
target-vocab-min-frequency: 1
experiment-name: EN-DE Pretrain Predictor
gpu-id: 3

Environment

OS: Linux
OpenKiwi version 0.1.2
Python 3.6

captainvera · 2019-10-16T16:27:45Z

Hi @Bhavani01 !

This issue should have been fixed with #39. Can you specify which version of pytorch you're using so we can test appropriately?

Thanks!

Bhavani01 · 2019-10-17T13:53:16Z

Hi, I use version 1.2.0 of Pytorch. Thank you. Bhavani

…

On Wed, 16 Oct 2019 at 18:27, Miguel Vera ***@***.***> wrote: Hi @Bhavani01 <https://github.com/Bhavani01> ! This issue should have been fixed with #39 <#39>. Can you specify which version of pytorch you're using so we can test appropriately? Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#43?email_source=notifications&email_token=AC27SNVO3AC5UUA6VOAPCSTQO46IHA5CNFSM4JBIDUV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBNDWKY#issuecomment-542784299>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC27SNVZDCBJIFGZJEH4IE3QO46IHANCNFSM4JBIDUVQ> .

Bhavani01 · 2019-11-04T09:33:22Z

Is there any update on this? I now have the latest versions of of Kiwi and Pytorch. But the GPU training still fails. I also have an additional issue. The training on the CPU is fine but when I try to predict, it fails. It exits without giving an error. Pasting the log here. Any insights on what I could do differently? Thanks in advance.
2019-11-04 09:15:15.747 [kiwi.lib.predict setup:159] {'batch_size': 64,
'config': 'predict_estimator.yaml',
'debug': False,
'experiment_name': 'predict-predest',
'gpu_id': None,
'load_data': None,
'load_model': '/ec/dgt/local/exodus/home/bhaskbh/new_train/best_model.torch',
'load_vocab': '/ec/dgt/local/exodus/home/bhaskbh/new_train/vocab.torch',
'log_interval': 100,
'mlflow_always_log_artifacts': False,
'mlflow_tracking_uri': 'mlruns/',
'model': 'estimator',
'output_dir': '/ec/dgt/local/exodus/home/bhaskbh/test_data',
'quiet': False,
'run_uuid': None,
'save_config': None,
'save_data': None,
'seed': 42}
2019-11-04 09:15:15.747 [kiwi.lib.predict setup:160] Local output directory is: /ec/dgt/local/home/bhaskbh/test_data
2019-11-04 09:15:15.747 [kiwi.lib.predict run:100] Predict with the PredEst (Predictor-Estimator) model
2019-11-04 09:15:18.168 [kiwi.data.utils load_vocabularies_to_fields:126] Loaded vocabularies from /ec/dgt/local/home/bhaskbh/new_train/best_model.torch

captainvera · 2019-11-04T14:29:22Z

Hi!
I'm really sorry about the late response, I let this slip through the cracks. I'll give you an update later today!

On the second issue, it is hard to diagnose through a stopped log, would you mind sharing the command/config of how you're running the prediction pipeline?

Bhavani01 · 2019-11-04T15:30:43Z

experiment-name: predict-predest
output-dir: /ec/dgt/local/home/bhaskbh/test_data
seed: 42
#gpu-id: 0
model: estimator
sentence-level: True
binary-level: True
load-model: /ec/dgt/local/home/bhaskbh/new_train/best_model.torch
load-vocab: /ec/dgt/local/home/bhaskbh/new_train/vocab.torch
wmt18-format: False
test-source: /ec/dgt/local/home/bhaskbh/test_data/uuuu10k.en.txt
test-target: /ec/dgt/local/home/bhaskbh/test_data/uuuu10k.en_DE.txt
valid-batch-size: 64

captainvera · 2019-11-04T18:28:41Z

Hi,

I have made a pull request #44 that should solve the issue at hand.
On your second issue, I'll get back to you soon. I was able to reproduce it and am working on a fix.

Miguel

kepler · 2019-11-04T21:05:36Z

Hi @Bhavani01. Please let us know whether the current version of master solves the first issue.

Bhavani01 · 2019-11-05T10:16:11Z

I re-installed it but it still crashes. This is the only difference in the output log.
RuntimeError: Expected object of device type cuda but got device type cpu for argument #3 'other' in call to th_iand

captainvera · 2019-11-05T15:28:36Z

In the exact same line as before?
I'm having trouble reproducing this issue now, I'm using the exact same config as yours with pytorch 1.2.

As for your second issue, I can only reproduce this logging-but-no-output situation when running the predict pipeline with a Predictor and not a Predictor-Estimator. It should be noted that the Predictor is just a pre-training step and can't actually generate QE tags. You need to train the Estimator on top of the Predictor. Can you confirm you have a predictor-estimator?

@kepler maybe we should add an error message when trying to run the predict pipeline with a predictor. (the names are kind of confusing hehe). This would avoid these silent crashes and provide actionable feedback.

Bhavani01 · 2019-11-05T17:49:34Z

Yes, the same line with the addition of in call to th_iand
I did train an estimator and the log showed a successful training. Also checked the location of the model and I was pointing at the right model. I will test it again.

Bhavani01 · 2019-11-06T08:22:39Z

Could it be because of this? I assumed --load-pred-source was when I wanted to predict the source and similarly for --load-pred-target but I now looked at the documentation again it says --load-pred-target - If set, model architecture and vocabulary parameters are ignored. Load pretrained predictor tgt->src. Will let you know if this training is successful in 3 days. But the GPU problem still persists. Thanks a lot for your help.

captainvera · 2019-11-06T20:03:33Z

Hmmm, I think you assumed the correct thing and our documentation is wrong. I'm going to confirm this but an initial look seems to indicate that --load-pred-target is indeed used when predicting src -> tgt which is what I assume you want to do.

On the other hand, I'm not being able to reproduce your error with the gpu. I'm using python 3.6.8, latest version of openkiwi (installed from master, this is important as we haven't updated the version on pip yet) and pytorch 1.2.0 and can't reproduce your error. The only thing I've changed in the config you provided was adding a line with model: predictor as that is required to run the training pipeline.

As for your second issue, an easy way to test if the model is at fault is to download our pre-trained models available on our releases and run the same config but pointing to one of our pre-trained models.

Bhavani01 · 2019-11-07T09:53:34Z

I did test the config with the pretrained models. So, I guess there is something wrong with the model even though the training completed successfully. I am retraining now. Will test it again when completed. For the gpu issue, I will download from the master again this time instead of updating it, and test. I would be really grateful if you can confirm the --load-pred-target. Thanks

Bhavani01 · 2019-11-07T13:45:09Z

I cloned the master again instead of pulling changes and the training on the GPU seems OK so far. Thanks. Will report about the other issue when I finish.

captainvera · 2019-11-07T18:46:46Z

Nice! Glad to hear your problem has been solved :)

I confirmed the issue about the --load-pred-target and my suspicion was correct. It is used to load src -> tgt predictors. Our documentation has a mistake, thanks for pointing it out!

Bhavani01 · 2019-11-08T09:19:59Z

The predictor training on the GPU was fine. However, it crashed for the estimator training.
Here is the log.

Command:

kiwi train --config /ec/dgt/local/exodus/home/bhaskbh/new_train/estimate.yaml

Logging:

2019-11-08 09:06:48.174 [root setup:380] This is run ID: 27917144000c41e4a505dcaff111c669
2019-11-08 09:06:48.174 [root setup:383] Inside experiment ID: 0 (EN-DE Train Estimator)
2019-11-08 09:06:48.174 [root setup:386] Local output directory is: /ec/dgt/local/exodus/home/bhaskbh/new_train/gpu_test
2019-11-08 09:06:48.174 [root setup:389] Logging execution to MLflow at: None
2019-11-08 09:06:48.194 [root setup:395] Using GPU: 2
2019-11-08 09:06:48.194 [root setup:400] Artifacts location: None
2019-11-08 09:06:48.201 [kiwi.lib.train run:154] Training the PredEst (Predictor-Estimator) model
2019-11-08 09:07:05.865 [kiwi.data.utils load_vocabularies_to_fields:126] Loaded vocabularies from /ec/dgt/local/exodus/home/bhaskbh/new_train/gpu_test/best_model.torch
2019-11-08 09:07:12.816 [kiwi.lib.train run:187] Estimator(
  (predictor_tgt): Predictor(
    (attention): Attention(
      (scorer): MLPScorer(
        (layers): ModuleList(
          (0): Sequential(
            (0): Linear(in_features=1600, out_features=800, bias=True)
            (1): Tanh()
          )
          (1): Sequential(
            (0): Linear(in_features=800, out_features=1, bias=True)
            (1): Tanh()
          )
        )
      )
    )
    (embedding_source): Embedding(45004, 200, padding_idx=1)
    (embedding_target): Embedding(45004, 200, padding_idx=1)
    (lstm_source): LSTM(200, 400, num_layers=2, batch_first=True, dropout=0.5, bidirectional=True)
    (forward_target): LSTM(200, 400, num_layers=2, batch_first=True, dropout=0.5)
    (backward_target): LSTM(200, 400, num_layers=2, batch_first=True, dropout=0.5)
    (W1): Embedding(45004, 200, padding_idx=1)
    (_loss): CrossEntropyLoss()
  )
  (mlp): Sequential(
    (0): Linear(in_features=1000, out_features=125, bias=True)
    (1): Tanh()
  )
  (lstm): LSTM(125, 125, batch_first=True, bidirectional=True)
  (embedding_out): Linear(in_features=250, out_features=2, bias=True)
  (sentence_pred): Sequential(
    (0): Linear(in_features=250, out_features=125, bias=True)
    (1): Sigmoid()
    (2): Linear(in_features=125, out_features=62, bias=True)
    (3): Sigmoid()
    (4): Linear(in_features=62, out_features=1, bias=True)
  )
  (binary_pred): Sequential(
    (0): Linear(in_features=250, out_features=125, bias=True)
    (1): Tanh()
    (2): Linear(in_features=125, out_features=62, bias=True)
    (3): Tanh()
    (4): Linear(in_features=62, out_features=2, bias=True)
  )
  (xents): ModuleDict(
    (tags): CrossEntropyLoss()
  )
  (mse_loss): MSELoss()
  (xent_binary): CrossEntropyLoss()
)
2019-11-08 09:07:12.816 [kiwi.lib.train run:188] 39845791 parameters
2019-11-08 09:07:12.817 [kiwi.trainers.trainer run:75] Epoch 1 of 10
Batches:   0%|                         | 1/5942 [00:02<3:29:00,  2.11s/ batches]/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [9,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [10,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [12,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [26,0,0] Assertion `t >= 0 && t < n_classes` failed.
Floating point exception

My config is as follows:

model: estimator
output-dir: /ec/dgt/local/exodus/home/bhaskbh/new_train/gpu_test
hidden-est: 125
rnn-layers-est: 1
dropout-est: 0.0
mlp-est: True
token-level: True
sentence-level: True
sentence-ll: False
binary-level: True
predict-target: true
target-bad-weight: 2.5
predict-source: false
source-bad-weight: 2.5
predict-gaps: false
target-bad-weight: 2.5
epochs: 10
checkpoint-validation-steps: 0
checkpoint-save: true
checkpoint-keep-only-best: 3
checkpoint-early-stop-patience: 0
log-interval: 100
learning-rate: 2e-3
train-batch-size: 64
valid-batch-size: 64
load-pred-target: /ec/dgt/local/exodus/home/bhaskbh/new_train/gpu_test/best_model.torch
wmt18-format: false
train-source: /ec/dgt/local/exodus/home/bhaskbh/data/en_de.src
train-target: /ec/dgt/local/exodus/home/bhaskbh/data/en_de.mt
train-pe: /ec/dgt/local/exodus/home/bhaskbh/data/en_de.pe
train-target-tags: /ec/dgt/local/exodus/home/bhaskbh/data/en_de.tags
train-sentence-scores: /ec/dgt/local/exodus/home/bhaskbh/data/en_de.ter
split: 0.99
experiment-name: EN-DE Train Estimator
gpu-id: 2

captainvera · 2019-11-11T15:40:08Z

Hey, @Bhavani01 I'll take a look at this today and get back to you shortly!

I took the freedom to edit your comment to make it easier to read :)

captainvera · 2019-11-12T12:29:37Z

I'm having problems reproducing your problem.

I trained a small predictor with the example training config we make available on the repo and then trained an estimator with your config above.The only things I changed were the data (using WMT19) and the wmt18-format flag (since you're not predicting gaps and WMT19 has gaps and kiwi needs to know to filter them out).

Am I correct to you assume that you trained your predictor with the config you show in your first comment? I'd like your confirmation on that, but while I wait I'll try training a predictor with that and an estimator on top to see if I can find something and get back to you.

Bhavani01 · 2019-11-12T14:02:52Z

For the predictor I used the same config as my first comment. Other than saving best model, is there any other message to indicate successful training. Coz I say the predictor training was successful because it ran for 6 epochs and saved the best model.

captainvera · 2019-11-12T14:43:35Z

Nope, that should be it. If you're getting reasonable results (Acc > 0.6) and the model is improving and finishes training without any error then yes, it is a successful training.

I've just finished training a predictor and an estimator with your configs (using WMT19 data on both, something done solely for testing purposes) and they both trained successfully.

With this, I can't really reproduce your problem...
Could you try training your models with WMT19 data for testing purposes? It is available here

Maybe it is somehow related to the data you're using and it being handled wrongly by kiwi somehow? That floating point error is extremely weird.

Also, can you train with the CPU? The estimator should be pretty fast to train on cpu, that can be an alternative for the time being while we find out what's going on here!

Bhavani01 · 2019-11-12T14:47:40Z

OK. I will try to train with the CPU and the WMT19 data.

Bhavani01 · 2019-11-12T14:54:08Z

This is my error with the CPU:

Batches:   0%|                        | 1/5942 [00:37<62:35:24, 37.93s/ batches]Traceback (most recent call last):
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/env/bin/kiwi", line 11, in <module>
    load_entry_point('openkiwi', 'console_scripts', 'kiwi')()
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/kiwi/__main__.py", line 22, in main
    return kiwi.cli.main.cli()
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/kiwi/cli/main.py", line 71, in cli
    train.main(extra_args)
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/kiwi/cli/pipelines/train.py", line 142, in main
    train.train_from_options(options)
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/kiwi/lib/train.py", line 123, in train_from_options
    trainer = run(ModelClass, output_dir, pipeline_options, model_options)
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/kiwi/lib/train.py", line 204, in run
    trainer.run(train_iter, valid_iter, epochs=pipeline_options.epochs)
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/kiwi/trainers/trainer.py", line 76, in run
    self.train_epoch(train_iterator, valid_iterator)
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/kiwi/trainers/trainer.py", line 96, in train_epoch
    outputs = self.train_step(batch)
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/kiwi/trainers/trainer.py", line 141, in train_step
    loss_dict = self.model.loss(model_out, batch)
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/kiwi/models/predictor_estimator.py", line 507, in loss
    loss_bin = self.binary_loss(model_out, batch)
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/kiwi/models/predictor_estimator.py", line 497, in binary_loss
    loss = self.xent_binary(model_out[const.BINARY], labels.long())
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/env/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/env/lib64/python3.6/site-packages/torch/nn/modules/loss.py", line 904, in forward
    ignore_index=self.ignore_index, reduction=self.reduction)
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/env/lib64/python3.6/site-packages/torch/nn/functional.py", line 1970, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/env/lib64/python3.6/site-packages/torch/nn/functional.py", line 1790, in nll_loss
    ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: Assertion `cur_target >= 0 && cur_target < n_classes' failed.  at /pytorch/aten/src/THNN/generic/ClassNLLCriterion.c:93

captainvera · 2019-11-12T15:00:58Z

Hmmm, this definitely seems to imply something weird with the data you're feeding into Kiwi, would you mind sharing a subset of this data?

Since this fails on the first batch, a couple of lines of each file should be enough for me to check what's going on!

Bhavani01 · 2019-11-12T16:28:37Z

I suspected this has to do with the tags. Since I don't have human annotators I have a simple script to generate the labels from the alignments. A superficial glance seemed fine. I trained with just TER and sentence level and got the same error. Since the predictor trained successfully with the same src, mt and pe files, TER is the only additional input. Could this be because my TER range is not 0-1 but some segments have a score greater than 1? I will try rounding all >1 to 1 and check. Will also try to get a subset from my training data that is public domain to share. Thanks.

captainvera · 2019-11-19T16:28:44Z

Hi @Bhavani01, has your issue been solved by regenerating/ repairing the data?

Bhavani01 · 2019-11-20T09:50:08Z

I don't get the same error anymore after changing the TER scores. In the GPU training I get the "RuntimeError: CUDA out of memory." error even though I have more than enough memory and irrespective of the size of data I am training. I got a different error in the CPU training with the full dataset. Now I am running with a subsection of the data. It is still training. I am trying to use the --load-model option to train with smaller sets of data until I reach the one that is causing problems as I did basic data cleaning and cant see any obvious problems.

captainvera · 2019-11-20T12:04:41Z

That's good news.

On the GPU error, that is not related to the computer's memory but to the GPU memory. As such, it does not matter the size of the training data (that will fill the RAM but not the GPU), what matters is the batch size and the number of tokens in each sentence.

My recommendations would be to decrease the batch_size (while adjusting the learning rate accordingly) and to use the options we provide to control the max token count of src and tgt sentences, respectively: --max-source-length: X & --max-target-length: X where X is usually something between 50 and 100.

It can happen that you have some unusually long sentences being loaded into the GPU and this exceeds the amount of memory available.

As for the CPU training, I'd be very interested in the error you're getting, as that is not expected.

Finally, we are preparing some updates for Kiwi that should add some sanity checks for data. This should help us avoid errors like your previous one in the future, stay tuned!

Bhavani01 · 2019-11-20T12:58:27Z

I did try with batch size 32. I restrict the length of segments to 200 in my data cleaning. I will reduce it and test. Thanks.

Bhavani01 · 2019-12-13T10:21:25Z

Hi, for the NuQE and Quetch trainings, the target is the MT output right? Not the reference or the post edit?
BTW, the time stamp on the log does not match system or local time(Central European in my case). It is one hour behind(london time). Doesnt affect training. Just thought you should know.

captainvera · 2019-12-16T15:32:01Z

When training a QE model, target should always be MT. This applies to all models in OpenKiwi.

We normally refer to things with the following nomenclature:
Source - Text in the source language
Target - MT produced from Source

Thanks for the headsup! We'll see how to use system time. I'd say we probably set up something wrong but never noticed since we are in London time :)

Bhavani01 · 2019-12-17T15:04:21Z

Hi,
Even if I specify the gpu id, the predict config picks the cpu. Is there a way around this?

Thanks.
Bhavani

captainvera · 2019-12-17T15:09:04Z

Hey @Bhavani01, that shouldn't happen, let me have a look into what's going on.

Also, I'd appreciate it if you could open new issues instead of continuing the conversation on this one! That way we can containerise topics and use these issues to help similar questions in the future.

Bhavani01 · 2019-12-17T15:12:07Z

Got it. BTW the gpu issue was from my system, all my gpus were blocked. Sorry about that.

captainvera · 2019-12-17T15:13:17Z

Are you talking about this last comment? So the predict pipeline is working as expected? 🙂

Bhavani01 · 2019-12-17T15:20:16Z

Yes. I was outside the virtual env and the gpu was not visible to it.

captainvera · 2019-12-17T15:21:27Z

Ah! Glad to know It's solved! I'll close this issue for now. Feel free to open a new one in case you have any further questions

Bhavani01 added the bug Something isn't working label Oct 16, 2019

captainvera self-assigned this Oct 16, 2019

captainvera added a commit that referenced this issue Nov 4, 2019

Fix device assignment for pytorch compatibility #43

264c1ee

kepler pushed a commit that referenced this issue Nov 4, 2019

Fix device assignment for pytorch compatibility #43 (#44)

8d1ce7f

captainvera closed this as completed Dec 17, 2019

GPU training crashes #43

GPU training crashes #43

Comments

Bhavani01 commented Oct 16, 2019

captainvera commented Oct 16, 2019

Bhavani01 commented Oct 17, 2019 via email

Bhavani01 commented Nov 4, 2019

captainvera commented Nov 4, 2019

Bhavani01 commented Nov 4, 2019

captainvera commented Nov 4, 2019

kepler commented Nov 4, 2019

Bhavani01 commented Nov 5, 2019

captainvera commented Nov 5, 2019

Bhavani01 commented Nov 5, 2019

Bhavani01 commented Nov 6, 2019

captainvera commented Nov 6, 2019

Bhavani01 commented Nov 7, 2019

Bhavani01 commented Nov 7, 2019

captainvera commented Nov 7, 2019

Bhavani01 commented Nov 8, 2019 • edited by captainvera Loading

captainvera commented Nov 11, 2019

captainvera commented Nov 12, 2019

Bhavani01 commented Nov 12, 2019

captainvera commented Nov 12, 2019

Bhavani01 commented Nov 12, 2019

Bhavani01 commented Nov 12, 2019 • edited by captainvera Loading

captainvera commented Nov 12, 2019

Bhavani01 commented Nov 12, 2019 • edited Loading

captainvera commented Nov 19, 2019

Bhavani01 commented Nov 20, 2019 • edited Loading

captainvera commented Nov 20, 2019

Bhavani01 commented Nov 20, 2019

Bhavani01 commented Dec 13, 2019 • edited Loading

captainvera commented Dec 16, 2019 • edited Loading

Bhavani01 commented Dec 17, 2019

captainvera commented Dec 17, 2019

Bhavani01 commented Dec 17, 2019

captainvera commented Dec 17, 2019

Bhavani01 commented Dec 17, 2019

captainvera commented Dec 17, 2019

Bhavani01 commented Nov 8, 2019 •

edited by captainvera

Loading

Bhavani01 commented Nov 12, 2019 •

edited by captainvera

Loading

Bhavani01 commented Nov 12, 2019 •

edited

Loading

Bhavani01 commented Nov 20, 2019 •

edited

Loading

Bhavani01 commented Dec 13, 2019 •

edited

Loading

captainvera commented Dec 16, 2019 •

edited

Loading