Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU training crashes #43

Closed
Bhavani01 opened this issue Oct 16, 2019 · 36 comments
Closed

GPU training crashes #43

Bhavani01 opened this issue Oct 16, 2019 · 36 comments
Assignees
Labels
bug Something isn't working

Comments

@Bhavani01
Copy link

I have no problems training on a CPU. But when I train on a GPU it crashes every time.

2019-10-15 09:02:52.215 [root setup:380] This is run ID: 4d526700aafc4f5fba779bae21789a82
2019-10-15 09:02:52.215 [root setup:383] Inside experiment ID: 0 (EN-DE Pretrain Predictor)
2019-10-15 09:02:52.215 [root setup:386] Local output directory is: /ec/dgt/local/exodus/home/bhaskbh/gpu_train
2019-10-15 09:02:52.215 [root setup:389] Logging execution to MLflow at: None
2019-10-15 09:02:52.247 [root setup:395] Using GPU: 3
2019-10-15 09:02:52.247 [root setup:400] Artifacts location: None
2019-10-15 09:02:52.252 [kiwi.lib.train run:154] Training the PredEst Predictor model (an embedder model) model
2019-10-15 09:03:13.830 [kiwi.lib.train run:187] Predictor(
(attention): Attention(
(scorer): MLPScorer(
(layers): ModuleList(
(0): Sequential(
(0): Linear(in_features=1600, out_features=800, bias=True)
(1): Tanh()
)
(1): Sequential(
(0): Linear(in_features=800, out_features=1, bias=True)
(1): Tanh()
)
)
)
)
(embedding_source): Embedding(45004, 200, padding_idx=1)
(embedding_target): Embedding(45004, 200, padding_idx=1)
(lstm_source): LSTM(200, 400, num_layers=2, batch_first=True, dropout=0.5, bidirectional=True)
(forward_target): LSTM(200, 400, num_layers=2, batch_first=True, dropout=0.5)
(backward_target): LSTM(200, 400, num_layers=2, batch_first=True, dropout=0.5)
(W1): Embedding(45004, 200, padding_idx=1)
(_loss): CrossEntropyLoss()
)
2019-10-15 09:03:13.831 [kiwi.lib.train run:188] 39389601 parameters
2019-10-15 09:03:13.831 [kiwi.trainers.trainer run:75] Epoch 1 of 6
Batches: 0%| | 0/5680 [00:00<?, ? batches/s]
Traceback (most recent call last):
File "/home/bhaskbh/.local/bin/kiwi", line 11, in
sys.exit(main())
File "/home/bhaskbh/.local/lib/python3.6/site-packages/kiwi/main.py", line 22, in main
return kiwi.cli.main.cli()
File "/home/bhaskbh/.local/lib/python3.6/site-packages/kiwi/cli/main.py", line 71, in cli
train.main(extra_args)
File "/home/bhaskbh/.local/lib/python3.6/site-packages/kiwi/cli/pipelines/train.py", line 142, in main
train.train_from_options(options)
File "/home/bhaskbh/.local/lib/python3.6/site-packages/kiwi/lib/train.py", line 123, in train_from_options
trainer = run(ModelClass, output_dir, pipeline_options, model_options)
File "/home/bhaskbh/.local/lib/python3.6/site-packages/kiwi/lib/train.py", line 204, in run
trainer.run(train_iter, valid_iter, epochs=pipeline_options.epochs)
File "/home/bhaskbh/.local/lib/python3.6/site-packages/kiwi/trainers/trainer.py", line 76, in run
self.train_epoch(train_iterator, valid_iterator)
File "/home/bhaskbh/.local/lib/python3.6/site-packages/kiwi/trainers/trainer.py", line 96, in train_epoch
outputs = self.train_step(batch)
File "/home/bhaskbh/.local/lib/python3.6/site-packages/kiwi/trainers/trainer.py", line 140, in train_step
model_out = self.model(batch)
File "/home/bhaskbh/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/home/bhaskbh/.local/lib/python3.6/site-packages/kiwi/models/predictor.py", line 240, in forward
source_mask = self.get_mask(batch, source_side)[:, 1:-1]
File "/home/bhaskbh/.local/lib/python3.6/site-packages/kiwi/models/model.py", line 205, in get_mask
input_tensor != pad_id, dtype=torch.uint8
RuntimeError: Expected object of backend CUDA but got backend CPU for argument #3 'other'

To Reproduce
Simple train with config file.
epochs: 6
checkpoint-validation-steps: 5000
checkpoint-save: true
checkpoint-keep-only-best: 1
checkpoint-early-stop-patience: 0
optimizer: adam
log-interval: 100
learning-rate: 2e-3
learning-rate-decay: 0.6
learning-rate-decay-start: 2
train-batch-size: 64
valid-batch-size: 64
train-source: /home/bhaskbh/data/en_de.src
train-target: /home/bhaskbh/data/en_de.pe
split: 0.99
source-vocab-size: 45000
target-vocab-size: 45000
source-max-length: 50
source-min-length: 1
target-max-length: 50
target-min-length: 1
source-vocab-min-frequency: 1
target-vocab-min-frequency: 1
experiment-name: EN-DE Pretrain Predictor
gpu-id: 3

Environment

  • OS: Linux
  • OpenKiwi version 0.1.2
  • Python 3.6
@Bhavani01 Bhavani01 added the bug Something isn't working label Oct 16, 2019
@captainvera
Copy link
Contributor

Hi @Bhavani01 !

This issue should have been fixed with #39. Can you specify which version of pytorch you're using so we can test appropriately?

Thanks!

@captainvera captainvera self-assigned this Oct 16, 2019
@Bhavani01
Copy link
Author

Bhavani01 commented Oct 17, 2019 via email

@Bhavani01
Copy link
Author

Is there any update on this? I now have the latest versions of of Kiwi and Pytorch. But the GPU training still fails. I also have an additional issue. The training on the CPU is fine but when I try to predict, it fails. It exits without giving an error. Pasting the log here. Any insights on what I could do differently? Thanks in advance.
2019-11-04 09:15:15.747 [kiwi.lib.predict setup:159] {'batch_size': 64,
'config': 'predict_estimator.yaml',
'debug': False,
'experiment_name': 'predict-predest',
'gpu_id': None,
'load_data': None,
'load_model': '/ec/dgt/local/exodus/home/bhaskbh/new_train/best_model.torch',
'load_vocab': '/ec/dgt/local/exodus/home/bhaskbh/new_train/vocab.torch',
'log_interval': 100,
'mlflow_always_log_artifacts': False,
'mlflow_tracking_uri': 'mlruns/',
'model': 'estimator',
'output_dir': '/ec/dgt/local/exodus/home/bhaskbh/test_data',
'quiet': False,
'run_uuid': None,
'save_config': None,
'save_data': None,
'seed': 42}
2019-11-04 09:15:15.747 [kiwi.lib.predict setup:160] Local output directory is: /ec/dgt/local/home/bhaskbh/test_data
2019-11-04 09:15:15.747 [kiwi.lib.predict run:100] Predict with the PredEst (Predictor-Estimator) model
2019-11-04 09:15:18.168 [kiwi.data.utils load_vocabularies_to_fields:126] Loaded vocabularies from /ec/dgt/local/home/bhaskbh/new_train/best_model.torch

@captainvera
Copy link
Contributor

Hi!
I'm really sorry about the late response, I let this slip through the cracks. I'll give you an update later today!

On the second issue, it is hard to diagnose through a stopped log, would you mind sharing the command/config of how you're running the prediction pipeline?

@Bhavani01
Copy link
Author

experiment-name: predict-predest
output-dir: /ec/dgt/local/home/bhaskbh/test_data
seed: 42
#gpu-id: 0
model: estimator
sentence-level: True
binary-level: True
load-model: /ec/dgt/local/home/bhaskbh/new_train/best_model.torch
load-vocab: /ec/dgt/local/home/bhaskbh/new_train/vocab.torch
wmt18-format: False
test-source: /ec/dgt/local/home/bhaskbh/test_data/uuuu10k.en.txt
test-target: /ec/dgt/local/home/bhaskbh/test_data/uuuu10k.en_DE.txt
valid-batch-size: 64

@captainvera
Copy link
Contributor

Hi,

I have made a pull request #44 that should solve the issue at hand.
On your second issue, I'll get back to you soon. I was able to reproduce it and am working on a fix.

Miguel

@kepler
Copy link
Member

kepler commented Nov 4, 2019

Hi @Bhavani01. Please let us know whether the current version of master solves the first issue.

@Bhavani01
Copy link
Author

I re-installed it but it still crashes. This is the only difference in the output log.
RuntimeError: Expected object of device type cuda but got device type cpu for argument #3 'other' in call to th_iand

@captainvera
Copy link
Contributor

In the exact same line as before?
I'm having trouble reproducing this issue now, I'm using the exact same config as yours with pytorch 1.2.

As for your second issue, I can only reproduce this logging-but-no-output situation when running the predict pipeline with a Predictor and not a Predictor-Estimator. It should be noted that the Predictor is just a pre-training step and can't actually generate QE tags. You need to train the Estimator on top of the Predictor. Can you confirm you have a predictor-estimator?

@kepler maybe we should add an error message when trying to run the predict pipeline with a predictor. (the names are kind of confusing hehe). This would avoid these silent crashes and provide actionable feedback.

@Bhavani01
Copy link
Author

  1. Yes, the same line with the addition of in call to th_iand
  2. I did train an estimator and the log showed a successful training. Also checked the location of the model and I was pointing at the right model. I will test it again.

@Bhavani01
Copy link
Author

Could it be because of this? I assumed --load-pred-source was when I wanted to predict the source and similarly for --load-pred-target but I now looked at the documentation again it says --load-pred-target - If set, model architecture and vocabulary parameters are ignored. Load pretrained predictor tgt->src. Will let you know if this training is successful in 3 days. But the GPU problem still persists. Thanks a lot for your help.

@captainvera
Copy link
Contributor

Hmmm, I think you assumed the correct thing and our documentation is wrong. I'm going to confirm this but an initial look seems to indicate that --load-pred-target is indeed used when predicting src -> tgt which is what I assume you want to do.

On the other hand, I'm not being able to reproduce your error with the gpu. I'm using python 3.6.8, latest version of openkiwi (installed from master, this is important as we haven't updated the version on pip yet) and pytorch 1.2.0 and can't reproduce your error. The only thing I've changed in the config you provided was adding a line with model: predictor as that is required to run the training pipeline.

As for your second issue, an easy way to test if the model is at fault is to download our pre-trained models available on our releases and run the same config but pointing to one of our pre-trained models.

@Bhavani01
Copy link
Author

I did test the config with the pretrained models. So, I guess there is something wrong with the model even though the training completed successfully. I am retraining now. Will test it again when completed. For the gpu issue, I will download from the master again this time instead of updating it, and test. I would be really grateful if you can confirm the --load-pred-target. Thanks

@Bhavani01
Copy link
Author

I cloned the master again instead of pulling changes and the training on the GPU seems OK so far. Thanks. Will report about the other issue when I finish.

@captainvera
Copy link
Contributor

Nice! Glad to hear your problem has been solved :)

I confirmed the issue about the --load-pred-target and my suspicion was correct. It is used to load src -> tgt predictors. Our documentation has a mistake, thanks for pointing it out!

@Bhavani01
Copy link
Author

Bhavani01 commented Nov 8, 2019

The predictor training on the GPU was fine. However, it crashed for the estimator training.
Here is the log.

Command:

kiwi train --config /ec/dgt/local/exodus/home/bhaskbh/new_train/estimate.yaml

Logging:

2019-11-08 09:06:48.174 [root setup:380] This is run ID: 27917144000c41e4a505dcaff111c669
2019-11-08 09:06:48.174 [root setup:383] Inside experiment ID: 0 (EN-DE Train Estimator)
2019-11-08 09:06:48.174 [root setup:386] Local output directory is: /ec/dgt/local/exodus/home/bhaskbh/new_train/gpu_test
2019-11-08 09:06:48.174 [root setup:389] Logging execution to MLflow at: None
2019-11-08 09:06:48.194 [root setup:395] Using GPU: 2
2019-11-08 09:06:48.194 [root setup:400] Artifacts location: None
2019-11-08 09:06:48.201 [kiwi.lib.train run:154] Training the PredEst (Predictor-Estimator) model
2019-11-08 09:07:05.865 [kiwi.data.utils load_vocabularies_to_fields:126] Loaded vocabularies from /ec/dgt/local/exodus/home/bhaskbh/new_train/gpu_test/best_model.torch
2019-11-08 09:07:12.816 [kiwi.lib.train run:187] Estimator(
  (predictor_tgt): Predictor(
    (attention): Attention(
      (scorer): MLPScorer(
        (layers): ModuleList(
          (0): Sequential(
            (0): Linear(in_features=1600, out_features=800, bias=True)
            (1): Tanh()
          )
          (1): Sequential(
            (0): Linear(in_features=800, out_features=1, bias=True)
            (1): Tanh()
          )
        )
      )
    )
    (embedding_source): Embedding(45004, 200, padding_idx=1)
    (embedding_target): Embedding(45004, 200, padding_idx=1)
    (lstm_source): LSTM(200, 400, num_layers=2, batch_first=True, dropout=0.5, bidirectional=True)
    (forward_target): LSTM(200, 400, num_layers=2, batch_first=True, dropout=0.5)
    (backward_target): LSTM(200, 400, num_layers=2, batch_first=True, dropout=0.5)
    (W1): Embedding(45004, 200, padding_idx=1)
    (_loss): CrossEntropyLoss()
  )
  (mlp): Sequential(
    (0): Linear(in_features=1000, out_features=125, bias=True)
    (1): Tanh()
  )
  (lstm): LSTM(125, 125, batch_first=True, bidirectional=True)
  (embedding_out): Linear(in_features=250, out_features=2, bias=True)
  (sentence_pred): Sequential(
    (0): Linear(in_features=250, out_features=125, bias=True)
    (1): Sigmoid()
    (2): Linear(in_features=125, out_features=62, bias=True)
    (3): Sigmoid()
    (4): Linear(in_features=62, out_features=1, bias=True)
  )
  (binary_pred): Sequential(
    (0): Linear(in_features=250, out_features=125, bias=True)
    (1): Tanh()
    (2): Linear(in_features=125, out_features=62, bias=True)
    (3): Tanh()
    (4): Linear(in_features=62, out_features=2, bias=True)
  )
  (xents): ModuleDict(
    (tags): CrossEntropyLoss()
  )
  (mse_loss): MSELoss()
  (xent_binary): CrossEntropyLoss()
)
2019-11-08 09:07:12.816 [kiwi.lib.train run:188] 39845791 parameters
2019-11-08 09:07:12.817 [kiwi.trainers.trainer run:75] Epoch 1 of 10
Batches:   0%|                         | 1/5942 [00:02<3:29:00,  2.11s/ batches]/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [9,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [10,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [12,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [26,0,0] Assertion `t >= 0 && t < n_classes` failed.
Floating point exception

My config is as follows:

model: estimator
output-dir: /ec/dgt/local/exodus/home/bhaskbh/new_train/gpu_test
hidden-est: 125
rnn-layers-est: 1
dropout-est: 0.0
mlp-est: True
token-level: True
sentence-level: True
sentence-ll: False
binary-level: True
predict-target: true
target-bad-weight: 2.5
predict-source: false
source-bad-weight: 2.5
predict-gaps: false
target-bad-weight: 2.5
epochs: 10
checkpoint-validation-steps: 0
checkpoint-save: true
checkpoint-keep-only-best: 3
checkpoint-early-stop-patience: 0
log-interval: 100
learning-rate: 2e-3
train-batch-size: 64
valid-batch-size: 64
load-pred-target: /ec/dgt/local/exodus/home/bhaskbh/new_train/gpu_test/best_model.torch
wmt18-format: false
train-source: /ec/dgt/local/exodus/home/bhaskbh/data/en_de.src
train-target: /ec/dgt/local/exodus/home/bhaskbh/data/en_de.mt
train-pe: /ec/dgt/local/exodus/home/bhaskbh/data/en_de.pe
train-target-tags: /ec/dgt/local/exodus/home/bhaskbh/data/en_de.tags
train-sentence-scores: /ec/dgt/local/exodus/home/bhaskbh/data/en_de.ter
split: 0.99
experiment-name: EN-DE Train Estimator
gpu-id: 2

@captainvera
Copy link
Contributor

Hey, @Bhavani01 I'll take a look at this today and get back to you shortly!

I took the freedom to edit your comment to make it easier to read :)

@captainvera
Copy link
Contributor

I'm having problems reproducing your problem.

I trained a small predictor with the example training config we make available on the repo and then trained an estimator with your config above.The only things I changed were the data (using WMT19) and the wmt18-format flag (since you're not predicting gaps and WMT19 has gaps and kiwi needs to know to filter them out).

Am I correct to you assume that you trained your predictor with the config you show in your first comment? I'd like your confirmation on that, but while I wait I'll try training a predictor with that and an estimator on top to see if I can find something and get back to you.

@Bhavani01
Copy link
Author

For the predictor I used the same config as my first comment. Other than saving best model, is there any other message to indicate successful training. Coz I say the predictor training was successful because it ran for 6 epochs and saved the best model.

@captainvera
Copy link
Contributor

Nope, that should be it. If you're getting reasonable results (Acc > 0.6) and the model is improving and finishes training without any error then yes, it is a successful training.

I've just finished training a predictor and an estimator with your configs (using WMT19 data on both, something done solely for testing purposes) and they both trained successfully.

With this, I can't really reproduce your problem...
Could you try training your models with WMT19 data for testing purposes? It is available here

Maybe it is somehow related to the data you're using and it being handled wrongly by kiwi somehow? That floating point error is extremely weird.

Also, can you train with the CPU? The estimator should be pretty fast to train on cpu, that can be an alternative for the time being while we find out what's going on here!

@Bhavani01
Copy link
Author

OK. I will try to train with the CPU and the WMT19 data.

@Bhavani01
Copy link
Author

Bhavani01 commented Nov 12, 2019

This is my error with the CPU:

Batches:   0%|                        | 1/5942 [00:37<62:35:24, 37.93s/ batches]Traceback (most recent call last):
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/env/bin/kiwi", line 11, in <module>
    load_entry_point('openkiwi', 'console_scripts', 'kiwi')()
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/kiwi/__main__.py", line 22, in main
    return kiwi.cli.main.cli()
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/kiwi/cli/main.py", line 71, in cli
    train.main(extra_args)
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/kiwi/cli/pipelines/train.py", line 142, in main
    train.train_from_options(options)
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/kiwi/lib/train.py", line 123, in train_from_options
    trainer = run(ModelClass, output_dir, pipeline_options, model_options)
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/kiwi/lib/train.py", line 204, in run
    trainer.run(train_iter, valid_iter, epochs=pipeline_options.epochs)
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/kiwi/trainers/trainer.py", line 76, in run
    self.train_epoch(train_iterator, valid_iterator)
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/kiwi/trainers/trainer.py", line 96, in train_epoch
    outputs = self.train_step(batch)
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/kiwi/trainers/trainer.py", line 141, in train_step
    loss_dict = self.model.loss(model_out, batch)
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/kiwi/models/predictor_estimator.py", line 507, in loss
    loss_bin = self.binary_loss(model_out, batch)
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/kiwi/models/predictor_estimator.py", line 497, in binary_loss
    loss = self.xent_binary(model_out[const.BINARY], labels.long())
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/env/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/env/lib64/python3.6/site-packages/torch/nn/modules/loss.py", line 904, in forward
    ignore_index=self.ignore_index, reduction=self.reduction)
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/env/lib64/python3.6/site-packages/torch/nn/functional.py", line 1970, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "/ec/dgt/local/exodus/home/bhaskbh/OpenKiwi-master/env/lib64/python3.6/site-packages/torch/nn/functional.py", line 1790, in nll_loss
    ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: Assertion `cur_target >= 0 && cur_target < n_classes' failed.  at /pytorch/aten/src/THNN/generic/ClassNLLCriterion.c:93

@captainvera
Copy link
Contributor

Hmmm, this definitely seems to imply something weird with the data you're feeding into Kiwi, would you mind sharing a subset of this data?

Since this fails on the first batch, a couple of lines of each file should be enough for me to check what's going on!

@Bhavani01
Copy link
Author

Bhavani01 commented Nov 12, 2019

I suspected this has to do with the tags. Since I don't have human annotators I have a simple script to generate the labels from the alignments. A superficial glance seemed fine. I trained with just TER and sentence level and got the same error. Since the predictor trained successfully with the same src, mt and pe files, TER is the only additional input. Could this be because my TER range is not 0-1 but some segments have a score greater than 1? I will try rounding all >1 to 1 and check. Will also try to get a subset from my training data that is public domain to share. Thanks.

@captainvera
Copy link
Contributor

Hi @Bhavani01, has your issue been solved by regenerating/ repairing the data?

@Bhavani01
Copy link
Author

Bhavani01 commented Nov 20, 2019

I don't get the same error anymore after changing the TER scores. In the GPU training I get the "RuntimeError: CUDA out of memory." error even though I have more than enough memory and irrespective of the size of data I am training. I got a different error in the CPU training with the full dataset. Now I am running with a subsection of the data. It is still training. I am trying to use the --load-model option to train with smaller sets of data until I reach the one that is causing problems as I did basic data cleaning and cant see any obvious problems.

@captainvera
Copy link
Contributor

That's good news.

On the GPU error, that is not related to the computer's memory but to the GPU memory. As such, it does not matter the size of the training data (that will fill the RAM but not the GPU), what matters is the batch size and the number of tokens in each sentence.

My recommendations would be to decrease the batch_size (while adjusting the learning rate accordingly) and to use the options we provide to control the max token count of src and tgt sentences, respectively: --max-source-length: X & --max-target-length: X where X is usually something between 50 and 100.

It can happen that you have some unusually long sentences being loaded into the GPU and this exceeds the amount of memory available.

As for the CPU training, I'd be very interested in the error you're getting, as that is not expected.

Finally, we are preparing some updates for Kiwi that should add some sanity checks for data. This should help us avoid errors like your previous one in the future, stay tuned!

@Bhavani01
Copy link
Author

I did try with batch size 32. I restrict the length of segments to 200 in my data cleaning. I will reduce it and test. Thanks.

@Bhavani01
Copy link
Author

Bhavani01 commented Dec 13, 2019

Hi, for the NuQE and Quetch trainings, the target is the MT output right? Not the reference or the post edit?
BTW, the time stamp on the log does not match system or local time(Central European in my case). It is one hour behind(london time). Doesnt affect training. Just thought you should know.

@captainvera
Copy link
Contributor

captainvera commented Dec 16, 2019

When training a QE model, target should always be MT. This applies to all models in OpenKiwi.

We normally refer to things with the following nomenclature:
Source - Text in the source language
Target - MT produced from Source

Thanks for the headsup! We'll see how to use system time. I'd say we probably set up something wrong but never noticed since we are in London time :)

@Bhavani01
Copy link
Author

Hi,
Even if I specify the gpu id, the predict config picks the cpu. Is there a way around this?

Thanks.
Bhavani

@captainvera
Copy link
Contributor

Hey @Bhavani01, that shouldn't happen, let me have a look into what's going on.

Also, I'd appreciate it if you could open new issues instead of continuing the conversation on this one! That way we can containerise topics and use these issues to help similar questions in the future.

@Bhavani01
Copy link
Author

Got it. BTW the gpu issue was from my system, all my gpus were blocked. Sorry about that.

@captainvera
Copy link
Contributor

Are you talking about this last comment? So the predict pipeline is working as expected? 🙂

@Bhavani01
Copy link
Author

Yes. I was outside the virtual env and the gpu was not visible to it.

@captainvera
Copy link
Contributor

Ah! Glad to know It's solved! I'll close this issue for now. Feel free to open a new one in case you have any further questions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants