Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error while Training Dall-E on a single TPU (8cores) #9

Closed
mkhoshle opened this issue Jul 21, 2021 · 9 comments
Closed

Error while Training Dall-E on a single TPU (8cores) #9

mkhoshle opened this issue Jul 21, 2021 · 9 comments

Comments

@mkhoshle
Copy link

Hi, I am trying to train Dall-e on COCO dataset and here are the parameters I use:

%%writefile /content/tmp/run.sh
#@title Configuration
# model
model="vqgan" #@param  ['vqgan','evqgan','gvqgan','vqvae','evqvae','gvqvae','vqvae2']
# training
epochs=30 #@param {'type': 'raw'}
learning_rate=4.5e-6 #@param {'type': 'number' }
precision=16 #@param {'type': 'integer' }
batch_size=8 #@param {'type': 'raw'}
num_workers=8 #@param {'type': 'raw'} 
# fake_data=True #@param {'type': 'boolean' }
use_tpus=True #@param {'type': 'boolean' }


# modifiable
resume=False #@param {type: 'boolean'}
dropout=0.1 #@param {type: 'number'}
rescale_img_size=256 #@param {type: 'number'}
resize_ratio=0.75 #@param {type: 'number'}
# test=True #@param {type: 'boolean'}
seed=8675309
codebook_dim=1024
embedding_dim=256

python '/content/dalle-lightning-modified-/train_dalle.py' \
    --epochs $epochs \
    --learning_rate $learning_rate \
    --precision $precision \
    --batch_size $batch_size \
    --num_workers $num_workers \
    --use_tpus \
    --train_dir "/content/data/train/" \
    --val_dir "/content/data/test" \
    --vae_path "/content/vae_logs/last.ckpt"  \
    --log_dir "/content/dalle_logs/" \
    --img_size $rescale_img_size \
    --seed $seed \
    --resize_ratio $resize_ratio \
    --embedding_dim $embedding_dim \
    --codebook_dim $codebook_dim

When running I get the following error:

WARNING:root:TPU has started up successfully with version pytorch-1.9
Global seed set to 8675309
GPU available: False, used: False
TPU available: True, using: 8 TPU cores
IPU available: False, using: 0 IPUs
Setting batch size: 8 learning rate: 4.50e-06

Global seed set to 8675309
Global seed set to 8675309
Global seed set to 8675309
Global seed set to 8675309
Global seed set to 8675309
Global seed set to 8675309
Global seed set to 8675309
Global seed set to 8675309

  | Name          | Type                     | Params
-----------------------------------------------------------
0 | text_emb      | Embedding                | 5.3 M 
1 | image_emb     | Embedding                | 4.2 M 
2 | text_pos_emb  | Embedding                | 131 K 
3 | image_pos_emb | AxialPositionalEmbedding | 32.8 K
4 | vae           | OpenAIDiscreteVAE        | 97.6 M
5 | transformer   | Transformer              | 268 M 
6 | to_logits     | Sequential               | 9.5 M 
-----------------------------------------------------------
288 M     Trainable params
97.6 M    Non-trainable params
385 M     Total params
771.301   Total estimated model params size (MB)
Exception in device=TPU:5: `Dataloader` returned 0 length. Please make sure that it returns at least 1 batch
Exception in device=TPU:3: `Dataloader` returned 0 length. Please make sure that it returns at least 1 batch
Exception in device=TPU:7: `Dataloader` returned 0 length. Please make sure that it returns at least 1 batch
Exception in device=TPU:0: `Dataloader` returned 0 length. Please make sure that it returns at least 1 batch
Exception in device=TPU:1: `Dataloader` returned 0 length. Please make sure that it returns at least 1 batch
Exception in device=TPU:2: `Dataloader` returned 0 length. Please make sure that it returns at least 1 batch
Exception in device=TPU:6: `Dataloader` returned 0 length. Please make sure that it returns at least 1 batch
Exception in device=TPU:4: `Dataloader` returned 0 length. Please make sure that it returns at least 1 batch

One process sees the folder of 94629 images and texts and the rest see 0 images and texts. I do not understand why this is happening. Could you please help me with this? Any idea?

@tgisaturday
Copy link
Owner

@mkhoshle Your code seems to fail loading coco dataset. Make sure everything runs okay with --fake_data flag and check --train_dir, --val_dir.

@mkhoshle
Copy link
Author

@tgisaturday No the issue is not loading. I can see that one process sees the folder of 94629 images and texts and the rest see 0 images and texts which is weird. --fake_data flag is set to False and the directory paths are correct.

@tgisaturday
Copy link
Owner

@mkhoshle I've tested with cc3m and coco and there weren't any similar symptoms. Double check all your settings and show me how to reproduce.

@mkhoshle
Copy link
Author

@tgisaturday Here is my code. You can see the error in my colab notebook:
https://colab.research.google.com/drive/1c9ttTLYbfhjJ59JM_f5XZkF0vMl7pAob?usp=sharing

@tgisaturday
Copy link
Owner

@mkhoshle You have to use pytorch-lighting datamodule to run the code without any problem. I can't debug every custom codes which are not following the entire framework.

@mkhoshle
Copy link
Author

mkhoshle commented Jul 22, 2021

@tgisaturday I have followed your code examples to do this. What do you mean I need to use pytorch-lighting datamodule to run the code. Do you mean that torch_xla should not be installed and it should only be based on pytorch-lighting? Is not pytorch-lighting dependent on `torch_xla?

@tgisaturday
Copy link
Owner

tgisaturday commented Jul 22, 2021

@mkhoshle Not using TextImageDataModule here can cause problems.

class TextImageDataModule(LightningDataModule):

For example, using pure torchvision Dataset class and just feeding only dataloader to lightning Trainer causes OOM in large pods.
Lightning-AI/pytorch-lightning#8358 (comment)

If this is not the case, start debugging using only one TPU core. Sometimes hidden error gets revealed.

@afiaka87
Copy link
Contributor

@tgisaturday i believe they're using the colab notebook from the repository if you're not aware. Or a rendition of it.

@tgisaturday
Copy link
Owner

@tgisaturday No the issue is not loading. I can see that one process sees the folder of 94629 images and texts and the rest see 0 images and texts which is weird. --fake_data flag is set to False and the directory paths are correct.

The reason why only one process sees the folder of 94629 images is that you have set num_workers to 1. num_worker is number of processes which handles dataloading in multi-processing manner. Nothing related to number of TPU cores. However, none of TPUs from 0-7 are fed with data. This can be device allocation error, dataloader error, or others that are not visible in current colab notebook.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants