Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bus error when running the code #6

Closed
xuetf opened this issue Jan 25, 2021 · 3 comments
Closed

bus error when running the code #6

xuetf opened this issue Jan 25, 2021 · 3 comments

Comments

@xuetf
Copy link

xuetf commented Jan 25, 2021

  • Red Hat 4.8.5-11
  • Four V100
  • Python3.6
  • torch 1.7.1
  • transformer 3.3.1

Run ''sh agnews.sh", get the following errors, I wonder if it is due to the multi-processing. Could you help check it?

`Namespace(accum_steps=2, category_vocab_size=100, dataset_dir='datasets/agnews/', dist_port=12345, early_stop=False, eval_batch_size=128, final_model='final_model.pt', gpus=2, label_names_file='label_names.txt', match_threshold=20, max_len=200, mcp_epochs=3, out_file='out.txt', self_train_epochs=1.0, test_file='test.txt', test_label_file='test_labels.txt', top_pred_num=50, train_batch_size=32, train_file='train.txt', update_interval=50)
Effective training batch size: 128
Label names used for each class are: {0: ['politics'], 1: ['sports'], 2: ['business'], 3: ['technology']}
Some weights of the model checkpoint at bert-base-uncased/ were not used when initializing LOTClassModel: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']

  • This IS expected if you are initializing LOTClassModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
  • This IS NOT expected if you are initializing LOTClassModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    Some weights of LOTClassModel were not initialized from the model checkpoint at bert-base-uncased/ and are newly initialized: ['cls.predictions.decoder.bias', 'dense.weight', 'dense.bias', 'classifier.weight', 'classifier.bias']
    You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
    Loading encoded texts from datasets/agnews/train.pt
    Loading texts with label names from datasets/agnews/label_name_data.pt
    Loading encoded texts from datasets/agnews/test.pt
    Contructing category vocabulary.
    /home/hadoop-aipnlp/anaconda3/envs/py36/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 2 leaked semaphores to clean up at shutdown
    len(cache))
    agnews.sh: line 25: 5931 Bus error (core dumped) python src/train.py --dataset_dir datasets/${DATASET}/ --label_names_file ${LABEL_NAME_FILE} --train_file ${TRAIN_CORPUS} --test_file ${TEST_CORPUS} --test_label_file ${TEST_LABEL} --max_len ${MAX_LEN} --train_batch_size ${TRAIN_BATCH} --accum_steps ${ACCUM_STEP} --eval_batch_size ${EVAL_BATCH} --gpus ${GPUS} --mcp_epochs ${MCP_EPOCH} --self_train_epochs ${SELF_TRAIN_EPOCH}`
@yumeng5
Copy link
Owner

yumeng5 commented Jan 26, 2021

Hi,

Thanks for letting me know the issue. This seems to be an error related to PyTorch distributed training. Unfortunately, I cannot reproduce this error on my machines and I don't have any ideas regarding why this is happening. Probably you could try to add export PYTHONWARNINGS='ignore:semaphore_tracker:UserWarning' to the agnews.sh script as suggested here. Alternatively, you could modify the code to remove the distributed training part as discussed here. If you have a V100 GPU, you probably won't need to train on multiple GPUs.

Thanks,
Yu

@wws0815
Copy link

wws0815 commented Jan 27, 2021

Hi,

Thanks for letting me know the issue. This seems to be an error related to PyTorch distributed training. Unfortunately, I cannot reproduce this error on my machines and I don't have any ideas regarding why this is happening. Probably you could try to add export PYTHONWARNINGS='ignore:semaphore_tracker:UserWarning' to the agnews.sh script as suggested here. Alternatively, you could modify the code to remove the distributed training part as discussed here. If you have a V100 GPU, you probably won't need to train on multiple GPUs.

Thanks,
Yu

Hello, when I tried to use the code you provided today, I also reported a ‘bus error’ error. My environment configuration is similar to that of the questioner, single GPU processing. I would like to ask if this program will take up a lot of CPU? When I am running, the CPU usage is often at 100+%. I don't know what is going on? I have reduced the data volume and batch_size. Do you have any suggestions here?

@yumeng5
Copy link
Owner

yumeng5 commented Jan 30, 2021

Hi @wws0815,

The code will not use CPU heavily except at the beginning when preparing the input tensors (you could see the code print out something like "Converting texts into tensors." or "Reading texts from..."). Later when the training begins (i.e., after you see "Constructing category vocabulary."), the code should mainly use GPUs for model training. If you are still seeing high CPU usage, probably you will need to make sure the code is using GPUs for training.

Thanks,
Yu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants