bus error when running the code #6

xuetf · 2021-01-25T06:23:51Z

Red Hat 4.8.5-11
Four V100
Python3.6
torch 1.7.1
transformer 3.3.1

Run ''sh agnews.sh", get the following errors, I wonder if it is due to the multi-processing. Could you help check it?

`Namespace(accum_steps=2, category_vocab_size=100, dataset_dir='datasets/agnews/', dist_port=12345, early_stop=False, eval_batch_size=128, final_model='final_model.pt', gpus=2, label_names_file='label_names.txt', match_threshold=20, max_len=200, mcp_epochs=3, out_file='out.txt', self_train_epochs=1.0, test_file='test.txt', test_label_file='test_labels.txt', top_pred_num=50, train_batch_size=32, train_file='train.txt', update_interval=50)
Effective training batch size: 128
Label names used for each class are: {0: ['politics'], 1: ['sports'], 2: ['business'], 3: ['technology']}
Some weights of the model checkpoint at bert-base-uncased/ were not used when initializing LOTClassModel: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']

This IS expected if you are initializing LOTClassModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
This IS NOT expected if you are initializing LOTClassModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of LOTClassModel were not initialized from the model checkpoint at bert-base-uncased/ and are newly initialized: ['cls.predictions.decoder.bias', 'dense.weight', 'dense.bias', 'classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Loading encoded texts from datasets/agnews/train.pt
Loading texts with label names from datasets/agnews/label_name_data.pt
Loading encoded texts from datasets/agnews/test.pt
Contructing category vocabulary.
/home/hadoop-aipnlp/anaconda3/envs/py36/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 2 leaked semaphores to clean up at shutdown
len(cache))
agnews.sh: line 25: 5931 Bus error (core dumped) python src/train.py --dataset_dir datasets/${DATASET}/ --label_names_file ${LABEL_NAME_FILE} --train_file ${TRAIN_CORPUS} --test_file ${TEST_CORPUS} --test_label_file ${TEST_LABEL} --max_len ${MAX_LEN} --train_batch_size ${TRAIN_BATCH} --accum_steps ${ACCUM_STEP} --eval_batch_size ${EVAL_BATCH} --gpus ${GPUS} --mcp_epochs ${MCP_EPOCH} --self_train_epochs ${SELF_TRAIN_EPOCH}`

yumeng5 · 2021-01-26T06:44:25Z

Hi,

Thanks for letting me know the issue. This seems to be an error related to PyTorch distributed training. Unfortunately, I cannot reproduce this error on my machines and I don't have any ideas regarding why this is happening. Probably you could try to add export PYTHONWARNINGS='ignore:semaphore_tracker:UserWarning' to the agnews.sh script as suggested here. Alternatively, you could modify the code to remove the distributed training part as discussed here. If you have a V100 GPU, you probably won't need to train on multiple GPUs.

Thanks,
Yu

wws0815 · 2021-01-27T07:27:15Z

Hi,

Thanks for letting me know the issue. This seems to be an error related to PyTorch distributed training. Unfortunately, I cannot reproduce this error on my machines and I don't have any ideas regarding why this is happening. Probably you could try to add export PYTHONWARNINGS='ignore:semaphore_tracker:UserWarning' to the agnews.sh script as suggested here. Alternatively, you could modify the code to remove the distributed training part as discussed here. If you have a V100 GPU, you probably won't need to train on multiple GPUs.

Thanks,
Yu

Hello, when I tried to use the code you provided today, I also reported a ‘bus error’ error. My environment configuration is similar to that of the questioner, single GPU processing. I would like to ask if this program will take up a lot of CPU? When I am running, the CPU usage is often at 100+%. I don't know what is going on? I have reduced the data volume and batch_size. Do you have any suggestions here?

yumeng5 · 2021-01-30T20:53:12Z

Hi @wws0815,

The code will not use CPU heavily except at the beginning when preparing the input tensors (you could see the code print out something like "Converting texts into tensors." or "Reading texts from..."). Later when the training begins (i.e., after you see "Constructing category vocabulary."), the code should mainly use GPUs for model training. If you are still seeing high CPU usage, probably you will need to make sure the code is using GPUs for training.

Thanks,
Yu

yumeng5 closed this as completed Mar 11, 2021

yumeng5 mentioned this issue May 19, 2023

AssertionError: Too few (0) documents with category indicative terms found for category 1; try to add more unlabeled documents to the training corpus (recommend) or reduce --match_threshold (not recommend) #22

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bus error when running the code #6

bus error when running the code #6

xuetf commented Jan 25, 2021

yumeng5 commented Jan 26, 2021

wws0815 commented Jan 27, 2021

yumeng5 commented Jan 30, 2021

bus error when running the code #6

bus error when running the code #6

Comments

xuetf commented Jan 25, 2021

yumeng5 commented Jan 26, 2021

wws0815 commented Jan 27, 2021

yumeng5 commented Jan 30, 2021