-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bus error when running the code #6
Comments
Hi, Thanks for letting me know the issue. This seems to be an error related to PyTorch distributed training. Unfortunately, I cannot reproduce this error on my machines and I don't have any ideas regarding why this is happening. Probably you could try to add Thanks, |
Hello, when I tried to use the code you provided today, I also reported a ‘bus error’ error. My environment configuration is similar to that of the questioner, single GPU processing. I would like to ask if this program will take up a lot of CPU? When I am running, the CPU usage is often at 100+%. I don't know what is going on? I have reduced the data volume and batch_size. Do you have any suggestions here? |
Hi @wws0815, The code will not use CPU heavily except at the beginning when preparing the input tensors (you could see the code print out something like "Converting texts into tensors." or "Reading texts from..."). Later when the training begins (i.e., after you see "Constructing category vocabulary."), the code should mainly use GPUs for model training. If you are still seeing high CPU usage, probably you will need to make sure the code is using GPUs for training. Thanks, |
Run ''sh agnews.sh", get the following errors, I wonder if it is due to the multi-processing. Could you help check it?
`Namespace(accum_steps=2, category_vocab_size=100, dataset_dir='datasets/agnews/', dist_port=12345, early_stop=False, eval_batch_size=128, final_model='final_model.pt', gpus=2, label_names_file='label_names.txt', match_threshold=20, max_len=200, mcp_epochs=3, out_file='out.txt', self_train_epochs=1.0, test_file='test.txt', test_label_file='test_labels.txt', top_pred_num=50, train_batch_size=32, train_file='train.txt', update_interval=50)
Effective training batch size: 128
Label names used for each class are: {0: ['politics'], 1: ['sports'], 2: ['business'], 3: ['technology']}
Some weights of the model checkpoint at bert-base-uncased/ were not used when initializing LOTClassModel: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
Some weights of LOTClassModel were not initialized from the model checkpoint at bert-base-uncased/ and are newly initialized: ['cls.predictions.decoder.bias', 'dense.weight', 'dense.bias', 'classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Loading encoded texts from datasets/agnews/train.pt
Loading texts with label names from datasets/agnews/label_name_data.pt
Loading encoded texts from datasets/agnews/test.pt
Contructing category vocabulary.
/home/hadoop-aipnlp/anaconda3/envs/py36/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 2 leaked semaphores to clean up at shutdown
len(cache))
agnews.sh: line 25: 5931 Bus error (core dumped) python src/train.py --dataset_dir datasets/${DATASET}/ --label_names_file ${LABEL_NAME_FILE} --train_file ${TRAIN_CORPUS} --test_file ${TEST_CORPUS} --test_label_file ${TEST_LABEL} --max_len ${MAX_LEN} --train_batch_size ${TRAIN_BATCH} --accum_steps ${ACCUM_STEP} --eval_batch_size ${EVAL_BATCH} --gpus ${GPUS} --mcp_epochs ${MCP_EPOCH} --self_train_epochs ${SELF_TRAIN_EPOCH}`
The text was updated successfully, but these errors were encountered: