Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

example下的ee执行predict.py报错 #540

Closed
zhuweigang opened this issue Jun 21, 2024 · 10 comments
Closed

example下的ee执行predict.py报错 #540

zhuweigang opened this issue Jun 21, 2024 · 10 comments
Assignees
Labels
bug Something isn't working

Comments

@zhuweigang
Copy link

zhuweigang commented Jun 21, 2024

git下载后执行run.py报错说bert-base-chinese找不到,我去hf找到google的bert-base-chinese模型下载到本地后执行run.py成功了,然后执行predict.py报如下错误:

  • OS: ubuntu 20.04
  • Python Version 3.9
  • NVIDIA RTX 3090

Screenshots
[2024-06-21 08:33:07,466][run][INFO] - Saving features into cached file /root/DeepKE/example/ee/standard/./data/DuEE/trigger/cached_dev_bert-base-chinese_256
[2024-06-21 08:33:08,084][run][INFO] - ***** Running evaluation *****
[2024-06-21 08:33:08,084][run][INFO] - Num examples = 1498
[2024-06-21 08:33:08,085][run][INFO] - Batch size = 16
[2024-06-21 08:33:08,085][run][INFO] - Mode = dev
Evaluating: 0%| | 0/94 [00:00<?, ?it/s]../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [0,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [1,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [2,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [3,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [4,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [5,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [6,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [7,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [8,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [9,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [10,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [11,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [12,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [13,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [14,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [15,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
Evaluating: 0%| | 0/94 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/root/DeepKE/example/ee/standard/predict.py", line 115, in main
result, eval_pred_list = evaluate(args, model, eval_dataset, tokenizer, labels, pad_token_label_id, mode="dev", device=device)
File "/root/DeepKE/example/ee/standard/run.py", line 219, in evaluate
outputs = model(pad_token_label_id=pad_token_label_id, **inputs)
File "/root/anaconda3/envs/deepke/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/anaconda3/envs/deepke/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/root/DeepKE/src/deepke/event_extraction/standard/bertcrf/bert_crf.py", line 89, in forward
loss = self.crf.neg_log_likelihood(crf_logits, crf_mask, crf_labels)
File "/root/DeepKE/src/deepke/event_extraction/standard/bertcrf/crf.py", line 273, in neg_log_likelihood
gold_score = self._score_sentence(scores, mask, tags)
File "/root/DeepKE/src/deepke/event_extraction/standard/bertcrf/crf.py", line 258, in _score_sentence
tg_energy = tg_energy.masked_select(mask.transpose(1, 0))
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

其它范例如ner/re/ae都能执行成功就ee失败,麻烦帮看一下怎么解决哈

@zhuweigang zhuweigang added the bug Something isn't working label Jun 21, 2024
@shengyumao
Copy link
Collaborator

您好,当出现RuntimeError: CUDA error: device-side assert triggered这个错误的时候,一般是指最后分类头预测的维度和测试数据集中标签的数量不一致。
请您检查一下predict.yaml中您设置的任务类型(i.e. task_name参数中是trigger还是role)与所训练的模型是否一致,因为predict.yaml中默认的任务类型为role(trigger的eval在训练的过程中已经同时执行了),而train.yaml中设定的默认任务类型为trigger。如果您需要对role进行evaluation的话,需要在train.yaml中修改task_name为role,再另外训练一个模型后再进行evaluation。
如果上述方案没能解决您的问题,请提供一下具体的运行参数(例如配置文件等)~

@zxlzr
Copy link
Contributor

zxlzr commented Jun 22, 2024

请问您还有其他问题吗?

@zhuweigang
Copy link
Author

感谢大侠热心的的答复,我按照建议把train.yaml的task_name由trigger改成role之后报另外一个错误了,这次不是cuda错误:

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /root/anaconda3/envs/deepke/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
/root/anaconda3/envs/deepke/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /root/anaconda3/envs/deepke did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.8/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /root/anaconda3/envs/deepke/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
/root/anaconda3/envs/deepke/lib/python3.9/site-packages/hydra/plugins/config_source.py:190: UserWarning:
Missing @Package directive train.yaml in file:///root/DeepKE/example/ee/standard/conf.
See https://hydra.cc/docs/next/upgrades/0.11_to_1.0/adding_a_package_directive
warnings.warn(message=msg, category=UserWarning)
Traceback (most recent call last):
File "/root/DeepKE/example/ee/standard/predict.py", line 49, in main
args.dev_trigger_pred_file = os.path.join(args.cwd, args.dev_trigger_pred_file) if args.do_pipeline_predict and args.task_name=="role" else None
File "/root/anaconda3/envs/deepke/lib/python3.9/posixpath.py", line 90, in join
genericpath._check_arg_types('join', a, *p)
File "/root/anaconda3/envs/deepke/lib/python3.9/genericpath.py", line 152, in _check_arg_types
raise TypeError(f'{funcname}() argument must be str, bytes, or '
TypeError: join() argument must be str, bytes, or os.PathLike object, not 'NoneType'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

@zhuweigang
Copy link
Author

另外我想确定一下,我去huggingface下载google下的bert-base-chinese模型给ee例子用,这样做行不行的?

@shengyumao
Copy link
Collaborator

感谢大侠热心的的答复,我按照建议把train.yaml的task_name由trigger改成role之后报另外一个错误了,这次不是cuda错误:

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /root/anaconda3/envs/deepke/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so /root/anaconda3/envs/deepke/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /root/anaconda3/envs/deepke did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.8/lib64/libcudart.so.11.0 CUDA SETUP: Highest compute capability among GPUs detected: 8.6 CUDA SETUP: Detected CUDA version 118 CUDA SETUP: Loading binary /root/anaconda3/envs/deepke/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so... /root/anaconda3/envs/deepke/lib/python3.9/site-packages/hydra/plugins/config_source.py:190: UserWarning: Missing @Package directive train.yaml in file:///root/DeepKE/example/ee/standard/conf. See https://hydra.cc/docs/next/upgrades/0.11_to_1.0/adding_a_package_directive warnings.warn(message=msg, category=UserWarning) Traceback (most recent call last): File "/root/DeepKE/example/ee/standard/predict.py", line 49, in main args.dev_trigger_pred_file = os.path.join(args.cwd, args.dev_trigger_pred_file) if args.do_pipeline_predict and args.task_name=="role" else None File "/root/anaconda3/envs/deepke/lib/python3.9/posixpath.py", line 90, in join genericpath._check_arg_types('join', a, *p) File "/root/anaconda3/envs/deepke/lib/python3.9/genericpath.py", line 152, in _check_arg_types raise TypeError(f'{funcname}() argument must be str, bytes, or ' TypeError: join() argument must be str, bytes, or os.PathLike object, not 'NoneType'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

您好,从报错信息来看这里你修改参数后是直接运行的predict.py文件?您将train.yaml中的taskname改为role了之后需要再次运行python run.py,去训练一个事件元素抽取模型,我们在README中有提到对于事件抽取任务,需要训练两个阶段的模型。
另外从报错信息里头来看似乎是您把predict.yaml中的dev_trigger_pred_file参数删除了或设置为空,默认参数为./exp/DuEE/trigger/bert-base-chinese/eval_pred.json,这个是在运行trigger模型训练的过程中得到的触发词预测结果,用于后续继续进行pipeline的事件元素抽取。请您再次检查一下。

另外我想确定一下,我去huggingface下载google下的bert-base-chinese模型给ee例子用,这样做行不行的?

从huggingface上下载是没有问题的。

@zhuweigang
Copy link
Author

zhuweigang commented Jun 24, 2024

感谢大侠热心答复:—)

  1. 我是把train.yaml里的taskname改为role后再运行了run.py后再运行了predict.py得到的这个报错结果
  2. predict.yaml中的dev_trigger_pred_file参数我没有动,还是./exp/DuEE/trigger/bert-base-chinese/eval_pred.json,而且这个文件是存在的

@zhuweigang
Copy link
Author

zhuweigang commented Jun 24, 2024

附上两个配置文件哈


************** train.yaml ******************


data_name: DuEE # [ACE, DuEE]
model_name_or_path: bert-base-chinese # [bert-base-uncased, bert-base-chinese] english for ace, chinese for duee
#task_name: trigger # [trigger, role]
task_name: role
model_type: bertcrf
do_train: True
do_eval: True
do_predict: False # True for ACE, False for DuEE
labels: ""
config_name: ""
tokenizer_name: ""
cache_dir: ""
evaluate_during_training: True
do_lower_case: True
weight_decay: 0.0
learning_rate: 5e-5
adam_epsilon: 1e-8
per_gpu_train_batch_size: 16
per_gpu_eval_batch_size: 16
gradient_accumulation_steps: 1
max_seq_length: 256
max_grad_norm: 1.0
num_train_epochs: 5
max_steps: 500
warmup_steps: 0
logging_steps: 500
save_steps: 500
eval_all_checkpoints: False
no_cuda: False
n_gpu: 0
overwrite_output_dir: True
overwrite_cache: True
seed: 42
fp16: False
fp16_opt_level: "01"
local_rank: -1
data_dir: "" # parsing in run.py
tag_path: "" # parsing in run.py
output_dir: "" # parsing in run.py
dev_trigger_pred_file: null
test_trigger_pred_file: null


*************** predict.yaml ***************


defaults:

  • train

data_name: DuEE # [ACE, DuEE]
model_name_or_path: ./exp/DuEE/role/bert-base-chinese
task_name: role # the trigger prediction is done during the training process.
do_train: False
do_eval: True
do_predict: False # True for ACE, False for DuEE

do_pipeline_predict: True
overwrite_cache: True

dev_trigger_pred_file: ./exp/DuEE/trigger/bert-base-chinese/eval_pred.json # change to your pred file of trigger classification
test_trigger_pred_file: ./exp/DuEE/trigger/bert-base-chinese/test_pred.json

@shengyumao
Copy link
Collaborator

您好,这里的报错为TypeError: join() argument must be str, bytes, or os.PathLike object, not 'NoneType',即args.cwd args.dev_trigger_pred_file中有对象为空,你可以尝试打个断点或者将相关变量打印出来看看,我在自己的环境下重新运行了一下,默认参数下predict.py L49并没有报错,可以看看在你的环境下变量错在哪儿。

@zhuweigang
Copy link
Author

zhuweigang commented Jun 24, 2024

感谢指导,不知道我的yaml文件什么毛病,现在搞定了dev_trigger_pred_file和test_trigger_pred_file参数为空的问题后报了下面的错误,方便的话可以加我微信一起看一下吗?非常感谢

[2024-06-24 17:34:19,321][deepke.event_extraction.standard.bertcrf.processor_ee][INFO] - LOOKING AT /root/DeepKE/example/ee/standard/./data/DuEE/role/dev_with_pred_trigger.tsv train
[2024-06-24 17:34:19,345][run][INFO] - Creating features from dataset file at /root/DeepKE/example/ee/standard/./data/DuEE/role
###############
[2024-06-24 17:34:19,345][deepke.event_extraction.standard.bertcrf.processor_ee][INFO] - Writing example 0 of 2015
###############
[2024-06-24 17:34:23,558][run][INFO] - Saving features into cached file /root/DeepKE/example/ee/standard/./data/DuEE/role/cached_dev_bert-base-chinese_256
[2024-06-24 17:34:24,415][run][INFO] - ***** Running evaluation *****
[2024-06-24 17:34:24,416][run][INFO] - Num examples = 2015
[2024-06-24 17:34:24,416][run][INFO] - Batch size = 16
[2024-06-24 17:34:24,416][run][INFO] - Mode = dev
Evaluating: 0%| | 0/126 [00:00<?, ?it/s]../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [0,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [1,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [2,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [3,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [4,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [5,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [6,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [7,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [8,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [9,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [10,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [11,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [12,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [13,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [14,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [15,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
Evaluating: 0%| | 0/126 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/root/DeepKE/example/ee/standard/predict.py", line 124, in main
result, eval_pred_list = evaluate(args, model, eval_dataset, tokenizer, labels, pad_token_label_id, mode="dev", device=device)
File "/root/DeepKE/example/ee/standard/run.py", line 219, in evaluate
outputs = model(pad_token_label_id=pad_token_label_id, **inputs)
File "/root/anaconda3/envs/deepke/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/anaconda3/envs/deepke/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/root/DeepKE/src/deepke/event_extraction/standard/bertcrf/bert_crf.py", line 89, in forward
loss = self.crf.neg_log_likelihood(crf_logits, crf_mask, crf_labels)
File "/root/DeepKE/src/deepke/event_extraction/standard/bertcrf/crf.py", line 273, in neg_log_likelihood
gold_score = self._score_sentence(scores, mask, tags)
File "/root/DeepKE/src/deepke/event_extraction/standard/bertcrf/crf.py", line 258, in _score_sentence
tg_energy = tg_energy.masked_select(mask.transpose(1, 0))
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

@zhuweigang
Copy link
Author

ee的问题解决了,是因为hydra-core版本不是1.3.1的缘故,感谢大神

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants