Skip to content

[Bug]: 在使用main分支代码在python3.11.4版本下,执行多进程并发任务失败;报错TypeError: code() argument 13 must be str, not int;RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing. #691

Open
@HunterLG

Description

@HunterLG

Before Reporting 报告之前

  • I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。

  • I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引,并且在安装过程中没有错误发生。(否则,我们建议您使用Question模板向我们进行提问)

Search before reporting 先搜索,再报告

  • I have searched the Data-Juicer issues and found no similar bugs. 我已经在 issue列表 中搜索但是没有发现类似的bug报告。

OS 系统

linux

Installation Method 安装方式

源码安装

Data-Juicer Version Data-Juicer版本

1.3.3

Python Version Python版本

3.11.4

Describe the bug 描述这个bug

2025-06-07 19:17:53 | INFO | data_juicer.core.executor.default_executor:50 - Using cache compression method: [None]
2025-06-07 19:17:53 | INFO | data_juicer.core.executor.default_executor:55 - Setting up dataset builder...
2025-06-07 19:17:53 | INFO | data_juicer.core.data.dataset_builder:37 - found dataset_path setting: /data/l00906000/data-juicer-main/demos/data/demo-dataset.jsonl
2025-06-07 19:17:53 | INFO | data_juicer.core.data.load_strategy:79 - Getting strategy class for exec: default, data_type: local, data_source: None
2025-06-07 19:17:53 | INFO | data_juicer.core.executor.default_executor:74 - Preparing exporter...
2025-06-07 19:17:53 | INFO | data_juicer.core.executor.default_executor:112 - Loading dataset from dataset builder...
2025-06-07 19:17:54 | INFO | data_juicer.format.formatter:186 - Unifying the input dataset formats...
2025-06-07 19:17:54 | INFO | data_juicer.format.formatter:201 - There are 6 sample(s) in the original dataset.
Filter (num_proc=2): 0%| | 0/6 [00:00<?, ? examples/s]Process ForkPoolWorker-1:
Process ForkPoolWorker-2:
Traceback (most recent call last):
File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/multiprocess/process.py", line 315, in _bootstrap
self.run()
File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/multiprocess/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
Traceback (most recent call last):
File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/multiprocess/pool.py", line 114, in worker
task = get()
^^^^^
File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/multiprocess/queues.py", line 371, in get
return _ForkingPickler.loads(res)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/multiprocess/process.py", line 315, in _bootstrap
self.run()
File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/dill/_dill.py", line 327, in loads
return load(file, ignore, **kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/multiprocess/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/dill/_dill.py", line 313, in load
return Unpickler(file, ignore=ignore, **kwds).load()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/multiprocess/pool.py", line 114, in worker
task = get()
^^^^^
File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/dill/_dill.py", line 525, in load
obj = StockUnpickler.load(self)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/multiprocess/queues.py", line 371, in get
return _ForkingPickler.loads(res)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/dill/_dill.py", line 327, in loads
return load(file, ignore, **kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/dill/_dill.py", line 659, in _create_code
if len(args) == 16: return CodeType(*args)
^^^^^^^^^^^^^^^
File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/dill/_dill.py", line 313, in load
return Unpickler(file, ignore=ignore, **kwds).load()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: code() argument 13 must be str, not int
File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/dill/_dill.py", line 525, in load
obj = StockUnpickler.load(self)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/dill/_dill.py", line 659, in _create_code
if len(args) == 16: return CodeType(*args)
^^^^^^^^^^^^^^^
TypeError: code() argument 13 must be str, not int

Filter (num_proc=2): 0%| | 0/6 [00:00<?, ? examples/s]
object address : 0x7effec591240
object refcount : 2
object type : 0x871e00
object type name: RuntimeError
object repr : RuntimeError('One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing.')
lost sys.stderr
2025-06-07 19:17:54 | ERROR | main:19 - An error has been caught in function '', process 'MainProcess' (3765013), thread 'MainThread' (139639783552832):
Traceback (most recent call last):

File "/data/l00906000/data-juicer-main/tools/process_data.py", line 19, in
main()
└ <function main at 0x7effef557560>

File "/data/l00906000/data-juicer-main/tools/process_data.py", line 15, in main
executor.run()
│ └ <function DefaultExecutor.run at 0x7effef5579c0>
└ <data_juicer.core.executor.default_executor.DefaultExecutor object at 0x7f006a5ca6d0>

File "/data/l00906000/data-juicer-main/data_juicer/core/executor/default_executor.py", line 115, in run
dataset = self.dataset_builder.load_dataset(num_proc=load_data_np)
│ │ │ └ 2
│ │ └ <function DatasetBuilder.load_dataset at 0x7efff07ae200>
│ └ <data_juicer.core.data.dataset_builder.DatasetBuilder object at 0x7effef169e10>
└ <data_juicer.core.executor.default_executor.DefaultExecutor object at 0x7f006a5ca6d0>

File "/data/l00906000/data-juicer-main/data_juicer/core/data/dataset_builder.py", line 144, in load_dataset
dataset = stra.load_data(**kwargs)
│ │ └ {'num_proc': 2}
│ └ <function DefaultLocalDataLoadStrategy.load_data at 0x7effef5567a0>
└ <data_juicer.core.data.load_strategy.DefaultLocalDataLoadStrategy object at 0x7effeee388d0>

File "/data/l00906000/data-juicer-main/data_juicer/core/data/load_strategy.py", line 297, in load_data
return formatter.load_dataset(load_data_np, self.cfg)
│ │ │ │ └ Namespace(config=[Path_fr(configs/demo/process.yaml, cwd=/data/l00906000/data-juicer-main)], auto=False, auto_num=1000, hpo_c...
│ │ │ └ <data_juicer.core.data.load_strategy.DefaultLocalDataLoadStrategy object at 0x7effeee388d0>
│ │ └ 2
│ └ <function LocalFormatter.load_dataset at 0x7efff02d0680>
└ <data_juicer.format.json_formatter.JsonFormatter object at 0x7effef1849d0>

File "/data/l00906000/data-juicer-main/data_juicer/format/formatter.py", line 78, in load_dataset
ds = unify_format(datasets,
│ └ Dataset({
│ features: ['text', 'meta'],
│ num_rows: 6
│ })
└ <function unify_format at 0x7efff02d0900>

File "/data/l00906000/data-juicer-main/data_juicer/format/formatter.py", line 212, in unify_format
dataset = dataset.filter(non_empty_text,
│ │ └ <function unify_format..non_empty_text at 0x7effef265b20>
│ └ <function NestedDataset.filter at 0x7efff07ada80>
└ Dataset({
features: ['text', 'meta'],
num_rows: 6
})

File "/data/l00906000/data-juicer-main/data_juicer/core/data/dj_dataset.py", line 449, in filter
new_ds = NestedDataset(super().filter(*args, **kargs))
│ │ └ {'num_proc': 2, 'fn_kwargs': {'target_keys': ['text']}, 'new_fingerprint': '983a0d27f769e02b'}
│ └ [<function unify_format..non_empty_text at 0x7effec5eefc0>]
└ <class 'data_juicer.core.data.dj_dataset.NestedDataset'>

File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 557, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
│ │ │ │ └ {'num_proc': 2, 'fn_kwargs': {'target_keys': ['text']}, 'new_fingerprint': '983a0d27f769e02b'}
│ │ │ └ (<function unify_format..non_empty_text at 0x7effec5eefc0>,)
│ │ └ Dataset({
│ │ features: ['text', 'meta'],
│ │ num_rows: 6
│ │ })
│ └ <function Dataset.filter at 0x7f002c2a7560>
└ typing.Union
File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/datasets/fingerprint.py", line 442, in wrapper
out = func(dataset, *args, **kwargs)
│ │ │ └ {'num_proc': 2, 'fn_kwargs': {'target_keys': ['text']}, 'new_fingerprint': '983a0d27f769e02b'}
│ │ └ (<function unify_format..non_empty_text at 0x7effec5eefc0>,)
│ └ Dataset({
│ features: ['text', 'meta'],
│ num_rows: 6
│ })
└ <function Dataset.filter at 0x7f002c2a74c0>
File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3746, in filter
indices = self.map(
│ └ <function NestedDataset.map at 0x7efff07ad9e0>
└ Dataset({
features: ['text', 'meta'],
num_rows: 6
})

File "/data/l00906000/data-juicer-main/data_juicer/core/data/dj_dataset.py", line 417, in map
new_ds = NestedDataset(super().map(*args, **kargs))
│ │ └ {'function': <function wrap_func_with_nested_access..wrapped_f at 0x7effec5ef560>, 'with_indices': True, 'with_rank':...
│ └ ()
└ <class 'data_juicer.core.data.dj_dataset.NestedDataset'>

File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 557, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
│ │ │ │ └ {'function': <function wrap_func_with_nested_access..wrapped_f at 0x7effec5ef560>, 'with_indices': True, 'with_rank':...
│ │ │ └ ()
│ │ └ Dataset({
│ │ features: ['text', 'meta'],
│ │ num_rows: 6
│ │ })
│ └ <function Dataset.map at 0x7f002c2a7060>
└ typing.Union
File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3171, in map
for rank, done, content in iflatmap_unordered(
│ └ <function iflatmap_unordered at 0x7f002c386340>
└ 1
File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/datasets/utils/py_utils.py", line 721, in iflatmap_unordered
raise RuntimeError(

RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing.

To Reproduce 如何复现

python tools/process_data.py --config configs/demo/process.yaml

Configs 配置信息

[{'language_id_score_filter': {'accelerator': None, │
│ │ 'audio_key': 'audios', │
│ │ 'batch_size': 100, │
│ │ 'cpu_required': 1, │
│ │ 'history_key': 'history', │
│ │ 'image_key': 'images', │
│ │ 'index_key': None, │
│ │ 'lang': 'zh', │
│ │ 'mem_required': 0, │
│ │ 'min_score': 0.8, │
│ │ 'num_proc': 2, │
│ │ 'query_key': 'query', │
│ │ 'response_key': 'response', │
│ │ 'skip_op_error': True, │
│ │ 'stats_export_path': None, │
│ │ 'text_key': 'text', │
│ │ 'turbo': False, │
│ │ 'video_key': 'videos', │
│ │ 'work_dir': '/data/l00906000/data-juicer-main/outputs/demo-process'}}]

Logs 报错日志

No response

Screenshots 截图

No response

Additional 额外信息

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions