Description
Before Reporting 报告之前
-
I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。
-
I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引,并且在安装过程中没有错误发生。(否则,我们建议您使用Question模板向我们进行提问)
Search before reporting 先搜索,再报告
OS 系统
linux
Installation Method 安装方式
源码安装
Data-Juicer Version Data-Juicer版本
1.3.3
Python Version Python版本
3.11.4
Describe the bug 描述这个bug
2025-06-07 19:17:53 | INFO | data_juicer.core.executor.default_executor:50 - Using cache compression method: [None]
2025-06-07 19:17:53 | INFO | data_juicer.core.executor.default_executor:55 - Setting up dataset builder...
2025-06-07 19:17:53 | INFO | data_juicer.core.data.dataset_builder:37 - found dataset_path setting: /data/l00906000/data-juicer-main/demos/data/demo-dataset.jsonl
2025-06-07 19:17:53 | INFO | data_juicer.core.data.load_strategy:79 - Getting strategy class for exec: default, data_type: local, data_source: None
2025-06-07 19:17:53 | INFO | data_juicer.core.executor.default_executor:74 - Preparing exporter...
2025-06-07 19:17:53 | INFO | data_juicer.core.executor.default_executor:112 - Loading dataset from dataset builder...
2025-06-07 19:17:54 | INFO | data_juicer.format.formatter:186 - Unifying the input dataset formats...
2025-06-07 19:17:54 | INFO | data_juicer.format.formatter:201 - There are 6 sample(s) in the original dataset.
Filter (num_proc=2): 0%| | 0/6 [00:00<?, ? examples/s]Process ForkPoolWorker-1:
Process ForkPoolWorker-2:
Traceback (most recent call last):
File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/multiprocess/process.py", line 315, in _bootstrap
self.run()
File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/multiprocess/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
Traceback (most recent call last):
File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/multiprocess/pool.py", line 114, in worker
task = get()
^^^^^
File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/multiprocess/queues.py", line 371, in get
return _ForkingPickler.loads(res)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/multiprocess/process.py", line 315, in _bootstrap
self.run()
File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/dill/_dill.py", line 327, in loads
return load(file, ignore, **kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/multiprocess/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/dill/_dill.py", line 313, in load
return Unpickler(file, ignore=ignore, **kwds).load()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/multiprocess/pool.py", line 114, in worker
task = get()
^^^^^
File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/dill/_dill.py", line 525, in load
obj = StockUnpickler.load(self)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/multiprocess/queues.py", line 371, in get
return _ForkingPickler.loads(res)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/dill/_dill.py", line 327, in loads
return load(file, ignore, **kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/dill/_dill.py", line 659, in _create_code
if len(args) == 16: return CodeType(*args)
^^^^^^^^^^^^^^^
File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/dill/_dill.py", line 313, in load
return Unpickler(file, ignore=ignore, **kwds).load()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: code() argument 13 must be str, not int
File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/dill/_dill.py", line 525, in load
obj = StockUnpickler.load(self)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/dill/_dill.py", line 659, in _create_code
if len(args) == 16: return CodeType(*args)
^^^^^^^^^^^^^^^
TypeError: code() argument 13 must be str, not int
Filter (num_proc=2): 0%| | 0/6 [00:00<?, ? examples/s]
object address : 0x7effec591240
object refcount : 2
object type : 0x871e00
object type name: RuntimeError
object repr : RuntimeError('One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing.')
lost sys.stderr
2025-06-07 19:17:54 | ERROR | main:19 - An error has been caught in function '', process 'MainProcess' (3765013), thread 'MainThread' (139639783552832):
Traceback (most recent call last):
File "/data/l00906000/data-juicer-main/tools/process_data.py", line 19, in
main()
└ <function main at 0x7effef557560>
File "/data/l00906000/data-juicer-main/tools/process_data.py", line 15, in main
executor.run()
│ └ <function DefaultExecutor.run at 0x7effef5579c0>
└ <data_juicer.core.executor.default_executor.DefaultExecutor object at 0x7f006a5ca6d0>
File "/data/l00906000/data-juicer-main/data_juicer/core/executor/default_executor.py", line 115, in run
dataset = self.dataset_builder.load_dataset(num_proc=load_data_np)
│ │ │ └ 2
│ │ └ <function DatasetBuilder.load_dataset at 0x7efff07ae200>
│ └ <data_juicer.core.data.dataset_builder.DatasetBuilder object at 0x7effef169e10>
└ <data_juicer.core.executor.default_executor.DefaultExecutor object at 0x7f006a5ca6d0>
File "/data/l00906000/data-juicer-main/data_juicer/core/data/dataset_builder.py", line 144, in load_dataset
dataset = stra.load_data(**kwargs)
│ │ └ {'num_proc': 2}
│ └ <function DefaultLocalDataLoadStrategy.load_data at 0x7effef5567a0>
└ <data_juicer.core.data.load_strategy.DefaultLocalDataLoadStrategy object at 0x7effeee388d0>
File "/data/l00906000/data-juicer-main/data_juicer/core/data/load_strategy.py", line 297, in load_data
return formatter.load_dataset(load_data_np, self.cfg)
│ │ │ │ └ Namespace(config=[Path_fr(configs/demo/process.yaml, cwd=/data/l00906000/data-juicer-main)], auto=False, auto_num=1000, hpo_c...
│ │ │ └ <data_juicer.core.data.load_strategy.DefaultLocalDataLoadStrategy object at 0x7effeee388d0>
│ │ └ 2
│ └ <function LocalFormatter.load_dataset at 0x7efff02d0680>
└ <data_juicer.format.json_formatter.JsonFormatter object at 0x7effef1849d0>
File "/data/l00906000/data-juicer-main/data_juicer/format/formatter.py", line 78, in load_dataset
ds = unify_format(datasets,
│ └ Dataset({
│ features: ['text', 'meta'],
│ num_rows: 6
│ })
└ <function unify_format at 0x7efff02d0900>
File "/data/l00906000/data-juicer-main/data_juicer/format/formatter.py", line 212, in unify_format
dataset = dataset.filter(non_empty_text,
│ │ └ <function unify_format..non_empty_text at 0x7effef265b20>
│ └ <function NestedDataset.filter at 0x7efff07ada80>
└ Dataset({
features: ['text', 'meta'],
num_rows: 6
})
File "/data/l00906000/data-juicer-main/data_juicer/core/data/dj_dataset.py", line 449, in filter
new_ds = NestedDataset(super().filter(*args, **kargs))
│ │ └ {'num_proc': 2, 'fn_kwargs': {'target_keys': ['text']}, 'new_fingerprint': '983a0d27f769e02b'}
│ └ [<function unify_format..non_empty_text at 0x7effec5eefc0>]
└ <class 'data_juicer.core.data.dj_dataset.NestedDataset'>
File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 557, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
│ │ │ │ └ {'num_proc': 2, 'fn_kwargs': {'target_keys': ['text']}, 'new_fingerprint': '983a0d27f769e02b'}
│ │ │ └ (<function unify_format..non_empty_text at 0x7effec5eefc0>,)
│ │ └ Dataset({
│ │ features: ['text', 'meta'],
│ │ num_rows: 6
│ │ })
│ └ <function Dataset.filter at 0x7f002c2a7560>
└ typing.Union
File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/datasets/fingerprint.py", line 442, in wrapper
out = func(dataset, *args, **kwargs)
│ │ │ └ {'num_proc': 2, 'fn_kwargs': {'target_keys': ['text']}, 'new_fingerprint': '983a0d27f769e02b'}
│ │ └ (<function unify_format..non_empty_text at 0x7effec5eefc0>,)
│ └ Dataset({
│ features: ['text', 'meta'],
│ num_rows: 6
│ })
└ <function Dataset.filter at 0x7f002c2a74c0>
File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3746, in filter
indices = self.map(
│ └ <function NestedDataset.map at 0x7efff07ad9e0>
└ Dataset({
features: ['text', 'meta'],
num_rows: 6
})
File "/data/l00906000/data-juicer-main/data_juicer/core/data/dj_dataset.py", line 417, in map
new_ds = NestedDataset(super().map(*args, **kargs))
│ │ └ {'function': <function wrap_func_with_nested_access..wrapped_f at 0x7effec5ef560>, 'with_indices': True, 'with_rank':...
│ └ ()
└ <class 'data_juicer.core.data.dj_dataset.NestedDataset'>
File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 557, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
│ │ │ │ └ {'function': <function wrap_func_with_nested_access..wrapped_f at 0x7effec5ef560>, 'with_indices': True, 'with_rank':...
│ │ │ └ ()
│ │ └ Dataset({
│ │ features: ['text', 'meta'],
│ │ num_rows: 6
│ │ })
│ └ <function Dataset.map at 0x7f002c2a7060>
└ typing.Union
File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3171, in map
for rank, done, content in iflatmap_unordered(
│ └ <function iflatmap_unordered at 0x7f002c386340>
└ 1
File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/datasets/utils/py_utils.py", line 721, in iflatmap_unordered
raise RuntimeError(
RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing.
To Reproduce 如何复现
python tools/process_data.py --config configs/demo/process.yaml
Configs 配置信息
[{'language_id_score_filter': {'accelerator': None, │
│ │ 'audio_key': 'audios', │
│ │ 'batch_size': 100, │
│ │ 'cpu_required': 1, │
│ │ 'history_key': 'history', │
│ │ 'image_key': 'images', │
│ │ 'index_key': None, │
│ │ 'lang': 'zh', │
│ │ 'mem_required': 0, │
│ │ 'min_score': 0.8, │
│ │ 'num_proc': 2, │
│ │ 'query_key': 'query', │
│ │ 'response_key': 'response', │
│ │ 'skip_op_error': True, │
│ │ 'stats_export_path': None, │
│ │ 'text_key': 'text', │
│ │ 'turbo': False, │
│ │ 'video_key': 'videos', │
│ │ 'work_dir': '/data/l00906000/data-juicer-main/outputs/demo-process'}}]
Logs 报错日志
No response
Screenshots 截图
No response
Additional 额外信息
No response