[Bug]: 在使用main分支代码在python3.11.4版本下，执行多进程并发任务失败；报错TypeError: code() argument 13 must be str, not int；RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing.

### Before Reporting 报告之前

- [x] I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。

- [x] I have read the [README](https://github.com/alibaba/data-juicer/blob/main/README.md) carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 [README](https://github.com/alibaba/data-juicer/blob/main/README_ZH.md) 上的操作指引，并且在安装过程中没有错误发生。（否则，我们建议您使用Question模板向我们进行提问）


### Search before reporting 先搜索，再报告

- [x] I have searched the Data-Juicer [issues](https://github.com/alibaba/data-juicer/issues) and found no similar bugs. 我已经在 [issue列表](https://github.com/alibaba/data-juicer/issues) 中搜索但是没有发现类似的bug报告。


### OS 系统

linux

### Installation Method 安装方式

源码安装

### Data-Juicer Version Data-Juicer版本

1.3.3

### Python Version Python版本

3.11.4

### Describe the bug 描述这个bug


2025-06-07 19:17:53 | INFO     | data_juicer.core.executor.default_executor:50 - Using cache compression method: [None]
2025-06-07 19:17:53 | INFO     | data_juicer.core.executor.default_executor:55 - Setting up dataset builder...
2025-06-07 19:17:53 | INFO     | data_juicer.core.data.dataset_builder:37 - found dataset_path setting: /data/l00906000/data-juicer-main/demos/data/demo-dataset.jsonl
2025-06-07 19:17:53 | INFO     | data_juicer.core.data.load_strategy:79 - Getting strategy class for exec: default, data_type: local, data_source: None
2025-06-07 19:17:53 | INFO     | data_juicer.core.executor.default_executor:74 - Preparing exporter...
2025-06-07 19:17:53 | INFO     | data_juicer.core.executor.default_executor:112 - Loading dataset from dataset builder...
2025-06-07 19:17:54 | INFO     | data_juicer.format.formatter:186 - Unifying the input dataset formats...
2025-06-07 19:17:54 | INFO     | data_juicer.format.formatter:201 - There are 6 sample(s) in the original dataset.
Filter (num_proc=2):   0%|          | 0/6 [00:00<?, ? examples/s]Process ForkPoolWorker-1:
Process ForkPoolWorker-2:
Traceback (most recent call last):
  File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/multiprocess/process.py", line 315, in _bootstrap
    self.run()
  File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/multiprocess/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
Traceback (most recent call last):
  File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/multiprocess/pool.py", line 114, in worker
    task = get()
           ^^^^^
  File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/multiprocess/queues.py", line 371, in get
    return _ForkingPickler.loads(res)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/multiprocess/process.py", line 315, in _bootstrap
    self.run()
  File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/dill/_dill.py", line 327, in loads
    return load(file, ignore, **kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/multiprocess/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/dill/_dill.py", line 313, in load
    return Unpickler(file, ignore=ignore, **kwds).load()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/multiprocess/pool.py", line 114, in worker
    task = get()
           ^^^^^
  File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/dill/_dill.py", line 525, in load
    obj = StockUnpickler.load(self)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/multiprocess/queues.py", line 371, in get
    return _ForkingPickler.loads(res)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/dill/_dill.py", line 327, in loads
    return load(file, ignore, **kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/dill/_dill.py", line 659, in _create_code
    if len(args) == 16: return CodeType(*args)
                               ^^^^^^^^^^^^^^^
  File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/dill/_dill.py", line 313, in load
    return Unpickler(file, ignore=ignore, **kwds).load()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: code() argument 13 must be str, not int
  File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/dill/_dill.py", line 525, in load
    obj = StockUnpickler.load(self)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/dill/_dill.py", line 659, in _create_code
    if len(args) == 16: return CodeType(*args)
                               ^^^^^^^^^^^^^^^
TypeError: code() argument 13 must be str, not int

Filter (num_proc=2):   0%|          | 0/6 [00:00<?, ? examples/s]
object address  : 0x7effec591240
object refcount : 2
object type     : 0x871e00
object type name: RuntimeError
object repr     : RuntimeError('One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing.')
lost sys.stderr
2025-06-07 19:17:54 | ERROR    | __main__:19 - An error has been caught in function '<module>', process 'MainProcess' (3765013), thread 'MainThread' (139639783552832):
Traceback (most recent call last):

> File "/data/l00906000/data-juicer-main/tools/process_data.py", line 19, in <module>
    main()
    └ <function main at 0x7effef557560>

  File "/data/l00906000/data-juicer-main/tools/process_data.py", line 15, in main
    executor.run()
    │        └ <function DefaultExecutor.run at 0x7effef5579c0>
    └ <data_juicer.core.executor.default_executor.DefaultExecutor object at 0x7f006a5ca6d0>

  File "/data/l00906000/data-juicer-main/data_juicer/core/executor/default_executor.py", line 115, in run
    dataset = self.dataset_builder.load_dataset(num_proc=load_data_np)
              │    │               │                     └ 2
              │    │               └ <function DatasetBuilder.load_dataset at 0x7efff07ae200>
              │    └ <data_juicer.core.data.dataset_builder.DatasetBuilder object at 0x7effef169e10>
              └ <data_juicer.core.executor.default_executor.DefaultExecutor object at 0x7f006a5ca6d0>

  File "/data/l00906000/data-juicer-main/data_juicer/core/data/dataset_builder.py", line 144, in load_dataset
    dataset = stra.load_data(**kwargs)
              │    │           └ {'num_proc': 2}
              │    └ <function DefaultLocalDataLoadStrategy.load_data at 0x7effef5567a0>
              └ <data_juicer.core.data.load_strategy.DefaultLocalDataLoadStrategy object at 0x7effeee388d0>

  File "/data/l00906000/data-juicer-main/data_juicer/core/data/load_strategy.py", line 297, in load_data
    return formatter.load_dataset(load_data_np, self.cfg)
           │         │            │             │    └ Namespace(config=[Path_fr(configs/demo/process.yaml, cwd=/data/l00906000/data-juicer-main)], auto=False, auto_num=1000, hpo_c...
           │         │            │             └ <data_juicer.core.data.load_strategy.DefaultLocalDataLoadStrategy object at 0x7effeee388d0>
           │         │            └ 2
           │         └ <function LocalFormatter.load_dataset at 0x7efff02d0680>
           └ <data_juicer.format.json_formatter.JsonFormatter object at 0x7effef1849d0>

  File "/data/l00906000/data-juicer-main/data_juicer/format/formatter.py", line 78, in load_dataset
    ds = unify_format(datasets,
         │            └ Dataset({
         │                  features: ['text', 'meta'],
         │                  num_rows: 6
         │              })
         └ <function unify_format at 0x7efff02d0900>

  File "/data/l00906000/data-juicer-main/data_juicer/format/formatter.py", line 212, in unify_format
    dataset = dataset.filter(non_empty_text,
              │       │      └ <function unify_format.<locals>.non_empty_text at 0x7effef265b20>
              │       └ <function NestedDataset.filter at 0x7efff07ada80>
              └ Dataset({
                    features: ['text', 'meta'],
                    num_rows: 6
                })

  File "/data/l00906000/data-juicer-main/data_juicer/core/data/dj_dataset.py", line 449, in filter
    new_ds = NestedDataset(super().filter(*args, **kargs))
             │                             │       └ {'num_proc': 2, 'fn_kwargs': {'target_keys': ['text']}, 'new_fingerprint': '983a0d27f769e02b'}
             │                             └ [<function unify_format.<locals>.non_empty_text at 0x7effec5eefc0>]
             └ <class 'data_juicer.core.data.dj_dataset.NestedDataset'>

  File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 557, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
         │                                 │    │      │       └ {'num_proc': 2, 'fn_kwargs': {'target_keys': ['text']}, 'new_fingerprint': '983a0d27f769e02b'}
         │                                 │    │      └ (<function unify_format.<locals>.non_empty_text at 0x7effec5eefc0>,)
         │                                 │    └ Dataset({
         │                                 │          features: ['text', 'meta'],
         │                                 │          num_rows: 6
         │                                 │      })
         │                                 └ <function Dataset.filter at 0x7f002c2a7560>
         └ typing.Union
  File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/datasets/fingerprint.py", line 442, in wrapper
    out = func(dataset, *args, **kwargs)
          │    │         │       └ {'num_proc': 2, 'fn_kwargs': {'target_keys': ['text']}, 'new_fingerprint': '983a0d27f769e02b'}
          │    │         └ (<function unify_format.<locals>.non_empty_text at 0x7effec5eefc0>,)
          │    └ Dataset({
          │          features: ['text', 'meta'],
          │          num_rows: 6
          │      })
          └ <function Dataset.filter at 0x7f002c2a74c0>
  File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3746, in filter
    indices = self.map(
              │    └ <function NestedDataset.map at 0x7efff07ad9e0>
              └ Dataset({
                    features: ['text', 'meta'],
                    num_rows: 6
                })

  File "/data/l00906000/data-juicer-main/data_juicer/core/data/dj_dataset.py", line 417, in map
    new_ds = NestedDataset(super().map(*args, **kargs))
             │                          │       └ {'function': <function wrap_func_with_nested_access.<locals>.wrapped_f at 0x7effec5ef560>, 'with_indices': True, 'with_rank':...
             │                          └ ()
             └ <class 'data_juicer.core.data.dj_dataset.NestedDataset'>

  File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 557, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
         │                                 │    │      │       └ {'function': <function wrap_func_with_nested_access.<locals>.wrapped_f at 0x7effec5ef560>, 'with_indices': True, 'with_rank':...
         │                                 │    │      └ ()
         │                                 │    └ Dataset({
         │                                 │          features: ['text', 'meta'],
         │                                 │          num_rows: 6
         │                                 │      })
         │                                 └ <function Dataset.map at 0x7f002c2a7060>
         └ typing.Union
  File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3171, in map
    for rank, done, content in iflatmap_unordered(
        │                      └ <function iflatmap_unordered at 0x7f002c386340>
        └ 1
  File "/data/conda_envs/data-juicer3114/lib/python3.11/site-packages/datasets/utils/py_utils.py", line 721, in iflatmap_unordered
    raise RuntimeError(

RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing.




### To Reproduce 如何复现

python tools/process_data.py --config configs/demo/process.yaml

### Configs 配置信息

[{'language_id_score_filter': {'accelerator': None,                                                   │
│                          │                                'audio_key': 'audios',                                                 │
│                          │                                'batch_size': 100,                                                     │
│                          │                                'cpu_required': 1,                                                     │
│                          │                                'history_key': 'history',                                              │
│                          │                                'image_key': 'images',                                                 │
│                          │                                'index_key': None,                                                     │
│                          │                                'lang': 'zh',                                                          │
│                          │                                'mem_required': 0,                                                     │
│                          │                                'min_score': 0.8,                                                      │
│                          │                                'num_proc': 2,                                                         │
│                          │                                'query_key': 'query',                                                  │
│                          │                                'response_key': 'response',                                            │
│                          │                                'skip_op_error': True,                                                 │
│                          │                                'stats_export_path': None,                                             │
│                          │                                'text_key': 'text',                                                    │
│                          │                                'turbo': False,                                                        │
│                          │                                'video_key': 'videos',                                                 │
│                          │                                'work_dir': '/data/l00906000/data-juicer-main/outputs/demo-process'}}] 

### Logs 报错日志

_No response_

### Screenshots 截图

_No response_

### Additional 额外信息

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: 在使用main分支代码在python3.11.4版本下，执行多进程并发任务失败；报错TypeError: code() argument 13 must be str, not int；RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing. #691

Before Reporting 报告之前

Search before reporting 先搜索，再报告

OS 系统

Installation Method 安装方式

Data-Juicer Version Data-Juicer版本

Python Version Python版本

Describe the bug 描述这个bug

To Reproduce 如何复现

Configs 配置信息

Logs 报错日志

Screenshots 截图

Additional 额外信息

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: 在使用main分支代码在python3.11.4版本下，执行多进程并发任务失败；报错TypeError: code() argument 13 must be str, not int；RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing. #691

Description

Before Reporting 报告之前

Search before reporting 先搜索，再报告

OS 系统

Installation Method 安装方式

Data-Juicer Version Data-Juicer版本

Python Version Python版本

Describe the bug 描述这个bug

To Reproduce 如何复现

Configs 配置信息

Logs 报错日志

Screenshots 截图

Additional 额外信息

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions