[Bug]: 按照PP-UIE的新版本说明文档，调用大模型精调脚本 llm/run-finetune.py时，报错 No module named 'paddlenlp.datasets.json' #10543

extendswind · 2025-05-06T10:38:29Z

软件环境

- paddlepaddle-gpu:  3.0.0
- paddlenlp: 3.0.0b4

重复问题

I have searched the existing issues

错误描述

参照的操作文档：https://paddlenlp.readthedocs.io/zh/latest/llm/application/information_extraction/README.html

在doccano标注和根据doccano.py生成训练数据后，调用llm/run-finetune.py脚本时，报错找不到'paddlenlp.datasets.json'模块。

先前的UIE版本的funtune.py文件中，load_dataset函数传入了一个read函数，函数在UIE目录下的utils.py中定义。但是现在的UIE文档推荐使用llm/run_finetune.py进行大模型精调，在加载代码时会报错，找不到'paddlenlp.datasets.json'模块。

不确定如何使用load_dataset函数加载到json模块。

稳定复现步骤 & 代码

问题复现

doccano只进行了几行简单的标注，生成train.json、dev.json、sample_index.json、test.json文件。

运行的训练命令

 python -u  -m paddle.distributed.launch --gpus "0" run_finetune.py ./config/qwen/sft_argument_test.json

sft_argument_test.json配置

{
    "model_name_or_path": "paddlenlp/PP-UIE-0.5B",
    "dataset_name_or_path": "./data",
    "output_dir": "./checkpoints/ie_ckpts",
    "per_device_train_batch_size": 1,
    "gradient_accumulation_steps": 1,
    "per_device_eval_batch_size": 1,
    "eval_accumulation_steps":8,
    "num_train_epochs": 3,
    "learning_rate": 3e-05,
    "warmup_steps": 30,
    "logging_steps": 1,
    "evaluation_strategy": "epoch",
    "save_strategy": "epoch",
    "src_length": 1024,
    "max_length": 2048,
    "fp16": true,
    "fp16_opt_level": "O2",
    "do_train": true,
    "do_eval": true,
    "disable_tqdm": true,
    "load_best_model_at_end": true,
    "eval_with_do_generation": false,
    "metric_for_best_model": "accuracy",
    "recompute": false,
    "save_total_limit": 1,
    "tensor_parallel_degree": 1,
    "pipeline_parallel_degree": 1,
    "sharding": "stage2",
    "zero_padding": false,
    "unified_checkpoint": true,
    "use_flash_attention": false
  }

主要输出错误

Generating train split: 0 examples [00:00, ? examples/s]
Generating train split: 0 examples [00:00, ? examples/s]
Traceback (most recent call last):
  File "C:\projects\py-paddle-uie\venv\Lib\site-packages\paddlenlp\datasets\dataset.py", line 202, in load_dataset
    reader_cls = import_main_class(path_or_read_func)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\projects\py-paddle-uie\venv\Lib\site-packages\paddlenlp\datasets\dataset.py", line 99, in import_main_class
    module = importlib.import_module(module_path)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\wind\AppData\Local\Programs\Python\Python311\Lib\importlib\__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1204, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1140, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'paddlenlp.datasets.json'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\projects\py-paddle-uie\PaddleNLP-3.0.0-beta4\llm\run_finetune.py", line 717, in <module>
    main()
  File "C:\projects\py-paddle-uie\PaddleNLP-3.0.0-beta4\llm\run_finetune.py", line 295, in main
    train_ds, dev_ds, test_ds = create_dataset(data_args, training_args)
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\projects\py-paddle-uie\PaddleNLP-3.0.0-beta4\llm\run_finetune.py", line 660, in create_dataset
    train_ds = load_dataset(
               ^^^^^^^^^^^^^
  File "C:\projects\py-paddle-uie\venv\Lib\site-packages\paddlenlp\datasets\dataset.py", line 204, in load_dataset
    datasets = load_from_hf(
               ^^^^^^^^^^^^^
  File "C:\projects\py-paddle-uie\venv\Lib\site-packages\paddlenlp\datasets\dataset.py", line 123, in load_from_hf
    hf_datasets = load_hf_dataset(path, name=name, split=splits, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\projects\py-paddle-uie\venv\Lib\site-packages\paddlenlp\datasets\dataset.py", line 59, in load_from_ppnlp
    return origin_load_dataset(path, trust_remote_code=True, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\projects\py-paddle-uie\venv\Lib\site-packages\datasets\load.py", line 2151, in load_dataset
    builder_instance.download_and_prepare(
  File "C:\projects\py-paddle-uie\venv\Lib\site-packages\datasets\builder.py", line 924, in download_and_prepare
    self._download_and_prepare(
  File "C:\projects\py-paddle-uie\venv\Lib\site-packages\datasets\builder.py", line 1000, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "C:\projects\py-paddle-uie\venv\Lib\site-packages\datasets\builder.py", line 1741, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "C:\projects\py-paddle-uie\venv\Lib\site-packages\datasets\builder.py", line 1897, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset
LAUNCH INFO 2025-05-06 17:22:58,771 Pod failed
LAUNCH ERROR 2025-05-06 17:22:58,772 Container failed !!!
Container rank 0 status failed cmd ['C:\\projects\\py-paddle-uie\\venv\\Scripts\\python.exe', '-u', 'run_finetune.py', './config/qwen/sft_argument_test.json'] code 1 log log\workerlog.0
LAUNCH INFO 2025-05-06 17:22:58,772 ------------------------- ERROR LOG DETAIL -------------------------
ages\paddlenlp\datasets\dataset.py", line 99, in import_main_class
    module = importlib.import_module(module_path)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

问题排查

run_finetune.py代码中，创建数据集的create_dataset函数里，如果路径下存在train.json或者dev.json，则用load_dataset函数的第一个参数为“json”，具体代码如下：

def create_dataset(data_args, training_args):
    if data_args.dataset_name_or_path is None:
        raise ValueError(f"Please specific dataset name or path (got {data_args.dataset_name_or_path})")

    train_ds = None
    dev_ds = None
    test_ds = None
    if os.path.exists(os.path.join(data_args.dataset_name_or_path, "train.json")) or os.path.exists(
        os.path.join(data_args.dataset_name_or_path, "dev.json")
    ):
        logger.info("load train")
        if training_args.do_train:
            train_ds = load_dataset(
                "json",
                data_files=os.path.join(data_args.dataset_name_or_path, "train.json"),
                lazy=data_args.lazy,
            )[0]
        logger.info("load eval")
        if training_args.do_eval:
            dev_ds = load_dataset(
                "json",
                data_files=os.path.join(data_args.dataset_name_or_path, "dev.json"),
                lazy=data_args.lazy,
            )[0]
        logger.info("load test")
        if training_args.do_predict:
            test_ds = load_dataset(
                "json",
                data_files=os.path.join(data_args.dataset_name_or_path, "test.json"),
                lazy=data_args.lazy,
            )[0]

json模块默认作为paddlenlp.datasets.json进行加载，但是没有找到对应的实现，代码仓库中没有json.py文件，全局也没有搜到"json"字符串在__all__中。

The text was updated successfully, but these errors were encountered:

extendswind · 2025-05-06T10:40:23Z

pip list的相关版本：

paddlenlp 3.0.0b4
paddlepaddle-gpu 3.0.0
paddle2onnx 2.0.2rc1
paddlefsl 1.1.0

extendswind added the bug label May 6, 2025

paddle-bot bot assigned ZHUI May 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: 按照PP-UIE的新版本说明文档，调用大模型精调脚本 llm/run-finetune.py时，报错 No module named 'paddlenlp.datasets.json' #10543

[Bug]: 按照PP-UIE的新版本说明文档，调用大模型精调脚本 llm/run-finetune.py时，报错 No module named 'paddlenlp.datasets.json' #10543

extendswind commented May 6, 2025 •

edited

Loading

extendswind commented May 6, 2025

Uh oh!

[Bug]: 按照PP-UIE的新版本说明文档，调用大模型精调脚本 llm/run-finetune.py时，报错 No module named 'paddlenlp.datasets.json' #10543

[Bug]: 按照PP-UIE的新版本说明文档，调用大模型精调脚本 llm/run-finetune.py时，报错 No module named 'paddlenlp.datasets.json' #10543

Comments

extendswind commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

软件环境

重复问题

错误描述

稳定复现步骤 & 代码

问题复现

问题排查

extendswind commented May 6, 2025

Uh oh!

extendswind commented May 6, 2025 •

edited

Loading