Skip to content

[Bug]: 按照PP-UIE的新版本说明文档,调用大模型精调脚本 llm/run-finetune.py时,报错 No module named 'paddlenlp.datasets.json' #10543

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 task done
extendswind opened this issue May 6, 2025 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@extendswind
Copy link

extendswind commented May 6, 2025

软件环境

- paddlepaddle-gpu:  3.0.0
- paddlenlp: 3.0.0b4

重复问题

  • I have searched the existing issues

错误描述

参照的操作文档:https://paddlenlp.readthedocs.io/zh/latest/llm/application/information_extraction/README.html

在doccano标注和根据doccano.py生成训练数据后,调用llm/run-finetune.py脚本时,报错找不到'paddlenlp.datasets.json'模块。

先前的UIE版本的funtune.py文件中,load_dataset函数传入了一个read函数,函数在UIE目录下的utils.py中定义。但是现在的UIE文档推荐使用llm/run_finetune.py进行大模型精调,在加载代码时会报错,找不到'paddlenlp.datasets.json'模块。

不确定如何使用load_dataset函数加载到json模块。

稳定复现步骤 & 代码

问题复现

doccano只进行了几行简单的标注,生成train.json、dev.json、sample_index.json、test.json文件。

运行的训练命令

 python -u  -m paddle.distributed.launch --gpus "0" run_finetune.py ./config/qwen/sft_argument_test.json

sft_argument_test.json配置

{
    "model_name_or_path": "paddlenlp/PP-UIE-0.5B",
    "dataset_name_or_path": "./data",
    "output_dir": "./checkpoints/ie_ckpts",
    "per_device_train_batch_size": 1,
    "gradient_accumulation_steps": 1,
    "per_device_eval_batch_size": 1,
    "eval_accumulation_steps":8,
    "num_train_epochs": 3,
    "learning_rate": 3e-05,
    "warmup_steps": 30,
    "logging_steps": 1,
    "evaluation_strategy": "epoch",
    "save_strategy": "epoch",
    "src_length": 1024,
    "max_length": 2048,
    "fp16": true,
    "fp16_opt_level": "O2",
    "do_train": true,
    "do_eval": true,
    "disable_tqdm": true,
    "load_best_model_at_end": true,
    "eval_with_do_generation": false,
    "metric_for_best_model": "accuracy",
    "recompute": false,
    "save_total_limit": 1,
    "tensor_parallel_degree": 1,
    "pipeline_parallel_degree": 1,
    "sharding": "stage2",
    "zero_padding": false,
    "unified_checkpoint": true,
    "use_flash_attention": false
  }

主要输出错误

Generating train split: 0 examples [00:00, ? examples/s]
Generating train split: 0 examples [00:00, ? examples/s]
Traceback (most recent call last):
  File "C:\projects\py-paddle-uie\venv\Lib\site-packages\paddlenlp\datasets\dataset.py", line 202, in load_dataset
    reader_cls = import_main_class(path_or_read_func)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\projects\py-paddle-uie\venv\Lib\site-packages\paddlenlp\datasets\dataset.py", line 99, in import_main_class
    module = importlib.import_module(module_path)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\wind\AppData\Local\Programs\Python\Python311\Lib\importlib\__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1204, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1176, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1140, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'paddlenlp.datasets.json'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\projects\py-paddle-uie\PaddleNLP-3.0.0-beta4\llm\run_finetune.py", line 717, in <module>
    main()
  File "C:\projects\py-paddle-uie\PaddleNLP-3.0.0-beta4\llm\run_finetune.py", line 295, in main
    train_ds, dev_ds, test_ds = create_dataset(data_args, training_args)
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\projects\py-paddle-uie\PaddleNLP-3.0.0-beta4\llm\run_finetune.py", line 660, in create_dataset
    train_ds = load_dataset(
               ^^^^^^^^^^^^^
  File "C:\projects\py-paddle-uie\venv\Lib\site-packages\paddlenlp\datasets\dataset.py", line 204, in load_dataset
    datasets = load_from_hf(
               ^^^^^^^^^^^^^
  File "C:\projects\py-paddle-uie\venv\Lib\site-packages\paddlenlp\datasets\dataset.py", line 123, in load_from_hf
    hf_datasets = load_hf_dataset(path, name=name, split=splits, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\projects\py-paddle-uie\venv\Lib\site-packages\paddlenlp\datasets\dataset.py", line 59, in load_from_ppnlp
    return origin_load_dataset(path, trust_remote_code=True, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\projects\py-paddle-uie\venv\Lib\site-packages\datasets\load.py", line 2151, in load_dataset
    builder_instance.download_and_prepare(
  File "C:\projects\py-paddle-uie\venv\Lib\site-packages\datasets\builder.py", line 924, in download_and_prepare
    self._download_and_prepare(
  File "C:\projects\py-paddle-uie\venv\Lib\site-packages\datasets\builder.py", line 1000, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "C:\projects\py-paddle-uie\venv\Lib\site-packages\datasets\builder.py", line 1741, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "C:\projects\py-paddle-uie\venv\Lib\site-packages\datasets\builder.py", line 1897, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset
LAUNCH INFO 2025-05-06 17:22:58,771 Pod failed
LAUNCH ERROR 2025-05-06 17:22:58,772 Container failed !!!
Container rank 0 status failed cmd ['C:\\projects\\py-paddle-uie\\venv\\Scripts\\python.exe', '-u', 'run_finetune.py', './config/qwen/sft_argument_test.json'] code 1 log log\workerlog.0
LAUNCH INFO 2025-05-06 17:22:58,772 ------------------------- ERROR LOG DETAIL -------------------------
ages\paddlenlp\datasets\dataset.py", line 99, in import_main_class
    module = importlib.import_module(module_path)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

问题排查

run_finetune.py代码中,创建数据集的create_dataset函数里,如果路径下存在train.json或者dev.json,则用load_dataset函数的第一个参数为“json”,具体代码如下:

def create_dataset(data_args, training_args):
    if data_args.dataset_name_or_path is None:
        raise ValueError(f"Please specific dataset name or path (got {data_args.dataset_name_or_path})")

    train_ds = None
    dev_ds = None
    test_ds = None
    if os.path.exists(os.path.join(data_args.dataset_name_or_path, "train.json")) or os.path.exists(
        os.path.join(data_args.dataset_name_or_path, "dev.json")
    ):
        logger.info("load train")
        if training_args.do_train:
            train_ds = load_dataset(
                "json",
                data_files=os.path.join(data_args.dataset_name_or_path, "train.json"),
                lazy=data_args.lazy,
            )[0]
        logger.info("load eval")
        if training_args.do_eval:
            dev_ds = load_dataset(
                "json",
                data_files=os.path.join(data_args.dataset_name_or_path, "dev.json"),
                lazy=data_args.lazy,
            )[0]
        logger.info("load test")
        if training_args.do_predict:
            test_ds = load_dataset(
                "json",
                data_files=os.path.join(data_args.dataset_name_or_path, "test.json"),
                lazy=data_args.lazy,
            )[0]

json模块默认作为paddlenlp.datasets.json进行加载,但是没有找到对应的实现,代码仓库中没有json.py文件,全局也没有搜到"json"字符串在__all__中。

@extendswind extendswind added the bug Something isn't working label May 6, 2025
@paddle-bot paddle-bot bot assigned ZHUI May 6, 2025
@extendswind
Copy link
Author

pip list的相关版本:

  • paddlenlp 3.0.0b4
  • paddlepaddle-gpu 3.0.0
  • paddle2onnx 2.0.2rc1
  • paddlefsl 1.1.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants