Skip to content

[Bug]: Bug: Using deduplicator in config file leads to "Unrecognized arguments" error #704

Open
@yywhsgnd

Description

@yywhsgnd

Before Reporting 报告之前

  • I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。

  • I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引,并且在安装过程中没有错误发生。(否则,我们建议您使用Question模板向我们进行提问)

Search before reporting 先搜索,再报告

  • I have searched the Data-Juicer issues and found no similar bugs. 我已经在 issue列表 中搜索但是没有发现类似的bug报告。

OS 系统

Ubuntu

Installation Method 安装方式

pip

Data-Juicer Version Data-Juicer版本

latest, v0.1.2,

Python Version Python版本

3.11

Describe the bug 描述这个bug

你好,我在使用 Data-Juicer 时遇到了一个问题。即使在纯净的环境中,只要配置文件里包含 document_minhash_deduplicator 算子,程序就会崩溃,并给出一个误导性的“无法识别的命令行参数”错误。创建 my_final_config.yaml 文件,内容如下:

YAML

dataset_path: ./test-data
export_path: ./cleaned_data.jsonl
np: 1
process:

  • document_minhash_deduplicator:
    key: 'text'
    运行命令:

Bash

python tools/process_data.py --config my_final_config.yaml

INFO | data_juicer.config.config:577 - dataset_path config is set and a valid local path
error: Unrecognized arguments: --document_minhash_deduplicator.key=text

To Reproduce 如何复现

python tools/process_data.py --config my_final_config.yaml

Configs 配置信息

No response

Logs 报错日志

No response

Screenshots 截图

No response

Additional 额外信息

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions