Description
Before Reporting 报告之前
-
I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。
-
I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引,并且在安装过程中没有错误发生。(否则,我们建议您使用Question模板向我们进行提问)
Search before reporting 先搜索,再报告
OS 系统
Ubuntu
Installation Method 安装方式
pip
Data-Juicer Version Data-Juicer版本
latest, v0.1.2,
Python Version Python版本
3.11
Describe the bug 描述这个bug
你好,我在使用 Data-Juicer 时遇到了一个问题。即使在纯净的环境中,只要配置文件里包含 document_minhash_deduplicator 算子,程序就会崩溃,并给出一个误导性的“无法识别的命令行参数”错误。创建 my_final_config.yaml 文件,内容如下:
YAML
dataset_path: ./test-data
export_path: ./cleaned_data.jsonl
np: 1
process:
- document_minhash_deduplicator:
key: 'text'
运行命令:
Bash
python tools/process_data.py --config my_final_config.yaml
INFO | data_juicer.config.config:577 - dataset_path config is set and a valid local path
error: Unrecognized arguments: --document_minhash_deduplicator.key=text
To Reproduce 如何复现
python tools/process_data.py --config my_final_config.yaml
Configs 配置信息
No response
Logs 报错日志
No response
Screenshots 截图
No response
Additional 额外信息
No response