v0.3.02 by xming521 · Pull Request #196 · xming521/WeClone

xming521 · 2025-08-16T14:55:28Z

Sourcery 总结

在离线清理中启用可配置的“思考”功能，改进问答处理中的图像和 GIF 处理，重构配置模型以实现更清晰的数据集命名，并为发布 v0.3.02 提升版本和依赖项。

新功能：

在 LLMCleanConfig 中引入 enable_thinking 标志以控制离线清理行为
自动为包含图像的问答对分配最高分，绕过 LLM 评分
在 CSV 加载器中添加 GIF 到贴纸的转换，以支持动画表情符号

改进：

重构 cleaned_dataset_name 以从原始数据集动态派生
将 enable_thinking 传递给 vLLM 推理管道，并相应调整 repetition_penalty 和 max_new_tokens
实现 CommonMethods 以解析带有模态后缀的数据集名称，并移除已弃用的配置字段
始终在训练工作流中运行 LLM 清理，移除基于图像的排除
在 save_result 中将 QA id 转换为字符串以保持一致性
将配置解析从 pyjson5.load 切换到 loads，并为训练配置填充 include_type

构建：

将项目版本提升至 0.3.02，配置版本提升至 0.3.02
更新依赖项：openai 至 1.87.0，vllm 至 0.10.0，torch 至 2.7.1，添加 torchvision，transformers 至 4.53.2，以及 triton 至 3.3.1

CI：

将 pre-commit-hooks 升级至 v6.0.0，ruff 升级至 v0.12.8

文档：

修订 README 和中文版本，以软化关于模型大小性能的措辞

Original summary in English

Summary by Sourcery

Enable configurable thinking in offline cleaning, improve image and gif handling in QA processing, refactor configuration models for cleaner dataset naming, and bump versions and dependencies for release v0.3.02

New Features:

Introduce enable_thinking flag in LLMCleanConfig to control offline cleaning behavior
Automatically assign maximum score to QA pairs containing images, bypassing LLM scoring
Add gif-to-sticker conversion in CSV loader to support animated emoticons

Enhancements:

Refactor cleaned_dataset_name to derive dynamically from original dataset
Pass enable_thinking through vLLM inference pipeline and adjust repetition_penalty and max_new_tokens accordingly
Implement CommonMethods to parse dataset names with modality-based suffixes and remove deprecated config fields
Always run LLM cleaning in training workflow, removing image-based exclusion
Convert QA id to string in save_result for consistency
Switch from pyjson5.load to loads for config parsing and populate include_type for train config

Build:

Bump project version to 0.3.02 and config_version to 0.3.02
Update dependencies: openai to 1.87.0, vllm to 0.10.0, torch to 2.7.1, add torchvision, transformers to 4.53.2, and triton to 3.3.1

CI:

Upgrade pre-commit-hooks to v6.0.0 and ruff to v0.12.8

Documentation:

Revise README and Chinese version to soften language around model size performance

Increases LoRA rank from 4 to 16 in example and default configurations. This aims to improve model fine-tuning effectiveness. Adjusts per-device batch size and gradient accumulation steps. This maintains the same effective batch size while potentially reducing memory usage. Standardizes LoRA dropout to 0.25 across configurations. Updates READMEs to clarify model performance expectations.

Ensures the ID field for generated QA items is always a string. This provides type consistency for downstream consumers.

Ensures a specific PII test data file is tracked by Git, enabling its use in related tests.

Updates pre-commit hooks. Previously, the cleaning pipeline was entirely disabled for datasets containing image messages. Image conversations now receive a default score, bypassing text-focused LLM cleaning. This enables partial cleaning for mixed datasets and prevents pipeline failures.

Reduces the default LoRA rank in training configuration templates from 16 to 8. This change aims to optimize resource usage and potentially accelerate training.

Introduces a new 'enable_thinking' configuration for LLM interactions. This allows dynamic control of thinking behavior in both offline inference and the data cleaning process. When thinking is enabled for cleaning, the maximum number of new tokens is increased to accommodate longer reasoning outputs. Also reduces the repetition penalty in the cleaning strategy for improved response diversity and skips data entries containing images during the LLM cleaning process to avoid irrelevant processing.

Ensures that files with a .gif extension are correctly identified and categorized as animated expressions or stickers based on the platform. This improves the accuracy of media type classification during data processing.

Updates the Qwen3 model in test configurations from 4B to the smaller 0.6B version. This change helps improve test performance and reduce resource requirements.

Upgrades openai, vllm, torch, transformers, and triton to their latest versions. This ensures compatibility, incorporates recent bug fixes, and leverages performance improvements across the project.

This release introduces the ability to control the offline cleaning process through the `enable_thinking` configuration parameter.

Refactors dataset management to consistently append '-vl' for vision-language datasets and dynamically name cleaned datasets (e.g., 'dataset-cleaned'). Enforces vLLM as a strict dependency for LLM-based data cleaning, exiting if unavailable. Integrates 'enable_thinking' option for LLM cleaning and enables cleaning by default in relevant test configurations. Adds torchvision dependency for vision models and streamlines the cleaning call in training to centralize decision-making.

sourcery-ai · 2025-08-16T14:55:46Z

审阅者指南

此 PR 通过添加图像模态支持和可配置的 LLM 推理，改进了数据清洗和推理工作流；重构了配置模型，以实现动态数据集命名和后处理；提升了项目和依赖版本；增强了 QA 解析；简化了训练入口点；更新了 pre-commit 钩子；并完善了文档。

支持图像模态和 `enable_thinking` 的 LLM 清洗序列图

sequenceDiagram
    participant TrainSft as train_sft.py
    participant Cleaner as LLMCleaningStrategy
    participant Infer as vllm_infer
    TrainSft->>Cleaner: clean()
    Cleaner->>Infer: vllm_infer(inputs, enable_thinking)
    Infer-->>Cleaner: parsed_scores, failed_indexs
    Cleaner-->>TrainSft: cleaned_dataset_name

更新数据集命名和图像模态标志的 ER 图

erDiagram
    WCTrainSftConfig {
        string dataset
        string output_dir
    }
    WCMakeDatasetConfig {
        string dataset
    }
    WCTrainSftConfig ||--o{ CleanDatasetConfig : has
    CleanDatasetConfig {
        bool enable_clean
        string clean_strategy
        LLMCleanConfig llm
    }
    LLMCleanConfig {
        int accept_score
        bool enable_thinking
    }

采用动态数据集命名的更新配置模型类图

classDiagram
    class CommonMethods {
        +_parse_dataset_name() str
    }
    class WCTrainSftConfig {
        +output_dir: Optional[str]
        +dataset: str
        +process_config()
    }
    class WCMakeDatasetConfig {
        +process_config()
        +dataset: str
    }
    CommonMethods <|.. WCTrainSftConfig
    CommonMethods <|.. WCMakeDatasetConfig
    WCTrainSftConfig --|> CommonArgs
    WCTrainSftConfig --|> TrainSftArgs
    WCMakeDatasetConfig --|> CommonArgs
    WCMakeDatasetConfig --|> MakeDatasetArgs
    class LLMCleanConfig {
        +accept_score: int
        +enable_thinking: bool
    }
    class CleanDatasetConfig {
        +enable_clean: bool
        +clean_strategy: CleanStrategy
        +llm: LLMCleanConfig
    }

文件级更改

更改	详情	文件
增强清洗策略以处理基于图像的 QA 和可配置的 LLM 推理	根据 `original_dataset_name` 动态计算 `cleaned_dataset_name` 在 `judge()` 中：为带有图像的 QA 对分配分数并跳过 vLLM 调用使 `repetition_penalty`、`enable_thinking`、`max_new_tokens` 由配置驱动将 `enable_thinking` 传播到 `vllm_infer` 并在使用后删除 LLM 实例	`weclone/data/clean/strategies.py` `weclone/core/inference/offline_infer.py`
重构配置模型以支持可选适配器和动态数据集命名	使 `adapter_name_or_path` 变为可选，并添加 `enable_thinking` 标志引入 `CommonMethods._parse_dataset_name` 以在包含图像模态时追加 '-vl' 使用 `CommonMethods` 扩展 `WCTrainSftConfig` 和 `WCMakeDatasetConfig`，派生 `dataset` 字段，并在验证后移除 `adapter_name_or_path` 和 `include_type` 在配置加载中切换到 `pyjson5.loads` 并将 `include_type` 注入到 `train_sft` 参数中	`weclone/utils/config_models.py` `weclone/utils/config.py`
提升项目版本并更新依赖	将项目版本和配置版本提升至 v0.3.02 升级 openai、vllm、torch、torchvision、transformers、triton 更新配置变更日志条目	`pyproject.toml`
改进 QA 生成器数据处理和错误路径	强制在缺少 vLLM 支持时退出跳过清洗图像错误，支持 '.gif' 作为动画并映射到贴纸类型包含新的“动画表情”类型并清除贴纸的 src 在 `save_result` 中将 QA id 序列化为字符串	`weclone/data/qa_generator.py`
简化训练入口点以始终应用清洗	移除当包含图像或禁用清洗时跳过清洗的条件始终实例化 `LLMCleaningStrategy` 并分配已清洗的数据集	`weclone/train/train_sft.py`
更新 pre-commit 钩子版本	将 pre-commit-hooks 提升至 v6.0.0 将 ruff-pre-commit 提升至 v0.12.8	`.pre-commit-config.yaml`
重新措辞文档以提高清晰度	更新英文和中文 README 模型性能声明	`README.md` `README_zh.md`

可能关联的问题

#0: 此 PR 更新了数据清洗逻辑、数据集命名和相关依赖，直接解决了该问题的数据预处理失败。

提示和命令

与 Sourcery 互动

触发新审查： 在拉取请求上评论 @sourcery-ai review。
继续讨论： 直接回复 Sourcery 的审查评论。
从审查评论生成 GitHub Issue： 通过回复 Sourcery 的审查评论，要求其从评论中创建 Issue。您也可以回复审查评论并输入 @sourcery-ai issue 来创建 Issue。
生成拉取请求标题： 在拉取请求标题中的任意位置写入 @sourcery-ai，即可随时生成标题。您也可以在拉取请求上评论 @sourcery-ai title 来随时（重新）生成标题。
生成拉取请求摘要： 在拉取请求正文的任意位置写入 @sourcery-ai summary，即可随时在您想要的位置生成 PR 摘要。您也可以在拉取请求上评论 @sourcery-ai summary 来随时（重新）生成摘要。
生成审阅者指南： 在拉取请求上评论 @sourcery-ai guide，即可随时（重新）生成审阅者指南。
解决所有 Sourcery 评论： 在拉取请求上评论 @sourcery-ai resolve，以解决所有 Sourcery 评论。如果您已处理所有评论且不想再看到它们，这会很有用。
驳回所有 Sourcery 审查： 在拉取请求上评论 @sourcery-ai dismiss，以驳回所有现有 Sourcery 审查。如果您想从头开始进行新的审查，这尤其有用——别忘了评论 @sourcery-ai review 来触发新的审查！

自定义您的体验

访问您的仪表板以：

启用或禁用审查功能，例如 Sourcery 生成的拉取请求摘要、审阅者指南等。
更改审查语言。
添加、删除或编辑自定义审查说明。
调整其他审查设置。

获取帮助

如有问题或反馈，请联系我们的支持团队。
访问我们的文档以获取详细指南和信息。
通过在 X/Twitter、LinkedIn 或 GitHub 上关注我们，与 Sourcery 团队保持联系。

Original review guide in English

Reviewer's Guide

This PR refines the data cleaning and inference workflows by adding image‐modality support and configurable LLM reasoning, restructures config models for dynamic dataset naming and post‐processing, bumps project and dependency versions, enhances QA parsing, simplifies the training entry point, updates pre‐commit hooks, and polishes documentation.

Sequence diagram for LLM cleaning with image-modality support and enable_thinking

sequenceDiagram
    participant TrainSft as train_sft.py
    participant Cleaner as LLMCleaningStrategy
    participant Infer as vllm_infer
    TrainSft->>Cleaner: clean()
    Cleaner->>Infer: vllm_infer(inputs, enable_thinking)
    Infer-->>Cleaner: parsed_scores, failed_indexs
    Cleaner-->>TrainSft: cleaned_dataset_name

ER diagram for updated dataset naming and image-modality flag

erDiagram
    WCTrainSftConfig {
        string dataset
        string output_dir
    }
    WCMakeDatasetConfig {
        string dataset
    }
    WCTrainSftConfig ||--o{ CleanDatasetConfig : has
    CleanDatasetConfig {
        bool enable_clean
        string clean_strategy
        LLMCleanConfig llm
    }
    LLMCleanConfig {
        int accept_score
        bool enable_thinking
    }

Class diagram for updated config models with dynamic dataset naming

classDiagram
    class CommonMethods {
        +_parse_dataset_name() str
    }
    class WCTrainSftConfig {
        +output_dir: Optional[str]
        +dataset: str
        +process_config()
    }
    class WCMakeDatasetConfig {
        +process_config()
        +dataset: str
    }
    CommonMethods <|.. WCTrainSftConfig
    CommonMethods <|.. WCMakeDatasetConfig
    WCTrainSftConfig --|> CommonArgs
    WCTrainSftConfig --|> TrainSftArgs
    WCMakeDatasetConfig --|> CommonArgs
    WCMakeDatasetConfig --|> MakeDatasetArgs
    class LLMCleanConfig {
        +accept_score: int
        +enable_thinking: bool
    }
    class CleanDatasetConfig {
        +enable_clean: bool
        +clean_strategy: CleanStrategy
        +llm: LLMCleanConfig
    }

File-Level Changes

Change	Details	Files
Enhanced cleaning strategy to handle image-based QA and configurable LLM reasoning	compute cleaned_dataset_name dynamically from original_dataset_name in judge(): assign score for QA pairs with images and skip vLLM calls make repetition_penalty, enable_thinking, max_new_tokens driven by config propagate enable_thinking to vllm_infer and delete LLM instance after use	`weclone/data/clean/strategies.py` `weclone/core/inference/offline_infer.py`
Refactored config models for optional adapters and dynamic dataset naming	make adapter_name_or_path optional and add enable_thinking flag introduce CommonMethods._parse_dataset_name to append '-vl' when image modality is included extend WCTrainSftConfig and WCMakeDatasetConfig with CommonMethods, derive dataset field, remove adapter_name_or_path and include_type post-validation switch to pyjson5.loads in config loading and inject include_type into train_sft args	`weclone/utils/config_models.py` `weclone/utils/config.py`
Bumped project version and updated dependencies	increment project version and config_version to v0.3.02 upgrade openai, vllm, torch, torchvision, transformers, triton update config changelog entry	`pyproject.toml`
Improved QA generator data handling and error paths	enforce exit on missing vLLM support skip cleaning-image error, support '.gif' as animation and map to sticker type include new '动画表情' type and clear src for stickers serialize QA id as string in save_result	`weclone/data/qa_generator.py`
Simplified training entry point to always apply cleaning	remove condition that skipped cleaning when images were included or cleaning disabled always instantiate LLMCleaningStrategy and assign cleaned dataset	`weclone/train/train_sft.py`
Updated pre-commit hook versions	bump pre-commit-hooks to v6.0.0 bump ruff-pre-commit to v0.12.8	`.pre-commit-config.yaml`
Rephrased documentation for clarity	update English and Chinese README model performance statements	`README.md` `README_zh.md`

Possibly linked issues

#0: The PR updates data cleaning logic, dataset naming, and related dependencies, which directly address the issue's data preprocessing failure.

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

你好 - 我已审阅你的更改 - 以下是一些反馈意见：

在 judge 方法中，跳过基于图像的问答对而不调整输入到输出的索引映射，可能会导致 vLLM 推理结果和分数错位——请考虑保留占位符输入或重新映射索引以保持解析分数同步。
train_sft.main 的清理步骤现在无条件地运行清理器，即使 enable_clean 为 false 或包含图像模态；如果这不是故意的，请重新引入防护或记录行为更改。
新的 CommonMethods._parse_dataset_name 和 adapter_name_or_path 以及 include_type 的动态移除似乎在配置类中重复——请考虑整合此逻辑，以避免 output_dir/数据集初始化中的不一致。

AI 代理提示

请解决此代码审查中的评论：
## 总体评论
- 在 `judge` 方法中，跳过基于图像的问答对而不调整输入到输出的索引映射，可能会导致 vLLM 推理结果和分数错位——请考虑保留占位符输入或重新映射索引以保持解析分数同步。
- `train_sft.main` 的清理步骤现在无条件地运行清理器，即使 `enable_clean` 为 false 或包含图像模态；如果这不是故意的，请重新引入防护或记录行为更改。
- 新的 `CommonMethods._parse_dataset_name` 和 `adapter_name_or_path` 以及 `include_type` 的动态移除似乎在配置类中重复——请考虑整合此逻辑，以避免 output_dir/数据集初始化中的不一致。

## 单独评论

### 评论 1
<location> `weclone/train/train_sft.py:34` </location>
<code_context>
                 f"Dataset file '{data_path}' does not exist, please check if make-dataset was executed"
             )

-    if not dataset_config.clean_dataset.enable_clean or "image" in dataset_config.include_type:
-        logger.info("Data cleaning is not enabled or images are included, will use the original dataset.")
-    else:
-        cleaner = LLMCleaningStrategy(make_dataset_config=dataset_config)
-        train_config.dataset = cleaner.clean()
+    cleaner = LLMCleaningStrategy(make_dataset_config=dataset_config)
+    train_config.dataset = cleaner.clean()
</code_context>

<issue_to_address>
移除条件清理可能会导致意外清理图像数据集。

考虑恢复条件或添加明确处理以防止清理不支持的图像数据集。
</issue_to_address>

Sourcery 对开源免费 - 如果你喜欢我们的评论，请考虑分享它们 ✨

_{帮助我更有用！请在每条评论上点击 👍 或 👎，我将利用反馈改进你的评论。}

Original comment in English

Hey there - I've reviewed your changes - here's some feedback:

In the judge method, skipping image-based QA pairs without adjusting the input‐to‐output index mapping can misalign vLLM inference results and scores—consider preserving placeholder inputs or remapping indices to keep parsed scores in sync.
The train_sft.main cleanup step now unconditionally runs the cleaner even when enable_clean is false or image modalities are included; if that wasn’t intentional, reintroduce the guard or document the behavior change.
The new CommonMethods._parse_dataset_name and the dynamic removal of adapter_name_or_path and include_type seem duplicated across config classes—consider consolidating this logic to avoid inconsistencies in output_dir/dataset initialization.

Prompt for AI Agents

Please address the comments from this code review:
## Overall Comments
- In the `judge` method, skipping image-based QA pairs without adjusting the input‐to‐output index mapping can misalign vLLM inference results and scores—consider preserving placeholder inputs or remapping indices to keep parsed scores in sync.
- The `train_sft.main` cleanup step now unconditionally runs the cleaner even when `enable_clean` is false or image modalities are included; if that wasn’t intentional, reintroduce the guard or document the behavior change.
- The new `CommonMethods._parse_dataset_name` and the dynamic removal of `adapter_name_or_path` and `include_type` seem duplicated across config classes—consider consolidating this logic to avoid inconsistencies in output_dir/dataset initialization.

## Individual Comments

### Comment 1
<location> `weclone/train/train_sft.py:34` </location>
<code_context>
                 f"Dataset file '{data_path}' does not exist, please check if make-dataset was executed"
             )

-    if not dataset_config.clean_dataset.enable_clean or "image" in dataset_config.include_type:
-        logger.info("Data cleaning is not enabled or images are included, will use the original dataset.")
-    else:
-        cleaner = LLMCleaningStrategy(make_dataset_config=dataset_config)
-        train_config.dataset = cleaner.clean()
+    cleaner = LLMCleaningStrategy(make_dataset_config=dataset_config)
+    train_config.dataset = cleaner.clean()
</code_context>

<issue_to_address>
Removing conditional cleaning may lead to unintended cleaning of image datasets.

Consider restoring the conditional or adding explicit handling to prevent cleaning unsupported image datasets.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

Allow `vllm_infer` to return `None` for failed JSON parsing results, clarifying the sequence index for such failures. This enables `LLMCleaningStrategy` to robustly align parsed LLM scores with original data, correctly handling examples where JSON parsing failed by assigning a default score of 0. Adds an option to disable data cleaning during training setup.

Copilot

Pull Request Overview

This release updates WeClone to version 0.3.02, introducing configurable "thinking" functionality for offline LLM cleaning and improving multimodal data handling. The update enhances the data processing pipeline by allowing more control over LLM behavior during dataset cleaning operations.

Added configurable enable_thinking flag to control LLM cleaning behavior with adjustable parameters
Improved handling of images and GIFs in QA processing with automatic scoring and conversion features
Refactored configuration models to use dynamic dataset naming and cleaner field management

Reviewed Changes

Copilot reviewed 18 out of 19 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
weclone/utils/config_models.py	Added `enable_thinking` field and refactored config models with dynamic dataset naming
weclone/data/clean/strategies.py	Enhanced LLM cleaning to handle images automatically and pass thinking configuration
weclone/core/inference/offline_infer.py	Updated vLLM inference to support thinking mode and improved error handling
weclone/data/qa_generator.py	Added GIF-to-sticker conversion and improved CSV processing
weclone/train/train_sft.py	Removed image-based exclusion from LLM cleaning workflow
pyproject.toml	Updated dependencies and bumped version to 0.3.02

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-08-17T07:13:58Z

-            prompt_value = prompt_template.invoke({"id": qa.id, "messages": messages_str.strip()})
-            inputs.append(prompt_value.to_string())
+            if qa.images:
+                qa.score = 6


The hardcoded score value of 6 should be defined as a named constant or configuration parameter to improve maintainability and make the scoring logic more transparent.

Suggested change

qa.score = 6

qa.score = self.IMAGE_SCORE

Copilot · 2025-08-17T07:13:59Z

+        chat_template_kwargs={"enable_thinking": enable_thinking},
    )  # type: ignore

+    del llm


Explicitly deleting the LLM object may not immediately free GPU memory. Consider using proper context management or calling torch.cuda.empty_cache() after deletion to ensure GPU memory is released.

Suggested change

del llm

del llm

torch.cuda.empty_cache()

Copilot · 2025-08-17T07:13:59Z

        for i in df.index:
            if df.loc[i, "type_name"].lower() in ["文本", "text"]:
                continue
+            if df.loc[i, "src"].lower().endswith(".gif"):


Calling .lower() on every iteration is inefficient. Consider preprocessing the data or using a case-insensitive comparison method to improve performance when processing large datasets.

Adds a call to empty the CUDA cache immediately after the LLM object is deleted, ensuring prompt release of GPU memory. This helps prevent out-of-memory issues and improves resource utilization for subsequent operations.

xming521 added 14 commits July 17, 2025 22:21

refactor(data): converts QA item ID to string

533b0fa

Ensures the ID field for generated QA items is always a string. This provides type consistency for downstream consumers.

performance : some improve

ff0015a

chore(tests): include PII test data file

a0f6d0e

Ensures a specific PII test data file is tracked by Git, enabling its use in related tests.

Merge remote-tracking branch 'origin/master' into dev

a05bb7e

perf(config): reduce default lora rank

7e9886c

Reduces the default LoRA rank in training configuration templates from 16 to 8. This change aims to optimize resource usage and potentially accelerate training.

fix(data): classify gif as animated stickers

cc34bea

Ensures that files with a .gif extension are correctly identified and categorized as animated expressions or stickers based on the platform. This improves the accuracy of media type classification during data processing.

test(tests): update qwen3 test model to 0.6B

c89dc81

Updates the Qwen3 model in test configurations from 4B to the smaller 0.6B version. This change helps improve test performance and reduce resource requirements.

chore(deps): update core dependencies

cdef421

Upgrades openai, vllm, torch, transformers, and triton to their latest versions. This ensures compatibility, incorporates recent bug fixes, and leverages performance improvements across the project.

Merge branch 'dev' of https://github.com/xming521/WeClone into dev

8309c49

chore(config): Updates the project and configuration versions to 0.3.02.

1e8de85

This release introduces the ability to control the offline cleaning process through the `enable_thinking` configuration parameter.

xming521 requested a review from Copilot August 16, 2025 14:55

This comment was marked as outdated.

Sign in to view

sourcery-ai Bot reviewed Aug 16, 2025

View reviewed changes

Comment thread weclone/train/train_sft.py

xming521 requested a review from Copilot August 17, 2025 07:12

Copilot AI reviewed Aug 17, 2025

View reviewed changes

perf(infer): clear cuda cache after inference

5dcaad0

Adds a call to empty the CUDA cache immediately after the LLM object is deleted, ensuring prompt release of GPU memory. This helps prevent out-of-memory issues and improves resource utilization for subsequent operations.

xming521 merged commit a96996f into master Aug 17, 2025
2 checks passed

Conversation

xming521 commented Aug 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Sourcery 总结

Summary by Sourcery

Uh oh!

sourcery-ai Bot commented Aug 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

审阅者指南

支持图像模态和 enable_thinking 的 LLM 清洗序列图

更新数据集命名和图像模态标志的 ER 图

采用动态数据集命名的更新配置模型类图

文件级更改

可能关联的问题

与 Sourcery 互动

自定义您的体验

获取帮助

Reviewer's Guide

Sequence diagram for LLM cleaning with image-modality support and enable_thinking

ER diagram for updated dataset naming and image-modality flag

Class diagram for updated config models with dynamic dataset naming

File-Level Changes

Possibly linked issues

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

This comment was marked as outdated.

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI Aug 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xming521 commented Aug 16, 2025 •

edited

Loading

sourcery-ai Bot commented Aug 16, 2025 •

edited

Loading

支持图像模态和 `enable_thinking` 的 LLM 清洗序列图