Add XLMRoberta in Embedding Train #10074

jie-z-0607 · 2025-03-11T06:24:02Z

Before submitting

Lint code. If there are lint issues, please format the code first.

# Install and register `pre-commit` in the project folder
pip install pre-commit && pre-commit install

# Process previous code files separately
pre-commit run --file XXXX.py

Add test cases into tests folder. If there are codecov issues, please add tests cases first.

PR types

New features

PR changes

Models

Description

一、在Embedding训练中增加对XLMRoberta模型的支持，可支持bge-m3及系列模型的微调训练：
1.在XLMRoberta的modeling文件中增加相关模型；
2.调整训练脚本中的模型选择与初始化相关代码；
3.调整embedding dataset相关脚本中的数据构造代码；
4.其他参数文件支持等

二、修复了原XLMRoberta模型recompute开启不正常的问题

paddle-bot · 2025-03-11T06:24:08Z

Thanks for your contribution!

ZHUI

LGTM

ZHUI · 2025-03-11T06:32:50Z

paddlenlp/data/data_collator.py

-            b = np.tril(np.ones([cur_len, cur_len]), 0)
-            input_mask_data[0, 0, offset : offset + cur_len, offset : offset + cur_len] = b
+            b = np.ones([cur_len])
+            input_mask_data[0, offset : offset + cur_len] = b


注意一下数据处理是否兼容

ZHUI · 2025-03-11T06:33:28Z

llm/run_embedding.py

@@ -248,6 +253,7 @@ def main():
        return_tensors="np",
        return_attention_mask=not model_args.flash_mask,
        pad_to_multiple_of=data_args.pad_to_multiple_of,
+        return_position_ids=False


改成参数里面可以配置吧

ZHUI · 2025-03-11T06:35:12Z

llm/run_embedding.py

+    if isinstance(model_config, XLMRobertaConfig):
+        model_class = XLMRobertaSentenceEmbedding
+    elif isinstance(model_config, Qwen2Config):
+        model_class = Qwen2SentenceEmbedding


后面可以考虑加一个 AutoModelForSentenceEmbedding

ZHUI · 2025-03-14T06:14:17Z

单测有问题

codecov · 2025-03-18T03:46:10Z

Codecov Report

Attention: Patch coverage is 43.54839% with 35 lines in your changes missing coverage. Please review.

Project coverage is 49.94%. Comparing base (595e74f) to head (a9080ee).
Report is 431 commits behind head on develop.

Files with missing lines	Patch %	Lines
paddlenlp/transformers/xlm_roberta/modeling.py	47.27%	29 Missing ⚠️
paddlenlp/data/data_collator.py	20.00%	4 Missing ⚠️
paddlenlp/datasets/embedding_dataset.py	0.00%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop   #10074      +/-   ##
===========================================
- Coverage    52.11%   49.94%   -2.18%     
===========================================
  Files          730      757      +27     
  Lines       116557   122583    +6026     
===========================================
+ Hits         60744    61223     +479     
- Misses       55813    61360    +5547

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

add xlmroberta in embedding train

20ca758

paddle-bot bot added the contributor label Mar 11, 2025

paddle-bot bot assigned KB-Ding Mar 11, 2025

ZHUI previously approved these changes Mar 11, 2025

View reviewed changes

fix_1

21f54e3

jie-z-0607 dismissed ZHUI’s stale review via 21f54e3 March 12, 2025 03:52

fix_2

80ef3a9

fix_test

a9080ee

ZHUI merged commit 5475a8a into PaddlePaddle:develop Mar 18, 2025
9 of 13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add XLMRoberta in Embedding Train #10074

Add XLMRoberta in Embedding Train #10074

Uh oh!

jie-z-0607 commented Mar 11, 2025

Uh oh!

paddle-bot bot commented Mar 11, 2025

Uh oh!

ZHUI left a comment

Uh oh!

ZHUI Mar 11, 2025

Uh oh!

ZHUI Mar 11, 2025

Uh oh!

ZHUI Mar 11, 2025

Uh oh!

ZHUI commented Mar 14, 2025

Uh oh!

codecov bot commented Mar 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Add XLMRoberta in Embedding Train #10074

Add XLMRoberta in Embedding Train #10074

Uh oh!

Conversation

jie-z-0607 commented Mar 11, 2025

Before submitting

PR types

PR changes

Description

Uh oh!

paddle-bot bot commented Mar 11, 2025

Uh oh!

ZHUI left a comment

Choose a reason for hiding this comment

Uh oh!

ZHUI Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

ZHUI Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

ZHUI Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

ZHUI commented Mar 14, 2025

Uh oh!

codecov bot commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Mar 18, 2025 •

edited

Loading