Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add XLMRoberta in Embedding Train #10074

Merged
merged 4 commits into from
Mar 18, 2025

Conversation

jie-z-0607
Copy link
Contributor

Before submitting

  • Lint code. If there are lint issues, please format the code first.
# Install and register `pre-commit` in the project folder
pip install pre-commit && pre-commit install

# Process previous code files separately
pre-commit run --file XXXX.py
  • Add test cases into tests folder. If there are codecov issues, please add tests cases first.

PR types

New features

PR changes

Models

Description

一、在Embedding训练中增加对XLMRoberta模型的支持,可支持bge-m3及系列模型的微调训练:
1.在XLMRoberta的modeling文件中增加相关模型;
2.调整训练脚本中的模型选择与初始化相关代码;
3.调整embedding dataset相关脚本中的数据构造代码;
4.其他参数文件支持等

二、修复了原XLMRoberta模型recompute开启不正常的问题

Copy link

paddle-bot bot commented Mar 11, 2025

Thanks for your contribution!

ZHUI
ZHUI previously approved these changes Mar 11, 2025
Copy link
Collaborator

@ZHUI ZHUI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

b = np.tril(np.ones([cur_len, cur_len]), 0)
input_mask_data[0, 0, offset : offset + cur_len, offset : offset + cur_len] = b
b = np.ones([cur_len])
input_mask_data[0, offset : offset + cur_len] = b
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

注意一下数据处理是否兼容

@@ -248,6 +253,7 @@ def main():
return_tensors="np",
return_attention_mask=not model_args.flash_mask,
pad_to_multiple_of=data_args.pad_to_multiple_of,
return_position_ids=False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

改成参数里面可以配置吧

Comment on lines +111 to +114
if isinstance(model_config, XLMRobertaConfig):
model_class = XLMRobertaSentenceEmbedding
elif isinstance(model_config, Qwen2Config):
model_class = Qwen2SentenceEmbedding
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

后面可以考虑加一个 AutoModelForSentenceEmbedding

@ZHUI
Copy link
Collaborator

ZHUI commented Mar 14, 2025

单测有问题

Copy link

codecov bot commented Mar 18, 2025

Codecov Report

Attention: Patch coverage is 43.54839% with 35 lines in your changes missing coverage. Please review.

Project coverage is 49.94%. Comparing base (595e74f) to head (a9080ee).
Report is 191 commits behind head on develop.

Files with missing lines Patch % Lines
paddlenlp/transformers/xlm_roberta/modeling.py 47.27% 29 Missing ⚠️
paddlenlp/data/data_collator.py 20.00% 4 Missing ⚠️
paddlenlp/datasets/embedding_dataset.py 0.00% 2 Missing ⚠️

❌ Your patch status has failed because the patch coverage (43.54%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.
❌ Your project status has failed because the head coverage (49.94%) is below the target coverage (58.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop   #10074      +/-   ##
===========================================
- Coverage    52.11%   49.94%   -2.18%     
===========================================
  Files          730      757      +27     
  Lines       116557   122583    +6026     
===========================================
+ Hits         60744    61223     +479     
- Misses       55813    61360    +5547     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ZHUI ZHUI merged commit 5475a8a into PaddlePaddle:develop Mar 18, 2025
9 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants