Skip to content

Improve data loading error message#55

Merged
sfc-gh-mwyatt merged 4 commits intomainfrom
mwyatt/data-error-messages
Feb 19, 2025
Merged

Improve data loading error message#55
sfc-gh-mwyatt merged 4 commits intomainfrom
mwyatt/data-error-messages

Conversation

@sfc-gh-mwyatt
Copy link
Copy Markdown
Collaborator

@sfc-gh-mwyatt sfc-gh-mwyatt commented Feb 19, 2025

With some data loading configurations, it's possible to filter out all of the dataset during data loading. This leads to ambiguous error messages later. Adding checks here that raise more descriptive errors when we encounter this problem.

Copy link
Copy Markdown
Collaborator

@sfc-gh-sbekman sfc-gh-sbekman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's very useful, Mike - thank you for working on this!

Would it be ok to add a hint to the operator to why the dataset is empty when they know it's certainly isn't empty?

e.g. perhaps adding something like "Empty dataset after filtering out data that doesn't match the set requirements, e.g. max_length is smaller than the length of the shortest record"

Perhaps it's too verbose and can be phrased more succinctly but it'd help to explain like you did on slack for me.

@sfc-gh-mwyatt
Copy link
Copy Markdown
Collaborator Author

That's very useful, Mike - thank you for working on this!

Would it be ok to add a hint to the operator to why the dataset is empty when they know it's certainly isn't empty?

e.g. perhaps adding something like "Empty dataset after filtering out data that doesn't match the set requirements, e.g. max_length is smaller than the length of the shortest record"

Perhaps it's too verbose and can be phrased more succinctly but it'd help to explain like you did on slack for me.

Sure I think we could add these into the actual filtering step for SFTDataFactory. Let me make those changes now!

@sfc-gh-mwyatt sfc-gh-mwyatt changed the title Improve data source loading error message Improve data loading error message Feb 19, 2025
Copy link
Copy Markdown
Collaborator

@sfc-gh-sbekman sfc-gh-sbekman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great now, thank you, Michael!

Comment thread arctic_training/data/sft_factory.py Outdated
@sfc-gh-mwyatt sfc-gh-mwyatt merged commit 67ba04e into main Feb 19, 2025
@sfc-gh-mwyatt sfc-gh-mwyatt deleted the mwyatt/data-error-messages branch February 19, 2025 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants