v0.5.1, add support for reading xlsx files…#180
Merged
thammegowda merged 5 commits intomainfrom Apr 13, 2026
Merged
Conversation
…nd include Motorola datasets
- Add read_csv() using Python csv module for proper quoted-field handling. Fixes VistecAI scb-mt column misalignment - Support data_files in HF loader for datasets with unregistered configs (SMOL smolsent/smoldoc en_lij) - Restore Google-smol_sent and Google-smol_doc entries for eng-lij_Latn
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
v0.5.1: New datasets, formats, and bug fixes
New Features
xlsxparser: Read Excel files viaopenpyxl(optional dependency). Handles Motorola's endangered language datasets which are distributed as.xlsxfilescsvparser: Proper CSV reader using Python'scsvmodule for quoted fields with embedded commas. Fixes column misalignment incsvwithheaderformat--strictlangpair ordering formtdata list:mtdata list -l eng-deu --strictnow only returns datasets stored in eng→deu order. Closes Allow strict langpair ordering #157data_filessupport: Load HuggingFace datasets by file path when config names aren't registered in the repo YAML (needed for SMOL smolsent/smoldoc)entry.is_archivecheck before treating files as zip/tar, preventing.xlsxfiles (which are internally zip-based) from being misidentifiedNew Datasets
Bug Fixes
\rfrom outputs: Fix carriage return leaking through inread_plain,read_tsv, andechocommand. Closes Remove\rfrom outputs #173cols=(1,)now yields scalar string instead of one-element listRecipe Updates
wmt26-eng-lld: Added Motorola eng-lld parallel datawmt26-eng-lij_Latn: Replaced NLLB (noisy) with SMOL smol_sent + smol_doc; added Tatoebawmt26-eng-jpn: Added TALPCo jpn-engCloses
#157, #173, #178, #151, #152