v0.5.1, add support for reading xlsx files… by thammegowda · Pull Request #180 · thammegowda/mtdata

thammegowda · 2026-04-13T19:51:09Z

v0.5.1: New datasets, formats, and bug fixes

New Features

xlsx parser: Read Excel files via openpyxl (optional dependency). Handles Motorola's endangered language datasets which are distributed as .xlsx files
csv parser: Proper CSV reader using Python's csv module for quoted fields with embedded commas. Fixes column misalignment in csvwithheader format
--strict langpair ordering for mtdata list: mtdata list -l eng-deu --strict now only returns datasets stored in eng→deu order. Closes Allow strict langpair ordering #157
HF data_files support: Load HuggingFace datasets by file path when config names aren't registered in the repo YAML (needed for SMOL smolsent/smoldoc)
Generic SSL fallback: Retry downloads without SSL verification when certificate validation fails (e.g., hostnames with underscores violating RFC 952)
Archive detection fix: entry.is_archive check before treating files as zip/tar, preventing .xlsx files (which are internally zip-based) from being misidentified

New Datasets

Motorola Language Revitalization (8 datasets): eng-lld (Ladin), eng-chr (Cherokee), eng-xnr (Kangri), eng-mri (Maori), eng-yrl (Nheengatu), eng-kgp (Kaingang), por-yrl, por-kgp
Google SMOL eng-lij: SmolSent (863 sentences), SmolDoc (825 sentences), GATITOS (4,332 lexicon entries)
ZurichNLP wmt24pp-rm (6 Romansh varieties): Rumantsch Grischun, Sursilvan, Sutsilvan, Surmiran, Puter, Vallader (deu-roh test sets). Closes add wmt24pp-rm #178
VistecAI scb-mt-en-th-2020 (12 CSV subsets): ~1M eng-tha sentence pairs. Closes Add Thai-English parallel corpus "scb-mt-en-th-2020" #151
TALPCo (8 language pairs): jpn × {eng, kor, mya, zsm, ind, jav, tha, vie}. Closes Add TALPCo #152

Bug Fixes

Strip \r from outputs: Fix carriage return leaking through in read_plain, read_tsv, and echo command. Closes Remove \r from outputs #173
Fix TSV two-file pairing: Two separate TSV files now zip (pair) correctly instead of flattening
Single-column TSV: cols=(1,) now yields scalar string instead of one-element list

Recipe Updates

wmt26-eng-lld: Added Motorola eng-lld parallel data
wmt26-eng-lij_Latn: Replaced NLLB (noisy) with SMOL smol_sent + smol_doc; added Tatoeba
wmt26-eng-jpn: Added TALPCo jpn-eng

Closes

#157, #173, #178, #151, #152

…nd include Motorola datasets

- Strip \r from parser outputs (read_plain, read_tsv) and echo command. Closes #173 - Add ZurichNLP/wmt24pp-rm: 6 Romansh varieties (deu-roh) test sets. Closes #178

…151. Closes #152 - VistecAI scb-mt-en-th-2020: 12 CSV subsets with ~1M eng-tha sentence pairs - TALPCo: jpn paired with eng, kor, mya, zsm, ind, jav, tha, vie - Fix parser: TSV with 2 files now zips (pairs) instead of flattening - Fix parser: single-col TSV yields scalar string for proper pairing

- Add read_csv() using Python csv module for proper quoted-field handling. Fixes VistecAI scb-mt column misalignment - Support data_files in HF loader for datasets with unregistered configs (SMOL smolsent/smoldoc en_lij) - Restore Google-smol_sent and Google-smol_doc entries for eng-lij_Latn

thammegowda added 5 commits April 13, 2026 19:47

chore: update version to 0.5.1, add support for reading xlsx files, a…

92e818d

…nd include Motorola datasets

strip \r from outputs; add wmt24pp-rm dataset

9f340c9

- Strip \r from parser outputs (read_plain, read_tsv) and echo command. Closes #173 - Add ZurichNLP/wmt24pp-rm: 6 Romansh varieties (deu-roh) test sets. Closes #178

add --strict flag for langpair ordering in mtdata list. Closes #157

d35b862

thammegowda merged commit cc92769 into main Apr 13, 2026
22 checks passed

thammegowda deleted the develop branch April 13, 2026 21:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.5.1, add support for reading xlsx files…#180

v0.5.1, add support for reading xlsx files…#180
thammegowda merged 5 commits intomainfrom
develop

thammegowda commented Apr 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thammegowda commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

v0.5.1: New datasets, formats, and bug fixes

New Features

New Datasets

Bug Fixes

Recipe Updates

Closes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

thammegowda commented Apr 13, 2026 •

edited

Loading