Skip to content

v0.5.1, add support for reading xlsx files…#180

Merged
thammegowda merged 5 commits intomainfrom
develop
Apr 13, 2026
Merged

v0.5.1, add support for reading xlsx files…#180
thammegowda merged 5 commits intomainfrom
develop

Conversation

@thammegowda
Copy link
Copy Markdown
Owner

@thammegowda thammegowda commented Apr 13, 2026

v0.5.1: New datasets, formats, and bug fixes

New Features

  • xlsx parser: Read Excel files via openpyxl (optional dependency). Handles Motorola's endangered language datasets which are distributed as .xlsx files
  • csv parser: Proper CSV reader using Python's csv module for quoted fields with embedded commas. Fixes column misalignment in csvwithheader format
  • --strict langpair ordering for mtdata list: mtdata list -l eng-deu --strict now only returns datasets stored in eng→deu order. Closes Allow strict langpair ordering #157
  • HF data_files support: Load HuggingFace datasets by file path when config names aren't registered in the repo YAML (needed for SMOL smolsent/smoldoc)
  • Generic SSL fallback: Retry downloads without SSL verification when certificate validation fails (e.g., hostnames with underscores violating RFC 952)
  • Archive detection fix: entry.is_archive check before treating files as zip/tar, preventing .xlsx files (which are internally zip-based) from being misidentified

New Datasets

  • Motorola Language Revitalization (8 datasets): eng-lld (Ladin), eng-chr (Cherokee), eng-xnr (Kangri), eng-mri (Maori), eng-yrl (Nheengatu), eng-kgp (Kaingang), por-yrl, por-kgp
  • Google SMOL eng-lij: SmolSent (863 sentences), SmolDoc (825 sentences), GATITOS (4,332 lexicon entries)
  • ZurichNLP wmt24pp-rm (6 Romansh varieties): Rumantsch Grischun, Sursilvan, Sutsilvan, Surmiran, Puter, Vallader (deu-roh test sets). Closes add wmt24pp-rm #178
  • VistecAI scb-mt-en-th-2020 (12 CSV subsets): ~1M eng-tha sentence pairs. Closes Add Thai-English parallel corpus "scb-mt-en-th-2020" #151
  • TALPCo (8 language pairs): jpn × {eng, kor, mya, zsm, ind, jav, tha, vie}. Closes Add TALPCo #152

Bug Fixes

  • Strip \r from outputs: Fix carriage return leaking through in read_plain, read_tsv, and echo command. Closes Remove \r from outputs #173
  • Fix TSV two-file pairing: Two separate TSV files now zip (pair) correctly instead of flattening
  • Single-column TSV: cols=(1,) now yields scalar string instead of one-element list

Recipe Updates

  • wmt26-eng-lld: Added Motorola eng-lld parallel data
  • wmt26-eng-lij_Latn: Replaced NLLB (noisy) with SMOL smol_sent + smol_doc; added Tatoeba
  • wmt26-eng-jpn: Added TALPCo jpn-eng

Closes

#157, #173, #178, #151, #152

- Strip \r from parser outputs (read_plain, read_tsv) and echo command. Closes #173
- Add ZurichNLP/wmt24pp-rm: 6 Romansh varieties (deu-roh) test sets. Closes #178
…151. Closes #152

- VistecAI scb-mt-en-th-2020: 12 CSV subsets with ~1M eng-tha sentence pairs
- TALPCo: jpn paired with eng, kor, mya, zsm, ind, jav, tha, vie
- Fix parser: TSV with 2 files now zips (pairs) instead of flattening
- Fix parser: single-col TSV yields scalar string for proper pairing
- Add read_csv() using Python csv module for proper quoted-field handling. Fixes VistecAI scb-mt column misalignment
- Support data_files in HF loader for datasets with unregistered configs (SMOL smolsent/smoldoc en_lij)
- Restore Google-smol_sent and Google-smol_doc entries for eng-lij_Latn
@thammegowda thammegowda merged commit cc92769 into main Apr 13, 2026
22 checks passed
@thammegowda thammegowda deleted the develop branch April 13, 2026 21:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

add wmt24pp-rm Remove \r from outputs Allow strict langpair ordering Add TALPCo Add Thai-English parallel corpus "scb-mt-en-th-2020"

1 participant