Skip to content

Release v0.0.2

Compare
Choose a tag to compare
@fracpete fracpete released this 31 Oct 04:08
· 134 commits to main since this release
  • added text-stats filter
  • stream writers accept iterable of data records now as well to improve throughput
  • text_utils.apply_max_length now uses simple whitespace splitting instead of searching for nearest word boundary to break a line, which results in a massive speed improvement
  • fix: text_utils.remove_patterns no longer multiplies the generated lines when using more than one pattern
  • added remove-patterns filter
  • pretrain and translation text writers now buffer records by default (-b, --buffer_size) in order to improve throughput
  • jsonlines writers for pair, pretrain and translation data are now stream writers