Skip to content

Release v0.0.5

Compare
Choose a tag to compare
@fracpete fracpete released this 23 Jan 23:26
· 80 commits to main since this release
  • added flag -b/--force_batch to the llm-convert tool which all data to be reader from the reader before filtering it and then passing it to the writer; useful for batch filters.
  • added the randomize-records batch filter
  • added the --encoding ENC option to file readers
  • auto-determined encoding is now being logged (INFO level)
  • the LDC_ENCODING_MAX_CHECK_LENGTH environment variable allows overriding the default number of bytes used for determining the file encoding in auto-detect mode
  • default max number of bytes inspected for determining file encoding is now 10kb
  • method locate_files in base_io no longer includes directories when expanding globs
  • added tool llm-file-encoding for determining file encodings of text files
  • added method replace_extension to base_io module for changing a files extension (removes any supported compression suffix first)
  • stream writers (.jsonl/.txt) now work with --force_batch mode; the output file name gets automatically generated from the input file name when just using a directory for the output