You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
added flag -b/--force_batch to the llm-convert tool which all data to be reader from the reader before filtering it and then passing it to the writer; useful for batch filters.
added the randomize-records batch filter
added the --encoding ENC option to file readers
auto-determined encoding is now being logged (INFO level)
the LDC_ENCODING_MAX_CHECK_LENGTH environment variable allows overriding the default number of bytes used for determining the file encoding in auto-detect mode
default max number of bytes inspected for determining file encoding is now 10kb
method locate_files in base_io no longer includes directories when expanding globs
added tool llm-file-encoding for determining file encodings of text files
added method replace_extension to base_io module for changing a files extension (removes any supported compression suffix first)
stream writers (.jsonl/.txt) now work with --force_batch mode; the output file name gets automatically generated from the input file name when just using a directory for the output