Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extra BOM in CSV file, hledger reports an error #2189

Open
PSLLSP opened this issue Mar 29, 2024 · 6 comments
Open

Extra BOM in CSV file, hledger reports an error #2189

PSLLSP opened this issue Mar 29, 2024 · 6 comments
Labels
A-WISH Some kind of improvement request, hare-brained proposal, or plea. csv The csv file format, csv output format, or generally CSV-related. docs Documentation-related. i18n Internationalisation/localisation-related.

Comments

@PSLLSP
Copy link

PSLLSP commented Mar 29, 2024

hledger 1.32.3, linux

I have CSV file in UTF-8 format, it starts with BOM <feff>

When I join several such files to one file cat test-bom-*.csv > test-bom.csv, this file contains several BOM characters.
hledger doesn't like those extra BOM characters, it reports an error:

$ hledger -f test-bom.csv print
hledger: error: could not parse "2024-01-02" as a date using date format "%Y-%m-%d"
the CSV record is:  "\65279\&2024-01-02", "0.2", "test 2"
the date rule is:   %1
the date-format is: %Y-%m-%d
you may need to change your date rule, change your date-format rule, or add a skip rule
for m/d/y or d/m/y dates, use date-format %-m/%-d/%Y or date-format %-d/%-m/%Y

I am not sure but I think that it is not wrong when UTF-8 file has several BOM codes in the file; I tried other utilities and those were not failing with an error. In theory, coding of the file can change in the middle, like from UTF-8 to UTF-16LE...

How to replicate. Prepare test data, several simple CSV files with BOM and without BOM:

for I in 1 2 3; do echo -e "2024-01-0${I},0.${I},test ${I}" > "test-nobom-${I}.csv";  echo -e "\xef\xbb\xbf2024-01-0${I},0.${I},test ${I}" > "test-bom-${I}.csv"; done
cat test-nobom-[123].csv > test-nobom.csv 
cat test-bom-[123].csv > test-bom.csv 

Files test-bom.csv and test-nobom.csv looks same but they differ in file size:

$ cat test-bom.csv
2024-01-01,0.1,test 1
2024-01-02,0.2,test 2
2024-01-03,0.3,test 3

ls -l test-nobom.csv test-bom.csv
-rw-rw-r-- 1 user user 75 Mar 29 17:59 test-bom.csv
-rw-rw-r-- 1 user user 66 Mar 29 17:59 test-nobom.csv

grep is "confused" with BOM:

$ grep ^2024 test-nobom.csv
2024-01-01,0.1,test 1
2024-01-02,0.2,test 2
2024-01-03,0.3,test 3

$ grep ^2024 test-bom.csv

$ grep ^.2024 test-bom.csv
2024-01-01,0.1,test 1
2024-01-02,0.2,test 2
2024-01-03,0.3,test 3

Create import rules, those are the same, I created test-bom.csv.rules and then used ln -s test-bom.csv.rules test-nobom.csv.rules and ln -s test-bom.csv.rules test-bom-1.csv.rules :

$ cat test-bom.csv.rules 

fields      date,amount,description
date-format %Y-%m-%d
$ cat test-nobom.csv.rules 

fields      date,amount,description
date-format %Y-%m-%d
$ cat test-bom-1.csv.rules 

fields      date,amount,description
date-format %Y-%m-%d

TEST

hledger can import CSV file with single BOM and file without BOM:

$ hledger -f test-bom-1.csv bal
                -0.1  income:unknown
                 0.1  unknown
--------------------
                   0

$ hledger -f test-nobom.csv bal
                -0.6  income:unknown
                 0.6  unknown
--------------------
                   0

hledger doesn't like file with several BOM:

$ hledger -f test-bom.csv bal
hledger: error: could not parse "2024-01-02" as a date using date format "%Y-%m-%d"
the CSV record is:  "\65279\&2024-01-02", "0.2", "test 2"
the date rule is:   %1
the date-format is: %Y-%m-%d
you may need to change your date rule, change your date-format rule, or add a skip rule
for m/d/y or d/m/y dates, use date-format %-m/%-d/%Y or date-format %-d/%-m/%Y
@simonmichael simonmichael added A-WISH Some kind of improvement request, hare-brained proposal, or plea. csv The csv file format, csv output format, or generally CSV-related. i18n Internationalisation/localisation-related. labels Mar 29, 2024
@simonmichael
Copy link
Owner

simonmichael commented Mar 30, 2024

That's very clear! Thank you.

I also found:

We do want hledger to just work on real world data where possible, so we should be permissive where it doesn't add complications. But I'm not sure if we need to go as far as ignoring BOMs appearing anywhere in the input. It seems like an unusual niche case, and one that's easy to solve with preprocessing. Is it really valid for files to change encoding in the middle ? I can't imagine many tools that would handle that properly.

@simonmichael
Copy link
Owner

Our BOM handling should be mentioned at https://hledger.org/dev/hledger.html#text-encoding .

@simonmichael simonmichael added the docs Documentation-related. label Mar 30, 2024
@simonmichael
Copy link
Owner

simonmichael commented Mar 30, 2024

Related, https://www.unicode.org/faq/utf_bom.html#BOM says:

Q: What should I do with U+FEFF in the middle of a file?

  • In the absence of a protocol supporting its use as a BOM and when not at the beginning of a text stream, U+FEFF should normally not occur.
  • For backwards compatibility it should be treated as ZERO WIDTH NON-BREAKING SPACE (ZWNBSP), and is then part of the content of the file or string.
  • When designing a markup language or data protocol, the use of U+FEFF can be restricted to that of Byte Order Mark. In that case, any U+FEFF occurring in the middle of a file can be treated as an unsupported character.

@PSLLSP
Copy link
Author

PSLLSP commented Mar 30, 2024

BOM is troublemaker... ;-) We use extended ASCII and banks produced CSV files in CP-1250 in the past. Some of them upgraded their software and moved to UTF-8 and I believe that is why they produce UTF-8 file with BOM, to clearly signal that CSV file is not in CP-1250 but in UTF-8.

It is possible to create file that starts with BOM for UTF-8 and there is a BOM for UTF-16LE in the middle file. Just join file in UTF-8 with file in UTF-16LE. But that will be illegal, because BOM is just one code point (U+FEFF) expressed in different ways for each version of UTF. I thought that it could be possible to start with UTF-8 and use BOM in the middle of file to switch encoding to UTF-16LE but it is not possible because BOM for UTF-16LE is invalid sequence in UTF-8... Well, it could be possible but software has to test why there is an error in data, test if error code could be BOM for other variant of UTF... The good news is that UTF-16LE files are rare, UTF-8 is used in most cases.

@simonmichael
Copy link
Owner

simonmichael commented Mar 30, 2024 via email

@PSLLSP
Copy link
Author

PSLLSP commented Apr 2, 2024

What about ignoring ZWNBSP characters during CSV import? I do not see any way how these invisible troublemakers could be useful in hledger journal... Other way of handling these is to see them as EOL, this will help in the case that CSV file is not ended with EOL... Exception could be that ZWNBSP is used as field separator. I do not know if there is a way to define invisible ZWNBSP as field separator, maybe separator \uFEFF or separator ZWNBSP. I do not know any case of such CSV file... Or maybe to address this in a way that new command will be added, to map one character to other character, like UNIX command tr. I can use it to translate CSV file in encoding CP-1250 to UTF-8, I will define translation table in hledger import rule. New command to map input code to new code, several such commands could be in the rule file, each mapping on new line. The problem here is that hledger reads input file as UTF-8 and extended ASCII characters are invalid codes when file is read as UTF-8 stream (hledger reports error invalid byte sequence); to address this, new command to disable UTF-8 parsing should be added too, maybe (encoding utf-8 - the default and encoding binary to parse csv in 8-bit mode).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-WISH Some kind of improvement request, hare-brained proposal, or plea. csv The csv file format, csv output format, or generally CSV-related. docs Documentation-related. i18n Internationalisation/localisation-related.
Projects
None yet
Development

No branches or pull requests

2 participants