Skip to content

Better handling of encodings other than utf-8 for "sqlite-utils insert" #182

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kaihendry opened this issue Sep 30, 2020 · 5 comments
Closed
Labels
enhancement New feature or request

Comments

@kaihendry
Copy link

Makefile:

data.db:
        curl -O http://maps.natalian.org/data.txt
        go run csv-write.go > data.csv
        sqlite-utils insert data.db travels data.csv --csv

clean:
        rm data*

csv-write.go

Error message is:

sqlite-utils insert data.db travels data.csv --csv
Traceback (most recent call last):
  File "/home/hendry/.local/bin/sqlite-utils", line 8, in <module>
    sys.exit(cli())
  File "/home/hendry/.local/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/hendry/.local/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/hendry/.local/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/hendry/.local/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/hendry/.local/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/hendry/.local/lib/python3.8/site-packages/sqlite_utils/cli.py", line 614, in insert
    insert_upsert_implementation(
  File "/home/hendry/.local/lib/python3.8/site-packages/sqlite_utils/cli.py", line 553, in insert_upsert_implementation
    headers = next(reader)
  File "/usr/lib/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe3 in position 1234: invalid continuation byte
make: *** [Makefile:4: data.db] Error 1
[hendry@t14s datasette-map]$ sqlite-utils --version
sqlite-utils, version 2.19

Little bit surprised if Go is spewing out bad Unicode, but I'm not sure how to grok position 1234..

@simonw
Copy link
Owner

simonw commented Sep 30, 2020

It looks like http://maps.natalian.org/data.txt is encoded as latin-1, but sqlite-utils assumes utf-8 and hence breaks.

It would be worth improving the error message here. I could also add a --encoding latin-1 option to sqlite-utils insert to help in consuming files that are stored in charsets other than utf-8.

@simonw simonw changed the title UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe3 in position 1234: invalid continuation byte Better handling of encodings other than utf-8 for "sqlite-utils insert" Sep 30, 2020
@simonw simonw added the enhancement New feature or request label Sep 30, 2020
@simonw
Copy link
Owner

simonw commented Oct 14, 2020

I could use https://github.com/chardet/chardet to help here, though I'd rather not add it as a dependency (sqlite-utils has very few dependencies at the moment). I could add it as an optional dependency though.

@simonw
Copy link
Owner

simonw commented Oct 14, 2020

For the moment I'm going to add a --encoding option and some code that catches UnicodeDecodeError and shows an error message that suggests using --encoding.

That error message could detect if the file command is available and, if it is, suggest running file filename.txt to detect the character encoding.

@simonw
Copy link
Owner

simonw commented Oct 16, 2020

The file is opened for me by click.File(), which also handles things like - for stdin. But i neee to be able to switch the encoding used to read from that based on the --encoding option.

I think the way to do that is to open the file in binary mode and then wrap it in a codec reader:

fp = codecs.getreader(encoding)(fp)

@simonw simonw closed this as completed in 2c541fa Oct 16, 2020
@simonw
Copy link
Owner

simonw commented Oct 16, 2020

simonw added a commit that referenced this issue Oct 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants