gh-108590: Fix sqlite3 iterdump() for table columns containing invalid Unicode sequences #108683

erlend-aasland · 2023-08-30T12:21:13Z

This partially reverts 400a1ce, except for the test, which is amended.

Issue: sqlite3.iterdump() incompatible with binary data #108590

…e sequences This also reverts 400a1ce.

erlend-aasland · 2023-08-30T12:23:06Z

cc. @CorvinM

erlend-aasland · 2023-08-30T12:28:26Z

Sorry 'bout the unneeded churn, Serhiy; I should have caught this earlier.

serhiy-storchaka

It is still not good.

It modifies the input connection object, is not thread-safe, and even if iterdump() will no longer fail, you need a way to feed its result back to SQLite without errors. There will also be new encoding errors in attempt to save the dump in file or print to stdout.

serhiy-storchaka · 2023-08-30T13:32:48Z

For now, it may be easier to revert the previous changes and start to work on proper fix from clear page.

erlend-aasland · 2023-08-30T13:36:09Z

It modifies the input connection object [...]

Good point; Corvin added the context manager for this; let's throw that back in.

is not thread-safe [...]

The underlying calls should take care of that; you cannot share a sqlite3.Connection object between threads and call execute on it without explicitly setting check_same_thread=False. I think we're good here.

erlend-aasland · 2023-08-30T14:26:55Z

For now, it may be easier to revert the previous changes and start to work on proper fix from clear page.

I agree; I'm reverting the previous change in main and 3.11 (I already closed the 3.12 backport).

erlend-aasland · 2023-08-30T19:54:09Z

There will also be new encoding errors in attempt to save the dump in file or print to stdout.

There is no encoding errors when printing to stdout. With write, you need to pass errors="surrogateescape", so it is not optimal.

CorvinM · 2023-08-30T20:11:22Z

Does the recreation from dump work here? I was encountering encoding errors when attempting it with surrogateescape but not with the old chr() for some reason.
test from #108695:

    def test_dump_recreation(self):
        self.cu.executescript("""
            CREATE TABLE foo (id INTEGER, text TEXT, blob BLOB);
            INSERT INTO foo VALUES (0, CAST(X'619f' AS TEXT), X'619f');
            INSERT INTO foo VALUES (1, 'Hello SQLite!', X'98194eff46ab29f79064');
        """)
        original_dump = list(self.cx.iterdump())
        with memory_database() as cx2:
            query = "".join(original_dump)
            cx2.executescript(query)
            recreation_dump = list(cx2.iterdump())
        self.assertEqual(original_dump, recreation_dump)

erlend-aasland · 2023-08-30T20:13:32Z

Well, using the "surrogateescape" trick correctly dumps and reproduces the database, but it is not optimal, since the user must also pass errors="surrogateescape" to write(), unless print() is used, in which case it just works.

So the fix is valid, technically.

But from a user perspective; requiring the user to pass "surrogateescape" when writing the dump is a bad choice. Perhaps this is best solved as a documentation issue. We can add a "How to deal with table columns with broken Unicode sequences" How To, and reference the "surrogateescape" trick there (and point to the Unicode HOWTO).

erlend-aasland · 2023-08-30T20:23:38Z

Yes, I can recreate it with this:

$ sqlite3 test.db
SQLite version 3.39.5 2022-10-14 20:58:05
Enter ".help" for usage hints.
sqlite> CREATE TABLE t(t TEXT);
sqlite> INSERT INTO t VALUES(CAST(X'619f' AS TEXT));
sqlite> ^D
$ cat > test.py
import sqlite3
cx = sqlite3.connect("test.db")
for line in cx.iterdump():
    print(line)
cx.close()
$ ./python.exe test.py > dump1.sql
$ sqlite3 test.db .dump > dump2.sql
$ hexdump -C dump1.sql
00000000  42 45 47 49 4e 20 54 52  41 4e 53 41 43 54 49 4f  |BEGIN TRANSACTIO|
00000010  4e 3b 0a 43 52 45 41 54  45 20 54 41 42 4c 45 20  |N;.CREATE TABLE |
00000020  74 28 74 20 54 45 58 54  29 3b 0a 49 4e 53 45 52  |t(t TEXT);.INSER|
00000030  54 20 49 4e 54 4f 20 22  74 22 20 56 41 4c 55 45  |T INTO "t" VALUE|
00000040  53 28 27 61 9f 27 29 3b  0a 43 4f 4d 4d 49 54 3b  |S('a.');.COMMIT;|
00000050  0a                                                |.|
00000051
$ hexdump -C dump2.sql
00000000  50 52 41 47 4d 41 20 66  6f 72 65 69 67 6e 5f 6b  |PRAGMA foreign_k|
00000010  65 79 73 3d 4f 46 46 3b  0a 42 45 47 49 4e 20 54  |eys=OFF;.BEGIN T|
00000020  52 41 4e 53 41 43 54 49  4f 4e 3b 0a 43 52 45 41  |RANSACTION;.CREA|
00000030  54 45 20 54 41 42 4c 45  20 74 28 74 20 54 45 58  |TE TABLE t(t TEX|
00000040  54 29 3b 0a 49 4e 53 45  52 54 20 49 4e 54 4f 20  |T);.INSERT INTO |
00000050  74 20 56 41 4c 55 45 53  28 27 61 9f 27 29 3b 0a  |t VALUES('a.');.|
00000060  43 4f 4d 4d 49 54 3b 0a                           |COMMIT;.|
00000068

Apart from the preamble and table name quoting, they're equal. The VALUES string is 27 61 9f 27 in both dumps, as expected.

erlend-aasland · 2023-08-30T20:28:25Z

test_dump_recreation does obviously not work with the surrogateescapes approach, since it fails during recreation.

CorvinM · 2023-08-30T20:31:46Z

Unless I did something wrong I think this PR will fail the test I posted but works for print()
Whereas #108695 will pass the test but I doubt it will print() without encoding garbage.

What do you think about breaking consistency with sqlite cli for 3.11, 3.12 and just dumping CAST(x AS TEXT) for these invalid unicode cases? We could get a more consistent with sqlite cli fix in main since we are not bound to backwards compatibility?

erlend-aasland · 2023-08-30T20:32:10Z

FTR, with #108695 and the recreation sequence outlined in #108683 (comment), I'm not able to recreate the database. The resulting VALUES string becomes: 27 61 c2 9f 27

erlend-aasland · 2023-08-30T20:35:19Z

[...] I think this PR will fail the test I posted but works for print() [...]

Both correct.

What do you think about breaking consistency with sqlite cli for 3.11, 3.12 and just dumping CAST(x AS TEXT) for these invalid unicode cases? We could get a more consistent with sqlite cli fix in main since we are not bound to backwards compatibility?

I'm not sure if it is worth it for this corner case. At the moment, I think this may be best solved in documentation. Let's think about it for some days; there is no hurry in getting this fixed.

erlend-aasland · 2023-08-30T21:00:05Z

Thanks for helping investigate this, @CorvinM.

pythongh-108590: Correctly fix sqlite3 iterdump() with invalid Unicod…

c2e9d90

…e sequences This also reverts 400a1ce.

erlend-aasland requested a review from berkerpeksag as a code owner August 30, 2023 12:21

erlend-aasland added needs backport to 3.11 only security fixes needs backport to 3.12 bug and security fixes labels Aug 30, 2023

erlend-aasland requested a review from serhiy-storchaka August 30, 2023 12:21

bedevere-bot mentioned this pull request Aug 30, 2023

sqlite3.iterdump() incompatible with binary data #108590

Closed

2 tasks

bedevere-bot added the awaiting core review label Aug 30, 2023

Fix NEWS

4bb45da

erlend-aasland linked an issue Aug 30, 2023 that may be closed by this pull request

sqlite3.iterdump() incompatible with binary data #108590

Closed

2 tasks

serhiy-storchaka reviewed Aug 30, 2023

View reviewed changes

Add back context manager

3059ad9

erlend-aasland marked this pull request as draft August 30, 2023 19:34

bedevere-bot removed the awaiting core review label Aug 30, 2023

erlend-aasland changed the title ~~gh-108590: Correctly fix sqlite3 iterdump() with invalid Unicode sequences~~ gh-108590: Fix sqlite3 iterdump() for table columns containing invalid Unicode sequences Aug 30, 2023

Pull in main

48931c0

erlend-aasland closed this Oct 11, 2023

erlend-aasland deleted the sqlite/dump-invalid-utf8 branch October 11, 2023 20:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-108590: Fix sqlite3 iterdump() for table columns containing invalid Unicode sequences #108683

gh-108590: Fix sqlite3 iterdump() for table columns containing invalid Unicode sequences #108683

erlend-aasland commented Aug 30, 2023 •

edited

erlend-aasland commented Aug 30, 2023

erlend-aasland commented Aug 30, 2023

serhiy-storchaka left a comment

serhiy-storchaka commented Aug 30, 2023

erlend-aasland commented Aug 30, 2023

erlend-aasland commented Aug 30, 2023

erlend-aasland commented Aug 30, 2023 •

edited

CorvinM commented Aug 30, 2023 •

edited

erlend-aasland commented Aug 30, 2023

erlend-aasland commented Aug 30, 2023

erlend-aasland commented Aug 30, 2023

CorvinM commented Aug 30, 2023

erlend-aasland commented Aug 30, 2023

erlend-aasland commented Aug 30, 2023

erlend-aasland commented Aug 30, 2023

gh-108590: Fix sqlite3 iterdump() for table columns containing invalid Unicode sequences #108683

gh-108590: Fix sqlite3 iterdump() for table columns containing invalid Unicode sequences #108683

Conversation

erlend-aasland commented Aug 30, 2023 • edited

erlend-aasland commented Aug 30, 2023

erlend-aasland commented Aug 30, 2023

serhiy-storchaka left a comment

Choose a reason for hiding this comment

serhiy-storchaka commented Aug 30, 2023

erlend-aasland commented Aug 30, 2023

erlend-aasland commented Aug 30, 2023

erlend-aasland commented Aug 30, 2023 • edited

CorvinM commented Aug 30, 2023 • edited

erlend-aasland commented Aug 30, 2023

erlend-aasland commented Aug 30, 2023

erlend-aasland commented Aug 30, 2023

CorvinM commented Aug 30, 2023

erlend-aasland commented Aug 30, 2023

erlend-aasland commented Aug 30, 2023

erlend-aasland commented Aug 30, 2023

erlend-aasland commented Aug 30, 2023 •

edited

erlend-aasland commented Aug 30, 2023 •

edited

CorvinM commented Aug 30, 2023 •

edited