Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-108590: Fix sqlite3 iterdump() for table columns containing invalid Unicode sequences #108683

Closed

Conversation

erlend-aasland
Copy link
Contributor

@erlend-aasland erlend-aasland commented Aug 30, 2023

This partially reverts 400a1ce, except for the test, which is amended.

@erlend-aasland
Copy link
Contributor Author

cc. @CorvinM

@erlend-aasland
Copy link
Contributor Author

Sorry 'bout the unneeded churn, Serhiy; I should have caught this earlier.

@erlend-aasland erlend-aasland linked an issue Aug 30, 2023 that may be closed by this pull request
2 tasks
Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is still not good.

It modifies the input connection object, is not thread-safe, and even if iterdump() will no longer fail, you need a way to feed its result back to SQLite without errors. There will also be new encoding errors in attempt to save the dump in file or print to stdout.

@serhiy-storchaka
Copy link
Member

For now, it may be easier to revert the previous changes and start to work on proper fix from clear page.

@erlend-aasland
Copy link
Contributor Author

It modifies the input connection object [...]

Good point; Corvin added the context manager for this; let's throw that back in.

is not thread-safe [...]

The underlying calls should take care of that; you cannot share a sqlite3.Connection object between threads and call execute on it without explicitly setting check_same_thread=False. I think we're good here.

@erlend-aasland
Copy link
Contributor Author

For now, it may be easier to revert the previous changes and start to work on proper fix from clear page.

I agree; I'm reverting the previous change in main and 3.11 (I already closed the 3.12 backport).

@erlend-aasland erlend-aasland marked this pull request as draft August 30, 2023 19:34
@erlend-aasland erlend-aasland changed the title gh-108590: Correctly fix sqlite3 iterdump() with invalid Unicode sequences gh-108590: Fix sqlite3 iterdump() for table columns containing invalid Unicode sequences Aug 30, 2023
@erlend-aasland
Copy link
Contributor Author

erlend-aasland commented Aug 30, 2023

There will also be new encoding errors in attempt to save the dump in file or print to stdout.

There is no encoding errors when printing to stdout. With write, you need to pass errors="surrogateescape", so it is not optimal.

@CorvinM
Copy link
Contributor

CorvinM commented Aug 30, 2023

Does the recreation from dump work here? I was encountering encoding errors when attempting it with surrogateescape but not with the old chr() for some reason.
test from #108695:

    def test_dump_recreation(self):
        self.cu.executescript("""
            CREATE TABLE foo (id INTEGER, text TEXT, blob BLOB);
            INSERT INTO foo VALUES (0, CAST(X'619f' AS TEXT), X'619f');
            INSERT INTO foo VALUES (1, 'Hello SQLite!', X'98194eff46ab29f79064');
        """)
        original_dump = list(self.cx.iterdump())
        with memory_database() as cx2:
            query = "".join(original_dump)
            cx2.executescript(query)
            recreation_dump = list(cx2.iterdump())
        self.assertEqual(original_dump, recreation_dump)

@erlend-aasland
Copy link
Contributor Author

Well, using the "surrogateescape" trick correctly dumps and reproduces the database, but it is not optimal, since the user must also pass errors="surrogateescape" to write(), unless print() is used, in which case it just works.

So the fix is valid, technically.

But from a user perspective; requiring the user to pass "surrogateescape" when writing the dump is a bad choice. Perhaps this is best solved as a documentation issue. We can add a "How to deal with table columns with broken Unicode sequences" How To, and reference the "surrogateescape" trick there (and point to the Unicode HOWTO).

@erlend-aasland
Copy link
Contributor Author

Yes, I can recreate it with this:

$ sqlite3 test.db
SQLite version 3.39.5 2022-10-14 20:58:05
Enter ".help" for usage hints.
sqlite> CREATE TABLE t(t TEXT);
sqlite> INSERT INTO t VALUES(CAST(X'619f' AS TEXT));
sqlite> ^D
$ cat > test.py
import sqlite3
cx = sqlite3.connect("test.db")
for line in cx.iterdump():
    print(line)
cx.close()
$ ./python.exe test.py > dump1.sql
$ sqlite3 test.db .dump > dump2.sql
$ hexdump -C dump1.sql
00000000  42 45 47 49 4e 20 54 52  41 4e 53 41 43 54 49 4f  |BEGIN TRANSACTIO|
00000010  4e 3b 0a 43 52 45 41 54  45 20 54 41 42 4c 45 20  |N;.CREATE TABLE |
00000020  74 28 74 20 54 45 58 54  29 3b 0a 49 4e 53 45 52  |t(t TEXT);.INSER|
00000030  54 20 49 4e 54 4f 20 22  74 22 20 56 41 4c 55 45  |T INTO "t" VALUE|
00000040  53 28 27 61 9f 27 29 3b  0a 43 4f 4d 4d 49 54 3b  |S('a.');.COMMIT;|
00000050  0a                                                |.|
00000051
$ hexdump -C dump2.sql
00000000  50 52 41 47 4d 41 20 66  6f 72 65 69 67 6e 5f 6b  |PRAGMA foreign_k|
00000010  65 79 73 3d 4f 46 46 3b  0a 42 45 47 49 4e 20 54  |eys=OFF;.BEGIN T|
00000020  52 41 4e 53 41 43 54 49  4f 4e 3b 0a 43 52 45 41  |RANSACTION;.CREA|
00000030  54 45 20 54 41 42 4c 45  20 74 28 74 20 54 45 58  |TE TABLE t(t TEX|
00000040  54 29 3b 0a 49 4e 53 45  52 54 20 49 4e 54 4f 20  |T);.INSERT INTO |
00000050  74 20 56 41 4c 55 45 53  28 27 61 9f 27 29 3b 0a  |t VALUES('a.');.|
00000060  43 4f 4d 4d 49 54 3b 0a                           |COMMIT;.|
00000068

Apart from the preamble and table name quoting, they're equal. The VALUES string is 27 61 9f 27 in both dumps, as expected.

@erlend-aasland
Copy link
Contributor Author

test_dump_recreation does obviously not work with the surrogateescapes approach, since it fails during recreation.

@CorvinM
Copy link
Contributor

CorvinM commented Aug 30, 2023

Unless I did something wrong I think this PR will fail the test I posted but works for print()
Whereas #108695 will pass the test but I doubt it will print() without encoding garbage.

What do you think about breaking consistency with sqlite cli for 3.11, 3.12 and just dumping CAST(x AS TEXT) for these invalid unicode cases? We could get a more consistent with sqlite cli fix in main since we are not bound to backwards compatibility?

@erlend-aasland
Copy link
Contributor Author

FTR, with #108695 and the recreation sequence outlined in #108683 (comment), I'm not able to recreate the database. The resulting VALUES string becomes: 27 61 c2 9f 27

@erlend-aasland
Copy link
Contributor Author

[...] I think this PR will fail the test I posted but works for print() [...]

Both correct.

What do you think about breaking consistency with sqlite cli for 3.11, 3.12 and just dumping CAST(x AS TEXT) for these invalid unicode cases? We could get a more consistent with sqlite cli fix in main since we are not bound to backwards compatibility?

I'm not sure if it is worth it for this corner case. At the moment, I think this may be best solved in documentation. Let's think about it for some days; there is no hurry in getting this fixed.

@erlend-aasland
Copy link
Contributor Author

Thanks for helping investigate this, @CorvinM.

@erlend-aasland erlend-aasland deleted the sqlite/dump-invalid-utf8 branch October 11, 2023 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs backport to 3.11 only security fixes needs backport to 3.12 bug and security fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sqlite3.iterdump() incompatible with binary data
4 participants