Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-108590: Improve sqlite3 docs on encoding issues and how to handle those #108699

Merged
merged 18 commits into from
Oct 25, 2023

Conversation

erlend-aasland
Copy link
Contributor

@erlend-aasland erlend-aasland commented Aug 30, 2023

Document how to handle table columns with invalid Unicode sequences.


📚 Documentation preview 📚: https://cpython-previews--108699.org.readthedocs.build/

Document how to handle table columns with invalid Unicode sequences.
@bedevere-bot bedevere-bot added the docs Documentation in the Doc dir label Aug 30, 2023
@erlend-aasland
Copy link
Contributor Author

erlend-aasland commented Aug 30, 2023

IMO, the existing text_factory examples look contrived; the added example where we solve the issue of invalid Unicode sequences is a more interesting and useful example.

@erlend-aasland
Copy link
Contributor Author

cc. @AlexWaygood, if you have time to look at the prose.

Doc/library/sqlite3.rst Outdated Show resolved Hide resolved
Doc/library/sqlite3.rst Outdated Show resolved Hide resolved
Doc/library/sqlite3.rst Outdated Show resolved Hide resolved
Doc/library/sqlite3.rst Outdated Show resolved Hide resolved
Doc/library/sqlite3.rst Outdated Show resolved Hide resolved
Doc/library/sqlite3.rst Outdated Show resolved Hide resolved
Co-authored-by: Alex Waygood <Alex.Waygood@Gmail.com>
@erlend-aasland
Copy link
Contributor Author

Thanks, Alex!

Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print() will fail, because sys.stdout.errors is "strict" by default. It is not a new issue, you have the same issue when you print the result of os.listdir(), for example.

Other issue is that you cannot simply feed the result of iterdump() to execute(), because the Python interface only accept UTF-8 encodable string.

A minor problem is that using anything besides str, bytes or bytearray as text_factory slows down requests. So you only need to use this if necessary.

@erlend-aasland erlend-aasland changed the title gh-108590: Add sqlite3 text factory howto gh-108590: Improve sqlite3 docs on encoding issues and how to handle those Aug 31, 2023
@CorvinM
Copy link
Contributor

CorvinM commented Aug 31, 2023

It may be helpful to emphasize that the surrogate escapes cannot be passed back to sqlite through python.

@serhiy-storchaka
Copy link
Member

The plausible real-world example -- a DB created from a CSV file or other source in legacy encoding.

Ukrainian or Korean are bad as examples, because all characters are non-ASCII, so you cannot not recognize the word in its bytes repr. "Österreich" is better, it contains only one non-Latin letter. Of course you need to use the encoding different from UTF-8 for example. It would be better if the natural legacy encoding is different from Latin1, but if there is nothing better, Latin1 works too. "Österreich" is a bit too long, some shorter Norwegian geographical name may be better.

The new example only shows the use of iterdump() and does not show the use of bytes. I think that examples with fetch() are important too. You can show it for bytes, encoding='latin1' and errors='surrogateescape', then say that iterdump() only works with text_factory producing a string and show how to save it in a file (and you can chose whether you want to recreate the DB with the originally encoded data or recode it in UTF-8 if you know the used encoding).

@encukou
Copy link
Member

encukou commented Aug 31, 2023

The CSV export from my Czech bank account uses iso-8859-2 and contains the column b'\xc8\xe1stka'.
Částka means "sum" (quantity of money). Decoding with Latin-1 this gives you Èástka which is wrong.

Or you could use b'Polo\xbeka' (položka, "entry"), which latin1 decodes to the more obviously wrong Polo¾ka.

@serhiy-storchaka
Copy link
Member

Good examples!

Doc/library/sqlite3.rst Outdated Show resolved Hide resolved
Doc/library/sqlite3.rst Outdated Show resolved Hide resolved
Doc/library/sqlite3.rst Outdated Show resolved Hide resolved
Doc/library/sqlite3.rst Outdated Show resolved Hide resolved
Doc/library/sqlite3.rst Outdated Show resolved Hide resolved
Doc/library/sqlite3.rst Outdated Show resolved Hide resolved
Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we all agree that the current version has serious issues. No need to beat it more, it is already dead. Let @erlend-aasland to address our comments and prepare a new version. Then we will start a new round of bikeshedding.

Doc/library/sqlite3.rst Outdated Show resolved Hide resolved
@erlend-aasland
Copy link
Contributor Author

@serhiy-storchaka, do you want to take another look?

@erlend-aasland
Copy link
Contributor Author

@ezio-melotti, would you like to take another look?

Doc/library/sqlite3.rst Outdated Show resolved Hide resolved
Doc/library/sqlite3.rst Outdated Show resolved Hide resolved
Doc/library/sqlite3.rst Outdated Show resolved Hide resolved
Doc/library/sqlite3.rst Outdated Show resolved Hide resolved
Doc/library/sqlite3.rst Outdated Show resolved Hide resolved
Copy link
Member

@CAM-Gerlach CAM-Gerlach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good now, thanks! Just a few small suggestions.

Doc/library/sqlite3.rst Outdated Show resolved Hide resolved
Doc/library/sqlite3.rst Outdated Show resolved Hide resolved
Doc/library/sqlite3.rst Outdated Show resolved Hide resolved
Co-authored-by: C.A.M. Gerlach <CAM.Gerlach@Gerlach.CAM>
Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>
Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>
@erlend-aasland
Copy link
Contributor Author

I think this is ready to land. Shout out if you disagree. OTOH, we can easily patch up things with a new PR. I'll be merging this tomorrow morning CET.

@erlend-aasland
Copy link
Contributor Author

A big thanks to everyone who helped shape this PR!

Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change removes any mention of bytes. It makes the documentation less informative.

It also does not resolve the original issue, in particularly, it does not document that iterdump() is not compatible with text_factory=bytes. And it does not say what to do with the result of iterdump() with custom text_factory.

@erlend-aasland
Copy link
Contributor Author

This change removes any mention of bytes. It makes the documentation less informative.

Serhiy, please see #108699 (comment). I do not intend to mention bytes unless there is a reason to do it. If you care strongly for it, I suggest you create a follow-up PR with your use-case. IMO, it does not block this PR.

It also does not resolve the original issue, in particularly, it does not document that iterdump() is not compatible with text_factory=bytes. And it does not say what to do with the result of iterdump() with custom text_factory.

It does so implicitly; it helps the user understand how to work around such encoding issues. If you disagree with this approach, please open a competing PR.

@serhiy-storchaka
Copy link
Member

Okay. I do not fully understand what is the benefit of this change, but I do not oppose it.

@erlend-aasland
Copy link
Contributor Author

Okay. I do not fully understand what is the benefit of this change, but I do not oppose it.

Why so dismissive? I've asked you twice for a concrete proposal regarding your bytes comment, but you seem to ignore this. I do not oppose such a change, I only ask for a use-case for such an example.

@serhiy-storchaka
Copy link
Member

Sorry, I meant no disrespect. There were so many changes and comments in this PR that I was already lost. After the old examples were removed, there was a lot of discussion about new examples, but they never appeared, so I didn't see much to comment on, as this PR looked far from finished. If you consider it complete, I will perhaps continue the work in a new PR. text_factory should be fully documented (and special cases for bytes and bytearray are specially optimized in the code).

@erlend-aasland
Copy link
Contributor Author

Ok, thanks for your response, Serhiy. It is ok to continue improving the docs in follow-up PRs.

Let's land this, then we can follow up your concerns in a new PR. I appreciate your comments; I just find some of them hard to address.

@erlend-aasland erlend-aasland merged commit 1262e41 into python:main Oct 25, 2023
22 checks passed
@erlend-aasland erlend-aasland deleted the sqlite/doc-text-factory branch October 25, 2023 13:58
@miss-islington-app
Copy link

Thanks @erlend-aasland for the PR 🌮🎉.. I'm working now to backport this PR to: 3.11, 3.12.
🐍🍒⛏🤖 I'm not a witch! I'm not a witch!

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Oct 25, 2023
…andle those (pythonGH-108699)

Add a guide for how to handle non-UTF-8 text encodings.
Link to that guide from the 'text_factory' docs.

(cherry picked from commit 1262e41)

Co-authored-by: Erlend E. Aasland <erlend@python.org>
Co-authored-by: Alex Waygood <Alex.Waygood@Gmail.com>
Co-authored-by: C.A.M. Gerlach <CAM.Gerlach@Gerlach.CAM>
Co-authored-by: Corvin <corvin@corvin.dev>
Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Oct 25, 2023
…andle those (pythonGH-108699)

Add a guide for how to handle non-UTF-8 text encodings.
Link to that guide from the 'text_factory' docs.

(cherry picked from commit 1262e41)

Co-authored-by: Erlend E. Aasland <erlend@python.org>
Co-authored-by: Alex Waygood <Alex.Waygood@Gmail.com>
Co-authored-by: C.A.M. Gerlach <CAM.Gerlach@Gerlach.CAM>
Co-authored-by: Corvin <corvin@corvin.dev>
Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
@bedevere-app
Copy link

bedevere-app bot commented Oct 25, 2023

GH-111324 is a backport of this pull request to the 3.12 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.12 bug and security fixes label Oct 25, 2023
@bedevere-app
Copy link

bedevere-app bot commented Oct 25, 2023

GH-111325 is a backport of this pull request to the 3.11 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.11 only security fixes label Oct 25, 2023
erlend-aasland added a commit that referenced this pull request Oct 25, 2023
…handle those (GH-108699) (#111325)

Add a guide for how to handle non-UTF-8 text encodings.
Link to that guide from the 'text_factory' docs.

(cherry picked from commit 1262e41)

Co-authored-by: Erlend E. Aasland <erlend@python.org>
Co-authored-by: Alex Waygood <Alex.Waygood@Gmail.com>
Co-authored-by: C.A.M. Gerlach <CAM.Gerlach@Gerlach.CAM>
Co-authored-by: Corvin <corvin@corvin.dev>
Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
erlend-aasland added a commit that referenced this pull request Oct 25, 2023
…handle those (GH-108699) (#111324)

Add a guide for how to handle non-UTF-8 text encodings.
Link to that guide from the 'text_factory' docs.

(cherry picked from commit 1262e41)

Co-authored-by: Erlend E. Aasland <erlend@python.org>
Co-authored-by: Alex Waygood <Alex.Waygood@Gmail.com>
Co-authored-by: C.A.M. Gerlach <CAM.Gerlach@Gerlach.CAM>
Co-authored-by: Corvin <corvin@corvin.dev>
Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
aisk pushed a commit to aisk/cpython that referenced this pull request Feb 11, 2024
…andle those (python#108699)

Add a guide for how to handle non-UTF-8 text encodings.
Link to that guide from the 'text_factory' docs.

Co-authored-by: Alex Waygood <Alex.Waygood@Gmail.com>
Co-authored-by: C.A.M. Gerlach <CAM.Gerlach@Gerlach.CAM>
Co-authored-by: Corvin <corvin@corvin.dev>
Co-authored-by: Ezio Melotti <ezio.melotti@gmail.com>
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation in the Doc dir skip news
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants