New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ascii Codes(128-255) trigger "Binary data can't be viewed with the text editor"? #1279
Comments
Ok, I figured why the text & grid editor might be detecting binary content in my blob's.
The algo must be interpreting Ascii characters > 127 as binary. |
k, this does sound like a bug that needs fixing. 😄 |
We are performing almost the same check in the table widget as in the cell editor, but for the first, apparently for performance reasons, we are checking only the first 512 bytes, at most, of the field. The check is comparing the UTF-8 conversion of the data as string, with the original data. But what if the text uses another encoding? Do we always keep the text in UTF-8 or would be broken for other encodings? @sky5walk What's your encoding? I guess it is ISO 8859-1. |
As I mentioned above, my files are without a BOM and are plain Ansi text. |
Ok, this muddies the water a bit. All 3 cases below should be considered text by the grid and editor?
|
Have you set an encoding for the table? Does changing it help? Right-click on the table header and select "Set encoding" from the context menu. In the dialog, write "ISO 8859-1" or "CP1250" or whatever is appropriate, and then click OK. In any case, the binary check will fail in the table widget when the first 512 bytes contain a non 7-bit-US-ASCII character. |
Ok, I see the issue. There is only UTF8 or UTF16 encoding, no ANSI. |
I think this would be a fix for the binary check in the table when the proper encoding is used. Maybe @MKleusberg can take a quick look at it for validating the solution.
|
@sky5walk The behaviour regarding encoding is probably different in Linux as in Windows. My current OS encoding is UTF-8 and DB4S is aligned with that, but don't know which is the behaviour in Windows. In my environment, DB4S expects UTF-8, which covers the full Unicode set. Any other encoding can be set in the table encoding inside DB4S, so I think there isn't any limitation to 7-bit ASCII. The unintuitive issue is knowing what encoding to set. Powerful text editors usually detect and display the used encoding when opening the text file, so one of them can help in guessing that. |
Ok, it sounds like I'll store UTF8 with NO BOM. This shows as normal text in the current DB4S grid and editor. The user gets a simple browser for plain text doc's. |
@mgrojo This looks like the right approach to me. We definitely need to take encoding into consideration here because currently we're assuming it is always going to be UTF-8. That's reasonable because there are no ASCII or ANSI SQLite databases but of course doesn't work for ASCII or ANSI BLOBs. |
@MKleusberg The EditDialog seems to be already working (I haven't taken a look to how it's done). I've made a test database with some BLOBs in UTF-8 and some others in ISO 8859-1. Before setting the encoding, the UTF-8 work as expected, but the Latin-1 BLOBs are identified as binary, both in the table widget and the cell editor, except when the first non-US-ASCII is outside the first 512 bytes of the value, then it is identified as text by the table and as binary by the cell editor (this is expected due to the performance comprise done). When the ISO 8859-1 encoding is set for the table, the Latin-1 BLOBs are correctly identified as text in every place and the UTF-8 values are transcoded (i.e. 'á' turns 'á') as expected. The only minor glitch is that the binary view of the decoded Latin-1 texts are showing the UTF-8 decoding and not the original binary Latin-1 encoding. This is probably messy to changge and sure it would be usually uninteresting for the user. I think we can leave it with the decode call for the table widget that I've copied here. The BOM part is beyond my current knowledge. I've only encountered them while working in Windows with much surprise and trouble. In Linux are apparently very unusual. In fact I don't understand why the BOM presence breaks our current check, but I suppose that ToUtf8() strips them. |
…oding The current table encoding is used in the binary check, so text encoded in 8-bit encodings is properly recognised as text in the table widget. Make it possible to reset the encoding used in the way suggested to the user: "Leave the field empty for using the database encoding". Currently it was rejecting it with the message: "This encoding is either not valid or not supported." See issue #1279
@sky5walk My last commit should improve the problem with the binary check in the table widget once you've set the proper encoding. Maybe this is enough for you once you've returned to your original Windows encoding (CP1252, I guess, which is very similar to ISO 8859-1). The BOM issue is still pending. |
@mgrojo You're right! Sorry for the confusion 😄 Your commit is correct then. I'll probably add the BOM bits over the next days to make that working as well. |
Thanks, I will check the nightly.
|
Your table is correct, but you have to take into account that the grid will only read the first 512 bytes. |
Happy New Year!
|
Happy new year 😉 Tomorrow's nightly should change the behaviour for UTF-8 + BOM so it's always treated as text (even if there is a trailing null). Can you check if that is working for you? And of course if it helps you? 😃 |
Cool, the trailing null thing will help code readability. Hate carrying -1's sprinkled with comments. |
Ok, I tried the same blob insertion code on 170102 nightly, and the grid shows my UTF8 blob contents but the editor clips the 1st few bytes, 4 from simple text, 2 from ascii code > 127? |
This fixes a bug introduced in 27c6579. In that commit we are checking if a string starts with a Unicode BOM and remove the BOM if necessary. Apparently though Qt's startsWith() function doesn't work as exected here, so it's changed to a manual comparison in this commit which seems to work fine. The first person who can explain that behaviour to me gets a free beer if we meet in person. See issue #1279. See issue #1282.
Thanks for pointing this out! That's what happens if you only test the new bits and not the default behaviour 😉 I think this should be fixed in the next nightly build. Can you give that another try? |
Yes, will do.
|
Ok, I tried the nightly 170103 with the following results.
|
The reason for treating text with trailing NULLs as binary is that this is that the NULL character cannot be printed or edited properly in the text editors. So to avoid possible data loss we have to make sure the user knows about the fact that there is some non-printable character in the field. Does that make sense? 😃 I have just tried again and for this line in your table
I get different results:
Are you sure it's different on your system? Because that would be weird if grid and editor behaved differently here. |
Yes, as of nightly 180103, I get the same results as you. |
Take into account that this is possible:
In the grid we are checking only the first 512 bytes, so it is perfectly possible that the first 512 bytes are US-ASCII and some remaining bytes are in a 8-bit ASCII extension that needs a configured encoding to be treated as text. |
Why the limit to 1st 512 bytes? Is the speed hit really that bad? |
Details for the issue
I wrote 3 ansi text files as blobs to a db. Each cell in the grid displays the ascii contents. But, the text editor to the right is not consistently reading the contents.
What is the algo that triggers supposed binary content in my ascii content?
Is it size related?
Useful extra information
Each text file while viewed in Notepad++ showed as ANSI and no byte order mark(BOM).
I'm opening this issue because:
I'm using DB4S on:
I'm using DB4S version:
I have also:
The text was updated successfully, but these errors were encountered: