Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ascii Codes(128-255) trigger "Binary data can't be viewed with the text editor"? #1279

Open
5 of 14 tasks
sky5walk opened this issue Dec 29, 2017 · 27 comments
Open
5 of 14 tasks
Labels
bug Confirmed bugs or reports that are very likely to be bugs. encoding

Comments

@sky5walk
Copy link

sky5walk commented Dec 29, 2017

Details for the issue

I wrote 3 ansi text files as blobs to a db. Each cell in the grid displays the ascii contents. But, the text editor to the right is not consistently reading the contents.
What is the algo that triggers supposed binary content in my ascii content?
Is it size related?

Useful extra information

Each text file while viewed in Notepad++ showed as ANSI and no byte order mark(BOM).

I'm opening this issue because:

  • DB4S is crashing
  • DB4S has a bug
  • DB4S needs a feature
  • DB4S has another problem

I'm using DB4S on:

  • Windows: ( version:10 Pro_ )
  • Linux: ( distro: ___ )
  • Mac OS: ( version: ___ )
  • Other: ___

I'm using DB4S version:

  • 3.10.1
  • 3.10.0
  • 3.9.1
  • Other: ___3.10.99 nightly from 171205

I have also:

@sky5walk
Copy link
Author

Ok, I figured why the text & grid editor might be detecting binary content in my blob's.
Some of my files contain Ascii symbols, but they are < Ascii code 255.
Example content from blob.

1[CR][LF]
±2[CR][LF]    --<< plus/minus symbol = Ascii(177)
3°[CR][LF]    --<< degrees sign      = Ascii(176)
#END[CR][LF]

The algo must be interpreting Ascii characters > 127 as binary.
Please extend the editor up to Ascii code 255.
I want my blobs viewable to the user as text.

@sky5walk sky5walk changed the title What algo = "Binary data can't be viewed with the text editor"? Ascii Codes(127-255) trigger "Binary data can't be viewed with the text editor"? Dec 30, 2017
@justinclift justinclift added the bug Confirmed bugs or reports that are very likely to be bugs. label Dec 30, 2017
@justinclift
Copy link
Member

k, this does sound like a bug that needs fixing. 😄

@mgrojo
Copy link
Member

mgrojo commented Dec 30, 2017

We are performing almost the same check in the table widget as in the cell editor, but for the first, apparently for performance reasons, we are checking only the first 512 bytes, at most, of the field.

The check is comparing the UTF-8 conversion of the data as string, with the original data. But what if the text uses another encoding? Do we always keep the text in UTF-8 or would be broken for other encodings?

@sky5walk What's your encoding? I guess it is ISO 8859-1.

@sky5walk
Copy link
Author

As I mentioned above, my files are without a BOM and are plain Ansi text.
I will try with UTF-8 and reply.

@sky5walk
Copy link
Author

Ok, this muddies the water a bit. All 3 cases below should be considered text by the grid and editor?

File Format      blob action
-----------      -----------
Ansi no BOM      binary
UTF8 no BOM      text
UTF8 w/BOM       binary

@mgrojo
Copy link
Member

mgrojo commented Dec 30, 2017

Have you set an encoding for the table? Does changing it help? Right-click on the table header and select "Set encoding" from the context menu. In the dialog, write "ISO 8859-1" or "CP1250" or whatever is appropriate, and then click OK.

In any case, the binary check will fail in the table widget when the first 512 bytes contain a non 7-bit-US-ASCII character.

@sky5walk
Copy link
Author

Ok, I see the issue. There is only UTF8 or UTF16 encoding, no ANSI.
What is the reason to limit to 7 bit? Are Ascii codes 128-255 not recognized?
I type these directly into this editor and Notepad and pasted them into the DB4S editor? --> °®±
Why does it not switch to binary on paste?
I'm debating how to proceed.
Do I convert my simple ANSI doc's to UTF8 before dumping to a blob?

@mgrojo
Copy link
Member

mgrojo commented Dec 30, 2017

I think this would be a fix for the binary check in the table when the proper encoding is used. Maybe @MKleusberg can take a quick look at it for validating the solution.

@@ -871,11 +871,11 @@ void SqliteTableModel::clearCache()
 
 bool SqliteTableModel::isBinary(const QModelIndex& index) const
 {
     // We're using the same way to detect binary data here as in the EditDialog class. For performance reasons we're only looking at
     // the first couple of bytes though.
-    QByteArray data = m_data.at(index.row()).at(index.column()).left(512);
+    QByteArray data = decode(m_data.at(index.row()).at(index.column()).left(512));
     return QString(data).toUtf8() != data;
 }
 
 QByteArray SqliteTableModel::encode(const QByteArray& str) const
 {

@mgrojo
Copy link
Member

mgrojo commented Dec 30, 2017

@sky5walk The behaviour regarding encoding is probably different in Linux as in Windows. My current OS encoding is UTF-8 and DB4S is aligned with that, but don't know which is the behaviour in Windows. In my environment, DB4S expects UTF-8, which covers the full Unicode set. Any other encoding can be set in the table encoding inside DB4S, so I think there isn't any limitation to 7-bit ASCII. The unintuitive issue is knowing what encoding to set. Powerful text editors usually detect and display the used encoding when opening the text file, so one of them can help in guessing that.

@sky5walk
Copy link
Author

sky5walk commented Dec 31, 2017

Ok, it sounds like I'll store UTF8 with NO BOM. This shows as normal text in the current DB4S grid and editor. The user gets a simple browser for plain text doc's.
But, this means my blobs are not the true file contents. Meaning, I have to strip BOM's from the files prior to writing blobs to my db. I'm trying to maintain doc's and only update them when their SHA3-256 hash changes. If a doc is used and matches an existing blob's SHA3-256, then I'll refer to its ID in the table. Else, I'll write a new blob to the table and so on.
I started with plain ANSI, but that is not universal enough anymore.
Haha, now I have to contend with File SHA3's vs blob SHA3's without BOM's... or drop BOM's altogether which is another debate. Argggg.

@MKleusberg
Copy link
Member

@mgrojo This looks like the right approach to me. We definitely need to take encoding into consideration here because currently we're assuming it is always going to be UTF-8. That's reasonable because there are no ASCII or ANSI SQLite databases but of course doesn't work for ASCII or ANSI BLOBs.
I think there are two problems though which aren't yet covered in your fix: 1) we would need to do the same in the EditDialog class which doesn't know about encoding yet, and 2) this doesn't help for UTF-8 data with BOM. It's probably a good idea to look for all possible BOMs and always assume it's text data if we find one. What do you think?

@mgrojo
Copy link
Member

mgrojo commented Dec 31, 2017

@MKleusberg The EditDialog seems to be already working (I haven't taken a look to how it's done). I've made a test database with some BLOBs in UTF-8 and some others in ISO 8859-1. Before setting the encoding, the UTF-8 work as expected, but the Latin-1 BLOBs are identified as binary, both in the table widget and the cell editor, except when the first non-US-ASCII is outside the first 512 bytes of the value, then it is identified as text by the table and as binary by the cell editor (this is expected due to the performance comprise done). When the ISO 8859-1 encoding is set for the table, the Latin-1 BLOBs are correctly identified as text in every place and the UTF-8 values are transcoded (i.e. 'á' turns 'á') as expected.

The only minor glitch is that the binary view of the decoded Latin-1 texts are showing the UTF-8 decoding and not the original binary Latin-1 encoding. This is probably messy to changge and sure it would be usually uninteresting for the user.

I think we can leave it with the decode call for the table widget that I've copied here.

The BOM part is beyond my current knowledge. I've only encountered them while working in Windows with much surprise and trouble. In Linux are apparently very unusual. In fact I don't understand why the BOM presence breaks our current check, but I suppose that ToUtf8() strips them.

mgrojo added a commit that referenced this issue Dec 31, 2017
…oding

The current table encoding is used in the binary check, so text encoded in
8-bit encodings is properly recognised as text in the table widget.

Make it possible to reset the encoding used in the way suggested to the
user: "Leave the field empty for using the database encoding". Currently it
was rejecting it with the message: "This encoding is either not valid or
not supported."

See issue #1279
@mgrojo
Copy link
Member

mgrojo commented Dec 31, 2017

@sky5walk My last commit should improve the problem with the binary check in the table widget once you've set the proper encoding. Maybe this is enough for you once you've returned to your original Windows encoding (CP1252, I guess, which is very similar to ISO 8859-1). The BOM issue is still pending.

@MKleusberg
Copy link
Member

@mgrojo You're right! Sorry for the confusion 😄 Your commit is correct then.

I'll probably add the BOM bits over the next days to make that working as well.

@sky5walk
Copy link
Author

Thanks, I will check the nightly.
Is this table correct for the grid and editor?

blob content     grid/editor
--------------   ------------
Ansi <asc(128)   text
Ansi >asc(±°)    binary
UTF8 - BOM       text
UTF8 + BOM       *binary
*May be fixed later?

@mgrojo
Copy link
Member

mgrojo commented Dec 31, 2017

Your table is correct, but you have to take into account that the grid will only read the first 512 bytes.

@sky5walk sky5walk changed the title Ascii Codes(127-255) trigger "Binary data can't be viewed with the text editor"? Ascii Codes(128-255) trigger "Binary data can't be viewed with the text editor"? Jan 1, 2018
@sky5walk
Copy link
Author

sky5walk commented Jan 1, 2018

Happy New Year!
I tried the nightly build and my blob's behave as below.
So, I will force my blob's to UTF-8 without BOM and no trailing Nulls.

blob content     grid/editor
--------------   ------------
Ansi <asc(128)   text
Ansi >asc(±°)    binary
UTF8 - BOM       text   --< But, a single trailing Null flips to binary?
UTF8 + BOM       binary --< May be fixed later?

@MKleusberg
Copy link
Member

Happy new year 😉

Tomorrow's nightly should change the behaviour for UTF-8 + BOM so it's always treated as text (even if there is a trailing null). Can you check if that is working for you? And of course if it helps you? 😃

@sky5walk
Copy link
Author

sky5walk commented Jan 1, 2018

Cool, the trailing null thing will help code readability. Hate carrying -1's sprinkled with comments.
I'll check tomorrow.

@sky5walk
Copy link
Author

sky5walk commented Jan 2, 2018

Ok, I tried the same blob insertion code on 170102 nightly, and the grid shows my UTF8 blob contents but the editor clips the 1st few bytes, 4 from simple text, 2 from ascii code > 127?
I will revert to 170101 until this is fixed.

MKleusberg added a commit that referenced this issue Jan 2, 2018
This fixes a bug introduced in 27c6579.
In that commit we are checking if a string starts with a Unicode BOM and
remove the BOM if necessary. Apparently though Qt's startsWith()
function doesn't work as exected here, so it's changed to a manual
comparison in this commit which seems to work fine.

The first person who can explain that behaviour to me gets a free beer
if we meet in person.

See issue #1279.
See issue #1282.
@MKleusberg
Copy link
Member

Thanks for pointing this out! That's what happens if you only test the new bits and not the default behaviour 😉 I think this should be fixed in the next nightly build. Can you give that another try?

@sky5walk
Copy link
Author

sky5walk commented Jan 2, 2018

Yes, will do.
BOM detection with a QT String function may not be advised.
Instead, quickly scan the 1st 3 or 4 bytes for the known BOM prefixes DB4S supports in conjunction with the endianess in play, BE or LE.

Encoding     HEX
-----------  ----------
UTF-8        EF BB BF
UTF-16 (BE)  FE FF
UTF-16 (LE)  FF FE
UTF-32 (BE)  00 00 FE FF

@sky5walk
Copy link
Author

sky5walk commented Jan 3, 2018

Ok, I tried the nightly 170103 with the following results.
Note how trailing null's are treated differently?
Any reason trailing null's cannot be unified?
I am going to force my blob text content to UTF8-BOM-trailing nulls.

blob content    grid    editor
--------------  ------  ------------
Ansi <asc(128)  *text   *text
Ansi >asc(±°)   *text   binary
UTF8 - BOM      *text   *text
UTF8 + BOM      text    text
* Trailing null's flip to binary?

@MKleusberg
Copy link
Member

The reason for treating text with trailing NULLs as binary is that this is that the NULL character cannot be printed or edited properly in the text editors. So to avoid possible data loss we have to make sure the user knows about the fact that there is some non-printable character in the field.
Now the reason for treating NULLs differently is that the rule that a BOM means text has higher priority in our code than any other rule. I'm not super sure this is a good idea but the reasoning is that a BOM indicates Unicode text which in turn means you want to edit it in a text editor. And if your editor/font doesn't support some characters it's a user problem.

Does that make sense? 😃

I have just tried again and for this line in your table

Ansi >asc(±°)   *text   binary

I get different results:

Ansi >asc(±°)  binary   binary

Are you sure it's different on your system? Because that would be weird if grid and editor behaved differently here.

@sky5walk
Copy link
Author

sky5walk commented Jan 5, 2018

Yes, as of nightly 180103, I get the same results as you.
This thread is helpful for future users. But, I will not deviate from 'UTF8 minus BOM' for my text document blob's. (Unless UTF16 is demanded for international text?)

@mgrojo
Copy link
Member

mgrojo commented Jan 5, 2018

Take into account that this is possible:

Ansi >asc(±°) *text binary

In the grid we are checking only the first 512 bytes, so it is perfectly possible that the first 512 bytes are US-ASCII and some remaining bytes are in a 8-bit ASCII extension that needs a configured encoding to be treated as text.

@sky5walk
Copy link
Author

sky5walk commented Jan 5, 2018

Why the limit to 1st 512 bytes? Is the speed hit really that bad?
Non text document blob insertion of images, zip files, etc. will have 0's and asc(>127) bytes occurring often.
And now, text doc's with BOM's can be ruled out after 3 or 4 bytes.
That leaves Text with no BOM's.
I would think the 512byte limit be a preference setting, so users can tailor their viewing results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Confirmed bugs or reports that are very likely to be bugs. encoding
Projects
None yet
Development

No branches or pull requests

4 participants