Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cell edit window considers binary data to be text/numeric if first byte is in ascii range #1772

Closed
mbenningfield1 opened this Issue Feb 28, 2019 · 24 comments

Comments

Projects
None yet
4 participants
@mbenningfield1
Copy link

mbenningfield1 commented Feb 28, 2019

Details for the issue

What did you do?

Select a blob column while browsing a table.

What did you expect to see?

I expected to see the data represented as binary since the column affinity for the column is BLOB

What did you see instead?

On columns where the first byte of the binary header is outside of ASCII printable range, it shows the data as binary with the correct count of bytes.

The columns where the first byte of the binary header is within ASCII printable range, it shows the data as text/numeric with 1 char(s), instead of 32 bytes.

Useful extra information

The info below often helps, please fill it out if you're able to. :)

What operating system are you using?

  • Windows: ( _version:7 Pro version 6.1 Build 7601 SP1)

What is your DB4S version?

  • 3.11.1

Did you also

@justinclift justinclift added the bug label Mar 1, 2019

@justinclift

This comment has been minimized.

Copy link
Member

justinclift commented Mar 1, 2019

Oh, interesting bug. At least, it sounds like a bug.

Most of the time people seem to want things displayed as text if there's any reasonable chance of having it turn out ok. But yeah, for the situation where showing things in binary is preferred, that's been a lot less tested (AFAIK). Or at least, less issues have been opened about that approach.

Hopefully we can get things worked out, so both use cases are catered to. 😄

@mbenningfield1

This comment has been minimized.

Copy link
Author

mbenningfield1 commented Mar 1, 2019

Yeah, I think trying to "massage" the representation in order to anticipate the user's expectation is doomed to failure. If the column has BLOB affinity, or the "Binary" tab in the cell edit window is active, just display the raw bytes. For example, a column with text or numeric affinity with a value of 1 will show a binary representation in the cell edit window (with the "Binary" tab active) as '0x31', which is binary for the ASCII character '1', not a hex value of 0x01. If I ask the sqlite engine for the same column value as a BLOB, it will give me one byte (unsigned char) as 0x01, NOT 0x031.

@mgrojo

This comment has been minimized.

Copy link
Contributor

mgrojo commented Mar 1, 2019

I'm not sure whether what you are seeing is it working as expected or a bug. In theory, If the first 512 bytes of a cell value have a correct UTF-8 encoding, they are displayed as text. But you say "The columns where the first byte of the binary header is within ASCII printable range, it shows the data as text/numeric with 1 char(s), instead of 32 bytes." I can't reproduce this, all the 32 bytes have to be printable in UTF-8.

Could you post examples of BLOB detected as binary and BLOB detected as text? Screenshots might also help.

@sky5walk

This comment has been minimized.

Copy link

sky5walk commented Mar 1, 2019

I do not follow mbenningfield1's claim?
The Cell Editor has a dropdown for multiple formats.
Choose Binary and the data is shown unaltered.
In my case, I don't mind the Text view of my Blob cells since I often store plain text documents as Blobs.
We had an issue regarding UTF-8 and ANSI showing different results in Text view, but you solved that already.

@mbenningfield1

This comment has been minimized.

Copy link
Author

mbenningfield1 commented Mar 1, 2019

The first shot is of a BLOB that shows up correctly, the second is a shot of one that does not.

ok1
notok1

@mbenningfield1

This comment has been minimized.

Copy link
Author

mbenningfield1 commented Mar 1, 2019

Sorry, I hit the wrong button and accidently closed this.

@sky5walk

This comment has been minimized.

Copy link

sky5walk commented Mar 1, 2019

Again, confused. You are only showing the DB4S Binary view and claiming 1 is wrong.
What is the actual contents of both cells? Can you show an external hex editor image?

@mbenningfield1

This comment has been minimized.

Copy link
Author

mbenningfield1 commented Mar 1, 2019

@sky5walk No, I'm not saying that the binary is showing the wrong bits. I'm saying that it isn't telling me the count of bytes for the binary, it it telling me there is 1 character in this BLOB. Working with binary data, I need the correct count of bytes.

@mgrojo mgrojo self-assigned this Mar 1, 2019

@mgrojo

This comment has been minimized.

Copy link
Contributor

mgrojo commented Mar 1, 2019

I see now. There's a problem with the fix for #1731 when there are embedded nulls. The text detection is only working until the first null and the same for the character count.

@justinclift justinclift added this to the 3.11.2 milestone Mar 2, 2019

@justinclift justinclift referenced this issue Mar 2, 2019

Closed

3.11.2 - outstanding pieces #1773

13 of 13 tasks complete

mgrojo added a commit that referenced this issue Mar 2, 2019

Fix text detection for blobs containing bytes in ASCII range up to a …
…zero

Despite we provide a length for toUnicode() the validity/decode is being
performed only up to the first null character, so it passes as text blobs
containing bytes in the ASCII range, followed by a zero and anything else
after.

See issue #1772
@mgrojo

This comment has been minimized.

Copy link
Contributor

mgrojo commented Mar 2, 2019

@mbenningfield1 The problem with the embedded zeros should be fixed and will enter in tomorrow's nightly build. Could you give it a try and report whether all your blobs are now correctly detected?

@sky5walk

This comment has been minimized.

Copy link

sky5walk commented Mar 3, 2019

  1. Just tried 190303 Nightly.
    Confusing results in the detection count?
    It shows "X char(s)".
    Characters only make sense for text, xml, json.
    Bytes are required for binary, image.
  2. The detected char(s) is intermittent and wrong on 1st touch.
    Ex: View a blob or large text field with the cell editor in:
    Text mode --> Less than actual
    Select Binary mode --> Less than actual
    Select XML mode --> Less than actual
    Re-select Text mode --> Correct count.
    EDIT: The detected char(s) sometimes is correct on 1st touch. But after changing modes to binary or other, it then becomes something lower, even returning to text mode.
    EDIT2: Change the Cell Editor mode on a large blob field to binary, then click the auto-detect wheel and it changes modes to text, but the char(s) count is less than the actual.
@mbenningfield1

This comment has been minimized.

Copy link
Author

mbenningfield1 commented Mar 3, 2019

Cheers. Looks like it's working sensibly now. I can't reliably confirm @sky5walk 's observations, since I don't have any large text stored as BLOB's. I did notice that since TEXT and NUMERIC affinities are conflated by the Cell Edit window as 'Text/Numeric', even numeric values from INTEGER affinity columns get treated as strings of numeric characters when displayed in Binary. That could be contributing or related to the behavior that @sky5walk is seeing.

mgrojo added a commit that referenced this issue Mar 3, 2019

@justinclift

This comment has been minimized.

Copy link
Member

justinclift commented Mar 6, 2019

EDIT: The detected char(s) sometimes is correct on 1st touch. But after changing modes to binary or other, it then becomes something lower, even returning to text mode.
EDIT2: Change the Cell Editor mode on a large blob field to binary, then click the auto-detect wheel and it changes modes to text, but the char(s) count is less than the actual.

Sounds like this shouldn't be too hard to reproduce. 😄

@mbenningfield1

This comment has been minimized.

Copy link
Author

mbenningfield1 commented Mar 6, 2019

I created a test table in my current database (id INTEGER, name TEXT, data BLOB) and imported the following lorem ipsum text file into the data column:
LoremIpsum.txt

On first touch, whether the Cell Edit window is in Text or Binary mode, the character count is correct. Switching from Binary to Text shows an incorrect lower count. Alternatively, clicking 'Auto-Detect' in Binary mode correctly detects as Text, but with the incorrect lower count. Alternatively again, if initially in Text mode, switching to Binary and back to Text results in the incorrect lower count.

So, while my main concern has been addressed (correct byte count in Binary for binary data), it appears that the text detection process still needs a little work.

@justinclift

This comment has been minimized.

Copy link
Member

justinclift commented Mar 6, 2019

Thanks @mbenningfield1, that helps. 😄

@mgrojo

This comment has been minimized.

Copy link
Contributor

mgrojo commented Mar 7, 2019

I can reproduce the problem with the file that you've provided, but not with the ones I import from my local disk. It's clear that the problem is related to CR+LF endings of Windows. Qt seems to strip those CR at the minimal touch. If the file is exported again, it should be written in the local OS text format.

https://forum.qt.io/topic/42464/qplaintextedit-eats-line-endings/5

@sky5walk

This comment has been minimized.

Copy link

sky5walk commented Mar 7, 2019

Ok, good catch but super confusing and certainly a BUG.
My column is a BLOB. Its contents cannot be stripped according to TEXT rules.
My confusion was copy and paste from the cell edit window reapplied the [CR+LF] characters so I could not find the missing character count.

Also, please change the nomenclature from Char(s) to Byte(s).
Ansi and Unicode are a mess with Char(s).

@mgrojo

This comment has been minimized.

Copy link
Contributor

mgrojo commented Mar 7, 2019

My column is a BLOB. Its contents cannot be stripped according to TEXT rules.

Well, you're editing it as text after all. Maybe the solution is to throw away the Qt text editor and use always the Scintilla version. I've seen that it is not string the CR characters, but at least in Linux, it is interpreting them as an additional end of line, leaving an empty line after every real line. How is it behaving in Windows? I've tested it using Wine and it is behaving as the 100% Linux version, which is strange to me.

Also, please change the nomenclature from Char(s) to Byte(s).
Ansi and Unicode are a mess with Char(s).

I've always understood that as an abbreviation for characters, not bytes. In fact I changed it some time ago to count correctly characters for multibyte encodings. I saw that the count for Spanish was incorrect when the text included accented characters in UTF-8. We might want to change it to "character(s)" and maybe include also the byte count, but not sure about the available space.

mgrojo added a commit that referenced this issue Mar 7, 2019

Make clear we are counting characters and not bytes or C char(s)
Update translations with source texts.

Update en_GB translation to include plural forms (and some colour
instances+automatic removal of duplicate messages done by linguist).

See issue #1772
@mgrojo

This comment has been minimized.

Copy link
Contributor

mgrojo commented Mar 7, 2019

I've updated "char(s)" to "character(s)". This is how most translators have understood it and the actual count. In the British "translation" I've also included the plural forms, so it will even read "1 character" or "n characters". This can only be made through a translation file, not in the original English text.

@mbenningfield1

This comment has been minimized.

Copy link
Author

mbenningfield1 commented Mar 7, 2019

@mgrojo: That sounds good. Should I close this issue and let someone open another one with the focus on text detection? As I said, my initial concern about binary data as BINARY appears to be resolved. I'm not that concerned about text detection, since I usually store large text files as TEXT. Is there anything else you guys need from me?

@mgrojo

This comment has been minimized.

Copy link
Contributor

mgrojo commented Mar 7, 2019

Thanks, @mbenningfield1. Yes, I think it's better to separate the issues.

@mgrojo mgrojo closed this Mar 7, 2019

@sky5walk

This comment has been minimized.

Copy link

sky5walk commented Mar 7, 2019

My column is a BLOB. Its contents cannot be stripped according to TEXT rules.
"mgrojo replied"
Well, you're editing it as text after all. Maybe the solution is to throw away the Qt text editor and use always the Scintilla version. I've seen that it is not string the CR characters, but at least in Linux, it is interpreting them as an additional end of line, leaving an empty line after every real line. How is it behaving in Windows? I've tested it using Wine and it is behaving as the 100% Linux version, which is strange to me.

Yes, I thought we discussed moving to the Scintilla gadget in another issue? You get the extra formatting/styling too!
But regardless, there is still a bug if my BLOB content is altered while in the cell editor.
Should I post an issue or can you reference one from here?

@justinclift

This comment has been minimized.

Copy link
Member

justinclift commented Mar 8, 2019

In the British "translation" ...

Heh Heh Heh. From memory (years ago), that came from my needing something to play around with for gaining an understanding of Qt Linguist. We can probably drop it, unless someone really wants a proper UK spelling version and is willing to update it. 😄

@justinclift

This comment has been minimized.

Copy link
Member

justinclift commented Mar 8, 2019

But regardless, there is still a bug if my BLOB content is altered while in the cell editor.

Yeah, sounds like a bug. Probably one that's a PITA to fix too. 🙄

Should I post an issue or can you reference one from here?

Post a new issue. 😄

mgrojo added a commit that referenced this issue Mar 12, 2019

Fix text detection for blobs containing bytes in ASCII range up to a …
…zero

Despite we provide a length for toUnicode() the validity/decode is being
performed only up to the first null character, so it passes as text blobs
containing bytes in the ASCII range, followed by a zero and anything else
after.

See issue #1772

mgrojo added a commit that referenced this issue Mar 12, 2019

@justinclift justinclift referenced this issue Apr 12, 2019

Closed

Blob data is displayed incorrectly #1846

0 of 7 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.