New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary cell dump not visible - with no scrollbars (and it's not even a real blob) #1461

Open
tlhackque opened this Issue Jul 5, 2018 · 14 comments

Comments

Projects
None yet
3 participants
@tlhackque
Copy link

tlhackque commented Jul 5, 2018

Details for the issue

What did you do?

Selected a database cell classified as a "blob"

What did you expect to see?

binary and ascii dump Everything is grayed-out, perhaps because the DB is open Read-Only.

What did you see instead?

only leftmost 2 chars of ascii dump
No horizontal scrollbars.

image

It is also mildly annoy that the single G1 character ('E4'), which is a valid Latin-1 graphic character suffices to make this text look like a blob.

E4 is "small a dieresis or umlaut"

The binary test should require C0 or C1 control characters; G1 graphics are not an indication of binary data.

For ISO standard (non-MS) text:

The C0 range is 0x00 - 0x1F (Text can be expected to have some of these:
09 - ht, 0a - lf, 0b - vt, 0c - ff, 0d - cr, and in some cases, 1b -esc and 1a (^z)

The C1 range is 0x80 - 0x9F none of which are common in ANSI text. (Windows, however does many of these for graphics.)

The G0 range is 0x20 - 0x7e (7f is DEL).
The G1 range is 0xA0 - 0xFF.

I would say that a field is "binary" if it has any bytes in the C0 set (except as noted) or any of the C1 set.

Some might disagree because of the Microsoft usage.

A very common (and simple) test for "binary" is the presence of a 0 (NUL) byte. It's not foolproof - a slightly enhanced version is if 0 bytes constitute more than a small (say, 10) percentage of the data - in text structures, 0 tends to be a terminator....

The latter test is probably the best choice here... it's simple, effective, and avoids the Microsoft issue.

(I might mention byte order marks and other Unicode indicators, but this note is already too long.)

Useful extra information

The info below often helps, please fill it out if you're able to. :)

What operating system are you using?

  • [xc ] Windows: ( _version:_10 pro ___ )
  • Linux: ( distro: ___ )
  • Mac OS: ( version: ___ )
  • Other: ___

What is your DB4S version?

  • 3.10.1
  • 3.10.0
  • 3.9.1
  • [x ] Other: _3.10.99 5-Jul

Did you also

@justinclift

This comment has been minimized.

Copy link
Member

justinclift commented Jul 5, 2018

As a thought, if you resize the Edit Database Cell window/pane, is it possible to show the full contents correctly?

I'm kind of thinking maybe the left side of that window pane might be fixed width atm.

With the binary detection, this looks like the code which does it:

int EditDialog::checkDataType(const QByteArray& data)
{
QByteArray cellData = data;
// Check for NULL data type
if (cellData.isNull()) {
return Null;
}
// Check if it's an image. First do a quick test by calling canRead() which only checks the first couple of bytes or so. Only if
// that returned true, do a more sophisticated test of the data. This way we get both, good performance and proper data checking.
QBuffer imageBuffer(&cellData);
QImageReader readerBuffer(&imageBuffer);
QString imageFormat = readerBuffer.format();
if(readerBuffer.canRead() && !readerBuffer.read().isNull())
return imageFormat == "svg" ? SVG : Image;
// Check if it's text only
if(isTextOnly(cellData))
{
QJsonDocument jsonDoc = QJsonDocument::fromJson(cellData);
if (!jsonDoc.isNull())
return JSON;
else
return Text;
}
// It's none of the above, so treat it as general binary data
return Binary;
}

@justinclift

This comment has been minimized.

Copy link
Member

justinclift commented Jul 5, 2018

Hmmm, seems to be the isTextOnly() call in there which is relevant:

bool isTextOnly(QByteArray data, const QString& encoding, bool quickTest)
{
// If the data starts with a Unicode BOM, we always assume it is text
if(startsWithBom(data))
return true;
// Truncate to the first couple of bytes for quick testing
if(quickTest)
data = data.left(512);
// Convert to Unicode if necessary
if(!encoding.isEmpty())
data = QTextCodec::codecForName(encoding.toUtf8())->toUnicode(data).toUtf8();
// Perform check
return QString(data).toUtf8() == data;
}

@tlhackque

This comment has been minimized.

Copy link
Author

tlhackque commented Jul 5, 2018

As a thought, if you resize the Edit Database Cell window/pane, is it possible to show the full contents correctly?

Yes problem is that my windows is 27 in wide, the database cell occupies 6in of that, and I'm still scrolling the row data. The horizontal scrollbar on the edit cell should work...

I'm kind of thinking maybe the left side of that window pane might be fixed width atm.

Yes, sounds right because of the dump. But still should be able to scroll to see the right half...

@justinclift

This comment has been minimized.

Copy link
Member

justinclift commented Jul 5, 2018

Good point. That scroll bar should definitely be shown when the window size isn't big enough. 😄

@justinclift justinclift added the bug label Jul 5, 2018

@tlhackque

This comment has been minimized.

Copy link
Author

tlhackque commented Jul 5, 2018

Hmmm, seems to be the isTextOnly() call in there which is relevant:

That's a strange approach (seeing if round trip through unicode results in no change).

I think after the BOM check, you can simply do a memchr() and look for 0. Or a loop and count the zeros vs length....

UTF-8 is a superset of Latin-1, so I'm not sure why my accented character is being rejected.

FWIW, the data in that cell is originally Latin-1 - Perl encodes it into its internal representation, and what happens inside the DBI/SQLite internals is a mystery to me at the moment.

@justinclift

This comment has been minimized.

Copy link
Member

justinclift commented Jul 5, 2018

From rough memory, I think we did used to look for 0x0 characters, but we switched to the current approach for some reason. Not remembering exactly why off the top of my head though.

@justinclift

This comment has been minimized.

Copy link
Member

justinclift commented Jul 5, 2018

Hmmm, I wonder if it's possible that the 512 byte boundary the data is split on (for performance) could be leaving a malformed utf8 character, which is then capable of causing an incorrect result for the data type check.

@tlhackque

This comment has been minimized.

Copy link
Author

tlhackque commented Jul 5, 2018

Could happen that way. But not in this case, as shown in the screenshot, the data is only 90 bytes here.

It is true that if you have reason to know that a field should be UTF-8, and it's malformed, calling it binary is reasonable. But you have no way to know. And in this case, the data is valid ISO Latin-1 (and so valid UTF-8).

@tlhackque

This comment has been minimized.

Copy link
Author

tlhackque commented Jul 5, 2018

Not remembering exactly why off the top of my head though.

git blame.

(Which should have been called 'git credit', but Linus is rarely politically correct.)

@mgrojo mgrojo self-assigned this Jul 7, 2018

mgrojo added a commit that referenced this issue Jul 8, 2018

Issue #1461: Binary cell dump not visible - with no scrollbars
QHexEdit widget was being set disabled instead of read-only, according to
comments, because there wasn't any setReadOnly in qhexedit. At least in
the current version, the method actually exists, so it is now used.

This allows the user to move the scroll and select text in the binary mode
of the 'Edit Database Cell' dock.
@mgrojo

This comment has been minimized.

Copy link
Contributor

mgrojo commented Jul 8, 2018

@tlhackque I made a commit yesterday that fixes the problem of the hex editor being disabled, instead of read-only.

That's a strange approach (seeing if round trip through unicode results in no change).

Regarding the binary detection, I think the application is doing right. Of course, the term binary is a diffuse concept, but we follow a simple approach: whatever is not fully correct text in the selected encoding is treated as binary (but only 512 bytes are tested for performance). Would you say this is binary?:
FF FF FF FF
In UTF-8 is meaningless, but in Latin-1 it is a correct text string: ÿÿÿÿ

The default encoding is UTF-8 and consequently FF FF FF FF is considered a BLOB. But if you change it the encoding to Latin-1, you will see it treated as text and shown: ÿÿÿÿ.

UTF-8 is a superset of Latin-1, so I'm not sure why my accented character is being rejected.

That's only true when you considere the set of characters (we should say Unicode in that sense, I think), but not in representation terms. Only US-ASCII is a common subset of UTF-8 and Latin-1. All the extended characters (over 7 bits) are represented differently.

E4 is "small a dieresis or umlaut"

E4 is not always 'ä'. For example, in UTF-8 E4 by itself is meaningless. On the other hand, 'ä' in UTF-8 is the two byte sequence: C3 A4.

A very common (and simple) test for "binary" is the presence of a 0 (NUL) byte.

Once again that depends on encoding. 'ä' in UTF-16 is 0x00E4 and text in UTF-16 shouldn't be considered binary if you know the correct encoding.

The way to tell DB4S which encoding to use is clicking with the right mouse button over the column header and choosing 'Set encoding'.

In my opinion, the only sensible enhancement in this regard would be detecting the more usual text encodings, like ISO-8859-?, and applying them automatically or ask the user whether to apply them.

@tlhackque

This comment has been minimized.

Copy link
Author

tlhackque commented Jul 8, 2018

I made a commit yesterday that fixes the problem of the hex editor being disabled, instead of read-only.
Thanks! I'll look for that the next time I pull a nightly build.

I agree that encoding vs. codepoints can be confusing. I should have been more careful. Still, it is the case that even in Unicode, 00 (C0 NUL) is a good marker for binary data.

I'm not sure if autodetection of "If not Unicode, then ..." is feasible in the general case, but there are certainly some heuristics that could be tried. But...

The way to tell DB4S which encoding to use is clicking with the right mouse button over the column header and choosing 'Set encoding'.

Another great. but hidden feature. Should be available in "Edit Database Cell" - at least, that's where I looked for it. Or right-clicking on a cell, which is the second place I looked.

How about a pulldown next to Mode for Encoding? If that had been there, I'd never have started this thread :-)

That's an extensive collection of encodings - nice job on getting them all in. Is there a reason for the the ordering in the dropdown menu? I guess I understand putting the most frequently used at the top, but the rest seem to be in random order. I'd list the frequent (by some measure) ones at the top of the list, then a separator, then all in alphabetical order...

That's only true when you consider the set of characters (we should say Unicode in that sense, I think), but not in representation terms. Only US-ASCII is a common subset of UTF-8 and Latin-1. All the extended characters (over 7 bits) are represented differently.

Depends on if you are thinking encoding or code points. UTF-8 (which I restricted my comments to) uses the same codepoints as 8859-1 - for 0x00-0xFF. See unicode.org P37:

ASCII and Latin-1 Compatibility Area. For compatibility with the ASCII and ISO 8859-1,
Latin-1 standards, this area contains the same repertoire and ordering as Latin-1. Accordingly,
it contains the basic Latin alphabet, European digits, and then the same collection of
miscellaneous punctuation, symbols, and additional Latin letters
as are found in Latin-1.

And even the Latin-* control codes have matching codeponts - unicode.org P 544

There are 65 code points set aside in the Unicode Standard for compatibility with the C0
and C1 control codes defined in the ISO/IEC 2022 framework. The ranges of these code
points are U+0000..U+001F, U+007F, and U+0080..U+009F, which correspond to the 8-bit
controls 0016 to 1F16 (C0 controls), 7F16 (delete), and 8016 to 9F16 (C1 controls), respectively.
For example, the 8-bit legacy control code character tabulation (or tab) is the byte
value 0916; the Unicode Standard encodes the corresponding control code at U+0009.
The Unicode Standard provides for the intact interchange of these code points, neither
adding to nor subtracting from their semantics. The semantics of the control codes are generally
determined by the application with which they are used. However, in the absence of
specific application uses, they may be interpreted according to the control function semantics
specified in ISO/IEC 6429:1992.

On the other hand, the encoding uses the MSB set to indicate that the next few bits encode multi-byte strings. unicode.org P95 So at that level, yes, it's trickier to distinguish Unicode from Latin-1.

I appreciate all the effort you've been putting in to responding to my experience. Many thanks!

@justinclift

This comment has been minimized.

Copy link
Member

justinclift commented Jul 9, 2018

Another great. but hidden feature.

Yeah, this is one of the things which has bugged me for a while. It's reasonably common for us to implement new things and requests, but they're often very hmmm... hard to discover (?) unless the person already knows about them.

We seem to struggle in figuring out how to implement the "make this easily discoverable by new users". Suggestions welcome of course. 😄

How about a pulldown next to Mode for Encoding?

Awesome, good idea!

@mgrojo

This comment has been minimized.

Copy link
Contributor

mgrojo commented Jul 9, 2018

Another great. but hidden feature. Should be available in "Edit Database Cell" - at least, that's where I looked for it. Or right-clicking on a cell, which is the second place I looked.

Yes, I know it's hidden. Indeed I only knew about it when it was mentioned in some issue 😄 I discovered then all the menu options in the contextual menu of the column header.

How about a pulldown next to Mode for Encoding?

Awesome, good idea!

Well, the space is limited there. What about another entry in the future Tools entry in the menu bar. At least the "Set encoding for all tables" option. "Set encoding" for this table has a context problem. Only in the Data Browse tab would make sense.

Another option for both entries would be in the contextual menu of the Edit DB Cell dock.

@tlhackque

This comment has been minimized.

Copy link
Author

tlhackque commented Aug 23, 2018

Well, the space is limited there.

I meant here, where I seem to have space.

image

Or, if that's a problem on smaller screens, the "Mode label" could become something like "View", that cascades to Mode and Encoding. E.g.

VIEW
| -- Mode --> Text
              Binary
              Image
              JSON
              XML
| -- Encoding
            UTF-8
            ISO-8859-15
             ...

That would be expandable, for when someone comes up with "Writing mode LeftToRight/RightToLeft", "Suppress non-printables" - or whatever else the future brings. (No, I'm not suggesting either of these right now, just observing that the future usually brings "more".)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment