Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Binary cell dump not visible - with no scrollbars (and it's not even a real blob) #1461
Details for the issue
What did you do?
Selected a database cell classified as a "blob"
What did you expect to see?
binary and ascii dump Everything is grayed-out, perhaps because the DB is open Read-Only.
What did you see instead?
only leftmost 2 chars of ascii dump
It is also mildly annoy that the single G1 character ('E4'), which is a valid Latin-1 graphic character suffices to make this text look like a blob.
E4 is "small a dieresis or umlaut"
The binary test should require C0 or C1 control characters; G1 graphics are not an indication of binary data.
For ISO standard (non-MS) text:
The C0 range is 0x00 - 0x1F (Text can be expected to have some of these:
The C1 range is 0x80 - 0x9F none of which are common in ANSI text. (Windows, however does many of these for graphics.)
The G0 range is 0x20 - 0x7e (7f is DEL).
I would say that a field is "binary" if it has any bytes in the C0 set (except as noted) or any of the C1 set.
Some might disagree because of the Microsoft usage.
A very common (and simple) test for "binary" is the presence of a 0 (NUL) byte. It's not foolproof - a slightly enhanced version is if 0 bytes constitute more than a small (say, 10) percentage of the data - in text structures, 0 tends to be a terminator....
The latter test is probably the best choice here... it's simple, effective, and avoids the Microsoft issue.
(I might mention byte order marks and other Unicode indicators, but this note is already too long.)
Useful extra information
The info below often helps, please fill it out if you're able to. :)
What operating system are you using?
What is your DB4S version?
Did you also
As a thought, if you resize the Edit Database Cell window/pane, is it possible to show the full contents correctly?
I'm kind of thinking maybe the left side of that window pane might be fixed width atm.
With the binary detection, this looks like the code which does it:
Yes problem is that my windows is 27 in wide, the database cell occupies 6in of that, and I'm still scrolling the row data. The horizontal scrollbar on the edit cell should work...
Yes, sounds right because of the dump. But still should be able to scroll to see the right half...
That's a strange approach (seeing if round trip through unicode results in no change).
I think after the BOM check, you can simply do a memchr() and look for 0. Or a loop and count the zeros vs length....
UTF-8 is a superset of Latin-1, so I'm not sure why my accented character is being rejected.
FWIW, the data in that cell is originally Latin-1 - Perl encodes it into its internal representation, and what happens inside the DBI/SQLite internals is a mystery to me at the moment.
Could happen that way. But not in this case, as shown in the screenshot, the data is only 90 bytes here.
It is true that if you have reason to know that a field should be UTF-8, and it's malformed, calling it binary is reasonable. But you have no way to know. And in this case, the data is valid ISO Latin-1 (and so valid UTF-8).
@tlhackque I made a commit yesterday that fixes the problem of the hex editor being disabled, instead of read-only.
Regarding the binary detection, I think the application is doing right. Of course, the term binary is a diffuse concept, but we follow a simple approach: whatever is not fully correct text in the selected encoding is treated as binary (but only 512 bytes are tested for performance). Would you say this is binary?:
The default encoding is UTF-8 and consequently FF FF FF FF is considered a BLOB. But if you change it the encoding to Latin-1, you will see it treated as text and shown: ÿÿÿÿ.
That's only true when you considere the set of characters (we should say Unicode in that sense, I think), but not in representation terms. Only US-ASCII is a common subset of UTF-8 and Latin-1. All the extended characters (over 7 bits) are represented differently.
E4 is not always 'ä'. For example, in UTF-8 E4 by itself is meaningless. On the other hand, 'ä' in UTF-8 is the two byte sequence: C3 A4.
Once again that depends on encoding. 'ä' in UTF-16 is 0x00E4 and text in UTF-16 shouldn't be considered binary if you know the correct encoding.
The way to tell DB4S which encoding to use is clicking with the right mouse button over the column header and choosing 'Set encoding'.
In my opinion, the only sensible enhancement in this regard would be detecting the more usual text encodings, like ISO-8859-?, and applying them automatically or ask the user whether to apply them.
I agree that encoding vs. codepoints can be confusing. I should have been more careful. Still, it is the case that even in Unicode, 00 (C0 NUL) is a good marker for binary data.
I'm not sure if autodetection of "If not Unicode, then ..." is feasible in the general case, but there are certainly some heuristics that could be tried. But...
Another great. but hidden feature. Should be available in "Edit Database Cell" - at least, that's where I looked for it. Or right-clicking on a cell, which is the second place I looked.
How about a pulldown next to Mode for Encoding? If that had been there, I'd never have started this thread :-)
That's an extensive collection of encodings - nice job on getting them all in. Is there a reason for the the ordering in the dropdown menu? I guess I understand putting the most frequently used at the top, but the rest seem to be in random order. I'd list the frequent (by some measure) ones at the top of the list, then a separator, then all in alphabetical order...
Depends on if you are thinking encoding or code points. UTF-8 (which I restricted my comments to) uses the same codepoints as 8859-1 - for 0x00-0xFF. See unicode.org P37:
And even the Latin-* control codes have matching codeponts - unicode.org P 544
On the other hand, the encoding uses the MSB set to indicate that the next few bits encode multi-byte strings. unicode.org P95 So at that level, yes, it's trickier to distinguish Unicode from Latin-1.
I appreciate all the effort you've been putting in to responding to my experience. Many thanks!
Yeah, this is one of the things which has bugged me for a while. It's reasonably common for us to implement new things and requests, but they're often very hmmm... hard to discover (?) unless the person already knows about them.
We seem to struggle in figuring out how to implement the "make this easily discoverable by new users". Suggestions welcome of course.
Awesome, good idea!
Yes, I know it's hidden. Indeed I only knew about it when it was mentioned in some issue
Well, the space is limited there. What about another entry in the future Tools entry in the menu bar. At least the "Set encoding for all tables" option. "Set encoding" for this table has a context problem. Only in the Data Browse tab would make sense.
Another option for both entries would be in the contextual menu of the Edit DB Cell dock.
I meant here, where I seem to have space.
Or, if that's a problem on smaller screens, the "Mode label" could become something like "View", that cascades to Mode and Encoding. E.g.
That would be expandable, for when someone comes up with "Writing mode LeftToRight/RightToLeft", "Suppress non-printables" - or whatever else the future brings. (No, I'm not suggesting either of these right now, just observing that the future usually brings "more".)