Add support for a chunked string construct #10

Closed
ghost opened this Issue Aug 23, 2012 · 15 comments

Projects

None yet

3 participants

@ghost
ghost commented Aug 23, 2012

Multiple people have requested support for an "unbounded" or "chunked" string construct for streaming unknown-length strings.

Originally I thought it was sufficient to simply suggest transfering unbounded strings as an ARRAY of STRING elements. People have correctly pointed out where this falls appart:

An ARRAY of STRINGS is not the same thing as a single STRING that could potentially be huge.

When I thought more about this at the library level, it made more sense. For example, processing an ARRAY of STRINGS in Java will parse each string into a char[] object and return it; regardless of how big it is. For sufficiently large amounts of data, this can lead to a lot of overhead. If the library knew it was receiving an unbounded or chunked string, it could respond to the caller with a Stream instead of trying to parse out all the text itself.

James Manger proposed the following use of '*' to define the type:

    [S][*]
        [S][3][Bob]
        [S][4][ ran]
        [S][5][ fast]
    [E]

I added the use of 'E' as the terminator... in a lot of ways this becomes a specialized array of String elements that library implementors may be able to handle more efficiently and provides the ability of masking chunked string length from the caller.

In Java this would work by returning an InputStreamReader to the client that can simply keep calling read() to pull chars out of the stream.

Under the covers, the standard STRING decoding logic would be running, decoded each chunked string out of the stream (until it hit [E]) and returning it to the calls from Reader.read() as incremental chunks of chars.

This is a pretty slick idea. I'd like to know what others think.

@ghost ghost was assigned Aug 23, 2012
@kxepal
Collaborator
kxepal commented Aug 23, 2012

What's the difference between next two data snippets:

[A]
    [S][4294967296][... very very long story ...]
    [S][4294967296][... very very long story ...]
    [S][4294967296][... very very long story ...]
    [S][4294967296][... very very long story ...]
    [S][4294967296][... very very long story ...]
[E]

and

[S] [*]
    [S][4294967296][... very very long story ...]
    [S][4294967296][... very very long story ...]
    [S][4294967296][... very very long story ...]
    [S][4294967296][... very very long story ...]
    [S][4294967296][... very very long story ...]
[E]

?

Actually, proposed type doesn't solves his problem: chunks could have any size so we just creating typed array, nothing more. If we really needs chunked string, probably we need to define chunk size first:

[S] [*] [I] [4096]
    [S][4096][... chunk of very very long story ...]
    [S][4096][... chunk of very very long story ...]
    [S][4096][... chunk of very very long story ...]
    [S][4096][... chunk of very very long story ...]
    [S][4096][... chunk of very very long story ...]
    ...
    [S][4][tail]
[E]

This looks better: we're forcing to have chunks with no more than specified length and this type now matters. And again I'd like to recall realizations of chunked response on protocol level(:

@kxepal
Collaborator
kxepal commented Aug 23, 2012

While this question mostly about how well data structure was designed, ARRAY of STRING types couldn't solve problem with large single item:

[A]
    [S] [L][4294967296][... very very long story ...]
    [S] [i][3][foo]
    [S] [L][4294967296][... very very long story ...]
    [S] [i][3][bar]
    [S] [L][4294967296][... very very long story ...]
[E]

In this case chunked string may help us. Probably would be better to assign him specific marker, for example C - as Chunked type with length support and without data block trailed by E marker.

Valid cases

Empty case:

[C] [i] [42]
[E]

Normal case:

[C] [i] [3]
    [S] [i] [3] [foo]
    [S] [i] [3] [bar]
    [S] [i] [3] [baz]
[E]

While C length specifies string chunk length actually this is max chunk length. Next case is valid:

[C] [i] [3]
    [S] [i] [3] [foo]
    [S] [i] [3] [bar]
    [S] [i] [3] [baz]
    [S] [i] [1] [!]
[E]

and this is valid too:

[C] [i] [42]
    [S] [i] [3] [foo]
    [S] [i] [3] [bar]
    [S] [i] [3] [baz]
[E]

NOOP's are welcomed as well:

[C] [i] [3]
    [N]
    [N]
    [S] [i] [3] [foo]
    [S] [i] [3] [bar]
    [N]
    [S] [i] [3] [baz]
    [N]
    [N]
    [N]
[E]

Invalid cases

Chunk size overflow:

[C] [i] [3]
    [S] [i] [7] [abcdefg]
    [S] [i] [3] [bar]
    [S] [i] [3] [baz]
[E]

Chunk item is not a string:

[C] [i] [3]
    [S] [i] [3] [foo]
    [Z]
[E]

H are not a string:

[C] [i] [3]
    [H] [3] [3.14]
[E]

Chunks of chunks:

[C] [i] [3]
    [S] [i] [3] [foo]
    [C] [I] [4096]
        [S] [I] [4096] [...]
        [S] [I] [4096] [...]
        [S] [I] [4096] [...]
    [E]
    [S] [i] [3] [bar]
[E]
@ghost
ghost commented Aug 24, 2012

Few notes...

A specialized, unbounded container type is not the same as an array of strings.

When processing the array of strings, I have no idea all the values are strings even, when I process an unbounded C chunked container, I know all the types are the same.

I don't like the container element count, this was a mistake I made with the original A and O types in the spec and we had to undo.

Also don't need the specified default chunk length at the root.

I DO think the idea of a typed chunked container is interesting though if we support chunked strings and chunked binary data for example... Maybe like

[S][] for unbounded string container
[B][
] for unbounded binary container

Using C was a good idea, but then once I realized we had multiple types of unbounded containers, it was just easier to overload the existing markers with specialized lengths to indicate them (and I think more intuitive)

In both cases, the processing logic is very very similar.

Thoughts?

@AnyCPU
Collaborator
AnyCPU commented Aug 24, 2012
[S][*]
    [S][3][Bob]
    [S][4][ ran]
    [S][5][ fast]
[E]

Can be like object :

Msg
{
type = 'string';
body =
{
'first str',
'str 1',
...,
'super str'
}
}

Little overhead, fully capable without spec modification.

@kxepal
Collaborator
kxepal commented Aug 24, 2012

Also don't need the specified default chunk length at the root.

Just had though that defined chunk length would help in building custom UBJSON-based streaming protocols that mostly operates with text/binary data and this helps to forget about chunks overlap reading. If it doesn't matters, so this tricks could be moved to client realization.

[S][] for unbounded string container
[B][
] for unbounded binary container

Using C was a good idea, but then once I realized we had multiple types of unbounded containers, it was just easier to overload the existing markers with specialized lengths to indicate them (and I think more intuitive)

There are few problems with your idea:

  1. There is no way to create array of integers nor floats nor bools or more - it mostly looks like very limited typed array.
  2. It's not parser friendly since you provide some special magic value in place where length value is expected. Moreover, you need to read two bytes to decide who will be the data constructor instead of one.
  3. This is not optimal solution. If you defines that C container should contains data only of single type, so there is no reason to repeat type markers for each item. This saves us 1 byte per item and solves problem with storing binary data in way that @AnyCPU mentioned in issue #12.

Assume that you need to store binary file in UBJSON of 10000 bytes length. Solutions:

  1. S and base64. Result would be 13512 bytes.

  2. A of i gives you 20002 bytes, because for each array item you need to define type marker and value - 2 bytes per item.

  3. C with strict type i: 10003 bytes - 10000 for raw data and 3 for C, i and E markers.

    [C] [i]
        [42]
        [211]
        [125]
        [23]
        ...
    [E] 
    
  4. Adding new B type will produce 10004 bytes - 10000 for raw data, two for B and I markers and two for length with price to be lesser Universal and JSON, but more Binary.

Note, that only p.4 breaks JSON compatibility - C is just optimized array, that could be converted back from JSON with easy.

@ghost
ghost commented Aug 24, 2012

Michael,

I think body notation you meant an array right? If so you are absolutely right it can be represented that way, but unfortunately on the processing side (library implementation) we don't know it is an unbounded array of potentially large number of strings, so we cannot do anything optimized like returning a Stream to the caller to pull content out of it.

The customized data structure is primarily motivated to make library implementor's lives easier so they know "Oh, I just hit a potentially huge section of data, instead of parsing it in, I'm just going to return a stream".

Alex,

Ah! This whole argument didn't dawn on me until I saw your last Example #3 -- are you and Michael suggesting a new [C]hunked Type, that's first argument is a TYPE indicator -- initially only [i](for byte) and [S](for String)?

If so, I like that idea... I was misunderstanding what was being proposed before unfortunately.

SIDE QUESTION: If we are going to use [i] this way for a typing indicator, would it be more intuitive to keep it named [B] for Byte?

    [C][B]
    ...
    [E] -- chunked Byte collection.

    [C][S]
    ...
    [E] -- chunked String collection.

?

@AnyCPU
Collaborator
AnyCPU commented Aug 25, 2012

As I understand we want to have a strongly typed container without overhead.
[C][E] - construstion for Spec 8
{} - object in Spec 9
[] - array in Spec 9
So we need to have a one more brackets. According to http://en.wikipedia.org/wiki/Bracket I propose to use chevrons < and >.

{} - often used for objects and similar data
[] - always used for arrays
<> - used for generics.
So we will have generic arrays.

[<] [generic type]
[... items ...]
[>]

generic type is type from Spec 9.

For example:
array for int 8 or bytes
[<][i]
[0x01]
[0xa3]
[>]

array of strings
[<][S]
[8][asdfghjk]
[1][Y]
[>]
for string a type specifier [S] in body can be ommited and only length specifier used.

@AnyCPU
Collaborator
AnyCPU commented Aug 25, 2012

[] is just array of any type - anytype array
<> is specialized array

@kxepal
Collaborator
kxepal commented Aug 25, 2012

@AnyCPU , awesome idea with < >!
+1 for all points.

However, I suppose, it couldn't be backported to Draft-8 because this is a new feature which have to increment version number to avoid conflicts of realizations.

@thebuzzmedia ,

SIDE QUESTION: If we are going to use [i] this way for a typing indicator, would it be more intuitive to keep it named [B] for Byte?

I suppose no, because this binary type is composite: generic array + 8 bit integer. Why I couldn't encode binary data using 2 or 4 byte integers? Let's just say hello to UTF-16 and UTF-32 support!(: Better to keep integers in more consistency state than trying to give the office for some their specific use case.

@AnyCPU
Collaborator
AnyCPU commented Aug 25, 2012

We need going forward and upgrade is not big deal because this version consume same system resources. it is another and updated version.
main goal is fully compliant version to json.

modern apps constructed as modular system. so if system have stable public interface, internal parts can be safely modificated.
We say "send this message to target". We say what to do, not how to do.

@adilbaig
   [<] [generic type]
   [... items ...]
   [>]

Very cool. Quick question, how can we differentiate between "a large string that is chunked" and "an array of strings that is chunked"?

On another note, i have a thought that takes us back a bit.

Chunks are supposed to be a solution to sending large streams of data down the wire and intermittently making sure they are not corrupt. Atleast that's how i see it as being most useful. It would be helpful in that respect to enclose entire JSON documents into a chunk, with intermittent checksums that act as both verifiers and keep-alive notices. Ex :

[<]  //Notice, no generic type specified

    [checksum for next X bytes] //This checksum verifies the following X bytes
    [ ... part of a JSON document ... ]

    [checksum for next X bytes] //This checksum verifies the following X bytes
    [ ... another part of a JSON document ... ]
    .
    .
[>] //End of stream
//Now we can process this doc, or part of doc.

This makes chunks generically more useful and data-type, data-structure agnostic. With this design, even parts of a JSON doc can be chunked with integrity. Thoughts?

@AnyCPU
Collaborator
AnyCPU commented Aug 26, 2012
  1. string
 [<][S]
 [1][q]
 [5][fg 44]
 [>]

array of strings

 [ <S [1][q] > <S [5][fg 44] > ]
  1. JSON that contains json
class DataWithCheckSum
{
      string checksum;
      object data;

      DataWithCheckSum(object toPack)
      {  
         data = toPack;
         checksum = hash(data);
      }

     JSON getDoc()
     {
            getJSON(this);
     }
}
@kxepal
Collaborator
kxepal commented Aug 26, 2012

Quick question, how can we differentiate between "a large string that is chunked" and "an array of strings that is chunked"?

Actually topic is misleading, because since there will not be any limits for chunk size it's just a typed array that can be handled in specific ways in app without worry about collection items:

  • typed array of int8 may be just an array of numbers or binary container;
  • typed array of int32 may be just an array of numbers or UTF-32 string;
  • typed array of floats may be just an array of floats or GIS information or object coordinates.

Note that typed array of booleans wouldn't be available in UBJSON, but for JSON this is very expensive structure too, so probably there is nothing bad for this case.

In this form it solves two problems:

  1. Provides insurance upon stored data types;
  2. Data stored in very compact way with less overhead as it could be with regular array type.

It is not a new data type, it is optimization of base JSON type as well as int8/int16/int32/int64 and float/double types.

Chunks are supposed to be a solution to sending large streams of data down the wire and intermittently making sure they are not corrupt. Atleast that's how i see it as being most useful. It would be helpful in that respect to enclose entire JSON documents into a chunk, with intermittent checksums that act as both verifiers and keep-alive notices.

This is good and seductive idea, but it mostly about data protocol design, not for data format. Instead of checksums you probably be happy to sign data with PGP key to ensure that they come from trusted end point and there wasn't any data injection, but all of these is out of UBJSON scope.

@AnyCPU
Collaborator
AnyCPU commented Aug 26, 2012

Note that typed array of booleans wouldn't be available in UBJSON, but for JSON this is very expensive structure too, so probably there is nothing bad for this case.

[TFTFTFTFTFTFT] even shorter, hmm...

It is not a new data type, it is optimization of base JSON type as well as int8/int16/int32/int64 and float/double types.

+1

@ghost
ghost commented Aug 26, 2012

Combining Issue #10 and Issue #12 all into the 'support for type-inferred arrays' tracked in Issue #13:
#13

@ghost ghost closed this Aug 26, 2012
@ghost ghost removed this from the Draft 9 milestone Feb 25, 2014
@ghost ghost added wontfix and removed wontfix labels Feb 25, 2014
@ghost ghost was unassigned Aug 15, 2016
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment