Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ReQL proposal: blob type #2612

Closed
coffeemug opened this issue Jun 26, 2014 · 58 comments
Closed

ReQL proposal: blob type #2612

coffeemug opened this issue Jun 26, 2014 · 58 comments

Comments

@coffeemug
Copy link
Contributor

This proposal covers the binary data part of #137.

A blob is a non-streaming, reasonably small (< ~10MB), base64 encoded binary data type. The user can store a blob in a document.

  • For now, any time the document gets loaded, the blob gets loaded (this might change if we allow not loading fields if the user plucks them out). EDIT: It'd be really nice to optimize that right away, but I don't know how hard that is.
  • Any time the document gets copied, the blob gets copied.
  • Hopefully if only non-blob fields in a document get updated, the blob doesn't get rewritten.
  • For now, the only way to modify a blob is to rewrite it entirely.
> r.expr({ 'binary': Uint8Array(...) }).run(conn)
{ 'binary': Uint8Array(...) }

> r.expr({ 'binary': Uint8Array(...) }).run(conn, { blobFormat: 'native' })
{ 'binary': Uint8Array(...) }

> r.expr({ 'binary': Uint8Array(...) }).run(conn, { blobFormat: 'raw' })
{ 'binary': {"$reql_type$":"BLOB", data: "..."}

I propose not including any of the following in v1:

  • Reading part of a blob
  • Modifying part of a blob without rewriting the whole thing
  • Appending to a blob without rewriting the whole thing
@mlucy
Copy link
Member

mlucy commented Jun 26, 2014

I agree with this interface (although I might call the pseudotype "BINARY" or "BASE64" instead of "BLOB").

One question with this is what native types should be turned into binary pseudotypes. In Ruby, people generally just put binary data in strings (there's no separate type for it, although strings can have encodings).

If we go with turning binary strings in Ruby into binary pseudotypes (and vice-versa), we might also want to offer an r.binary(...) command to turn a non-binary string into a blob (e.g. r.binary("abc")).

@coffeemug
Copy link
Contributor Author

One question with this is what native types should be turned into binary pseudotypes.

Embarrasingly, I don't know either of the three driver languages well enough to know this. @neumino @deontologician @larkost @Tryneus can speak about their respective languages.

we might also want to offer an r.binary(...) command

👍

@srh
Copy link
Contributor

srh commented Jun 26, 2014

we might also want to offer an r.binary(...) command to turn a non-binary string into a blob (e.g. r.binary("abc")

How would it do that?

@larkost
Copy link
Collaborator

larkost commented Jun 26, 2014

Python3 has a binary type (type(b'23') is <class 'bytes'>), but while Python2.x technically supports the b'23' format, it converts it into a string immediately. So such a thing would generally be needed for Python as well. There is a valid question about whether we should implicitly convert <class 'bytes'> objects, and what we should present on fetch.

The argument against natively supporting <class 'bytes'> is that some surprising (but correct) places return bytes objects in Python3 such as subprocess STDERR and STDOUT. If we natively bring those in as binary types then the same code in Python2.x would put it into the database as a binary type, while in Python3.x it would insert a binary.

The same arguments hold for what we present when fetching. In order to be consistent I think we should present r.binary objects rather than native items in Python3. One good option would be to have the r.binary class be a subclass of bytes (this maps to `<type 'str'> on Python2.x), which would "do the right thing" in most cases.

@neumino
Copy link
Member

neumino commented Jun 26, 2014

Node has a buffer class, that should work fine.

The dataexplorer will probably drop the blobs and replace them with a string <blob> or something like that.

@larkost
Copy link
Collaborator

larkost commented Jun 26, 2014

@srh For Python2.x we are pretty much stuck converting non-binary objects since even a read on a file in binary mode results in a plain string object. Probably the best approach would be to use the same format at the bytes instantiator uses in Python3.x:

class bytes(object)
 |  bytes(iterable_of_ints) -> bytes
 |  bytes(string, encoding[, errors]) -> bytes
 |  bytes(bytes_or_buffer) -> immutable copy of bytes_or_buffer
 |  bytes(int) -> bytes object of size given by the parameter initialized with null bytes
 |  bytes() -> empty bytes object

Just subclassing the bytes object gets us this already on Python3. However, on Python2 bytes is just an alias to str, and so there we probably would have to call the encode methods ourselves.

@deontologician
Copy link
Contributor

Python2 str is the same as Python3 bytes. Neither holds any encoding information. The bytes() function and b"123" syntax are just dummy operations to make migrating to Python3 easier. I'd say only those two respective types should be base64'd in Python. Implicitly converting unicode strings to bytes is probably a bad idea since the user doesn't get to pick the encoding, and when you retrieve it again you need to know that to decode it.

@mlucy
Copy link
Member

mlucy commented Jun 26, 2014

How would it do that?

r.binary would be a term on the server that takes a string and returns a binary pseudotype. (If passed a binary pseudotype, it just returns what it was passed.)

I think we should try hard to turn binary pseudotypes into a reasonable native type in the clients. In Ruby, if I have x = "a\0b".encode('binary'), I really want r(x).run to give me back that same string with a binary encoding.

@mlucy
Copy link
Member

mlucy commented Jun 26, 2014

@deontologician -- that makes sense. Converting str in Python 2 and bytes in Python 3 to binary pseudotypes automatically seems right. (And turning them back, obviously.)

@AtnNn
Copy link
Member

AtnNn commented Jun 26, 2014

Converting str in Python 2 to binary pseudotypes

I would love to do this, but it is a bad idea. It is a big break from the current behaviour, and will break in unexpected ways when the rethinkdb driver is used alongside other libraries that return str instead of unicode values.

@deontologician
Copy link
Contributor

@AtnNn Ah ok. I think I see the issue, I misunderstood that we were talking about what to coerce to a blob without the r.binary() wrapper. I agree that you can only really do that in Python3, since python2 strings are just overloaded with no way to discern the meaning.

I was talking more about r.binary() not coercing a unicode string to a byte array.

@mlucy
Copy link
Member

mlucy commented Jun 26, 2014

So you think we should automatically convert in Python 3, but not automatically convert anything in Python 2?

I sort of feel like r.binary should be able to turn any string into a binary blob. That feels like a common task. But if that seems crazy, I'll defer to the people who actually use Python regularly.

@AtnNn
Copy link
Member

AtnNn commented Jun 26, 2014

In Python 2, strings are bytes by default. I think r.binary should be required to convert them to blobs.

In Python 3, strings are unicode by default. I think r.binary should be a no-op on bytes and should fail on unicode strings.

@deontologician
Copy link
Contributor

I think it's reasonable to automatically convert bytes to blobs, since their meanings are the same in Python3. The r.binary() wrapper should probably allow bytes in the Python3 driver though. And yeah, there's no good way to automatically convert in python2.

The issue I see with r.binary accepting unicode strings is that we'd have to decide on an encoding for the user. I think it's better if it blows up and says "pick an encoding first", because then the user won't accidentally shoot themselves in the foot by storing their string in a format they aren't sure how to decode. It's basically the same conflation python2 does: "this blob is probably arbitrary bytes, but it also might be a string in some encoding which was not recorded anywhere".

Plus, what's the use case for writing text strings into a blob field? Don't you really want an actual string field? I think if the user is confused enough to be doing that, we could help them out by throwing an error.

While I was writing this @AtnNn 's comment came in, which I agree with

@mlucy
Copy link
Member

mlucy commented Jun 27, 2014

If it's easy to go from a unicode string to bytes, I wouldn't be that opposed to leaving off r.binary in Python 3.

If we do do things this way, we should make sure the error message you get when you try to insert a binary string in Python 2 suggests using r.binary.

@deontologician
Copy link
Contributor

Well, the issue is you just don't know in python2 from the type alone. If
you insert a str, it'll just go in as a normal string value, there's
nothing to tell us we need to raise an exception. You might try utf-8
encoding and decoding first and if it fails assume they were trying to do a
binary, but you'd incur that cost on every string insert which doesn't
sound like a good idea.

I'd say just use r.binary in both python2 and 3, and then as a bonus in
Python3 you can also infer it if they try to insert bytes. Then, as
@AtnNn mentioned, you would raise an exception in r.binary if someone
tried to pass it unicode string (in either 2 or 3).

On Thu, Jun 26, 2014 at 5:04 PM, Michael Lucy notifications@github.com
wrote:

If it's easy to go from a unicode string to bytes, I wouldn't be that
opposed to leaving off r.binary in Python 3.

If we do do things this way, we should make sure the error message you get
when you try to insert a binary string in Python 2 suggests using r.binary
.


Reply to this email directly or view it on GitHub
#2612 (comment)
.

@mlucy
Copy link
Member

mlucy commented Jun 27, 2014

Well, the issue is you just don't know in python2 from the type alone. If
you insert a str, it'll just go in as a normal string value, there's
nothing to tell us we need to raise an exception. You might try utf-8
encoding and decoding first and if it fails assume they were trying to do a
binary, but you'd incur that cost on every string insert which doesn't
sound like a good idea.

We have to detect the error at some point on the server when we're decoding the string. (One could argue we have to detect the error in the client when we serialize the query to JSON because the JSON spec defines a string as a sequence of Unicode characters, so you can't just drop e.g. a null byte in the middle of one.) Wherever we detect that the string isn't a Unicode string, we should produce an error suggesting r.binary.

I think it would be very bad if r.binary threw an error when you passed it a string that doesn't technically need to be encoded as binary data, both because it makes it hard to write code that you know won't throw and because we need some way to let people make binary blobs containing only ASCII characters (to match a schema they have in their head, for example, or because another query branches on the type of a field).

@deontologician
Copy link
Contributor

Good point, I wasn't thinking about the fact that you need to check UTF-8 for JSON anyway. The python JSON library will raise an exception if you don't have valid unicode (though it assumes ascii encoding).

I agree that r.binary definitely shouldn't try to figure out if the binary secretly contains a text string. I just meant in the case you get a python unicode type, it should fail and refuse to silently convert it to a binary.

Here's what I'm thinking explicitly:

Python 2

r.binary("hi") # perfectly ok, not going to infer text
r.binary(b"hi") # identical to above, b prefix is a dummy for compatibility
r.binary(u"hi") # should fail, this is the unicode type which is unambiguously text and never binary

Python 3

r.binary("hi") # should fail, default strings are unicode
r.binary(u"hi") # identical to above, u prefix is a dummy for compatibility
r.binary(b"hi") # perfectly ok

@AtnNn
Copy link
Member

AtnNn commented Jun 27, 2014

To add to @deontologician table,

r.binary(u"hi".encode("utf-8")) # perfectly ok

@coffeemug
Copy link
Contributor Author

@mlucy -- other than some driver specific behavior (e.g. which types to automagically convert in which languages) I think this proposal is well defined. Could we mark it as settled?

@srh
Copy link
Contributor

srh commented Jul 2, 2014

It's not well defined yet. None of the operations are defined. Here is what I propose:

Supported operations:

eq, ne, gt, ge, lt, le -- lexicographic comparison (using unsigned octets).

binary.typeOf() -> "binary"

binary.info() -> something appropriate, presumably. the type? the length?

The "raw" representation: {"$reql_type$":"binary", "data":"aB_-bc"}, for example, where the data uses base64url with no padding (and no unrecognized characters). As long as the wire protocol actually uses this "raw" representation, any deviation from this precise format should be rejected.

string.coerceTo("binary") -> encodes the UTF-8 representation of the string into the binary value.
binary.coerceTo("string") -> decodes the binary value in the UTF-8 format. It is an error if it's not valid UTF-8.
value.coerceTo("binary") -> an error for any other type.

binary.length() -> number. A new command! Returns the length in bytes.

binary.slice(startIndex[, endIndex, {leftBound:'closed', rightBound:'open'}]) -> binary. This is possible (unlike with strings where it's kind of messed up to want such a thing, and they're UTF-8 encoded). We could wait for users to ask for this. This would be useful, for example, if you wanted to retrieve all the ID3 tags from a bunch of MP3's. (Which, in this case, might actually be file objects.)

@coffeemug
Copy link
Contributor Author

I like this, though I think it's an overkill -- we don't really need comparison operators for the first version, and probably not length and slice (though those would be really cool). I'd keep those out and only add them if users want them (though I'm not opposed to adding them if we really want to).

I'm on board with the definition of coerceTo, typeOf, and info.

@srh
Copy link
Contributor

srh commented Jul 2, 2014

we don't really need comparison operators for the first version,

Yes we do, comparisons (and orderBy, distinct queries, etc) must work for all values. Comparing two objects for equality will require comparing the binary values they contain.

probably not length

I think we want length for sure. I can remember many times, for example, querying Windows for files with a length greater than X, to see what humongous files I had that could be deleted. Suppose the developer has avatars uploaded in a table as binary values (and not files, because they're small, or because the user is new to RethinkDB and thinks binary values are more comfortable) but eventually realizes that users uploaded avatars that are too big.

@AtnNn
Copy link
Member

AtnNn commented Jul 2, 2014

We should not call it "binary". Perhaps "bytes" or "bytestring".

We already have a binary type, it is called "bool".

I would not call it "blob" because it doesn't behave the same as SQL blobs.

@srh
Copy link
Contributor

srh commented Jul 2, 2014

"bytes" is good.

@mlucy
Copy link
Member

mlucy commented Jul 3, 2014

The "raw" representation: {"$reql_type$":"binary", "data":"aB_-bc"}, for example, where the data uses base64url with no padding (and no unrecognized characters). As long as the wire protocol actually uses this "raw" representation, any deviation from this precise format should be rejected.

I think we should use normal base64 with '+' and '/'. That's what most tools generate. I think we should accept either padding or no padding and add padding ourself where it doesn't exist (there are several Base64 variants where padding is mandatory, but also at least one popular one where it's optional).

length

I don't think we should introduce a new term for this. Some other options I would prefer more:

  • Overload count.
  • Add a term size which will get the size in bytes of a string or binary data type. (This is at least useful on multiple things. Also, arguably count should be named size.)
  • Skip it for the first version.

@mlucy
Copy link
Member

mlucy commented Jul 3, 2014

I'm OK with the name "bytes". I think we should leave off slice for the first version. Other than that and the two quibbles above, I'm on board with Sam's proposal.

@mlucy
Copy link
Member

mlucy commented Jul 4, 2014

This is complete nonsense and not true at all.

I apparently was misremembering. Ruby's behavior with strings is kind of complicated, but it uses size to mean length and bytesize to mean number of bytes. (Which, IMO, is what size should mean.)

If other people aren't using that convention, then we probably shouldn't use it. A command bytesize which works on strings and binary blobs would be good. Once we have Unicode support, a command length that works on both (and on arrays) would be good too.

@mlucy
Copy link
Member

mlucy commented Jul 4, 2014

I think at a minimum we should accept both 3548 and 2045 and store them internally as the common subset.

I have no idea what you're talking about, but binary datums will surely be stored with a varint length N, followed by N bytes consisting of the binary data.

To save space, sure, but from the user's perspective it should be {"$reql_type$": ..., "data": SOME_BASE64_STRING} (if they call run with e.g. `binary_format: 'raw'). I'm saying that string we give them should be in the common subset of 3548 and 2045, so if they have a strict parser it will work with either.

@mlucy
Copy link
Member

mlucy commented Jul 4, 2014

I'm fine with either binary or bytes. I think that @srh is right that binary will be interpreted to mean "data using the full 8 bits", not "a single boolean value". I would never interpret it as the second, and it's commonly used to mean the first. (For example, Wikipedia describes Base64 as a binary-to-text encoding.)

@mlucy
Copy link
Member

mlucy commented Jul 4, 2014

OK, I think we should add a command length that works on blobs and arrays (as a synonym for count), and when we add better Unicode support it should work on strings too. (Long-term I think we should maybe deprecate count, but that's a whole can of worms I don't want to open right now.)

@coffeemug
Copy link
Contributor Author

  • I'm ok with either r.bytes or r.binary.
  • I'm ok with the command length that's also an alias for count. I don't think we should deprecate count though -- it's very entrenched in the SQL world and it feels sort of unnatural to say table.length().

@srh
Copy link
Contributor

srh commented Jul 7, 2014

OK, I think we should add a command length that works on blobs and arrays

If we're going to overload length for arrays, making it a synonym of count, then we should drop the name "length" altogether and just go with count for everything. I don't see what causes you to want to overload length for arrays in the first place.

@coffeemug
Copy link
Contributor Author

If we're going to overload length for arrays, making it a synonym of count, then we should drop the name "length" altogether and just go with count for everything.

I'd be ok with that too.

I don't see what causes you to want to overload length for arrays in the first place.

I'm not sure what @mlucy's rationale is, but I think count makes sense for tables since everyone's used to that in SQL, but the naming is a little weird to get the number of bytes in a blob. I think aliasing length to count makes sense, but I also wouldn't be opposed to just defining length on blobs. I basically think anything we do here will be ok.

@srh
Copy link
Contributor

srh commented Jul 7, 2014

The name count makes sense on tables because tables are not an ordered linear sequence the way arrays are. They do not have a "length".

@srh
Copy link
Contributor

srh commented Jul 7, 2014

The whole point of asking, why should length work on arrays, is the question: What does this accomplish? What bad things can happen?

It's not a question of "What names make me feel good." I can't think of any good things that happen by making length also work on arrays. And it's easy to think of bad things happening.

@mlucy
Copy link
Member

mlucy commented Jul 7, 2014

Let's just go with count for everything. If we aren't going to deprecate it, it would be bad to have two primitives for the same task.

@mlucy
Copy link
Member

mlucy commented Jul 7, 2014

The name count makes sense on tables because tables are not an ordered linear sequence the way arrays are. They do not have a "length".

This is off-topic, but tables are ordered (by the primary key of the rows), it's just that some operations (like reading from them in batches) ignore that ordering.

@srh
Copy link
Contributor

srh commented Jul 7, 2014

Result sets are not ordered.

@coffeemug
Copy link
Contributor Author

Marking as settled. To clarify:

  • We'll use count to get the number of bytes and not add a length command
  • We'll call the type binary and the command r.binary
  • We'll accept both RFC 3548 and RFC 2045, and send back the common subset
  • We'll define count, info, type_of, coerce_to and the comparison operators to start with

@Tryneus Tryneus self-assigned this Jul 9, 2014
@hardkrash
Copy link

Please contemplate on how you would retrieve documents without the binary components. That is how would someone perform a query and get all the documents without the binary elements.

Pretend that I have a document with a handful of random metadata entries and 1 to 3 binary entries. for speed reasons I want to look at all the meta data and my binary elements are 20MiB images. clearly if I have 100k images I may want to look at all the metadata without the large images. Excluding a key in the document may not be enough. Those images might have different keys in each document. Clearly in that scenario exclusion via type would be useful.

@coffeemug
Copy link
Contributor Author

You'd say r.table('foo').without('blob_field_name').

@mlucy
Copy link
Member

mlucy commented Jul 23, 2014

I'd like to propose that we re-introduce slice since @Tryneus already implemented it in the CR . The use, meaning, and edge cases should be the same as array.slice.

@mlucy
Copy link
Member

mlucy commented Jul 23, 2014

(From an administrative perspective, this issue is temporarily back in its discussion period for that one aspect.)

@mlucy
Copy link
Member

mlucy commented Jul 24, 2014

@coffeemug: any opinion on slice?

@coffeemug
Copy link
Contributor Author

Adding slice would be great. I talked to @Tryneus and I thought he already implemented it, but I'm not entirely sure.

@mlucy
Copy link
Member

mlucy commented Jul 25, 2014

Alright, it's been 2-3 days with no objections and the code's already written, marking as settled once more.

@Tryneus
Copy link
Member

Tryneus commented Jul 25, 2014

Ok, r.binary has been approved and merged to next in commits b878582, d7730a5, b0285bc, a0f0973, and 9f7247b. Review was 1799. Will be in release 1.14.

A few notes about the final state of the feature:

  • base64 is only used on the wire; as soon as it is converted to a datum_t, the base64 is translated to a wire_string_t. The reverse applies when sending r.binary to a client.
  • We support the common subset of RFC 3548 and RFC 2045, but not base64url.
  • The r.binary term is client-only. There is no use case I could find for running r.binary on existing data that is not fulfilled by using coerce_to('binary').
  • count works on r.binary objects, similar to arrays.
  • slice works on r.binary objects, with the same semantics as when used on an array.
  • info, type_of, and coerce_to are all implemented (the type string is BINARY).

Drivers:

  • The python driver will implicitly convert bytes to r.binary unless bytes is the same as str (in python2).
  • The python driver will return r.binary objects as a thin wrapper of bytes (to allow implicit conversion when re-inserting in python2).
  • The javascript driver will implicitly convert Buffer to r.binary.
  • The javascript driver will return r.binary objects as Buffers.
  • The ruby driver does not implicitly convert data to r.binary, like python2.
  • The ruby driver will return r.binary objects as a thin wrapper of String (to allow implicit conversion when re-inserting).

@Tryneus Tryneus closed this as completed Jul 25, 2014
@coffeemug
Copy link
Contributor Author

A few more questions.

  • Can I pass binary_format='raw' to get the pseudotype in the driver? (if not, I think we should do that, both for clarity and consistency).
  • How is the wrapper r.binary object specified? What methods can I call on it/what can I do with it, etc.?

@Tryneus
Copy link
Member

Tryneus commented Jul 25, 2014

@coffeemug:

  • I did not implement binary_format='raw' (I didn't see much of a use case), but that's a simple change - opened Add binary_format='raw in the drivers. #2762 for this.
  • The thin wrappers in Ruby and Python simply inherit from String and bytes, respectively, without redefining anything. They are functionally identical to a normal string, but we can tell if it was the result of r.binary in an earlier query.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants