New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ReQL proposal: blob type #2612
Comments
I agree with this interface (although I might call the pseudotype "BINARY" or "BASE64" instead of "BLOB"). One question with this is what native types should be turned into binary pseudotypes. In Ruby, people generally just put binary data in strings (there's no separate type for it, although strings can have encodings). If we go with turning binary strings in Ruby into binary pseudotypes (and vice-versa), we might also want to offer an |
Embarrasingly, I don't know either of the three driver languages well enough to know this. @neumino @deontologician @larkost @Tryneus can speak about their respective languages.
👍 |
How would it do that? |
Python3 has a binary type ( The argument against natively supporting The same arguments hold for what we present when fetching. In order to be consistent I think we should present |
Node has a buffer class, that should work fine. The dataexplorer will probably drop the blobs and replace them with a string |
@srh For Python2.x we are pretty much stuck converting non-binary objects since even a read on a file in binary mode results in a plain string object. Probably the best approach would be to use the same format at the
Just subclassing the |
Python2 |
I think we should try hard to turn binary pseudotypes into a reasonable native type in the clients. In Ruby, if I have |
@deontologician -- that makes sense. Converting |
I would love to do this, but it is a bad idea. It is a big break from the current behaviour, and will break in unexpected ways when the rethinkdb driver is used alongside other libraries that return |
@AtnNn Ah ok. I think I see the issue, I misunderstood that we were talking about what to coerce to a blob without the I was talking more about r.binary() not coercing a unicode string to a byte array. |
So you think we should automatically convert in Python 3, but not automatically convert anything in Python 2? I sort of feel like |
In Python 2, strings are bytes by default. I think In Python 3, strings are unicode by default. I think r.binary should be a no-op on bytes and should fail on unicode strings. |
I think it's reasonable to automatically convert The issue I see with r.binary accepting unicode strings is that we'd have to decide on an encoding for the user. I think it's better if it blows up and says "pick an encoding first", because then the user won't accidentally shoot themselves in the foot by storing their string in a format they aren't sure how to decode. It's basically the same conflation python2 does: "this blob is probably arbitrary bytes, but it also might be a string in some encoding which was not recorded anywhere". Plus, what's the use case for writing text strings into a blob field? Don't you really want an actual string field? I think if the user is confused enough to be doing that, we could help them out by throwing an error. While I was writing this @AtnNn 's comment came in, which I agree with |
If it's easy to go from a unicode string to If we do do things this way, we should make sure the error message you get when you try to insert a binary string in Python 2 suggests using |
Well, the issue is you just don't know in python2 from the type alone. If I'd say just use On Thu, Jun 26, 2014 at 5:04 PM, Michael Lucy notifications@github.com
|
We have to detect the error at some point on the server when we're decoding the string. (One could argue we have to detect the error in the client when we serialize the query to JSON because the JSON spec defines a string as a sequence of Unicode characters, so you can't just drop e.g. a null byte in the middle of one.) Wherever we detect that the string isn't a Unicode string, we should produce an error suggesting I think it would be very bad if |
Good point, I wasn't thinking about the fact that you need to check UTF-8 for JSON anyway. The python JSON library will raise an exception if you don't have valid unicode (though it assumes ascii encoding). I agree that Here's what I'm thinking explicitly: Python 2 r.binary("hi") # perfectly ok, not going to infer text
r.binary(b"hi") # identical to above, b prefix is a dummy for compatibility
r.binary(u"hi") # should fail, this is the unicode type which is unambiguously text and never binary Python 3 r.binary("hi") # should fail, default strings are unicode
r.binary(u"hi") # identical to above, u prefix is a dummy for compatibility
r.binary(b"hi") # perfectly ok |
To add to @deontologician table,
|
@mlucy -- other than some driver specific behavior (e.g. which types to automagically convert in which languages) I think this proposal is well defined. Could we mark it as settled? |
It's not well defined yet. None of the operations are defined. Here is what I propose: Supported operations:
The "raw" representation:
|
I like this, though I think it's an overkill -- we don't really need comparison operators for the first version, and probably not I'm on board with the definition of |
Yes we do, comparisons (and orderBy, distinct queries, etc) must work for all values. Comparing two objects for equality will require comparing the binary values they contain.
I think we want length for sure. I can remember many times, for example, querying Windows for files with a length greater than X, to see what humongous files I had that could be deleted. Suppose the developer has avatars uploaded in a table as binary values (and not files, because they're small, or because the user is new to RethinkDB and thinks binary values are more comfortable) but eventually realizes that users uploaded avatars that are too big. |
We should not call it "binary". Perhaps "bytes" or "bytestring". We already have a binary type, it is called "bool". I would not call it "blob" because it doesn't behave the same as SQL blobs. |
"bytes" is good. |
I think we should use normal base64 with '+' and '/'. That's what most tools generate. I think we should accept either padding or no padding and add padding ourself where it doesn't exist (there are several Base64 variants where padding is mandatory, but also at least one popular one where it's optional).
I don't think we should introduce a new term for this. Some other options I would prefer more:
|
I'm OK with the name "bytes". I think we should leave off |
I apparently was misremembering. Ruby's behavior with strings is kind of complicated, but it uses If other people aren't using that convention, then we probably shouldn't use it. A command |
To save space, sure, but from the user's perspective it should be |
I'm fine with either |
OK, I think we should add a command |
|
If we're going to overload |
I'd be ok with that too.
I'm not sure what @mlucy's rationale is, but I think |
The name |
The whole point of asking, why should It's not a question of "What names make me feel good." I can't think of any good things that happen by making |
Let's just go with |
This is off-topic, but tables are ordered (by the primary key of the rows), it's just that some operations (like reading from them in batches) ignore that ordering. |
Result sets are not ordered. |
Marking as settled. To clarify:
|
Please contemplate on how you would retrieve documents without the binary components. That is how would someone perform a query and get all the documents without the binary elements. Pretend that I have a document with a handful of random metadata entries and 1 to 3 binary entries. for speed reasons I want to look at all the meta data and my binary elements are 20MiB images. clearly if I have 100k images I may want to look at all the metadata without the large images. Excluding a key in the document may not be enough. Those images might have different keys in each document. Clearly in that scenario exclusion via type would be useful. |
You'd say |
I'd like to propose that we re-introduce |
(From an administrative perspective, this issue is temporarily back in its discussion period for that one aspect.) |
@coffeemug: any opinion on |
Adding |
Alright, it's been 2-3 days with no objections and the code's already written, marking as settled once more. |
Ok, A few notes about the final state of the feature:
Drivers:
|
A few more questions.
|
|
This proposal covers the binary data part of #137.
A blob is a non-streaming, reasonably small (< ~10MB), base64 encoded binary data type. The user can store a blob in a document.
I propose not including any of the following in v1:
The text was updated successfully, but these errors were encountered: