-
Notifications
You must be signed in to change notification settings - Fork 428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix double base64 of binary attributes #1112
Conversation
b019707
to
94d5522
Compare
94d5522
to
1fe4b3d
Compare
attribute_value[BINARY] = base64.b64decode(attribute_value.pop(STRING)) | ||
elif attr_type == BINARY_SET and LIST in attribute_value: | ||
attribute_value[BINARY_SET] = [base64.b64decode(v[STRING]) for v in attribute_value.pop(LIST)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we base64-decode here when we didn't use to? Since BinaryAttribute(legacy_encoding=False).deserialize
does not base64-decode (assuming _convert_binary
in connection/base.py already did the job), we have to do it here for the sake of Model.from_dict
.
elif MAP in attr: | ||
for sub_attr in attr[MAP].values(): | ||
_convert_binary(sub_attr) | ||
elif LIST in attr: | ||
for sub_attr in attr[LIST]: | ||
_convert_binary(sub_attr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maps and lists are now being recursed into. This matches botocore behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth noting that this is going to have some performance impact (I actually did notice this as part of the delta in #1079). We might want to run benchmarks to get a sense for it and include it in the release notes.
If the difference is meaningful then we may also want to add an optimisation for short-circuiting this when binary attributes aren't in use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merely iterating a dict/list recursively has a notable performance effect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the average case I doubt it, but my guess is the high-throughput cases with larger items (e.g. containing large lists) would notice something. I can use the benchmark script in #1079 with varying sized lists to get a sense for it - I might be wrong and it's small enough to not be of concern
if attr_type == BINARY: | ||
return base64.b64encode(attr_value).decode() | ||
if attr_type == BINARY_SET: | ||
return [base64.b64encode(v).decode() for v in attr_value] | ||
if attr_type in {BOOLEAN, STRING, STRING_SET}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we base64-encode here when we didn't use to? Since BinaryAttribute(legacy_encoding=False).serialize
does not base64-encode anymore (assuming botocore would do the job), we have to do it here for the sake of Model.to_dict
.
BTW, now that
As a result, we don't have a JSON-serializable representation that roundtrips. Not sure what to do about that. |
Is this breaking in the sense that we fail to deserialise these, or just that they're serialised differently when read back and written? Maybe we can add a test for this as well (apologies if one was already there and I didn't notice) |
I wonder if we should talk more about the options here. Seems like we're drifting towards binary attributes becoming a second-class citizen that contains so many footguns that we might as well not support them natively (which also feels wrong, since they are part of the DynamoDB API). I'd be curious to see how much worse each of these points are becoming as of this PR and what it would take to ensure that |
The previous behavior is low on my priority to preserve, since
|
Agree. There are many ORMs out there, a DynamoDB specific better cover what DynamoDB supports. On the topic of JSON-izing, here's how Amazon goes about it. In the AWS console, if you add a binary attribute and then switch from "Form" to "JSON view", there's a nested option to toggle between "JSON" and "DynamoDB JSON" and that option is grayed out once there are binary attributes or sets: If designing from scratch, I'd expect
|
Couple of remaining thoughts:
|
I like I find that Also, initially I thought: def to_dynamodb_json() -> str:
return json.dumps(self.serialize(), cls=EncoderWhichBase64sBytes) but it's no good for the "use it in something larger" case. |
Here's a thought. Given a JSON encoder: import base64
import json
from typing import Any
class CustomEncoder(json.JSONEncoder):
def default(self, obj: Any) -> Any:
if isinstance(obj, bytes):
return base64.b64encode(obj).decode()
return super().default(obj) you can do: class MyModel(Model):
...
picture = BinaryAttribute(legacy_encoding=False)
data = {"items": [my_model.serialize()]}
data_str = json.dumps(data, cls=CustomEncoder) Now, how do you deserialize that? You can do data = json.loads(data_str)
items = [
MyModel.from_raw_data(item)
for item in data['items']
] but that's broken since What if we change class BinaryAttribute
def deserialize(self, value):
- if self.legacy_encoding:
+ if self.legacy_encoding or isinstance(value, str):
return b64decode(value)
return value and then perhaps get rid of |
Co-authored-by: Garrett <garrettheel@users.noreply.github.com>
Since #1112, `Model.serialize` return value is not necessarily JSON-serializable. We're adding `to_dynamodb_dict` as a replacement (basically, `bytes` are base-64 encoded), and `to_simple_dict` to provide for a common ask of a "simple" JSON representation similar to what the AWS Console defaults to (previously called `to_dict` in earlier versions of the 6.x branch). We're making those methods of `AttributeContainer` rather than `Model` so they'd be equally applicable to `MapAttribute`.
For many versions (verified in versions 3-5) we had a known issue where
Clearly (1) is inefficient and (2) is unusable, and we wanted PynamoDB to be efficient (and not broken) by default, but similarly didn't want to risk breaking systems during an upgrade to PynamoDB 6. With that in mind, we're introducing a
legacy_encoding
required parameter toBinaryAttribute
andBinarySetAttribute
, the rationale being a new required parameter will prompt an informed decision. The parameter will exist throughout the lifetime of PynamoDB 6.x and perhaps removed in a future version (in favor oflegacy_encoding=False
becoming the default behavior).Since nested binary attributes were previously unusable, we're safely defaulting to
legacy_encoding=False
for nested attributes (and forcing it to beFalse
when binary attributes are declared in non-raw maps).Along with this, a slight breaking change in introduced: A set in a "raw" list or map have previously serialized as a list (L) but will now result in a string/number/binary set:
Additions to docs: