Fix double base64 of binary attributes #1112

ikonst · 2022-11-16T06:38:43Z

For many versions (verified in versions 3-5) we had a known issue where

top-level Binary(Set)Attributes were double-base64-encoded in the underlying DynamoDB table item
nested Binary(Set)Attributes were unusable: every serialization would add another round of Base64 encoding(!), so nested binary attributes could not have been used in practice

Clearly (1) is inefficient and (2) is unusable, and we wanted PynamoDB to be efficient (and not broken) by default, but similarly didn't want to risk breaking systems during an upgrade to PynamoDB 6. With that in mind, we're introducing a legacy_encoding required parameter to BinaryAttribute and BinarySetAttribute, the rationale being a new required parameter will prompt an informed decision. The parameter will exist throughout the lifetime of PynamoDB 6.x and perhaps removed in a future version (in favor of legacy_encoding=False becoming the default behavior).

Since nested binary attributes were previously unusable, we're safely defaulting to legacy_encoding=False for nested attributes (and forcing it to be False when binary attributes are declared in non-raw maps).

Along with this, a slight breaking change in introduced: A set in a "raw" list or map have previously serialized as a list (L) but will now result in a string/number/binary set:

for all strings, string set (SS)
for all numbers, number set (NS)
for all bytes, binary set (BS)
else, raises an error on serialization

Additions to docs:

ikonst · 2022-11-16T14:31:17Z

pynamodb/attributes.py

+            attribute_value[BINARY] = base64.b64decode(attribute_value.pop(STRING))
+        elif attr_type == BINARY_SET and LIST in attribute_value:
+            attribute_value[BINARY_SET] = [base64.b64decode(v[STRING]) for v in attribute_value.pop(LIST)]


Why do we base64-decode here when we didn't use to? Since BinaryAttribute(legacy_encoding=False).deserialize does not base64-decode (assuming _convert_binary in connection/base.py already did the job), we have to do it here for the sake of Model.from_dict.

ikonst · 2022-11-16T14:33:29Z

pynamodb/connection/base.py

+    elif MAP in attr:
+        for sub_attr in attr[MAP].values():
+            _convert_binary(sub_attr)
+    elif LIST in attr:
+        for sub_attr in attr[LIST]:
+            _convert_binary(sub_attr)


Maps and lists are now being recursed into. This matches botocore behavior.

Worth noting that this is going to have some performance impact (I actually did notice this as part of the delta in #1079). We might want to run benchmarks to get a sense for it and include it in the release notes.

If the difference is meaningful then we may also want to add an optimisation for short-circuiting this when binary attributes aren't in use

Merely iterating a dict/list recursively has a notable performance effect?

In the average case I doubt it, but my guess is the high-throughput cases with larger items (e.g. containing large lists) would notice something. I can use the benchmark script in #1079 with varying sized lists to get a sense for it - I might be wrong and it's small enough to not be of concern

ikonst · 2022-11-16T14:34:50Z

pynamodb/util.py

+    if attr_type == BINARY:
+        return base64.b64encode(attr_value).decode()
+    if attr_type == BINARY_SET:
+        return [base64.b64encode(v).decode() for v in attr_value]
+    if attr_type in {BOOLEAN, STRING, STRING_SET}:


Why do we base64-encode here when we didn't use to? Since BinaryAttribute(legacy_encoding=False).serialize does not base64-encode anymore (assuming botocore would do the job), we have to do it here for the sake of Model.to_dict.

ikonst · 2022-11-16T15:32:44Z

BTW, now that Attribute.(de)serialize and subsequently Model.(de)serialize are not (de)serializing a JSON-serializable dict (for DynamoDB API) but a dict for botocore (with non-JSON-serializable bytes), there’s an unfortunate gap forming:

Model.serialize ’s output cannot be JSON-serialized (w/o custom encoder)
Model.to_dict ’s output does not always roundtrip — bytes in raw maps will be encoded as base64 strings, but then Model.from_dict will decode them as str containing base64

As a result, we don't have a JSON-serializable representation that roundtrips. Not sure what to do about that.

garrettheel · 2022-11-17T13:39:47Z

Along with this, a slight breaking change in introduced: A set in a "raw" list or map have previously serialized as a list (L) but will now result in a string/number/binary set

Is this breaking in the sense that we fail to deserialise these, or just that they're serialised differently when read back and written? Maybe we can add a test for this as well (apologies if one was already there and I didn't notice)

garrettheel · 2022-11-17T13:52:24Z

BTW, now that Attribute.(de)serialize and subsequently Model.(de)serialize are not (de)serializing a JSON-serializable dict (for DynamoDB API) but a dict for botocore (with non-JSON-serializable bytes), there’s an unfortunate gap forming:

Model.serialize ’s output cannot be JSON-serialized (w/o custom encoder)

Model.to_dict ’s output does not always roundtrip — bytes in raw maps will be encoded as base64 strings, but then Model.from_dict will decode them as str containing base64

As a result, we don't have a JSON-serializable representation that roundtrips. Not sure what to do about that.

I wonder if we should talk more about the options here. Seems like we're drifting towards binary attributes becoming a second-class citizen that contains so many footguns that we might as well not support them natively (which also feels wrong, since they are part of the DynamoDB API).

I'd be curious to see how much worse each of these points are becoming as of this PR and what it would take to ensure that from_dict, to_dict, serialize, etc. work better with binary. We can maybe do that elsewhere to avoid cluttering this PR too much

ikonst · 2022-11-17T14:33:16Z

Along with this, a slight breaking change in introduced: A set in a "raw" list or map have previously serialized as a list (L) but will now result in a string/number/binary set

Is this breaking in the sense that we fail to deserialise these, or just that they're serialised differently when read back and written? Maybe we can add a test for this as well (apologies if one was already there and I didn't notice)

TestMapAttribute.test_attribute_children: raw maps now round-trip sets (e.g. BS)
TestMapAttribute.test_serialize_invalid_set: when assigning a set to a map, they must be heterogenous, and of one of the supported types (e.g no more {42, "ham"})

The previous behavior is low on my priority to preserve, since

it doesn't roundtrip

m.raw_map.my_stuff = {42, "ham"}
m.save()
m.refresh()
assert type(m.raw_map.my_stuff) is list

now we fail loudly so change is apparent
old behavior doesn't offer advantages
old behavior doesn't give a way to use sets in raw maps

ikonst · 2022-11-17T14:47:30Z

which also feels wrong, since they are part of the DynamoDB API

Agree. There are many ORMs out there, a DynamoDB specific better cover what DynamoDB supports.

On the topic of JSON-izing, here's how Amazon goes about it. In the AWS console, if you add a binary attribute and then switch from "Form" to "JSON view", there's a nested option to toggle between "JSON" and "DynamoDB JSON" and that option is grayed out once there are binary attributes or sets:

If designing from scratch, I'd expect

one method which returns a "JSON" and raises if sets or binaries are present
another method which returns a "DynamoDB JSON" (with binaries as base64 strs, such that json can serialize them)

docs/upgrading_binary.rst

garrettheel · 2022-11-18T15:01:26Z

Couple of remaining thoughts:

Should we document that serialize() now retains values as bytes, since that method is public and can break anyone who tries to json encode the output? I know we don't have release notes in here yet but it might be helpful to add some while we're discussing everything
Regarding below, I wonder whether there are any changes we should implement now to better warn users about these kinds of issues. For example we could split out to_dynamodb_json and to_simple_json (or similar) and mark to_json as deprecated so that we can remove it in a future release. I'm OK to defer this decision until later if you want, since the release is already getting quite large.

If designing from scratch, I'd expect

one method which returns a "JSON" and raises if sets or binaries are present

another method which returns a "DynamoDB JSON" (with binaries as base64 strs, such that json can serialize them)

…-attr-v2

ikonst · 2022-11-18T15:43:46Z

I like to_dynamodb_json and to_simple_json. One caveat is that current API is to_dict -> dict and to_json -> str.

I find that to_json is almost never what you want, since it's common that you'd want to use the pre-serialized dict as part of something bigger (e.g. as part of something you're caching in redis, as a list of items...). Would it be OK to name it to_dynamodb_json but have it return a dict?

Also, initially I thought:

def to_dynamodb_json() -> str:
  return json.dumps(self.serialize(), cls=EncoderWhichBase64sBytes)

but it's no good for the "use it in something larger" case.

ikonst · 2022-11-18T16:02:03Z

Here's a thought. Given a JSON encoder:

import base64
import json
from typing import Any


class CustomEncoder(json.JSONEncoder):
    def default(self, obj: Any) -> Any:
        if isinstance(obj, bytes):
            return base64.b64encode(obj).decode()
        return super().default(obj)

you can do:

class MyModel(Model):
   ...
   picture = BinaryAttribute(legacy_encoding=False)

data = {"items": [my_model.serialize()]}
data_str = json.dumps(data, cls=CustomEncoder)

Now, how do you deserialize that? You can do

data = json.loads(data_str)
items = [
   MyModel.from_raw_data(item)
   for item in data['items']
]

but that's broken since BinaryAttribute(legacy_encoding=False) will just pass through the `str value.

What if we change

 class BinaryAttribute
     def deserialize(self, value):
-        if self.legacy_encoding:
+        if self.legacy_encoding or isinstance(value, str):
             return b64decode(value)
         return value

and then perhaps get rid of _handle_binary_attributes, which you suspect can slow things down?

docs/release_notes.rst

Co-authored-by: Garrett <garrettheel@users.noreply.github.com>

Since #1112, `Model.serialize` return value is not necessarily JSON-serializable. We're adding `to_dynamodb_dict` as a replacement (basically, `bytes` are base-64 encoded), and `to_simple_dict` to provide for a common ask of a "simple" JSON representation similar to what the AWS Console defaults to (previously called `to_dict` in earlier versions of the 6.x branch). We're making those methods of `AttributeContainer` rather than `Model` so they'd be equally applicable to `MapAttribute`.

ikonst mentioned this pull request Nov 16, 2022

Split BinaryAttribute into BinaryDataAttribute and LegacyBinaryAttribute #1110

Closed

ikonst force-pushed the 2022-11-16-binary-attr-v2 branch 2 times, most recently from b019707 to 94d5522 Compare November 16, 2022 06:53

Fix double base64 for binary attributes

1fe4b3d

ikonst force-pushed the 2022-11-16-binary-attr-v2 branch from 94d5522 to 1fe4b3d Compare November 16, 2022 06:57

ikonst commented Nov 16, 2022

View reviewed changes

ikonst added 3 commits November 16, 2022 09:37

short name for MapAttribute link

3e92a2d

more nuanced docs

f351f62

clarify bug

b438e6a

ikonst requested a review from garrettheel November 16, 2022 15:31

ikonst added 2 commits November 18, 2022 00:06

better PickleAttribute w/o type:ignore

ff72286

brevity in guideline doc

a3e997c

garrettheel reviewed Nov 18, 2022

View reviewed changes

docs/upgrading_binary.rst Outdated Show resolved Hide resolved

garrettheel reviewed Nov 18, 2022

View reviewed changes

docs/upgrading_binary.rst Outdated Show resolved Hide resolved

garrettheel reviewed Nov 18, 2022

View reviewed changes

docs/upgrading_binary.rst Outdated Show resolved Hide resolved

ikonst added 2 commits November 18, 2022 10:22

rephrase docs per Garrett's feedback

b4600a6

Merge remote-tracking branch 'upstream/master' into 2022-11-16-binary…

8759b74

…-attr-v2

add a warning to Model.serialize

9c8eb9b

ikonst mentioned this pull request Nov 18, 2022

Prevent double encoding of binary #129

Closed

1 task

add to release notes

88b4170

ikonst added 3 commits December 2, 2022 22:40

Merge branch 'master' into 2022-11-16-binary-attr-v2

c946d12

remove space

fb12e66

avoid -> prevent

3554f5f

ikonst requested a review from garrettheel December 3, 2022 03:49

garrettheel reviewed Dec 5, 2022

View reviewed changes

docs/release_notes.rst Outdated Show resolved Hide resolved

garrettheel approved these changes Dec 5, 2022

View reviewed changes

Update release notes re: to_dict/to_json

21bbeb0

Co-authored-by: Garrett <garrettheel@users.noreply.github.com>

ikonst changed the title ~~Fix double base64 for binary attributes~~ Fix double base64 of binary attributes Dec 5, 2022

ikonst merged commit ddb8f7d into master Dec 5, 2022

ikonst deleted the 2022-11-16-binary-attr-v2 branch December 5, 2022 20:05

ikonst mentioned this pull request Dec 7, 2022

Replace to_dict with to_simple_dict, to_dynamodb_dict #1126

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix double base64 of binary attributes #1112

Fix double base64 of binary attributes #1112

ikonst commented Nov 16, 2022 •

edited

ikonst Nov 16, 2022 •

edited

ikonst Nov 16, 2022

garrettheel Nov 17, 2022

ikonst Nov 17, 2022

garrettheel Nov 17, 2022 •

edited

ikonst Nov 16, 2022

ikonst commented Nov 16, 2022 •

edited

garrettheel commented Nov 17, 2022

garrettheel commented Nov 17, 2022

ikonst commented Nov 17, 2022

ikonst commented Nov 17, 2022 •

edited

garrettheel commented Nov 18, 2022

ikonst commented Nov 18, 2022 •

edited

ikonst commented Nov 18, 2022 •

edited

Fix double base64 of binary attributes #1112

Fix double base64 of binary attributes #1112

Conversation

ikonst commented Nov 16, 2022 • edited

ikonst Nov 16, 2022 • edited

Choose a reason for hiding this comment

ikonst Nov 16, 2022

Choose a reason for hiding this comment

garrettheel Nov 17, 2022

Choose a reason for hiding this comment

ikonst Nov 17, 2022

Choose a reason for hiding this comment

garrettheel Nov 17, 2022 • edited

Choose a reason for hiding this comment

ikonst Nov 16, 2022

Choose a reason for hiding this comment

ikonst commented Nov 16, 2022 • edited

garrettheel commented Nov 17, 2022

garrettheel commented Nov 17, 2022

ikonst commented Nov 17, 2022

ikonst commented Nov 17, 2022 • edited

garrettheel commented Nov 18, 2022

ikonst commented Nov 18, 2022 • edited

ikonst commented Nov 18, 2022 • edited

ikonst commented Nov 16, 2022 •

edited

ikonst Nov 16, 2022 •

edited

garrettheel Nov 17, 2022 •

edited

ikonst commented Nov 16, 2022 •

edited

ikonst commented Nov 17, 2022 •

edited

ikonst commented Nov 18, 2022 •

edited

ikonst commented Nov 18, 2022 •

edited