Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix double base64 of binary attributes #1112

Merged
merged 14 commits into from
Dec 5, 2022
Merged

Conversation

ikonst
Copy link
Contributor

@ikonst ikonst commented Nov 16, 2022

For many versions (verified in versions 3-5) we had a known issue where

  1. top-level Binary(Set)Attributes were double-base64-encoded in the underlying DynamoDB table item
  2. nested Binary(Set)Attributes were unusable: every serialization would add another round of Base64 encoding(!), so nested binary attributes could not have been used in practice

Clearly (1) is inefficient and (2) is unusable, and we wanted PynamoDB to be efficient (and not broken) by default, but similarly didn't want to risk breaking systems during an upgrade to PynamoDB 6. With that in mind, we're introducing a legacy_encoding required parameter to BinaryAttribute and BinarySetAttribute, the rationale being a new required parameter will prompt an informed decision. The parameter will exist throughout the lifetime of PynamoDB 6.x and perhaps removed in a future version (in favor of legacy_encoding=False becoming the default behavior).

Since nested binary attributes were previously unusable, we're safely defaulting to legacy_encoding=False for nested attributes (and forcing it to be False when binary attributes are declared in non-raw maps).

Along with this, a slight breaking change in introduced: A set in a "raw" list or map have previously serialized as a list (L) but will now result in a string/number/binary set:

  • for all strings, string set (SS)
  • for all numbers, number set (NS)
  • for all bytes, binary set (BS)
  • else, raises an error on serialization

Additions to docs:

Comment on lines +445 to +447
attribute_value[BINARY] = base64.b64decode(attribute_value.pop(STRING))
elif attr_type == BINARY_SET and LIST in attribute_value:
attribute_value[BINARY_SET] = [base64.b64decode(v[STRING]) for v in attribute_value.pop(LIST)]
Copy link
Contributor Author

@ikonst ikonst Nov 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we base64-decode here when we didn't use to? Since BinaryAttribute(legacy_encoding=False).deserialize does not base64-decode (assuming _convert_binary in connection/base.py already did the job), we have to do it here for the sake of Model.from_dict.

Comment on lines +1394 to +1399
elif MAP in attr:
for sub_attr in attr[MAP].values():
_convert_binary(sub_attr)
elif LIST in attr:
for sub_attr in attr[LIST]:
_convert_binary(sub_attr)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maps and lists are now being recursed into. This matches botocore behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth noting that this is going to have some performance impact (I actually did notice this as part of the delta in #1079). We might want to run benchmarks to get a sense for it and include it in the release notes.

If the difference is meaningful then we may also want to add an optimisation for short-circuiting this when binary attributes aren't in use

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merely iterating a dict/list recursively has a notable performance effect?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the average case I doubt it, but my guess is the high-throughput cases with larger items (e.g. containing large lists) would notice something. I can use the benchmark script in #1079 with varying sized lists to get a sense for it - I might be wrong and it's small enough to not be of concern

Comment on lines +29 to +33
if attr_type == BINARY:
return base64.b64encode(attr_value).decode()
if attr_type == BINARY_SET:
return [base64.b64encode(v).decode() for v in attr_value]
if attr_type in {BOOLEAN, STRING, STRING_SET}:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we base64-encode here when we didn't use to? Since BinaryAttribute(legacy_encoding=False).serialize does not base64-encode anymore (assuming botocore would do the job), we have to do it here for the sake of Model.to_dict.

@ikonst
Copy link
Contributor Author

ikonst commented Nov 16, 2022

BTW, now that Attribute.(de)serialize and subsequently Model.(de)serialize are not (de)serializing a JSON-serializable dict (for DynamoDB API) but a dict for botocore (with non-JSON-serializable bytes), there’s an unfortunate gap forming:

  • Model.serialize ’s output cannot be JSON-serialized (w/o custom encoder)
  • Model.to_dict ’s output does not always roundtrip — bytes in raw maps will be encoded as base64 strings, but then Model.from_dict will decode them as str containing base64

As a result, we don't have a JSON-serializable representation that roundtrips. Not sure what to do about that.

@garrettheel
Copy link
Member

Along with this, a slight breaking change in introduced: A set in a "raw" list or map have previously serialized as a list (L) but will now result in a string/number/binary set

Is this breaking in the sense that we fail to deserialise these, or just that they're serialised differently when read back and written? Maybe we can add a test for this as well (apologies if one was already there and I didn't notice)

@garrettheel
Copy link
Member

BTW, now that Attribute.(de)serialize and subsequently Model.(de)serialize are not (de)serializing a JSON-serializable dict (for DynamoDB API) but a dict for botocore (with non-JSON-serializable bytes), there’s an unfortunate gap forming:

  • Model.serialize ’s output cannot be JSON-serialized (w/o custom encoder)
  • Model.to_dict ’s output does not always roundtrip — bytes in raw maps will be encoded as base64 strings, but then Model.from_dict will decode them as str containing base64

As a result, we don't have a JSON-serializable representation that roundtrips. Not sure what to do about that.

I wonder if we should talk more about the options here. Seems like we're drifting towards binary attributes becoming a second-class citizen that contains so many footguns that we might as well not support them natively (which also feels wrong, since they are part of the DynamoDB API).

I'd be curious to see how much worse each of these points are becoming as of this PR and what it would take to ensure that from_dict, to_dict, serialize, etc. work better with binary. We can maybe do that elsewhere to avoid cluttering this PR too much

@ikonst
Copy link
Contributor Author

ikonst commented Nov 17, 2022

Along with this, a slight breaking change in introduced: A set in a "raw" list or map have previously serialized as a list (L) but will now result in a string/number/binary set

Is this breaking in the sense that we fail to deserialise these, or just that they're serialised differently when read back and written? Maybe we can add a test for this as well (apologies if one was already there and I didn't notice)

  • TestMapAttribute.test_attribute_children: raw maps now round-trip sets (e.g. BS)
  • TestMapAttribute.test_serialize_invalid_set: when assigning a set to a map, they must be heterogenous, and of one of the supported types (e.g no more {42, "ham"})

The previous behavior is low on my priority to preserve, since

  1. it doesn't roundtrip
    m.raw_map.my_stuff = {42, "ham"}
    m.save()
    m.refresh()
    assert type(m.raw_map.my_stuff) is list
  2. now we fail loudly so change is apparent
  3. old behavior doesn't offer advantages
  4. old behavior doesn't give a way to use sets in raw maps

@ikonst
Copy link
Contributor Author

ikonst commented Nov 17, 2022

which also feels wrong, since they are part of the DynamoDB API

Agree. There are many ORMs out there, a DynamoDB specific better cover what DynamoDB supports.

On the topic of JSON-izing, here's how Amazon goes about it. In the AWS console, if you add a binary attribute and then switch from "Form" to "JSON view", there's a nested option to toggle between "JSON" and "DynamoDB JSON" and that option is grayed out once there are binary attributes or sets:

image

If designing from scratch, I'd expect

  • one method which returns a "JSON" and raises if sets or binaries are present
  • another method which returns a "DynamoDB JSON" (with binaries as base64 strs, such that json can serialize them)

@garrettheel
Copy link
Member

Couple of remaining thoughts:

  1. Should we document that serialize() now retains values as bytes, since that method is public and can break anyone who tries to json encode the output? I know we don't have release notes in here yet but it might be helpful to add some while we're discussing everything

  2. Regarding below, I wonder whether there are any changes we should implement now to better warn users about these kinds of issues. For example we could split out to_dynamodb_json and to_simple_json (or similar) and mark to_json as deprecated so that we can remove it in a future release. I'm OK to defer this decision until later if you want, since the release is already getting quite large.

If designing from scratch, I'd expect

  • one method which returns a "JSON" and raises if sets or binaries are present
  • another method which returns a "DynamoDB JSON" (with binaries as base64 strs, such that json can serialize them)

@ikonst
Copy link
Contributor Author

ikonst commented Nov 18, 2022

I like to_dynamodb_json and to_simple_json. One caveat is that current API is to_dict -> dict and to_json -> str.

I find that to_json is almost never what you want, since it's common that you'd want to use the pre-serialized dict as part of something bigger (e.g. as part of something you're caching in redis, as a list of items...). Would it be OK to name it to_dynamodb_json but have it return a dict?

Also, initially I thought:

def to_dynamodb_json() -> str:
  return json.dumps(self.serialize(), cls=EncoderWhichBase64sBytes)

but it's no good for the "use it in something larger" case.

@ikonst
Copy link
Contributor Author

ikonst commented Nov 18, 2022

Here's a thought. Given a JSON encoder:

import base64
import json
from typing import Any


class CustomEncoder(json.JSONEncoder):
    def default(self, obj: Any) -> Any:
        if isinstance(obj, bytes):
            return base64.b64encode(obj).decode()
        return super().default(obj)

you can do:

class MyModel(Model):
   ...
   picture = BinaryAttribute(legacy_encoding=False)

data = {"items": [my_model.serialize()]}
data_str = json.dumps(data, cls=CustomEncoder)

Now, how do you deserialize that? You can do

data = json.loads(data_str)
items = [
   MyModel.from_raw_data(item)
   for item in data['items']
]

but that's broken since BinaryAttribute(legacy_encoding=False) will just pass through the `str value.

What if we change

 class BinaryAttribute
     def deserialize(self, value):
-        if self.legacy_encoding:
+        if self.legacy_encoding or isinstance(value, str):
             return b64decode(value)
         return value

and then perhaps get rid of _handle_binary_attributes, which you suspect can slow things down?

@ikonst ikonst mentioned this pull request Nov 18, 2022
1 task
docs/release_notes.rst Outdated Show resolved Hide resolved
Co-authored-by: Garrett <garrettheel@users.noreply.github.com>
@ikonst ikonst changed the title Fix double base64 for binary attributes Fix double base64 of binary attributes Dec 5, 2022
@ikonst ikonst merged commit ddb8f7d into master Dec 5, 2022
@ikonst ikonst deleted the 2022-11-16-binary-attr-v2 branch December 5, 2022 20:05
ikonst added a commit that referenced this pull request Dec 9, 2022
Since #1112, `Model.serialize` return value is not necessarily JSON-serializable. We're adding `to_dynamodb_dict` as a replacement (basically, `bytes` are base-64 encoded), and `to_simple_dict` to provide for a common ask of a "simple" JSON representation similar to what the AWS Console defaults to (previously called `to_dict` in earlier versions of the 6.x branch).

We're making those methods of `AttributeContainer` rather than `Model` so they'd be equally applicable to `MapAttribute`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants