Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[proposal] Make an option for lazy-deserialization of complex types #1373

Closed
ak0rz opened this issue May 19, 2022 · 3 comments
Closed

[proposal] Make an option for lazy-deserialization of complex types #1373

ak0rz opened this issue May 19, 2022 · 3 comments

Comments

@ak0rz
Copy link
Contributor

ak0rz commented May 19, 2022

One of crucial performance bottleneck that protobuf-java adresses is figting with eager deserialization of complex (e.g. variable-sized) types.
There are even classes like LazyField<T>, that stores a ByteString underneath until actual access to the field occurs.

If we don't actually care about binary compatibility (which is usually the case for the generated code), we may accomplish that in a neat API-compatible way.

Firstly, we introduce a sealed LazyField[T] and a case classes which would model such datatype:

/**
 * Base trait for codec of arbitrary field type.
 * Implicit codecs should be generated in any message companion
 * Base implicits (for [[String]] and [[Seq]] should be hand-written
 * in [[ByteStringCodec]] companion by hand.
 */
trait ByteStringCodec[T] {
  def fromByteString(bs: ByteString): T
  def toByteString(value: T): ByteString
}

case object ByteStringCodec {
  def apply[T: ByteStringCodec]: ByteStringCodec[T] = implicitly

  implicit val stringByteStringCodec: ByteStringCodec[String] = new ByteStringCodec[String] {
    def fromByteString(bs: ByteString): T = bs.toStringUtf8
    def toByteString(value: T): ByteString = ByteString.copyFromUtf8(value)
  }

  // ...any other arbitrary implicits
}

sealed trait LazyField[T] {
  def value: T
  def byteString: ByteString
}

case class SerializedField[T: ByteStringCodec](byteString: ByteString) extends LazyField[T] {
  lazy val value: T = ByteStringCodec[T].fromByteString(byteString)
}

case class ValueField[T: ByteStringCodec](value: T) extends LazyField[T] {
  lazy val byteString: ByteString = ByteStringCodec[T].toByteString(value)
}

case object LazyField {
  implicit def lazyFieldUnwrap[T](field: LazyField[T]): T = field.value
  implicit def lazyFieldWrap[T: ByteStringCodec](value: T): LazyField[T] = ValueField[T](value)
}

And then, when option for lazy fields is enabled, we just generate LazyField[T] instead of T for any type we consider "complex" (string, repeated, message, e.g.)

option (scalapb.options).lazy_eval = true;

message Foo {
  option (scalapb.message).lazy_eval = true;
  string bar = 1 [(scalapb.field).lazy_eval = true];
}
case class Foo(bar: LazyField[String]) extends scalapb.GeneratedMessage {
  ...
}

case object Foo extends scalapb.GeneratedMessageCompanion[Foo] {
  implicit val fooCodec: ByteStringCodec[Foo] = ???

  ...
}

Implicits from case object LazyField should maintain API compatibility and end-users shouldn't notice anything.

@thesamet
Copy link
Contributor

Thanks for the clear write-up. This seems like a very interesting idea which I'd like us to explore!

Few thoughts:

  1. I briefly looked up LazyField in protobuf-java, but I wasn't able to tell when it's being utilized. I can't recall ever seeing it in generated code. I haven't used the lite implementation much and it seems to be related. Do you have more information on the choices made by protobuf-java?
  2. There are cases to watch for when the user combines this with other ScalaPB customization (no_box, custom types, custom collection types), and so on. This can lead to additional complexity in generator or cases where the user need to supply their own ByteStringCodec instance via an import option.
  3. Methods such as getField and PMessage processing need to wrap/unwrap properly.

In summary, I'm interested in seeing this in PR and landing ultimately as an optional feature, though let's consider the complexity it adds to the code base vs. the benefits once there's a PR.

@ak0rz
Copy link
Contributor Author

ak0rz commented May 23, 2022

I briefly looked up LazyField in protobuf-java, but I wasn't able to tell when it's being utilized. I can't recall ever seeing it in generated code. I haven't used the lite implementation much and it seems to be related. Do you have more information on the choices made by protobuf-java?

Actually, simplest examples are the way they treat String values in generated code:

private java.lang.Object name_ = "";
/**
  * <code>string name = 1;</code>
  * @return The name.
  */
public java.lang.String getName() {
  java.lang.Object ref = name_;
  if (!(ref instanceof java.lang.String)) {
    com.google.protobuf.ByteString bs =
      (com.google.protobuf.ByteString) ref;
    java.lang.String s = bs.toStringUtf8();
    name_ = s;
    return s;
  } else {
    return (java.lang.String) ref;
  }
}

You may note here that they are also throw away the underlying ByteString (which is good for the memory footprint), but I'm not sure if it better to store it for later serialization or not. Anyway, this code looks hacky and follows "at least once" deserializations.

There are cases to watch for when the user combines this with other ScalaPB customization (no_box, custom types, custom collection types), and so on. This can lead to additional complexity in generator or cases where the user need to supply their own ByteStringCodec instance via an import option.

I suppose this is true, but I think it may be relaxed a bit by thinking about specification of types in advance and a 'little bit' of refactoring in generator. I'm on the PoC PR for this issue and the type derivation is one of the most painful places indeed.

Methods such as getField and PMessage processing need to wrap/unwrap properly.

That's not a problem since it will use the same implicit conversion as the regular user code, but in generated context.

In summary, I'm interested in seeing this in PR and landing ultimately as an optional feature, though let's consider the complexity it adds to the code base vs. the benefits once there's a PR.

I'm on it, and I'll try to measure it in benchmarks.
But also there may be another caveats:

  1. Lazy deserialization makes any protocol level errors boom on field access instead of parseFrom. Possibly unwrap of such fields may return an Either of error and value, or we just leave this on to user responsibility when they enable such generator option.

@thesamet
Copy link
Contributor

Closing due to inactivity. Feel free to comment if this is still needed.

ak0rz added a commit to ak0rz/ScalaPB that referenced this issue Jan 29, 2024
ak0rz added a commit to ak0rz/ScalaPB that referenced this issue Jan 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants