[proposal] Make an option for lazy-deserialization of complex types #1373

ak0rz · 2022-05-19T09:21:05Z

One of crucial performance bottleneck that protobuf-java adresses is figting with eager deserialization of complex (e.g. variable-sized) types.
There are even classes like LazyField<T>, that stores a ByteString underneath until actual access to the field occurs.

If we don't actually care about binary compatibility (which is usually the case for the generated code), we may accomplish that in a neat API-compatible way.

Firstly, we introduce a sealed LazyField[T] and a case classes which would model such datatype:

/**
 * Base trait for codec of arbitrary field type.
 * Implicit codecs should be generated in any message companion
 * Base implicits (for [[String]] and [[Seq]] should be hand-written
 * in [[ByteStringCodec]] companion by hand.
 */
trait ByteStringCodec[T] {
  def fromByteString(bs: ByteString): T
  def toByteString(value: T): ByteString
}

case object ByteStringCodec {
  def apply[T: ByteStringCodec]: ByteStringCodec[T] = implicitly

  implicit val stringByteStringCodec: ByteStringCodec[String] = new ByteStringCodec[String] {
    def fromByteString(bs: ByteString): T = bs.toStringUtf8
    def toByteString(value: T): ByteString = ByteString.copyFromUtf8(value)
  }

  // ...any other arbitrary implicits
}

sealed trait LazyField[T] {
  def value: T
  def byteString: ByteString
}

case class SerializedField[T: ByteStringCodec](byteString: ByteString) extends LazyField[T] {
  lazy val value: T = ByteStringCodec[T].fromByteString(byteString)
}

case class ValueField[T: ByteStringCodec](value: T) extends LazyField[T] {
  lazy val byteString: ByteString = ByteStringCodec[T].toByteString(value)
}

case object LazyField {
  implicit def lazyFieldUnwrap[T](field: LazyField[T]): T = field.value
  implicit def lazyFieldWrap[T: ByteStringCodec](value: T): LazyField[T] = ValueField[T](value)
}

And then, when option for lazy fields is enabled, we just generate LazyField[T] instead of T for any type we consider "complex" (string, repeated, message, e.g.)

option (scalapb.options).lazy_eval = true;

message Foo {
  option (scalapb.message).lazy_eval = true;
  string bar = 1 [(scalapb.field).lazy_eval = true];
}

case class Foo(bar: LazyField[String]) extends scalapb.GeneratedMessage {
  ...
}

case object Foo extends scalapb.GeneratedMessageCompanion[Foo] {
  implicit val fooCodec: ByteStringCodec[Foo] = ???

  ...
}

Implicits from case object LazyField should maintain API compatibility and end-users shouldn't notice anything.

The text was updated successfully, but these errors were encountered:

thesamet · 2022-05-19T13:17:42Z

Thanks for the clear write-up. This seems like a very interesting idea which I'd like us to explore!

Few thoughts:

I briefly looked up LazyField in protobuf-java, but I wasn't able to tell when it's being utilized. I can't recall ever seeing it in generated code. I haven't used the lite implementation much and it seems to be related. Do you have more information on the choices made by protobuf-java?
There are cases to watch for when the user combines this with other ScalaPB customization (no_box, custom types, custom collection types), and so on. This can lead to additional complexity in generator or cases where the user need to supply their own ByteStringCodec instance via an import option.
Methods such as getField and PMessage processing need to wrap/unwrap properly.

In summary, I'm interested in seeing this in PR and landing ultimately as an optional feature, though let's consider the complexity it adds to the code base vs. the benefits once there's a PR.

ak0rz · 2022-05-23T12:20:40Z

I briefly looked up LazyField in protobuf-java, but I wasn't able to tell when it's being utilized. I can't recall ever seeing it in generated code. I haven't used the lite implementation much and it seems to be related. Do you have more information on the choices made by protobuf-java?

Actually, simplest examples are the way they treat String values in generated code:

private java.lang.Object name_ = "";
/**
  * <code>string name = 1;</code>
  * @return The name.
  */
public java.lang.String getName() {
  java.lang.Object ref = name_;
  if (!(ref instanceof java.lang.String)) {
    com.google.protobuf.ByteString bs =
      (com.google.protobuf.ByteString) ref;
    java.lang.String s = bs.toStringUtf8();
    name_ = s;
    return s;
  } else {
    return (java.lang.String) ref;
  }
}

You may note here that they are also throw away the underlying ByteString (which is good for the memory footprint), but I'm not sure if it better to store it for later serialization or not. Anyway, this code looks hacky and follows "at least once" deserializations.

There are cases to watch for when the user combines this with other ScalaPB customization (no_box, custom types, custom collection types), and so on. This can lead to additional complexity in generator or cases where the user need to supply their own ByteStringCodec instance via an import option.

I suppose this is true, but I think it may be relaxed a bit by thinking about specification of types in advance and a 'little bit' of refactoring in generator. I'm on the PoC PR for this issue and the type derivation is one of the most painful places indeed.

Methods such as getField and PMessage processing need to wrap/unwrap properly.

That's not a problem since it will use the same implicit conversion as the regular user code, but in generated context.

In summary, I'm interested in seeing this in PR and landing ultimately as an optional feature, though let's consider the complexity it adds to the code base vs. the benefits once there's a PR.

I'm on it, and I'll try to measure it in benchmarks.
But also there may be another caveats:

Lazy deserialization makes any protocol level errors boom on field access instead of parseFrom. Possibly unwrap of such fields may return an Either of error and value, or we just leave this on to user responsibility when they enable such generator option.

thesamet · 2023-10-17T17:30:20Z

Closing due to inactivity. Feel free to comment if this is still needed.

ak0rz mentioned this issue May 19, 2022

[proposal] Provide a way to inject String deduplication into the deserialization pipeline #1374

Closed

ak0rz mentioned this issue May 23, 2022

[WIP] #1373 lazy fields #1376

Closed

thesamet closed this as completed Oct 17, 2023

ak0rz added a commit to ak0rz/ScalaPB that referenced this issue Jan 29, 2024

scalapb#1373 Lazy protobuf parsing implementation

e91ffe0

ak0rz added a commit to ak0rz/ScalaPB that referenced this issue Jan 31, 2024

scalapb#1373 Implementation of lazy parsing for LEN encoded types

5402c2c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[proposal] Make an option for lazy-deserialization of complex types #1373

[proposal] Make an option for lazy-deserialization of complex types #1373

ak0rz commented May 19, 2022 •

edited

Loading

thesamet commented May 19, 2022

ak0rz commented May 23, 2022

thesamet commented Oct 17, 2023

[proposal] Make an option for lazy-deserialization of complex types #1373

[proposal] Make an option for lazy-deserialization of complex types #1373

Comments

ak0rz commented May 19, 2022 • edited Loading

thesamet commented May 19, 2022

ak0rz commented May 23, 2022

thesamet commented Oct 17, 2023

ak0rz commented May 19, 2022 •

edited

Loading