feat(chain): add ParquetSerializer #140

victoria-yining-huang · 2025-06-16T21:38:17Z

Serializing to Parquet functionality should be provided by the streaming platform.

add code first pass no tests remove overwrite of Serializer add pyarrow-stubs add ignore missing imports pre commit happy figure out polars datatypes

evanh · 2025-06-18T15:18:08Z

sentry_streams/sentry_streams/pipeline/chain.py

@@ -167,6 +173,20 @@ def build_step(self, name: str, ctx: Pipeline, previous: Step) -> Step:
        )


+@dataclass
+class ParquetSerializer(Applier[Message[TIn], bytes], Generic[TIn]):


Did you try using this class in an example or test to make sure it works?

fpacifici

Please see my comment on the dependencies

fpacifici · 2025-06-18T22:53:38Z

sentry_streams/pyproject.toml

+    "polars==1.30.0",
+    "pyarrow==19.0.0",
+    "pyarrow-stubs==19.0",
+    "pandas==2.2.3",


Please do not pin versions.
polars>=1.30.0
pyarrow>=19.0.0
pandas>=2.2.3

If you pin versions and the client of the library requires a different version they would not be able to use this library.
Also please set the minimum version to something older. These are the most recent versions of all those packages. This may make things harder for the user as they would have to be up to date with everything without a reason.

Also you should not need pyarrow-stubs in the main dependencies.
Please add them in the dev section

fpacifici · 2025-06-18T22:58:04Z

sentry_streams/pyproject.toml

+    "polars==1.30.0",
+    "pyarrow==19.0.0",
+    "pyarrow-stubs==19.0",
+    "pandas==2.2.3",


Do you need pandas only for tests?
If yes please do not add this dependency. We should minimize the dependencies we import in a library. That makes it easier to use in other code bases.

Option 1: do without pandas
Option 2: add it only to the dev section we use for tests.

fpacifici · 2025-06-18T23:09:15Z

sentry_streams/pyproject.toml

+    "polars==1.30.0",
+    "pyarrow==19.0.0",
+    "pyarrow-stubs==19.0",
+    "pandas==2.2.3",


Also consider making this libraries optional dependencies.
https://packaging.python.org/en/latest/guides/writing-pyproject-toml/#dependencies-optional-dependencies

So if snuba does not want to use parquet or arrow together with all their transitive dependencies they do not need to import those libraries. This as well makes it easier for the client to use this library.

This would have impact on how the classes are imported, The ParquetSerializer would have to be in its own module imported by the client. The streaming coded will not be able to import it directly, which is likely ok.

fpacifici · 2025-06-18T23:10:33Z

sentry_streams/sentry_streams/pipeline/msg_codecs.py

+
+def _get_parquet(msg: Message[Any], schema_fields: PolarsSchema) -> bytes:
+    df = pl.DataFrame(
+        [i for i in msg.payload if i is not None],


How does this work? Message[Any] is not known to be iterable to the type checker.

Also what's the type of msg.payload ?
If it is pyarrow.Table or pyarrow.Array there should be a from_arrow method https://docs.pola.rs/api/python/stable/reference/api/polars.from_arrow.html#polars-from-arrow which should try to auto infer the types.

If instead it is a python dictionary then why do we need arrow at all ?

fpacifici · 2025-06-18T23:13:00Z

sentry_streams/sentry_streams/pipeline/chain.py

@@ -167,6 +173,20 @@ def build_step(self, name: str, ctx: Pipeline, previous: Step) -> Step:
        )


+@dataclass
+class ParquetSerializer(Applier[Message[TIn], bytes], Generic[TIn]):


I am not sure the type of the message is right. Isn't the input message supposed to be a Message[pyarrow.Table] or Message[pyarrow.Array].

Or is it a Sequence of python dictionary as it seems from the unit test ?

fpacifici · 2025-06-18T23:18:03Z

sentry_streams/sentry_streams/pipeline/msg_codecs.py

+    if isinstance(schema_fields, PolarsSchema):
+        return _get_parquet(msg, schema_fields)
+    else:
+        polars_schema = _map_arrow_to_polars_schema(schema_fields)


Please let's do the schema conversion in the build_Step method, which is executed only once at startup rather than doing it at runtime for each message.
Doing it in build_step means that:

It is done only once. SO you waste fewer resources

If there is an error we learn that before we try to consume. and we can even validate it.

fpacifici · 2025-06-18T23:19:06Z

sentry_streams/sentry_streams/pipeline/msg_codecs.py

+def parquet_serializer(
+    msg: Message[Any], schema_fields: Union[Sequence[Tuple[str, PADataType]], PolarsSchema]
+) -> bytes:
+    if isinstance(schema_fields, PolarsSchema):


Why do we need to support both PolarSchema and Sequence[Tuple[str, PADataType] if the ParquetSerializer can only support Sequence[Tuple[str, PADataType] ?

fpacifici · 2025-06-18T23:20:56Z

sentry_streams/sentry_streams/pipeline/chain.py

@@ -167,6 +173,20 @@ def build_step(self, name: str, ctx: Pipeline, previous: Step) -> Step:
        )


+@dataclass
+class ParquetSerializer(Applier[Message[TIn], bytes], Generic[TIn]):
+    schema_fields: Sequence[Tuple[str, PADataType]]


This is only ok if the input type is an arrow table/batch/array.
If not then defining the types as arrow data types would not work and also leak the implementation details of how this is implemented.
If the expected payload is a python dictionary the schema cannot be defined with arrow, we need an abstract implementation that can only depend on parquet as this is a parquet serializer. Arrow and polars are implementation details.

fpacifici · 2025-06-18T23:26:14Z

sentry_streams/tests/pipeline/test_msg_codecs.py

+
+
+def test_parquet_parser_nominal_case() -> None:
+    schema_fields: Sequence[Tuple[str, Any]] = [


Please test with a simpler schema.
What this test is testing is whether the parquet serialization works and whether it takes the schema into account.
Any non trivial schema would allow you to accomplish that goal.
This is the whole schema of the sentry error, which is out of the scope of what you are trying to test.
The test would be considerably simpler with a simpler, made up, schema without loosing almost anything.

fpacifici · 2025-06-18T23:31:18Z

sentry_streams/tests/pipeline/test_msg_codecs.py

+    # for testing purposes, only compare the contents of the parquet tables
+    expected_table = pq.read_table(pa.BufferReader(expected))
+    expected_table = pq.read_table(pa.BufferReader(result))
+    assert expected_table == expected_table


This does not compare two tables as the two variables have the same name. The condition would always pass.

victoria-yining-huang force-pushed the vic/add_parquet_serializer branch 2 times, most recently from dfbdab3 to e012443 Compare June 18, 2025 04:36

add code first pass no tests

1782aee

add code first pass no tests remove overwrite of Serializer add pyarrow-stubs add ignore missing imports pre commit happy figure out polars datatypes

victoria-yining-huang force-pushed the vic/add_parquet_serializer branch from e012443 to 1782aee Compare June 18, 2025 04:50

victoria-yining-huang added 2 commits June 18, 2025 04:06

everyone happy

e26754f

comment

c7b0458

victoria-yining-huang marked this pull request as ready for review June 18, 2025 08:21

victoria-yining-huang changed the title ~~(wip) add parquet serializer~~ add parquet serializer Jun 18, 2025

victoria-yining-huang changed the title ~~add parquet serializer~~ add ParquetSerializer Jun 18, 2025

untitaker approved these changes Jun 18, 2025

View reviewed changes

evanh reviewed Jun 18, 2025

View reviewed changes

add uv lock

7d43c92

victoria-yining-huang changed the title ~~add ParquetSerializer~~ feat(chain): add ParquetSerializer Jun 18, 2025

fpacifici requested changes Jun 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat(chain): add ParquetSerializer #140

feat(chain): add ParquetSerializer #140

victoria-yining-huang commented Jun 16, 2025 •

edited

Loading

Uh oh!

evanh Jun 18, 2025

Uh oh!

fpacifici left a comment

Uh oh!

fpacifici Jun 18, 2025 •

edited

Loading

Uh oh!

fpacifici Jun 18, 2025

Uh oh!

fpacifici Jun 18, 2025

Uh oh!

fpacifici Jun 18, 2025

Uh oh!

fpacifici Jun 18, 2025

Uh oh!

fpacifici Jun 18, 2025

Uh oh!

fpacifici Jun 18, 2025

Uh oh!

fpacifici Jun 18, 2025

Uh oh!

fpacifici Jun 18, 2025

Uh oh!

fpacifici Jun 18, 2025

Uh oh!

fpacifici Jun 18, 2025

Uh oh!

fpacifici Jun 18, 2025

Uh oh!

Uh oh!



		def test_parquet_parser_nominal_case() -> None:
		schema_fields: Sequence[Tuple[str, Any]] = [

Uh oh!

feat(chain): add ParquetSerializer #140

Are you sure you want to change the base?

feat(chain): add ParquetSerializer #140

Conversation

victoria-yining-huang commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fpacifici left a comment

Choose a reason for hiding this comment

Uh oh!

fpacifici Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

victoria-yining-huang commented Jun 16, 2025 •

edited

Loading

fpacifici Jun 18, 2025 •

edited

Loading