Skip to content

Commit

Permalink
updated serialize.md
Browse files Browse the repository at this point in the history
  • Loading branch information
flexatone committed Jan 24, 2024
1 parent 5ebf73f commit f3d65e8
Showing 1 changed file with 18 additions and 8 deletions.
26 changes: 18 additions & 8 deletions doc/source/articles/serialize.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,15 +12,15 @@ StaticFrame (an open-source DataFrame library of which I am an author) builds up

## The Challenge of Serializing DataFrames

DataFrames are not just tables of columnar data with string column labels, such as found in relational databases. In addition to columnar data, DataFrames have labelled rows and columns, and those row and column labels can be of any type or (with hierarchical labels) many types.
DataFrames are not just collections of columnar data with string column labels, such as found in relational databases. In addition to columnar data, DataFrames have labelled rows and columns, and those row and column labels can be of any type or (with hierarchical labels) many types.

As Parquet was originally designed to just store tables of columnar data, the full range of DataFrame characteristics is not natively supported. Pandas supplies this additional information by adding JSON metadata into the Parquet file.
As Parquet was originally designed to just store collections of columnar data, the full range of DataFrame characteristics is not natively supported. Pandas supplies this additional information by adding JSON metadata into the Parquet file. Further, Parquet does not support the full range of NumPy dtypes.

While Python pickles are capable of efficiently serializing DataFrames, they are only suitable for short-term caches from trusted sources. While Pickles are fast, they can become invalid due to code changes and are widely regarded as insecure to load from untrusted sources.
While Python pickles are capable of efficiently serializing DataFrames and NumPy arrays, they are only suitable for short-term caches from trusted sources. While Pickles are fast, they can become invalid due to code changes and are insecure to load from untrusted sources.

Another alternative to Parquet, originating in the Arrow project, is Feather. While Feather succeeds in being faster than Parquet, it is still two times slower at reading DataFrames than NPZ.

Parquet and Feather support compression to reduce file size. Parquet defaults to using "snappy" compression, while Feather uses "lz4". As the NPZ format prioritizes performance, it does not yet support compression. As will be shown below, NPZ file sizes are remarkably comparable to compressed Parquet and Feather.
Parquet and Feather support compression to reduce file size on disk. Parquet defaults to using "snappy" compression, while Feather uses "lz4". As the NPZ format prioritizes performance, it does not yet support compression. As will be shown below, uncompressed NPZ file sizes are (for the data sets benchmarked) comparable to compressed Parquet and Feather.


## DataFrame Serialization Performance Comparisons
Expand All @@ -31,7 +31,7 @@ Numerous publications offer DataFrame performance comparisons by testing just a

To avoid this problem, I present nine performance results across two dimensions of synthetic fixtures: shape (tall, square, and wide) and columnar heterogeneity (columnar, mixed, and uniform). Shape variations alter the distribution of the same number of elements between tall (e.g., 10,000 rows and 100 columns), square (e.g., 1,000 rows and columns), and wide (e.g., 100 rows and 10,000 columns) geometries. Columnar heterogeneity variations alter the diversity of types between columnar (no adjacent columns have the same type), mixed (some adjacent columns have the same type), and uniform (all columns have the same type).

The `frame-fixtures` library defines a domain-specific language to create deterministic but randomly-generated DataFrames for testing. Performance tests using `frame-fixtures` employ an approach similar to the following IPython session using `%time`. Plotted performance tests, as shown below, extend this approach with systematic variation of shape, type heterogeneity, and average duration over ten iterations.
The `frame-fixtures` library defines a domain-specific language to create deterministic but randomly-generated DataFrames for testing; the nine variations of DataFrames are generated with this tool. The following IPython session performs a basic perforamnce test using `%time`.

```python
>>> import numpy as np
Expand All @@ -51,26 +51,36 @@ CPU times: user 14.1 s, sys: 1.44 s, total: 15.5 s
Wall time: 15.6 s
```

Plotted performance tests, as shown below, extend this basic approach by using ``frame-fiuxtures`` for systematic variation of shape, type heterogeneity, and average results over ten iterations.


### Read Performance

As data is generally read more often then it is written, read performance is a priority. As shown for all nine DataFrames of one million (1e+06) elements, NPZ significantly outperforms Parquet and Feather with every fixture. NPZ read performance is nearly ten times faster than compressed Parquet. For exmaple, with the Uniform Tall fixture, compressed Parquet reading is 21 ms compared to 1.5 ms with NPZ.
As data is generally read more often then it is written, read performance is a priority. As shown for all nine DataFrames of one million (1e+06) elements, NPZ significantly outperforms Parquet and Feather with every fixture. NPZ read performance is nearly ten times faster than compressed Parquet. For example, with the Uniform Tall fixture, compressed Parquet reading is 21 ms compared to 1.5 ms with NPZ.

The chart below shows processing time, where lower bars correspond to faster performance.

![Read performance 1e6](serialize/serialize-read-linux-1e6.png "Title")


This performance is retained with scale. Moving to 100 million (1e+08) elements, NPZ continues to outperform Parquet and Feather regardless of if compression is used or not.
The impressive performance of NPZ is retained with scale. Moving to 100 million (1e+08) elements, NPZ continues to be at least twice as fast as Parquet and Feather, regardless of of if compression is used or not.

![Read performance 1e6](serialize/serialize-read-linux-1e8.png "Title")


[CHART]

### Write Performance

As data is generally read more often then it is written, write performance is secondary. Nonetheless, NPZ outperforms Parquet (both compressed and uncompressed) in all scenarioes. For example, with the Uniform Square fixture, compressed Parquet writing is 200 ms compared to 18.3 ms with NPZ.


![Write performance 1e6](serialize/serialize-write-linux-1e6.png "Title")


As with read performance tests, the impressive performance of NPZ is retained with scale. Moving to 100 million (1e+08) elements, NPZ continues to be at least twice as fast as Parquet, regardless of of if compression is used or not. Feather performance (both compressed and uncompressed) out-performs NPZ in a few scenarios by a small amount, but NPZ write performance is generally comparable to Feather (and in a few scenarios faster).


![Write performance 1e6](serialize/serialize-write-linux-1e8.png "Title")



Expand Down

0 comments on commit f3d65e8

Please sign in to comment.