# Comparing File formats
We're using Polars here instead of Pandas, don't worry why, we'll cover that in the next Notebooks.

In [3]:
import sys
import pyarrow as pa
import pyarrow.parquet as pq
import polars as pl
from pyarrow import orc

### Parquet File Format

In [4]:
%%time
df_polars_parquet = pl.read_parquet(source='/home/jovyan/work/shared-datasets/nasdaq.parquet')
df_polars_parquet.head(5)

CPU times: user 1.01 s, sys: 2.52 s, total: 3.53 s
Wall time: 1.03 s


ticker,per,recordtime,open,high,low,close,vol,openint
binary,binary,i64,f64,f64,f64,f64,f64,i64
[binary data],[binary data],1201478400,0.738,0.738,0.738,0.738,0.0,0
[binary data],[binary data],1201564800,0.738,0.776,0.6657,0.6797,19169939.0,0
[binary data],[binary data],1201651200,0.6797,0.7108,0.6448,0.6603,2818723.0,0
[binary data],[binary data],1201737600,0.6595,0.8001,0.6595,0.7418,2345961.0,0
[binary data],[binary data],1201824000,0.7714,0.7721,0.738,0.7387,361853.0,0


- Native support for read from Parquet with Polars
- timestamps in parquet stored as epochs, requires transformation on read
- strings stores as binary, requires casting at UTF-8 on read

### ORC File Format

In [5]:
%%time
pyarrow_orc = orc.read_table('/home/jovyan/work/shared-datasets/nasdaq.orc')

CPU times: user 768 ms, sys: 615 ms, total: 1.38 s
Wall time: 1.38 s


In [6]:
%%time
df_polars_orc_via_arrow = pl.from_arrow(pyarrow_orc)
df_polars_orc_via_arrow.head(5)

CPU times: user 308 ms, sys: 1.34 s, total: 1.65 s
Wall time: 1.64 s


ticker,per,recordtime,open,high,low,close,vol,openint
binary,binary,datetime[ns],f64,f64,f64,f64,f64,i64
[binary data],[binary data],2008-01-28 00:00:00,0.738,0.738,0.738,0.738,0.0,0
[binary data],[binary data],2008-01-29 00:00:00,0.738,0.776,0.6657,0.6797,19169939.0,0
[binary data],[binary data],2008-01-30 00:00:00,0.6797,0.7108,0.6448,0.6603,2818723.0,0
[binary data],[binary data],2008-01-31 00:00:00,0.6595,0.8001,0.6595,0.7418,2345961.0,0
[binary data],[binary data],2008-02-01 00:00:00,0.7714,0.7721,0.738,0.7387,361853.0,0


- No native read from ORC at this time in Polars
- Can use PyArrow to read ORC
- Timestamps strongly typed as datetime data type (good)
- string fields read as binary just like parquet read

In [7]:
sys.getsizeof(df_polars_orc_via_arrow)

56

#### Use Arrow to type on deserialization
- Polars read orc does not allow for schema override on read.
- We can fix the string data on read via Arrow as deserialization median.

https://arrow.apache.org/docs/python/generated/pyarrow.orc.read_table.html

In [8]:
schema_overrides = {"ticker":pl.Utf8, "per":pl.Utf8}
df_polars_orc_via_arrow_with_overrides = pl.from_arrow(pyarrow_orc, schema_overrides = schema_overrides)
df_polars_orc_via_arrow_with_overrides.head(5)

ticker,per,recordtime,open,high,low,close,vol,openint
str,str,datetime[ns],f64,f64,f64,f64,f64,i64
"""AACG.US""","""D""",2008-01-28 00:00:00,0.738,0.738,0.738,0.738,0.0,0
"""AACG.US""","""D""",2008-01-29 00:00:00,0.738,0.776,0.6657,0.6797,19169939.0,0
"""AACG.US""","""D""",2008-01-30 00:00:00,0.6797,0.7108,0.6448,0.6603,2818723.0,0
"""AACG.US""","""D""",2008-01-31 00:00:00,0.6595,0.8001,0.6595,0.7418,2345961.0,0
"""AACG.US""","""D""",2008-02-01 00:00:00,0.7714,0.7721,0.738,0.7387,361853.0,0


- Using schema overrides we are able to strongly type during deserialization
- all fields now strongly typed

In [9]:
sys.getsizeof(df_polars_orc_via_arrow_with_overrides)

56

# Conclusions
- Both Parquet and ORC read extremely fast, and consume a fraction of the memory with accessed through Polars
- Polars had native support for Parquet, ORC required reading via Arrow
- Type conversion can be performed via Arrow to deserialize strings stored as binary on disk
- Parquet has limited primitive datatypes
- ORC may be better choice if preserving timestamp columns and leveraging Polars predicate push down filter using dates instead of epochs