Improve ORC reader performance #555

dain · 2019-03-28T01:54:38Z

For TCP-DS, this saves about 9.5% of total CPU when running over gzip-compressed data.

Benchmarks are contained in BenchmarkStreamReaders. For this test, I used 3 forks each with
30 warm and 20 test iterations. The results are average nanoseconds per value read.

Benchmark method	Old		Error	New		Error	Speedup
readBooleanNoNull	2.48	±	0.04	0.67	±	0.01	3.73
readBooleanWithNull	9.49	±	0.27	2.01	±	0.02	4.72
readByteNoNull	2.83	±	0.05	0.45	±	0.02	6.23
readByteWithNull	8.60	±	0.30	1.95	±	0.02	4.41
readDoubleNoNull	15.61	±	0.17	1.31	±	0.07	11.89
readDoubleWithNull	16.14	±	0.25	2.83	±	0.04	5.70
readFloatNoNull	14.64	±	0.22	0.81	±	0.04	18.11
readFloatWithNull	15.77	±	0.22	2.08	±	0.04	7.60
readIntNoNull	7.83	±	0.08	2.56	±	0.03	3.06
readIntWithNull	13.62	±	0.18	2.99	±	0.05	4.55
readLongDecimalNoNull	82.88	±	0.73	16.85	±	0.27	4.92
readLongDecimalWithNull	55.30	±	0.81	11.77	±	0.19	4.70
readLongNoNull	9.43	±	0.19	3.33	±	0.06	2.83
readLongWithNull	14.13	±	0.17	3.26	±	0.06	4.34
readShortDecimalNoNull	27.35	±	0.51	9.62	±	0.15	2.84
readShortDecimalWithNull	20.81	±	0.30	6.96	±	0.11	2.99
readShortNoNull	6.79	±	0.11	2.55	±	0.03	2.66
readShortWithNull	11.19	±	0.12	2.95	±	0.04	3.79
readSliceDictionaryNoNull	3.31	±	0.04	2.22	±	0.02	1.49
readSliceDictionaryWithNull	8.65	±	0.10	3.67	±	0.05	2.36
readSliceDirectNoNull	34.46	±	0.47	3.94	±	0.08	8.76
readSliceDirectWithNull	33.57	±	0.56	6.99	±	0.11	4.80
readTimestampNoNull	17.91	±	0.30	13.55	±	0.20	1.32
readTimestampWithNull	16.07	±	0.24	11.77	±	0.12	1.36

sopel39 · 2019-03-28T09:39:47Z

Here are the actual benchmark results (both wall time and CPU): https://s3.us-east-2.amazonaws.com/starburstdata/karol/Benchmarks+comparison-orc_improvements.pdf

presto-orc/src/test/java/io/prestosql/orc/BenchmarkStreamReaders.java

presto-orc/src/main/java/io/prestosql/orc/reader/SliceDictionaryStreamReader.java

presto-orc/src/main/java/io/prestosql/orc/OrcReader.java

presto-orc/src/main/java/io/prestosql/orc/reader/ReaderUtils.java

Pass SQL type to ORC stream reader constructor and use that instead of passing to each readBlock call.

Change reporting to avg nanoseconds per row Close output streams Add long reader benchmarks for all types and pollute profile Add direct slice benchmark Improve dictionary slice benchmark Add long decimal benchmark Load data into memory before benchmarking Use Presto writer

sopel39 · 2019-04-11T17:14:16Z

Well done!

Pass SQL type to ORC stream reader constructor and use that instead of passing to each readBlock call. Cherry-pick of trinodb/trino#555 The difference from the original commit include: 1) Rmeoved systemMemoryContext because the BatchStreamReaders don't have local arrays; 2) Fixed raptor tests by converting all spi types to storage types; 3) Other Nits changes Co-authored-by: Dain Sundstrom <dain@iq80.com> Convert column type to storage type for Raptor Raptor stores TIME and TIMESTAMP data as longs. When creating the batch RecordReader, these types need to be converted to the storage types.

Cherry-pick of trinodb/trino#555 Co-authored-by: Martin Traverso <mtraverso@gmail.com>

Cherry-pick of trinodb/trino#555 Co-authored-by: Dain Sundstrom <dain@iq80.com>

Pass SQL type to ORC stream reader constructor and use that instead of passing to each readBlock call. Cherry-pick of trinodb/trino#555 The difference from the original commit include: 1) Rmeoved systemMemoryContext because the BatchStreamReaders don't have local arrays; 2) Fixed raptor tests by converting all spi types to storage types; 3) Other Nits changes Co-authored-by: Dain Sundstrom <dain@iq80.com> Convert column type to storage type for Raptor Raptor stores TIME and TIMESTAMP data as longs. When creating the batch RecordReader, these types need to be converted to the storage types.

Cherry-pick of trinodb/trino#555 Co-authored-by: Martin Traverso <mtraverso@gmail.com>

Cherry-pick of trinodb/trino#555 Co-authored-by: Dain Sundstrom <dain@iq80.com>

Pass SQL type to ORC stream reader constructor and use that instead of passing to each readBlock call. Cherry-pick of trinodb/trino#555 The difference from the original commit include: 1) Rmeoved systemMemoryContext because the BatchStreamReaders don't have local arrays; 2) Fixed raptor tests by converting all spi types to storage types; 3) Other Nits changes Co-authored-by: Dain Sundstrom <dain@iq80.com> Convert column type to storage type for Raptor Raptor stores TIME and TIMESTAMP data as longs. When creating the batch RecordReader, these types need to be converted to the storage types.

Cherry-pick of trinodb/trino#555 Co-authored-by: Martin Traverso <mtraverso@gmail.com>

Cherry-pick of trinodb/trino#555 Co-authored-by: Dain Sundstrom <dain@iq80.com>

cla-bot bot added the cla-signed label Mar 28, 2019

dain force-pushed the orc-types branch 2 times, most recently from 5da422f to 144013e Compare April 4, 2019 17:29

martint approved these changes Apr 10, 2019

View reviewed changes

dain and others added 9 commits April 10, 2019 16:50

Remove unused LongDictionaryStreamReader

a690c96

Remove type from OrcRecordReader.readBlock

22e2bba

Pass SQL type to ORC stream reader constructor and use that instead of passing to each readBlock call.

Remove unused inDictionary from SliceDictionaryStreamReader

e269629

Rename stripeDictionary to dictionary

fb844a7

Simplify OrcDataSource interface

6c44964

Convert OrcInputStream to use a chunked loader

4a3c252

Improve ORC boolean reader

4f7c688

Improve ORC double and float readers

cff1f44

dain force-pushed the orc-types branch from 144013e to 2014cf2 Compare April 11, 2019 00:03

dain added 7 commits April 10, 2019 17:57

Improve ORC slice direct reader

5402a5b

Improve ORC list and map readers

6c9fa02

Improve ORC long reader

fd46381

Improve ORC byte reader

2ec3b61

Improve ORC decimal reader

2799a3e

Improve ORC slice dictionary reader

52c7843

Improve ORC timestamp reader

fcc753b

dain force-pushed the orc-types branch from 2014cf2 to fcc753b Compare April 11, 2019 01:03

dain closed this Apr 11, 2019

dain mentioned this pull request Apr 11, 2019

Release notes for 308 #583

Closed

5 tasks

dain added this to the 308 milestone Apr 11, 2019

dain reopened this Apr 11, 2019

dain merged commit b6961f8 into trinodb:master Apr 11, 2019

elonazoulay mentioned this pull request Apr 18, 2019

Improve orc reader performance prestodb/presto#12695

Closed

yingsu00 pushed a commit to yingsu00/presto that referenced this pull request Nov 12, 2019

Improve ORC boolean reader

d6d33e4

Cherry-pick of trinodb/trino#555 Co-authored-by: Martin Traverso <mtraverso@gmail.com>

yingsu00 pushed a commit to yingsu00/presto that referenced this pull request Nov 12, 2019

Improve ORC byte reader

a50cf92

Cherry-pick of trinodb/trino#555 Co-authored-by: Dain Sundstrom <dain@iq80.com>

yingsu00 pushed a commit to yingsu00/presto that referenced this pull request Nov 12, 2019

Improve ORC slice direct reader

f7205c1

Cherry-pick of trinodb/trino#555 Co-authored-by: Dain Sundstrom <dain@iq80.com>

yingsu00 pushed a commit to yingsu00/presto that referenced this pull request Nov 12, 2019

Improve ORC list and map readers

866da3e

Cherry-pick of trinodb/trino#555 Co-authored-by: Dain Sundstrom <dain@iq80.com>

yingsu00 pushed a commit to yingsu00/presto that referenced this pull request Nov 12, 2019

Improve ORC timestamp reader

9b727fe

Cherry-pick of trinodb/trino#555 Co-authored-by: Dain Sundstrom <dain@iq80.com>

yingsu00 pushed a commit to yingsu00/presto that referenced this pull request Nov 12, 2019

Improve ORC LongReader

3cf6fc1

Cherry-pick of trinodb/trino#555 Co-authored-by: Dain Sundstrom <dain@iq80.com>

rongrong pushed a commit to prestodb/presto that referenced this pull request Nov 13, 2019

Improve ORC boolean reader

95a57bd

Cherry-pick of trinodb/trino#555 Co-authored-by: Martin Traverso <mtraverso@gmail.com>

rongrong pushed a commit to prestodb/presto that referenced this pull request Nov 13, 2019

Improve ORC byte reader

9633991

Cherry-pick of trinodb/trino#555 Co-authored-by: Dain Sundstrom <dain@iq80.com>

rongrong pushed a commit to prestodb/presto that referenced this pull request Nov 13, 2019

Improve ORC slice direct reader

bf96a44

Cherry-pick of trinodb/trino#555 Co-authored-by: Dain Sundstrom <dain@iq80.com>

rongrong pushed a commit to prestodb/presto that referenced this pull request Nov 13, 2019

Improve ORC list and map readers

0814f33

Cherry-pick of trinodb/trino#555 Co-authored-by: Dain Sundstrom <dain@iq80.com>

rongrong pushed a commit to prestodb/presto that referenced this pull request Nov 13, 2019

Improve ORC timestamp reader

61408c2

Cherry-pick of trinodb/trino#555 Co-authored-by: Dain Sundstrom <dain@iq80.com>

rongrong pushed a commit to prestodb/presto that referenced this pull request Nov 13, 2019

Improve ORC LongReader

f92fa25

Cherry-pick of trinodb/trino#555 Co-authored-by: Dain Sundstrom <dain@iq80.com>

kaikalur pushed a commit to kaikalur/presto that referenced this pull request Jan 22, 2020

Improve ORC boolean reader

7237c07

Cherry-pick of trinodb/trino#555 Co-authored-by: Martin Traverso <mtraverso@gmail.com>

kaikalur pushed a commit to kaikalur/presto that referenced this pull request Jan 22, 2020

Improve ORC byte reader

1114dcb

Cherry-pick of trinodb/trino#555 Co-authored-by: Dain Sundstrom <dain@iq80.com>

kaikalur pushed a commit to kaikalur/presto that referenced this pull request Jan 22, 2020

Improve ORC slice direct reader

965d1bf

Cherry-pick of trinodb/trino#555 Co-authored-by: Dain Sundstrom <dain@iq80.com>

kaikalur pushed a commit to kaikalur/presto that referenced this pull request Jan 22, 2020

Improve ORC list and map readers

56b0bfd

Cherry-pick of trinodb/trino#555 Co-authored-by: Dain Sundstrom <dain@iq80.com>

kaikalur pushed a commit to kaikalur/presto that referenced this pull request Jan 22, 2020

Improve ORC timestamp reader

791b023

Cherry-pick of trinodb/trino#555 Co-authored-by: Dain Sundstrom <dain@iq80.com>

kaikalur pushed a commit to kaikalur/presto that referenced this pull request Jan 22, 2020

Improve ORC LongReader

6107aec

Cherry-pick of trinodb/trino#555 Co-authored-by: Dain Sundstrom <dain@iq80.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve ORC reader performance #555

Improve ORC reader performance #555

dain commented Mar 28, 2019 •

edited

sopel39 commented Mar 28, 2019

sopel39 commented Apr 11, 2019

Improve ORC reader performance #555

Improve ORC reader performance #555

Conversation

dain commented Mar 28, 2019 • edited

sopel39 commented Mar 28, 2019

sopel39 commented Apr 11, 2019

dain commented Mar 28, 2019 •

edited