Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Derive predicate over base column for CAST(a_varchar AS date) > a_date_constant condition #12925

Closed
ubyyj opened this issue Jun 22, 2022 · 14 comments · Fixed by #13567
Closed

Comments

@ubyyj
Copy link
Contributor

ubyyj commented Jun 22, 2022

For iceberg table, if partition key in an expression (cast) in the where clause, then it's not pushed down to table scan. while hive table does.

below is the explain output of a query, which the partition key dt is casted to date (this is required by Tableau for incremental refresh), the predicate is not pushed down to table scan.

CREATE EXTERNAL TABLE t1(
  name string COMMENT 'from deserializer',
  age int COMMENT 'from deserializer',
  dt string COMMENT 'from deserializer')
ROW FORMAT SERDE
  'org.apache.iceberg.mr.hive.HiveIcebergSerDe'
STORED BY
  'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
LOCATION
  'hdfs://namenode/user/hive/warehouse/t1'
TBLPROPERTIES (
  'bucketing_version'='1',
  'engine.hive.enabled'='true',
  'external.table.purge'='TRUE',
  'format-version'='2',
  'last_modified_by'='root',
  'last_modified_time'='1655557984',
  'metadata_location'='...',
  'previous_metadata_location'='..',
  'table_type'='ICEBERG',
  'transient_lastDdlTime'='1655557984',
  'uuid'='81f55231-f6ed-4a22-97ea-3fe792cb2927',
  'write.delete.mode'='copy-on-write')
explain select cast(dt as date) as dt, count(name) as cnt from iceberg.default.t1 where cast(dt as date) > date_add('day', -3, current_date)  and cast(dt as date)  <= date_add('day', 3, current_date)   group by dt;
                                                                                               
 Fragment 0 [SINGLE]
     Output layout: [expr, count]
     Output partitioning: SINGLE []
     Stage Execution Strategy: UNGROUPED_EXECUTION
     Output[dt, cnt]
     │   Layout: [expr:date, count:bigint]
     │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
     │   dt := expr
     │   cnt := count
     └─ RemoteSource[1]
            Layout: [count:bigint, expr:date]

 Fragment 1 [HASH]
     Output layout: [count, expr]
     Output partitioning: SINGLE []
     Stage Execution Strategy: UNGROUPED_EXECUTION
     Project[]
     │   Layout: [count:bigint, expr:date]
     │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
     │   expr := CAST("dt" AS date)
     └─ Aggregate(FINAL)[dt][$hashvalue]
        │   Layout: [dt:varchar, $hashvalue:bigint, count:bigint]
        │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
        │   count := count("count_0")
        └─ LocalExchange[HASH][$hashvalue] ("dt")
           │   Layout: [dt:varchar, count_0:bigint, $hashvalue:bigint]
           │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
           └─ RemoteSource[2]
                  Layout: [dt:varchar, count_0:bigint, $hashvalue_1:bigint]

 Fragment 2 [SOURCE]
     Output layout: [dt, count_0, $hashvalue_2]
     Output partitioning: HASH [dt][$hashvalue_2]
     Stage Execution Strategy: UNGROUPED_EXECUTION
     Aggregate(PARTIAL)[dt][$hashvalue_2]
     │   Layout: [dt:varchar, $hashvalue_2:bigint, count_0:bigint]
     │   count_0 := count("name")
     └─ ScanFilterProject[table = iceberg:default.t1$data@3690406149830550716, grouped = false, filterPredicate = ((CAST("dt" AS date) > DATE '2022-06-19') AND (CAST("dt" AS date) <= DATE '2022-06-25'))]
            Layout: [name:varchar, dt:varchar, $hashvalue_2:bigint]
            Estimates: {rows: 2 (264B), cpu: 246, memory: 0B, network: 0B}/{rows: 1 (66B), cpu: 492, memory: 0B, network: 0B}/{rows: 1 (66B), cpu: 558, memory: 0B, network: 0B}
            $hashvalue_2 := combine_hash(bigint '0', COALESCE("$operator$hash_code"("dt"), 0))
            dt := 3:dt:varchar
            name := 1:name:varchar`

for a similar HIVE table , the condition is pushed down.

CREATE EXTERNAL TABLE actions(
  user_id bigint)
PARTITIONED BY (
  dt string,
  hh string,
  action string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'hdfs://namenode/user/hive/warehouse/actions'
TBLPROPERTIES (
  'transient_lastDdlTime'='1655858884')
explain select cast(dt as date) as dt, count(user_id) as cnt from hive.default.actions where cast(dt as date) > date_add('day', -3, current_date)  and cast(dt as date)  <= date_add('day', 3, current_date)  group by dt;

 Fragment 0 [SINGLE]
     Output layout: [expr, count]
     Output partitioning: SINGLE []
     Stage Execution Strategy: UNGROUPED_EXECUTION
     Output[dt, cnt]
     │   Layout: [expr:date, count:bigint]
     │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
     │   dt := expr
     │   cnt := count
     └─ RemoteSource[1]
            Layout: [count:bigint, expr:date]

 Fragment 1 [HASH]
     Output layout: [count, expr]
     Output partitioning: SINGLE []
     Stage Execution Strategy: UNGROUPED_EXECUTION
     Project[]
     │   Layout: [count:bigint, expr:date]
     │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
     │   expr := CAST("dt" AS date)
     └─ Aggregate(FINAL)[dt][$hashvalue]
        │   Layout: [dt:varchar, $hashvalue:bigint, count:bigint]
        │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
        │   count := count("count_0")
        └─ LocalExchange[HASH][$hashvalue] ("dt")
           │   Layout: [dt:varchar, count_0:bigint, $hashvalue:bigint]
           │   Estimates: {rows: ? (?), cpu: ?, memory: ?, network: ?}
           └─ RemoteSource[2]
                  Layout: [dt:varchar, count_0:bigint, $hashvalue_1:bigint]

 Fragment 2 [SOURCE]
     Output layout: [dt, count_0, $hashvalue_2]
     Output partitioning: HASH [dt][$hashvalue_2]
     Stage Execution Strategy: UNGROUPED_EXECUTION
     Aggregate(PARTIAL)[dt][$hashvalue_2]
     │   Layout: [dt:varchar, $hashvalue_2:bigint, count_0:bigint]
     │   count_0 := count("user_id")
     └─ ScanFilterProject[table = hive:default:actions, grouped = false, filterPredicate = ((CAST("dt" AS date) > DATE '2022-06-19') AND (CAST("dt" AS date) <= DATE '2022-06-25'))]
            Layout: [user_id:bigint, dt:varchar, $hashvalue_2:bigint]
            Estimates: {rows: 1 (33B), cpu: 24, memory: 0B, network: 0B}/{rows: 0 (8B), cpu: 48, memory: 0B, network: 0B}/{rows: 0 (8B), cpu: 56, memory: 0B, network: 0B}
            $hashvalue_2 := combine_hash(bigint '0', COALESCE("$operator$hash_code"("dt"), 0))
            dt := dt:string:PARTITION_KEY
                :: [[2022-06-20]]
            user_id := user_id:bigint:REGULAR
            hh:string:PARTITION_KEY
                :: [[00]]
            action:string:PARTITION_KEY
                :: [[post]]
@findepi
Copy link
Member

findepi commented Jun 22, 2022

For the record, #12795 #12918 improve Iceberg pushdown capabilities, but only when the partition key is of a temporal type.

In this issue, your Iceberg table doesn't seem partitioned and Hive table is partitioned.
Also, can you elaborate on the actual and expected behavior?
Thanks for sharing the EXPLAIN plans, but I don't know what detail of the plans you're looking at.

@ubyyj
Copy link
Contributor Author

ubyyj commented Jun 23, 2022

The iceberg table t1 is partitioned, just the show create table result can't tell it. Below is what I got from the metadata file:
"partition-specs" : [ { "spec-id" : 0, "fields" : [ { "name" : "dt", "transform" : "identity", "source-id" : 3, "field-id" : 1000 } ] } ],
I want (I guess) the ScanFilterProject should tell which dates will be scanned, like what hive plan tells, the [[2022-06-20]] part:
dt := dt:string:PARTITION_KEY :: [[2022-06-20]]

@ubyyj
Copy link
Contributor Author

ubyyj commented Jun 23, 2022

as far as I can understand, the push down does not succeed because the dt is a varchar type, while timestamp(6) is expected, at here

@findepi
Copy link
Member

findepi commented Jun 23, 2022

I want (I guess) the ScanFilterProject should tell which dates will be scanned, like what hive plan tells, the [[2022-06-20]] part:
dt := dt:string:PARTITION_KEY :: [[2022-06-20]]

the actual way to test this is to count number of source splits

I think your needs should be addressed by @homar 's #9830

@ubyyj
Copy link
Contributor Author

ubyyj commented Jun 24, 2022

I am not sure it's addressed by #9830. As for the latest release 387, such query will keep coordinator busy for long time, and then timeout (the table is relative big, with millions of parquet files, and millions of partitions)

@findepi
Copy link
Member

findepi commented Jun 24, 2022

and then timeout

do you have a stacktrace?

@ubyyj
Copy link
Contributor Author

ubyyj commented Jun 25, 2022

here is one

Query-20220625_001742_00041_njuwj-43241 RUNNABLE #43241 
  at java.net.SocketInputStream.socketRead0(SocketInputStream.java:-2)
  at java.net.SocketInputStream.socketRead(SocketInputStream.java:115)
  at java.net.SocketInputStream.read(SocketInputStream.java:168)
  at java.net.SocketInputStream.read(SocketInputStream.java:140)
  at sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:478)
  at sun.security.ssl.SSLSocketInputRecord.readHeader(SSLSocketInputRecord.java:472)
  at sun.security.ssl.SSLSocketInputRecord.decode(SSLSocketInputRecord.java:160)
  at sun.security.ssl.SSLTransport.decode(SSLTransport.java:111)
  at sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1506)
  at sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.java:1416)
  at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:456)
  at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:427)
  at org.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:436)
  at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:384)
  at com.amazonaws.http.conn.ssl.SdkTLSSocketFactory.connectSocket(SdkTLSSocketFactory.java:142)
  at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
  at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376)
  at jdk.internal.reflect.GeneratedMethodAccessor840.invoke(:-1)
  at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:566)
  at com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76)
  at com.amazonaws.http.conn.$Proxy634.connect(:-1)
  at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
  at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
  at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
  at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
  at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
  at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
  at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
  at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1343)
  at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1154)
  at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:811)
  at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:779)
  at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:753)
  at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:713)
  at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:695)
  at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:559)
  at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:539)
  at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5453)
  at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5400)
  at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1372)
  at io.trino.plugin.hive.s3.TrinoS3FileSystem.lambda$getS3ObjectMetadata$6(TrinoS3FileSystem.java:785)
  at io.trino.plugin.hive.s3.TrinoS3FileSystem$$Lambda$10174/0x0000ffc62c1fb508.call(:-1)
  at io.trino.plugin.hive.util.RetryDriver.run(RetryDriver.java:130)
  at io.trino.plugin.hive.s3.TrinoS3FileSystem.getS3ObjectMetadata(TrinoS3FileSystem.java:782)
  at io.trino.plugin.hive.s3.TrinoS3FileSystem.getS3ObjectMetadata(TrinoS3FileSystem.java:766)
  at io.trino.plugin.hive.s3.TrinoS3FileSystem.getFileStatus(TrinoS3FileSystem.java:446)
  at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:454)
  at org.apache.iceberg.hadoop.HadoopInputFile.lazyStat(HadoopInputFile.java:156)
  at org.apache.iceberg.hadoop.HadoopInputFile.getLength(HadoopInputFile.java:167)
  at io.trino.plugin.iceberg.HdfsInputFile$$Lambda$10172/0x0000ffc62c1fc8b0.run(:-1)
  at io.trino.plugin.hive.authentication.NoHdfsAuthentication.doAs(NoHdfsAuthentication.java:25)
  at io.trino.plugin.hive.HdfsEnvironment.doAs(HdfsEnvironment.java:97)
  at io.trino.plugin.iceberg.HdfsInputFile.getLength(HdfsInputFile.java:55)
  at org.apache.iceberg.avro.AvroIterable.newFileReader(AvroIterable.java:101)
  at org.apache.iceberg.avro.AvroIterable.getMetadata(AvroIterable.java:66)
  at org.apache.iceberg.ManifestReader.<init>(ManifestReader.java:104)
  at org.apache.iceberg.ManifestFiles.read(ManifestFiles.java:87)
  at org.apache.iceberg.ManifestGroup.lambda$entries$15(ManifestGroup.java:245)
  at org.apache.iceberg.ManifestGroup$$Lambda$10203/0x0000ffc62bff4920.apply(:-1)
  at org.apache.iceberg.relocated.com.google.common.collect.Iterators$6.transform(Iterators.java:783)
  at org.apache.iceberg.relocated.com.google.common.collect.TransformedIterator.next(TransformedIterator.java:47)
  at org.apache.iceberg.relocated.com.google.common.collect.TransformedIterator.next(TransformedIterator.java:47)
  at org.apache.iceberg.util.ParallelIterable$ParallelIterator.submitNextTask(ParallelIterable.java:131)
  at org.apache.iceberg.util.ParallelIterable$ParallelIterator.checkTasks(ParallelIterable.java:118)
  at org.apache.iceberg.util.ParallelIterable$ParallelIterator.hasNext(ParallelIterable.java:155)
  at java.lang.Iterable.forEach(Iterable.java:74)
  at io.trino.plugin.iceberg.TableStatisticsMaker.makeTableStatistics(TableStatisticsMaker.java:109)
  at io.trino.plugin.iceberg.TableStatisticsMaker.getTableStatistics(TableStatisticsMaker.java:60)
  at io.trino.plugin.iceberg.IcebergMetadata.getTableStatistics(IcebergMetadata.java:1916)
  at io.trino.plugin.base.classloader.ClassLoaderSafeConnectorMetadata.getTableStatistics(ClassLoaderSafeConnectorMetadata.java:326)
  at io.trino.metadata.MetadataManager.getTableStatistics(MetadataManager.java:445)
  at io.trino.cost.TableScanStatsRule.doCalculate(TableScanStatsRule.java:64)
  at io.trino.cost.TableScanStatsRule.doCalculate(TableScanStatsRule.java:38)
  at io.trino.cost.SimpleStatsRule.calculate(SimpleStatsRule.java:39)
  at io.trino.cost.ComposableStatsCalculator.calculateStats(ComposableStatsCalculator.java:85)
  at io.trino.cost.ComposableStatsCalculator.calculateStats(ComposableStatsCalculator.java:73)
  at io.trino.cost.CachingStatsProvider.getStats(CachingStatsProvider.java:80)
  at io.trino.sql.planner.iterative.rule.UseNonPartitionedJoinLookupSource.lambda$getSourceTablesRowCount$0(UseNonPartitionedJoinLookupSource.java:153)
  at io.trino.sql.planner.iterative.rule.UseNonPartitionedJoinLookupSource$$Lambda$10465/0x0000ffc9489db0b0.applyAsDouble(:-1)
  at java.util.stream.ReferencePipeline$6$1.accept(ReferencePipeline.java:246)
  at java.util.Collections$2.tryAdvance(Collections.java:4747)
  at java.util.Collections$2.forEachRemaining(Collections.java:4755)
  at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
  at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
  at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
  at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
  at java.util.stream.DoublePipeline.collect(DoublePipeline.java:516)
  at java.util.stream.DoublePipeline.sum(DoublePipeline.java:425)
  at io.trino.sql.planner.iterative.rule.UseNonPartitionedJoinLookupSource.getSourceTablesRowCount(UseNonPartitionedJoinLookupSource.java:154)
  at io.trino.sql.planner.iterative.rule.UseNonPartitionedJoinLookupSource.getSourceTablesRowCount(UseNonPartitionedJoinLookupSource.java:135)
  at io.trino.sql.planner.iterative.rule.UseNonPartitionedJoinLookupSource.apply(UseNonPartitionedJoinLookupSource.java:89)
  at io.trino.sql.planner.iterative.rule.UseNonPartitionedJoinLookupSource.apply(UseNonPartitionedJoinLookupSource.java:65)
  at io.trino.sql.planner.iterative.IterativeOptimizer.transform(IterativeOptimizer.java:202)
  at io.trino.sql.planner.iterative.IterativeOptimizer.exploreNode(IterativeOptimizer.java:169)
  at io.trino.sql.planner.iterative.IterativeOptimizer.exploreGroup(IterativeOptimizer.java:132)
  at io.trino.sql.planner.iterative.IterativeOptimizer.exploreChildren(IterativeOptimizer.java:251)
  at io.trino.sql.planner.iterative.IterativeOptimizer.exploreGroup(IterativeOptimizer.java:134)
  at io.trino.sql.planner.iterative.IterativeOptimizer.exploreChildren(IterativeOptimizer.java:251)
  at io.trino.sql.planner.iterative.IterativeOptimizer.exploreGroup(IterativeOptimizer.java:134)
  at io.trino.sql.planner.iterative.IterativeOptimizer.exploreChildren(IterativeOptimizer.java:251)
  at io.trino.sql.planner.iterative.IterativeOptimizer.exploreGroup(IterativeOptimizer.java:134)
  at io.trino.sql.planner.iterative.IterativeOptimizer.exploreChildren(IterativeOptimizer.java:251)
  at io.trino.sql.planner.iterative.IterativeOptimizer.exploreGroup(IterativeOptimizer.java:134)
  at io.trino.sql.planner.iterative.IterativeOptimizer.exploreChildren(IterativeOptimizer.java:251)
  at io.trino.sql.planner.iterative.IterativeOptimizer.exploreGroup(IterativeOptimizer.java:134)
  at io.trino.sql.planner.iterative.IterativeOptimizer.exploreChildren(IterativeOptimizer.java:251)
  at io.trino.sql.planner.iterative.IterativeOptimizer.exploreGroup(IterativeOptimizer.java:134)
  at io.trino.sql.planner.iterative.IterativeOptimizer.exploreChildren(IterativeOptimizer.java:251)
  at io.trino.sql.planner.iterative.IterativeOptimizer.exploreGroup(IterativeOptimizer.java:134)
  at io.trino.sql.planner.iterative.IterativeOptimizer.exploreChildren(IterativeOptimizer.java:251)
  at io.trino.sql.planner.iterative.IterativeOptimizer.exploreGroup(IterativeOptimizer.java:134)
  at io.trino.sql.planner.iterative.IterativeOptimizer.exploreChildren(IterativeOptimizer.java:251)
  at io.trino.sql.planner.iterative.IterativeOptimizer.exploreGroup(IterativeOptimizer.java:134)
  at io.trino.sql.planner.iterative.IterativeOptimizer.exploreChildren(IterativeOptimizer.java:251)
  at io.trino.sql.planner.iterative.IterativeOptimizer.exploreGroup(IterativeOptimizer.java:134)
  at io.trino.sql.planner.iterative.IterativeOptimizer.optimize(IterativeOptimizer.java:117)
  at io.trino.sql.planner.LogicalPlanner.plan(LogicalPlanner.java:240)
  at io.trino.sql.planner.LogicalPlanner.plan(LogicalPlanner.java:217)
  at io.trino.sql.planner.LogicalPlanner.plan(LogicalPlanner.java:212)
  at io.trino.execution.SqlQueryExecution.doPlanQuery(SqlQueryExecution.java:477)
  at io.trino.execution.SqlQueryExecution.planQuery(SqlQueryExecution.java:458)
  at io.trino.execution.SqlQueryExecution.start(SqlQueryExecution.java:399)
  at io.trino.execution.SqlQueryManager.createQuery(SqlQueryManager.java:243)
  at io.trino.dispatcher.LocalDispatchQueryFactory$$Lambda$8076/0x0000ffc641ff1cb0.accept(:-1)
  at io.trino.dispatcher.LocalDispatchQuery.lambda$startExecution$7(LocalDispatchQuery.java:143)
  at io.trino.dispatcher.LocalDispatchQuery$$Lambda$8174/0x0000ffc641d2ecb0.run(:-1)
  at io.trino.$gen.Trino_387____20220623_071609_2.run(:-1)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
  at java.lang.Thread.run(Thread.java:829)

@ubyyj
Copy link
Contributor Author

ubyyj commented Jun 25, 2022

and in Tableau, we got following error message:
[Simba][Presto] (1060) Presto Query Error: The optimizer exhausted the time limit of 180000 ms: Top rules: { io.trino.sql.planner.iterative.rule.UseNonPartitionedJoinLookupSource@7fc3a7eb: 269393 ms, 1 invocations, 0 ...[see log] (65556

@findepi
Copy link
Member

findepi commented Jun 27, 2022

and in Tableau, we got following error message: The optimizer exhausted the time limit of 180000 ms

This is very disturbing.
Can you please file a separate issue about this? Please include your query reproducing the problem, the table schema and partitioning (SHOW CREATE TABLE output).

Edit: disregard.

@findepi
Copy link
Member

findepi commented Jun 28, 2022

When dt is a varchar column, let's think what can be the value of dt for cast(dt as date) >= DATE '2022-06-29' to be true.
Obviously, the dt can be '2022-06-29', '2022-06-30', '2022-07-....
However, the CAST accepts various different other forms of dates (see #13011).
So dt can start with a whitespace (any whitespace character, not just U+0020 SPACE), eg ' \t\n\t\r 2022-06-29'.
The year can be zero-padded, eg '0002022-06-29',

We still could derive a predicate directly over dt which would be very efficient in practice, in the spirit of #3618.
We just need to think about a few edge cases when doing that.

@findepi findepi changed the title iceberg partition key in cast expression is not pushed down Derive predicate over base column for CAST(a_varchar AS date) > a_date_constant condition Jun 28, 2022
@findepi
Copy link
Member

findepi commented Jun 28, 2022

BTW, for Iceberg i do strongly recommend to switch your column data type to be date.
You will be able to leverage partitioning transforms later, if you need to. cc @alexjo2144

the issue may still be useful for existing data sets with suboptimal data type, and also useful for Hive.

@ubyyj
Copy link
Contributor Author

ubyyj commented Jun 29, 2022

the table is a successor of a hive table, so we kept the data type.
For hive table, everything worked as expected, wondering the difference between HiveMetadata and IcebergMetadata.

@alexjo2144
Copy link
Member

It will work okay but you loose the ability to use any date transforms: https://iceberg.apache.org/spec/#partition-transforms

@ubyyj
Copy link
Contributor Author

ubyyj commented Jun 30, 2022

can we treat identity transform separately, so iceberg can match hive's perf in this scenario

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

3 participants