Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query causes hive cache to crash; requires restart of Trino #9776

Closed
nbayle opened this issue Oct 26, 2021 · 4 comments
Closed

Query causes hive cache to crash; requires restart of Trino #9776

nbayle opened this issue Oct 26, 2021 · 4 comments
Labels
bug Something isn't working

Comments

@nbayle
Copy link

nbayle commented Oct 26, 2021

Trino 358. Query will execute between 0-2 times (usually once) successfully and then the cache no longer works. Restarting Trino solves the issue. The error doesn't seem to helpful though.

Properties:

connector.name=hive-hadoop2
hive.metastore=glue
hive.metastore.glue.region=us-east-1
hive.s3.aws-access-key=ASDF
hive.s3.aws-secret-key=ASDF
hive.metastore.glue.aws-access-key=ASDF
hive.metastore.glue.aws-secret-key=ASDF
hive.recursive-directories=true
hive.security=allow-all
hive.allow-register-partition-procedure=true
hive.non-managed-table-writes-enabled=true
hive.max-partitions-per-scan=3000000
hive.translate-hive-views=true
hive.storage-format=PARQUET
hive.compression-codec=SNAPPY
hive.cache.enabled=true
hive.cache.location=/opt/hive-cache/us-east-1
hive.cache.data-transfer-port=8804
hive.cache.bookkeeper-port=8805
hive.file-status-cache-tables=*
hive.file-status-cache-expire-time=30m
hive.metastore-cache-ttl=1m
hive.metastore-cache-maximum-size=100000
hive.metastore-refresh-interval=1m
hive.metastore.glue.max-connections=10
hive.metastore.glue.partitions-segments=10
hive.metastore.glue.get-partition-threads=100
hive.metastore.glue.read-statistics-threads=10
hive.metastore.glue.write-statistics-threads=10
hive.max-outstanding-splits=5000

Sample query:

trino> SELECT "resource_purpose_type" AS "resource_purpose_type",
    ->        count(*) * 100 / sum(count(*)) over() AS "Percentage"
    -> FROM
    ->   (select *,
    ->           cast(concat(day, ' ', hour, ':', minute) as TIMESTAMP) as activity_date
    ->    from s3_us_east_1.prod_rap_combined.full_useractivity
    ->    where account_id = 'ASDF'
    ->      and cast(concat(day, ' ', hour, ':', minute) as TIMESTAMP) >= at_timezone(current_timestamp, 'UTC') - interval '5' minute limit 10000) AS "virtual_table"
    -> GROUP BY "resource_purpose_type"
    -> ORDER BY "Percentage" ASC
    -> LIMIT 10000;

First query success w/ results:

Query 20211022_223034_00007_x848v, FINISHED, 6 nodes
Splits: 2,196 total, 1,850 done (84.24%)
50.45 [149K rows, 171MB] [2.96K rows/s, 3.39MB/s]

Subsequent query bookkeeper crash, runs for ages:

Query 20211022_223339_00008_x848v, RUNNING, 6 nodes, 7,422 splits
7:47 [ 610K rows,  669MB] [1.31K rows/s, 1.43MB/s] [ <=>                                      ]

     STAGES   ROWS  ROWS/s  BYTES  BYTES/s  QUEUED    RUN   DONE
0.........R      0       0     0B       0B       0     51      0
  1.......S   610K   1.31K   669M    1.43M     434     48   6889

Error log (different query ID, but this is the same error we always get:

2021-10-25T16:03:42.927Z	WARN	20211025_160037_84338_u4v8f.3.1-179-66	com.qubole.rubix.spi.fop.SocketObjectFactory	Unable to open connection to host 192.168.9.55
org.apache.thrift.shaded.transport.TTransportException: java.net.SocketTimeoutException: connect timed out
	at org.apache.thrift.shaded.transport.TSocket.open(TSocket.java:226)
	at com.qubole.rubix.spi.fop.SocketObjectFactory.create(SocketObjectFactory.java:44)
	at com.qubole.rubix.spi.fop.SocketObjectFactory.create(SocketObjectFactory.java:24)
	at com.qubole.rubix.spi.fop.ObjectPoolPartition.increaseObjects(ObjectPoolPartition.java:128)
	at com.qubole.rubix.spi.fop.ObjectPoolPartition.getObject(ObjectPoolPartition.java:85)
	at com.qubole.rubix.spi.fop.ObjectPool.getObject(ObjectPool.java:95)
	at com.qubole.rubix.spi.fop.ObjectPool.borrowObject(ObjectPool.java:81)
	at com.qubole.rubix.spi.BookKeeperFactory.createBookKeeperClient(BookKeeperFactory.java:75)
	at com.qubole.rubix.core.NonLocalRequestChain.<init>(NonLocalRequestChain.java:75)
	at com.qubole.rubix.core.CachingInputStream.setupReadRequestChains(CachingInputStream.java:404)
	at com.qubole.rubix.core.CachingInputStream.readInternal(CachingInputStream.java:254)
	at com.qubole.rubix.core.CachingInputStream.read(CachingInputStream.java:223)
	at org.apache.hadoop.fs.FSInputStream.readFully(FSInputStream.java:121)
	at org.apache.hadoop.fs.BufferedFSInputStream.readFully(BufferedFSInputStream.java:113)
	at org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:111)
	at org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:111)
	at io.trino.plugin.hive.parquet.HdfsParquetDataSource.readFully(HdfsParquetDataSource.java:135)
	at io.trino.plugin.hive.parquet.HdfsParquetDataSource$ReferenceCountedReader.read(HdfsParquetDataSource.java:301)
	at io.trino.plugin.hive.parquet.HdfsParquetDataSource$1.read(HdfsParquetDataSource.java:208)
	at io.trino.parquet.reader.ParquetReader.readPrimitive(ParquetReader.java:276)
	at io.trino.parquet.reader.ParquetReader.readColumnChunk(ParquetReader.java:337)
	at io.trino.parquet.reader.ParquetReader.readBlock(ParquetReader.java:320)
	at io.trino.plugin.hive.parquet.ParquetPageSource$ParquetBlockLoader.load(ParquetPageSource.java:214)
	at io.trino.spi.block.LazyBlock$LazyData.load(LazyBlock.java:390)
	at io.trino.spi.block.LazyBlock$LazyData.getTopLevelBlock(LazyBlock.java:359)
	at io.trino.spi.block.LazyBlock.getBlock(LazyBlock.java:264)
	at io.trino.spi.block.LazyBlock.isNull(LazyBlock.java:253)
	at io.trino.$gen.PageFilter_20211025_160110_125.filter(Unknown Source)
	at io.trino.$gen.PageFilter_20211025_160110_125.filter(Unknown Source)
	at io.trino.operator.project.PageProcessor.createWorkProcessor(PageProcessor.java:121)
	at io.trino.operator.ScanFilterAndProjectOperator$SplitToPages.lambda$processPageSource$1(ScanFilterAndProjectOperator.java:280)
	at io.trino.operator.WorkProcessorUtils.lambda$flatMap$4(WorkProcessorUtils.java:245)
	at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:319)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:372)
	at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:306)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:372)
	at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:306)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:372)
	at io.trino.operator.WorkProcessorUtils.getNextState(WorkProcessorUtils.java:221)
	at io.trino.operator.WorkProcessorUtils.lambda$processStateMonitor$2(WorkProcessorUtils.java:200)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:372)
	at io.trino.operator.WorkProcessorUtils.lambda$flatten$6(WorkProcessorUtils.java:277)
	at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:319)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:372)
	at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:306)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:372)
	at io.trino.operator.WorkProcessorUtils.getNextState(WorkProcessorUtils.java:221)
	at io.trino.operator.WorkProcessorUtils.lambda$processStateMonitor$2(WorkProcessorUtils.java:200)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:372)
	at io.trino.operator.WorkProcessorUtils.getNextState(WorkProcessorUtils.java:221)
	at io.trino.operator.WorkProcessorUtils.lambda$finishWhen$3(WorkProcessorUtils.java:215)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:372)
	at io.trino.operator.WorkProcessorSourceOperatorAdapter.getOutput(WorkProcessorSourceOperatorAdapter.java:149)
	at io.trino.operator.Driver.processInternal(Driver.java:387)
	at io.trino.operator.Driver.lambda$processFor$9(Driver.java:291)
	at io.trino.operator.Driver.tryWithLock(Driver.java:683)
	at io.trino.operator.Driver.processFor(Driver.java:284)
	at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1075)
	at io.trino.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
	at io.trino.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:484)
	at io.trino.$gen.Trino_358____20211021_011419_2.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.net.SocketTimeoutException: connect timed out
	at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.base/java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:399)
	at java.base/java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:242)
	at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:224)
	at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.base/java.net.Socket.connect(Socket.java:609)
	at org.apache.thrift.shaded.transport.TSocket.open(TSocket.java:221)
	... 63 more


2021-10-25T16:03:42.927Z	WARN	20211025_160037_84338_u4v8f.3.1-179-66	com.qubole.rubix.spi.fop.ObjectPoolPartition	Pool: bks-pool : Host: 192.168.9.55 : Could not increase pool size. Pool state: totalCount=0 queueSize=0 delta=1
2021-10-25T16:03:42.927Z	ERROR	20211025_160037_84338_u4v8f.3.1-179-66	com.qubole.rubix.core.NonLocalRequestChain	Could not get cache status from bookkeeper server at 192.168.9.55
java.lang.RuntimeException: Could not add connections to pool
	at com.qubole.rubix.spi.fop.ObjectPoolPartition.getObject(ObjectPoolPartition.java:93)
	at com.qubole.rubix.spi.fop.ObjectPool.getObject(ObjectPool.java:95)
	at com.qubole.rubix.spi.fop.ObjectPool.borrowObject(ObjectPool.java:81)
	at com.qubole.rubix.spi.BookKeeperFactory.createBookKeeperClient(BookKeeperFactory.java:75)
	at com.qubole.rubix.core.NonLocalRequestChain.<init>(NonLocalRequestChain.java:75)
	at com.qubole.rubix.core.CachingInputStream.setupReadRequestChains(CachingInputStream.java:404)
	at com.qubole.rubix.core.CachingInputStream.readInternal(CachingInputStream.java:254)
	at com.qubole.rubix.core.CachingInputStream.read(CachingInputStream.java:223)
	at org.apache.hadoop.fs.FSInputStream.readFully(FSInputStream.java:121)
	at org.apache.hadoop.fs.BufferedFSInputStream.readFully(BufferedFSInputStream.java:113)
	at org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:111)
	at org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:111)
	at io.trino.plugin.hive.parquet.HdfsParquetDataSource.readFully(HdfsParquetDataSource.java:135)
	at io.trino.plugin.hive.parquet.HdfsParquetDataSource$ReferenceCountedReader.read(HdfsParquetDataSource.java:301)
	at io.trino.plugin.hive.parquet.HdfsParquetDataSource$1.read(HdfsParquetDataSource.java:208)
	at io.trino.parquet.reader.ParquetReader.readPrimitive(ParquetReader.java:276)
	at io.trino.parquet.reader.ParquetReader.readColumnChunk(ParquetReader.java:337)
	at io.trino.parquet.reader.ParquetReader.readBlock(ParquetReader.java:320)
	at io.trino.plugin.hive.parquet.ParquetPageSource$ParquetBlockLoader.load(ParquetPageSource.java:214)
	at io.trino.spi.block.LazyBlock$LazyData.load(LazyBlock.java:390)
	at io.trino.spi.block.LazyBlock$LazyData.getTopLevelBlock(LazyBlock.java:359)
	at io.trino.spi.block.LazyBlock.getBlock(LazyBlock.java:264)
	at io.trino.spi.block.LazyBlock.isNull(LazyBlock.java:253)
	at io.trino.$gen.PageFilter_20211025_160110_125.filter(Unknown Source)
	at io.trino.$gen.PageFilter_20211025_160110_125.filter(Unknown Source)
	at io.trino.operator.project.PageProcessor.createWorkProcessor(PageProcessor.java:121)
	at io.trino.operator.ScanFilterAndProjectOperator$SplitToPages.lambda$processPageSource$1(ScanFilterAndProjectOperator.java:280)
	at io.trino.operator.WorkProcessorUtils.lambda$flatMap$4(WorkProcessorUtils.java:245)
	at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:319)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:372)
	at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:306)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:372)
	at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:306)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:372)
	at io.trino.operator.WorkProcessorUtils.getNextState(WorkProcessorUtils.java:221)
	at io.trino.operator.WorkProcessorUtils.lambda$processStateMonitor$2(WorkProcessorUtils.java:200)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:372)
	at io.trino.operator.WorkProcessorUtils.lambda$flatten$6(WorkProcessorUtils.java:277)
	at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:319)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:372)
	at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:306)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:372)
	at io.trino.operator.WorkProcessorUtils.getNextState(WorkProcessorUtils.java:221)
	at io.trino.operator.WorkProcessorUtils.lambda$processStateMonitor$2(WorkProcessorUtils.java:200)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:372)
	at io.trino.operator.WorkProcessorUtils.getNextState(WorkProcessorUtils.java:221)
	at io.trino.operator.WorkProcessorUtils.lambda$finishWhen$3(WorkProcessorUtils.java:215)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:372)
	at io.trino.operator.WorkProcessorSourceOperatorAdapter.getOutput(WorkProcessorSourceOperatorAdapter.java:149)
	at io.trino.operator.Driver.processInternal(Driver.java:387)
	at io.trino.operator.Driver.lambda$processFor$9(Driver.java:291)
	at io.trino.operator.Driver.tryWithLock(Driver.java:683)
	at io.trino.operator.Driver.processFor(Driver.java:284)
	at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1075)
	at io.trino.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
	at io.trino.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:484)
	at io.trino.$gen.Trino_358____20211021_011419_2.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)
@nbayle nbayle changed the title Query causes hive connector to crash; requires restart of Trino Query causes hive cache to crash; requires restart of Trino Oct 26, 2021
@findepi findepi added the bug Something isn't working label Oct 30, 2021
@findepi
Copy link
Member

findepi commented Oct 30, 2021

@manavjaiswal
Copy link

Hi. Is there any progress on this issue? Request to please get this issue resolved. Thanks.

@tiechengsu
Copy link

tiechengsu commented Jun 24, 2023

We use Trino 380 and see the same error after enabled cache. We are running trino on k8s. We see the error on bunch of workers. But one of the worker particularly has more of these error and the rest of work has error Unable to open connection to host: <ip of the problematic worker>. The symptoms is the problematic has much higher CPU usage while the other worker has low CPU usage. Even though there are bunch of queries running on the cluster. Can someone look into this?

@raunaqmorarka
Copy link
Member

Rubix is replaced by alluxio now #20550

@raunaqmorarka raunaqmorarka closed this as not planned Won't fix, can't repro, duplicate, stale Feb 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

No branches or pull requests

5 participants