GH-1067: Close cached HDFS FileSystem instances by xborder · Pull Request #4 · xborder/arrow-java

xborder · 2026-03-29T21:48:46Z

Summary

This change adds explicit HDFS resource cleanup in the same spirit that the dataset JNI layer already does explicit lifecycle cleanup for S3. For S3, Arrow Java registers a shutdown hook that finalizes the native S3 subsystem. For HDFS, the leak happens through cached Hadoop FileSystem instances that keep non-daemon IPC threads alive, so the cleanup needs to happen when FileSystemDatasetFactory is closed.

FileSystemDatasetFactory now retains its input URIs and, on close(), detects HDFS-backed URIs, normalizes and deduplicates the filesystem roots, and best-effort closes the corresponding cached Hadoop FileSystem instances. That releases the lingering IPC threads and allows the JVM to exit normally after HDFS dataset reads.

This PR also adds a regression test using MiniDFSCluster and a forked child JVM to show the behavioral difference:

without cleanup, the child JVM hangs
with FileSystemDatasetFactory cleanup, the child JVM exits normally

Testing

mvn -Parrow-jni -pl dataset -Dmaven.gitcommitid.skip=true -Dtest=TestHdfsFileSystemCleanup test

Result:

Tests run: 2, Failures: 0, Errors: 0, Skipped: 0
BUILD SUCCESS

apacheGH-1067: Close cached HDFS FileSystem instances

d8087ea

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-1067: Close cached HDFS FileSystem instances#4

GH-1067: Close cached HDFS FileSystem instances#4
xborder wants to merge 1 commit intomainfrom
xborder/fix-ipc-channel-read

xborder commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xborder commented Mar 29, 2026

Summary

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant