Skip to content

GH-1067: Close cached HDFS FileSystem instances#4

Draft
xborder wants to merge 1 commit intomainfrom
xborder/fix-ipc-channel-read
Draft

GH-1067: Close cached HDFS FileSystem instances#4
xborder wants to merge 1 commit intomainfrom
xborder/fix-ipc-channel-read

Conversation

@xborder
Copy link
Copy Markdown
Owner

@xborder xborder commented Mar 29, 2026

Summary

Fixes apache#1067.

This change adds explicit HDFS resource cleanup in the same spirit that the dataset JNI layer already does explicit lifecycle cleanup for S3. For S3, Arrow Java registers a shutdown hook that finalizes the native S3 subsystem. For HDFS, the leak happens through cached Hadoop FileSystem instances that keep non-daemon IPC threads alive, so the cleanup needs to happen when FileSystemDatasetFactory is closed.

FileSystemDatasetFactory now retains its input URIs and, on close(), detects HDFS-backed URIs, normalizes and deduplicates the filesystem roots, and best-effort closes the corresponding cached Hadoop FileSystem instances. That releases the lingering IPC threads and allows the JVM to exit normally after HDFS dataset reads.

This PR also adds a regression test using MiniDFSCluster and a forked child JVM to show the behavioral difference:

  • without cleanup, the child JVM hangs
  • with FileSystemDatasetFactory cleanup, the child JVM exits normally

Testing

mvn -Parrow-jni -pl dataset -Dmaven.gitcommitid.skip=true -Dtest=TestHdfsFileSystemCleanup test

Result:

  • Tests run: 2, Failures: 0, Errors: 0, Skipped: 0
  • BUILD SUCCESS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ARROW Java][HDFS] JVM hangs after reading HDFS files via Arrow Dataset API due to non-daemon native threads

1 participant