Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Ray session conflicts with PyArrow+HDFS #36415

Open
krfricke opened this issue Jun 14, 2023 · 4 comments
Open

[core] Ray session conflicts with PyArrow+HDFS #36415

krfricke opened this issue Jun 14, 2023 · 4 comments
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-fundamentals P1 Issue that should be fixed within a few weeks stability

Comments

@krfricke
Copy link
Contributor

What happened + What you expected to happen

Using PyArrow fs with HDFS works fine outside a ray session:

file_sys, file_path = pyarrow.fs.FileSystem.from_uri(hdfs_folder)
file_infos = file_sys.get_file_info(pyarrow.fs.FileSelector(file_path, recursive=False))

However, after ray.init(), the same code results in a segmentation fault:

2023-06-14 01:27:37,622 INFO worker.py:1614 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 
*** SIGSEGV received at time=1686731258 on cpu 0 ***     
PC: @     0x7f99d20c5822  (unknown)  (unknown)   
    @     0x7f996fa6ec85        208  absl::lts_20220623::WriteFailureInfo() 
    @     0x7f996fa6e9c8         64  absl::lts_20220623::AbslFailureSignalHandler()                                    
    @     0x7f99e81c6420       3408  (unknown)      
    @     0x7f99d1c2782e         48  (unknown)
    @     0x7f99d1c2cc0f        240  (unknown)                                                                                                        
    @     0x7f99d2267a5f        144  (unknown)                                                                                                        
    @     0x7f99d2267d53        128  (unknown)
    @     0x7f99d21092a0         64  (unknown)
    @     0x7f99e81ba609  (unknown)  start_thread           
[2023-06-14 01:27:38,591 E 9716 9731] logging.cc:361: *** SIGSEGV received at time=1686731258 on cpu 0 ***
[2023-06-14 01:27:38,591 E 9716 9731] logging.cc:361: PC: @     0x7f99d20c5822  (unknown)  (unknown)
[2023-06-14 01:27:38,591 E 9716 9731] logging.cc:361:     @     0x7f996fa6ec85        208  absl::lts_20220623::WriteFailureInfo()
[2023-06-14 01:27:38,592 E 9716 9731] logging.cc:361:     @     0x7f996fa6e9e1         64  absl::lts_20220623::AbslFailureSignalHandler()
[2023-06-14 01:27:38,593 E 9716 9731] logging.cc:361:     @     0x7f99e81c6420       3408  (unknown)
[2023-06-14 01:27:38,593 E 9716 9731] logging.cc:361:     @     0x7f99d1c2782e         48  (unknown)
[2023-06-14 01:27:38,593 E 9716 9731] logging.cc:361:     @     0x7f99d1c2cc0f        240  (unknown)
[2023-06-14 01:27:38,593 E 9716 9731] logging.cc:361:     @     0x7f99d2267a5f        144  (unknown)
[2023-06-14 01:27:38,593 E 9716 9731] logging.cc:361:     @     0x7f99d2267d53        128  (unknown)      
[2023-06-14 01:27:38,593 E 9716 9731] logging.cc:361:     @     0x7f99d21092a0         64  (unknown)
[2023-06-14 01:27:38,593 E 9716 9731] logging.cc:361:     @     0x7f99e81ba609  (unknown)  start_thread                          
Fatal Python error: Segmentation fault                                                                                                                
                                                                                                                                                      
#                                                                                                                                                     
# A fatal error has been detected by the Java Runtime Environment:                                                                                    
#                                                                                                                                                     
#  SIGSEGV (0xb) at pc=0x00007f99e81c62ab, pid=9716, tid=0x00007f99baa56700                         
#                                                                                                                                                     
# JRE version: OpenJDK Runtime Environment (8.0_362-b09) (build 1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09)
# Java VM: OpenJDK 64-Bit Server VM (25.362-b09 mixed mode linux-amd64 compressed oops)             
# Problematic frame:                                                                                                                                  
# C  [libpthread.so.0+0x142ab]  raise+0xcb                                                                                                            
#                                                                                                                                                     
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /ray/hs_err_pid9716.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#
[failure_signal_handler.cc : 332] RAW: Signal 6 raised at PC=0x7f99e800300b while already in AbslFailureSignalHandler()
*** SIGABRT received at time=1686731258 on cpu 0 ***
PC: @     0x7f99e800300b  (unknown)  raise
    @     0x7f996fa6ec85        208  absl::lts_20220623::WriteFailureInfo()
    @     0x7f996fa6e9c8         64  absl::lts_20220623::AbslFailureSignalHandler()
    @     0x7f99e81c6420       3952  (unknown)
    @     0x7f99d22c3843        240  (unknown)
    @     0x7f99d211410e        352  JVM_handle_linux_signal
    @     0x7f99d210731c         64  (unknown)
    @     0x7f99e81c6420      10576  (unknown)
    @     0x7f99d1c2782e         48  (unknown)
    @     0x7f99d1c2cc0f        240  (unknown)
    @     0x7f99d2267a5f        144  (unknown)
    @     0x7f99d2267d53        128  (unknown)
    @     0x7f99d21092a0         64  (unknown)
    @     0x7f99e81ba609  (unknown)  start_thread
[2023-06-14 01:27:38,618 E 9716 9731] logging.cc:361: *** SIGABRT received at time=1686731258 on cpu 0 ***
[2023-06-14 01:27:38,618 E 9716 9731] logging.cc:361: PC: @     0x7f99e800300b  (unknown)  raise
[2023-06-14 01:27:38,618 E 9716 9731] logging.cc:361:     @     0x7f996fa6ec85        208  absl::lts_20220623::WriteFailureInfo()
[2023-06-14 01:27:38,618 E 9716 9731] logging.cc:361:     @     0x7f996fa6e9e1         64  absl::lts_20220623::AbslFailureSignalHandler()
[2023-06-14 01:27:38,618 E 9716 9731] logging.cc:361:     @     0x7f99e81c6420       3952  (unknown)
[2023-06-14 01:27:38,618 E 9716 9731] logging.cc:361:     @     0x7f99d22c3843        240  (unknown)
[2023-06-14 01:27:38,618 E 9716 9731] logging.cc:361:     @     0x7f99d211410e        352  JVM_handle_linux_signal
[2023-06-14 01:27:38,618 E 9716 9731] logging.cc:361:     @     0x7f99d210731c         64  (unknown)
[2023-06-14 01:27:38,619 E 9716 9731] logging.cc:361:     @     0x7f99e81c6420      10576  (unknown)
[2023-06-14 01:27:38,619 E 9716 9731] logging.cc:361:     @     0x7f99d1c2782e         48  (unknown)
[2023-06-14 01:27:38,619 E 9716 9731] logging.cc:361:     @     0x7f99d1c2cc0f        240  (unknown)
[2023-06-14 01:27:38,619 E 9716 9731] logging.cc:361:     @     0x7f99d2267a5f        144  (unknown)
[2023-06-14 01:27:38,619 E 9716 9731] logging.cc:361:     @     0x7f99d2267d53        128  (unknown)
[2023-06-14 01:27:38,619 E 9716 9731] logging.cc:361:     @     0x7f99d21092a0         64  (unknown)
[2023-06-14 01:27:38,619 E 9716 9731] logging.cc:361:     @     0x7f99e81ba609  (unknown)  start_thread
Fatal Python error: Aborted

Here is the log dump from java:

hs_err_pid9716.log

The segfault occurs almost every time, but not always.

It never occurs when ray is not initialized. Thus there is probably some interference between the ray session/global state and the java/pyarrow/hdfs connection.

Versions / Dependencies

Ray latest master, hadoop 3.2.4, java openjdk version "1.8.0_362"

Reproduction script

  • Install HDFS with ./ci/env-install-hdfs.sh
  • Create some directory in HDFS e.g. with /opt/hadoop-3.2.4/bin/hdfs dfs -put /tmp/somewhere hdfs://[host]:8020/somewhere
  • Run this script
def setup_hdfs():                                                                                                                                     
    """Set env vars required by pyarrow to talk to hdfs correctly.                                                                                    
                                                                                                                                                      
    Returns hostname and port needed for the hdfs uri."""                                                                                             
                                                                                                                                                      
    # the following file is written in `install-hdfs.sh`.                                                                                             
    with open("/tmp/hdfs_env", "r") as f:
        for line in f.readlines():
            line = line.rstrip("\n")                                                                                                                  
            tokens = line.split("=", maxsplit=1)
            os.environ[tokens[0]] = tokens[1]
    import sys

    sys.path.insert(0, os.path.join(os.environ["HADOOP_HOME"], "bin"))
    hostname = os.getenv("CONTAINER_ID")
    port = os.getenv("HDFS_PORT")
    return hostname, port


import os
import pyarrow
import pyarrow.fs

hostname, port = setup_hdfs()


workspace_dir = f'hdfs://{hostname}:{port}/somewhere'


# from ray.air._internal.remote_storage import upload_to_uri 
# upload_to_uri("/tmp/content", workspace_dir)

def get_list_of_files_under_hdfs_folder(hdfs_folder):
    file_sys, file_path = pyarrow.fs.FileSystem.from_uri(hdfs_folder)
    file_infos = file_sys.get_file_info(pyarrow.fs.FileSelector(file_path, recursive=False))
    return file_infos

print(f"Success!, number of files in {workspace_dir}: {len(get_list_of_files_under_hdfs_folder(workspace_dir))}")
print(f"Success!, number of files in {workspace_dir}: {len(get_list_of_files_under_hdfs_folder(workspace_dir))}")

                                                                           
print("initializing ray, and get number of files again.")
import ray
ray.is_initialized()
ray.init()
print("After ray init", len(get_list_of_files_under_hdfs_folder(workspace_dir)))

Issue Severity

High: It blocks me from completing my task.

@krfricke krfricke added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 14, 2023
@xieus xieus added P1 Issue that should be fixed within a few weeks core Issues that should be addressed in Ray Core and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 20, 2023
@wxy117
Copy link

wxy117 commented Sep 7, 2023

same error here

@yydai
Copy link

yydai commented Mar 3, 2024

Any update? same error

@jjyao jjyao removed their assignment Mar 17, 2024
@redhatdean
Copy link

Same error

@kewenkang
Copy link

Any update? I met with the similar error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-fundamentals P1 Issue that should be fixed within a few weeks stability
Projects
None yet
Development

No branches or pull requests

8 participants