Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shuffle server OOM #49

Closed
packageman opened this issue Dec 27, 2021 · 5 comments
Closed

shuffle server OOM #49

packageman opened this issue Dec 27, 2021 · 5 comments

Comments

@packageman
Copy link

I deployed the shuffle server release 0.1.0 version on a 16c64g machine with XMX_SIZE="55g" configuration. When running a spark application, the shuffle server memory will continue to grow, and eventually, it will grow to about 60g to trigger the OOM and exit.

server.conf

rss.rpc.server.port 20001
rss.jetty.http.port 20000
rss.rpc.executor.size 2000
rss.storage.basePath /data/rss
rss.storage.type LOCALFILE
rss.coordinator.quorum xxx:19999,xxx:19999,xxx:19999
rss.server.buffer.capacity 20000000000
rss.server.buffer.spill.threshold 5000000000
rss.server.partition.buffer.size 157200000
rss.server.read.buffer.capacity 10000000000
rss.server.heartbeat.timeout 60000
rss.server.heartbeat.interval 10000
rss.rpc.message.max.size 1073741824
rss.server.preAllocation.expired 120000
rss.server.commit.timeout 3600000
rss.storage.data.replica 1
rss.server.flush.thread.alive 5
rss.server.flush.threadPool.size 20
rss.server.app.expired.withoutHeartbeat 30000

rss-env.sh

set -o pipefail
set -e

XMX_SIZE="55g"

RUNNER="${JAVA_HOME}/bin/java"
JPS="${JAVA_HOME}/bin/jps"

image

image

image

Is my configuration incorrect or there is a memory leak in the program?

@colinmjj
Copy link
Collaborator

can you share the client's config?

@packageman
Copy link
Author

spark.properties

#Java properties built from Kubernetes config map with name: spark-drv-e832d57dfa6994bc-conf-map
#Mon Dec 27 13:42:43 CST 2021
spark.shuffle.manager=org.apache.spark.shuffle.RssShuffleManager
spark.rss.coordinator.quorum=xxx\:19999,xxx\:19999,xxx\:19999
spark.rss.storage.type=LOCALFILE

spark.driver.port=7078
spark.kubernetes.resource.type=java
spark.kubernetes.driver.volumes.persistentVolumeClaim.data.options.claimName=bigdata-pv
spark.executor.cores=6
spark.history.fs.cleaner.enabled=true
spark.kubernetes.executor.request.cores=6
spark.submit.pyFiles=
spark.executor.memory=30g
spark.kubernetes.driverEnv.APP_TYPES=spark
spark.driver.memoryOverhead=4g
spark.kubernetes.container.image=127.0.0.1\:65001/xxx/service-spark\:staging
spark.master=k8s\://https\://kubernetes.default
spark.driver.memory=4g
spark.kubernetes.driver.request.cores=0.05
spark.kubernetes.driver.pod.name=bigdata-warehouseeditorfirestorm
spark.driver.host=bigdata-warehouseeditorfirestorm-d275ac7dfa6992c2-driver-svc.scrm.svc
spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.claimName=bigdata-pv
spark.eventLog.compress=true
spark.submit.deployMode=cluster
spark.executor.extraJavaOptions=-DREQ_ID\=98cf7a8b-730a-4902-ab12-f213f4268156
spark.kubernetes.authenticate.driver.serviceAccountName=bigdata-api
spark.history.fs.logDirectory=file\:///data/spark-history
spark.kubernetes.submitInDriver=true
spark.kubernetes.pyspark.pythonVersion=3
spark.kubernetes.memoryOverheadFactor=0.2
spark.app.name=bigdata-warehouseeditorfirestorm
spark.eventLog.enabled=true
spark.driver.cores=1
spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.path=/data
spark.kubernetes.authenticate.submission.caCertFile=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt
spark.driver.blockManager.port=7079
spark.kubernetes.driverEnv.SPRING_PROFILES_ACTIVE=spark,staging
spark.executor.memoryOverhead=5g
spark.kubernetes.driver.volumes.persistentVolumeClaim.data.mount.path=/data
spark.app.id=spark-cfbe5cf4715042cd82ebd6cab82d069c
spark.eventLog.dir=file\:///data/spark-history
spark.kubernetes.authenticate.submission.oauthTokenFile=/var/run/secrets/kubernetes.io/serviceaccount/token
spark.kubernetes.namespace=scrm
spark.executor.instances=5
spark.jars=local\:///opt/spark/jars/app.jar

@colinmjj
Copy link
Collaborator

I can't tell the root cause for now. Shuffle server's memory is composed with buffer of write + buffer of read + metadata used, and there shouldn't be OOM with your configuration.
Shuffle server's log should be checked, and update following configuration in shuffle server with one storage device:
rss.server.flush.thread.alive 5 ->2
rss.server.flush.threadPool.size 20 -> 4

@packageman
Copy link
Author

rss.server.flush.thread.alive 5 ->2
rss.server.flush.threadPool.size 20 -> 4

OOM also happened.

I called the metrics API(/metrics/jvm, /metrics/server) to check the buffer usage and jvm metrics: buffer-related metrics are all 0 or very small, but jvm_memory_bytes_used is 16512134504, about 16G. Except read/write
/inflush/preallocated buffer, is it because metadata occupies 16G of resources?

shuffle server metrics:

{
    "metrics": [
        {
            "name": "event_size_threshold_level4",
            "labelNames": [],
            "labelValues": [],
            "value": 199,
            "timestampMs": null
        },
        {
            "name": "registered_shuffle",
            "labelNames": [],
            "labelValues": [],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "total_write_num",
            "labelNames": [],
            "labelValues": [],
            "value": 199,
            "timestampMs": null
        },
        {
            "name": "event_size_threshold_level3",
            "labelNames": [],
            "labelValues": [],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "event_size_threshold_level2",
            "labelNames": [],
            "labelValues": [],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "total_read_memory_data",
            "labelNames": [],
            "labelValues": [],
            "value": 976536477845,
            "timestampMs": null
        },
        {
            "name": "in_flush_buffer_size",
            "labelNames": [],
            "labelValues": [],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "total_write_data",
            "labelNames": [],
            "labelValues": [],
            "value": 10785298196,
            "timestampMs": null
        },
        {
            "name": "used_buffer_size",
            "labelNames": [],
            "labelValues": [],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "app_num_with_node",
            "labelNames": [],
            "labelValues": [],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "total_read_time",
            "labelNames": [],
            "labelValues": [],
            "value": 162889,
            "timestampMs": null
        },
        {
            "name": "registered_shuffle_engine",
            "labelNames": [],
            "labelValues": [],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "total_received_data",
            "labelNames": [],
            "labelValues": [],
            "value": 26123582195,
            "timestampMs": null
        },
        {
            "name": "total_upload_time_s",
            "labelNames": [],
            "labelValues": [],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "total_write_exception",
            "labelNames": [],
            "labelValues": [],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "total_read_data",
            "labelNames": [],
            "labelValues": [],
            "value": 1184709143754,
            "timestampMs": null
        },
        {
            "name": "allocated_buffer_size",
            "labelNames": [],
            "labelValues": [],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "total_read_local_data_file",
            "labelNames": [],
            "labelValues": [],
            "value": 208101043509,
            "timestampMs": null
        },
        {
            "name": "total_upload_size",
            "labelNames": [],
            "labelValues": [],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "partition_num_with_node",
            "labelNames": [],
            "labelValues": [],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "buffered_data_size",
            "labelNames": [],
            "labelValues": [],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "total_write_slow",
            "labelNames": [],
            "labelValues": [],
            "value": 13,
            "timestampMs": null
        },
        {
            "name": "total_dropped_event_num",
            "labelNames": [],
            "labelValues": [],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "total_read_local_index_file",
            "labelNames": [],
            "labelValues": [],
            "value": 71622400,
            "timestampMs": null
        },
        {
            "name": "total_write_time",
            "labelNames": [],
            "labelValues": [],
            "value": 468057,
            "timestampMs": null
        },
        {
            "name": "event_size_threshold_level1",
            "labelNames": [],
            "labelValues": [],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "total_write_block",
            "labelNames": [],
            "labelValues": [],
            "value": 58669,
            "timestampMs": null
        },
        {
            "name": "event_queue_size",
            "labelNames": [],
            "labelValues": [],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "total_write_handler",
            "labelNames": [],
            "labelValues": [],
            "value": 0,
            "timestampMs": null
        }
    ],
    "timeStamp": 1640933730956
}

jvm metrics

{
    "metrics": [
        {
            "name": "jvm_info",
            "labelNames": [
                "version",
                "vendor",
                "runtime"
            ],
            "labelValues": [
                "1.8.0_292-b10",
                "AdoptOpenJDK",
                "OpenJDK Runtime Environment"
            ],
            "value": 1,
            "timestampMs": null
        },
        {
            "name": "jvm_gc_collection_seconds_count",
            "labelNames": [
                "gc"
            ],
            "labelValues": [
                "G1 Young Generation"
            ],
            "value": 128,
            "timestampMs": null
        },
        {
            "name": "jvm_gc_collection_seconds_sum",
            "labelNames": [
                "gc"
            ],
            "labelValues": [
                "G1 Young Generation"
            ],
            "value": 25.19,
            "timestampMs": null
        },
        {
            "name": "jvm_gc_collection_seconds_count",
            "labelNames": [
                "gc"
            ],
            "labelValues": [
                "G1 Old Generation"
            ],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "jvm_gc_collection_seconds_sum",
            "labelNames": [
                "gc"
            ],
            "labelValues": [
                "G1 Old Generation"
            ],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "jvm_threads_current",
            "labelNames": [],
            "labelValues": [],
            "value": 1067,
            "timestampMs": null
        },
        {
            "name": "jvm_threads_daemon",
            "labelNames": [],
            "labelValues": [],
            "value": 1051,
            "timestampMs": null
        },
        {
            "name": "jvm_threads_peak",
            "labelNames": [],
            "labelValues": [],
            "value": 1067,
            "timestampMs": null
        },
        {
            "name": "jvm_threads_started_total",
            "labelNames": [],
            "labelValues": [],
            "value": 1069,
            "timestampMs": null
        },
        {
            "name": "jvm_threads_deadlocked",
            "labelNames": [],
            "labelValues": [],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "jvm_threads_deadlocked_monitor",
            "labelNames": [],
            "labelValues": [],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "jvm_threads_state",
            "labelNames": [
                "state"
            ],
            "labelValues": [
                "NEW"
            ],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "jvm_threads_state",
            "labelNames": [
                "state"
            ],
            "labelValues": [
                "TIMED_WAITING"
            ],
            "value": 6,
            "timestampMs": null
        },
        {
            "name": "jvm_threads_state",
            "labelNames": [
                "state"
            ],
            "labelValues": [
                "TERMINATED"
            ],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "jvm_threads_state",
            "labelNames": [
                "state"
            ],
            "labelValues": [
                "RUNNABLE"
            ],
            "value": 36,
            "timestampMs": null
        },
        {
            "name": "jvm_threads_state",
            "labelNames": [
                "state"
            ],
            "labelValues": [
                "WAITING"
            ],
            "value": 1018,
            "timestampMs": null
        },
        {
            "name": "jvm_threads_state",
            "labelNames": [
                "state"
            ],
            "labelValues": [
                "BLOCKED"
            ],
            "value": 7,
            "timestampMs": null
        },
        {
            "name": "jvm_classes_loaded",
            "labelNames": [],
            "labelValues": [],
            "value": 4689,
            "timestampMs": null
        },
        {
            "name": "jvm_classes_loaded_total",
            "labelNames": [],
            "labelValues": [],
            "value": 4692,
            "timestampMs": null
        },
        {
            "name": "jvm_classes_unloaded_total",
            "labelNames": [],
            "labelValues": [],
            "value": 3,
            "timestampMs": null
        },
        {
            "name": "jvm_buffer_pool_used_bytes",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "direct"
            ],
            "value": 32769,
            "timestampMs": null
        },
        {
            "name": "jvm_buffer_pool_used_bytes",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "mapped"
            ],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "jvm_buffer_pool_capacity_bytes",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "direct"
            ],
            "value": 32768,
            "timestampMs": null
        },
        {
            "name": "jvm_buffer_pool_capacity_bytes",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "mapped"
            ],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "jvm_buffer_pool_used_buffers",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "direct"
            ],
            "value": 5,
            "timestampMs": null
        },
        {
            "name": "jvm_buffer_pool_used_buffers",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "mapped"
            ],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_bytes_used",
            "labelNames": [
                "area"
            ],
            "labelValues": [
                "heap"
            ],
            "value": 16512134504,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_bytes_used",
            "labelNames": [
                "area"
            ],
            "labelValues": [
                "nonheap"
            ],
            "value": 51022000,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_bytes_committed",
            "labelNames": [
                "area"
            ],
            "labelValues": [
                "heap"
            ],
            "value": 53687091200,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_bytes_committed",
            "labelNames": [
                "area"
            ],
            "labelValues": [
                "nonheap"
            ],
            "value": 52494336,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_bytes_max",
            "labelNames": [
                "area"
            ],
            "labelValues": [
                "heap"
            ],
            "value": 53687091200,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_bytes_max",
            "labelNames": [
                "area"
            ],
            "labelValues": [
                "nonheap"
            ],
            "value": -1,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_bytes_init",
            "labelNames": [
                "area"
            ],
            "labelValues": [
                "heap"
            ],
            "value": 53687091200,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_bytes_init",
            "labelNames": [
                "area"
            ],
            "labelValues": [
                "nonheap"
            ],
            "value": 2555904,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_bytes_used",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "Code Cache"
            ],
            "value": 19920576,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_bytes_used",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "Metaspace"
            ],
            "value": 31101424,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_bytes_used",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "G1 Eden Space"
            ],
            "value": 805306368,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_bytes_used",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "G1 Survivor Space"
            ],
            "value": 33554432,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_bytes_used",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "G1 Old Gen"
            ],
            "value": 15673273704,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_bytes_committed",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "Code Cache"
            ],
            "value": 20512768,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_bytes_committed",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "Metaspace"
            ],
            "value": 31981568,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_bytes_committed",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "G1 Eden Space"
            ],
            "value": 5603590144,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_bytes_committed",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "G1 Survivor Space"
            ],
            "value": 33554432,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_bytes_committed",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "G1 Old Gen"
            ],
            "value": 48049946624,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_bytes_max",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "Code Cache"
            ],
            "value": 251658240,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_bytes_max",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "Metaspace"
            ],
            "value": -1,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_bytes_max",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "G1 Eden Space"
            ],
            "value": -1,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_bytes_max",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "G1 Survivor Space"
            ],
            "value": -1,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_bytes_max",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "G1 Old Gen"
            ],
            "value": 53687091200,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_bytes_init",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "Code Cache"
            ],
            "value": 2555904,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_bytes_init",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "Metaspace"
            ],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_bytes_init",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "G1 Eden Space"
            ],
            "value": 5637144576,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_bytes_init",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "G1 Survivor Space"
            ],
            "value": 0,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_bytes_init",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "G1 Old Gen"
            ],
            "value": 48049946624,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_allocated_bytes_total",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "G1 Old Gen"
            ],
            "value": 1055979279032,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_allocated_bytes_total",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "Code Cache"
            ],
            "value": 23076416,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_allocated_bytes_total",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "G1 Eden Space"
            ],
            "value": 297090940928,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_allocated_bytes_total",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "G1 Survivor Space"
            ],
            "value": 9294577664,
            "timestampMs": null
        },
        {
            "name": "jvm_memory_pool_allocated_bytes_total",
            "labelNames": [
                "pool"
            ],
            "labelValues": [
                "Metaspace"
            ],
            "value": 27300720,
            "timestampMs": null
        },
        {
            "name": "process_cpu_seconds_total",
            "labelNames": [],
            "labelValues": [],
            "value": 2042.23,
            "timestampMs": null
        },
        {
            "name": "process_start_time_seconds",
            "labelNames": [],
            "labelValues": [],
            "value": 1640913776.419,
            "timestampMs": null
        },
        {
            "name": "process_open_fds",
            "labelNames": [],
            "labelValues": [],
            "value": 379,
            "timestampMs": null
        },
        {
            "name": "process_max_fds",
            "labelNames": [],
            "labelValues": [],
            "value": 999999,
            "timestampMs": null
        },
        {
            "name": "process_virtual_memory_bytes",
            "labelNames": [],
            "labelValues": [],
            "value": 66492104704,
            "timestampMs": null
        },
        {
            "name": "process_resident_memory_bytes",
            "labelNames": [],
            "labelValues": [],
            "value": 58613448704,
            "timestampMs": null
        }
    ],
    "timeStamp": 1640933954231
}

@jerqi
Copy link
Collaborator

jerqi commented Jan 4, 2022

It seems to be killed by kernel. You should use a virtual machine that have more memory.

@jerqi jerqi closed this as completed Jan 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants