Skip to content

Stuck on LightGBM Distributed Training in PySpark – Hanging After Socket CommunicationΒ #2392

Open
@amanjethani-equbot

Description

@amanjethani-equbot

πŸ”§ My Setup:
I'm trying to run distributed LightGBM training using synapseml.lightgbm.LightGBMRegressor in PySpark.

πŸ’» Cluster Details:

  • Spark version: 3.5.1 (compatible with PySpark 3.5.6)
  • PySpark version: 3.5.6
  • synapseml: v0.11.1 (latest)
  • Spark Cluster: 3 Hetzner nodes
  • Driver: 5.161.217.134 (3342)
  • Worker 1: 159.69.6.195 (3343)
  • Worker 2: 91.99.133.95 (3349)
  • Ports Open: 30000–45000 and 3340-3380 TCP on all nodes (very wide range just to get things working)

βœ… What Works:
Cluster is configured correctly. All Spark jobs and partitions are assigned and shuffled as expected.

LightGBM begins training; it launches sockets and receives enabledTask:::: messages from all worker nodes.

No errors appear in the logs.

❌ The Problem:
The training gets stuck at the point where the driver closes all sockets after receiving topology info. Specifically, logs stop here:
NetworkManager: driver writing back network topology to 2 connections: ... NetworkManager: driver writing back partition topology to 2 connections: ... NetworkManager: driver closing all sockets and server socket NetworkManager: driver done closing all sockets and server socket
πŸ” What I’ve Tried:

  • Repartitioned data to match number of workers.
  • Verified that all workers are reachable from driver on the open ports.
  • Set parallelism="data_parallel", also tried tree_learner="data" explicitly.
  • Experimented with broadcast & partition sizes to no avail.

❓ My Questions:

  • Why does training hang even after all workers successfully establish socket communication?
  • Is this a known issue with certain versions of synapseml or LightGBM?
  • How can I restrict or fix the port range LightGBM uses? I want to avoid opening a massive 30000–45000 range can this be pinned reliably? (Tried Defaultport Lightgbm Paramter not working)
  • Any workaround or logs I should enable to debug deeper (e.g., LightGBM internal debug mode)?
  • Is it possible that a missing barrier or stage finalization in Spark is causing this silent hang?
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from synapse.ml.lightgbm import LightGBMRegressor

spark = SparkSession.builder.appName("Distributed LightGBM").getOrCreate()

df = spark.range(0, 100000)
for i in range(20):
    df = df.withColumn(f"f{i}", (df["id"] * 0.1 + i) % 1)
df = df.withColumn("label", (df["id"] % 2).cast("double"))

features = [f"f{i}" for i in range(20)]
vec = VectorAssembler(inputCols=features, outputCol="features")
df = vec.transform(df).select("features", "label").repartition(2)

lgbm = LightGBMRegressor(
    objective="binary",
    featuresCol="features",
    labelCol="label",
    numIterations=100,
    learningRate=0.1,
    numLeaves=31,
    earlyStoppingRound=10,
    verbosity=1,
    parallelism="data_parallel",
)

model = lgbm.fit(df)

Using this command to run the above file.


$SPARK_HOME/bin/spark-submit   --master spark://5.161.217.134:3342   --conf spark.driver.host=5.161.217.134   --conf spark.driver.port=3346   --conf spark.driver.bindAddress=0.0.0.0   --conf spark.executor.memory=29g   --conf spark.executor.cores=16   --conf spark.driver.memory=8g   --conf spark.blockManager.port=3347   --conf spark.fileserver.port=3348   --conf spark.ui.port=3379   --conf spark.broadcast.port=3350   --conf spark.task.cpus=1   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer   --conf spark.kryoserializer.buffer.max=1024m   --conf spark.sql.shuffle.partitions=200   --conf spark.sql.execution.arrow.pyspark.enabled=true   --conf spark.memory.fraction=0.9   --conf spark.memory.storageFraction=0.4   --packages com.microsoft.azure:synapseml_2.12:0.11.1   spark_lgb.py

πŸ™ Any help or guidance is appreciated!
Let me know if logs or config files would help.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions