Stuck on LightGBM Distributed Training in PySpark – Hanging After Socket Communication

🔧 My Setup:
I'm trying to run distributed LightGBM training using synapseml.lightgbm.LightGBMRegressor in PySpark.

💻 Cluster Details:
- Spark version: 3.5.1 (compatible with PySpark 3.5.6)
- PySpark version: 3.5.6
- synapseml: v0.11.1 (latest)
- Spark Cluster: 3 Hetzner nodes
- Driver: 5.161.217.134 (3342)
- Worker 1: 159.69.6.195 (3343)
- Worker 2: 91.99.133.95 (3349)
- Ports Open: 30000–45000 and 3340-3380 TCP on all nodes (very wide range just to get things working)

✅ What Works:
Cluster is configured correctly. All Spark jobs and partitions are assigned and shuffled as expected.

LightGBM begins training; it launches sockets and receives enabledTask:<ip>:<port>:<partition>:<taskid> messages from all worker nodes.

No errors appear in the logs.

❌ The Problem:
The training gets stuck at the point where the driver closes all sockets after receiving topology info. Specifically, logs stop here:
`NetworkManager: driver writing back network topology to 2 connections: ...
NetworkManager: driver writing back partition topology to 2 connections: ...
NetworkManager: driver closing all sockets and server socket
NetworkManager: driver done closing all sockets and server socket
`
🔍 What I’ve Tried:

- Repartitioned data to match number of workers.
- Verified that all workers are reachable from driver on the open ports.
- Set parallelism="data_parallel", also tried tree_learner="data" explicitly.
- Experimented with broadcast & partition sizes to no avail.

❓ My Questions:

- Why does training hang even after all workers successfully establish socket communication?
- Is this a known issue with certain versions of synapseml or LightGBM?
- How can I restrict or fix the port range LightGBM uses? I want to avoid opening a massive 30000–45000 range   can this be pinned reliably? (Tried Defaultport Lightgbm Paramter not working)
- Any workaround or logs I should enable to debug deeper (e.g., LightGBM internal debug mode)?
- Is it possible that a missing barrier or stage finalization in Spark is causing this silent hang?



```
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from synapse.ml.lightgbm import LightGBMRegressor

spark = SparkSession.builder.appName("Distributed LightGBM").getOrCreate()

df = spark.range(0, 100000)
for i in range(20):
    df = df.withColumn(f"f{i}", (df["id"] * 0.1 + i) % 1)
df = df.withColumn("label", (df["id"] % 2).cast("double"))

features = [f"f{i}" for i in range(20)]
vec = VectorAssembler(inputCols=features, outputCol="features")
df = vec.transform(df).select("features", "label").repartition(2)

lgbm = LightGBMRegressor(
    objective="binary",
    featuresCol="features",
    labelCol="label",
    numIterations=100,
    learningRate=0.1,
    numLeaves=31,
    earlyStoppingRound=10,
    verbosity=1,
    parallelism="data_parallel",
)

model = lgbm.fit(df)
```
Using this command to run the above file.
```

$SPARK_HOME/bin/spark-submit   --master spark://5.161.217.134:3342   --conf spark.driver.host=5.161.217.134   --conf spark.driver.port=3346   --conf spark.driver.bindAddress=0.0.0.0   --conf spark.executor.memory=29g   --conf spark.executor.cores=16   --conf spark.driver.memory=8g   --conf spark.blockManager.port=3347   --conf spark.fileserver.port=3348   --conf spark.ui.port=3379   --conf spark.broadcast.port=3350   --conf spark.task.cpus=1   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer   --conf spark.kryoserializer.buffer.max=1024m   --conf spark.sql.shuffle.partitions=200   --conf spark.sql.execution.arrow.pyspark.enabled=true   --conf spark.memory.fraction=0.9   --conf spark.memory.storageFraction=0.4   --packages com.microsoft.azure:synapseml_2.12:0.11.1   spark_lgb.py

```

🙏 Any help or guidance is appreciated!
Let me know if logs or config files would help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Stuck on LightGBM Distributed Training in PySpark – Hanging After Socket Communication #2392

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Stuck on LightGBM Distributed Training in PySpark – Hanging After Socket Communication #2392

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions