Description
π§ My Setup:
I'm trying to run distributed LightGBM training using synapseml.lightgbm.LightGBMRegressor in PySpark.
π» Cluster Details:
- Spark version: 3.5.1 (compatible with PySpark 3.5.6)
- PySpark version: 3.5.6
- synapseml: v0.11.1 (latest)
- Spark Cluster: 3 Hetzner nodes
- Driver: 5.161.217.134 (3342)
- Worker 1: 159.69.6.195 (3343)
- Worker 2: 91.99.133.95 (3349)
- Ports Open: 30000β45000 and 3340-3380 TCP on all nodes (very wide range just to get things working)
β
What Works:
Cluster is configured correctly. All Spark jobs and partitions are assigned and shuffled as expected.
LightGBM begins training; it launches sockets and receives enabledTask:::: messages from all worker nodes.
No errors appear in the logs.
β The Problem:
The training gets stuck at the point where the driver closes all sockets after receiving topology info. Specifically, logs stop here:
NetworkManager: driver writing back network topology to 2 connections: ... NetworkManager: driver writing back partition topology to 2 connections: ... NetworkManager: driver closing all sockets and server socket NetworkManager: driver done closing all sockets and server socket
π What Iβve Tried:
- Repartitioned data to match number of workers.
- Verified that all workers are reachable from driver on the open ports.
- Set parallelism="data_parallel", also tried tree_learner="data" explicitly.
- Experimented with broadcast & partition sizes to no avail.
β My Questions:
- Why does training hang even after all workers successfully establish socket communication?
- Is this a known issue with certain versions of synapseml or LightGBM?
- How can I restrict or fix the port range LightGBM uses? I want to avoid opening a massive 30000β45000 range can this be pinned reliably? (Tried Defaultport Lightgbm Paramter not working)
- Any workaround or logs I should enable to debug deeper (e.g., LightGBM internal debug mode)?
- Is it possible that a missing barrier or stage finalization in Spark is causing this silent hang?
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from synapse.ml.lightgbm import LightGBMRegressor
spark = SparkSession.builder.appName("Distributed LightGBM").getOrCreate()
df = spark.range(0, 100000)
for i in range(20):
df = df.withColumn(f"f{i}", (df["id"] * 0.1 + i) % 1)
df = df.withColumn("label", (df["id"] % 2).cast("double"))
features = [f"f{i}" for i in range(20)]
vec = VectorAssembler(inputCols=features, outputCol="features")
df = vec.transform(df).select("features", "label").repartition(2)
lgbm = LightGBMRegressor(
objective="binary",
featuresCol="features",
labelCol="label",
numIterations=100,
learningRate=0.1,
numLeaves=31,
earlyStoppingRound=10,
verbosity=1,
parallelism="data_parallel",
)
model = lgbm.fit(df)
Using this command to run the above file.
$SPARK_HOME/bin/spark-submit --master spark://5.161.217.134:3342 --conf spark.driver.host=5.161.217.134 --conf spark.driver.port=3346 --conf spark.driver.bindAddress=0.0.0.0 --conf spark.executor.memory=29g --conf spark.executor.cores=16 --conf spark.driver.memory=8g --conf spark.blockManager.port=3347 --conf spark.fileserver.port=3348 --conf spark.ui.port=3379 --conf spark.broadcast.port=3350 --conf spark.task.cpus=1 --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.kryoserializer.buffer.max=1024m --conf spark.sql.shuffle.partitions=200 --conf spark.sql.execution.arrow.pyspark.enabled=true --conf spark.memory.fraction=0.9 --conf spark.memory.storageFraction=0.4 --packages com.microsoft.azure:synapseml_2.12:0.11.1 spark_lgb.py
π Any help or guidance is appreciated!
Let me know if logs or config files would help.