Description
SynapseML version
1.0.11
System information
- Language version (python 3.11, scala 2.12):
- Spark Version (3.4.2):
- Spark Platform (maxcompute):
Describe the problem
2025-07-02 17:42:50.216 | INFO | main::23 - CDC py ver:3.11.11 (main, Dec 16 2024, 17:23:09) [GCC 10.2.1 20200825 (Alibaba 10.2.1-3.8 2.32)] /workdir
2025-07-02 17:43:22.637 | INFO | main::57 - Spark配置信息:
2025-07-02 17:43:23.406 | INFO | main::58 - Executor数量: 40
2025-07-02 17:43:23.406 | INFO | main::59 - 每个Executor的内存: 64g
2025-07-02 17:43:23.406 | INFO | main::60 - 每个Executor的核心数: 16
Current Working Directory (os.getcwd()): /workdir
2025-07-02 17:43:23.407 | INFO | main:main:349 - 开始执行时间: 2025-07-02 17:43:23.407060
2025-07-02 17:43:29.618 | INFO | main:main:390 - --- 开始训练流程 ---
2025-07-02 17:43:29.618 | INFO | main:main:393 - 加载并转换训练数据...
2025-07-02 17:43:29.843 | INFO | main:process_multi_value_feature:311 - 处理多值特征 'goods_name_preference_total',保留TOP 100...
2025-07-02 17:43:29.953 | INFO | main:process_multi_value_feature:324 - 正在为 'goods_name_preference_total' 拟合CountVectorizer...
2025-07-02 17:45:09.976 | INFO | main:process_multi_value_feature:330 - 特征 'goods_name_preference_total' 处理完成,已生成向量列 'goods_name_preference_total_vec'。
2025-07-02 17:45:10.021 | INFO | main:process_multi_value_feature:311 - 处理多值特征 'goods_name_preference',保留TOP 100...
2025-07-02 17:45:10.060 | INFO | main:process_multi_value_feature:324 - 正在为 'goods_name_preference' 拟合CountVectorizer...
2025-07-02 17:46:00.475 | INFO | main:process_multi_value_feature:330 - 特征 'goods_name_preference' 处理完成,已生成向量列 'goods_name_preference_vec'。
2025-07-02 17:47:42.085 | INFO | main:main:412 - 训练数据加载和特征工程完成。行数: 19074476
2025-07-02 17:47:42.129 | INFO | main:main:417 - 识别出 303个数值特征,23个分类特征。
2025-07-02 17:47:42.129 | INFO | main:preprocess_data:281 - 开始进行高效数据预处理...
2025-07-02 17:47:42.129 | INFO | main:preprocess_data:284 - 填充分类特征的空值...
2025-07-02 17:47:42.186 | INFO | main:preprocess_data:288 - 一次性计算所有数值特征的均值...
2025-07-02 17:48:29.125 | INFO | main:preprocess_data:303 - 计算出的均值将用于填充 303 个数值列。
2025-07-02 17:48:29.228 | INFO | main:preprocess_data:306 - 数据预处理完成。
2025-07-02 17:48:29.228 | INFO | main:fit:123 - 开始对23个分类特征进行目标编码...
2025-07-02 17:48:36.529 | INFO | main:fit:126 - 标签全局均值: 0.13839336923331472
2025-07-02 17:48:36.529 | INFO | main:fit:132 - 处理特征批次 1/1, 特征数: 23
2025-07-02 17:48:36.529 | INFO | main:fit:136 - 处理特征: reg_channel
2025-07-02 17:48:41.802 | INFO | main:fit:163 - 特征 reg_channel 处理完成, 耗时: 5.27秒, 编码值数量: 16
2025-07-02 17:48:41.860 | INFO | main:fit:136 - 处理特征: member_card_level
2025-07-02 17:48:43.563 | INFO | main:fit:163 - 特征 member_card_level 处理完成, 耗时: 1.70秒, 编码值数量: 5
2025-07-02 17:48:43.604 | INFO | main:fit:136 - 处理特征: business_district_typ
2025-07-02 17:48:46.556 | INFO | main:fit:163 - 特征 business_district_typ 处理完成, 耗时: 2.95秒, 编码值数量: 61
2025-07-02 17:48:46.593 | INFO | main:fit:136 - 处理特征: belong_province_name
2025-07-02 17:48:50.744 | INFO | main:fit:163 - 特征 belong_province_name 处理完成, 耗时: 4.15秒, 编码值数量: 37
2025-07-02 17:48:50.783 | INFO | main:fit:136 - 处理特征: belong_company_name
2025-07-02 17:48:55.478 | INFO | main:fit:163 - 特征 belong_company_name 处理完成, 耗时: 4.69秒, 编码值数量: 28
2025-07-02 17:48:55.514 | INFO | main:fit:136 - 处理特征: order_type_preference
2025-07-02 17:48:57.140 | INFO | main:fit:163 - 特征 order_type_preference 处理完成, 耗时: 1.63秒, 编码值数量: 2
2025-07-02 17:48:57.177 | INFO | main:fit:136 - 处理特征: city_name_preference
2025-07-02 17:49:02.530 | INFO | main:fit:163 - 特征 city_name_preference 处理完成, 耗时: 5.35秒, 编码值数量: 354
2025-07-02 17:49:02.566 | INFO | main:fit:136 - 处理特征: province_preference
2025-07-02 17:49:04.604 | INFO | main:fit:163 - 特征 province_preference 处理完成, 耗时: 2.04秒, 编码值数量: 32
2025-07-02 17:49:04.646 | INFO | main:fit:136 - 处理特征: company_preference
2025-07-02 17:49:06.745 | INFO | main:fit:163 - 特征 company_preference 处理完成, 耗时: 2.10秒, 编码值数量: 27
2025-07-02 17:49:06.781 | INFO | main:fit:136 - 处理特征: member_life_cycle
2025-07-02 17:49:08.614 | INFO | main:fit:163 - 特征 member_life_cycle 处理完成, 耗时: 1.83秒, 编码值数量: 6
2025-07-02 17:49:08.651 | INFO | main:fit:136 - 处理特征: portrait_name
2025-07-02 17:49:11.487 | INFO | main:fit:163 - 特征 portrait_name 处理完成, 耗时: 2.84秒, 编码值数量: 22
2025-07-02 17:49:11.523 | INFO | main:fit:136 - 处理特征: gender
2025-07-02 17:49:13.976 | INFO | main:fit:163 - 特征 gender 处理完成, 耗时: 2.45秒, 编码值数量: 3
2025-07-02 17:49:14.013 | INFO | main:fit:136 - 处理特征: age
2025-07-02 17:49:17.451 | INFO | main:fit:163 - 特征 age 处理完成, 耗时: 3.44秒, 编码值数量: 80
2025-07-02 17:49:17.488 | INFO | main:fit:136 - 处理特征: belong_city_name
2025-07-02 17:49:24.142 | INFO | main:fit:163 - 特征 belong_city_name 处理完成, 耗时: 6.65秒, 编码值数量: 365
2025-07-02 17:49:24.179 | INFO | main:fit:136 - 处理特征: belong_city_level
2025-07-02 17:49:26.558 | INFO | main:fit:163 - 特征 belong_city_level 处理完成, 耗时: 2.38秒, 编码值数量: 6
2025-07-02 17:49:26.594 | INFO | main:fit:136 - 处理特征: consume_belong_city_name
2025-07-02 17:49:32.207 | INFO | main:fit:163 - 特征 consume_belong_city_name 处理完成, 耗时: 5.61秒, 编码值数量: 353
2025-07-02 17:49:32.243 | INFO | main:fit:136 - 处理特征: consume_belong_city_level
2025-07-02 17:49:34.259 | INFO | main:fit:163 - 特征 consume_belong_city_level 处理完成, 耗时: 2.02秒, 编码值数量: 6
2025-07-02 17:49:34.294 | INFO | main:fit:136 - 处理特征: prefer_coupon_type
2025-07-02 17:49:42.789 | INFO | main:fit:163 - 特征 prefer_coupon_type 处理完成, 耗时: 8.50秒, 编码值数量: 175
2025-07-02 17:49:42.826 | INFO | main:fit:136 - 处理特征: goods_series_preference
2025-07-02 17:49:49.254 | INFO | main:fit:163 - 特征 goods_series_preference 处理完成, 耗时: 6.43秒, 编码值数量: 75
2025-07-02 17:49:49.290 | INFO | main:fit:136 - 处理特征: goods_series_preference_total
2025-07-02 17:49:55.378 | INFO | main:fit:163 - 特征 goods_series_preference_total 处理完成, 耗时: 6.09秒, 编码值数量: 68
2025-07-02 17:49:55.414 | INFO | main:fit:136 - 处理特征: date_type_preference
2025-07-02 17:49:58.796 | INFO | main:fit:163 - 特征 date_type_preference 处理完成, 耗时: 3.38秒, 编码值数量: 4
2025-07-02 17:49:58.832 | INFO | main:fit:136 - 处理特征: goods_sugar_preference
2025-07-02 17:50:01.975 | INFO | main:fit:163 - 特征 goods_sugar_preference 处理完成, 耗时: 3.14秒, 编码值数量: 62
2025-07-02 17:50:02.017 | INFO | main:fit:136 - 处理特征: business_district_type_preference
2025-07-02 17:50:04.395 | INFO | main:fit:163 - 特征 business_district_type_preference 处理完成, 耗时: 2.38秒, 编码值数量: 61
2025-07-02 17:50:04.435 | INFO | main:fit:169 - 所有特征目标编码完成!
2025-07-02 17:50:04.436 | INFO | main:transform:172 - 开始应用目标编码到23个分类特征 (采用高效JOIN方式)...
2025-07-02 17:50:04.560 | INFO | main:transform:209 - 特征 reg_channel 编码完成, 耗时: 0.12秒
2025-07-02 17:50:04.653 | INFO | main:transform:209 - 特征 member_card_level 编码完成, 耗时: 0.09秒
2025-07-02 17:50:04.745 | INFO | main:transform:209 - 特征 business_district_typ 编码完成, 耗时: 0.09秒
2025-07-02 17:50:04.840 | INFO | main:transform:209 - 特征 belong_province_name 编码完成, 耗时: 0.10秒
2025-07-02 17:50:04.943 | INFO | main:transform:209 - 特征 belong_company_name 编码完成, 耗时: 0.10秒
2025-07-02 17:50:05.048 | INFO | main:transform:209 - 特征 order_type_preference 编码完成, 耗时: 0.11秒
2025-07-02 17:50:05.157 | INFO | main:transform:209 - 特征 city_name_preference 编码完成, 耗时: 0.11秒
2025-07-02 17:50:05.270 | INFO | main:transform:209 - 特征 province_preference 编码完成, 耗时: 0.11秒
2025-07-02 17:50:05.391 | INFO | main:transform:209 - 特征 company_preference 编码完成, 耗时: 0.12秒
2025-07-02 17:50:05.512 | INFO | main:transform:209 - 特征 member_life_cycle 编码完成, 耗时: 0.12秒
2025-07-02 17:50:05.637 | INFO | main:transform:209 - 特征 portrait_name 编码完成, 耗时: 0.12秒
2025-07-02 17:50:05.768 | INFO | main:transform:209 - 特征 gender 编码完成, 耗时: 0.13秒
2025-07-02 17:50:05.902 | INFO | main:transform:209 - 特征 age 编码完成, 耗时: 0.13秒
2025-07-02 17:50:06.102 | INFO | main:transform:209 - 特征 belong_city_name 编码完成, 耗时: 0.20秒
2025-07-02 17:50:06.246 | INFO | main:transform:209 - 特征 belong_city_level 编码完成, 耗时: 0.14秒
2025-07-02 17:50:06.407 | INFO | main:transform:209 - 特征 consume_belong_city_name 编码完成, 耗时: 0.16秒
2025-07-02 17:50:06.560 | INFO | main:transform:209 - 特征 consume_belong_city_level 编码完成, 耗时: 0.15秒
2025-07-02 17:50:06.716 | INFO | main:transform:209 - 特征 prefer_coupon_type 编码完成, 耗时: 0.16秒
2025-07-02 17:50:06.879 | INFO | main:transform:209 - 特征 goods_series_preference 编码完成, 耗时: 0.16秒
2025-07-02 17:50:07.048 | INFO | main:transform:209 - 特征 goods_series_preference_total 编码完成, 耗时: 0.17秒
2025-07-02 17:50:07.222 | INFO | main:transform:209 - 特征 date_type_preference 编码完成, 耗时: 0.17秒
2025-07-02 17:50:07.400 | INFO | main:transform:209 - 特征 goods_sugar_preference 编码完成, 耗时: 0.18秒
2025-07-02 17:50:07.580 | INFO | main:transform:209 - 特征 business_district_type_preference 编码完成, 耗时: 0.18秒
2025-07-02 17:50:08.417 | INFO | main:main:428 - 准备分布式模型训练数据...
2025-07-02 17:50:14.420 | INFO | main:main:454 - 验证已缓存的最终训练数据...
2025-07-02 17:56:17.281 | INFO | main:main:456 - 按验证指示列分割的数据统计:
2025-07-02 17:56:17.281 | INFO | main:main:458 - is_validation=True: 3813174 行
2025-07-02 17:56:17.281 | INFO | main:main:458 - is_validation=False: 15261302 行
2025-07-02 17:56:18.165 | INFO | main:main:462 - 开始训练分布式LightGBM模型...
[LightGBM] [Warning] Using too small bin_construct_sample_cnt
may encounter unexpected errors and poor accuracy.
[LightGBM] [Info] Saving data reference to binary buffer
2025-07-02 18:31:21.482 | ERROR | main:main:565 - 处理过程中发生严重错误: An error occurred while calling o2252.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Could not recover from a failed barrier ResultStage. Most recent failure reason: Stage failed because barrier task ResultTask(315, 250) finished unsuccessfully.
ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason:
The executor with id 5 exited with exit code 134(unexpected).
The API gave the following container statuses:
container name: spark-kubernetes-executor
container image: reg.docker.alibaba-inc.com/odps-kube/spark:v3.4.2-odps0.48.1-alinux3
container state: terminated
container started at: 2025-07-02T09:42:55Z
container finished at: 2025-07-02T10:31:04Z
exit code: 134
termination reason: Error
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2786)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2722)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2721)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2721)
at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:2158)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2979)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2924)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2913)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:971)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2258)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2279)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2298)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2323)
at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1019)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:405)
at org.apache.spark.rdd.RDD.collect(RDD.scala:1018)
at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.executePartitionTasks(LightGBMBase.scala:621)
at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.executeTraining(LightGBMBase.scala:598)
at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.trainOneDataBatch(LightGBMBase.scala:446)
at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.$anonfun$train$2(LightGBMBase.scala:62)
at com.microsoft.azure.synapse.ml.logging.SynapseMLLogging.logVerb(SynapseMLLogging.scala:163)
at com.microsoft.azure.synapse.ml.logging.SynapseMLLogging.logVerb$(SynapseMLLogging.scala:160)
at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.logVerb(LightGBMClassifier.scala:27)
at com.microsoft.azure.synapse.ml.logging.SynapseMLLogging.logFit(SynapseMLLogging.scala:153)
at com.microsoft.azure.synapse.ml.logging.SynapseMLLogging.logFit$(SynapseMLLogging.scala:152)
at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.logFit(LightGBMClassifier.scala:27)
at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.train(LightGBMBase.scala:64)
at com.microsoft.azure.synapse.ml.lightgbm.LightGBMBase.train$(LightGBMBase.scala:36)
at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.train(LightGBMClassifier.scala:27)
at com.microsoft.azure.synapse.ml.lightgbm.LightGBMClassifier.train(LightGBMClassifier.scala:27)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:114)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:750)
2025-07-02 18:31:21.482 | INFO | main:main:569 -
处理结束时间: 2025-07-02 18:31:21.482348
2025-07-02 18:31:21.482 | INFO | main:main:570 - 总耗时: 47.97 分钟
2025-07-02 18:31:21.482 | INFO | main:main:571 - 清理资源...
2025-07-02 18:31:25.859 | INFO | main:main:574 - Spark Session已停止,处理完成。
Code to reproduce issue
lgbm = LightGBMClassifier(
# numIterations=Config.LGBM_NUM_ROUNDS,
labelCol="label",
featuresCol="features",
validationIndicatorCol="is_validation",
useBarrierExecutionMode=True,
timeout=1800.0,
defaultListenPort=25500,
**Config.LGBM_PARAMS
)
start_time_train = time.time()
model = lgbm.fit(training_data_final)
Other info / logs
No response
What component(s) does this bug affect?
-
area/cognitive
: Cognitive project -
area/core
: Core project -
area/deep-learning
: DeepLearning project -
area/lightgbm
: Lightgbm project -
area/opencv
: Opencv project -
area/vw
: VW project -
area/website
: Website -
area/build
: Project build system -
area/notebooks
: Samples under notebooks folder -
area/docker
: Docker usage -
area/models
: models related issue
What language(s) does this bug affect?
-
language/scala
: Scala source code -
language/python
: Pyspark APIs -
language/r
: R APIs -
language/csharp
: .NET APIs -
language/new
: Proposals for new client languages
What integration(s) does this bug affect?
-
integrations/synapse
: Azure Synapse integrations -
integrations/azureml
: Azure ML integrations -
integrations/databricks
: Databricks integrations