Spark MLlib GBT 100M dataset #18

szilard · 2019-05-08T12:40:57Z

du -sm *.csv
467     train-10m.csv
47      train-1m.csv
5       train-0.1m.csv

du -sm *.parquet
2385    spark_ohe-train-100m.parquet
239     spark_ohe-train-10m.parquet
25      spark_ohe-train-1m.parquet
3       spark_ohe-train-0.1m.parquet


free -m
              total        used        free      shared  buff/cache   available
Mem:         245854         568      244920           8         365      244043


lscpu
CPU(s):                32


${SPARK_ROOT}/bin/spark-shell --master local[*] --driver-memory 220G --executor-memory 220G


import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.GBTClassifier
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator

val d_train = spark.read.parquet("spark_ohe-train.parquet").cache()
val d_test = spark.read.parquet("spark_ohe-test.parquet").cache()
(d_train.count(), d_test.count())


free -m
              total        used        free      shared  buff/cache   available
Mem:         245854       64579      178405           8        2868      180025


val rf = new GBTClassifier().setLabelCol("label").setFeaturesCol("features").
  setMaxIter(100).setMaxDepth(10).setStepSize(0.1).
  setMaxBins(100).setMaxMemoryInMB(10240)     // max possible setMaxMemoryInMB (otherwise errors out)
val pipeline = new Pipeline().setStages(Array(rf))

val now = System.nanoTime
val model = pipeline.fit(d_train)

The text was updated successfully, but these errors were encountered:

szilard · 2019-05-08T13:05:11Z

starts spilling to disk

no more disk writes

job fails

szilard · 2019-05-08T20:22:52Z

moar RAM:

x1e.8xlarge (32 cores, 1 NUMA, 960GB RAM)

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    16
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E7-8880 v3 @ 2.30GHz
Stepping:              4
CPU MHz:               2699.984
CPU max MHz:           3100.0000
CPU min MHz:           1200.0000
BogoMIPS:              4600.10
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              46080K
NUMA node0 CPU(s):     0-31
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq monitor est ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm invpcid_single kaiser fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm xsaveopt ida

~/spark-2.4.2-bin-hadoop2.7/bin/spark-shell --master local[*] --driver-memory 940G --executor-memory 940G

scala> val model = pipeline.fit(d_train)
[Stage 443:>                                                      (0 + 32) / 32]OpenJDK 
64-Bit Server VM warning:
 INFO: os::commit_memory(0x00007eb838e80000, 51384942592, 0) failed; 
error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 51384942592 bytes 
for committing reserved memory.
# An error report file with more information is saved as:
# /home/ubuntu/GBM-perf/wip-testing/spark/hs_err_pid2301.log

szilard · 2019-05-09T18:49:07Z

Let's try to learn only 1 tree of depth 1:

runs 1150 sec, AUC=0.634, RAM usage 620GB

szilard · 2019-05-09T19:57:41Z

1 tree depth 10:

runs 1350 sec, AUC=0.712, RAM usage 620GB

szilard · 2019-05-10T04:43:04Z

10 trees depth 10:

runs 7850 sec, AUC=0.731, RAM usage 780GB

szilard · 2019-05-10T08:02:30Z

		100M			10M
trees	depth	time [s]	AUC	RAM [GB]	time [s]	AUC	RAM [GB]
1	1	1150	0.634	620	70	0.635	110
1	10	1350	0.712	620	90	0.712	112
10	10	7850	0.731	780	830	0.731	125
100	10	crash OOM		>960 (OOM)	8070	0.755	230

100M ran on:
x1e.8xlarge (32 cores, 1 NUMA, 960GB RAM)

10M ran on:
r4.8xlarge (32 cores, 1 NUMA, 240GB RAM)

szilard added the analysis label May 9, 2019

szilard added the in-readme label May 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark MLlib GBT 100M dataset #18

Spark MLlib GBT 100M dataset #18

szilard commented May 8, 2019 •

edited

Loading

szilard commented May 8, 2019 •

edited

Loading

szilard commented May 8, 2019 •

edited

Loading

szilard commented May 9, 2019 •

edited

Loading

szilard commented May 9, 2019

szilard commented May 10, 2019

szilard commented May 10, 2019

Spark MLlib GBT 100M dataset #18

Spark MLlib GBT 100M dataset #18

Comments

szilard commented May 8, 2019 • edited Loading

szilard commented May 8, 2019 • edited Loading

starts spilling to disk

no more disk writes

job fails

szilard commented May 8, 2019 • edited Loading

moar RAM:

szilard commented May 9, 2019 • edited Loading

szilard commented May 9, 2019

szilard commented May 10, 2019

szilard commented May 10, 2019

szilard commented May 8, 2019 •

edited

Loading

szilard commented May 8, 2019 •

edited

Loading

szilard commented May 8, 2019 •

edited

Loading

szilard commented May 9, 2019 •

edited

Loading