Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark MLlib GBT 100M dataset #18

Open
szilard opened this issue May 8, 2019 · 6 comments
Open

Spark MLlib GBT 100M dataset #18

szilard opened this issue May 8, 2019 · 6 comments

Comments

@szilard
Copy link
Owner

szilard commented May 8, 2019

du -sm *.csv
467     train-10m.csv
47      train-1m.csv
5       train-0.1m.csv

du -sm *.parquet
2385    spark_ohe-train-100m.parquet
239     spark_ohe-train-10m.parquet
25      spark_ohe-train-1m.parquet
3       spark_ohe-train-0.1m.parquet


free -m
              total        used        free      shared  buff/cache   available
Mem:         245854         568      244920           8         365      244043


lscpu
CPU(s):                32


${SPARK_ROOT}/bin/spark-shell --master local[*] --driver-memory 220G --executor-memory 220G


import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.GBTClassifier
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator

val d_train = spark.read.parquet("spark_ohe-train.parquet").cache()
val d_test = spark.read.parquet("spark_ohe-test.parquet").cache()
(d_train.count(), d_test.count())


free -m
              total        used        free      shared  buff/cache   available
Mem:         245854       64579      178405           8        2868      180025

Screen Shot 2019-05-08 at 5 50 53 AM


val rf = new GBTClassifier().setLabelCol("label").setFeaturesCol("features").
  setMaxIter(100).setMaxDepth(10).setStepSize(0.1).
  setMaxBins(100).setMaxMemoryInMB(10240)     // max possible setMaxMemoryInMB (otherwise errors out)
val pipeline = new Pipeline().setStages(Array(rf))

val now = System.nanoTime
val model = pipeline.fit(d_train)
@szilard
Copy link
Owner Author

szilard commented May 8, 2019

Screen Shot 2019-05-08 at 5 55 24 AM

Screen Shot 2019-05-08 at 5 55 57 AM

starts spilling to disk

Screen Shot 2019-05-08 at 6 00 31 AM

Screen Shot 2019-05-08 at 6 01 28 AM

Screen Shot 2019-05-08 at 6 01 48 AM

Screen Shot 2019-05-08 at 6 03 32 AM

Screen Shot 2019-05-08 at 6 08 55 AM

no more disk writes

Screen Shot 2019-05-08 at 6 25 47 AM

Screen Shot 2019-05-08 at 6 26 34 AM
Screen Shot 2019-05-08 at 6 27 01 AM

job fails

Screen Shot 2019-05-08 at 6 36 01 AM

Screen Shot 2019-05-08 at 6 36 28 AM

@szilard
Copy link
Owner Author

szilard commented May 8, 2019

moar RAM:

x1e.8xlarge (32 cores, 1 NUMA, 960GB RAM)

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    16
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E7-8880 v3 @ 2.30GHz
Stepping:              4
CPU MHz:               2699.984
CPU max MHz:           3100.0000
CPU min MHz:           1200.0000
BogoMIPS:              4600.10
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              46080K
NUMA node0 CPU(s):     0-31
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq monitor est ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm invpcid_single kaiser fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm xsaveopt ida
~/spark-2.4.2-bin-hadoop2.7/bin/spark-shell --master local[*] --driver-memory 940G --executor-memory 940G

Screen Shot 2019-05-08 at 1 41 27 PM

scala> val model = pipeline.fit(d_train)
[Stage 443:>                                                      (0 + 32) / 32]OpenJDK 
64-Bit Server VM warning:
 INFO: os::commit_memory(0x00007eb838e80000, 51384942592, 0) failed; 
error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 51384942592 bytes 
for committing reserved memory.
# An error report file with more information is saved as:
# /home/ubuntu/GBM-perf/wip-testing/spark/hs_err_pid2301.log

Screen Shot 2019-05-08 at 11 51 11 PM

@szilard
Copy link
Owner Author

szilard commented May 9, 2019

Let's try to learn only 1 tree of depth 1:

runs 1150 sec, AUC=0.634, RAM usage 620GB

Screen Shot 2019-05-09 at 12 18 36 PM
Screen Shot 2019-05-09 at 12 18 56 PM
Screen Shot 2019-05-09 at 12 19 16 PM
Screen Shot 2019-05-09 at 12 19 45 PM

@szilard
Copy link
Owner Author

szilard commented May 9, 2019

1 tree depth 10:

runs 1350 sec, AUC=0.712, RAM usage 620GB

Screen Shot 2019-05-09 at 12 56 09 PM
Screen Shot 2019-05-09 at 12 56 21 PM
Screen Shot 2019-05-09 at 12 56 36 PM
Screen Shot 2019-05-09 at 12 56 51 PM

@szilard
Copy link
Owner Author

szilard commented May 10, 2019

10 trees depth 10:

runs 7850 sec, AUC=0.731, RAM usage 780GB

@szilard
Copy link
Owner Author

szilard commented May 10, 2019

    100M     10M    
trees depth time [s] AUC RAM [GB] time [s] AUC RAM [GB]
1 1 1150 0.634 620 70 0.635 110
1 10 1350 0.712 620 90 0.712 112
10 10 7850 0.731 780 830 0.731 125
100 10 crash OOM   >960 (OOM) 8070 0.755 230

100M ran on:
x1e.8xlarge (32 cores, 1 NUMA, 960GB RAM)

10M ran on:
r4.8xlarge (32 cores, 1 NUMA, 240GB RAM)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant