single-node xgboost-ray is 3.5x slower than single-node vanilla xgboost #230

faaany · 2022-08-19T08:43:45Z

Hi guys, I am trying to improve my xgboost training with xgboost-ray due to its amazing features, like multi-node training, distributed data loading etc. However, my experiment shows that training with xgboost is 3.5 faster than xgboost-ray on single node.

My configuration is as follows:

XGBoost-Ray Version: 0.1.10
XGBoost Version: 1.6.1
ray: 1.13.0

Training Data size: 18GB, with 51,995,904 rows and 174 columns.
Cluster Machine: 160 Cores and 500 RAM

Ray Config: 1 actor and 154 cpus_per_actor
Ray is initialized with ray.init() and the ray cluster is set up manually and the nodes are connected via Docker swarm (which might not be relevant for this issue)

Please let me know if there is more info needed from your side. Thanks a lot!

krfricke · 2022-08-19T15:09:01Z

Just to be sure, a few questions:

Are you comparing vanilla xgboost on 1 node with 160 cores vs. xgboost-ray on 1 node with 160 cores or 2 nodes with 160 cores each (I assume it's the latter?)
What is the absolute time training takes?
Where does the training data live? Is it one file or multiple files?

It would be great if you could post (parts of) your training script so we can look into how to optimize the data loading part of training.

Just as a test for your next run, can you set the n_params parameter of xgboost (in both xgboost-ray and vanilla xgboost) to the number of CPUs (e.g. 154 if using 154 cpus_per_actor)?

krfricke · 2022-08-19T16:41:52Z

Quick update, I've ran a benchmark on a single node 16 CPU instance with 10GB data and can't find any differences in training time with a single worker.

Vanilla xgboost

Running xgboost training benchmark...
run_xgboost_training takes 47.01420674300016 seconds.
Results: {'training_time': 47.01420674300016, 'prediction_time': 0.0}

XGBoost-Ray

UserWarning: `num_actors` in `ray_params` is smaller than 2 (1). XGBoost will NOT be distributed!
run_xgboost_training takes 43.593930259000444 seconds.
Results: {'training_time': 43.593930259000444, 'prediction_time': 0.0}

Note that this is with local data. I'll benchmark with two nodes next (possibly only next week though)

Yard1 · 2022-08-19T18:58:48Z

I've walked through it with @faaany and it seems that for their use-case, running 5 actors with 30 CPUs on a single node gives major speedups over 1 actor with 154 CPUs, though still not as fast as vanilla XGBoost.

For 2 nodes, using 9 actors with 32 CPUs per actor is around 15% faster than vanilla XGBoost.

It seems that there is an upper celling of CPUs per actor and adding more beyond that gives diminishing returns compared to adding more actors. It's still not clear why vanilla XGBoost is so fast, though.

@faaany, please confirm when you have a moment :)

xwjiang2010 · 2022-08-19T19:24:18Z

@Yard1 is there a script that resembles @faaany's use case?

faaany · 2022-08-20T12:55:41Z

@Yard1 is correct. Regarding @krfricke 's questions at the beginning, here are the answers:

I am comparing vanilla xgboost on 1 node with 160 cores vs. xgboost-ray on 1 node with 160 cores (I mistyped the number of actors in the original post. I used 1 actor for single-node xgboost-ray).
the absolute training time takes about 8min vs. 35min.
the training data lives in local hard disk and is mounted into the docker container for training.

Do you mean the nthread param?

faaany · 2022-08-20T13:17:18Z

@xwjiang2010 my code is actually very simple, not that different from the xgboost-ray example code. I agree with @Yard1 's statement that there might be an upper ceiling of CPUs per actor. For a machine with 160 CPUs, it is not a good idea to set the number of actors to 1 and the number of CPUs close to 160.

On the other hand, the hardware I am using might not be that typical for other xgboost-ray users. So you may close this issue if you want.

faaany · 2022-08-24T08:21:37Z

As stated above, I need to tune the number of actors and number of cpus per actor to achieve better runtime performance with vanilla xgboost for my machine with 160 cores.

faaany closed this as completed Aug 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

single-node xgboost-ray is 3.5x slower than single-node vanilla xgboost #230

single-node xgboost-ray is 3.5x slower than single-node vanilla xgboost #230

faaany commented Aug 19, 2022 •

edited

krfricke commented Aug 19, 2022

krfricke commented Aug 19, 2022 •

edited

Yard1 commented Aug 19, 2022

xwjiang2010 commented Aug 19, 2022

faaany commented Aug 20, 2022 •

edited

faaany commented Aug 20, 2022 •

edited

faaany commented Aug 24, 2022

single-node xgboost-ray is 3.5x slower than single-node vanilla xgboost #230

single-node xgboost-ray is 3.5x slower than single-node vanilla xgboost #230

Comments

faaany commented Aug 19, 2022 • edited

krfricke commented Aug 19, 2022

krfricke commented Aug 19, 2022 • edited

Yard1 commented Aug 19, 2022

xwjiang2010 commented Aug 19, 2022

faaany commented Aug 20, 2022 • edited

faaany commented Aug 20, 2022 • edited

faaany commented Aug 24, 2022

faaany commented Aug 19, 2022 •

edited

krfricke commented Aug 19, 2022 •

edited

faaany commented Aug 20, 2022 •

edited

faaany commented Aug 20, 2022 •

edited