Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

single-node xgboost-ray is 3.5x slower than single-node vanilla xgboost #230

Closed
faaany opened this issue Aug 19, 2022 · 7 comments
Closed

Comments

@faaany
Copy link

faaany commented Aug 19, 2022

Hi guys, I am trying to improve my xgboost training with xgboost-ray due to its amazing features, like multi-node training, distributed data loading etc. However, my experiment shows that training with xgboost is 3.5 faster than xgboost-ray on single node.

My configuration is as follows:

XGBoost-Ray Version: 0.1.10
XGBoost Version: 1.6.1
ray: 1.13.0

Training Data size: 18GB, with 51,995,904 rows and 174 columns.
Cluster Machine: 160 Cores and 500 RAM

Ray Config: 1 actor and 154 cpus_per_actor
Ray is initialized with ray.init() and the ray cluster is set up manually and the nodes are connected via Docker swarm (which might not be relevant for this issue)

Please let me know if there is more info needed from your side. Thanks a lot!

@krfricke
Copy link
Collaborator

Just to be sure, a few questions:

  1. Are you comparing vanilla xgboost on 1 node with 160 cores vs. xgboost-ray on 1 node with 160 cores or 2 nodes with 160 cores each (I assume it's the latter?)
  2. What is the absolute time training takes?
  3. Where does the training data live? Is it one file or multiple files?

It would be great if you could post (parts of) your training script so we can look into how to optimize the data loading part of training.

Just as a test for your next run, can you set the n_params parameter of xgboost (in both xgboost-ray and vanilla xgboost) to the number of CPUs (e.g. 154 if using 154 cpus_per_actor)?

@krfricke
Copy link
Collaborator

krfricke commented Aug 19, 2022

Quick update, I've ran a benchmark on a single node 16 CPU instance with 10GB data and can't find any differences in training time with a single worker.

Vanilla xgboost

Running xgboost training benchmark...
run_xgboost_training takes 47.01420674300016 seconds.
Results: {'training_time': 47.01420674300016, 'prediction_time': 0.0}

XGBoost-Ray

UserWarning: `num_actors` in `ray_params` is smaller than 2 (1). XGBoost will NOT be distributed!
run_xgboost_training takes 43.593930259000444 seconds.
Results: {'training_time': 43.593930259000444, 'prediction_time': 0.0}

Note that this is with local data. I'll benchmark with two nodes next (possibly only next week though)

@Yard1
Copy link
Member

Yard1 commented Aug 19, 2022

I've walked through it with @faaany and it seems that for their use-case, running 5 actors with 30 CPUs on a single node gives major speedups over 1 actor with 154 CPUs, though still not as fast as vanilla XGBoost.

For 2 nodes, using 9 actors with 32 CPUs per actor is around 15% faster than vanilla XGBoost.

It seems that there is an upper celling of CPUs per actor and adding more beyond that gives diminishing returns compared to adding more actors. It's still not clear why vanilla XGBoost is so fast, though.

@faaany, please confirm when you have a moment :)

@xwjiang2010
Copy link

@Yard1 is there a script that resembles @faaany's use case?

@faaany
Copy link
Author

faaany commented Aug 20, 2022

@Yard1 is correct. Regarding @krfricke 's questions at the beginning, here are the answers:

  1. I am comparing vanilla xgboost on 1 node with 160 cores vs. xgboost-ray on 1 node with 160 cores (I mistyped the number of actors in the original post. I used 1 actor for single-node xgboost-ray).
  2. the absolute training time takes about 8min vs. 35min.
  3. the training data lives in local hard disk and is mounted into the docker container for training.

Do you mean the nthread param?

@faaany
Copy link
Author

faaany commented Aug 20, 2022

@xwjiang2010 my code is actually very simple, not that different from the xgboost-ray example code. I agree with @Yard1 's statement that there might be an upper ceiling of CPUs per actor. For a machine with 160 CPUs, it is not a good idea to set the number of actors to 1 and the number of CPUs close to 160.

On the other hand, the hardware I am using might not be that typical for other xgboost-ray users. So you may close this issue if you want.

@faaany faaany closed this as completed Aug 24, 2022
@faaany
Copy link
Author

faaany commented Aug 24, 2022

As stated above, I need to tune the number of actors and number of cpus per actor to achieve better runtime performance with vanilla xgboost for my machine with 160 cores.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants