-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues when training on a large dataset #21
Comments
Hi, first please make sure that you're using the binary compiled from Rust directly instead of the Python wrapper. Currently to keep the Python API ergonomic, a redundant copy of the dataset is kept in memory, which is fine for smaller datasets, but could be problematic for really large ones. I'm a bit surprised that you couldn't even finish initialization. I added some addtional logging to facilitate debugging; could you try installing the latest source, and see how far it gets? You can do so by running Once initialization passes, you can try using the new |
Thank you for a quick reply!
Yes, I use CLI compiled from source. I use Python binding just for the inference on new data.
I'm not 100% sure it happened because of Omikuji: I was connected through SSH and the only output I saw was Yesterday I re-ran this training on this time on 3.4T Ram instance, and this time I executed the command as a background bash job (with Right now it's been running for 26 hours and it's on 90% of forest training. Thank you for the update you've made: I'll reinstall your tool before the next training and will try Also, just as a suggestion. Do you think it would be possible/feasible to implement checkpoints? Right now the cost of training on my dataset is about $600 with this instance, and if something happens before the model is saved, that's a painful waste of resources. |
Yep you might have noticed that this has already been added. When I get around to it I could also try improving parallelization on this part; I didn't expect it to take as long as 6 hours.
In principle it should be possible, but would require some fairly significant refactoring & redesign. I guess we could also aim for something simpler, e.g. saving the initialized trainer, or saving each tree immediately after it's trained, but that would still require some refactoring. I probably can't get to it at the moment, but you're welcome to take a stab at it :) |
And the last question regarding usage on large datasets. That yesterday's training process was finished successfully in 26 hours and I'm very happy with precision@5 I got on my test set. But there's a new interesting issue: the resulting model consists of 3 trees 120Gb each, and it takes 70 minutes to load the model (in fact, it requires around over 624Gb of memory to finish loading the model for inference, since it failed to load on 624Gb instance). I tried loading model to Python, then Also, I understand I can leave only one tree in the model folder, but it's still 120Gb and there's quite a performance drop when I do it (tested only on smaller 2M dataset). Is |
Unfortuntely for now I can't really think of any way to further speed up model loading... I guess we could first load the entire files into memory, then parallelize deserializing individual trees, but that would probably make the memory usage problem even worse. Calling You could try increasing I might eventually try support some sparsity-inducing loss function like L1-normalized SVM, but that will take time and might be tricky too. (E.g., according to Babbar & Schölkopf 2019 the LibLinear solver underfits and they suggested using proximal gradient method instead, which I suspect could be much slower.) Out of curiosity, could you tell me a bit more about your use case? Particularly, do you need to regularly retrain the model? If so, I could try prioritize speeding up the initialization process, as I assume shaving off 6hrs from 26hrs would be quite significant. |
Closing for now due to inactivity, feel free to re-open if you have more questions. |
Hi Tom! At first, I wanted to thank you for your great contribution. This is the best implementation for XMC I've found (that is also feasible to use in production).
I ran a number of experiments and I have an observation that it works great when the training set is about 1-2M samples, but the task I'm trying to solve has 60M samples in the training set with 1M labels and 3M features from Tf-Idf. I always use default Parabel-like parameters.
Once I managed to train a model on 60M samples with 260k labels, but the only machine that managed to fit it was 160CPU 3.4T RAM GCP instance which is very expensive.
I tried 96CPU 1.4T machine to decrease costs, but it hangs for 3-4 hours on
Initializing tree trainer
step and then disconnects (I guess it gets out of memory).Do you have any tips and tricks how to run training on a dataset of this size at a reasonable cost? E.g. would it be possible to train in batches on smaller/cheaper machines? Or are there any "magic" hyperparameter settings to achieve this?
The text was updated successfully, but these errors were encountered: