-
Notifications
You must be signed in to change notification settings - Fork 383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple training runs in parallel #715
Comments
Hmm, interesting. I don't think we can support that feature by adding a batch dimension automatically. The code would be very different, as each module has its state without a batch size. I believe the easiest and most flexible solution is probably to create one learner per model and launch them in parallel. This way, each model/experiment can have its own artifact directory with metrics, checkpoints, etc. that you can compare. let data = ...;
let learners = [build_learner(device1, artifact1), build_learner(device2, artifact2), build_learner(device3, artifact3)];
let models = learners
.map(|learners| (learner, data.clone()))
.iter_par()
.map(|(learner, data)| {
learner.fit(data)
}); Let me know if it helps! |
If there is no better way, because of the burn compute graph design and tensor generics, etc., perhaps we can build this multithreading approach into burn itself. e.g. training could look like: let learner =
LearnerBuilder::new("./artifacts")
.metric_train_plot(LossMetric::new())
.metric_valid_plot(LossMetric::new())
.devices(vec![device])
.num_epochs(50)
.build(
[1e-3, 3e-3, 5e-3, 8e-3]
.map(|lr| (
model,
AdamConfig::new().init(),
lr
))
); We perhaps redefine the This way the ordinary usage hardly changes at all (or we could instead move this functionality to a new method .build([(
model,
AdamConfig::new().init(),
1e-3
)]); |
I think we should make sure that you can leverage the GPU efficiently with multiple threads at the same time! CUDA has streams for that (https://developer.nvidia.com/blog/gpu-pro-tip-cuda-7-streams-simplify-concurrency/) and I think LibTorch is using them. WGPU is thread-safe; I'm not sure about the internals, but I would be interested in knowing more about how it behaves. A CPU backend will probably have no problem being executed this way.
The problem isn't the computation graph but the modules. The Linear module has a tensor of rank two for its weights and a tensor of rank one for its bias. Having to support a batch dimension would be a significant breaking change that will affect every module, adding a lot of complexity to the API. If you are building your own modules, you can add a batch dimension to your parameters and take that approach if you want, but I don't think we should enforce it for popular modules.
We could provide a |
I plead that if this gets implemented, it's opaque and As for the streams and underlying backend implementations, I'm a bit ignorant, I assumed that the tensor kernels being run would end up unsychronized (so data and instruction access patterns would be a lot worse). I can do a test to check the performance implications of multithreading vs "batching": let size = 2usize.pow(26);
let batch = 16;
let x = || Tensor::<TchBackend<f32>, 2>::random_device([batch, size], Distribution::Default, &TchDevice::Mps);
let y = || Tensor::<TchBackend<f32>, 1>::random_device([size], Distribution::Default, &TchDevice::Mps);
c.bench_function("gpu_mul_batch", |bench| {
let ab = (x(), x());
bench.iter(|| {
let (a, b) = black_box(ab.clone());
let z = a * b;
black_box(z);
});
});
c.bench_function("gpu_mul_multithread", |bench| {
let ab = (0..batch).map(|_| (y(), y())).collect::<Vec<_>>();
bench.iter(|| {
ab
.clone()
.into_par_iter()
.for_each(|ab| {
let (a, b) = black_box(ab);
let z = a * b;
black_box(z);
});
});
}); The results are very shocking:
How can this be?! My GPU is the M1 Max and has a unified memory architecture, so perhaps this result doesn't generalize to discrete GPUs. |
@wbrickner I think it heavily depends on the size of the tensors. For small tensors, I expect the batching to be faster, but for big ones, I expect the multithreaded version to be equally fast. I'm also a bit surprised by the results, but I guess when working with big matrices, allocating that amount of contiguous memory is slower than allocating smaller chunks. |
This would be extremely helpful for RL use cases. |
Feature description
I have an optimization that is very sensitive to initialization. No idea why. Instead of getting it right with elegant math, I have found I can just try over and over until I get a good initial state.
I'm not nearly saturating my GPU's parallelism. What I want is an optimizer / training loop from
burn
that can perform the whole optimization process overN
parameter sets, like basically adding another dimension to all the tensors (and isolating certain operations against them across this dimension).Feature motivation
Being able to conduct
N
multiple training runs over an identical model architecture and loss function (possibly not the same data, not with same learning rate, not with same initialization) at the same time.(Optional) Suggest a Solution
This might be very easy to implement with some modifications to autodiff and the optimizers. I don't have enough familiarity to day. Ideally the resulting external API will not change when using a list of learning rates / schedules, initializers, etc.
The text was updated successfully, but these errors were encountered: