Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple training runs in parallel #715

Open
wbrickner opened this issue Aug 29, 2023 · 6 comments
Open

Multiple training runs in parallel #715

wbrickner opened this issue Aug 29, 2023 · 6 comments
Labels
enhancement Enhance existing features

Comments

@wbrickner
Copy link
Contributor

Feature description

I have an optimization that is very sensitive to initialization. No idea why. Instead of getting it right with elegant math, I have found I can just try over and over until I get a good initial state.

I'm not nearly saturating my GPU's parallelism. What I want is an optimizer / training loop from burn that can perform the whole optimization process over N parameter sets, like basically adding another dimension to all the tensors (and isolating certain operations against them across this dimension).

Feature motivation

Being able to conduct N multiple training runs over an identical model architecture and loss function (possibly not the same data, not with same learning rate, not with same initialization) at the same time.

(Optional) Suggest a Solution

This might be very easy to implement with some modifications to autodiff and the optimizers. I don't have enough familiarity to day. Ideally the resulting external API will not change when using a list of learning rates / schedules, initializers, etc.

@nathanielsimard
Copy link
Member

Hmm, interesting. I don't think we can support that feature by adding a batch dimension automatically. The code would be very different, as each module has its state without a batch size.

I believe the easiest and most flexible solution is probably to create one learner per model and launch them in parallel. This way, each model/experiment can have its own artifact directory with metrics, checkpoints, etc. that you can compare.

let data = ...;
let learners = [build_learner(device1, artifact1), build_learner(device2, artifact2), build_learner(device3, artifact3)];

let models = learners
    .map(|learners| (learner, data.clone()))
    .iter_par()
    .map(|(learner, data)| {
        learner.fit(data)
    });

Let me know if it helps!

@wbrickner
Copy link
Contributor Author

wbrickner commented Aug 29, 2023

  • Does burn / the underlying backend allow for sharing the GPU this way? Is this efficient? I would think this would cause massive performance loss b.c. of memory churn and lack of efficient parallelism (idk how GPU sharing works at a low level, but I assume multiple simultaneous operations do not get synchronized and coalesced by the driver into one big operation).

  • Can this transform be done automatically? Or is the const generic on tensor dimension + little a priori trait knowledge on how the computational graph is connected what prevents this from working? I was an Arrayfire user for a while, and it lacks any generics that communicate dimension, so you can pull these tricks easily.


If there is no better way, because of the burn compute graph design and tensor generics, etc., perhaps we can build this multithreading approach into burn itself. e.g. training could look like:

let learner = 
    LearnerBuilder::new("./artifacts")
      .metric_train_plot(LossMetric::new())
      .metric_valid_plot(LossMetric::new())
      .devices(vec![device])
      .num_epochs(50)
      .build(
        [1e-3, 3e-3, 5e-3, 8e-3]
          .map(|lr| (
            model,
            AdamConfig::new().init(),
            lr
          ))
      );

We perhaps redefine the build method to accept Iterator<Item = (Model, Optim, LR)>, allowing for full control of the set of runs conducted.

This way the ordinary usage hardly changes at all (or we could instead move this functionality to a new method build_multi):

.build([(
  model,
  AdamConfig::new().init(),
  1e-3
)]);

@nathanielsimard
Copy link
Member

Does burn / the underlying backend allow for sharing the GPU this way? Is this efficient? I would think this would cause massive performance loss b.c. of memory churn and lack of efficient parallelism (idk how GPU sharing works at a low level, but I assume multiple simultaneous operations do not get synchronized and coalesced by the driver into one big operation).

I think we should make sure that you can leverage the GPU efficiently with multiple threads at the same time! CUDA has streams for that (https://developer.nvidia.com/blog/gpu-pro-tip-cuda-7-streams-simplify-concurrency/) and I think LibTorch is using them. WGPU is thread-safe; I'm not sure about the internals, but I would be interested in knowing more about how it behaves. A CPU backend will probably have no problem being executed this way.

Can this transform be done automatically? Or is the const generic on tensor dimension + little a priori trait knowledge on how the computational graph is connected what prevents this from working? I was an Arrayfire user for a while, and it lacks any generics that communicate dimension, so you can pull these tricks easily.

The problem isn't the computation graph but the modules. The Linear module has a tensor of rank two for its weights and a tensor of rank one for its bias. Having to support a batch dimension would be a significant breaking change that will affect every module, adding a lot of complexity to the API. If you are building your own modules, you can add a batch dimension to your parameters and take that approach if you want, but I don't think we should enforce it for popular modules.

This way the ordinary usage hardly changes at all (or we could instead move this functionality to a new method build_multi):

We could provide a build_multi method on the builder. It would return a list of learners instead of just one. Additionally, we could offer a function fit_all(dataloader, learners) -> Vec<Modules> to execute them all in parallel.

@wbrickner
Copy link
Contributor Author

I plead that if this gets implemented, it's opaque and build_multi also returns Learner rather than Vec<Learner>, otherwise it's as easy / elegant as building my own Vec of learners.

As for the streams and underlying backend implementations, I'm a bit ignorant, I assumed that the tensor kernels being run would end up unsychronized (so data and instruction access patterns would be a lot worse). I can do a test to check the performance implications of multithreading vs "batching":

let size = 2usize.pow(26);
let batch = 16;

let x = || Tensor::<TchBackend<f32>, 2>::random_device([batch, size], Distribution::Default, &TchDevice::Mps);
let y = || Tensor::<TchBackend<f32>, 1>::random_device([size], Distribution::Default, &TchDevice::Mps);

c.bench_function("gpu_mul_batch", |bench| {
  let ab = (x(), x());

  bench.iter(|| {
    let (a, b) = black_box(ab.clone());
    let z = a * b;
    black_box(z);
  });
});

c.bench_function("gpu_mul_multithread", |bench| {
  let ab = (0..batch).map(|_| (y(), y())).collect::<Vec<_>>();

  bench.iter(|| {
    ab
      .clone()
      .into_par_iter()
      .for_each(|ab| {
        let (a, b) = black_box(ab);
        let z = a * b;
        black_box(z);
      });
  });
});

The results are very shocking:

gpu_mul_batch           time:   [39.374 ms 40.432 ms 41.563 ms]
Found 15 outliers among 100 measurements (15.00%)
  15 (15.00%) high severe

gpu_mul_multithread     time:   [34.787 ms 34.830 ms 34.878 ms]
Found 8 outliers among 100 measurements (8.00%)
  7 (7.00%) high mild
  1 (1.00%) high severe

How can this be?! My GPU is the M1 Max and has a unified memory architecture, so perhaps this result doesn't generalize to discrete GPUs.

@nathanielsimard
Copy link
Member

@wbrickner I think it heavily depends on the size of the tensors. For small tensors, I expect the batching to be faster, but for big ones, I expect the multithreaded version to be equally fast. I'm also a bit surprised by the results, but I guess when working with big matrices, allocating that amount of contiguous memory is slower than allocating smaller chunks.

@antimora antimora added the enhancement Enhance existing features label Sep 8, 2023
@giucesar
Copy link

This would be extremely helpful for RL use cases.
I'm experimenting with Godot + Rust + Burn to build some AI for games (dummy tests for now), and being able to train several agents (small similar models) in parallel would be welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhance existing features
Projects
None yet
Development

No branches or pull requests

4 participants