Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Tensor mutability of TensorOps #145

Closed
djdisodo opened this issue Jan 4, 2023 · 6 comments
Closed

RFC: Tensor mutability of TensorOps #145

djdisodo opened this issue Jan 4, 2023 · 6 comments
Labels
enhancement Enhance existing features performance Anything related to performance

Comments

@djdisodo
Copy link

djdisodo commented Jan 4, 2023

with current design of TensorOps, TensorPrivimitive is once initialized, never mutated,
not allowing Buffer of passed TensorPrimitive.
(well, while it's possible with interior mutability that wouldn't be an expected behavior)

some operations may be able to just use same buffer(ex. reshape)
or write result into same buffer(ex. sqrt, relu)
this behavior will prevent allocating new buffer for result, possibly cutting memory usage in half,
but makes user to clone the Tensor before passing if planning to use original value of passed Tensor.

i'm suggesting to support both current behavior and new "append" behavior
by adding both functions to TensorOps trait and making current behavior implementation optional(using trait default behavior)

trait Foo {
    fn foo_append(t: &mut TensorPrimitive);
    fn foo(t: &TensorPrimitive) -> TensorPrimitive {
        let t = t.clone();
        Self::foo_append(&mut t);
        t
    }
}

i'm a newbie so correct me 😓

@nathanielsimard
Copy link
Member

Thanks for the proposal. I'd like to highlight the pros and cons of having mutable operations.

Pros:

  • Potentially increase performance, particularly during inference rather than training. This is because tensors often need to be reused in the backward pass during training, which requires an immutable API or frequent cloning.

Cons:

  • Increase the size of the backend API
  • May increase the userland Tensor API
    • Decrease developer experience by requiring them to choose between the mutable and immutable versions of an operation.

I have two potential solutions in mind:

  1. Allow backends to implement mutable operations (with default implementations provided). However, I would not include these mutable operations in the userland Tensor API. Instead, a lazy decorator backend could analyze the computational graph and use these mutable operations internally. One potential issue with this approach is that the decorator backend would need to handle dynamic partial graphs, which may make it difficult to know for certain if a tensor will never be used.

  2. Another way to allow mutating tensor in the backend is to change the API so that each operation takes ownership of each input tensor. Each backend could then handle the reusability of tensor data in their clone implementation of the tensor primitive. This solution is simpler, but it's not clear how we could provide more information to backends to help them know when to share storage or reuse and modify it.

Maybe both solutions could be combined in a way that simplifies the decorator backend's analysis of graphs, using explicit clone calls to provide lifetime information.

@nathanielsimard nathanielsimard added enhancement Enhance existing features performance Anything related to performance labels Jan 6, 2023
@nathanielsimard
Copy link
Member

nathanielsimard commented Jan 8, 2023

I thought about the problem and came up with en even better potential solution!

  1. Each operation would receive owned input tensors, which would allow them to reuse data storage or buffer for performance improvement. However, they would also need to handle shared data structures.
  2. Tensor would no longer implement Clone, but would have to implement share instead. This would involve creating a new method for creating shared references to tensor data.
pub trait TensorOps<B> {
    ...
    fn share<const D: usize>(tensor: &mut B::TensorPrimitive<D> ) -> B::TensorPrimitive<D>;  
}

Backends can implement this with a simple clone if they want, but they can also change the datastore to a shared one.

struct MyTensorPrimitive {
   ...
   storage: MyTensorStorage,
}

enum MyTensorStorage {
    Owned(Storage),
    Shared(Arc<Storage>),
}

The share function implementation modifies the inner storage by making it immutable in an Arc reference. This allows backends to have more flexibility to reuse existing buffers without increasing the number of functions they need to implement.

I don't see any drawback to this solution. It does not increase the size of the API in the Backend trait or the Tensor struct, does not require graph analysis for performance improvement, and even allows for partial mutability in the API (the left-hand side Tensor may be shared, but the right-hand side may not, allowing for even more optimization opportunities). It also provides room for better documentation, as we can add custom documentation to the share method but not the Clone trait.

@antimora
Copy link
Collaborator

+1 on improving inference performance.

@antimora
Copy link
Collaborator

I came across clone_from method that could be memory efficient: https://doc.rust-lang.org/nightly/core/clone/trait.Clone.html#method.clone_from

@antimora
Copy link
Collaborator

antimora commented Apr 2, 2023

@nathanielsimard You worked on this. Is this ticket complete?

@nathanielsimard
Copy link
Member

Yes it's completed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhance existing features performance Anything related to performance
Projects
None yet
Development

No branches or pull requests

3 participants