RFC: Tensor mutability of TensorOps #145

djdisodo · 2023-01-04T06:25:57Z

with current design of TensorOps, TensorPrivimitive is once initialized, never mutated,
not allowing Buffer of passed TensorPrimitive.
(well, while it's possible with interior mutability that wouldn't be an expected behavior)

some operations may be able to just use same buffer(ex. reshape)
or write result into same buffer(ex. sqrt, relu)
this behavior will prevent allocating new buffer for result, possibly cutting memory usage in half,
but makes user to clone the Tensor before passing if planning to use original value of passed Tensor.

i'm suggesting to support both current behavior and new "append" behavior
by adding both functions to TensorOps trait and making current behavior implementation optional(using trait default behavior)

trait Foo {
    fn foo_append(t: &mut TensorPrimitive);
    fn foo(t: &TensorPrimitive) -> TensorPrimitive {
        let t = t.clone();
        Self::foo_append(&mut t);
        t
    }
}

i'm a newbie so correct me 😓

nathanielsimard · 2023-01-06T22:36:05Z

Thanks for the proposal. I'd like to highlight the pros and cons of having mutable operations.

Pros:

Potentially increase performance, particularly during inference rather than training. This is because tensors often need to be reused in the backward pass during training, which requires an immutable API or frequent cloning.

Cons:

Increase the size of the backend API
May increase the userland Tensor API
- Decrease developer experience by requiring them to choose between the mutable and immutable versions of an operation.

I have two potential solutions in mind:

Allow backends to implement mutable operations (with default implementations provided). However, I would not include these mutable operations in the userland Tensor API. Instead, a lazy decorator backend could analyze the computational graph and use these mutable operations internally. One potential issue with this approach is that the decorator backend would need to handle dynamic partial graphs, which may make it difficult to know for certain if a tensor will never be used.
Another way to allow mutating tensor in the backend is to change the API so that each operation takes ownership of each input tensor. Each backend could then handle the reusability of tensor data in their clone implementation of the tensor primitive. This solution is simpler, but it's not clear how we could provide more information to backends to help them know when to share storage or reuse and modify it.

Maybe both solutions could be combined in a way that simplifies the decorator backend's analysis of graphs, using explicit clone calls to provide lifetime information.

nathanielsimard · 2023-01-08T01:26:04Z

I thought about the problem and came up with en even better potential solution!

Each operation would receive owned input tensors, which would allow them to reuse data storage or buffer for performance improvement. However, they would also need to handle shared data structures.
Tensor would no longer implement Clone, but would have to implement share instead. This would involve creating a new method for creating shared references to tensor data.

pub trait TensorOps<B> {
    ...
    fn share<const D: usize>(tensor: &mut B::TensorPrimitive<D> ) -> B::TensorPrimitive<D>;  
}

Backends can implement this with a simple clone if they want, but they can also change the datastore to a shared one.

struct MyTensorPrimitive {
   ...
   storage: MyTensorStorage,
}

enum MyTensorStorage {
    Owned(Storage),
    Shared(Arc<Storage>),
}

The share function implementation modifies the inner storage by making it immutable in an Arc reference. This allows backends to have more flexibility to reuse existing buffers without increasing the number of functions they need to implement.

I don't see any drawback to this solution. It does not increase the size of the API in the Backend trait or the Tensor struct, does not require graph analysis for performance improvement, and even allows for partial mutability in the API (the left-hand side Tensor may be shared, but the right-hand side may not, allowing for even more optimization opportunities). It also provides room for better documentation, as we can add custom documentation to the share method but not the Clone trait.

antimora · 2023-02-24T15:55:07Z

+1 on improving inference performance.

antimora · 2023-02-25T03:45:40Z

I came across clone_from method that could be memory efficient: https://doc.rust-lang.org/nightly/core/clone/trait.Clone.html#method.clone_from

antimora · 2023-04-02T18:38:10Z

@nathanielsimard You worked on this. Is this ticket complete?

nathanielsimard · 2023-04-02T18:40:17Z

Yes it's completed.

nathanielsimard added enhancement Enhance existing features performance Anything related to performance labels Jan 6, 2023

nathanielsimard closed this as completed Apr 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Tensor mutability of TensorOps #145

RFC: Tensor mutability of TensorOps #145

djdisodo commented Jan 4, 2023 •

edited

nathanielsimard commented Jan 6, 2023

nathanielsimard commented Jan 8, 2023 •

edited

antimora commented Feb 24, 2023

antimora commented Feb 25, 2023

antimora commented Apr 2, 2023

nathanielsimard commented Apr 2, 2023

RFC: Tensor mutability of TensorOps #145

RFC: Tensor mutability of TensorOps #145

Comments

djdisodo commented Jan 4, 2023 • edited

nathanielsimard commented Jan 6, 2023

nathanielsimard commented Jan 8, 2023 • edited

antimora commented Feb 24, 2023

antimora commented Feb 25, 2023

antimora commented Apr 2, 2023

nathanielsimard commented Apr 2, 2023

djdisodo commented Jan 4, 2023 •

edited

nathanielsimard commented Jan 8, 2023 •

edited