Skip to content

Fixed-Shape Tensor RFC revisions#25

Merged
gatesn merged 3 commits intodevelopfrom
ct/tensor-revise
Mar 5, 2026
Merged

Fixed-Shape Tensor RFC revisions#25
gatesn merged 3 commits intodevelopfrom
ct/tensor-revise

Conversation

@connortsui20
Copy link
Contributor

@connortsui20 connortsui20 commented Mar 5, 2026

Rendered

Some revisions from #24

This also moves the RFC into the accepted directory.

I'll just keep this named tensor since future RFCs can be called variable or sparse tensors.

The only change that was not directly because of the comments on the last PR was a change to the strides section, because some of the description was incorrect.

Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
@connortsui20 connortsui20 requested review from danking and gatesn March 5, 2026 14:35
@connortsui20 connortsui20 changed the title Tensor RFC revisions Fixed-Shape Tensor RFC revisions Mar 5, 2026
@connortsui20
Copy link
Contributor Author

connortsui20 commented Mar 5, 2026

Still working on a change to the strides section, its a bit more complicated than it seems at first because arrow does not have a logical type system and "shape + permutation" can mean 2 different things

Edit: done in commit 53ad802

@connortsui20 connortsui20 reopened this Mar 5, 2026
Comment on lines +86 to +89
Nullability exists only at the tensor level: within a tensor array, an individual tensor may be
null, but elements within a tensor may not be. This is because tensor operations like matmul cannot
be efficiently implemented over nullable elements, and most tensor libraries (e.g., PyTorch) do not
support per-element nulls either.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commenting here but maybe it should go on the previous PR?

IDK how arrow does it, but I don't think that's necessarily true.
Most vectorized compute just runs through null values that are zeroed out, IDK what's how you matmul the validity itself, but I think that's a reasonable thing

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think interpretation of NULLs is context dependent. If NULL means "there was no data observed at this position" and you're doing a weighted sum of the features, treating NULLs as zero is probably the right choice. The result is indeed the count of what you observed. You can't infer anything about things you did not observe.

On the other hand, if NULL means "there is some data here but for technical reasons it was unrecoverable" and you're doing a linear regression, you probably want to replace NULL by a mean value over some dimension(s). I don't have a good linear regression example, but suppose you flip one hundred coins and record heads as 1 and tails as 0. Suppose further that you lose 10 coins before observing them. If you compute the sum of this vector with NULL as zeros you'll conclude the coins are tails-biased! If you compute the sum of this vector with NULL as the sample mean, you'll have an unbiased estimate of the coin's heads/tails probability.

IMO, matmul, sum, etc. should only be defined on tensors with non-nullable elements. I suppose null elements are fine? if they're representable in torch (I think they are not?).

Numpy is able to represent them when you use the catchall-object-dtype, but if you request primitive types it converts them to NaNs.

In [8]: np.array([1., None])
Out[8]: array([1.0, None], dtype=object)

In [9]: np.array([1., None], dtype=float)
Out[9]: array([ 1., nan])

In [10]: np.array([1., None], dtype=np.dtype('f4'))
Out[10]: array([ 1., nan], dtype=float32)

@connortsui20 connortsui20 mentioned this pull request Mar 5, 2026
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>

Physical shape favors Arrow compatibility and simpler stride math. Logical shape favors
NumPy/PyTorch compatibility and is arguably more intuitive for our users since Vortex has a logical
type system.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I think torch/numpy integration matters more for tensors than arrow compatibility. There's no linear algebra library that natively works on arrow arrays.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, and the conversion will be cheap regardless

Copy link

@danking danking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great

@gatesn gatesn merged commit 0ffa944 into develop Mar 5, 2026
3 checks passed
@gatesn gatesn deleted the ct/tensor-revise branch March 5, 2026 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants