Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[discussion] [feature request] Native tensor-backed array of strings (at least a read-only one) and basic string processing functions for addition into core + discussion of extremely basic data frames (also for reducing python object heap pressure) #101699

Open
vadimkantorov opened this issue May 17, 2023 · 13 comments
Labels
feature A request for a proper, new feature. needs research We need to decide whether or not this merits inclusion, based on research world triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@vadimkantorov
Copy link
Contributor

vadimkantorov commented May 17, 2023

馃殌 The feature, motivation and pitch

This is useful to avoid copies related to copy-on-write (actually copy-on-read because of python's finicky ref-counters) problems with DataLoader: #13246. A typical application: list of file names or file paths in a dataset (avoiding creating hundreds of thousands/millions of python string objects on the heap), string<>token lookup tables.

For fixed-size-characters (ascii: uint8 / utf_16: int16 / utf_32: int32) there is my prototype in https://gist.github.com/vadimkantorov/86c3a46bf25bed3ad45d043ae86fff57 , but some other designs can be considered. In essence, initially this means some fast parallelized APIs for conversion from python string lists and accessing individual elements / sub-lists (maybe parallelized string encoding/decoding). Different storage formats can be envisaged: e.g. fixed-length string array (with some null-byte padding) or no-padding packed format as in the gist above. I think for practical use in compaction of strings in datasets, the no-padding format is needed (although for parallelized hashing the fixed-length strings may be easier). Also should be decided if (stable) hashes can be precomputed/cached/stored along with the strings.

This would be useful for some fast column-based string processing or mmap'ing some dataset files . Probably NumPy / HDF5 / Apache Arrow / parquet / data frame libraries have also some support along these lines.

It seems that torcharrow.StringColumn might implement this. I think it's worth moving a string list class like this in core. Maybe even more lightweight - a Tensor subclass or even just methods for working with string-array holding uint8/int16/int32 tensors, because it's very useful for working around #13246 and otherwise more economic/parallelized basic string/file paths manipulation.

A useful string function to include is some parallelized string hashing methods that are stable (e.g. hash all of the strings in the array at once). Then this could be used for fast hashtable construction / keys hash computation. Another useful concept can be "string lists" that allow appends (with some exponential storage reallocation): #64359

Related issues on "zero-copy": #43949, #33041, #34651 (about getting a bytes view over a sub-tensor - can be useful as an ascii string substitute, and in general for zero-copy pytorch interop. i wonder if python has some native string views over utf-8 strings?) It may seem that there's even an option to hack around CPython PyUnicode structure and create a "view" over char bytes (stored in tensor) without any char byte copies (although it's maybe not very safe): python/cpython#104689 https://stackoverflow.com/questions/76291943/create-a-python-string-from-a-native-pointer-without-char-buffer-copy


Going further, maybe some simplistic dataframe class can be added to PyTorch (being a tuple of tensors with having equal leftmost dim). These dataframes would primarily be used for some simple dataset serialization/deserialization / filtering and transformation. Ideally, a dataframe should support two modes of serialization: array-of-structs and column-based. Imagine, having a list of COCO per-image annotation objects and just giving it to some sort of dataframe constructor (maybe along with some schema/spec) and getting back a set of column tensors (with some helper accessor methods). This dataframe could be scattered without copies to DataLoader workers. Native CUDA-accelerated basic CSV-parsing could also be nice (especially if combined with mmap-based file reading?). I can see that this is implemented by torcharrow, maybe time to move some of its core structures to core?

Discussion of conversion of nested structures to columns:

Maybe some simple nested schemas can be supported first:

  • array of dicts of primitive types
  • array of dicts with nested arrays of dicts of primitive types

These might be enough to represent data annotation schemas of common datasets (?)

@ngimel ngimel added feature A request for a proper, new feature. triage review labels May 17, 2023
@albanD albanD added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module needs research We need to decide whether or not this merits inclusion, based on research world and removed triage review labels May 22, 2023
@vadimkantorov vadimkantorov changed the title [discussion] Native tensor-backed string array [discussion] [feature request] Native tensor-backed string array and basic string processing functions for addition into core + discussion of extremely basic data frames May 23, 2023
@vadimkantorov
Copy link
Contributor Author

vadimkantorov commented May 27, 2023

It also appears that ONNX / onnxrt supports some sort of string dtype and numpy-like string arrays:

Native support for string arrays is useful for embedding inside the models the vocabs and mapping token indices to strings

@vadimkantorov
Copy link
Contributor Author

vadimkantorov commented May 27, 2023

in NumPy also string arrays do exist via the dtype numpy.str_ https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.str_ (although seems to actually be only utf-16 == UCS-4 format always spending 4 bytes per character)

This could be useful for more compat with NumPy too

@vadimkantorov
Copy link
Contributor Author

vadimkantorov commented Jun 10, 2023

A related discussion on importance of reducing GC pressure and number of Python objects:
#103339

So native StringArrays + some basic data frames are a useful idea for dataset classes

@vadimkantorov
Copy link
Contributor Author

vadimkantorov commented Jun 15, 2023

Related: https://pytorch.org/rl/tensordict/saving.html#saving-memmory-mapped-tensordicts

TensorDicts from https://github.com/pytorch/rl, https://github.com/pytorch-labs/tensordict seem to be an instantiation of this basic dataframe idea https://pytorch.org/rl/tensordict/tutorials/data_fashion.html but do not support strings yet. Could this (along with torcharrow) be a basis for design of simple data frames in core pytorch (enabling fast ffcv/tfrecord-like datasets)?

cc @vmoens

@vmoens
Copy link
Contributor

vmoens commented Jun 15, 2023

That would be an interesting feature to add to tensordict, but would need some discussion.
For usage in LLM it would defo be super useful.

cc @tcbegley @apbard

@vadimkantorov
Copy link
Contributor Author

vadimkantorov commented Aug 1, 2023

@mikaylagawarecki btw maybe worth natively supporting apache arrow / arrow2 / hdf5 or expanding column-based mmap support as in https://twitter.com/rohanpaul_ai/status/1686472387091648512 ?

@vadimkantorov
Copy link
Contributor Author

Another torch/arrow related dataframe package with StringColumn support: https://github.com/wenleix/StructTorch by @wenleix

@wenleix
Copy link
Contributor

wenleix commented Aug 8, 2023

Another torch/arrow related dataframe package with StringColumn support: https://github.com/wenleix/StructTorch

cc @dracifer

@vadimkantorov vadimkantorov changed the title [discussion] [feature request] Native tensor-backed string array and basic string processing functions for addition into core + discussion of extremely basic data frames [discussion] [feature request] Native tensor-backed string array and basic string processing functions for addition into core + discussion of extremely basic data frames (also for reducing python object heap pressure) Aug 8, 2023
@wenleix
Copy link
Contributor

wenleix commented Aug 8, 2023

It seems that torcharrow.StringColumn might implement this. I think it's worth moving a string list class like this in core. Maybe even more lightweight - a Tensor subclass or even just methods for working with string-array holding uint8/int16/int32 tensors, because it's very useful for working around #13246 and otherwise more economic/parallelized basic string/file paths manipulation.

This is an interesting topic, here are some context FWIW:

I am no longer actively working on TorchArrow and my thoughts/knowledge may be out-dated. cc @dracifer

@dracifer
Copy link
Contributor

Going further, maybe some simplistic dataframe class can be added to PyTorch (being a tuple of tensors with having equal leftmost dim). These dataframes would primarily be used for some simple dataset serialization/deserialization / filtering and transformation. Ideally, a dataframe should support two modes of serialization: array-of-structs and column-based.

this idea matches pretty much our thoughts about adding "structure" to Pytorch to provide stronger types and memory layout and bridge data and model better with transformation libraries. cc @laurencer

@vadimkantorov
Copy link
Contributor Author

some aspect is also padding / alignment by memory page size (which is OS/machine-dependent) for mmap sometimes

@vadimkantorov
Copy link
Contributor Author

vadimkantorov commented Nov 17, 2023

As one of forms of representation (padding-less), NestedTensors / JaggedLayout can be useful for these string arrays @cpuhrsch. The padded format (where all stirings in array occupy same number of bytes) can be just a regular dense tensor. Or maybe a Dense Tensor + Int Tensor (for storing string lengths of array elements) - btw don't know if Nested or if there's some other built-in representation for representing Padded Tensors

@vadimkantorov vadimkantorov changed the title [discussion] [feature request] Native tensor-backed string array and basic string processing functions for addition into core + discussion of extremely basic data frames (also for reducing python object heap pressure) [discussion] [feature request] Native tensor-backed array of strings (at least a read-only one) and basic string processing functions for addition into core + discussion of extremely basic data frames (also for reducing python object heap pressure) Dec 20, 2023
@vadimkantorov
Copy link
Contributor Author

vadimkantorov commented Dec 27, 2023

Related:

For model hosting, it's often useful to be able to store single strings in tensors and arrays of strings

Triton inference server supports TYPE_STRING: https://github.com/triton-inference-server/server/blob/r22.02/docs/model_configuration.md#datatypes (appearing just a uint8 bytes tensor under the hood - so needs to be decoded with some user-provided encoding)

it would be nice that PyTorch supports natively some sort of string dtype (both for encoding a single string and (at least readonly) lists of strings)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A request for a proper, new feature. needs research We need to decide whether or not this merits inclusion, based on research world triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

6 participants