-
Notifications
You must be signed in to change notification settings - Fork 21.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
[discussion] [feature request] Native tensor-backed array of strings (at least a read-only one) and basic string processing functions for addition into core + discussion of extremely basic data frames (also for reducing python object heap pressure) #101699
Comments
It also appears that ONNX / onnxrt supports some sort of string dtype and numpy-like string arrays:
Native support for string arrays is useful for embedding inside the models the vocabs and mapping token indices to strings |
in NumPy also string arrays do exist via the dtype This could be useful for more compat with NumPy too |
A related discussion on importance of reducing GC pressure and number of Python objects: So native StringArrays + some basic data frames are a useful idea for dataset classes |
Related: https://pytorch.org/rl/tensordict/saving.html#saving-memmory-mapped-tensordicts TensorDicts from https://github.com/pytorch/rl, https://github.com/pytorch-labs/tensordict seem to be an instantiation of this basic dataframe idea https://pytorch.org/rl/tensordict/tutorials/data_fashion.html but do not support strings yet. Could this (along with torcharrow) be a basis for design of simple data frames in core pytorch (enabling fast ffcv/tfrecord-like datasets)? cc @vmoens |
@mikaylagawarecki btw maybe worth natively supporting apache arrow / arrow2 / hdf5 or expanding column-based mmap support as in https://twitter.com/rohanpaul_ai/status/1686472387091648512 ? |
Another torch/arrow related dataframe package with StringColumn support: https://github.com/wenleix/StructTorch by @wenleix |
cc @dracifer |
This is an interesting topic, here are some context FWIW:
I am no longer actively working on TorchArrow and my thoughts/knowledge may be out-dated. cc @dracifer |
this idea matches pretty much our thoughts about adding "structure" to Pytorch to provide stronger types and memory layout and bridge data and model better with transformation libraries. cc @laurencer |
some aspect is also padding / alignment by memory page size (which is OS/machine-dependent) for mmap sometimes |
As one of forms of representation (padding-less), NestedTensors / JaggedLayout can be useful for these string arrays @cpuhrsch. The padded format (where all stirings in array occupy same number of bytes) can be just a regular dense tensor. Or maybe a Dense Tensor + Int Tensor (for storing string lengths of array elements) - btw don't know if Nested or if there's some other built-in representation for representing Padded Tensors |
Related: For model hosting, it's often useful to be able to store single strings in tensors and arrays of strings Triton inference server supports TYPE_STRING: https://github.com/triton-inference-server/server/blob/r22.02/docs/model_configuration.md#datatypes (appearing just a uint8 bytes tensor under the hood - so needs to be decoded with some user-provided encoding) it would be nice that PyTorch supports natively some sort of string dtype (both for encoding a single string and (at least readonly) lists of strings) |
馃殌 The feature, motivation and pitch
This is useful to avoid copies related to copy-on-write (actually copy-on-read because of python's finicky ref-counters) problems with DataLoader: #13246. A typical application: list of file names or file paths in a dataset (avoiding creating hundreds of thousands/millions of python string objects on the heap), string<>token lookup tables.
For fixed-size-characters (ascii: uint8 / utf_16: int16 / utf_32: int32) there is my prototype in https://gist.github.com/vadimkantorov/86c3a46bf25bed3ad45d043ae86fff57 , but some other designs can be considered. In essence, initially this means some fast parallelized APIs for conversion from python string lists and accessing individual elements / sub-lists (maybe parallelized string encoding/decoding). Different storage formats can be envisaged: e.g. fixed-length string array (with some null-byte padding) or no-padding packed format as in the gist above. I think for practical use in compaction of strings in datasets, the no-padding format is needed (although for parallelized hashing the fixed-length strings may be easier). Also should be decided if (stable) hashes can be precomputed/cached/stored along with the strings.
This would be useful for some fast column-based string processing or mmap'ing some dataset files . Probably NumPy / HDF5 / Apache Arrow / parquet / data frame libraries have also some support along these lines.
It seems that torcharrow.StringColumn might implement this. I think it's worth moving a string list class like this in core. Maybe even more lightweight - a Tensor subclass or even just methods for working with string-array holding uint8/int16/int32 tensors, because it's very useful for working around #13246 and otherwise more economic/parallelized basic string/file paths manipulation.
A useful string function to include is some parallelized string hashing methods that are stable (e.g. hash all of the strings in the array at once). Then this could be used for fast hashtable construction / keys hash computation. Another useful concept can be "string lists" that allow appends (with some exponential storage reallocation): #64359
Related issues on "zero-copy": #43949, #33041, #34651 (about getting a
bytes
view over a sub-tensor - can be useful as an ascii string substitute, and in general for zero-copy pytorch interop. i wonder if python has some native string views over utf-8 strings?) It may seem that there's even an option to hack around CPython PyUnicode structure and create a "view" over char bytes (stored in tensor) without any char byte copies (although it's maybe not very safe): python/cpython#104689 https://stackoverflow.com/questions/76291943/create-a-python-string-from-a-native-pointer-without-char-buffer-copyGoing further, maybe some simplistic dataframe class can be added to PyTorch (being a tuple of tensors with having equal leftmost dim). These dataframes would primarily be used for some simple dataset serialization/deserialization / filtering and transformation. Ideally, a dataframe should support two modes of serialization: array-of-structs and column-based. Imagine, having a list of COCO per-image annotation objects and just giving it to some sort of dataframe constructor (maybe along with some schema/spec) and getting back a set of column tensors (with some helper accessor methods). This dataframe could be scattered without copies to DataLoader workers. Native CUDA-accelerated basic CSV-parsing could also be nice (especially if combined with mmap-based file reading?). I can see that this is implemented by torcharrow, maybe time to move some of its core structures to core?
Discussion of conversion of nested structures to columns:
Maybe some simple nested schemas can be supported first:
These might be enough to represent data annotation schemas of common datasets (?)
The text was updated successfully, but these errors were encountered: