[discussion] [feature request] Native tensor-backed array of strings (at least a read-only one) and basic string processing functions for addition into core + discussion of extremely basic data frames (also for reducing python object heap pressure) #101699

vadimkantorov · 2023-05-17T15:29:53Z

🚀 The feature, motivation and pitch

This is useful to avoid copies related to copy-on-write (actually copy-on-read because of python's finicky ref-counters) problems with DataLoader: #13246. A typical application: list of file names or file paths in a dataset (avoiding creating hundreds of thousands/millions of python string objects on the heap), string<>token lookup tables.

For fixed-size-characters (ascii: uint8 / utf_16: int16 / utf_32: int32) there is my prototype in https://gist.github.com/vadimkantorov/86c3a46bf25bed3ad45d043ae86fff57 , but some other designs can be considered. In essence, initially this means some fast parallelized APIs for conversion from python string lists and accessing individual elements / sub-lists (maybe parallelized string encoding/decoding). Different storage formats can be envisaged: e.g. fixed-length string array (with some null-byte padding) or no-padding packed format as in the gist above. I think for practical use in compaction of strings in datasets, the no-padding format is needed (although for parallelized hashing the fixed-length strings may be easier). Also should be decided if (stable) hashes can be precomputed/cached/stored along with the strings.

This would be useful for some fast column-based string processing or mmap'ing some dataset files . Probably NumPy / HDF5 / Apache Arrow / parquet / data frame libraries have also some support along these lines.

It seems that torcharrow.StringColumn might implement this. I think it's worth moving a string list class like this in core. Maybe even more lightweight - a Tensor subclass or even just methods for working with string-array holding uint8/int16/int32 tensors, because it's very useful for working around #13246 and otherwise more economic/parallelized basic string/file paths manipulation.

A useful string function to include is some parallelized string hashing methods that are stable (e.g. hash all of the strings in the array at once). Then this could be used for fast hashtable construction / keys hash computation. Another useful concept can be "string lists" that allow appends (with some exponential storage reallocation): #64359

Related issues on "zero-copy": #43949, #33041, #34651 (about getting a bytes view over a sub-tensor - can be useful as an ascii string substitute, and in general for zero-copy pytorch interop. i wonder if python has some native string views over utf-8 strings?) It may seem that there's even an option to hack around CPython PyUnicode structure and create a "view" over char bytes (stored in tensor) without any char byte copies (although it's maybe not very safe): python/cpython#104689 https://stackoverflow.com/questions/76291943/create-a-python-string-from-a-native-pointer-without-char-buffer-copy

Going further, maybe some simplistic dataframe class can be added to PyTorch (being a tuple of tensors with having equal leftmost dim). These dataframes would primarily be used for some simple dataset serialization/deserialization / filtering and transformation. Ideally, a dataframe should support two modes of serialization: array-of-structs and column-based. Imagine, having a list of COCO per-image annotation objects and just giving it to some sort of dataframe constructor (maybe along with some schema/spec) and getting back a set of column tensors (with some helper accessor methods). This dataframe could be scattered without copies to DataLoader workers. Native CUDA-accelerated basic CSV-parsing could also be nice (especially if combined with mmap-based file reading?). I can see that this is implemented by torcharrow, maybe time to move some of its core structures to core?

Discussion of conversion of nested structures to columns:

Maybe some simple nested schemas can be supported first:

array of dicts of primitive types
array of dicts with nested arrays of dicts of primitive types

These might be enough to represent data annotation schemas of common datasets (?)

The text was updated successfully, but these errors were encountered:

vadimkantorov · 2023-05-27T08:39:32Z

It also appears that ONNX / onnxrt supports some sort of string dtype and numpy-like string arrays:

https://github.com/onnx/onnx/blob/main/docs/Operators-ml.md#aionnxmlarrayfeatureextractor - indexing operator for a string array
https://github.com/onnx/onnx/blob/main/docs/Operators-ml.md#aionnxmllabelencoder - map strings to indices
https://github.com/onnx/onnx/blob/main/docs/Operators.md#StringNormalizer - basic string ops

Native support for string arrays is useful for embedding inside the models the vocabs and mapping token indices to strings

vadimkantorov · 2023-05-27T17:21:36Z

in NumPy also string arrays do exist via the dtype numpy.str_ https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.str_ (although seems to actually be only utf-16 == UCS-4 format always spending 4 bytes per character)

This could be useful for more compat with NumPy too

vadimkantorov · 2023-06-10T11:27:08Z

A related discussion on importance of reducing GC pressure and number of Python objects:
#103339

So native StringArrays + some basic data frames are a useful idea for dataset classes

vadimkantorov · 2023-06-15T08:12:28Z

Related: https://pytorch.org/rl/tensordict/saving.html#saving-memmory-mapped-tensordicts

TensorDicts from https://github.com/pytorch/rl, https://github.com/pytorch-labs/tensordict seem to be an instantiation of this basic dataframe idea https://pytorch.org/rl/tensordict/tutorials/data_fashion.html but do not support strings yet. Could this (along with torcharrow) be a basis for design of simple data frames in core pytorch (enabling fast ffcv/tfrecord-like datasets)?

cc @vmoens

vmoens · 2023-06-15T13:42:54Z

That would be an interesting feature to add to tensordict, but would need some discussion.
For usage in LLM it would defo be super useful.

cc @tcbegley @apbard

vadimkantorov · 2023-08-01T23:01:58Z

@mikaylagawarecki btw maybe worth natively supporting apache arrow / arrow2 / hdf5 or expanding column-based mmap support as in https://twitter.com/rohanpaul_ai/status/1686472387091648512 ?

vadimkantorov · 2023-08-07T16:19:01Z

Another torch/arrow related dataframe package with StringColumn support: https://github.com/wenleix/StructTorch by @wenleix

wenleix · 2023-08-08T17:48:41Z

Another torch/arrow related dataframe package with StringColumn support: https://github.com/wenleix/StructTorch

cc @dracifer

wenleix · 2023-08-08T18:01:59Z

It seems that torcharrow.StringColumn might implement this. I think it's worth moving a string list class like this in core. Maybe even more lightweight - a Tensor subclass or even just methods for working with string-array holding uint8/int16/int32 tensors, because it's very useful for working around #13246 and otherwise more economic/parallelized basic string/file paths manipulation.

This is an interesting topic, here are some context FWIW:

TorchArrow starts with implementing those columns (StringColumn in your example, but also other nested data like ListColumn/MapColumn) backed by Velox. This is the current implementation in TorchArrow GitHub repo (essentially a pybind over Velox Vector ).
There are discussions/exploration about whether some lightweight structure over Tensor could work. As ML preproc are in general not as complicated as analytic DB workload. Bring it onto PyTorch allows (1) Easier PyTorch Ecosystem integration (2) Lower the bar for ML engineers to play around since they are already familiar with PT.
See [RFC] Rethinking of ML Preproc in PyTorch Ecosystem torcharrow#515 and https://docs.google.com/document/d/1RHQDCAqLCAt9EkbtaUrd5ETjrbe_7HoV-9Mfxlohq4c/ for more details. As you noted, https://github.com/wenleix/StructTorch is one of the exploration/hack to try this idea.
Some project/ideas are (tangentially) related include TensorFlow's JaggedTenosr, Awkward Array

I am no longer actively working on TorchArrow and my thoughts/knowledge may be out-dated. cc @dracifer

dracifer · 2023-08-11T15:16:26Z

Going further, maybe some simplistic dataframe class can be added to PyTorch (being a tuple of tensors with having equal leftmost dim). These dataframes would primarily be used for some simple dataset serialization/deserialization / filtering and transformation. Ideally, a dataframe should support two modes of serialization: array-of-structs and column-based.

this idea matches pretty much our thoughts about adding "structure" to Pytorch to provide stronger types and memory layout and bridge data and model better with transformation libraries. cc @laurencer

vadimkantorov · 2023-08-11T15:19:55Z

some aspect is also padding / alignment by memory page size (which is OS/machine-dependent) for mmap sometimes

vadimkantorov · 2023-11-17T16:54:29Z

As one of forms of representation (padding-less), NestedTensors / JaggedLayout can be useful for these string arrays @cpuhrsch. The padded format (where all stirings in array occupy same number of bytes) can be just a regular dense tensor. Or maybe a Dense Tensor + Int Tensor (for storing string lengths of array elements) - btw don't know if Nested or if there's some other built-in representation for representing Padded Tensors

vadimkantorov · 2023-12-27T15:14:14Z

Add support for String input/output triton-inference-server/pytorch_backend#60

For model hosting, it's often useful to be able to store single strings in tensors and arrays of strings

Triton inference server supports TYPE_STRING: https://github.com/triton-inference-server/server/blob/r22.02/docs/model_configuration.md#datatypes (appearing just a uint8 bytes tensor under the hood - so needs to be decoded with some user-provided encoding)

it would be nice that PyTorch supports natively some sort of string dtype (both for encoding a single string and (at least readonly) lists of strings)

ngimel added feature A request for a proper, new feature. triage review labels May 17, 2023

albanD added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module needs research We need to decide whether or not this merits inclusion, based on research world and removed triage review labels May 22, 2023

vadimkantorov changed the title ~~[discussion] Native tensor-backed string array~~ [discussion] [feature request] Native tensor-backed string array and basic string processing functions for addition into core + discussion of extremely basic data frames May 23, 2023

vadimkantorov mentioned this issue Jun 10, 2023

[feature request] Native method for iterating Python items of tensors: iteritems() and a new tensor.item(i, j, k, ...) method #103352

Open

vadimkantorov mentioned this issue Jun 15, 2023

Passing dict in datapipe/dataset will have memory leak problem #103581

Open

vadimkantorov mentioned this issue Oct 30, 2023

[RFC] Tensordict integration #112441

Open

2 tasks

vadimkantorov mentioned this issue Dec 20, 2023

DataLoader num_workers > 0 causes CPU memory from parent process to be replicated in all worker processes #13246

Open

albertz mentioned this issue Jan 18, 2024

High memory usage with datasets (specifically when multi procs are used) rwth-i6/returnn#1498

Open

vadimkantorov mentioned this issue May 10, 2024

[3/N] Non-Tensor: Support string parameter for aten operations #125831

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[discussion] [feature request] Native tensor-backed array of strings (at least a read-only one) and basic string processing functions for addition into core + discussion of extremely basic data frames (also for reducing python object heap pressure) #101699

[discussion] [feature request] Native tensor-backed array of strings (at least a read-only one) and basic string processing functions for addition into core + discussion of extremely basic data frames (also for reducing python object heap pressure) #101699

vadimkantorov commented May 17, 2023 •

edited

vadimkantorov commented May 27, 2023 •

edited

vadimkantorov commented May 27, 2023 •

edited

vadimkantorov commented Jun 10, 2023 •

edited

vadimkantorov commented Jun 15, 2023 •

edited

vmoens commented Jun 15, 2023

vadimkantorov commented Aug 1, 2023 •

edited

vadimkantorov commented Aug 7, 2023

wenleix commented Aug 8, 2023 •

edited

wenleix commented Aug 8, 2023 •

edited

dracifer commented Aug 11, 2023

vadimkantorov commented Aug 11, 2023

vadimkantorov commented Nov 17, 2023 •

edited

vadimkantorov commented Dec 27, 2023 •

edited

[discussion] [feature request] Native tensor-backed array of strings (at least a read-only one) and basic string processing functions for addition into core + discussion of extremely basic data frames (also for reducing python object heap pressure) #101699

[discussion] [feature request] Native tensor-backed array of strings (at least a read-only one) and basic string processing functions for addition into core + discussion of extremely basic data frames (also for reducing python object heap pressure) #101699

Comments

vadimkantorov commented May 17, 2023 • edited

🚀 The feature, motivation and pitch

vadimkantorov commented May 27, 2023 • edited

vadimkantorov commented May 27, 2023 • edited

vadimkantorov commented Jun 10, 2023 • edited

vadimkantorov commented Jun 15, 2023 • edited

vmoens commented Jun 15, 2023

vadimkantorov commented Aug 1, 2023 • edited

vadimkantorov commented Aug 7, 2023

wenleix commented Aug 8, 2023 • edited

wenleix commented Aug 8, 2023 • edited

dracifer commented Aug 11, 2023

vadimkantorov commented Aug 11, 2023

vadimkantorov commented Nov 17, 2023 • edited

vadimkantorov commented Dec 27, 2023 • edited

vadimkantorov commented May 17, 2023 •

edited

vadimkantorov commented May 27, 2023 •

edited

vadimkantorov commented May 27, 2023 •

edited

vadimkantorov commented Jun 10, 2023 •

edited

vadimkantorov commented Jun 15, 2023 •

edited

vadimkantorov commented Aug 1, 2023 •

edited

wenleix commented Aug 8, 2023 •

edited

wenleix commented Aug 8, 2023 •

edited

vadimkantorov commented Nov 17, 2023 •

edited

vadimkantorov commented Dec 27, 2023 •

edited