Add string dtype to Tensor #1408

dsmilkov · 2018-11-20T17:59:56Z

Add string dtype to Tensor.

This opens the door for adding Python's string_ops to TensorFlow.js, which are used for text-based models, and for adding pre-processing layers that operate on strings.

Details:

dtype was not added as a generic to the Tensor class in order to keep compiler errors simple and code backwards compatible.
dataSync() can be optionally typed to cast its result. E.g. t.dataSync<'string'>() returns string[] while t.dataSync() returns TypedArray for backwards compatibility.
layers and converter pass with this build. node has 30ish failed tests since string is an unknown dtype.
Only clone, reshape and cast work with strings at this point to keep this PR small. Other ops will get the functionality in a follow-up PR.
Added unit tests to assert that numeric ops throw on string tensors.
Backends now should support dtype string in their register/write/read methods.
Added a vscode config to do debugging directly from vscode

FEATURE

This change is

nkreeger

Looks fairly straight forward.

Food for thought - do you think tackling large PRs like this can be piecemealed in a working branch? Even if tests don't pass I wonder if we can help avoid very large PRs (reviews can miss some details).

Reviewed 43 of 43 files at r1.
Reviewable status: 0 of 1 approvals obtained (waiting on @dsmilkov, @nsthorat, @caisq, and @nkreeger)

.vscode/launch.json, line 3 at r1 (raw file):

  // Use IntelliSense to learn about possible attributes.
  // Hover to view descriptions of existing attributes.
  // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387

Curious - is this auto-gen in VSCode? Looks like this is a config to attach VSCode to a Chromium process?

src/buffer_test.ts, line 26 at r1 (raw file):

buff.set(2, 0, 1, 0);

Nit - does it make sense to add an actual floating point number here? 2.1 or something.

src/types.ts, line 37 at r1 (raw file):

Quoted 5 lines of code…

  float32: number;
  int32: number;
  bool: boolean;
  complex64: number;
  string: string;

Does the order matter here? If not - do you think it makes sense to order these by size?

bool
int32,
flaot32,
complex64
string

nsthorat

Can you add an explicit test for .data<'string'> and .dataSync<'string'> to test the compiler?

Reviewed 43 of 43 files at r1.
Reviewable status: 0 of 1 approvals obtained (waiting on @dsmilkov, @nsthorat, and @caisq)

src/engine.ts, line 291 at r1 (raw file):

    if (refCount <= 1) {
      if (a.dtype === 'string') {
        this.numBytes -= bytesFromStringArray(a.dataSync<'string'>());

I have a feeling dataSync() on this will be expensive in Node.js, one thought would be to keep the size when you write strings

src/jasmine_util.ts, line 96 at r1 (raw file):

export let TEST_ENVS: TestEnv[] = [
  {
    name: 'test-webgl1',

Just want to point this test out to you which now will have slightly different behavior: https://github.com/tensorflow/tfjs-core/blob/master/src/engine_test.ts#L511

src/tape.ts, line 172 at r1 (raw file):

      // Call the gradient function.
      const dx = inputGradients[inputName]();
      if (dx.dtype !== 'float32') {

can we test this?

src/tensor_test.ts, line 1373 at r1 (raw file):

        '  shape: [3]\n' +
        '  values:\n' +
        '    [a, bb, ccc]');

nit (optional): for consistency with console.log(), each element should be wrapped with a quote

src/kernels/backend_cpu_test.ts, line 55 at r1 (raw file):

  it('register string tensor with values', () => {
    const backend = new MathBackendCPU();
    tf.ENV.registerBackend('test-storage', () => backend);

you can put this in beforeEach to parallel afterEach

dsmilkov

Thanks! I tried to make the PR as small as possible and most of it are unit tests and assertion of updated error messages. To get a better sense of the core change in this PR see master...9351969 which compares one of the earliest commits in this branch with master (no unit tests, and it's about 300 lines of diff.

Reviewable status: complete! 1 of 1 approvals obtained (waiting on @nkreeger, @dsmilkov, and @caisq)

.vscode/launch.json, line 3 at r1 (raw file):

Previously, nkreeger (Nick Kreeger) wrote…

  // Use IntelliSense to learn about possible attributes.
  // Hover to view descriptions of existing attributes.
  // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
Curious - is this auto-gen in VSCode? Looks like this is a config to attach VSCode to a Chromium process?

Not auto-gen, but the recommended setting from vscode if you want to debug karma chrome directly in vscode (I used it in this PR to save time, and wanted to commit so everyone can benefit - all you need is the Chrome debugger plugin for vscode.

dsmilkov

Reviewable status: complete! 1 of 1 approvals obtained (waiting on @nkreeger, @dsmilkov, and @caisq)

.vscode/launch.json, line 3 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

Not auto-gen, but the recommended setting from vscode if you want to debug karma chrome directly in vscode (I used it in this PR to save time, and wanted to commit so everyone can benefit - all you need is the Chrome debugger plugin for vscode.

Oh, the comments are auto-gen yes

pyu10055

Great stuff, couple high level questions:

will string type tensor op be supported in GPU or it is a CPU only dtype?
have you consider either use UInt8Array or UInt16Array for string storage to be aligned with current types?

pyu10055 · 2018-11-26T18:25:44Z

src/engine_test.ts

+    const a = tf.tensor([['a', 'bb'], ['c', 'd']]);
+
+    expect(tf.memory().numTensors).toBe(1);
+    expect(tf.memory().numBytes).toBe(10);  // 10 letters, each 2 bytes.


5 letters? Do we always store string type as UTF-16, do we support UTF-8?

pyu10055 · 2018-11-26T18:32:10Z

src/kernels/backend_cpu.ts

@@ -678,7 +678,7 @@ export class MathBackendCPU implements KernelBackend {
    this.assertNotComplex(x, 'logicalNot');

    const values = x.dataSync();
-    const newValues = new Int32Array(values.length);
+    const newValues = new Uint8Array(values.length);


does it support int32 logicalNot?

caisq

Great! Thanks for doing this, @dsmilkov

Please see my comments below.

Reviewable status: complete! 1 of 1 approvals obtained (waiting on @nkreeger, @dsmilkov, @nsthorat, @pyu10055, and @caisq)

src/engine_test.ts, line 369 at r1 (raw file):

10 --> 5

src/tensor_test.ts, line 580 at r1 (raw file):

['a', 'b']

I don't see any unit tests involving

empty strings
non-ASCII strings.

Should you add them? In particular, for non-ASCII strings, what do you do? Throw an error?

src/tensor_test.ts, line 1373 at r1 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

nit (optional): for consistency with console.log(), each element should be wrapped with a quote

+1 Python TensorFlow adds quotes to print() output of string-type tensors. This also helps user see empty strings more clearly.

src/util.ts, line 325 at r1 (raw file):

new Array(size)

Maybe new Array<string>()?

src/util.ts, line 402 at r1 (raw file):

bytes += x.length * 2);

This doesn't handle non-ASCII strings.

If you plan to handle non-ASCII strings, you should calculate the byte size of the strings differently
(e.g., https://stackoverflow.com/questions/5515869/string-length-in-bytes-in-javascript)

If you don't plan to handle non-ASCII strings, you should check for them and throw Error if a
non-ASCII string is found.

FYI, non-ASCII strings are supported by Python TensorFlow.

src/util.ts, line 475 at r1 (raw file):

throw new Error('Cannot convert a string[] to a TypedArray');

I'm curious why this is not supported. You could convert it to a Uint8Array.

src/ops/binary_ops_test.ts, line 211 at r1 (raw file):

  it('throws for string tensor', () => {
    expect(() => tf.maximum('q', 3))

Certain tf.* operations actually should support string tensors, including:

tf.add (concatenates strings)

…into string-data

dsmilkov

PTAL.

Thanks for the good questions. Answers inline.

will string type tensor op be supported in GPU or it is a CPU only dtype?

Only on cpu. Storing it on GPU will be highly complicated and no obvious benefits

have you consider either use UInt8Array or UInt16Array for string storage to be aligned with current types?

In the browser we will store strings natively for speed. When you store strings natively, the encoding depends on the <meta charset="enc"> of the html page that hosts the js. Since you can't know how much actual memory the browser used, I use 2 bytes per character as an approximation since tf.memory() will be used only for debugging.

I did consider using a TextEncoder/TextDecoder to convert native strings to predictable UTF16 encoding, but this:

has no obvious benefits, other than predictable byte size
slows down string manipulation
increases code complexity
encoding/decoding requires polyfills for some browsers.

Can you add an explicit test for .data<'string'> and .dataSync<'string'> to test the compiler?

Added.

Reviewable status: complete! 1 of 1 approvals obtained (waiting on @nkreeger, @nsthorat, @pyu10055, @caisq, and @bileschi)

src/buffer_test.ts, line 26 at r1 (raw file):

Previously, nkreeger (Nick Kreeger) wrote…

buff.set(2, 0, 1, 0);
Nit - does it make sense to add an actual floating point number here? 2.1 or something.

Done. This test was moved out from

src/engine.ts, line 291 at r1 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

I have a feeling dataSync() on this will be expensive in Node.js, one thought would be to keep the size when you write strings

Done.

src/engine_test.ts, line 369 at r1 (raw file):

Previously, pyu10055 (Ping Yu) wrote…

5 letters? Do we always store string type as UTF-16, do we support UTF-8?

We store strings natively. The browser internally can store strings as either UTF-8 or UTF-16, but that's an implementation detail and we always assume 2 bytes per character so we are conservative with memory usage: https://stackoverflow.com/questions/2219526/how-many-bytes-in-a-javascript-string

src/engine_test.ts, line 369 at r1 (raw file):

Previously, caisq (Shanqing Cai) wrote…

10
10 --> 5

Done.

src/jasmine_util.ts, line 96 at r1 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

Just want to point this test out to you which now will have slightly different behavior: https://github.com/tensorflow/tfjs-core/blob/master/src/engine_test.ts#L511

Should not affect the test since we were already prepending the test- prefix when we register backends below (see https://github.com/tensorflow/tfjs-core/blob/master/src/jasmine_util.ts#L136) so we ended up with test-test-webgl1

src/tape.ts, line 172 at r1 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

can we test this?

Done. I was testing already with dy being string (gradient: 1D string throws error with non-numeric dy), but I added more tests with dy being bool/int

src/tensor_test.ts, line 580 at r1 (raw file):

Previously, caisq (Shanqing Cai) wrote…

['a', 'b']
I don't see any unit tests involving

empty strings

non-ASCII strings.

Should you add them? In particular, for non-ASCII strings, what do you do? Throw an error?

Non-ascii should be supported, so I'm adding unit tests for that. It's the job of the backends to decide how to store the strings. For the webgl and vanilla cpu backend we store strings natively so unicode-escaping is supported. TF Python seems to support Unicode as well (tested in a colab) but the node.js backend might have to do a bit of work to convert from unicode to native string when crossing from backend <--> user interface (cc @nkreeger)

src/tensor_test.ts, line 1373 at r1 (raw file):

Previously, caisq (Shanqing Cai) wrote…

+1 Python TensorFlow adds quotes to print() output of string-type tensors. This also helps user see empty strings more clearly.

Great point. Done.

src/types.ts, line 37 at r1 (raw file):

Previously, nkreeger (Nick Kreeger) wrote…

  float32: number;
  int32: number;
  bool: boolean;
  complex64: number;
  string: string;
Does the order matter here? If not - do you think it makes sense to order these by size?
bool
int32,
flaot32,
complex64
string

Done.

src/util.ts, line 325 at r1 (raw file):

Previously, caisq (Shanqing Cai) wrote…

new Array(size)
Maybe new Array<string>()?

Done.

src/util.ts, line 402 at r1 (raw file):

Previously, caisq (Shanqing Cai) wrote…

bytes += x.length * 2);
This doesn't handle non-ASCII strings.

If you plan to handle non-ASCII strings, you should calculate the byte size of the strings differently
(e.g., https://stackoverflow.com/questions/5515869/string-length-in-bytes-in-javascript)

If you don't plan to handle non-ASCII strings, you should check for them and throw Error if a
non-ASCII string is found.

FYI, non-ASCII strings are supported by Python TensorFlow.

We will support non-ASCII strings, but the memory is going to be approximate (2 bytes per character) for speed. When you store strings natively, the encoding depends on the of the html page that hosts the js. Added this info to the jsdoc and updated the result returned by tf.memory(). Whenever there is a least 1 string tensor, tf.memory returns (among other things):

{
  unrealiable: true,
  reasons: ['Memory usage by string tensors is approximate, 2 bytes per character']
}

src/util.ts, line 475 at r1 (raw file):

Previously, caisq (Shanqing Cai) wrote…

throw new Error('Cannot convert a string[] to a TypedArray');
I'm curious why this is not supported. You could convert it to a Uint8Array.

This method is used only internally. We can add that functionality when the need arises.

src/kernels/backend_cpu.ts, line 681 at r1 (raw file):

Previously, pyu10055 (Ping Yu) wrote…

does it support int32 logicalNot?

logical_not in TF requires a bool returns a bool tensor.

src/kernels/backend_cpu_test.ts, line 55 at r1 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

you can put this in beforeEach to parallel afterEach

Done.

src/ops/binary_ops_test.ts, line 211 at r1 (raw file):

Previously, caisq (Shanqing Cai) wrote…

Certain tf.* operations actually should support string tensors, including:

tf.add (concatenates strings)

That's good to know! I'll be adding op string functionality in follow up PRs to keep this PR small.

caisq

Great! Thanks!

Reviewable status: complete! 1 of 1 approvals obtained (waiting on @nkreeger, @nsthorat, @pyu10055, @caisq, and @bileschi)

caisq

Reviewable status: complete! 1 of 1 approvals obtained (waiting on @nkreeger, @nsthorat, @pyu10055, @caisq, and @bileschi)

nkreeger

Reviewable status: complete! 2 of 1 approvals obtained (waiting on @nsthorat, @pyu10055, @caisq, and @bileschi)

Allow users to provide different dtypes in binary arithmetic ops (add/sub/mul/div/...) and matmul, just like in numpy. The dtype of the result is upcasted i.e. matMul(float32, int32) => float32 This will result in release patch 0.14.1, which will fix the breakage in 0.14.0 caused by #1408 due to improved dtype inference where tensor(new Int32Array()) is inferred to be int32, and was float32. Fixes tensorflow/tfjs#934, tensorflow/tfjs#966

dsmilkov added 21 commits November 3, 2018 12:14

save

421cd40

save

8c9258b

save

5cf1e12

save

9351969

Merge remote-tracking branch 'origin' into string-data

2f20cf2

save

b066618

save

84c1365

save

21afda0

save

4769316

save

054847e

save

5627d76

save

3d09b8e

save

4bcb004

save

5c6e6dd

save

bb42a88

Merge remote-tracking branch 'origin' into string-data

691ff43

save

4a1c3f7

save

b4d3cfb

save

4ac5327

save

ea13ab2

save

cb02807

dsmilkov changed the title ~~NOT READY YET Tensor<string>~~ Add string as dtype to Tensor Nov 26, 2018

dsmilkov requested review from nsthorat, caisq and nkreeger November 26, 2018 15:31

dsmilkov changed the title ~~Add string as dtype to Tensor~~ Add string dtype to Tensor Nov 26, 2018

save

e2ab930

nkreeger suggested changes Nov 26, 2018

View reviewed changes

nsthorat approved these changes Nov 26, 2018

View reviewed changes

dsmilkov commented Nov 26, 2018

View reviewed changes

pyu10055 reviewed Nov 26, 2018

View reviewed changes

caisq suggested changes Nov 26, 2018

View reviewed changes

dsmilkov requested a review from bileschi November 26, 2018 19:43

dsmilkov added 5 commits November 26, 2018 17:08

save

8977b3c

save

665232f

Merge branch 'string-data' of https://github.com/tensorflow/tfjs-core …

c971bf9

…into string-data

save

870db21

save

81cc45c

dsmilkov commented Nov 27, 2018

View reviewed changes

save

e950b16

caisq reviewed Nov 27, 2018

View reviewed changes

caisq approved these changes Nov 27, 2018

View reviewed changes

dsmilkov added 3 commits November 27, 2018 15:15

save

219e4a7

merge

a2025e3

save

70afeef

nkreeger approved these changes Nov 27, 2018

View reviewed changes

dsmilkov merged commit f217045 into master Nov 27, 2018

dsmilkov deleted the string-data branch December 3, 2018 18:02

This was referenced Dec 6, 2018

Allow different dtypes in binary math ops #1432

Merged

Add support for "string" type tensors tensorflow/tfjs#454

Closed

bileschi mentioned this pull request Jun 17, 2019

DO NOT SUBMIT : String tensor tensorflow/tfjs-layers#257

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add string dtype to Tensor #1408

Add string dtype to Tensor #1408

dsmilkov commented Nov 20, 2018 •

edited

nkreeger left a comment

nsthorat left a comment

dsmilkov left a comment

dsmilkov left a comment

pyu10055 left a comment

pyu10055 Nov 26, 2018

pyu10055 Nov 26, 2018

caisq left a comment

dsmilkov left a comment

caisq left a comment

caisq left a comment

nkreeger left a comment

Add string dtype to Tensor #1408

Add string dtype to Tensor #1408

Conversation

dsmilkov commented Nov 20, 2018 • edited

nkreeger left a comment

Choose a reason for hiding this comment

nsthorat left a comment

Choose a reason for hiding this comment

dsmilkov left a comment

Choose a reason for hiding this comment

dsmilkov left a comment

Choose a reason for hiding this comment

pyu10055 left a comment

Choose a reason for hiding this comment

pyu10055 Nov 26, 2018

Choose a reason for hiding this comment

pyu10055 Nov 26, 2018

Choose a reason for hiding this comment

caisq left a comment

Choose a reason for hiding this comment

dsmilkov left a comment

Choose a reason for hiding this comment

caisq left a comment

Choose a reason for hiding this comment

caisq left a comment

Choose a reason for hiding this comment

nkreeger left a comment

Choose a reason for hiding this comment

dsmilkov commented Nov 20, 2018 •

edited