Skip to content
This repository has been archived by the owner on Aug 15, 2019. It is now read-only.

Add string dtype to Tensor #1408

Merged
merged 31 commits into from Nov 27, 2018
Merged

Add string dtype to Tensor #1408

merged 31 commits into from Nov 27, 2018

Conversation

dsmilkov
Copy link
Contributor

@dsmilkov dsmilkov commented Nov 20, 2018

Add string dtype to Tensor.

This opens the door for adding Python's string_ops to TensorFlow.js, which are used for text-based models, and for adding pre-processing layers that operate on strings.

Details:

  • dtype was not added as a generic to the Tensor class in order to keep compiler errors simple and code backwards compatible.
  • dataSync() can be optionally typed to cast its result. E.g. t.dataSync<'string'>() returns string[] while t.dataSync() returns TypedArray for backwards compatibility.
  • layers and converter pass with this build. node has 30ish failed tests since string is an unknown dtype.
  • Only clone, reshape and cast work with strings at this point to keep this PR small. Other ops will get the functionality in a follow-up PR.
  • Added unit tests to assert that numeric ops throw on string tensors.
  • Backends now should support dtype string in their register/write/read methods.
  • Added a vscode config to do debugging directly from vscode

FEATURE


This change is Reviewable

@dsmilkov dsmilkov changed the title NOT READY YET Tensor<string> Add string as dtype to Tensor Nov 26, 2018
@dsmilkov dsmilkov changed the title Add string as dtype to Tensor Add string dtype to Tensor Nov 26, 2018
Copy link
Contributor

@nkreeger nkreeger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Looks fairly straight forward.

Food for thought - do you think tackling large PRs like this can be piecemealed in a working branch? Even if tests don't pass I wonder if we can help avoid very large PRs (reviews can miss some details).

Reviewed 43 of 43 files at r1.
Reviewable status: 0 of 1 approvals obtained (waiting on @dsmilkov, @nsthorat, @caisq, and @nkreeger)


.vscode/launch.json, line 3 at r1 (raw file):

  // Use IntelliSense to learn about possible attributes.
  // Hover to view descriptions of existing attributes.
  // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387

Curious - is this auto-gen in VSCode? Looks like this is a config to attach VSCode to a Chromium process?


src/buffer_test.ts, line 26 at r1 (raw file):

buff.set(2, 0, 1, 0);

Nit - does it make sense to add an actual floating point number here? 2.1 or something.


src/types.ts, line 37 at r1 (raw file):

Quoted 5 lines of code…
  float32: number;
  int32: number;
  bool: boolean;
  complex64: number;
  string: string;

Does the order matter here? If not - do you think it makes sense to order these by size?

bool
int32,
flaot32,
complex64
string

Copy link
Contributor

@nsthorat nsthorat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add an explicit test for .data<'string'> and .dataSync<'string'> to test the compiler?

Reviewed 43 of 43 files at r1.
Reviewable status: 0 of 1 approvals obtained (waiting on @dsmilkov, @nsthorat, and @caisq)


src/engine.ts, line 291 at r1 (raw file):

    if (refCount <= 1) {
      if (a.dtype === 'string') {
        this.numBytes -= bytesFromStringArray(a.dataSync<'string'>());

I have a feeling dataSync() on this will be expensive in Node.js, one thought would be to keep the size when you write strings


src/jasmine_util.ts, line 96 at r1 (raw file):

export let TEST_ENVS: TestEnv[] = [
  {
    name: 'test-webgl1',

Just want to point this test out to you which now will have slightly different behavior: https://github.com/tensorflow/tfjs-core/blob/master/src/engine_test.ts#L511


src/tape.ts, line 172 at r1 (raw file):

      // Call the gradient function.
      const dx = inputGradients[inputName]();
      if (dx.dtype !== 'float32') {

can we test this?


src/tensor_test.ts, line 1373 at r1 (raw file):

        '  shape: [3]\n' +
        '  values:\n' +
        '    [a, bb, ccc]');

nit (optional): for consistency with console.log(), each element should be wrapped with a quote


src/kernels/backend_cpu_test.ts, line 55 at r1 (raw file):

  it('register string tensor with values', () => {
    const backend = new MathBackendCPU();
    tf.ENV.registerBackend('test-storage', () => backend);

you can put this in beforeEach to parallel afterEach

Copy link
Contributor Author

@dsmilkov dsmilkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I tried to make the PR as small as possible and most of it are unit tests and assertion of updated error messages. To get a better sense of the core change in this PR see master...9351969 which compares one of the earliest commits in this branch with master (no unit tests, and it's about 300 lines of diff.

Reviewable status: :shipit: complete! 1 of 1 approvals obtained (waiting on @nkreeger, @dsmilkov, and @caisq)


.vscode/launch.json, line 3 at r1 (raw file):

Previously, nkreeger (Nick Kreeger) wrote…
  // Use IntelliSense to learn about possible attributes.
  // Hover to view descriptions of existing attributes.
  // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387

Curious - is this auto-gen in VSCode? Looks like this is a config to attach VSCode to a Chromium process?

Not auto-gen, but the recommended setting from vscode if you want to debug karma chrome directly in vscode (I used it in this PR to save time, and wanted to commit so everyone can benefit - all you need is the Chrome debugger plugin for vscode.

Copy link
Contributor Author

@dsmilkov dsmilkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 1 approvals obtained (waiting on @nkreeger, @dsmilkov, and @caisq)


.vscode/launch.json, line 3 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

Not auto-gen, but the recommended setting from vscode if you want to debug karma chrome directly in vscode (I used it in this PR to save time, and wanted to commit so everyone can benefit - all you need is the Chrome debugger plugin for vscode.

Oh, the comments are auto-gen yes

Copy link
Collaborator

@pyu10055 pyu10055 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff, couple high level questions:

  1. will string type tensor op be supported in GPU or it is a CPU only dtype?
  2. have you consider either use UInt8Array or UInt16Array for string storage to be aligned with current types?

const a = tf.tensor([['a', 'bb'], ['c', 'd']]);

expect(tf.memory().numTensors).toBe(1);
expect(tf.memory().numBytes).toBe(10); // 10 letters, each 2 bytes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 letters? Do we always store string type as UTF-16, do we support UTF-8?

@@ -678,7 +678,7 @@ export class MathBackendCPU implements KernelBackend {
this.assertNotComplex(x, 'logicalNot');

const values = x.dataSync();
const newValues = new Int32Array(values.length);
const newValues = new Uint8Array(values.length);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it support int32 logicalNot?

Copy link
Collaborator

@caisq caisq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Thanks for doing this, @dsmilkov

Please see my comments below.

Reviewable status: :shipit: complete! 1 of 1 approvals obtained (waiting on @nkreeger, @dsmilkov, @nsthorat, @pyu10055, and @caisq)


src/engine_test.ts, line 369 at r1 (raw file):

10

10 --> 5


src/tensor_test.ts, line 580 at r1 (raw file):

['a', 'b']

I don't see any unit tests involving

  • empty strings
  • non-ASCII strings.

Should you add them? In particular, for non-ASCII strings, what do you do? Throw an error?


src/tensor_test.ts, line 1373 at r1 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

nit (optional): for consistency with console.log(), each element should be wrapped with a quote

+1 Python TensorFlow adds quotes to print() output of string-type tensors. This also helps user see empty strings more clearly.


src/util.ts, line 325 at r1 (raw file):

new Array(size)

Maybe new Array<string>()?


src/util.ts, line 402 at r1 (raw file):

bytes += x.length * 2);

This doesn't handle non-ASCII strings.

If you plan to handle non-ASCII strings, you should calculate the byte size of the strings differently
(e.g., https://stackoverflow.com/questions/5515869/string-length-in-bytes-in-javascript)

If you don't plan to handle non-ASCII strings, you should check for them and throw Error if a
non-ASCII string is found.

FYI, non-ASCII strings are supported by Python TensorFlow.


src/util.ts, line 475 at r1 (raw file):

throw new Error('Cannot convert a string[] to a TypedArray');

I'm curious why this is not supported. You could convert it to a Uint8Array.


src/ops/binary_ops_test.ts, line 211 at r1 (raw file):

  it('throws for string tensor', () => {
    expect(() => tf.maximum('q', 3))

Certain tf.* operations actually should support string tensors, including:

  • tf.add (concatenates strings)

Copy link
Contributor Author

@dsmilkov dsmilkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL.

Thanks for the good questions. Answers inline.

  1. will string type tensor op be supported in GPU or it is a CPU only dtype?

Only on cpu. Storing it on GPU will be highly complicated and no obvious benefits

  1. have you consider either use UInt8Array or UInt16Array for string storage to be aligned with current types?

In the browser we will store strings natively for speed. When you store strings natively, the encoding depends on the <meta charset="enc"> of the html page that hosts the js. Since you can't know how much actual memory the browser used, I use 2 bytes per character as an approximation since tf.memory() will be used only for debugging.

I did consider using a TextEncoder/TextDecoder to convert native strings to predictable UTF16 encoding, but this:

  • has no obvious benefits, other than predictable byte size
  • slows down string manipulation
  • increases code complexity
  • encoding/decoding requires polyfills for some browsers.

Can you add an explicit test for .data<'string'> and .dataSync<'string'> to test the compiler?

Added.

Reviewable status: :shipit: complete! 1 of 1 approvals obtained (waiting on @nkreeger, @nsthorat, @pyu10055, @caisq, and @bileschi)


src/buffer_test.ts, line 26 at r1 (raw file):

Previously, nkreeger (Nick Kreeger) wrote…
buff.set(2, 0, 1, 0);

Nit - does it make sense to add an actual floating point number here? 2.1 or something.

Done. This test was moved out from


src/engine.ts, line 291 at r1 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

I have a feeling dataSync() on this will be expensive in Node.js, one thought would be to keep the size when you write strings

Done.


src/engine_test.ts, line 369 at r1 (raw file):

Previously, pyu10055 (Ping Yu) wrote…

5 letters? Do we always store string type as UTF-16, do we support UTF-8?

We store strings natively. The browser internally can store strings as either UTF-8 or UTF-16, but that's an implementation detail and we always assume 2 bytes per character so we are conservative with memory usage: https://stackoverflow.com/questions/2219526/how-many-bytes-in-a-javascript-string


src/engine_test.ts, line 369 at r1 (raw file):

Previously, caisq (Shanqing Cai) wrote…
10

10 --> 5

Done.


src/jasmine_util.ts, line 96 at r1 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

Just want to point this test out to you which now will have slightly different behavior: https://github.com/tensorflow/tfjs-core/blob/master/src/engine_test.ts#L511

Should not affect the test since we were already prepending the test- prefix when we register backends below (see https://github.com/tensorflow/tfjs-core/blob/master/src/jasmine_util.ts#L136) so we ended up with test-test-webgl1


src/tape.ts, line 172 at r1 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

can we test this?

Done. I was testing already with dy being string (gradient: 1D string throws error with non-numeric dy), but I added more tests with dy being bool/int


src/tensor_test.ts, line 580 at r1 (raw file):

Previously, caisq (Shanqing Cai) wrote…
['a', 'b']

I don't see any unit tests involving

  • empty strings
  • non-ASCII strings.

Should you add them? In particular, for non-ASCII strings, what do you do? Throw an error?

Non-ascii should be supported, so I'm adding unit tests for that. It's the job of the backends to decide how to store the strings. For the webgl and vanilla cpu backend we store strings natively so unicode-escaping is supported. TF Python seems to support Unicode as well (tested in a colab) but the node.js backend might have to do a bit of work to convert from unicode to native string when crossing from backend <--> user interface (cc @nkreeger)


src/tensor_test.ts, line 1373 at r1 (raw file):

Previously, caisq (Shanqing Cai) wrote…

+1 Python TensorFlow adds quotes to print() output of string-type tensors. This also helps user see empty strings more clearly.

Great point. Done.


src/types.ts, line 37 at r1 (raw file):

Previously, nkreeger (Nick Kreeger) wrote…
  float32: number;
  int32: number;
  bool: boolean;
  complex64: number;
  string: string;

Does the order matter here? If not - do you think it makes sense to order these by size?

bool
int32,
flaot32,
complex64
string

Done.


src/util.ts, line 325 at r1 (raw file):

Previously, caisq (Shanqing Cai) wrote…
new Array(size)

Maybe new Array<string>()?

Done.


src/util.ts, line 402 at r1 (raw file):

Previously, caisq (Shanqing Cai) wrote…
bytes += x.length * 2);

This doesn't handle non-ASCII strings.

If you plan to handle non-ASCII strings, you should calculate the byte size of the strings differently
(e.g., https://stackoverflow.com/questions/5515869/string-length-in-bytes-in-javascript)

If you don't plan to handle non-ASCII strings, you should check for them and throw Error if a
non-ASCII string is found.

FYI, non-ASCII strings are supported by Python TensorFlow.

We will support non-ASCII strings, but the memory is going to be approximate (2 bytes per character) for speed. When you store strings natively, the encoding depends on the of the html page that hosts the js. Added this info to the jsdoc and updated the result returned by tf.memory(). Whenever there is a least 1 string tensor, tf.memory returns (among other things):

{
  unrealiable: true,
  reasons: ['Memory usage by string tensors is approximate, 2 bytes per character']
}

src/util.ts, line 475 at r1 (raw file):

Previously, caisq (Shanqing Cai) wrote…
throw new Error('Cannot convert a string[] to a TypedArray');

I'm curious why this is not supported. You could convert it to a Uint8Array.

This method is used only internally. We can add that functionality when the need arises.


src/kernels/backend_cpu.ts, line 681 at r1 (raw file):

Previously, pyu10055 (Ping Yu) wrote…

does it support int32 logicalNot?

logical_not in TF requires a bool returns a bool tensor.


src/kernels/backend_cpu_test.ts, line 55 at r1 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

you can put this in beforeEach to parallel afterEach

Done.


src/ops/binary_ops_test.ts, line 211 at r1 (raw file):

Previously, caisq (Shanqing Cai) wrote…

Certain tf.* operations actually should support string tensors, including:

  • tf.add (concatenates strings)

That's good to know! I'll be adding op string functionality in follow up PRs to keep this PR small.

Copy link
Collaborator

@caisq caisq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong: Great! Thanks!

Reviewable status: :shipit: complete! 1 of 1 approvals obtained (waiting on @nkreeger, @nsthorat, @pyu10055, @caisq, and @bileschi)

Copy link
Collaborator

@caisq caisq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong:

Reviewable status: :shipit: complete! 1 of 1 approvals obtained (waiting on @nkreeger, @nsthorat, @pyu10055, @caisq, and @bileschi)

Copy link
Contributor

@nkreeger nkreeger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 2 of 1 approvals obtained (waiting on @nsthorat, @pyu10055, @caisq, and @bileschi)

@dsmilkov dsmilkov merged commit f217045 into master Nov 27, 2018
@dsmilkov dsmilkov deleted the string-data branch December 3, 2018 18:02
dsmilkov added a commit that referenced this pull request Dec 6, 2018
Allow users to provide different dtypes in binary arithmetic ops (add/sub/mul/div/...) and matmul, just like in numpy.

The dtype of the result is upcasted i.e. matMul(float32, int32) => float32

This will result in release patch 0.14.1, which will fix the breakage in 0.14.0 caused by #1408 due to improved dtype inference where tensor(new Int32Array()) is inferred to be int32, and was float32.

Fixes tensorflow/tfjs#934, tensorflow/tfjs#966
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
5 participants