Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fairly straight forward.
Food for thought - do you think tackling large PRs like this can be piecemealed in a working branch? Even if tests don't pass I wonder if we can help avoid very large PRs (reviews can miss some details).
Reviewed 43 of 43 files at r1.
Reviewable status: 0 of 1 approvals obtained (waiting on @dsmilkov, @nsthorat, @caisq, and @nkreeger)
.vscode/launch.json, line 3 at r1 (raw file):
// Use IntelliSense to learn about possible attributes. // Hover to view descriptions of existing attributes. // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
Curious - is this auto-gen in VSCode? Looks like this is a config to attach VSCode to a Chromium process?
src/buffer_test.ts, line 26 at r1 (raw file):
buff.set(2, 0, 1, 0);
Nit - does it make sense to add an actual floating point number here? 2.1
or something.
src/types.ts, line 37 at r1 (raw file):
Quoted 5 lines of code…
float32: number; int32: number; bool: boolean; complex64: number; string: string;
Does the order matter here? If not - do you think it makes sense to order these by size?
bool
int32,
flaot32,
complex64
string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add an explicit test for .data<'string'>
and .dataSync<'string'>
to test the compiler?
Reviewed 43 of 43 files at r1.
Reviewable status: 0 of 1 approvals obtained (waiting on @dsmilkov, @nsthorat, and @caisq)
src/engine.ts, line 291 at r1 (raw file):
if (refCount <= 1) { if (a.dtype === 'string') { this.numBytes -= bytesFromStringArray(a.dataSync<'string'>());
I have a feeling dataSync() on this will be expensive in Node.js, one thought would be to keep the size when you write strings
src/jasmine_util.ts, line 96 at r1 (raw file):
export let TEST_ENVS: TestEnv[] = [ { name: 'test-webgl1',
Just want to point this test out to you which now will have slightly different behavior: https://github.com/tensorflow/tfjs-core/blob/master/src/engine_test.ts#L511
src/tape.ts, line 172 at r1 (raw file):
// Call the gradient function. const dx = inputGradients[inputName](); if (dx.dtype !== 'float32') {
can we test this?
src/tensor_test.ts, line 1373 at r1 (raw file):
' shape: [3]\n' + ' values:\n' + ' [a, bb, ccc]');
nit (optional): for consistency with console.log(), each element should be wrapped with a quote
src/kernels/backend_cpu_test.ts, line 55 at r1 (raw file):
it('register string tensor with values', () => { const backend = new MathBackendCPU(); tf.ENV.registerBackend('test-storage', () => backend);
you can put this in beforeEach to parallel afterEach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I tried to make the PR as small as possible and most of it are unit tests and assertion of updated error messages. To get a better sense of the core change in this PR see master...9351969 which compares one of the earliest commits in this branch with master (no unit tests, and it's about 300 lines of diff.
Reviewable status: complete! 1 of 1 approvals obtained (waiting on @nkreeger, @dsmilkov, and @caisq)
.vscode/launch.json, line 3 at r1 (raw file):
Previously, nkreeger (Nick Kreeger) wrote…
// Use IntelliSense to learn about possible attributes. // Hover to view descriptions of existing attributes. // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
Curious - is this auto-gen in VSCode? Looks like this is a config to attach VSCode to a Chromium process?
Not auto-gen, but the recommended setting from vscode if you want to debug karma chrome directly in vscode (I used it in this PR to save time, and wanted to commit so everyone can benefit - all you need is the Chrome debugger plugin for vscode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 1 approvals obtained (waiting on @nkreeger, @dsmilkov, and @caisq)
.vscode/launch.json, line 3 at r1 (raw file):
Previously, dsmilkov (Daniel Smilkov) wrote…
Not auto-gen, but the recommended setting from vscode if you want to debug karma chrome directly in vscode (I used it in this PR to save time, and wanted to commit so everyone can benefit - all you need is the Chrome debugger plugin for vscode.
Oh, the comments are auto-gen yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff, couple high level questions:
- will string type tensor op be supported in GPU or it is a CPU only dtype?
- have you consider either use UInt8Array or UInt16Array for string storage to be aligned with current types?
src/engine_test.ts
Outdated
const a = tf.tensor([['a', 'bb'], ['c', 'd']]); | ||
|
||
expect(tf.memory().numTensors).toBe(1); | ||
expect(tf.memory().numBytes).toBe(10); // 10 letters, each 2 bytes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 letters? Do we always store string type as UTF-16, do we support UTF-8?
@@ -678,7 +678,7 @@ export class MathBackendCPU implements KernelBackend { | |||
this.assertNotComplex(x, 'logicalNot'); | |||
|
|||
const values = x.dataSync(); | |||
const newValues = new Int32Array(values.length); | |||
const newValues = new Uint8Array(values.length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it support int32 logicalNot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Thanks for doing this, @dsmilkov
Please see my comments below.
Reviewable status: complete! 1 of 1 approvals obtained (waiting on @nkreeger, @dsmilkov, @nsthorat, @pyu10055, and @caisq)
src/engine_test.ts, line 369 at r1 (raw file):
10
10 --> 5
src/tensor_test.ts, line 580 at r1 (raw file):
['a', 'b']
I don't see any unit tests involving
- empty strings
- non-ASCII strings.
Should you add them? In particular, for non-ASCII strings, what do you do? Throw an error?
src/tensor_test.ts, line 1373 at r1 (raw file):
Previously, nsthorat (Nikhil Thorat) wrote…
nit (optional): for consistency with console.log(), each element should be wrapped with a quote
+1 Python TensorFlow adds quotes to print() output of string-type tensors. This also helps user see empty strings more clearly.
src/util.ts, line 325 at r1 (raw file):
new Array(size)
Maybe new Array<string>()
?
src/util.ts, line 402 at r1 (raw file):
bytes += x.length * 2);
This doesn't handle non-ASCII strings.
If you plan to handle non-ASCII strings, you should calculate the byte size of the strings differently
(e.g., https://stackoverflow.com/questions/5515869/string-length-in-bytes-in-javascript)
If you don't plan to handle non-ASCII strings, you should check for them and throw Error if a
non-ASCII string is found.
FYI, non-ASCII strings are supported by Python TensorFlow.
src/util.ts, line 475 at r1 (raw file):
throw new Error('Cannot convert a string[] to a TypedArray');
I'm curious why this is not supported. You could convert it to a Uint8Array.
src/ops/binary_ops_test.ts, line 211 at r1 (raw file):
it('throws for string tensor', () => { expect(() => tf.maximum('q', 3))
Certain tf.* operations actually should support string tensors, including:
- tf.add (concatenates strings)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL.
Thanks for the good questions. Answers inline.
- will string type tensor op be supported in GPU or it is a CPU only dtype?
Only on cpu. Storing it on GPU will be highly complicated and no obvious benefits
- have you consider either use UInt8Array or UInt16Array for string storage to be aligned with current types?
In the browser we will store strings natively for speed. When you store strings natively, the encoding depends on the <meta charset="enc">
of the html page that hosts the js. Since you can't know how much actual memory the browser used, I use 2 bytes per character as an approximation since tf.memory()
will be used only for debugging.
I did consider using a TextEncoder/TextDecoder to convert native strings to predictable UTF16 encoding, but this:
- has no obvious benefits, other than predictable byte size
- slows down string manipulation
- increases code complexity
- encoding/decoding requires polyfills for some browsers.
Can you add an explicit test for .data<'string'> and .dataSync<'string'> to test the compiler?
Added.
Reviewable status: complete! 1 of 1 approvals obtained (waiting on @nkreeger, @nsthorat, @pyu10055, @caisq, and @bileschi)
src/buffer_test.ts, line 26 at r1 (raw file):
Previously, nkreeger (Nick Kreeger) wrote…
buff.set(2, 0, 1, 0);
Nit - does it make sense to add an actual floating point number here?
2.1
or something.
Done. This test was moved out from
src/engine.ts, line 291 at r1 (raw file):
Previously, nsthorat (Nikhil Thorat) wrote…
I have a feeling dataSync() on this will be expensive in Node.js, one thought would be to keep the size when you write strings
Done.
src/engine_test.ts, line 369 at r1 (raw file):
Previously, pyu10055 (Ping Yu) wrote…
5 letters? Do we always store string type as UTF-16, do we support UTF-8?
We store strings natively. The browser internally can store strings as either UTF-8 or UTF-16, but that's an implementation detail and we always assume 2 bytes per character so we are conservative with memory usage: https://stackoverflow.com/questions/2219526/how-many-bytes-in-a-javascript-string
src/engine_test.ts, line 369 at r1 (raw file):
Previously, caisq (Shanqing Cai) wrote…
10
10 --> 5
Done.
src/jasmine_util.ts, line 96 at r1 (raw file):
Previously, nsthorat (Nikhil Thorat) wrote…
Just want to point this test out to you which now will have slightly different behavior: https://github.com/tensorflow/tfjs-core/blob/master/src/engine_test.ts#L511
Should not affect the test since we were already prepending the test-
prefix when we register backends below (see https://github.com/tensorflow/tfjs-core/blob/master/src/jasmine_util.ts#L136) so we ended up with test-test-webgl1
src/tape.ts, line 172 at r1 (raw file):
Previously, nsthorat (Nikhil Thorat) wrote…
can we test this?
Done. I was testing already with dy being string (gradient: 1D string throws error with non-numeric dy), but I added more tests with dy being bool/int
src/tensor_test.ts, line 580 at r1 (raw file):
Previously, caisq (Shanqing Cai) wrote…
['a', 'b']
I don't see any unit tests involving
- empty strings
- non-ASCII strings.
Should you add them? In particular, for non-ASCII strings, what do you do? Throw an error?
Non-ascii should be supported, so I'm adding unit tests for that. It's the job of the backends to decide how to store the strings. For the webgl and vanilla cpu backend we store strings natively so unicode-escaping is supported. TF Python seems to support Unicode as well (tested in a colab) but the node.js backend might have to do a bit of work to convert from unicode to native string when crossing from backend <--> user interface (cc @nkreeger)
src/tensor_test.ts, line 1373 at r1 (raw file):
Previously, caisq (Shanqing Cai) wrote…
+1 Python TensorFlow adds quotes to print() output of string-type tensors. This also helps user see empty strings more clearly.
Great point. Done.
src/types.ts, line 37 at r1 (raw file):
Previously, nkreeger (Nick Kreeger) wrote…
float32: number; int32: number; bool: boolean; complex64: number; string: string;
Does the order matter here? If not - do you think it makes sense to order these by size?
bool int32, flaot32, complex64 string
Done.
src/util.ts, line 325 at r1 (raw file):
Previously, caisq (Shanqing Cai) wrote…
new Array(size)
Maybe
new Array<string>()
?
Done.
src/util.ts, line 402 at r1 (raw file):
Previously, caisq (Shanqing Cai) wrote…
bytes += x.length * 2);
This doesn't handle non-ASCII strings.
If you plan to handle non-ASCII strings, you should calculate the byte size of the strings differently
(e.g., https://stackoverflow.com/questions/5515869/string-length-in-bytes-in-javascript)If you don't plan to handle non-ASCII strings, you should check for them and throw Error if a
non-ASCII string is found.FYI, non-ASCII strings are supported by Python TensorFlow.
We will support non-ASCII strings, but the memory is going to be approximate (2 bytes per character) for speed. When you store strings natively, the encoding depends on the of the html page that hosts the js. Added this info to the jsdoc and updated the result returned by tf.memory(). Whenever there is a least 1 string tensor, tf.memory returns (among other things):
{
unrealiable: true,
reasons: ['Memory usage by string tensors is approximate, 2 bytes per character']
}
src/util.ts, line 475 at r1 (raw file):
Previously, caisq (Shanqing Cai) wrote…
throw new Error('Cannot convert a string[] to a TypedArray');
I'm curious why this is not supported. You could convert it to a Uint8Array.
This method is used only internally. We can add that functionality when the need arises.
src/kernels/backend_cpu.ts, line 681 at r1 (raw file):
Previously, pyu10055 (Ping Yu) wrote…
does it support int32 logicalNot?
logical_not in TF requires a bool returns a bool tensor.
src/kernels/backend_cpu_test.ts, line 55 at r1 (raw file):
Previously, nsthorat (Nikhil Thorat) wrote…
you can put this in beforeEach to parallel afterEach
Done.
src/ops/binary_ops_test.ts, line 211 at r1 (raw file):
Previously, caisq (Shanqing Cai) wrote…
Certain tf.* operations actually should support string tensors, including:
- tf.add (concatenates strings)
That's good to know! I'll be adding op string functionality in follow up PRs to keep this PR small.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 2 of 1 approvals obtained (waiting on @nsthorat, @pyu10055, @caisq, and @bileschi)
Allow users to provide different dtypes in binary arithmetic ops (add/sub/mul/div/...) and matmul, just like in numpy. The dtype of the result is upcasted i.e. matMul(float32, int32) => float32 This will result in release patch 0.14.1, which will fix the breakage in 0.14.0 caused by #1408 due to improved dtype inference where tensor(new Int32Array()) is inferred to be int32, and was float32. Fixes tensorflow/tfjs#934, tensorflow/tfjs#966
Add
string
dtype toTensor
.This opens the door for adding Python's
string_ops
to TensorFlow.js, which are used for text-based models, and for adding pre-processing layers that operate on strings.Details:
t.dataSync<'string'>()
returnsstring[]
whilet.dataSync()
returnsTypedArray
for backwards compatibility.layers
andconverter
pass with this build.node
has 30ish failed tests sincestring
is an unknown dtype.clone
,reshape
andcast
work with strings at this point to keep this PR small. Other ops will get the functionality in a follow-up PR.string
in theirregister/write/read
methods.FEATURE
This change is