-
Notifications
You must be signed in to change notification settings - Fork 943
Add support for string weights #1795
Conversation
|
CLA is ok since I'm a Googler (the confusion is because I didn't use email address for the first commit - it was made from a new laptop) |
nsthorat
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 8 of 14 files at r2, 6 of 6 files at r3.
Reviewable status:complete! 1 of 1 approvals obtained (waiting on @caisq, @dsmilkov, and @nsthorat)
src/io/io_utils.ts, line 66 at r3 (raw file):
const spec: WeightsManifestEntry = {name, shape: t.shape, dtype: t.dtype}; if (t.dtype === 'string') { const utf8bytes =
small optimization: wrap this in a promise and add the promise to dataPromises so you don't serially await data here (Promise.all can parallelize)
src/ops/array_ops.ts, line 421 at r3 (raw file):
/** @doc {heading: 'Tensors', subheading: 'Slicing and Joining'} */ function tile_<T extends Tensor>(x: T|TensorLike, reps: number[]): T { const $x = convertToTensor(x, 'x', 'tile', null);
would be good to name null here and below
src/platforms/platform_browser.ts, line 20 at r3 (raw file):
import {Platform} from './platform'; export class PlatformBrowser implements Platform {
add a simple unit test for this file in platform_browser_test
src/platforms/platform_node.ts, line 26 at r3 (raw file):
importFetch: () => require('node-fetch') };
add a simple unit test for this file in platform_browser_test
|
LGTM, but wait for approval from Nick / Ping on the doc. |
|
Thanks, yup will do |
|
We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google. ℹ️ Googlers: Go here for more info. |
dsmilkov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 1 of 1 approvals obtained (waiting on @caisq)
src/io/io_utils.ts, line 66 at r3 (raw file):
Previously, nsthorat (Nikhil Thorat) wrote…
small optimization: wrap this in a promise and add the promise to dataPromises so you don't serially await data here (Promise.all can parallelize)
Done.
src/ops/array_ops.ts, line 421 at r3 (raw file):
Previously, nsthorat (Nikhil Thorat) wrote…
would be good to name null here and below
Done.
src/platforms/platform_browser.ts, line 20 at r3 (raw file):
Previously, nsthorat (Nikhil Thorat) wrote…
add a simple unit test for this file in platform_browser_test
Done.
src/platforms/platform_node.ts, line 26 at r3 (raw file):
Previously, nsthorat (Nikhil Thorat) wrote…
add a simple unit test for this file in platform_browser_test
Done.
pyu10055
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 1 of 1 approvals obtained (waiting on @caisq and @dsmilkov)
src/io/io_utils.ts, line 27 at r4 (raw file):
/** Used to delimit neighboring strings when encoding string tensors. */ export const STRING_DELIMITER = '\x00';
will the tensor itself contains this delimiter? should it be escaped before encoding?
dsmilkov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 1 of 1 approvals obtained (waiting on @caisq and @pyu10055)
src/io/io_utils.ts, line 27 at r4 (raw file):
Previously, pyu10055 (Ping Yu) wrote…
will the tensor itself contains this delimiter? should it be escaped before encoding?
If you ship a tensor of 2 values ['a', 'b'], it will be shipped over the wire as a\x00b encoded in utf-8 (utf-8 assigns the 0 byte for \x00), parsed on the other side as tensor(['a', 'b']). No escaping is needed, since \x00 is the escaped representation of the null-terminating byte. Over the wire you will not see \x00
caisq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 2 of 1 approvals obtained (waiting on @caisq, @dsmilkov, and @pyu10055)
src/io/io_utils_test.ts, line 336 at r4 (raw file):
здраво
Consider adding test cases for wider unicode characters, e.g., '正常'
src/io/types.ts, line 113 at r4 (raw file):
// U
Nit: use C-style comments for consistency with the rest of the file.
src/io/types.ts, line 117 at r4 (raw file):
including the delimiters.
State that there are n - 1 delimiters if there are n strings.
src/platforms/platform_node.ts, line 35 at r4 (raw file):
const util = require('util');
Why not import this at the top of the file?
dsmilkov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pyu10055 PTAL for this change in tfjs-core. I will detect the source encoding in Python in the other PR tensorflow/tfjs-converter#386 and will ping you when to review that PR as well. Thanks!
Reviewable status:
complete! 2 of 1 approvals obtained (waiting on @caisq and @pyu10055)
src/io/io_utils.ts, line 27 at r4 (raw file):
Previously, dsmilkov (Daniel Smilkov) wrote…
If you ship a tensor of 2 values
['a', 'b'], it will be shipped over the wire asa\x00bencoded in utf-8 (utf-8 assigns the 0 byte for\x00), parsed on the other side astensor(['a', 'b']). No escaping is needed, since\x00is the escaped representation of the null-terminating byte. Over the wire you will not see\x00
We talked offline. Any string tensor (from python) will be decoded before it gets encoded as utf-8. In UTF-8, no other Unicode code point can be encoded with a zero byte anywhere within it: https://stackoverflow.com/questions/6907297/can-utf-8-contain-zero-byte so no tensor will contain that delimiter as part of its value.
In the converter PR I will detect the source encoding to make sure that the decoding succeeds.
src/io/io_utils_test.ts, line 336 at r4 (raw file):
Previously, caisq (Shanqing Cai) wrote…
здравоConsider adding test cases for wider unicode characters, e.g., '正常'
Done.
src/io/types.ts, line 113 at r4 (raw file):
Previously, caisq (Shanqing Cai) wrote…
// UNit: use C-style comments for consistency with the rest of the file.
Done.
src/io/types.ts, line 117 at r4 (raw file):
Previously, caisq (Shanqing Cai) wrote…
including the delimiters.State that there are n - 1 delimiters if there are n strings.
Done.
src/platforms/platform_node.ts, line 35 at r4 (raw file):
Previously, caisq (Shanqing Cai) wrote…
const util = require('util');Why not import this at the top of the file?
Because it's a dynamic import that runs only when in node (in the browser this never gets called b/c it would fail).
caisq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 2 of 1 approvals obtained (waiting on @caisq, @dsmilkov, and @pyu10055)
src/io/io_utils_test.ts, line 338 at r5 (raw file):
// Chinese.
- Indent seems off.
- This is not just Chinese. It's used in other languages as well. You can call it "East Asian" or "CJK" (Chinese-Japanese-Korean)
pyu10055
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 3 of 1 approvals obtained (waiting on @caisq and @dsmilkov)
dsmilkov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 3 of 1 approvals obtained (waiting on @caisq)
src/io/io_utils_test.ts, line 338 at r5 (raw file):
Previously, caisq (Shanqing Cai) wrote…
// Chinese.
- Indent seems off.
- This is not just Chinese. It's used in other languages as well. You can call it "East Asian" or "CJK" (Chinese-Japanese-Korean)
Thanks for clarifying! Corrected the comment. The clang-formatter wants to place the comment there for whatever reason.
stringtileandexpandDimsops (these ops are used by AutoML)PlatformThis is needed to enable execution of AutoML models, which store the vocab as a weight (
Consttensor) of dtypestring.Corresponding change in tfjs-converter: tensorflow/tfjs-converter#386
This change is