Skip to content
This repository was archived by the owner on Aug 15, 2019. It is now read-only.

Conversation

@dsmilkov
Copy link
Contributor

@dsmilkov dsmilkov commented Jun 18, 2019

  • Add support for deserializing weights of dtype string
  • Allow string tensors in tile and expandDims ops (these ops are used by AutoML)
  • Add utf-8 encoding/decoding capability to Platform

This is needed to enable execution of AutoML models, which store the vocab as a weight (Const tensor) of dtype string.

Corresponding change in tfjs-converter: tensorflow/tfjs-converter#386


This change is Reviewable

@dsmilkov
Copy link
Contributor Author

dsmilkov commented Jun 18, 2019

CLA is ok since I'm a Googler (the confusion is because I didn't use email address for the first commit - it was made from a new laptop)

@dsmilkov dsmilkov changed the title WIP Add support for string weights Add support for string weights Jun 18, 2019
@dsmilkov dsmilkov requested review from caisq and nsthorat June 18, 2019 13:38
@tensorflow tensorflow deleted a comment from googlebot Jun 18, 2019
Copy link
Contributor

@nsthorat nsthorat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 8 of 14 files at r2, 6 of 6 files at r3.
Reviewable status: :shipit: complete! 1 of 1 approvals obtained (waiting on @caisq, @dsmilkov, and @nsthorat)


src/io/io_utils.ts, line 66 at r3 (raw file):

    const spec: WeightsManifestEntry = {name, shape: t.shape, dtype: t.dtype};
    if (t.dtype === 'string') {
      const utf8bytes =

small optimization: wrap this in a promise and add the promise to dataPromises so you don't serially await data here (Promise.all can parallelize)


src/ops/array_ops.ts, line 421 at r3 (raw file):

/** @doc {heading: 'Tensors', subheading: 'Slicing and Joining'} */
function tile_<T extends Tensor>(x: T|TensorLike, reps: number[]): T {
  const $x = convertToTensor(x, 'x', 'tile', null);

would be good to name null here and below


src/platforms/platform_browser.ts, line 20 at r3 (raw file):

import {Platform} from './platform';

export class PlatformBrowser implements Platform {

add a simple unit test for this file in platform_browser_test


src/platforms/platform_node.ts, line 26 at r3 (raw file):

  importFetch: () => require('node-fetch')
};

add a simple unit test for this file in platform_browser_test

@nsthorat
Copy link
Contributor

LGTM, but wait for approval from Nick / Ping on the doc.

@dsmilkov
Copy link
Contributor Author

Thanks, yup will do

@googlebot
Copy link

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and have the pull request author add another comment and the bot will run again. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

Copy link
Contributor Author

@dsmilkov dsmilkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 1 approvals obtained (waiting on @caisq)


src/io/io_utils.ts, line 66 at r3 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

small optimization: wrap this in a promise and add the promise to dataPromises so you don't serially await data here (Promise.all can parallelize)

Done.


src/ops/array_ops.ts, line 421 at r3 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

would be good to name null here and below

Done.


src/platforms/platform_browser.ts, line 20 at r3 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

add a simple unit test for this file in platform_browser_test

Done.


src/platforms/platform_node.ts, line 26 at r3 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

add a simple unit test for this file in platform_browser_test

Done.

Copy link
Collaborator

@pyu10055 pyu10055 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 1 approvals obtained (waiting on @caisq and @dsmilkov)


src/io/io_utils.ts, line 27 at r4 (raw file):

/** Used to delimit neighboring strings when encoding string tensors. */
export const STRING_DELIMITER = '\x00';

will the tensor itself contains this delimiter? should it be escaped before encoding?

Copy link
Contributor Author

@dsmilkov dsmilkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 1 approvals obtained (waiting on @caisq and @pyu10055)


src/io/io_utils.ts, line 27 at r4 (raw file):

Previously, pyu10055 (Ping Yu) wrote…

will the tensor itself contains this delimiter? should it be escaped before encoding?

If you ship a tensor of 2 values ['a', 'b'], it will be shipped over the wire as a\x00b encoded in utf-8 (utf-8 assigns the 0 byte for \x00), parsed on the other side as tensor(['a', 'b']). No escaping is needed, since \x00 is the escaped representation of the null-terminating byte. Over the wire you will not see \x00

Copy link
Collaborator

@caisq caisq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 2 of 1 approvals obtained (waiting on @caisq, @dsmilkov, and @pyu10055)


src/io/io_utils_test.ts, line 336 at r4 (raw file):

здраво

Consider adding test cases for wider unicode characters, e.g., '正常'


src/io/types.ts, line 113 at r4 (raw file):

  // U

Nit: use C-style comments for consistency with the rest of the file.


src/io/types.ts, line 117 at r4 (raw file):

including the delimiters.

State that there are n - 1 delimiters if there are n strings.


src/platforms/platform_node.ts, line 35 at r4 (raw file):

const util = require('util');

Why not import this at the top of the file?

Copy link
Contributor Author

@dsmilkov dsmilkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pyu10055 PTAL for this change in tfjs-core. I will detect the source encoding in Python in the other PR tensorflow/tfjs-converter#386 and will ping you when to review that PR as well. Thanks!

Reviewable status: :shipit: complete! 2 of 1 approvals obtained (waiting on @caisq and @pyu10055)


src/io/io_utils.ts, line 27 at r4 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

If you ship a tensor of 2 values ['a', 'b'], it will be shipped over the wire as a\x00b encoded in utf-8 (utf-8 assigns the 0 byte for \x00), parsed on the other side as tensor(['a', 'b']). No escaping is needed, since \x00 is the escaped representation of the null-terminating byte. Over the wire you will not see \x00

We talked offline. Any string tensor (from python) will be decoded before it gets encoded as utf-8. In UTF-8, no other Unicode code point can be encoded with a zero byte anywhere within it: https://stackoverflow.com/questions/6907297/can-utf-8-contain-zero-byte so no tensor will contain that delimiter as part of its value.

In the converter PR I will detect the source encoding to make sure that the decoding succeeds.


src/io/io_utils_test.ts, line 336 at r4 (raw file):

Previously, caisq (Shanqing Cai) wrote…
здраво

Consider adding test cases for wider unicode characters, e.g., '正常'

Done.


src/io/types.ts, line 113 at r4 (raw file):

Previously, caisq (Shanqing Cai) wrote…
  // U

Nit: use C-style comments for consistency with the rest of the file.

Done.


src/io/types.ts, line 117 at r4 (raw file):

Previously, caisq (Shanqing Cai) wrote…
including the delimiters.

State that there are n - 1 delimiters if there are n strings.

Done.


src/platforms/platform_node.ts, line 35 at r4 (raw file):

Previously, caisq (Shanqing Cai) wrote…
const util = require('util');

Why not import this at the top of the file?

Because it's a dynamic import that runs only when in node (in the browser this never gets called b/c it would fail).

Copy link
Collaborator

@caisq caisq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 2 of 1 approvals obtained (waiting on @caisq, @dsmilkov, and @pyu10055)


src/io/io_utils_test.ts, line 338 at r5 (raw file):

// Chinese.
  1. Indent seems off.
  2. This is not just Chinese. It's used in other languages as well. You can call it "East Asian" or "CJK" (Chinese-Japanese-Korean)

Copy link
Collaborator

@pyu10055 pyu10055 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 3 of 1 approvals obtained (waiting on @caisq and @dsmilkov)

Copy link
Contributor Author

@dsmilkov dsmilkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 3 of 1 approvals obtained (waiting on @caisq)


src/io/io_utils_test.ts, line 338 at r5 (raw file):

Previously, caisq (Shanqing Cai) wrote…
// Chinese.
  1. Indent seems off.
  2. This is not just Chinese. It's used in other languages as well. You can call it "East Asian" or "CJK" (Chinese-Japanese-Korean)

Thanks for clarifying! Corrected the comment. The clang-formatter wants to place the comment there for whatever reason.

@dsmilkov dsmilkov merged commit a68c1ce into master Jun 19, 2019
@dsmilkov dsmilkov deleted the string-weights branch June 19, 2019 22:43
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants