Skip to content
This repository has been archived by the owner on Aug 15, 2019. It is now read-only.

Introduce src/contrib/data. #639

Merged
merged 11 commits into from Feb 13, 2018
Merged

Conversation

davidsoergel
Copy link
Member

@davidsoergel davidsoergel commented Feb 3, 2018

This subpackage provides Dataset, DataStream, and related subclasses and utilities.

It provides for streaming data loading and preprocessing, with the intent of feeding machine learning models with data for both training and evaluation.

This cannot yet replace src/data, nor does it yet support the needs of the current deeplearn.js demos. It's just an initial commit, moving the current version of the code unmodified from its private repo. Deeper integration with deeplearn.js will follow.


This change is Reviewable

This cannot yet replace src/data, nor does it yet support the needs of the demos.  It's just an initial commit, moving the current version of the code unmodified from its private repo.
@dsmilkov
Copy link
Contributor

dsmilkov commented Feb 4, 2018

Reviewed 5 of 23 files at r1.
Review status: 5 of 23 files reviewed at latest revision, 9 unresolved discussions.


src/contrib/data/dataset.ts, line 20 at r1 (raw file):

the most that NDArray currently supports

FYI, NDArray supports arbitrary dimensions. We have sugar for up to Array4D, but you can make arbitrary arrays using the internal NDArray.make(this.shape, {values}, dtype) The result you get is typed as NDArray<Rank> (which means any rank)


src/contrib/data/dataset.ts, line 39 at r1 (raw file):

 * A map from string keys (aka column names) to values for a single element.
 */
export type DatasetElement = {

curious: does tf.data also make an explicit distinction between DatasetElement and DatasetBatch? Or is this distinction solely for typings? Would users that use plain js see this distinction?


src/contrib/data/dataset.ts, line 61 at r1 (raw file):

 * A `Dataset` provides a stream of unbatched examples, and its transformations
 * are applied one example at a time.  Batching produces a BatchDataset, and so
 * must come last in the pipeline because there are no batch-enabled

will there be batch-enabled transformation later? in that case, might be good to say: Currently, there are no batch-...


src/contrib/data/dataset.ts, line 74 at r1 (raw file):

   * Create a `Dataset` from an array of elements.
   */
  static ofElements(items: DatasetElement[]): Dataset {

FYI, when these methods face the user, in JS land, it's super-unusual to have static methods e.g. dl.Dataset.ofElements(). We've been deprecating all those static methods. Much more common is to have dl.data.fromElements() where data is just a namespace. This doesn't change anything here, but just something to be aware of in naming all these static methods as if there were free floating functions. All the static methods will all end up under the same namespace dl.data.* regardless of which class they live.

Also note I prefer from instead of of since we already have several user-facing methods that use from (e.g. dl.fromPixels)


src/contrib/data/dataset.ts, line 74 at r1 (raw file):

   * Create a `Dataset` from an array of elements.
   */
  static ofElements(items: DatasetElement[]): Dataset {

Another note: I don't see ofElements in tf.data. Is this analogous to from_tensors? Maybe we should name it like that.


src/contrib/data/dataset.ts, line 88 at r1 (raw file):

   * nondeterministic order, then this concatenated `Dataset` will do the same.
   */
  static ofConcatenated(datasets: Dataset[]) {

why not just call it concatenate? Also I don't see ofConcatenated in tf.data. There is concatenate.


src/contrib/data/dataset.ts, line 160 at r1 (raw file):

   * @returns A `Dataset`.
   */
  concatenate(dataset: Dataset): Dataset {

merge this and ofConcatenated above


src/contrib/data/dataset.ts, line 293 at r1 (raw file):

 *   batchSize elements.  (Default true).
 */
export class BatchDataset {

I might be misunderstanding as I just started looking at tf.data, but I don't see any explicit distinction between BatchDataset and Dataset in tf.data. I'm curious what are the motivations for that distinction.


src/contrib/data/stream.ts, line 11 at r1 (raw file):

 * stream of elements.
 */
export abstract class DataStream<T> {

for my own understanding, is this basically tf.data.Iterator? Was the idea to call it stream to make it explicit that it is asynchronous? I'm good with that, just wanted to know your thoughts.


Comments from Reviewable

@dsmilkov
Copy link
Contributor

dsmilkov commented Feb 4, 2018

This looks great! Left a few high-level comments (mostly for my own understanding) so we can discuss tomorrow in person

@davidsoergel
Copy link
Member Author

Review status: 4 of 23 files reviewed at latest revision, 9 unresolved discussions, some commit checks failed.


src/contrib/data/dataset.ts, line 20 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

the most that NDArray currently supports

FYI, NDArray supports arbitrary dimensions. We have sugar for up to Array4D, but you can make arbitrary arrays using the internal NDArray.make(this.shape, {values}, dtype) The result you get is typed as NDArray<Rank> (which means any rank)

Done.


src/contrib/data/dataset.ts, line 39 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

curious: does tf.data also make an explicit distinction between DatasetElement and DatasetBatch? Or is this distinction solely for typings? Would users that use plain js see this distinction?

tf.data does not make the distinction; indeed TF itself does not. I think that's a mistake since an element and a batch are semantically different: some transformations make sense on one but not the other, and some transformations need to be implemented differently in the two cases. Since we are in a typed context we should be explicit about this.


src/contrib/data/dataset.ts, line 61 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

will there be batch-enabled transformation later? in that case, might be good to say: Currently, there are no batch-...

Done.


src/contrib/data/dataset.ts, line 74 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

FYI, when these methods face the user, in JS land, it's super-unusual to have static methods e.g. dl.Dataset.ofElements(). We've been deprecating all those static methods. Much more common is to have dl.data.fromElements() where data is just a namespace. This doesn't change anything here, but just something to be aware of in naming all these static methods as if there were free floating functions. All the static methods will all end up under the same namespace dl.data.* regardless of which class they live.

Also note I prefer from instead of of since we already have several user-facing methods that use from (e.g. dl.fromPixels)

OK, I renamed all the static methods from of* to from*. I'm leaving them in place for now, but we can move them to be free functions in a later CL if you like. (I just don't want to edit too dramatically here, so that it's clear how this code was moved from the other repo).


src/contrib/data/dataset.ts, line 74 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

Another note: I don't see ofElements in tf.data. Is this analogous to from_tensors? Maybe we should name it like that.

No, from_tensors creates a tf.data.Dataset with a single element. Here we're just wrapping an existing array of DatasetElements as a Dataset.


src/contrib/data/dataset.ts, line 88 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

why not just call it concatenate? Also I don't see ofConcatenated in tf.data. There is concatenate.

These are two different things: a.concatenate(b) is an instance method. It gives the same result as fromConcatenated([a, b]), which is a static method. However the latter allows concatenating many Datasets at once, i.e. fromConcatenated([a, b, c, d, ...]).


src/contrib/data/dataset.ts, line 160 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

merge this and ofConcatenated above

See above


src/contrib/data/dataset.ts, line 293 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

I might be misunderstanding as I just started looking at tf.data, but I don't see any explicit distinction between BatchDataset and Dataset in tf.data. I'm curious what are the motivations for that distinction.

See above. Briefly, there are things you can do with a Dataset that don't make sense (or need to be implemented differently) on a BatchDataset. For instance, if you want to filter elements where foobar < 42, Dataset.filter() and BatchDataset.filter() have very different implementations.


Comments from Reviewable

@nsthorat
Copy link
Contributor

nsthorat commented Feb 7, 2018

High level comments:

  • Add licenses to all files
  • What do you think about making directories "streams" and "datasets" , moving individual classes out into their own files (like in java), and leaving the core stuff like "dataset.ts" / "batch_dataset.ts" / "stream.ts" in the root I think that will make it quite clear where new pieces go.

I will do another in depth pass once the cosmetic stuff of breaking files up / directory structure is done so it's a little easier to grok.


Review status: 4 of 23 files reviewed at latest revision, 20 unresolved discussions, some commit checks failed.


src/test_util.ts, line 287 at r2 (raw file):

}

export function describeMathCPUAndGPU(testName: string, tests: MathTests[]) {

Maybe we can just use CPU for your unit tests (we don't have this because GPU we usually also want to test other flags).


src/contrib/data/dataset.ts, line 18 at r2 (raw file):

 * n-dimensional array.
 */
export type ElementArray = number|number[]|NDArray|string|undefined;

you don't need to explicitly type "undefined"


src/contrib/data/dataset.ts, line 25 at r2 (raw file):

 * Such a value must always have a batch dimension, even if it is of length 1.
 */
export type BatchArray = NDArray|string[];

since these are all used in multiple places, consider moving these to a "types.ts" file


src/contrib/data/dataset.ts, line 66 at r2 (raw file):

   */
  static fromElements(items: DatasetElement[]): Dataset {
    return new (class ArrayDataset extends Dataset {

this is a little funky to read, can we just have this as a separate class?


src/contrib/data/dataset.ts, line 80 at r2 (raw file):

   */
  static fromConcatenated(datasets: Dataset[]) {
    return new (class OfConcatenatedDataset extends Dataset {

same here


src/contrib/data/dataset.ts, line 284 at r2 (raw file):

 *   batchSize elements.  (Default true).
 */
export class BatchDataset {

I think this makes sense in another file


src/contrib/data/dataset_test.ts, line 93 at r2 (raw file):

  });

  it('can be repeated indefinitely', (done) => {

when there is only one arg, you can simply write done => {


src/contrib/data/decode_utf8.ts, line 1 at r2 (raw file):

// TODO(michaelterry): Determine root cause of why this import is

should this file be called utf8_stream.ts?


src/contrib/data/stream.ts, line 1 at r2 (raw file):

// Here we implement a simple asynchronous iterator.

would be good to split this file up


src/contrib/data/stream.ts, line 17 at r2 (raw file):

   * Calling next() on a closed stream returns `undefined`.
   */
  abstract async next(): Promise<T|undefined>;

no need for "undefined" explicitly here


src/contrib/data/stream.ts, line 148 at r2 (raw file):

   */
  skip(count: number): DataStream<T> {
    if (count < 0 || count === undefined) return this;

instead of "=== undefined" use "== null" here and above


src/contrib/data/stream.ts, line 246 at r2 (raw file):

  async pump(): Promise<boolean> {
    const item = await this.upstream.next();
    if (item === undefined) {

loose equality "== null" and everywhere else in this file.

A good reference on why: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Equality_comparisons_and_sameness


Comments from Reviewable

@davidsoergel
Copy link
Member Author

Yep I was planning to do that kind of large-scale rearrangement but originally figured I should do a clean copy first.

Option A) submit this as is and reorganize in later PRs, B) give up on making a clean copy and reorganize in this PR.

The point of a "clean copy" was that you've seen much of this code before, but if you want to review it thoroughly now anyway (which I certainly welcome!) then there's no benefit. So I'll go straight to option B unless you prefer otherwise.


Review status: 4 of 23 files reviewed at latest revision, 20 unresolved discussions, some commit checks failed.


Comments from Reviewable

@davidsoergel
Copy link
Member Author

Review status: 4 of 23 files reviewed at latest revision, 20 unresolved discussions, some commit checks failed.


src/contrib/data/dataset.ts, line 18 at r2 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

you don't need to explicitly type "undefined"

I was considering using --strictNullChecks (but never actually turned it on). Have you considered doing that in DLJS?


Comments from Reviewable

@davidsoergel
Copy link
Member Author

BTW since I'm now reorganizing anyway, I can easily break this up into ~5 consecutive PRs if you prefer. On the other hand it may make more sense as a unit. WDYT?


Review status: 4 of 23 files reviewed at latest revision, 20 unresolved discussions, some commit checks failed.


Comments from Reviewable

@davidsoergel
Copy link
Member Author

Review status: 2 of 32 files reviewed at latest revision, 20 unresolved discussions, some commit checks failed.


src/contrib/data/dataset.ts, line 293 at r1 (raw file):

Previously, davidsoergel (David Soergel) wrote…

See above. Briefly, there are things you can do with a Dataset that don't make sense (or need to be implemented differently) on a BatchDataset. For instance, if you want to filter elements where foobar < 42, Dataset.filter() and BatchDataset.filter() have very different implementations.

Oh, the other thing I forgot to mention: a Dataset makes no claims about the values in its constituent columns, but a BatchDataset asserts that every value agrees on the length of the batch dimension-- i.e., the data is column-oriented per batch, and the columns can be sensibly zipped.


src/contrib/data/dataset.ts, line 18 at r2 (raw file):

Previously, davidsoergel (David Soergel) wrote…

I was considering using --strictNullChecks (but never actually turned it on). Have you considered doing that in DLJS?

As a Persnickety Purist, I would like to turn on --strictNullChecks at some point if possible, but that's obviously not urgent. Until then I got rid of explicit "undefined" types thrroughout; we can always reinstate them.


src/contrib/data/dataset.ts, line 25 at r2 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

since these are all used in multiple places, consider moving these to a "types.ts" file

Done.


src/contrib/data/dataset.ts, line 66 at r2 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

this is a little funky to read, can we just have this as a separate class?

Fixed using Daniel's suggestion to just construct a Dataset by passing a getStream() method.


src/contrib/data/dataset.ts, line 80 at r2 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

same here

Done.


src/contrib/data/dataset.ts, line 284 at r2 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

I think this makes sense in another file

Done.


src/contrib/data/dataset_test.ts, line 93 at r2 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

when there is only one arg, you can simply write done => {

Done throughout.


src/test_util.ts, line 287 at r2 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

Maybe we can just use CPU for your unit tests (we don't have this because GPU we usually also want to test other flags).

Done.


src/contrib/data/decode_utf8.ts, line 1 at r2 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

should this file be called utf8_stream.ts?

Now folded into byte_stream.ts due to the circular-dependency issue.


src/contrib/data/stream.ts, line 11 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

for my own understanding, is this basically tf.data.Iterator? Was the idea to call it stream to make it explicit that it is asynchronous? I'm good with that, just wanted to know your thoughts.

Yes, this is basically an Iterator. Indeed this one is explicitly async (i.e., next() returns a Promise). Also I found that there is a lot of precedent in the JavaScript community for calling this sort of thing a "stream" so I went with that terminology.


src/contrib/data/stream.ts, line 1 at r2 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

would be good to split this file up

Tried but couldn't, due to circular dependencies as discussed offline.


src/contrib/data/stream.ts, line 17 at r2 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

no need for "undefined" explicitly here

Done.


src/contrib/data/stream.ts, line 148 at r2 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

instead of "=== undefined" use "== null" here and above

Done throughout.


src/contrib/data/stream.ts, line 246 at r2 (raw file):

Previously, nsthorat (Nikhil Thorat) wrote…

loose equality "== null" and everywhere else in this file.

A good reference on why: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Equality_comparisons_and_sameness

Done.


Comments from Reviewable

@dsmilkov
Copy link
Contributor

:lgtm_strong: You'll have to merge with the latest dir reorg - ping me when done and I'll merge this in!


Review status: 0 of 33 files reviewed at latest revision, 17 unresolved discussions, some commit checks failed.


package.json, line 16 at r3 (raw file):

^

replace all ^ with ~


src/contrib/data/batch_dataset.ts, line 19 at r3 (raw file):

math

we just made a big dir reorg, tensor.ts moved from /src/math to src/, you'll have to merge again. sorry about that


src/contrib/data/batch_dataset.ts, line 120 at r3 (raw file):

private static

FYI, a much more js way is to have a free floating function outside the class, instead of private static


src/contrib/data/dataset.ts, line 18 at r2 (raw file):

Previously, davidsoergel (David Soergel) wrote…

As a Persnickety Purist, I would like to turn on --strictNullChecks at some point if possible, but that's obviously not urgent. Until then I got rid of explicit "undefined" types thrroughout; we can always reinstate them.

We've had strictNullChecks and we decided to turn it off. strict null checks made the code base less readable and it provided no value. We haven't had a bug that would have been caught due to strict null checks.


src/contrib/data/dataset.ts, line 217 at r3 (raw file):

 * A Dataset defined by a getStream() function given as a constructor argument.
 */
export class DatasetFromStreamFn extends Dataset {

class DatasetFromStreamFn looks strange. Why not have it as an actual function:

function datasetFromStream(getStreamFn: () => Promise<...>) {
  return anonymous class....
}

which you call it from other places.


src/contrib/data/readers.ts, line 94 at r1 (raw file):

   *   line of the input.
   */
  static async create(input: DataSource, csvColumnNames?: string[]) {

if user doesn't provide headers, let's make headers keys be the numbers 0, 1, ... N, to be compatible with other libs, like d3.tsv/csv


Comments from Reviewable

@davidsoergel
Copy link
Member Author

All set, thanks!


Review status: 0 of 33 files reviewed at latest revision, 17 unresolved discussions, some commit checks pending.


package.json, line 16 at r3 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

^

replace all ^ with ~

Done.


src/contrib/data/batch_dataset.ts, line 19 at r3 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

math

we just made a big dir reorg, tensor.ts moved from /src/math to src/, you'll have to merge again. sorry about that

Done.


src/contrib/data/batch_dataset.ts, line 120 at r3 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

private static

FYI, a much more js way is to have a free floating function outside the class, instead of private static

Done here and a few other places


src/contrib/data/dataset.ts, line 18 at r2 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

We've had strictNullChecks and we decided to turn it off. strict null checks made the code base less readable and it provided no value. We haven't had a bug that would have been caught due to strict null checks.

OK, thanks for the info!


src/contrib/data/dataset.ts, line 217 at r3 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

class DatasetFromStreamFn looks strange. Why not have it as an actual function:

function datasetFromStream(getStreamFn: () => Promise<...>) {
  return anonymous class....
}

which you call it from other places.

Done.


src/contrib/data/datasets/csv_dataset.ts, line 94 at r1 (raw file):

Previously, dsmilkov (Daniel Smilkov) wrote…

if user doesn't provide headers, let's make headers keys be the numbers 0, 1, ... N, to be compatible with other libs, like d3.tsv/csv

Yep I had a note about this and had figured I'd do it in a followup CL, but now I did it here.


Comments from Reviewable

@dsmilkov dsmilkov merged commit 3cb2f66 into tensorflow:master Feb 13, 2018
@dsmilkov
Copy link
Contributor

Hey David,

Several tests fail at HEAD due to this PR. Can you take a look when you get a chance. Travis didn't run the tests since you didn't have write access to our repo. Will give you access now. See https://api.travis-ci.org/v3/job/340794358/log.txt for the failing tests - search for failed)

@davidsoergel
Copy link
Member Author

Huh that's weird, everything passed for me. Looking into it now-- sorry about that!

@nsthorat
Copy link
Contributor

nsthorat commented Feb 13, 2018

Sent #701 for review to increase test timeout interval.

Will merge in the morning since I'm tired, but the travis build passes.

@davidsoergel
Copy link
Member Author

Thanks much! I found that the tests are running near the timeout locally, and I can make them fail just by making the test streams longer--so it makes sense that as written they might always timeout on a slower machine. Maybe I have a serious performance bug, but at least I have a place to start in the morning.

(It's already on my list to get some benchmarks set up; just upped the priority :S)


Comments from Reviewable

@kedevked kedevked mentioned this pull request Aug 28, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
3 participants