Conversation
This cannot yet replace src/data, nor does it yet support the needs of the demos. It's just an initial commit, moving the current version of the code unmodified from its private repo.
Reviewed 5 of 23 files at r1. src/contrib/data/dataset.ts, line 20 at r1 (raw file):
FYI, NDArray supports arbitrary dimensions. We have sugar for up to src/contrib/data/dataset.ts, line 39 at r1 (raw file):
curious: does tf.data also make an explicit distinction between DatasetElement and DatasetBatch? Or is this distinction solely for typings? Would users that use plain js see this distinction? src/contrib/data/dataset.ts, line 61 at r1 (raw file):
will there be batch-enabled transformation later? in that case, might be good to say: src/contrib/data/dataset.ts, line 74 at r1 (raw file):
FYI, when these methods face the user, in JS land, it's super-unusual to have static methods e.g. Also note I prefer src/contrib/data/dataset.ts, line 74 at r1 (raw file):
Another note: I don't see src/contrib/data/dataset.ts, line 88 at r1 (raw file):
why not just call it concatenate? Also I don't see src/contrib/data/dataset.ts, line 160 at r1 (raw file):
merge this and ofConcatenated above src/contrib/data/dataset.ts, line 293 at r1 (raw file):
I might be misunderstanding as I just started looking at tf.data, but I don't see any explicit distinction between BatchDataset and Dataset in tf.data. I'm curious what are the motivations for that distinction. src/contrib/data/stream.ts, line 11 at r1 (raw file):
for my own understanding, is this basically tf.data.Iterator? Was the idea to call it stream to make it explicit that it is asynchronous? I'm good with that, just wanted to know your thoughts. Comments from Reviewable |
This looks great! Left a few high-level comments (mostly for my own understanding) so we can discuss tomorrow in person |
Review status: 4 of 23 files reviewed at latest revision, 9 unresolved discussions, some commit checks failed. src/contrib/data/dataset.ts, line 20 at r1 (raw file): Previously, dsmilkov (Daniel Smilkov) wrote…
Done. src/contrib/data/dataset.ts, line 39 at r1 (raw file): Previously, dsmilkov (Daniel Smilkov) wrote…
tf.data does not make the distinction; indeed TF itself does not. I think that's a mistake since an element and a batch are semantically different: some transformations make sense on one but not the other, and some transformations need to be implemented differently in the two cases. Since we are in a typed context we should be explicit about this. src/contrib/data/dataset.ts, line 61 at r1 (raw file): Previously, dsmilkov (Daniel Smilkov) wrote…
Done. src/contrib/data/dataset.ts, line 74 at r1 (raw file): Previously, dsmilkov (Daniel Smilkov) wrote…
OK, I renamed all the static methods from of* to from*. I'm leaving them in place for now, but we can move them to be free functions in a later CL if you like. (I just don't want to edit too dramatically here, so that it's clear how this code was moved from the other repo). src/contrib/data/dataset.ts, line 74 at r1 (raw file): Previously, dsmilkov (Daniel Smilkov) wrote…
No, from_tensors creates a tf.data.Dataset with a single element. Here we're just wrapping an existing array of DatasetElements as a Dataset. src/contrib/data/dataset.ts, line 88 at r1 (raw file): Previously, dsmilkov (Daniel Smilkov) wrote…
These are two different things: a.concatenate(b) is an instance method. It gives the same result as fromConcatenated([a, b]), which is a static method. However the latter allows concatenating many Datasets at once, i.e. fromConcatenated([a, b, c, d, ...]). src/contrib/data/dataset.ts, line 160 at r1 (raw file): Previously, dsmilkov (Daniel Smilkov) wrote…
See above src/contrib/data/dataset.ts, line 293 at r1 (raw file): Previously, dsmilkov (Daniel Smilkov) wrote…
See above. Briefly, there are things you can do with a Dataset that don't make sense (or need to be implemented differently) on a BatchDataset. For instance, if you want to filter elements where foobar < 42, Dataset.filter() and BatchDataset.filter() have very different implementations. Comments from Reviewable |
High level comments:
I will do another in depth pass once the cosmetic stuff of breaking files up / directory structure is done so it's a little easier to grok. Review status: 4 of 23 files reviewed at latest revision, 20 unresolved discussions, some commit checks failed. src/test_util.ts, line 287 at r2 (raw file):
Maybe we can just use CPU for your unit tests (we don't have this because GPU we usually also want to test other flags). src/contrib/data/dataset.ts, line 18 at r2 (raw file):
you don't need to explicitly type "undefined" src/contrib/data/dataset.ts, line 25 at r2 (raw file):
since these are all used in multiple places, consider moving these to a "types.ts" file src/contrib/data/dataset.ts, line 66 at r2 (raw file):
this is a little funky to read, can we just have this as a separate class? src/contrib/data/dataset.ts, line 80 at r2 (raw file):
same here src/contrib/data/dataset.ts, line 284 at r2 (raw file):
I think this makes sense in another file src/contrib/data/dataset_test.ts, line 93 at r2 (raw file):
when there is only one arg, you can simply write src/contrib/data/decode_utf8.ts, line 1 at r2 (raw file):
should this file be called utf8_stream.ts? src/contrib/data/stream.ts, line 1 at r2 (raw file):
would be good to split this file up src/contrib/data/stream.ts, line 17 at r2 (raw file):
no need for "undefined" explicitly here src/contrib/data/stream.ts, line 148 at r2 (raw file):
instead of "=== undefined" use "== null" here and above src/contrib/data/stream.ts, line 246 at r2 (raw file):
loose equality "== null" and everywhere else in this file. A good reference on why: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Equality_comparisons_and_sameness Comments from Reviewable |
Yep I was planning to do that kind of large-scale rearrangement but originally figured I should do a clean copy first. Option A) submit this as is and reorganize in later PRs, B) give up on making a clean copy and reorganize in this PR. The point of a "clean copy" was that you've seen much of this code before, but if you want to review it thoroughly now anyway (which I certainly welcome!) then there's no benefit. So I'll go straight to option B unless you prefer otherwise. Review status: 4 of 23 files reviewed at latest revision, 20 unresolved discussions, some commit checks failed. Comments from Reviewable |
Review status: 4 of 23 files reviewed at latest revision, 20 unresolved discussions, some commit checks failed. src/contrib/data/dataset.ts, line 18 at r2 (raw file): Previously, nsthorat (Nikhil Thorat) wrote…
I was considering using --strictNullChecks (but never actually turned it on). Have you considered doing that in DLJS? Comments from Reviewable |
BTW since I'm now reorganizing anyway, I can easily break this up into ~5 consecutive PRs if you prefer. On the other hand it may make more sense as a unit. WDYT? Review status: 4 of 23 files reviewed at latest revision, 20 unresolved discussions, some commit checks failed. Comments from Reviewable |
Review status: 2 of 32 files reviewed at latest revision, 20 unresolved discussions, some commit checks failed. src/contrib/data/dataset.ts, line 293 at r1 (raw file): Previously, davidsoergel (David Soergel) wrote…
Oh, the other thing I forgot to mention: a Dataset makes no claims about the values in its constituent columns, but a BatchDataset asserts that every value agrees on the length of the batch dimension-- i.e., the data is column-oriented per batch, and the columns can be sensibly zipped. src/contrib/data/dataset.ts, line 18 at r2 (raw file): Previously, davidsoergel (David Soergel) wrote…
As a Persnickety Purist, I would like to turn on --strictNullChecks at some point if possible, but that's obviously not urgent. Until then I got rid of explicit "undefined" types thrroughout; we can always reinstate them. src/contrib/data/dataset.ts, line 25 at r2 (raw file): Previously, nsthorat (Nikhil Thorat) wrote…
Done. src/contrib/data/dataset.ts, line 66 at r2 (raw file): Previously, nsthorat (Nikhil Thorat) wrote…
Fixed using Daniel's suggestion to just construct a Dataset by passing a getStream() method. src/contrib/data/dataset.ts, line 80 at r2 (raw file): Previously, nsthorat (Nikhil Thorat) wrote…
Done. src/contrib/data/dataset.ts, line 284 at r2 (raw file): Previously, nsthorat (Nikhil Thorat) wrote…
Done. src/contrib/data/dataset_test.ts, line 93 at r2 (raw file): Previously, nsthorat (Nikhil Thorat) wrote…
Done throughout. src/test_util.ts, line 287 at r2 (raw file): Previously, nsthorat (Nikhil Thorat) wrote…
Done. src/contrib/data/decode_utf8.ts, line 1 at r2 (raw file): Previously, nsthorat (Nikhil Thorat) wrote…
Now folded into byte_stream.ts due to the circular-dependency issue. src/contrib/data/stream.ts, line 11 at r1 (raw file): Previously, dsmilkov (Daniel Smilkov) wrote…
Yes, this is basically an Iterator. Indeed this one is explicitly async (i.e., next() returns a Promise). Also I found that there is a lot of precedent in the JavaScript community for calling this sort of thing a "stream" so I went with that terminology. src/contrib/data/stream.ts, line 1 at r2 (raw file): Previously, nsthorat (Nikhil Thorat) wrote…
Tried but couldn't, due to circular dependencies as discussed offline. src/contrib/data/stream.ts, line 17 at r2 (raw file): Previously, nsthorat (Nikhil Thorat) wrote…
Done. src/contrib/data/stream.ts, line 148 at r2 (raw file): Previously, nsthorat (Nikhil Thorat) wrote…
Done throughout. src/contrib/data/stream.ts, line 246 at r2 (raw file): Previously, nsthorat (Nikhil Thorat) wrote…
Done. Comments from Reviewable |
You'll have to merge with the latest dir reorg - ping me when done and I'll merge this in! Review status: 0 of 33 files reviewed at latest revision, 17 unresolved discussions, some commit checks failed. package.json, line 16 at r3 (raw file):
replace all src/contrib/data/batch_dataset.ts, line 19 at r3 (raw file):
we just made a big dir reorg, tensor.ts moved from /src/math to src/, you'll have to merge again. sorry about that src/contrib/data/batch_dataset.ts, line 120 at r3 (raw file):
FYI, a much more js way is to have a free floating function outside the class, instead of src/contrib/data/dataset.ts, line 18 at r2 (raw file): Previously, davidsoergel (David Soergel) wrote…
We've had strictNullChecks and we decided to turn it off. strict null checks made the code base less readable and it provided no value. We haven't had a bug that would have been caught due to strict null checks. src/contrib/data/dataset.ts, line 217 at r3 (raw file):
which you call it from other places. src/contrib/data/readers.ts, line 94 at r1 (raw file):
if user doesn't provide headers, let's make headers keys be the numbers 0, 1, ... N, to be compatible with other libs, like d3.tsv/csv Comments from Reviewable |
All set, thanks! Review status: 0 of 33 files reviewed at latest revision, 17 unresolved discussions, some commit checks pending. package.json, line 16 at r3 (raw file): Previously, dsmilkov (Daniel Smilkov) wrote…
Done. src/contrib/data/batch_dataset.ts, line 19 at r3 (raw file): Previously, dsmilkov (Daniel Smilkov) wrote…
Done. src/contrib/data/batch_dataset.ts, line 120 at r3 (raw file): Previously, dsmilkov (Daniel Smilkov) wrote…
Done here and a few other places src/contrib/data/dataset.ts, line 18 at r2 (raw file): Previously, dsmilkov (Daniel Smilkov) wrote…
OK, thanks for the info! src/contrib/data/dataset.ts, line 217 at r3 (raw file): Previously, dsmilkov (Daniel Smilkov) wrote…
Done. src/contrib/data/datasets/csv_dataset.ts, line 94 at r1 (raw file): Previously, dsmilkov (Daniel Smilkov) wrote…
Yep I had a note about this and had figured I'd do it in a followup CL, but now I did it here. Comments from Reviewable |
Hey David, Several tests fail at HEAD due to this PR. Can you take a look when you get a chance. Travis didn't run the tests since you didn't have write access to our repo. Will give you access now. See https://api.travis-ci.org/v3/job/340794358/log.txt for the failing tests - search for |
Huh that's weird, everything passed for me. Looking into it now-- sorry about that! |
Sent #701 for review to increase test timeout interval. Will merge in the morning since I'm tired, but the travis build passes. |
Thanks much! I found that the tests are running near the timeout locally, and I can make them fail just by making the test streams longer--so it makes sense that as written they might always timeout on a slower machine. Maybe I have a serious performance bug, but at least I have a place to start in the morning. (It's already on my list to get some benchmarks set up; just upped the priority :S) Comments from Reviewable |
This subpackage provides Dataset, DataStream, and related subclasses and utilities.
It provides for streaming data loading and preprocessing, with the intent of feeding machine learning models with data for both training and evaluation.
This cannot yet replace src/data, nor does it yet support the needs of the current deeplearn.js demos. It's just an initial commit, moving the current version of the code unmodified from its private repo. Deeper integration with deeplearn.js will follow.
This change is