Follow-up on comments of #338 #355

sgugger · 2020-03-02T16:46:42Z

Addresses @dan-zheng comments on #338
I also changed the name of items to numericalizedTexts as it documents better what the expected input is.

review-notebook-app · 2020-03-02T16:46:48Z

Check out this pull request on

You'll be able to see Jupyter notebook diff and discuss changes. Powered by ReviewNB.

Datasets/LanguageModelDataset.swift

Datasets/LabeledExample.swift

dan-zheng · 2020-03-02T16:52:27Z

Datasets/LanguageModelDataset.swift

-}
-
-//sampleIndices function to use in conjunction with a LanguageModelDataset
+/// sampleIndices function to use in conjunction with a `LanguageModelDataset`


This doc comment should read like a sentence. The meaning right now is unclear to me:

What is the returned [Int]?

Why is the dataset inout?

Please add /// - Parameters: and /// Returns: doc comments.

Added more prose, let me know if it's still unclear.

dan-zheng · 2020-03-02T16:53:30Z

Datasets/LanguageModelDataset.swift

-  public let openItem: (Item) -> [Int]
-  //The size of a batch
+/// A dataset suitable for language modeling 
+public struct LanguageModelDataset<C> where C: Collection, C.Index==Int, C.Element==[Int] {


Could you find a more descriptive name than C? Generic parameter names are important: Tensor<Scalar> is much clearer than Tensor<T>.

Used Texts since it's the type of the collection of the underlying texts. Let me know if you have a better idea.

Datasets/LanguageModelDataset.swift

Co-Authored-By: Dan Zheng <danielzheng@google.com>

dan-zheng · 2020-03-02T18:51:23Z

Datasets/LanguageModelDataset.swift

-  //The size of a batch
+/// A dataset suitable for language modeling.
+public struct LanguageModelDataset<Texts> 
+where Texts: Collection, Texts.Index==Int, Texts.Element==[Int] {


It may be surprising that Texts is a Collection where Index == Int and Element == [Int]. One might expect Texts to be a collection of String.

Could you please add doc comments that help explain the Texts generic parameter? Maybe you can clarify that it's numericalized.

Did that and added some explanation on what those terms mean. NumericalizedTexts is kind of long for the generic parameters, which is why I stuck to Texts. Let me know if there is anything else I can do to make this clearer.

How about textIDs? Is there anything more to “numericalizing” than associating each token with an integer ID?

Datasets/LanguageModelDataset.swift

Co-Authored-By: Dan Zheng <danielzheng@google.com>

dan-zheng

These incremental improvements LGTM, thanks!

dabrahams · 2020-03-18T21:45:24Z

Datasets/LabeledExample.swift

-// one tensor of labels). Note that if your task is more elaborate, you should write your
-// own struct of Tensors. Automatic conformance to Collatable can be derived the same way as
-// here.
+/// A generic tuple of two tensors `Tensor`.


“two tensors Tensor” doesn't quite parse for me.

How is this a “tuple” and not a “pair?” The former usually implies it has contains an arbitrary number of elements.

I would say “A heterogeneous pair of Tensors”

dabrahams · 2020-03-18T21:49:25Z

Datasets/LabeledExample.swift

+/// A generic tuple of two tensors `Tensor`.
+/// 
+/// - Note: `TensorPair` has a generic name and provides little semantic information, to conform to
+/// `Collatable`. You can use it for most basic datasets with one tensor of inputs and one tensor of


Nit: Markdown bullets should left-align all the text, so there should be two spaces before all but the first line of the bullet.

I don't know about Collatable yet, but it's hard to imagine how conforming to a protocol could impose naming requirements, or requirements on the non-provision of semantic information. I think you should try again for this comment, thinking about what you really mean to say. Hint: there's no need to make excuses for providing a very general-purpose component and naming it accordingly ;-)

dabrahams · 2020-03-18T21:52:42Z

Datasets/LanguageModelDataset.swift

+/// vocabulary). Therefore the generic type `Texts` refers to a collection of
+/// numericalized texts.
+public struct LanguageModelDataset<Texts> 
+where Texts: Collection, Texts.Index==Int, Texts.Element==[Int] {


Index==Int is almost invariably an over-constraint. What you probably mean is Texts: RandomAccessCollection; you can always get to the nth element via t[t.index(t.startIndex, offsetBy: n)] which is admittedly horrible, but can be encapsulated so you don't have to repeat it.

dabrahams · 2020-03-18T21:54:12Z

Datasets/LanguageModelDataset.swift

-  //The size of a batch
+/// A dataset suitable for language modeling.
+public struct LanguageModelDataset<Texts> 
+where Texts: Collection, Texts.Index==Int, Texts.Element==[Int] {


How about textIDs? Is there anything more to “numericalizing” than associating each token with an integer ID?

dabrahams · 2020-03-18T21:55:09Z

Datasets/LanguageModelDataset.swift

  private var batchCount: Int
-  //The sequence length of the last batch
+  /// The sequence length of the last batch.
  private var lastLength: Int


I realize it's not part of this PR, but this should be lastBatchLength.

dabrahams · 2020-03-18T21:56:15Z

Datasets/LanguageModelDataset.swift

-
-//sampleIndices function to use in conjunction with a LanguageModelDataset
+/// The sampleIndices function to use in conjunction with a `LanguageModelDataset` in a `Batcher`.
+/// Will shuffle the dataset in place instead of the indices (like the default function does).


Need a blank line after the summary.

dabrahams · 2020-03-18T21:58:01Z

Datasets/TextUnsupervised/TextUnsupervised.swift

@@ -57,8 +57,8 @@ public struct TextUnsupervised {
        var fileExtension = "tgz"
    }

-    public let trainingDataset: LanguageModelDataset<[Int]>
-    public let validationDataset: LanguageModelDataset<[Int]>
+    public let trainingDataset: LanguageModelDataset<[[Int]]>


What is this [[Int]] I keep seeing? Would a typealias for it make the code more descriptive?

Follow-up on comments of tensorflow#338

0f8d83e

dan-zheng reviewed Mar 2, 2020

View reviewed changes

Improve documentation

e06e070

dan-zheng reviewed Mar 2, 2020

View reviewed changes

Datasets/LanguageModelDataset.swift Outdated Show resolved Hide resolved

dan-zheng reviewed Mar 2, 2020

View reviewed changes

Datasets/LanguageModelDataset.swift Outdated Show resolved Hide resolved

sgugger and others added 2 commits March 2, 2020 13:44

Update Datasets/LanguageModelDataset.swift

63737d4

Co-Authored-By: Dan Zheng <danielzheng@google.com>

Update Datasets/LanguageModelDataset.swift

e492ffd

Co-Authored-By: Dan Zheng <danielzheng@google.com>

dan-zheng reviewed Mar 2, 2020

View reviewed changes

Add a note

567597c

dan-zheng reviewed Mar 2, 2020

View reviewed changes

Datasets/LanguageModelDataset.swift Outdated Show resolved Hide resolved

dan-zheng reviewed Mar 2, 2020

View reviewed changes

Datasets/LanguageModelDataset.swift Outdated Show resolved Hide resolved

sgugger and others added 2 commits March 2, 2020 14:59

Update Datasets/LanguageModelDataset.swift

75c3f59

Co-Authored-By: Dan Zheng <danielzheng@google.com>

Update Datasets/LanguageModelDataset.swift

280ccfc

Co-Authored-By: Dan Zheng <danielzheng@google.com>

dabrahams added the google* label Mar 4, 2020

shabalind assigned dabrahams Mar 11, 2020

shabalind requested a review from dan-zheng March 11, 2020 17:43

sgugger mentioned this pull request Mar 12, 2020

Adds the ability to drop the last batch #379

Merged

sgugger added 4 commits March 13, 2020 13:16

Merge branch 'master' into follow_up

ff05dc4

Adapt to new generic type

af1364c

Fix typo

4971cf3

Use right name

c8e6dbb

dan-zheng approved these changes Mar 13, 2020

View reviewed changes

sgugger merged commit dabab58 into tensorflow:master Mar 13, 2020

sgugger deleted the follow_up branch March 13, 2020 18:16

dabrahams reviewed Mar 18, 2020

View reviewed changes

dabrahams removed their assignment Mar 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Follow-up on comments of #338 #355

Follow-up on comments of #338 #355

sgugger commented Mar 2, 2020 •

edited

review-notebook-app bot commented Mar 2, 2020

dan-zheng Mar 2, 2020

sgugger Mar 2, 2020

dan-zheng Mar 2, 2020

sgugger Mar 2, 2020

dan-zheng Mar 2, 2020

sgugger Mar 2, 2020

dabrahams Mar 18, 2020

dan-zheng left a comment

dabrahams Mar 18, 2020 •

edited

dabrahams Mar 18, 2020

dabrahams Mar 18, 2020

dabrahams Mar 18, 2020

dabrahams Mar 18, 2020

dabrahams Mar 18, 2020

dabrahams Mar 18, 2020

Follow-up on comments of #338 #355

Follow-up on comments of #338 #355

Conversation

sgugger commented Mar 2, 2020 • edited

review-notebook-app bot commented Mar 2, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dan-zheng left a comment

Choose a reason for hiding this comment

dabrahams Mar 18, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgugger commented Mar 2, 2020 •

edited

dabrahams Mar 18, 2020 •

edited