Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Follow-up on comments of #338 #355

Merged
merged 11 commits into from Mar 13, 2020
Merged

Follow-up on comments of #338 #355

merged 11 commits into from Mar 13, 2020

Conversation

sgugger
Copy link
Contributor

@sgugger sgugger commented Mar 2, 2020

Addresses @dan-zheng comments on #338
I also changed the name of items to numericalizedTexts as it documents better what the expected input is.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

You'll be able to see Jupyter notebook diff and discuss changes. Powered by ReviewNB.

Datasets/LanguageModelDataset.swift Outdated Show resolved Hide resolved
Datasets/LabeledExample.swift Outdated Show resolved Hide resolved
}

//sampleIndices function to use in conjunction with a LanguageModelDataset
/// sampleIndices function to use in conjunction with a `LanguageModelDataset`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc comment should read like a sentence. The meaning right now is unclear to me:

  • What is the returned [Int]?
  • Why is the dataset inout?

Please add /// - Parameters: and /// Returns: doc comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more prose, let me know if it's still unclear.

public let openItem: (Item) -> [Int]
//The size of a batch
/// A dataset suitable for language modeling
public struct LanguageModelDataset<C> where C: Collection, C.Index==Int, C.Element==[Int] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you find a more descriptive name than C? Generic parameter names are important: Tensor<Scalar> is much clearer than Tensor<T>.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used Texts since it's the type of the collection of the underlying texts. Let me know if you have a better idea.

sgugger and others added 2 commits March 2, 2020 13:44
Co-Authored-By: Dan Zheng <danielzheng@google.com>
Co-Authored-By: Dan Zheng <danielzheng@google.com>
//The size of a batch
/// A dataset suitable for language modeling.
public struct LanguageModelDataset<Texts>
where Texts: Collection, Texts.Index==Int, Texts.Element==[Int] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be surprising that Texts is a Collection where Index == Int and Element == [Int]. One might expect Texts to be a collection of String.

Could you please add doc comments that help explain the Texts generic parameter? Maybe you can clarify that it's numericalized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did that and added some explanation on what those terms mean. NumericalizedTexts is kind of long for the generic parameters, which is why I stuck to Texts. Let me know if there is anything else I can do to make this clearer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about textIDs? Is there anything more to “numericalizing” than associating each token with an integer ID?

sgugger and others added 2 commits March 2, 2020 14:59
Co-Authored-By: Dan Zheng <danielzheng@google.com>
Co-Authored-By: Dan Zheng <danielzheng@google.com>
Copy link
Member

@dan-zheng dan-zheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These incremental improvements LGTM, thanks!

@sgugger sgugger merged commit dabab58 into tensorflow:master Mar 13, 2020
@sgugger sgugger deleted the follow_up branch March 13, 2020 18:16
// one tensor of labels). Note that if your task is more elaborate, you should write your
// own struct of Tensors. Automatic conformance to Collatable can be derived the same way as
// here.
/// A generic tuple of two tensors `Tensor`.
Copy link
Contributor

@dabrahams dabrahams Mar 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“two tensors Tensor” doesn't quite parse for me.

How is this a “tuple” and not a “pair?” The former usually implies it has contains an arbitrary number of elements.

I would say “A heterogeneous pair of Tensors”

/// A generic tuple of two tensors `Tensor`.
///
/// - Note: `TensorPair` has a generic name and provides little semantic information, to conform to
/// `Collatable`. You can use it for most basic datasets with one tensor of inputs and one tensor of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Markdown bullets should left-align all the text, so there should be two spaces before all but the first line of the bullet.

I don't know about Collatable yet, but it's hard to imagine how conforming to a protocol could impose naming requirements, or requirements on the non-provision of semantic information. I think you should try again for this comment, thinking about what you really mean to say. Hint: there's no need to make excuses for providing a very general-purpose component and naming it accordingly ;-)

/// vocabulary). Therefore the generic type `Texts` refers to a collection of
/// numericalized texts.
public struct LanguageModelDataset<Texts>
where Texts: Collection, Texts.Index==Int, Texts.Element==[Int] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Index==Int is almost invariably an over-constraint. What you probably mean is Texts: RandomAccessCollection; you can always get to the nth element via t[t.index(t.startIndex, offsetBy: n)] which is admittedly horrible, but can be encapsulated so you don't have to repeat it.

//The size of a batch
/// A dataset suitable for language modeling.
public struct LanguageModelDataset<Texts>
where Texts: Collection, Texts.Index==Int, Texts.Element==[Int] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about textIDs? Is there anything more to “numericalizing” than associating each token with an integer ID?

private var batchCount: Int
//The sequence length of the last batch
/// The sequence length of the last batch.
private var lastLength: Int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize it's not part of this PR, but this should be lastBatchLength.


//sampleIndices function to use in conjunction with a LanguageModelDataset
/// The sampleIndices function to use in conjunction with a `LanguageModelDataset` in a `Batcher`.
/// Will shuffle the dataset in place instead of the indices (like the default function does).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a blank line after the summary.

@@ -57,8 +57,8 @@ public struct TextUnsupervised {
var fileExtension = "tgz"
}

public let trainingDataset: LanguageModelDataset<[Int]>
public let validationDataset: LanguageModelDataset<[Int]>
public let trainingDataset: LanguageModelDataset<[[Int]]>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this [[Int]] I keep seeing? Would a typealias for it make the code more descriptive?

@dabrahams dabrahams removed their assignment Mar 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants