New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Follow-up on comments of #338 #355
Conversation
Check out this pull request on You'll be able to see Jupyter notebook diff and discuss changes. Powered by ReviewNB. |
Datasets/LanguageModelDataset.swift
Outdated
} | ||
|
||
//sampleIndices function to use in conjunction with a LanguageModelDataset | ||
/// sampleIndices function to use in conjunction with a `LanguageModelDataset` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doc comment should read like a sentence. The meaning right now is unclear to me:
- What is the returned
[Int]
? - Why is the dataset
inout
?
Please add /// - Parameters:
and /// Returns:
doc comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added more prose, let me know if it's still unclear.
Datasets/LanguageModelDataset.swift
Outdated
public let openItem: (Item) -> [Int] | ||
//The size of a batch | ||
/// A dataset suitable for language modeling | ||
public struct LanguageModelDataset<C> where C: Collection, C.Index==Int, C.Element==[Int] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you find a more descriptive name than C
? Generic parameter names are important: Tensor<Scalar>
is much clearer than Tensor<T>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used Texts
since it's the type of the collection of the underlying texts. Let me know if you have a better idea.
Co-Authored-By: Dan Zheng <danielzheng@google.com>
Co-Authored-By: Dan Zheng <danielzheng@google.com>
//The size of a batch | ||
/// A dataset suitable for language modeling. | ||
public struct LanguageModelDataset<Texts> | ||
where Texts: Collection, Texts.Index==Int, Texts.Element==[Int] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be surprising that Texts
is a Collection
where Index == Int
and Element == [Int]
. One might expect Texts
to be a collection of String
.
Could you please add doc comments that help explain the Texts
generic parameter? Maybe you can clarify that it's numericalized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did that and added some explanation on what those terms mean. NumericalizedTexts
is kind of long for the generic parameters, which is why I stuck to Texts
. Let me know if there is anything else I can do to make this clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about textIDs
? Is there anything more to “numericalizing” than associating each token with an integer ID?
Co-Authored-By: Dan Zheng <danielzheng@google.com>
Co-Authored-By: Dan Zheng <danielzheng@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These incremental improvements LGTM, thanks!
// one tensor of labels). Note that if your task is more elaborate, you should write your | ||
// own struct of Tensors. Automatic conformance to Collatable can be derived the same way as | ||
// here. | ||
/// A generic tuple of two tensors `Tensor`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
“two tensors Tensor
” doesn't quite parse for me.
How is this a “tuple” and not a “pair?” The former usually implies it has contains an arbitrary number of elements.
I would say “A heterogeneous pair of Tensor
s”
/// A generic tuple of two tensors `Tensor`. | ||
/// | ||
/// - Note: `TensorPair` has a generic name and provides little semantic information, to conform to | ||
/// `Collatable`. You can use it for most basic datasets with one tensor of inputs and one tensor of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Markdown bullets should left-align all the text, so there should be two spaces before all but the first line of the bullet.
I don't know about Collatable
yet, but it's hard to imagine how conforming to a protocol could impose naming requirements, or requirements on the non-provision of semantic information. I think you should try again for this comment, thinking about what you really mean to say. Hint: there's no need to make excuses for providing a very general-purpose component and naming it accordingly ;-)
/// vocabulary). Therefore the generic type `Texts` refers to a collection of | ||
/// numericalized texts. | ||
public struct LanguageModelDataset<Texts> | ||
where Texts: Collection, Texts.Index==Int, Texts.Element==[Int] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Index==Int
is almost invariably an over-constraint. What you probably mean is Texts: RandomAccessCollection
; you can always get to the nth element via t[t.index(t.startIndex, offsetBy: n)]
which is admittedly horrible, but can be encapsulated so you don't have to repeat it.
//The size of a batch | ||
/// A dataset suitable for language modeling. | ||
public struct LanguageModelDataset<Texts> | ||
where Texts: Collection, Texts.Index==Int, Texts.Element==[Int] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about textIDs
? Is there anything more to “numericalizing” than associating each token with an integer ID?
private var batchCount: Int | ||
//The sequence length of the last batch | ||
/// The sequence length of the last batch. | ||
private var lastLength: Int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize it's not part of this PR, but this should be lastBatchLength
.
|
||
//sampleIndices function to use in conjunction with a LanguageModelDataset | ||
/// The sampleIndices function to use in conjunction with a `LanguageModelDataset` in a `Batcher`. | ||
/// Will shuffle the dataset in place instead of the indices (like the default function does). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need a blank line after the summary.
@@ -57,8 +57,8 @@ public struct TextUnsupervised { | |||
var fileExtension = "tgz" | |||
} | |||
|
|||
public let trainingDataset: LanguageModelDataset<[Int]> | |||
public let validationDataset: LanguageModelDataset<[Int]> | |||
public let trainingDataset: LanguageModelDataset<[[Int]]> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this [[Int]]
I keep seeing? Would a typealias
for it make the code more descriptive?
Addresses @dan-zheng comments on #338
I also changed the name of
items
tonumericalizedTexts
as it documents better what the expected input is.