New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
torchtext iterator that tokenizes each line of words between the tokens <sos>
and <eos>
#654
Comments
The fixed length can be set for certain Fields by using the fix_length = 128
TEXT = data.Field(..., fix_length=max_length)
# construct the Iterator over train data
for batch in openbookQA_train:
print(batch.text)
# This will be a tensor of [batch_size x max_length] or that transposed Note that in case you have an instance where the length is larger than fix_length, the trailing tokens will be dropped (including the ). |
Hello, Thank you for your reply. @mttk When I execute the code below, each sample contains tokens from multiple number of sentences because my sample length is much larger than the length of each sentence: train_openbookQA_iter, val_openbookQA_iter, test_openbookQA_iter = BPTTIterator.splits(
(new_train_openbookQA, val_openbookQA, test_openbookQA),
batch_size = batch_size,
bptt_len= bptt,
sort_key=lambda x: len(x.text),
sort_within_batch = True,
shuffle = False,
device= device,
repeat=False) Should I use a different Iterator in this case? Thank you, |
If I understood your use-case correctly, a plain Iterator would work fine in this case. Are you training the model for QA or are you doing some variation of language modelling, or both? |
Hello, Thank you for your reply. Is If the maximum length of individual sequence that I want to feed into my Transformer for language modelling is 100, then is Thank you, |
|
Hello, Thank you again for your reply. For a given Thank you, |
Every token in an instance will always be processed, no matter the |
@ mttk Hello, Thank you again for your reply. Sorry for the ongoing question, I will try to look up more on the internet about the Thank you, |
Yes, in this case, an instance is a single sequence you pass as input. The length of the "I like dogs" instance is 3 if you use word tokenization (note that most Transformers use subword tokenizers). If you would define your [
["I", "like", "all", "animals", "<pad>"],
["I", "like", "dogs", "<pad>", "<pad>"]
] |
Thank you! |
Hello, just a follow up question, Say I am using the # set batch_size and bptt
batch_size=1
bptt = 1024
train_Wiki2_iter, val_Wiki2_iter, test_Wiki2_iter = BPTTIterator.splits(
(train_Wiki2, val_Wiki2, test_Wiki2),
batch_size = batch_size,
bptt_len= bptt,
sort_key=lambda x: len(x.text),
sort_within_batch = True,
shuffle = False,
repeat=False)
train_Wiki2_i = next(iter(train_Wiki2_iter))
val_Wiki2_i = next(iter(val_Wiki2_iter))
test_Wiki2_i = next(iter(test_Wiki2_iter))
test_dummy_Wiki2_i = test_Wiki2_i
train_Wiki2_i.text.size()
>> torch.Size([1024, 1])
# the size of train_Wiki2_i tells me that
# only 1024 * 1 length of tokens will be processed,
# whereas the length of train_Wiki2 is far greater than 1024*1. How can I ensure that the entire text corpus will be processed? If any of my understanding is wrong, please correct me. Thank you, I really appreciate your help. |
This size is just a single batch of your dataset. |
Thank you! :) |
Hello, Sorry for continuing this thread. # set batch_size and bptt
batch_size=1
bptt = 1024
train_Wiki2_iter, val_Wiki2_iter, test_Wiki2_iter = BPTTIterator.splits(
(train_Wiki2, val_Wiki2, test_Wiki2),
batch_size = batch_size,
bptt_len= bptt,
sort_key=lambda x: len(x.text),
sort_within_batch = True,
shuffle = False,
repeat=False)
train_Wiki2_i = next(iter(train_Wiki2_iter))
val_Wiki2_i = next(iter(val_Wiki2_iter))
test_Wiki2_i = next(iter(test_Wiki2_iter))
test_dummy_Wiki2_i = test_Wiki2_i In your previous post you mentioned that Thank you, |
for batch_index, batch in enumerate(train_Wiki2_iter):
# This is now batch #batch_index
# Do things with the batch
pass |
Thank you! Highly appreciated. |
@ mttk Hello again, from your example below: for batch_index, batch in enumerate(train_Wiki2_iter):
# This is now batch #batch_index
# Do things with the batch
pass Is there a way I can turn the Thank you, |
If the |
|
Hello,
I generated a text file called
openbookQA_train
. The contents of this file are shown below:I am trying to use or define torchtext Iterator to generate the input that I can pass into my Transformer.
I want each sample in my
next(iter(openbookQA_train)).text
to be a series of integers that are obtained by tokenizing each line of words between<sos>
and<eos>
(including those special tokens), and for a sample that contains lesser number of tokens than the bptt length, I want the sample to include all of the tokenized words between<sos>
and<eos>
and the rest of the slots to be filled with the token<pad>
up to the bptt length.How can I achieve this objective?
Thank you,
The text was updated successfully, but these errors were encountered: