torchtext iterator that tokenizes each line of words between the tokens `<sos>` and `<eos>` #654

h56cho · 2019-11-26T14:44:32Z

Hello,

I generated a text file called openbookQA_train. The contents of this file are shown below:

<sos> The sun is responsible for <mcoption> (A) puppies learning new tricks <eos>
<sos> The sun is responsible for <mcoption> (B) children growing up and getting old <eos>
<sos> The sun is responsible for <mcoption> (C) flowers wilting in a vase <eos>
<sos> The sun is responsible for <mcoption> (D) plants sprouting, blooming and wilting <eos>

I am trying to use or define torchtext Iterator to generate the input that I can pass into my Transformer.

I want each sample in my next(iter(openbookQA_train)).text to be a series of integers that are obtained by tokenizing each line of words between <sos> and <eos> (including those special tokens), and for a sample that contains lesser number of tokens than the bptt length, I want the sample to include all of the tokenized words between <sos> and <eos> and the rest of the slots to be filled with the token <pad> up to the bptt length.

How can I achieve this objective?

Thank you,

The text was updated successfully, but these errors were encountered:

zhangguanheng66 · 2019-11-26T15:09:46Z

@mttk may have a solution for this.

Otherwise, you could take a look at the new language modeling dataset (#624) and write a custom pipeline.

mttk · 2019-11-26T15:16:13Z

The fixed length can be set for certain Fields by using the fix_length argument of Field. Let's say you want to pad every example to 128 tokens.

fix_length = 128
TEXT = data.Field(..., fix_length=max_length)
# construct the Iterator over train data
for batch in openbookQA_train:
  print(batch.text)
  # This will be a tensor of [batch_size x max_length] or that transposed

Note that in case you have an instance where the length is larger than fix_length, the trailing tokens will be dropped (including the ).

h56cho · 2019-11-26T16:09:08Z

Hello,

Thank you for your reply.

@mttk
How can I ensure that each sample at a fixed sample length (I will set the fixed sample length in a way that this fixed sample length will always be greater than the number of tokens between <sos> and <eos>) that is being passed into the Transformer contain all of the tokenized words from each sentence (between the tokens <sos> and <eos>), and the rest of slots in the sample are just paddings?

When I execute the code below, each sample contains tokens from multiple number of sentences because my sample length is much larger than the length of each sentence:

train_openbookQA_iter, val_openbookQA_iter, test_openbookQA_iter = BPTTIterator.splits(
            (new_train_openbookQA, val_openbookQA, test_openbookQA),
            batch_size = batch_size,
            bptt_len= bptt,
            sort_key=lambda x: len(x.text),
            sort_within_batch = True,
            shuffle = False,
            device= device,
            repeat=False)

Should I use a different Iterator in this case?

Thank you,

mttk · 2019-11-26T16:24:23Z

If I understood your use-case correctly, a plain Iterator would work fine in this case. Are you training the model for QA or are you doing some variation of language modelling, or both?

h56cho · 2019-11-26T18:11:17Z

Hello,

@mttk

Thank you for your reply.
I am a bit confused --- what exactly are the batch_size and bptt_len in language modelling?

Is batch_size same as the length of the individual sample (or sequence)?
what's the difference between batch_size and bptt_len?

If the maximum length of individual sequence that I want to feed into my Transformer for language modelling is 100, then is batch_size = 100?

Thank you,

mttk · 2019-11-27T11:15:57Z

bptt_len is the length of a single instance, while batch_size is the number of instances you process in parallel (in "batches").

h56cho · 2019-11-27T12:27:08Z

@mttk

Hello,

Thank you again for your reply.

For a given bptt_len, what is the easiest way to adjust the batch_size to ensure that every token in a given corpus of text is being processed by the Transformer?

Thank you,

mttk · 2019-11-27T12:45:43Z

Every token in an instance will always be processed, no matter the batch_size. The size of batches affects the optimization process (magnitude / bias of updates) & is limited by the amount of GPU memory available to you.
I would suggest you to check lecture notes on optimization from some ML / NLP course for more information on this.

h56cho · 2019-11-27T13:24:32Z

@ mttk

Hello,

Thank you again for your reply.
So just to clarify a bit more, when you say "a single instance", do you mean "a single sequence" that's being passed into the Transformer? For example, if we pass a sequence "I like dogs" into the Transformer, would the length of a single instance in this case be 3 ('I', 'like', 'dogs')?

Sorry for the ongoing question, I will try to look up more on the internet about the batch_size.

Thank you,

mttk · 2019-11-27T13:29:30Z

Yes, in this case, an instance is a single sequence you pass as input. The length of the "I like dogs" instance is 3 if you use word tokenization (note that most Transformers use subword tokenizers).

If you would define your bptt_len to be for example 5, the instance that your model will see would look like ["I", "like", "dogs", "<pad>", "<pad>"]. A batch contains multiple instances which are processed in parallel, so a single batch of batch_size=2 could look like this (illustratively):

[
  ["I", "like", "all", "animals", "<pad>"],
  ["I", "like", "dogs", "<pad>", "<pad>"]
]

h56cho · 2019-11-27T13:31:13Z

Thank you!

h56cho · 2019-11-27T14:27:48Z

@mttk

Hello,

just a follow up question,

Say I am using the WikiText2 dataset for my text corpus in doing NLP. To ensure the entire corpus is being processed by the Transformer, how should I set my batch_size? It seem to me that only the texts of length bptt_length * batch_size are being processed, which is not necessarily equal to the length of the entire text corpus. The code and output below is where this reasoning is coming from:

# set batch_size and bptt
batch_size=1
bptt = 1024

train_Wiki2_iter, val_Wiki2_iter, test_Wiki2_iter = BPTTIterator.splits(
            (train_Wiki2, val_Wiki2, test_Wiki2),
            batch_size = batch_size,
            bptt_len= bptt,
            sort_key=lambda x: len(x.text),
            sort_within_batch = True,
            shuffle = False,
            repeat=False)

train_Wiki2_i = next(iter(train_Wiki2_iter))
val_Wiki2_i = next(iter(val_Wiki2_iter))
test_Wiki2_i = next(iter(test_Wiki2_iter))
test_dummy_Wiki2_i = test_Wiki2_i

train_Wiki2_i.text.size()
>> torch.Size([1024, 1]) 
 # the size of train_Wiki2_i tells me that 
 # only 1024 * 1 length of tokens will be processed, 
 # whereas the length of train_Wiki2 is far greater than 1024*1.

How can I ensure that the entire text corpus will be processed? If any of my understanding is wrong, please correct me.

Thank you, I really appreciate your help.

mttk · 2019-11-27T14:31:48Z

This size is just a single batch of your dataset. train_Wiki2_iter will have multiple batches of that size, and the total amount of data processed will be len(train_Wiki2_iter) (the amount of batches in a dataset) times the size of a single batch (which is [1024, 1] in your case).

h56cho · 2019-11-27T14:32:22Z

Thank you! :)

h56cho · 2019-11-27T15:17:11Z

@mttk

Hello,

Sorry for continuing this thread.
So say I want my batch_size = 1 since I am interested in doing the Stochastic Gradient Descend..

# set batch_size and bptt
batch_size=1
bptt = 1024

train_Wiki2_iter, val_Wiki2_iter, test_Wiki2_iter = BPTTIterator.splits(
            (train_Wiki2, val_Wiki2, test_Wiki2),
            batch_size = batch_size,
            bptt_len= bptt,
            sort_key=lambda x: len(x.text),
            sort_within_batch = True,
            shuffle = False,
            repeat=False)

train_Wiki2_i = next(iter(train_Wiki2_iter))
val_Wiki2_i = next(iter(val_Wiki2_iter))
test_Wiki2_i = next(iter(test_Wiki2_iter))
test_dummy_Wiki2_i = test_Wiki2_i

In your previous post you mentioned that train_Wiki2_iter will have multiple batches of the size 1024 * 1. How can I access the second batch, the third batch, and so on? I tried train_Wiki2_i[[2]] train_Wiki2_i[2], and train_Wiki2_i[2 , :, :] but all these do not seem to do the trick..

Thank you,

mttk · 2019-11-27T15:20:19Z

next(iter(...)) will only give you the first batch (the first element of the iterable). If you want to iterate over all batches, do something like this:

for batch_index, batch in enumerate(train_Wiki2_iter):
  # This is now batch #batch_index
  # Do things with the batch
  pass

h56cho · 2019-11-27T15:31:09Z

Thank you! Highly appreciated.

h56cho · 2019-11-27T16:37:13Z

@ mttk

Hello again,

from your example below:

for batch_index, batch in enumerate(train_Wiki2_iter):
  # This is now batch #batch_index
  # Do things with the batch
  pass

Is there a way I can turn the batch into a python list?
If so, how I can change batch into a list?

Thank you,

mttk · 2019-11-27T16:48:05Z

If the batch is in format [bptt_size, batch_size], then
instance_list = [instance for instance in batch.t()] should work

h56cho · 2019-11-27T17:00:52Z

instance_list = [instance for instance in batch.t()] doesn't work but
instance_list = [instance for instance in batch.text] seem to work... thank you!

h56cho closed this as completed Nov 27, 2019

h56cho reopened this Nov 27, 2019

h56cho closed this as completed Nov 27, 2019

h56cho reopened this Nov 27, 2019

h56cho closed this as completed Nov 27, 2019

h56cho reopened this Nov 27, 2019

h56cho closed this as completed Nov 27, 2019

h56cho reopened this Nov 27, 2019

Nayef211 closed this as completed Jun 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torchtext iterator that tokenizes each line of words between the tokens `<sos>` and `<eos>` #654

torchtext iterator that tokenizes each line of words between the tokens `<sos>` and `<eos>` #654

h56cho commented Nov 26, 2019

zhangguanheng66 commented Nov 26, 2019

mttk commented Nov 26, 2019 •

edited

h56cho commented Nov 26, 2019 •

edited

mttk commented Nov 26, 2019 •

edited

h56cho commented Nov 26, 2019 •

edited

mttk commented Nov 27, 2019

h56cho commented Nov 27, 2019 •

edited

mttk commented Nov 27, 2019

h56cho commented Nov 27, 2019 •

edited

mttk commented Nov 27, 2019 •

edited

h56cho commented Nov 27, 2019

h56cho commented Nov 27, 2019 •

edited

mttk commented Nov 27, 2019

h56cho commented Nov 27, 2019

h56cho commented Nov 27, 2019 •

edited

mttk commented Nov 27, 2019 •

edited

h56cho commented Nov 27, 2019

h56cho commented Nov 27, 2019

mttk commented Nov 27, 2019

h56cho commented Nov 27, 2019

torchtext iterator that tokenizes each line of words between the tokens <sos> and <eos> #654

torchtext iterator that tokenizes each line of words between the tokens <sos> and <eos> #654

Comments

h56cho commented Nov 26, 2019

zhangguanheng66 commented Nov 26, 2019

mttk commented Nov 26, 2019 • edited

h56cho commented Nov 26, 2019 • edited

mttk commented Nov 26, 2019 • edited

h56cho commented Nov 26, 2019 • edited

mttk commented Nov 27, 2019

h56cho commented Nov 27, 2019 • edited

mttk commented Nov 27, 2019

h56cho commented Nov 27, 2019 • edited

mttk commented Nov 27, 2019 • edited

h56cho commented Nov 27, 2019

h56cho commented Nov 27, 2019 • edited

mttk commented Nov 27, 2019

h56cho commented Nov 27, 2019

h56cho commented Nov 27, 2019 • edited

mttk commented Nov 27, 2019 • edited

h56cho commented Nov 27, 2019

h56cho commented Nov 27, 2019

mttk commented Nov 27, 2019

h56cho commented Nov 27, 2019

torchtext iterator that tokenizes each line of words between the tokens `<sos>` and `<eos>` #654

torchtext iterator that tokenizes each line of words between the tokens `<sos>` and `<eos>` #654

mttk commented Nov 26, 2019 •

edited

h56cho commented Nov 26, 2019 •

edited

mttk commented Nov 26, 2019 •

edited

h56cho commented Nov 26, 2019 •

edited

h56cho commented Nov 27, 2019 •

edited

h56cho commented Nov 27, 2019 •

edited

mttk commented Nov 27, 2019 •

edited

h56cho commented Nov 27, 2019 •

edited

h56cho commented Nov 27, 2019 •

edited

mttk commented Nov 27, 2019 •

edited