Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torchtext iterator that tokenizes each line of words between the tokens <sos> and <eos> #654

Closed
h56cho opened this issue Nov 26, 2019 · 20 comments

Comments

@h56cho
Copy link

h56cho commented Nov 26, 2019

Hello,

I generated a text file called openbookQA_train. The contents of this file are shown below:

<sos> The sun is responsible for <mcoption> (A) puppies learning new tricks <eos>
<sos> The sun is responsible for <mcoption> (B) children growing up and getting old <eos>
<sos> The sun is responsible for <mcoption> (C) flowers wilting in a vase <eos>
<sos> The sun is responsible for <mcoption> (D) plants sprouting, blooming and wilting <eos>

I am trying to use or define torchtext Iterator to generate the input that I can pass into my Transformer.

I want each sample in my next(iter(openbookQA_train)).text to be a series of integers that are obtained by tokenizing each line of words between <sos> and <eos> (including those special tokens), and for a sample that contains lesser number of tokens than the bptt length, I want the sample to include all of the tokenized words between <sos> and <eos> and the rest of the slots to be filled with the token <pad> up to the bptt length.

How can I achieve this objective?

Thank you,

@zhangguanheng66
Copy link
Contributor

@mttk may have a solution for this.

Otherwise, you could take a look at the new language modeling dataset (#624) and write a custom pipeline.

@mttk
Copy link
Contributor

mttk commented Nov 26, 2019

The fixed length can be set for certain Fields by using the fix_length argument of Field. Let's say you want to pad every example to 128 tokens.

fix_length = 128
TEXT = data.Field(..., fix_length=max_length)
# construct the Iterator over train data
for batch in openbookQA_train:
  print(batch.text)
  # This will be a tensor of [batch_size x max_length] or that transposed

Note that in case you have an instance where the length is larger than fix_length, the trailing tokens will be dropped (including the ).

@h56cho
Copy link
Author

h56cho commented Nov 26, 2019

Hello,

Thank you for your reply.

@mttk
How can I ensure that each sample at a fixed sample length (I will set the fixed sample length in a way that this fixed sample length will always be greater than the number of tokens between <sos> and <eos>) that is being passed into the Transformer contain all of the tokenized words from each sentence (between the tokens <sos> and <eos>), and the rest of slots in the sample are just paddings?

When I execute the code below, each sample contains tokens from multiple number of sentences because my sample length is much larger than the length of each sentence:

train_openbookQA_iter, val_openbookQA_iter, test_openbookQA_iter = BPTTIterator.splits(
            (new_train_openbookQA, val_openbookQA, test_openbookQA),
            batch_size = batch_size,
            bptt_len= bptt,
            sort_key=lambda x: len(x.text),
            sort_within_batch = True,
            shuffle = False,
            device= device,
            repeat=False)

Should I use a different Iterator in this case?

Thank you,

@mttk
Copy link
Contributor

mttk commented Nov 26, 2019

If I understood your use-case correctly, a plain Iterator would work fine in this case. Are you training the model for QA or are you doing some variation of language modelling, or both?

@h56cho
Copy link
Author

h56cho commented Nov 26, 2019

Hello,

@mttk

Thank you for your reply.
I am a bit confused --- what exactly are the batch_size and bptt_len in language modelling?

Is batch_size same as the length of the individual sample (or sequence)?
what's the difference between batch_size and bptt_len?

If the maximum length of individual sequence that I want to feed into my Transformer for language modelling is 100, then is batch_size = 100?

Thank you,

@mttk
Copy link
Contributor

mttk commented Nov 27, 2019

bptt_len is the length of a single instance, while batch_size is the number of instances you process in parallel (in "batches").

@h56cho
Copy link
Author

h56cho commented Nov 27, 2019

@mttk

Hello,

Thank you again for your reply.

For a given bptt_len, what is the easiest way to adjust the batch_size to ensure that every token in a given corpus of text is being processed by the Transformer?

Thank you,

@mttk
Copy link
Contributor

mttk commented Nov 27, 2019

Every token in an instance will always be processed, no matter the batch_size. The size of batches affects the optimization process (magnitude / bias of updates) & is limited by the amount of GPU memory available to you.
I would suggest you to check lecture notes on optimization from some ML / NLP course for more information on this.

@h56cho
Copy link
Author

h56cho commented Nov 27, 2019

@ mttk

Hello,

Thank you again for your reply.
So just to clarify a bit more, when you say "a single instance", do you mean "a single sequence" that's being passed into the Transformer? For example, if we pass a sequence "I like dogs" into the Transformer, would the length of a single instance in this case be 3 ('I', 'like', 'dogs')?

Sorry for the ongoing question, I will try to look up more on the internet about the batch_size.

Thank you,

@mttk
Copy link
Contributor

mttk commented Nov 27, 2019

Yes, in this case, an instance is a single sequence you pass as input. The length of the "I like dogs" instance is 3 if you use word tokenization (note that most Transformers use subword tokenizers).

If you would define your bptt_len to be for example 5, the instance that your model will see would look like ["I", "like", "dogs", "<pad>", "<pad>"]. A batch contains multiple instances which are processed in parallel, so a single batch of batch_size=2 could look like this (illustratively):

[
  ["I", "like", "all", "animals", "<pad>"],
  ["I", "like", "dogs", "<pad>", "<pad>"]
]

@h56cho
Copy link
Author

h56cho commented Nov 27, 2019

Thank you!

@h56cho h56cho closed this as completed Nov 27, 2019
@h56cho h56cho reopened this Nov 27, 2019
@h56cho
Copy link
Author

h56cho commented Nov 27, 2019

@mttk

Hello,

just a follow up question,

Say I am using the WikiText2 dataset for my text corpus in doing NLP. To ensure the entire corpus is being processed by the Transformer, how should I set my batch_size? It seem to me that only the texts of length bptt_length * batch_size are being processed, which is not necessarily equal to the length of the entire text corpus. The code and output below is where this reasoning is coming from:

# set batch_size and bptt
batch_size=1
bptt = 1024

train_Wiki2_iter, val_Wiki2_iter, test_Wiki2_iter = BPTTIterator.splits(
            (train_Wiki2, val_Wiki2, test_Wiki2),
            batch_size = batch_size,
            bptt_len= bptt,
            sort_key=lambda x: len(x.text),
            sort_within_batch = True,
            shuffle = False,
            repeat=False)

train_Wiki2_i = next(iter(train_Wiki2_iter))
val_Wiki2_i = next(iter(val_Wiki2_iter))
test_Wiki2_i = next(iter(test_Wiki2_iter))
test_dummy_Wiki2_i = test_Wiki2_i

train_Wiki2_i.text.size()
>> torch.Size([1024, 1]) 
 # the size of train_Wiki2_i tells me that 
 # only 1024 * 1 length of tokens will be processed, 
 # whereas the length of train_Wiki2 is far greater than 1024*1.

How can I ensure that the entire text corpus will be processed? If any of my understanding is wrong, please correct me.

Thank you, I really appreciate your help.

@mttk
Copy link
Contributor

mttk commented Nov 27, 2019

This size is just a single batch of your dataset. train_Wiki2_iter will have multiple batches of that size, and the total amount of data processed will be len(train_Wiki2_iter) (the amount of batches in a dataset) times the size of a single batch (which is [1024, 1] in your case).

@h56cho
Copy link
Author

h56cho commented Nov 27, 2019

Thank you! :)

@h56cho h56cho closed this as completed Nov 27, 2019
@h56cho h56cho reopened this Nov 27, 2019
@h56cho h56cho closed this as completed Nov 27, 2019
@h56cho h56cho reopened this Nov 27, 2019
@h56cho
Copy link
Author

h56cho commented Nov 27, 2019

@mttk

Hello,

Sorry for continuing this thread.
So say I want my batch_size = 1 since I am interested in doing the Stochastic Gradient Descend..

# set batch_size and bptt
batch_size=1
bptt = 1024

train_Wiki2_iter, val_Wiki2_iter, test_Wiki2_iter = BPTTIterator.splits(
            (train_Wiki2, val_Wiki2, test_Wiki2),
            batch_size = batch_size,
            bptt_len= bptt,
            sort_key=lambda x: len(x.text),
            sort_within_batch = True,
            shuffle = False,
            repeat=False)

train_Wiki2_i = next(iter(train_Wiki2_iter))
val_Wiki2_i = next(iter(val_Wiki2_iter))
test_Wiki2_i = next(iter(test_Wiki2_iter))
test_dummy_Wiki2_i = test_Wiki2_i

In your previous post you mentioned that train_Wiki2_iter will have multiple batches of the size 1024 * 1. How can I access the second batch, the third batch, and so on? I tried train_Wiki2_i[[2]] train_Wiki2_i[2], and train_Wiki2_i[2 , :, :] but all these do not seem to do the trick..

Thank you,

@mttk
Copy link
Contributor

mttk commented Nov 27, 2019

next(iter(...)) will only give you the first batch (the first element of the iterable). If you want to iterate over all batches, do something like this:

for batch_index, batch in enumerate(train_Wiki2_iter):
  # This is now batch #batch_index
  # Do things with the batch
  pass

@h56cho
Copy link
Author

h56cho commented Nov 27, 2019

Thank you! Highly appreciated.

@h56cho h56cho closed this as completed Nov 27, 2019
@h56cho h56cho reopened this Nov 27, 2019
@h56cho
Copy link
Author

h56cho commented Nov 27, 2019

@ mttk

Hello again,

from your example below:

for batch_index, batch in enumerate(train_Wiki2_iter):
  # This is now batch #batch_index
  # Do things with the batch
  pass

Is there a way I can turn the batch into a python list?
If so, how I can change batch into a list?

Thank you,

@mttk
Copy link
Contributor

mttk commented Nov 27, 2019

If the batch is in format [bptt_size, batch_size], then
instance_list = [instance for instance in batch.t()] should work

@h56cho
Copy link
Author

h56cho commented Nov 27, 2019

instance_list = [instance for instance in batch.t()] doesn't work but
instance_list = [instance for instance in batch.text] seem to work... thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants