sampler unable in BucketIterator #1152

StephennFernandes · 2021-02-11T08:00:16Z

unable to use XLAs Distributed Data Sampler or any Multi-GPU training with BucketIterator because it doesnt have a sampler feature.
train_iterator , valid_iterator = BucketIterator.splits((train_data, test_data), batch_size=batch_size, sort_within_batch=True, sort_key = lambda x: len(x.word_token), device=device)

so i am constraint to using only one GPU.

i used BucketIterator because it gives good batches with minimal padding, but the limiting scaling factor is a constraint.

The text was updated successfully, but these errors were encountered:

zhangguanheng66 · 2021-02-11T14:40:49Z

BucketIterator will be retired as the legacy code. But we are happy to hear your use case if you could put together a code snippet.

StephennFernandes · 2021-02-11T16:03:34Z

okay so heres now it goes , i am building a seq2seq model

`from torchtext.data import Field, BucketIterator, TabularDataset

tokenize = lambda x:x.split(" ")
konkani = Field(sequential=True, tokenize=tokenize, init_token='', eos_token='')
hindi = Field(sequential=True, tokenize=tokenize, init_token='', eos_token='')

fields = [('word_token_konkani', konkani), ('word_token_hindi', hindi)]

train_data, test_data = TabularDataset.splits(path="translation/", train="train.csv", test="test.csv", format="csv", fields=fields)

konkani.build_vocab(train_data, test_data, max_size=100000, min_freqs=2)
hindi.build_vocab(train_data, test_data, max_size=100000, min_freqs=2) `

then i build batches of similar length and thus minimizing padding required by using BucketIterator

train_iterator , valid_iterator = BucketIterator.splits((train_data, test_data), batch_size=batch_size, sort_within_batch=True, sort_key = lambda x: len(x.word_token_konkani), device=device)

but Bucket iterator doesn't have samplers thus we cannot distribute training on TPUs and GPUs
so i cant use DDP on the data.
i need an alternative method to distribute training by sampling batches to multiple devices.
any alternative approach for my use-case would be awesome .

ankitvad · 2021-02-18T12:24:22Z

@zhangguanheng66 I consulted the code sample you provided in the revamp approach for torchtext where the BucketIterator is being depreciated here - #664

this piece of code sample here:

data_len = [(len(txt), idx, label, txt) for idx, (label, txt) in enumerate(train_dataset)]
data_len.sort()

where train_dataset is of the type - Dataset then isn't this really inefficient? Since we are not sampling anything and the whole Dataset is loaded in memory at once!? I've not had a chance to check through the BucketIterator code but is that how torchtext actually handled this? Loading all the samples in memory at once? I'm guessing there's no other way to achieve this sort of bucketing without loading everything in memory?

I've been trying the - BucketBatchSampler from PyTorchNLP but I'm having an issue with all the samplers introduced by torch.

Is there any way that you can achieve bucketing of similar lengths sentences and introduce some sort of shuffling amongst them and everything?

zhangguanheng66 · 2021-02-18T15:22:17Z

Is there any way that you can achieve bucketing of similar lengths sentences and introduce some sort of shuffling amongst them and everything?

Iterable dataset doesn't not load the whole dataset to the memory but you don't have the index to access data, like the map-style dataset.

In practice, you can have a "pool", randomly check out a chunk of data (which can fit into the memory), and do sorting to group the data with similar length.

ankitvad · 2021-02-18T16:01:21Z

Just to confirm, you mean an iterable dataset? Or a torch.IterableDataset class? Probably not the class right? since that's for realtime stuff? Also, is that double negative intentional? :P

Or does this mean that the dataset is loaded in memory? Because in my use case I use a pandas dataframe to load the whole dataset .tsv file in the Dataset class and so the data is loaded in memory there. in init. And then using your example that I mentioned above : I create a list of data_len which I know can't be indexed since it's just a list but it still consists of all the data in memory right? For example if it's seq2seq then I end up with - [(len(seq1), idx, seq1, seq2)] which is the whole dataset again right?

So is this optimal?

zhangguanheng66 · 2021-02-18T16:19:33Z

I mean the iterator-style data, which yields the data in sequence. In general, if the data cannot fit into your memory, you should only check out part of the data. RawTextIterableDataset is an iterable dataset. You can use offset to check out different part of the data and sort them based on the lengths of the data. In this way, you only load what you need to the memory.

StephennFernandes · 2021-02-24T14:35:18Z

@zhangguanheng66 could you please show a simple workable example of the RawTextIterableDataset, that would also support Distributed Sampler in case of TPU training.

zhangguanheng66 · 2021-02-24T15:48:43Z

I don't have an example for TPU training. But you can use DataLoader and the new RawTextIterableDataset to perform the BucketIterator. Let's take the IMDB as an example

from torchtext.datasets import IMDB
import random

def collate_batch(batch):
   label_list, text_list = [], []
   for (_label, _text) in batch:
        label_list.append(label_transform(_label))
        processed_text = torch.tensor(text_transform(_text))
        text_list.append(processed_text)
   return torch.tensor(label_list), pad_sequence(text_list, padding_value=3.0)

train_iter = IMDB(split='train')
train_list = list(train_iter)
random.shuffle(train_list)  # Random shuffle the whole datasets
batch_size = 8 # A batch size of 8

# Create pools and each has a size of batch_size*100
pools = [train_list[i:i+ batch_size*100] for i in range(0, len(train_list), batch_size*100)]
pools = [sorted(samples, key=lambda x: len(tokenizer(x[1])))  for samples in pools]
pools = sum(pools, [])
bucket_dataloader = DataLoader(pools, batch_size=batch_size,
                               shuffle=False,  # shuffle is set to False to keep the order  
                               collate_fn=collate_batch)

This needs the nightly release to run.

StephennFernandes · 2021-02-24T17:54:25Z

@zhangguanheng66 Thanks for the help, does this implementation groups similar length sequences together in a batch for minimum padding, similar to BucketIterator ??

zhangguanheng66 · 2021-02-24T18:42:50Z

@zhangguanheng66 Thanks for the help, does this implementation groups similar length sequences together in a batch for minimum padding, similar to BucketIterator ??

Yes. This code snippet has the following steps:

Shuffle the whole dataset randomly
Generate multiple "pools" and each of them has a size of batch_size*100.
Sort the samples in the pool by the size of the token list
Yield the sample batch with a size of batch_size via DataLoader by processing the text/label data.

Since we sort the samples in the pool, the texts with similar length will be grouped together to minimize the padding.

StephennFernandes · 2021-02-25T17:05:03Z

in your following code example where does pad_sequence , text_transform, label_transform come from ?

what are their premise what specific transformations do they do to the existing data ?

zhangguanheng66 · 2021-02-25T17:26:37Z

OK, we just published the migration tutorial (literally yesterday), you can check it here for all the details - link.

StephennFernandes · 2021-02-26T09:20:24Z

@zhangguanheng66 thank you enormously for all the assistance,

could you please show me how to parse a corpus.txt file throught BPTTIterator for language modeling task.
perferably using the latest expeirmental functionality too ..

I tried to follow the documentation, but it was messy

StephennFernandes · 2021-02-26T12:41:23Z

@zhangguanheng66 i did tryout the experimental code approach with my custom Dataset, thats a .csv file having 2 cols for nmt .

But the TabularDataset returns weird objects that i cannot pass through torch.vocab.Vocab class or any other mentioned approach in the migration tutorial

zhangguanheng66 · 2021-02-26T14:33:43Z

@zhangguanheng66 i did tryout the experimental code approach with my custom Dataset, thats a .csv file having 2 cols for nmt .

But the TabularDataset returns weird objects that i cannot pass through torch.vocab.Vocab class or any other mentioned approach in the migration tutorial

Let me reply your question in another issue you just opened.

zhangguanheng66 · 2021-02-26T14:45:31Z

@zhangguanheng66 thank you enormously for all the assistance,

could you please show me how to parse a corpus.txt file throught BPTTIterator for language modeling task.
perferably using the latest expeirmental functionality too ..

I tried to follow the documentation, but it was messy

You don't need BPTTIterator to have the source/target sentence for the language modeling task. Take a look at the data processing part in our language modeling tutorial.

StephennFernandes · 2021-02-26T18:09:44Z

@zhangguanheng66

using the batch_sampler , makes [sampler, batch_sampler, batch_size] mutually exclusive in the DataLoader class.

but i do need the sampler args to set DistributedDataSampler for TPU and DDP training.

then i am again stuck on single GPU training.

zhangguanheng66 · 2021-02-26T18:37:26Z

For this kind of questions, you can use people on Pytorch NLP forum

StephennFernandes · 2021-03-05T13:52:14Z

@zhangguanheng66

using your above methods i tried implementing the model, but unfortunately i am getting Target out of bounds error.
I used the vocab from torchtext.vocab build the Dataiter using RawTextIterator as you recommended and all the collate_function and pooling techniques.

I tried all debugging methods and checked my code for any errors, I also checked the exact dimensionality of input and target tensors going into the network there were also exactly the same as they were produced from BucketIterator class .

I also tested my same model implementation with BucketIterator method and evertyhing works fine.

Cannot seem to find the issue in the current implementation.

StephennFernandes · 2021-03-05T14:03:10Z

It seems like there error is caused dude to optimizer unable to process the loss value due to some conditionality error,

zhangguanheng66 · 2021-03-05T14:24:56Z

@zhangguanheng66

using your above methods i tried implementing the model, but unfortunately i am getting Target out of bounds error.
I used the vocab from torchtext.vocab build the Dataiter using RawTextIterator as you recommended and all the collate_function and pooling techniques.

I tried all debugging methods and checked my code for any errors, I also checked the exact dimensionality of input and target tensors going into the network there were also exactly the same as they were produced from BucketIterator class .

I also tested my same model implementation with BucketIterator method and evertyhing works fine.

Cannot seem to find the issue in the current implementation.

Is this a language modeling? You should just follow the transformer tutorial and set up the model/optimizer. Check the dimension of the last layer since the error is "target out of bound".

StephennFernandes · 2021-03-05T14:27:03Z

No its just a simple seq2seq NMT

zhangguanheng66 · 2021-03-05T14:31:12Z

No its just a simple seq2seq NMT

OK, then you should check the NMT tutorial and the dimension of the last layer (see OUTPUT_DIM and INPUT_DIM). https://pytorch.org/tutorials/beginner/torchtext_translation.html

StephennFernandes · 2021-03-07T16:34:40Z

I did implemented the above mentioned tutorial in a single GPU and everything works perfectly fine.

But the output prediction in the decoder seq2seq produces a tensor which holds predicted values of multiple examples which i need to explicitly do .to(device) .. irespective of me using pytorch lightning

here the code snippet for the same.

` outputs = torch.zeros(target_len, batch_size, target_vocab_size).to(device) # this is the tensor

    hidden, cell = self.encoder(source)

    x = target[0]
    for t in range(1, target_len):
        output, hidden, cell = self.decoder(x, hidden, cell)

        outputs[t] = output

        best_guess = output.argmax(1)
        x = target[t] if random.random() < teacher_force_ratio else best_guess`

for the .to(device) i used xm.xla_device() to move the tensor to the specific device.

But as its ideal to use all tpu cores to distribute the compute.

The problem i am stuck here is how to sync the outputs tensor across all the tpu cores.
because i move the tensor to the certain tpu core. i get an error.

I want a way to move the given tensor to sync on all tpu cores.
similar to .to(device) being done to move a respective tensor to a given gpu

StephennFernandes · 2021-03-10T10:22:57Z

@zhangguanheng66 I was thinking of contributing to torchtext , is it possible ? .... I would love to contribute

zhangguanheng66 · 2021-03-10T15:35:59Z

OSS contributions are always welcome.

StephennFernandes · 2021-03-17T15:12:16Z

I've been further trying to use a simple LSTM and other variations of LSTM like AWD-LSTM to train a language model on raw text data.
But the documentation is using the old BPTTIterator method to train the language model, which will again put me into a single GPU constraint.

Similar to the solution you provided above. is there a way to train Lang Models on the torchtext nightly

zhangguanheng66 · 2021-03-17T15:47:04Z

I've been further trying to use a simple LSTM and other variations of LSTM like AWD-LSTM to train a language model on raw text data.
But the documentation is using the old BPTTIterator method to train the language model, which will again put me into a single GPU constraint.

Similar to the solution you provided above. is there a way to train Lang Models on the torchtext nightly

Could you explain me why the old BPTTIterator limit your language model to a single GPU? To my understanding, those two are not related. And just FYI, in the recent 0.9.0 release, the old BPTTIterator has been moved to the legacy folder.

StephennFernandes · 2021-03-17T15:53:39Z

because i use DistributedDataSampler class to get a sampler object that I place in the DataLoader class to distribute samples of my data to different devices.

But in BPTTIterator there isn't a sampler args for me to place my sampler object.

StephennFernandes · 2021-03-17T15:55:43Z

Okay another issue that, what if my raw text data is huge. like 50GB or so how do I iteratively read sequences of data and pass them to the appropriate dataloader without crashing my Memory.

zhangguanheng66 · 2021-03-17T19:05:49Z

Please take a look at the new datasets in torchtext.datasets folder. You should use iterable dataset without loading the whole dataset to your memory.

StephennFernandes · 2021-03-20T18:15:55Z

I tried this and the dataset_utils modules are missing so couldn`t implement the _read_text_iterator.
Would be great if you'd show a way(demo code or tutorial ) on how to load your custom .txt file with vocab for a language modeling task

StephennFernandes · 2021-03-20T18:38:23Z

I get an AttributeError when using BPTTIterator

train_iter = BPTTIterator(train_dl, batch_size=32, bptt_len=30)

`
~/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataset.py in getattr(self, attribute_name)
169 return function
170 else:
--> 171 raise AttributeError
172
173 @classmethod

AttributeError:
`

zhangguanheng66 added the legacy label Feb 11, 2021

StephennFernandes closed this as completed Feb 27, 2021

StephennFernandes reopened this Mar 17, 2021

StephennFernandes closed this as completed Mar 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sampler unable in BucketIterator #1152

sampler unable in BucketIterator #1152

StephennFernandes commented Feb 11, 2021 •

edited

zhangguanheng66 commented Feb 11, 2021

StephennFernandes commented Feb 11, 2021 •

edited

ankitvad commented Feb 18, 2021

zhangguanheng66 commented Feb 18, 2021

ankitvad commented Feb 18, 2021

zhangguanheng66 commented Feb 18, 2021 •

edited

StephennFernandes commented Feb 24, 2021

zhangguanheng66 commented Feb 24, 2021

StephennFernandes commented Feb 24, 2021

zhangguanheng66 commented Feb 24, 2021 •

edited

StephennFernandes commented Feb 25, 2021 •

edited

zhangguanheng66 commented Feb 25, 2021

StephennFernandes commented Feb 26, 2021

StephennFernandes commented Feb 26, 2021

zhangguanheng66 commented Feb 26, 2021

zhangguanheng66 commented Feb 26, 2021 •

edited

StephennFernandes commented Feb 26, 2021

zhangguanheng66 commented Feb 26, 2021

StephennFernandes commented Mar 5, 2021

StephennFernandes commented Mar 5, 2021

zhangguanheng66 commented Mar 5, 2021

StephennFernandes commented Mar 5, 2021

zhangguanheng66 commented Mar 5, 2021 •

edited

StephennFernandes commented Mar 7, 2021

StephennFernandes commented Mar 10, 2021

zhangguanheng66 commented Mar 10, 2021

StephennFernandes commented Mar 17, 2021

zhangguanheng66 commented Mar 17, 2021 •

edited

StephennFernandes commented Mar 17, 2021

StephennFernandes commented Mar 17, 2021

zhangguanheng66 commented Mar 17, 2021

StephennFernandes commented Mar 20, 2021

StephennFernandes commented Mar 20, 2021 •

edited

sampler unable in BucketIterator #1152

sampler unable in BucketIterator #1152

Comments

StephennFernandes commented Feb 11, 2021 • edited

zhangguanheng66 commented Feb 11, 2021

StephennFernandes commented Feb 11, 2021 • edited

ankitvad commented Feb 18, 2021

zhangguanheng66 commented Feb 18, 2021

ankitvad commented Feb 18, 2021

zhangguanheng66 commented Feb 18, 2021 • edited

StephennFernandes commented Feb 24, 2021

zhangguanheng66 commented Feb 24, 2021

StephennFernandes commented Feb 24, 2021

zhangguanheng66 commented Feb 24, 2021 • edited

StephennFernandes commented Feb 25, 2021 • edited

zhangguanheng66 commented Feb 25, 2021

StephennFernandes commented Feb 26, 2021

StephennFernandes commented Feb 26, 2021

zhangguanheng66 commented Feb 26, 2021

zhangguanheng66 commented Feb 26, 2021 • edited

StephennFernandes commented Feb 26, 2021

zhangguanheng66 commented Feb 26, 2021

StephennFernandes commented Mar 5, 2021

StephennFernandes commented Mar 5, 2021

zhangguanheng66 commented Mar 5, 2021

StephennFernandes commented Mar 5, 2021

zhangguanheng66 commented Mar 5, 2021 • edited

StephennFernandes commented Mar 7, 2021

StephennFernandes commented Mar 10, 2021

zhangguanheng66 commented Mar 10, 2021

StephennFernandes commented Mar 17, 2021

zhangguanheng66 commented Mar 17, 2021 • edited

StephennFernandes commented Mar 17, 2021

StephennFernandes commented Mar 17, 2021

zhangguanheng66 commented Mar 17, 2021

StephennFernandes commented Mar 20, 2021

StephennFernandes commented Mar 20, 2021 • edited

StephennFernandes commented Feb 11, 2021 •

edited

StephennFernandes commented Feb 11, 2021 •

edited

zhangguanheng66 commented Feb 18, 2021 •

edited

zhangguanheng66 commented Feb 24, 2021 •

edited

StephennFernandes commented Feb 25, 2021 •

edited

zhangguanheng66 commented Feb 26, 2021 •

edited

zhangguanheng66 commented Mar 5, 2021 •

edited

zhangguanheng66 commented Mar 17, 2021 •

edited

StephennFernandes commented Mar 20, 2021 •

edited