Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sampler unable in BucketIterator #1152

Closed
StephennFernandes opened this issue Feb 11, 2021 · 33 comments
Closed

sampler unable in BucketIterator #1152

StephennFernandes opened this issue Feb 11, 2021 · 33 comments
Labels

Comments

@StephennFernandes
Copy link

StephennFernandes commented Feb 11, 2021

unable to use XLAs Distributed Data Sampler or any Multi-GPU training with BucketIterator because it doesnt have a sampler feature.
train_iterator , valid_iterator = BucketIterator.splits((train_data, test_data), batch_size=batch_size, sort_within_batch=True, sort_key = lambda x: len(x.word_token), device=device)

so i am constraint to using only one GPU.

i used BucketIterator because it gives good batches with minimal padding, but the limiting scaling factor is a constraint.

@zhangguanheng66
Copy link
Contributor

BucketIterator will be retired as the legacy code. But we are happy to hear your use case if you could put together a code snippet.

@StephennFernandes
Copy link
Author

StephennFernandes commented Feb 11, 2021

okay so heres now it goes , i am building a seq2seq model

`from torchtext.data import Field, BucketIterator, TabularDataset

tokenize = lambda x:x.split(" ")
konkani = Field(sequential=True, tokenize=tokenize, init_token='', eos_token='')
hindi = Field(sequential=True, tokenize=tokenize, init_token='', eos_token='')

fields = [('word_token_konkani', konkani), ('word_token_hindi', hindi)]

train_data, test_data = TabularDataset.splits(path="translation/", train="train.csv", test="test.csv", format="csv", fields=fields)

konkani.build_vocab(train_data, test_data, max_size=100000, min_freqs=2)
hindi.build_vocab(train_data, test_data, max_size=100000, min_freqs=2) `

then i build batches of similar length and thus minimizing padding required by using BucketIterator

train_iterator , valid_iterator = BucketIterator.splits((train_data, test_data), batch_size=batch_size, sort_within_batch=True, sort_key = lambda x: len(x.word_token_konkani), device=device)

but Bucket iterator doesn't have samplers thus we cannot distribute training on TPUs and GPUs
so i cant use DDP on the data.
i need an alternative method to distribute training by sampling batches to multiple devices.
any alternative approach for my use-case would be awesome .

@ankitvad
Copy link

@zhangguanheng66 I consulted the code sample you provided in the revamp approach for torchtext where the BucketIterator is being depreciated here - #664

this piece of code sample here:

data_len = [(len(txt), idx, label, txt) for idx, (label, txt) in enumerate(train_dataset)]
data_len.sort()

where train_dataset is of the type - Dataset then isn't this really inefficient? Since we are not sampling anything and the whole Dataset is loaded in memory at once!? I've not had a chance to check through the BucketIterator code but is that how torchtext actually handled this? Loading all the samples in memory at once? I'm guessing there's no other way to achieve this sort of bucketing without loading everything in memory?

I've been trying the - BucketBatchSampler from PyTorchNLP but I'm having an issue with all the samplers introduced by torch.

Is there any way that you can achieve bucketing of similar lengths sentences and introduce some sort of shuffling amongst them and everything?

@zhangguanheng66
Copy link
Contributor

Is there any way that you can achieve bucketing of similar lengths sentences and introduce some sort of shuffling amongst them and everything?

Iterable dataset doesn't not load the whole dataset to the memory but you don't have the index to access data, like the map-style dataset.

In practice, you can have a "pool", randomly check out a chunk of data (which can fit into the memory), and do sorting to group the data with similar length.

@ankitvad
Copy link

Just to confirm, you mean an iterable dataset? Or a torch.IterableDataset class? Probably not the class right? since that's for realtime stuff? Also, is that double negative intentional? :P

Or does this mean that the dataset is loaded in memory? Because in my use case I use a pandas dataframe to load the whole dataset .tsv file in the Dataset class and so the data is loaded in memory there. in init. And then using your example that I mentioned above : I create a list of data_len which I know can't be indexed since it's just a list but it still consists of all the data in memory right? For example if it's seq2seq then I end up with - [(len(seq1), idx, seq1, seq2)] which is the whole dataset again right?

So is this optimal?

@zhangguanheng66
Copy link
Contributor

zhangguanheng66 commented Feb 18, 2021

I mean the iterator-style data, which yields the data in sequence. In general, if the data cannot fit into your memory, you should only check out part of the data. RawTextIterableDataset is an iterable dataset. You can use offset to check out different part of the data and sort them based on the lengths of the data. In this way, you only load what you need to the memory.

@StephennFernandes
Copy link
Author

@zhangguanheng66 could you please show a simple workable example of the RawTextIterableDataset, that would also support Distributed Sampler in case of TPU training.

@zhangguanheng66
Copy link
Contributor

I don't have an example for TPU training. But you can use DataLoader and the new RawTextIterableDataset to perform the BucketIterator. Let's take the IMDB as an example

from torchtext.datasets import IMDB
import random

def collate_batch(batch):
   label_list, text_list = [], []
   for (_label, _text) in batch:
        label_list.append(label_transform(_label))
        processed_text = torch.tensor(text_transform(_text))
        text_list.append(processed_text)
   return torch.tensor(label_list), pad_sequence(text_list, padding_value=3.0)

train_iter = IMDB(split='train')
train_list = list(train_iter)
random.shuffle(train_list)  # Random shuffle the whole datasets
batch_size = 8 # A batch size of 8

# Create pools and each has a size of batch_size*100
pools = [train_list[i:i+ batch_size*100] for i in range(0, len(train_list), batch_size*100)]
pools = [sorted(samples, key=lambda x: len(tokenizer(x[1])))  for samples in pools]
pools = sum(pools, [])
bucket_dataloader = DataLoader(pools, batch_size=batch_size,
                               shuffle=False,  # shuffle is set to False to keep the order  
                               collate_fn=collate_batch)

This needs the nightly release to run.

@StephennFernandes
Copy link
Author

@zhangguanheng66 Thanks for the help, does this implementation groups similar length sequences together in a batch for minimum padding, similar to BucketIterator ??

@zhangguanheng66
Copy link
Contributor

zhangguanheng66 commented Feb 24, 2021

@zhangguanheng66 Thanks for the help, does this implementation groups similar length sequences together in a batch for minimum padding, similar to BucketIterator ??

Yes. This code snippet has the following steps:

  • Shuffle the whole dataset randomly
  • Generate multiple "pools" and each of them has a size of batch_size*100.
  • Sort the samples in the pool by the size of the token list
  • Yield the sample batch with a size of batch_size via DataLoader by processing the text/label data.

Since we sort the samples in the pool, the texts with similar length will be grouped together to minimize the padding.

@StephennFernandes
Copy link
Author

StephennFernandes commented Feb 25, 2021

in your following code example where does pad_sequence , text_transform, label_transform come from ?

what are their premise what specific transformations do they do to the existing data ?

@zhangguanheng66
Copy link
Contributor

OK, we just published the migration tutorial (literally yesterday), you can check it here for all the details - link.

@StephennFernandes
Copy link
Author

@zhangguanheng66 thank you enormously for all the assistance,

could you please show me how to parse a corpus.txt file throught BPTTIterator for language modeling task.
perferably using the latest expeirmental functionality too ..

I tried to follow the documentation, but it was messy

@StephennFernandes
Copy link
Author

@zhangguanheng66 i did tryout the experimental code approach with my custom Dataset, thats a .csv file having 2 cols for nmt .

But the TabularDataset returns weird objects that i cannot pass through torch.vocab.Vocab class or any other mentioned approach in the migration tutorial

@zhangguanheng66
Copy link
Contributor

@zhangguanheng66 i did tryout the experimental code approach with my custom Dataset, thats a .csv file having 2 cols for nmt .

But the TabularDataset returns weird objects that i cannot pass through torch.vocab.Vocab class or any other mentioned approach in the migration tutorial

Let me reply your question in another issue you just opened.

@zhangguanheng66
Copy link
Contributor

zhangguanheng66 commented Feb 26, 2021

@zhangguanheng66 thank you enormously for all the assistance,

could you please show me how to parse a corpus.txt file throught BPTTIterator for language modeling task.
perferably using the latest expeirmental functionality too ..

I tried to follow the documentation, but it was messy

You don't need BPTTIterator to have the source/target sentence for the language modeling task. Take a look at the data processing part in our language modeling tutorial.

@StephennFernandes
Copy link
Author

@zhangguanheng66

using the batch_sampler , makes [sampler, batch_sampler, batch_size] mutually exclusive in the DataLoader class.

but i do need the sampler args to set DistributedDataSampler for TPU and DDP training.

then i am again stuck on single GPU training.

@zhangguanheng66
Copy link
Contributor

For this kind of questions, you can use people on Pytorch NLP forum

@StephennFernandes
Copy link
Author

@zhangguanheng66

using your above methods i tried implementing the model, but unfortunately i am getting Target out of bounds error.
I used the vocab from torchtext.vocab build the Dataiter using RawTextIterator as you recommended and all the collate_function and pooling techniques.

I tried all debugging methods and checked my code for any errors, I also checked the exact dimensionality of input and target tensors going into the network there were also exactly the same as they were produced from BucketIterator class .

I also tested my same model implementation with BucketIterator method and evertyhing works fine.

Cannot seem to find the issue in the current implementation.

@StephennFernandes
Copy link
Author

It seems like there error is caused dude to optimizer unable to process the loss value due to some conditionality error,

@zhangguanheng66
Copy link
Contributor

@zhangguanheng66

using your above methods i tried implementing the model, but unfortunately i am getting Target out of bounds error.
I used the vocab from torchtext.vocab build the Dataiter using RawTextIterator as you recommended and all the collate_function and pooling techniques.

I tried all debugging methods and checked my code for any errors, I also checked the exact dimensionality of input and target tensors going into the network there were also exactly the same as they were produced from BucketIterator class .

I also tested my same model implementation with BucketIterator method and evertyhing works fine.

Cannot seem to find the issue in the current implementation.

Is this a language modeling? You should just follow the transformer tutorial and set up the model/optimizer. Check the dimension of the last layer since the error is "target out of bound".

@StephennFernandes
Copy link
Author

No its just a simple seq2seq NMT

@zhangguanheng66
Copy link
Contributor

zhangguanheng66 commented Mar 5, 2021

No its just a simple seq2seq NMT

OK, then you should check the NMT tutorial and the dimension of the last layer (see OUTPUT_DIM and INPUT_DIM). https://pytorch.org/tutorials/beginner/torchtext_translation.html

@StephennFernandes
Copy link
Author

I did implemented the above mentioned tutorial in a single GPU and everything works perfectly fine.

But the output prediction in the decoder seq2seq produces a tensor which holds predicted values of multiple examples which i need to explicitly do .to(device) .. irespective of me using pytorch lightning

here the code snippet for the same.

` outputs = torch.zeros(target_len, batch_size, target_vocab_size).to(device) # this is the tensor

    hidden, cell = self.encoder(source)

    x = target[0]
    for t in range(1, target_len):
        output, hidden, cell = self.decoder(x, hidden, cell)

        outputs[t] = output

        best_guess = output.argmax(1)
        x = target[t] if random.random() < teacher_force_ratio else best_guess`

for the .to(device) i used xm.xla_device() to move the tensor to the specific device.

But as its ideal to use all tpu cores to distribute the compute.

The problem i am stuck here is how to sync the outputs tensor across all the tpu cores.
because i move the tensor to the certain tpu core. i get an error.

I want a way to move the given tensor to sync on all tpu cores.
similar to .to(device) being done to move a respective tensor to a given gpu

@StephennFernandes
Copy link
Author

@zhangguanheng66 I was thinking of contributing to torchtext , is it possible ? .... I would love to contribute

@zhangguanheng66
Copy link
Contributor

OSS contributions are always welcome.

@StephennFernandes
Copy link
Author

I've been further trying to use a simple LSTM and other variations of LSTM like AWD-LSTM to train a language model on raw text data.
But the documentation is using the old BPTTIterator method to train the language model, which will again put me into a single GPU constraint.

Similar to the solution you provided above. is there a way to train Lang Models on the torchtext nightly

@zhangguanheng66
Copy link
Contributor

zhangguanheng66 commented Mar 17, 2021

I've been further trying to use a simple LSTM and other variations of LSTM like AWD-LSTM to train a language model on raw text data.
But the documentation is using the old BPTTIterator method to train the language model, which will again put me into a single GPU constraint.

Similar to the solution you provided above. is there a way to train Lang Models on the torchtext nightly

Could you explain me why the old BPTTIterator limit your language model to a single GPU? To my understanding, those two are not related. And just FYI, in the recent 0.9.0 release, the old BPTTIterator has been moved to the legacy folder.

@StephennFernandes
Copy link
Author

because i use DistributedDataSampler class to get a sampler object that I place in the DataLoader class to distribute samples of my data to different devices.

But in BPTTIterator there isn't a sampler args for me to place my sampler object.

@StephennFernandes
Copy link
Author

Okay another issue that, what if my raw text data is huge. like 50GB or so how do I iteratively read sequences of data and pass them to the appropriate dataloader without crashing my Memory.

@zhangguanheng66
Copy link
Contributor

Please take a look at the new datasets in torchtext.datasets folder. You should use iterable dataset without loading the whole dataset to your memory.

@StephennFernandes
Copy link
Author

I tried this and the dataset_utils modules are missing so couldn`t implement the _read_text_iterator.
Would be great if you'd show a way(demo code or tutorial ) on how to load your custom .txt file with vocab for a language modeling task

@StephennFernandes
Copy link
Author

StephennFernandes commented Mar 20, 2021

I get an AttributeError when using BPTTIterator

train_iter = BPTTIterator(train_dl, batch_size=32, bptt_len=30)

`
~/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataset.py in getattr(self, attribute_name)
169 return function
170 else:
--> 171 raise AttributeError
172
173 @classmethod

AttributeError:
`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants