New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sampler unable in BucketIterator #1152
Comments
|
okay so heres now it goes , i am building a seq2seq model `from torchtext.data import Field, BucketIterator, TabularDataset tokenize = lambda x:x.split(" ") fields = [('word_token_konkani', konkani), ('word_token_hindi', hindi)] train_data, test_data = TabularDataset.splits(path="translation/", train="train.csv", test="test.csv", format="csv", fields=fields) konkani.build_vocab(train_data, test_data, max_size=100000, min_freqs=2) then i build batches of similar length and thus minimizing padding required by using BucketIterator
but Bucket iterator doesn't have samplers thus we cannot distribute training on TPUs and GPUs |
@zhangguanheng66 I consulted the code sample you provided in the revamp approach for torchtext where the BucketIterator is being depreciated here - #664 this piece of code sample here:
where train_dataset is of the type - Dataset then isn't this really inefficient? Since we are not sampling anything and the whole Dataset is loaded in memory at once!? I've not had a chance to check through the BucketIterator code but is that how torchtext actually handled this? Loading all the samples in memory at once? I'm guessing there's no other way to achieve this sort of bucketing without loading everything in memory? I've been trying the - BucketBatchSampler from PyTorchNLP but I'm having an issue with all the samplers introduced by torch. Is there any way that you can achieve bucketing of similar lengths sentences and introduce some sort of shuffling amongst them and everything? |
Iterable dataset doesn't not load the whole dataset to the memory but you don't have the index to access data, like the map-style dataset. In practice, you can have a "pool", randomly check out a chunk of data (which can fit into the memory), and do sorting to group the data with similar length. |
Just to confirm, you mean an iterable dataset? Or a torch.IterableDataset class? Probably not the class right? since that's for realtime stuff? Also, is that double negative intentional? :P Or does this mean that the dataset is loaded in memory? Because in my use case I use a pandas dataframe to load the whole dataset .tsv file in the Dataset class and so the data is loaded in memory there. in init. And then using your example that I mentioned above : I create a list of data_len which I know can't be indexed since it's just a list but it still consists of all the data in memory right? For example if it's seq2seq then I end up with - [(len(seq1), idx, seq1, seq2)] which is the whole dataset again right? So is this optimal? |
I mean the iterator-style data, which yields the data in sequence. In general, if the data cannot fit into your memory, you should only check out part of the data. RawTextIterableDataset is an iterable dataset. You can use |
@zhangguanheng66 could you please show a simple workable example of the |
I don't have an example for TPU training. But you can use
This needs the nightly release to run. |
@zhangguanheng66 Thanks for the help, does this implementation groups similar length sequences together in a batch for minimum padding, similar to BucketIterator ?? |
Yes. This code snippet has the following steps:
Since we sort the samples in the pool, the texts with similar length will be grouped together to minimize the padding. |
in your following code example where does what are their premise what specific transformations do they do to the existing data ? |
OK, we just published the migration tutorial (literally yesterday), you can check it here for all the details - link. |
@zhangguanheng66 thank you enormously for all the assistance, could you please show me how to parse a corpus.txt file throught BPTTIterator for language modeling task. I tried to follow the documentation, but it was messy |
@zhangguanheng66 i did tryout the experimental code approach with my custom Dataset, thats a .csv file having 2 cols for nmt . But the TabularDataset returns weird objects that i cannot pass through |
Let me reply your question in another issue you just opened. |
You don't need |
using the batch_sampler , makes [sampler, batch_sampler, batch_size] mutually exclusive in the but i do need the sampler args to set then i am again stuck on single GPU training. |
For this kind of questions, you can use people on Pytorch NLP forum |
using your above methods i tried implementing the model, but unfortunately i am getting Target out of bounds error. I tried all debugging methods and checked my code for any errors, I also checked the exact dimensionality of input and target tensors going into the network there were also exactly the same as they were produced from BucketIterator class . I also tested my same model implementation with BucketIterator method and evertyhing works fine. Cannot seem to find the issue in the current implementation. |
It seems like there error is caused dude to optimizer unable to process the loss value due to some conditionality error, |
Is this a language modeling? You should just follow the transformer tutorial and set up the model/optimizer. Check the dimension of the last layer since the error is "target out of bound". |
No its just a simple seq2seq NMT |
OK, then you should check the NMT tutorial and the dimension of the last layer (see |
I did implemented the above mentioned tutorial in a single GPU and everything works perfectly fine. But the output prediction in the decoder seq2seq produces a tensor which holds predicted values of multiple examples which i need to explicitly do .to(device) .. irespective of me using pytorch lightning here the code snippet for the same. ` outputs = torch.zeros(target_len, batch_size, target_vocab_size).to(device) # this is the tensor
for the But as its ideal to use all tpu cores to distribute the compute. The problem i am stuck here is how to sync the outputs tensor across all the tpu cores. I want a way to move the given tensor to sync on all tpu cores. |
@zhangguanheng66 I was thinking of contributing to torchtext , is it possible ? .... I would love to contribute |
OSS contributions are always welcome. |
I've been further trying to use a simple LSTM and other variations of LSTM like AWD-LSTM to train a language model on raw text data. Similar to the solution you provided above. is there a way to train Lang Models on the torchtext nightly |
Could you explain me why the old BPTTIterator limit your language model to a single GPU? To my understanding, those two are not related. And just FYI, in the recent 0.9.0 release, the old BPTTIterator has been moved to the legacy folder. |
because i use DistributedDataSampler class to get a sampler object that I place in the DataLoader class to distribute samples of my data to different devices. But in BPTTIterator there isn't a sampler args for me to place my sampler object. |
Okay another issue that, what if my raw text data is huge. like 50GB or so how do I iteratively read sequences of data and pass them to the appropriate dataloader without crashing my Memory. |
Please take a look at the new datasets in |
I tried this and the dataset_utils modules are missing so couldn`t implement the _read_text_iterator. |
I get an AttributeError when using BPTTIterator
` AttributeError: |
unable to use XLAs Distributed Data Sampler or any Multi-GPU training with BucketIterator because it doesnt have a sampler feature.
train_iterator , valid_iterator = BucketIterator.splits((train_data, test_data), batch_size=batch_size, sort_within_batch=True, sort_key = lambda x: len(x.word_token), device=device)
so i am constraint to using only one GPU.
i used BucketIterator because it gives good batches with minimal padding, but the limiting scaling factor is a constraint.
The text was updated successfully, but these errors were encountered: