Batch should be sorted by decreasing size. #95

PetrochukM · 2017-08-16T15:39:35Z

Intended effect:

rnn.pack_padded_sequence requires that a minibatch be sorted by decreasing order.
Curriculum learning requires that the batches are sorted in increasing order.

Proposed Solution:
Flip the sign of the self.sort_key(x) when creating the Batch.

`rnn.pack_padded_sequence` requires that a minibatch be sorted by decreasing order. It's important for `self.sort_key(x)` to sort the data in increasing order to for curriculum learning but for the rows in the batch to be sorted in decreasing order.

nelson-liu · 2017-08-16T18:43:17Z

Not sure how i feel about this (and thus curriculum learning) being on by default.

jekbradbury · 2017-08-16T19:16:13Z

I support the idea of intra-batch sorting being the opposite of inter-batch sorting, since the only reason for the former is to support packed sequences. It won’t turn curriculum learning on by default if you use a shuffled iterator like BucketIterator.

jekbradbury · 2017-08-16T19:17:18Z

I believe that what should happen here is just a reverse() call though, since the sorts should always use the same key and just have opposite order.

nelson-liu · 2017-08-16T19:20:05Z

It won’t turn curriculum learning on by default if you use a shuffled iterator like BucketIterator.

good point, this was my main concern.

jekbradbury · 2017-08-16T20:32:46Z

@Deepblue129 if you agree that just adding minibatch.reverse() before the Batch constructor solves your issue, I think it'd be the most generic solution and I'll merge that for 0.2

PetrochukM · 2017-08-17T03:38:24Z

Thank you for your comments. Made the change you suggested.

Added a comment because that reverse seems out of nowhere unless you have extra context.

jekbradbury · 2017-08-17T03:45:38Z

You need to do minibatch.reverse() in a separate line, because .reverse() returns None (you could also use reversed but there's no reason not to do it in-place here).

Also, maybe the comment should just say something like "pack_padded_sequence requires that a minibatch be sorted by decreasing order, which requires reversing relative to typical sort keys" rather than asking the reader to copy and paste a GitHub URL?

torchtext/data/iterator.py

@@ -157,8 +157,8 @@ def __iter__(self):
                    continue
                self.iterations += 1
                self._iterations_this_epoch += 1
-                minibatch_decreasing_size = sorted(minibatch, key=lambda x: -self.sort_key(x))
-                yield Batch(minibatch_decreasing_size, self.dataset, self.device,
+                # NOTE: Find out more here for why we reverse: https://github.com/pytorch/text/pull/95


nelson-liu · 2017-08-17T06:16:53Z

LGTM when CI passes.

PetrochukM · 2017-08-17T06:39:26Z

@nelson-liu Did you know we both go to UW Comp Sci? And both worked @ Google?

jekbradbury · 2017-08-17T06:43:19Z

Flake8 failed with ./torchtext/data/iterator.py:160:91: E501 line too long (107 > 90 characters)

JianyuZhan · 2017-08-26T10:52:09Z

Hi, @jekbradbury , I still stumbled on this issue even with this commit(a504b9). See OpenNMT/OpenNMT-py#189. Below is my analysis.

I think the minibatch.reverse() fix is not correct. In my case, it is using the pool() method to return batches for iterating over. The pool() method would shuffle the minibatch examples, then later when we minibatch.reverse() in the Iterator.__iter__ , it is still no satisfying the requirement of decreasing size, and thus it crashed when calling the rnn.pack_padded_sequence(embedding, lengths).

I think the fix should be explicitly sorting the minibatch by decreasing size before the Batch constructor. I previously (wrongly) fixed it in the calling site of my call to rnn.pack_padded_sequence(embedding, lengths). But I think this is not the right way to do it. It is the semantics that should be hided inside the Iterator.

I just tracked down the issue to the text code and found this issue. Hope I did't misunderstand the code.

kyteague · 2017-09-19T05:24:56Z

I don't think this change is intuitive as it goes against what is expected when using a sort key in Python.

This also breaks a popular downstream library: https://github.com/IBM/pytorch-seq2seq
I made an issue about it there as well: IBM/pytorch-seq2seq#77

Update to comments

112aaff

nelson-liu reviewed Aug 17, 2017

View reviewed changes

Addressing Comments

c5cedec

Flake8

b0c8ade

jekbradbury merged commit a5049b9 into pytorch:master Aug 17, 2017

jekbradbury added the breaking label Aug 17, 2017

jekbradbury mentioned this pull request Aug 26, 2017

Update to agree with new TorchText OpenNMT/OpenNMT-py#189

Closed

kyteague mentioned this pull request Sep 19, 2017

ValueError: lengths array has to be sorted in decreasing order IBM/pytorch-seq2seq#77

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch should be sorted by decreasing size. #95

Batch should be sorted by decreasing size. #95

PetrochukM commented Aug 16, 2017 •

edited

Loading

nelson-liu commented Aug 16, 2017

jekbradbury commented Aug 16, 2017

jekbradbury commented Aug 16, 2017

nelson-liu commented Aug 16, 2017

jekbradbury commented Aug 16, 2017

PetrochukM commented Aug 17, 2017

jekbradbury commented Aug 17, 2017

This comment was marked as off-topic.

This comment was marked as off-topic.

nelson-liu commented Aug 17, 2017

PetrochukM commented Aug 17, 2017

jekbradbury commented Aug 17, 2017

JianyuZhan commented Aug 26, 2017 •

edited

Loading

kyteague commented Sep 19, 2017 •

edited

Loading

Batch should be sorted by decreasing size. #95

Batch should be sorted by decreasing size. #95

Conversation

PetrochukM commented Aug 16, 2017 • edited Loading

nelson-liu commented Aug 16, 2017

jekbradbury commented Aug 16, 2017

jekbradbury commented Aug 16, 2017

nelson-liu commented Aug 16, 2017

jekbradbury commented Aug 16, 2017

PetrochukM commented Aug 17, 2017

jekbradbury commented Aug 17, 2017

This comment was marked as off-topic.

This comment was marked as off-topic.

nelson-liu commented Aug 17, 2017

PetrochukM commented Aug 17, 2017

jekbradbury commented Aug 17, 2017

JianyuZhan commented Aug 26, 2017 • edited Loading

kyteague commented Sep 19, 2017 • edited Loading

PetrochukM commented Aug 16, 2017 •

edited

Loading

JianyuZhan commented Aug 26, 2017 •

edited

Loading

kyteague commented Sep 19, 2017 •

edited

Loading