Questions regarding the code #2

dimeldo · 2019-10-21T12:58:42Z

What does this line check?

if sum([len(item) for item in batch[0][1]]) > 1024:

What is the maximum number of turns a dialogue can have? Or is it set by maximum length a dialogue can have? If so, where is it specified? I saw a number of constants number who can be contenders for that:

train_data = [data[idx] for idx in indices[100:]]
val_data = [data[idx] for idx in indices[:100]]

self.tokenizer.max_len = 1500
        # tokenizer weird behavior

qywu · 2019-10-22T16:57:32Z

Since GPT2 only supports sequences shorter 1024 tokens, this line skips dialogs longer than that. Though depending on your dataset, you can use a fix-sized window to extract dialogs.
Yes. It is set by the maximum length. If your dialog is longer than that, you can use a window to sample from the dialogs.

dimeldo · 2019-10-22T17:13:44Z

Thanks for answering.

Does it check if there’s a dialogue longer than 1024 in the batch, and if so, it skips the entire batch? I thought it’s this way since I saw my GPU not being utilized fully even on 128 batch size. Or is it just skipping per dialogue? Or is checking the sum of all the length of dialogues in a batch and then skipping them all if that’s higher than 1024?
What’s the 100 and the 1500 constant numbers represent then?

Please, excuse my lack of understanding. Thanks again.

qywu · 2019-10-22T17:47:12Z

The batch size is forced to be 1 now, which probably consumes all the memory of 11GB GPU. However, I have included batch support in the actual implementation. You can check out camrest and multiwoz. If there is a dialogue longer than 1024, it will skip the dialogue (batch size=1).

100 means dataset split. The first 100 dialogues are used for validation. 1500 is maximum length of your dialogue. It is only used for the tokenizer to tokenize your dialogue.

dimeldo · 2019-10-23T09:28:15Z

Okay, thanks a lot!

dimeldo closed this as completed Oct 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions regarding the code #2

Questions regarding the code #2

dimeldo commented Oct 21, 2019 •

edited

Loading

qywu commented Oct 22, 2019

dimeldo commented Oct 22, 2019 •

edited

Loading

qywu commented Oct 22, 2019

dimeldo commented Oct 23, 2019

Questions regarding the code #2

Questions regarding the code #2

Comments

dimeldo commented Oct 21, 2019 • edited Loading

qywu commented Oct 22, 2019

dimeldo commented Oct 22, 2019 • edited Loading

qywu commented Oct 22, 2019

dimeldo commented Oct 23, 2019

dimeldo commented Oct 21, 2019 •

edited

Loading

dimeldo commented Oct 22, 2019 •

edited

Loading