Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions regarding the code #2

Closed
dimeldo opened this issue Oct 21, 2019 · 4 comments
Closed

Questions regarding the code #2

dimeldo opened this issue Oct 21, 2019 · 4 comments

Comments

@dimeldo
Copy link

dimeldo commented Oct 21, 2019

  1. What does this line check?
if sum([len(item) for item in batch[0][1]]) > 1024:
  1. What is the maximum number of turns a dialogue can have? Or is it set by maximum length a dialogue can have? If so, where is it specified? I saw a number of constants number who can be contenders for that:
train_data = [data[idx] for idx in indices[100:]]
val_data = [data[idx] for idx in indices[:100]]
self.tokenizer.max_len = 1500
        # tokenizer weird behavior
@qywu
Copy link
Owner

qywu commented Oct 22, 2019

  1. Since GPT2 only supports sequences shorter 1024 tokens, this line skips dialogs longer than that. Though depending on your dataset, you can use a fix-sized window to extract dialogs.

  2. Yes. It is set by the maximum length. If your dialog is longer than that, you can use a window to sample from the dialogs.

@dimeldo
Copy link
Author

dimeldo commented Oct 22, 2019

Thanks for answering.

  1. Does it check if there’s a dialogue longer than 1024 in the batch, and if so, it skips the entire batch? I thought it’s this way since I saw my GPU not being utilized fully even on 128 batch size. Or is it just skipping per dialogue? Or is checking the sum of all the length of dialogues in a batch and then skipping them all if that’s higher than 1024?

  2. What’s the 100 and the 1500 constant numbers represent then?

Please, excuse my lack of understanding. Thanks again.

@qywu
Copy link
Owner

qywu commented Oct 22, 2019

The batch size is forced to be 1 now, which probably consumes all the memory of 11GB GPU. However, I have included batch support in the actual implementation. You can check out camrest and multiwoz. If there is a dialogue longer than 1024, it will skip the dialogue (batch size=1).

100 means dataset split. The first 100 dialogues are used for validation. 1500 is maximum length of your dialogue. It is only used for the tokenizer to tokenize your dialogue.

@dimeldo
Copy link
Author

dimeldo commented Oct 23, 2019

Okay, thanks a lot!

@dimeldo dimeldo closed this as completed Oct 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants