Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about Results on Question Answering, Table 3 #8

Open
haichao592 opened this issue Nov 18, 2021 · 5 comments
Open

Questions about Results on Question Answering, Table 3 #8

haichao592 opened this issue Nov 18, 2021 · 5 comments

Comments

@haichao592
Copy link

In Lester et al. (2021), they use T5 as the pre-trained model and use LM head to generate answers.
For models like BERT, Roberta explored in this work, we can not use LM head to extract context spans as the answers, which means a linear QA head is essential.
Is the task-specific linear head fine-tuned with prompt embeddings in PT, Table 3?
If so, this implementation is a little different from the original implementation.
If not, the randomly initialized QA head is not expected to produce meaningful outputs and hinders PT training, which makes the PT results in Table 3 meaningless.

Or, do I have some misunderstandings about the LM head in QA tasks?

@Xiao9905
Copy link
Member

Task-specific linear head is fine-tuned with prompt embeddings.
The comparison of using LM head and task-specific linear head is provided in our experiment (Table 5), which shows that in a data-rich setting, a LM head's performance is not better than a co-tuned task-specific head's.

@haichao592
Copy link
Author

Task-specific linear head is fine-tuned with prompt embeddings. The comparison of using LM head and the task-specific linear head is provided in our experiment (Table 5), which shows that in a data-rich setting, an LM head's performance is not better than a co-tuned task-specific head's.

Thanks!
Firstly, my concern is only about extractive question answering tasks, e.g., SQuAD.
As for now, I think the LM head can not be used to produce outputs with Roberta on SQuAD and so Table 5 is not applicable. Is that right?

Secondly, I have tried PT with T5-base-v1.1 as in Lester et al. (2021) and with RoBERTa-base as described above (fine-tuning both prompt embeddings (input layer only) and task-specific QA heads). And the F1 scores exceed 80 easily without careful hyperparameters search. And the results in Table 3 are quite different. Are there any other constraints that need to be met in the implementation of PT?

@Xiao9905
Copy link
Member

Yes, LM head can not be applied to sequence tagging as for now.
Your observation on PT with SQuAD is quite interesting. Have you frozen the pre-trained models' parameters? If so, could you please share your implementation to us for reference? I am also curious about why our results on PT with QA is really low.

@haichao592
Copy link
Author

Yes, LM head can not be applied to sequence tagging as for now. Your observation on PT with SQuAD is quite interesting. Have you frozen the pre-trained models' parameters? If so, could you please share your implementation to us for reference? I am also curious about why our results on PT with QA is really low.

Yes, I am sure. Only the prompt embeddings and qa heads are added to the optimizer.

I think the little code snippets are enough for it is easy to implement.

class PromptEmbedding(Module):

    def __init__(self, num_embeddings, embedding_dim):
        super().__init__()
        self.num_embeddings = num_embeddings
        self.embedding_dim = embedding_dim
        self.weight = Parameter(torch.empty(num_embeddings, embedding_dim))
        self.reset_parameters()

    def reset_parameters(self, weight=None):
        if weight is not None:
            self.weight.data = weight.clone().detach()
        else:
            init.normal_(self.weight, mean=0, std=1.0)

    def forward(self, input):
        return torch.cat([self.weight.repeat(input.size(0), 1, 1), input], dim=1)

In RobertEmbeddings,

        embeddings = inputs_embeds + token_type_embeddings
        if self.position_embedding_type == "absolute":
            position_embeddings = self.position_embeddings(position_ids)
            embeddings += position_embeddings

        if hasattr(self, "prompt_embeddings"):
            embeddings = self.prompt_embeddings(embeddings)

        embeddings = self.LayerNorm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings

Attention Mask,

        if self.config.num_prompts > 0 and self.config.prompt_tuning:
            attention_mask = torch.cat(
                [
                    torch.ones((attention_mask.size(0), self.config.num_prompts), device=device, dtype=attention_mask.dtype),
                    attention_mask
                ],
                dim=1
            )

RobertaForQuestionAnswering outputs,

        sequence_output = outputs[0]
        if self.config.num_prompts > 0:
            sequence_output = sequence_output[:, self.config.num_prompts:]

        logits = self.qa_outputs(sequence_output)
        start_logits, end_logits = logits.split(1, dim=-1)
        start_logits = start_logits.squeeze(-1).contiguous()
        end_logits = end_logits.squeeze(-1).contiguous()

@haichao592
Copy link
Author

@Xiao9905 Hi, could you share the hyperparameters and optimizer configuration used for PT2 SQuAD 1.1 Roberta-Large?
Such as learning rate, prompt length, epoch or max steps, warmup rate, weight decay, optimizer, initialization, and so on.
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants