You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Task heads that backpropagate deliberately reversed gradients to the encoder. A flag requesting this behavior when constructing a task head.
Motivation
Transfer learning experiments lend themselves to questions about the extent to which two tasks rely on the same information about a word/sentence, and to experiments probing whether and how word encodings contain/correspond to syntax trees, lemmas, frequencies, and other objects of linguistic/psycholinguistic study.
A difficulty is that a pretrained model, without fine-tuning, may already encode certain information too thoroughly and accessibly for intermediate training to make much of a difference. For example, BERT's masked language modeling objective produces word encodings in which syntax information is readily accessible. Intermediate training on a syntax task requires training a task head to extract this information, of course, but it will result in very little reorganization of the encoder itself.
Adversarial training, such as the amnesic probing of Elazar et al. 2020, can avoid this pitfall. Intermediate training can aim to burn particular information out of the encodings, and measure how much this impairs trainability of the target task. Strictly reversing the sense of the training data won't do it though; getting all the answers exactly wrong requires just as much domain knowledge as getting them all right does. And randomizing the labels on training data may just result in a feckless task head, one that discards useful information passed to it from the encoder, rather than affecting the encoder itself.
Ideally, then, the task head would be trained toward correctly reproducing gold-standard labels, but would flip all its gradients before backpropagating them to the shared encoder, thus training it not to produce precisely the signals that the task head found most informative. The following work by Cory Shain illustrates flipping gradients in this way (although it's not applied to shared-encoder transfer learning, but rather to development of encoders that disentangle semantics from syntax).
🚀 Feature request
Task heads that backpropagate deliberately reversed gradients to the encoder. A flag requesting this behavior when constructing a task head.
Motivation
Transfer learning experiments lend themselves to questions about the extent to which two tasks rely on the same information about a word/sentence, and to experiments probing whether and how word encodings contain/correspond to syntax trees, lemmas, frequencies, and other objects of linguistic/psycholinguistic study.
A difficulty is that a pretrained model, without fine-tuning, may already encode certain information too thoroughly and accessibly for intermediate training to make much of a difference. For example, BERT's masked language modeling objective produces word encodings in which syntax information is readily accessible. Intermediate training on a syntax task requires training a task head to extract this information, of course, but it will result in very little reorganization of the encoder itself.
Adversarial training, such as the amnesic probing of Elazar et al. 2020, can avoid this pitfall. Intermediate training can aim to burn particular information out of the encodings, and measure how much this impairs trainability of the target task. Strictly reversing the sense of the training data won't do it though; getting all the answers exactly wrong requires just as much domain knowledge as getting them all right does. And randomizing the labels on training data may just result in a feckless task head, one that discards useful information passed to it from the encoder, rather than affecting the encoder itself.
Ideally, then, the task head would be trained toward correctly reproducing gold-standard labels, but would flip all its gradients before backpropagating them to the shared encoder, thus training it not to produce precisely the signals that the task head found most informative. The following work by Cory Shain illustrates flipping gradients in this way (although it's not applied to shared-encoder transfer learning, but rather to development of encoders that disentangle semantics from syntax).
Your contribution
I am deeply unfamiliar with pytorch, unfortunately, and utterly ignorant of tensorflow. I can't offer much.
The text was updated successfully, but these errors were encountered: