-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to use tf.nn.dropout
to implement embedding dropout
#14746
Comments
tf.nn.dropout
to implement embedding dropout
@zotroneneis did you figure out how to do it? |
Yes, I did |
@zotroneneis can you show a code example? |
Sure! You just have to drop random rows of the embedding matrix (i.e. set them to zero). This corresponds to dropping certain words in the input sequence.
|
@zotroneneis But this drops the same words for the entire batch -- this might make learning more noisy, correct? |
supposed to be like this? with tf.name_scope('input'): with tf.name_scope('word_dropout'): ` |
@makai281 No, this would be dropout of the embedding vector (which is also a good idea, but not what we referred to here). We want to dropout inputs (= word types) efficiently. |
Directly drops the whole embedding matrix is inefficient for large vocabulary, since it requires
And here is the result:
|
The biggest problem I see with this, is that typically only a tiny fraction of the overall vocabulary exists in a given batch. Dropping a uniformly random (instead of word frequency-informed) 20% of the vocabulary will most times do nothing. In order to have a non-negligible effect, one would need to increase the dropout to very high values (e.g. 70%), at which point the danger is that some sentences (or batches, since the same words will be dropped for the entire batch) will be dropped almost entirely. So my impression is that training will become noisy. The solution is to drop existing word IDs, preferably separately for each sample in the batch. |
Recent papers in language modeling use a specific form of embedding dropout that was proposed in this paper. The paper also proposed variational recurrent dropout which was discussed already in this issue.
In embedding dropout, the same dropout mask is used at each timestep and entire words are dropped (i.e. the whole word vector of a word is set to zero). This behavior can be achieved by providing a
noise_shape
totf.nn.dropout
. In addition, the same words are dropped throughout a sequence:"Since we repeat the same mask at each time step, we drop the same words throughout the sequence – i.e. we drop word types at random rather than word tokens (as an example, the sentence “the dog and the cat” might become “— dog and — cat” or “the — and the cat”, but never “— dog and the cat”). "
I couldn't find a way to implement this functionality of embedding dropout efficiently. Are there any plans to incorporate these advances?
The text was updated successfully, but these errors were encountered: