-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
LazyEmbedding, an embedding layer with a dynamically sized vocabulary #55981
Comments
Hey @PetrochukM, thanks for the cool suggestion!
My initial thoughts are that
Given that there are multiple valid answers for each of the above, I think it's easier to maintain only That said, I'm open to being convinced, especially if there's a lot of interest for something like this :) |
Thanks for considering my suggestion. Here are my thoughts:
Re:
With regards to this, I'm happy to include an implementation in an open-source library. The issue is that it's difficult to implement a Let me know if you think this is possible to implement, efficiently, without PyTorch support.... I posted this feature request because of this perception. Either way, I'll definitely open-source an inefficient implementation because it'll make NLP so much easier. I have an NLP student and |
(I pinged a couple of NLP chats to see if there is interest in something like this!) |
should we discuss whether this is a better fit for torchtext? |
@jspisak I'd be happy to discuss. I don't think an efficient implementation is doable without making modifications to PyTorch its self. So, I don't think Even so, we could implement a version of this, in |
Love the idea! Have been using the |
+1 with this idea. @jbschlosser reminded me of this issue when I asked for the same embedding https://gist.github.com/wangkuiyi/dd2e3794d11010f0cd562ed009664f90. Slightly different from @PetrochukM's sample, the above gist doesn't allocate the underlying dense embedding table beforehand. This fully lazy and dynamic embedding is useful not only for NLP but also recommendation and ad systems. TensorFlow had a distributed version of this dynamic embedding feature in 2020 for recommendation systems https://github.com/tensorflow/community/blob/a4542d13baaa64e81ec4689719fbe30abe89aee0/rfcs/20200424-sparse-domain-isolation.md Baidu's Paddle has had this for advertising systems since 2018 https://github.com/PaddlePaddle/Paddle/projects/56. |
I'm wondering what people think about the following design proposal
This is using the same kernels as nn.Embedding, but doubles the underlying embedding table whenever it runs out of space for a given index and stores a table that maps a token index to the actual underlying data. It's also different in that it accepts indices and returns Tensors, so actually a full on replacement for nn.Embedding. To actually make this performant the map from given index to physical index will need to be written efficiently and we might want to reorder the embedding table based on some observed frequency of tokens on a call to eval to fully match the inference performance of nn.Embedding. Other cost such as doubling and reallocating the underlying data, initializing an entry that isn't available (repeated calls to torch.nn.init.normal_) and most importantly cache locality are potentially minimal assuming the input follows a distribution such as Zipf's law and the program runs long enough. In particular we might want to run some very precise benchmarks around cache locality. |
I think overall I really like the idea. This is similar in-line with std::vectors that grows dynamically as we push more objects into the structure. One question I have though is on the forward API. why not support list of strings directly? It seems LazyEmbedding wound need to depend on some external structure to provide token indices. That external structure also need to support dynamic (Lazy) semantics. It may be an overhead in terms of workflow and check-pointing compared to the alternative where LazyEmbedding encapsulate both the vocabulary and embeddings together? |
The idea is to use a Vocab to map from string to int and then LazyEmbedding to map from int to a vector (same as nn.Embedding now). |
Good point. AFAIK optimizers can't handle the parameters they're optimizing being resized- that would either result in new parameters the optimizer doesn't know about or mess with the internal optimizer state (more info here). There's An alternative I haven't fully thought through might be to maintain the table across multiple Parameters, calling Not sure about DDP either.. |
There are a couple of strategies I can think of:
I wonder how Baidu or TF handled this problem! And i'm not an expert in PyTorch's distributed toolchain, so I hope someone has better ideas! |
@cpuhrsch I'm a big fan of A LOT of your ideas! Thanks for providing concrete and actionable ideas around design and performance! |
馃殌 Feature
It'd be AMAZING to have a lazy embedding layer that grows to accommodate new tokens. So, for example, the interface would look like this:
And, for example, here is a basic, inefficient, implementation:
There is also a bit of work that can be done with modularity. For example,
PyTorch
could be responsible for providing aLazyEmbedding
that supports onlytorch.Tensor
.torchtext
could provide a wrapper that adds the vocabulary.Motivation
This feature has been on my mind a lot because it'd dramatically simplify my NLP training pipelines.
Without a
LazyEmbedding
, typically, I need to usetorchnlp.encoders
ortorchtext.vocab
. I'll need to initialize these objects by looping through my entire dataset, in order to determine all the tokens, that I might need. Afterward, I'll need to use the vocabulary with myDataLoader
in order to encode training examples. Lastly, I need to store this object in the related checkpoints, so that other people can use the same encoding.With a module like this, I wouldn't need to use
torchnlp.encoders
and I wouldn't need to usetorchtext.vocab
. Also, we could:Furthermore, a
LazyEmbedding
could also be extended with a tokenizer, padding tokens, eos tokens, unknown tokens, pre-trained word vectors, etc.cc @albanD @mruberry @jbschlosser
The text was updated successfully, but these errors were encountered: