Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where is Expectation #5

Open
zetyquickly opened this issue Jun 28, 2018 · 5 comments
Open

Where is Expectation #5

zetyquickly opened this issue Jun 28, 2018 · 5 comments

Comments

@zetyquickly
Copy link

zetyquickly commented Jun 28, 2018

screenshot from 2018-06-28 19-07-42

In this formula we have an expectation of $w_i$. That means for each pair of $(w_I, w_O)$ we should calculate this expectation. But as I can see in your code you are sampling n_negs of Negative Samples for each pair of $(w_I, w_O)$. Wouldn't that be more correct if we sample n_negs times $N$ of $w_i$ to obtain an empirical mean of expression in square brackets and after than accumulate n_negs of means?

@theeluwin
Copy link
Owner

That would give more precise loss value, however, I'm afraid that applying this might consume too much time for training. I quickly wrote a sample code of your idea.
With appropriate self.n_samples on the model, change the line below https://github.com/theeluwin/pytorch-sgns/blob/master/model.py#L66

if self.weights is not None:
    nwords = t.multinomial(self.weights, batch_size * context_size * self.n_negs * self.n_samples, replacement=True).view(batch_size, -1)
else:
    nwords = FT(batch_size, context_size * self.n_negs * self.n_samples).uniform_(0, self.vocab_size - 1).long()
ivectors = self.embedding.forward_i(iword).unsqueeze(2)
ovectors = self.embedding.forward_o(owords)
nvectors = self.embedding.forward_o(nwords).neg()
oloss = t.bmm(ovectors, ivectors).squeeze().sigmoid().log().mean(1)
nloss = t.bmm(nvectors, ivectors).squeeze().sigmoid().log().view(-1, context_size, self.n_negs, self.n_samples).mean(3).sum(2).mean(1)

@zetyquickly
Copy link
Author

I am completely agree with you that will make training slower. My collegue and me concluded that your approach also leads to convergence. But I wish there would be thoughts about this formula and about implementation in the README or in code. Thank you

@theeluwin
Copy link
Owner

If you have some good idea for implementation, please go ahead for PR. Otherwise, I'll close this issue. Thank you.

@msummerfield
Copy link

Thanks for publishing this repo. I have found it very useful in improving the performance of my own word2vec implementation.

On this particular issue, I note that mean(), sum(), log(), and sigmoid() are all continuous and monotonically-increasing functions of their inputs. Thus, barring any issues with numerical stability, minimising:

-(t.bmm(ovectors, ivectors).squeeze().sigmoid().log().mean(1) + t.bmm(nvectors, ivectors).squeeze().sigmoid().log().view(-1, context_size, self.n_negs).sum(2).mean(1))

is equivalent to minimising:

-(t.bmm(ovectors, ivectors).mean() + t.bmm(nvectors, ivectors).mean())

Given that all of the vectors start out 'small', and do not become excessively large in the course of training, numerical stability does not seem to be an issue. (Indeed, if there were some problem with stability I suspect it might arise anyway, since each function is computed successively, rather than using some numerically-stable compound implementation in the manner of nn.NLLLoss().) So, even though the loss function is 'wrong', it has the same argmin as the 'correct' loss function, which is all we really care about.

The same argument should apply to the 'improved' computation. Ultimately, the order of application of mean() and sum() operations makes no difference to the location of the minimum of the loss function. So, in terms of the optimisation, all you are doing is increasing the number of nwords. But so long as you have 'enough' negative samples, you should be fine - as Mikolov et al. say in their paper, 'we are free to simplify NCE as long as the vector representations retain their quality.'

I have tried the above simplification in my own implementation. It seems to work, in the sense that it converges and produces a sensible-looking result, although I have not done any testing to check that it produces the same embeddings (all else being equal). However, the speed-up is hardly worth it - about 5% for small vectors, e.g. around 25 elements, but with much smaller relative benefits for larger vectors, since the matrix multiplications dominate the computation time. The advantage of retaining the log() and sigmoid() functions is that the magnitude of the loss function is about the same, regardless of the parameters of the model, rather than being, e.g., roughly proportional to the vector size.

Incidentally, as far as I can tell from the original word2vec code (https://github.com/dav/word2vec/blob/9b8b58001ba5d58babe1c62327a8501b62cd6179/src/word2vec.c#L529) they use a fixed number of negative samples (just five by default), and it looks like they compute the sigmoid() function (by table lookup), but not the log().

@theeluwin
Copy link
Owner

@msummerfield Thanks for the detailed feedback! Awesome.
Idea of using the 'faster' loss looks meaningful. The main reason I retained all the details is that the overall loss remains mathematically correct (e.g., loss = 1 means the prediction accuracy is 36.7%) but yes the long-calculation might suffer from numerical issues, since I've never cared about those.
Again, thanks for the awesome feedback. It inspired me in many ways, so I hope it could be posted in some other spaces (like a blog) too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants