Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ProdLDA tutorial improvements #2794

Open
wants to merge 2 commits into
base: dev
Choose a base branch
from

Conversation

ahoho
Copy link

@ahoho ahoho commented Apr 5, 2021

Handful of changes to the ProdLDA tutorial

  1. Calculate the NPMI of topics, a measurement of coherence, using the held-out 20 newsgroups test set. This allows us to estimate whether changes we make to the implementation yield any improvements, and accords with standard practice (NB: this change means that we use a smaller training set, since previously the model was trained on train+test). The parameters yielding the best NPMI are then used for the word clouds.
  2. Remove headers and footers from the 20ng data, which helps with topic readability.
  3. Increase the number of epochs to 200, following the original implementation. It still runs relatively quickly on CPU (<30 min), but perhaps it's worth having a flag that changes the number depending on whether CUDA is available.
  4. Correct the batch norm terms to align with the original implementation. In particular, the original does not use affine=False, but rather scale=False: there is still a bias term. This improves the best NPMI from 0.276 to 0.355. See below:
        self.bn = nn.BatchNorm1d(
            vocab_size, eps=0.001, momentum=0.001, affine=True
        )
        self.bn.weight.data.copy_(torch.ones(vocab_size))
        self.bn.weight.requires_grad = False

While I didn't include it, I'll note that thanks to @fritzo and @martinjankowiak, we needn't rely on ProdLDA's Laplace approximation and can just use distributions.Dirichlet as a drop-in replacement. I've done so in a separate pyro reimplementation of the Dirichlet-VAE, and I have no reason to suspect it wouldn't work here. Of course, this would obviously deviate from ProdLDA.

nbviewer link

@martinjankowiak
Copy link
Collaborator

thanks @ahoho ! none of us are experts in NLP so it's great to have this tutorial see some attention from an NLP person. i hope your interest in it suggests that this can be a useful starting point for doing actual NLP modeling.

some comments/suggestions:

  • can you please expand/reword this comment? "NB: here we turn off the scaling to reduce..."? be explicit that there is still a bias term.
  • can you include the definition of tbe nmpi acronym?
  • are the warning filters new?
  • is 1 * (docs_test > 0) equivalent to (docs_test >0).float()? if so can we use the latter which is more explicit?

@fehiepsi
Copy link
Member

fehiepsi commented Apr 6, 2021

I like the idea of evaluation. I just wonder if we need to normalize beta before computing any score (if so, we might update ProdLDA.beta method to reflect this)? You'll need to rerun the notebook from the beginning to avoid any seed-related inconsistency. Currently, the order of cell execution is a bit messy.

Increase the number of epochs to 200

This is a bit unfortunate. I guess as a tutorial, we don't need to achieve state-of-the-art with 4x slower. But it is fine as long as you think the result is good. You don't need to stick with the original ProdLDA implementation though.

By the way, it is not clear to me that the word cloud shows a better result than the current one. Maybe my judgment is out of the mainstream (given "NPMI correlates with human judgments of topic quality") or a good metric is still under active research?

BatchNorm1d

I don't have much opinion on this, but if you think having those additional bias parameters is good, it might be clearer to separating out the definition of BatchNorm to a function or a class, e.g.

def BatchNorm1d(vocab_size, eps=0.001, momentum=0.001):
    bn = nn.BatchNorm1d(
        vocab_size, eps=eps, momentum=momentum, affine=True
    )
    bn.weight.data.copy_(torch.ones(vocab_size))
    bn.weight.requires_grad = False
    return bn

I also wonder if we use BatchNorm to collect running mean/variance and bias parameters, will we want to use that information in prediction/evaluation, in particular the unnormalized beta?

We have a dictionary of 8,902 unique words

This is inconsistent with the above cell.

we needn't rely on ProdLDA's Laplace approximation and can just use distributions.Dirichlet as a drop-in replacement

Yeah, could you elaborate the following sentence in the tutorial "(Note, however, that PyTorch/Pyro have included support for reparameterizable gradients for the Dirichlet distribution since 2018)." with links to paper/reference that use Dirichlet.rsample?

@ahoho
Copy link
Author

ahoho commented Apr 6, 2021

Really appreciate the feedback! I do think this is a great starting point for neural topic models (in fact I'm surprised I didn't notice it before, pyro has proven great for experimentation with them)

I'll make the fixes you both suggested, but to respond to your questions:

are the warning filters new?

Since we turn off gradient updates for the BN scaling terms, the trace gives a warning about those variables. If there's a better way, I'll gladly update.

I just wonder if we need to normalize beta before computing any score (if so, we might update ProdLDA.beta method to reflect this)?

You do not, since the score just relies on an argsort over each row

Increase the number of epochs to 200

I'll change this back, but make a note in the text that increasing it can help (or make it conditional on CUDA availability)

it is not clear to me that the word cloud shows a better result than the current one. [...] or a good metric is still under active research?

Yeah, this could be the result of (a) using less data (since we went from all data to just train), (b) problems with preprocessing, or (c) issues with the metric, which can be sometimes attributed to bad preprocessing, since that affects the words included in the estimates. I will play around with preprocessing to see if this helps; if memory serves I had better ProdLDA results with 20ng in the past. And yes, it's still under development---in fact, my own group is working on this now! That said, I still thought it made sense to have some sort of standard quantitative evaluation as a benchmark.

I also wonder if we use BatchNorm to collect running mean/variance and bias parameters, will we want to use that information in prediction/evaluation, in particular the unnormalized beta?

Hm, good question. In the NTM model I've usually built on in the past, the output of the BN layer is annealed to zero over training ((alpha) * beta_output + (1 - alpha) * bn_beta_output), so this hasn't come up. I guess it'd look something like this:

# in ProdLDA class
def beta(self):
    pseudo_batch = torch.eye(self.num_topics)
    beta = self.decoder.bn(self.decoder.beta(pseudo_batch))
    return beta.cpu().detach().numpy()

As a funny aside, I've seen a couple topic modeling papers in the past two years in that just pretend as if the Pathwise Derivatives paper doesn't exist, then come up with other ways of estimating a Dirichlet-based neural model---all while it's been implemented in torch from day one.

@fehiepsi
Copy link
Member

fehiepsi commented Apr 6, 2021

the score just relies on an argsort over each row

You are right. The value is not used in the metric but used in plot_word_cloud. Probably it is not important...

it made sense to have some sort of standard quantitative evaluation as a benchmark

Yeah, agreed!

I guess it'd look something like this

Looks reasonable to me. If you want to incorporate this, just make sure to switch bn to eval mode for evaluation, then switch back to train mode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants