Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARTM bug - again #33

Closed
r0mainK opened this issue Aug 14, 2019 · 3 comments
Closed

ARTM bug - again #33

r0mainK opened this issue Aug 14, 2019 · 3 comments
Assignees

Comments

@r0mainK
Copy link
Contributor

r0mainK commented Aug 14, 2019

As you may recall, we had a bug where the shape of the theta matrix was incorrect, the number of documents being inferior to what was expected. We were able to get rid of that by using an experimental feature the API exposes, that allows us to store it in a second phi matrix.

However, this was unfortunately not the only bug. While working on a PR to implement the consolidated training (see issue #1 ), I came across the interesting fact that, if the matrix shape was correct, it's contents not so much. Case and point, I simply summed all values in the matrix, expecting to get the number of documents, but in some cases it came short. More precisely, although the document rows in the theta matrix did exist, they were null.

As I was testing on a small corpus (bash files extracted from pytorch), I am not sure this also applies to large corpora. However, I assume it is the case, as this bug is closely related to the previous one, which applied to both large and small corpora.

After trying out different things I found a couple observations:

  • the problem usually appeared after the first and last phase of training
  • taking a large amount of topic compared to the range towards the model converged did not change much, however taking a too small amount of topics did
  • it affected usually small documents, like previously
  • an alternative way of retrieving the matrix corrected the problem after the first phase of training:
# instead of 
doctopic, _, _ = model_artm.get_phi_dense(model_name="theta")
# we do
doctopic = model_artm.transform_sparse(batch_vectorizer)[0].todense().T
  • the above method did not work after the last part of the training, where we induce sparsity - it also fell for the same bug as previously: documents being cut out, resulting in an incorrect matrix shape. Past that phase, neither worked (I tested literally all methods this time) and in some cases the current one gave "better" results- although almost never good ones.
  • if setting the --sparse-doc-coeff to a lower value - or even 0 - the problem did not occur and the above method worked each time. However, doing so systematically decreased the model quality, more often then not by a lot. I also did not observe significant increases in performances in the past with that regularizer in general.
  • I did not find any issues, at any point, for the wordtopic (phi) matrix

Given all this, here is my proposal (I will implement directly, we can always discuss this further when you come back from vacation @m09 ):

  1. Systematically retrieve the theta matrix with the method shown above
  2. Check the doctopic matrix is sane after each phase of training (DEBUG mode except for last).
  3. Save the doctopic and wordtopic matrices before inducing sparsity
  4. Compare results pre/post sparsisty inducing, save results only if the doctopic matrix is sane and the results better.

I will implement this in an upcoming PR - probably after implementing the consolidate creation and training. If all else fails, the next step will be to downgrade ARTM version, hoping the package was more stable previously.

@r0mainK
Copy link
Contributor Author

r0mainK commented Aug 15, 2019

Okay, so a couple more fun stuff:

  1. when using the above method to retrieve the theta matrix after reshaping, this CPP error shows up:
    E0815 15:05:07.551517 25 topic_selection_theta.cc:112] ProcessBatchesArgs.topic_name_size() != TopicSelectionThetaConfig.topic_value_size()

It is just a warning and does not impact anything but yeah, kinda ugly. It seems to be because the Theta matrix is not reshaped unless we refit the model, although the phi matrix is. However, the retrieved matrix is still good.

  1. There was an error I introduced during the reshape. Basically, we were keeping more topics then we should have (including the good ones).

  2. The score for theta sparsity is completely wrong. No idea why, I'm guessing it is linked to the bugs we found, but yeah it was just incorrect. I checked and this is not the case for the phi matrix, also I tried using the phi scorer for the theta matrix, but it does not work. As there is no easy way to compute the perplexity, I have no idea whether it is also bugged or not. Anyway, I am going to implement a custom sparsity score for theta (2 lines) and will just remove perplexity just in case.

I've also looked into the sparsity ref a bit more, it does indeed seem to be either detrimental, or not improve significantly the model - and often induce the bug. I will leave it as I did not test on large corpora.

@r0mainK
Copy link
Contributor Author

r0mainK commented Aug 16, 2019

Merged the bug fixes, keeping this open until @m09 comes back so he can see for himself.

@r0mainK r0mainK mentioned this issue Aug 16, 2019
18 tasks
@m09
Copy link
Contributor

m09 commented Aug 17, 2019

Great stuff! It seems the white rabbit wants us to go deeper, the artm python client is an endless stream of fun 😭

@m09 m09 closed this as completed Aug 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants