ARTM bug - again #33

r0mainK · 2019-08-14T15:49:28Z

As you may recall, we had a bug where the shape of the theta matrix was incorrect, the number of documents being inferior to what was expected. We were able to get rid of that by using an experimental feature the API exposes, that allows us to store it in a second phi matrix.

However, this was unfortunately not the only bug. While working on a PR to implement the consolidated training (see issue #1 ), I came across the interesting fact that, if the matrix shape was correct, it's contents not so much. Case and point, I simply summed all values in the matrix, expecting to get the number of documents, but in some cases it came short. More precisely, although the document rows in the theta matrix did exist, they were null.

As I was testing on a small corpus (bash files extracted from pytorch), I am not sure this also applies to large corpora. However, I assume it is the case, as this bug is closely related to the previous one, which applied to both large and small corpora.

After trying out different things I found a couple observations:

the problem usually appeared after the first and last phase of training
taking a large amount of topic compared to the range towards the model converged did not change much, however taking a too small amount of topics did
it affected usually small documents, like previously
an alternative way of retrieving the matrix corrected the problem after the first phase of training:

# instead of 
doctopic, _, _ = model_artm.get_phi_dense(model_name="theta")
# we do
doctopic = model_artm.transform_sparse(batch_vectorizer)[0].todense().T

the above method did not work after the last part of the training, where we induce sparsity - it also fell for the same bug as previously: documents being cut out, resulting in an incorrect matrix shape. Past that phase, neither worked (I tested literally all methods this time) and in some cases the current one gave "better" results- although almost never good ones.
if setting the --sparse-doc-coeff to a lower value - or even 0 - the problem did not occur and the above method worked each time. However, doing so systematically decreased the model quality, more often then not by a lot. I also did not observe significant increases in performances in the past with that regularizer in general.
I did not find any issues, at any point, for the wordtopic (phi) matrix

Given all this, here is my proposal (I will implement directly, we can always discuss this further when you come back from vacation @m09 ):

Systematically retrieve the theta matrix with the method shown above
Check the doctopic matrix is sane after each phase of training (DEBUG mode except for last).
Save the doctopic and wordtopic matrices before inducing sparsity
Compare results pre/post sparsisty inducing, save results only if the doctopic matrix is sane and the results better.

I will implement this in an upcoming PR - probably after implementing the consolidate creation and training. If all else fails, the next step will be to downgrade ARTM version, hoping the package was more stable previously.

The text was updated successfully, but these errors were encountered:

r0mainK · 2019-08-15T15:18:58Z

Okay, so a couple more fun stuff:

when using the above method to retrieve the theta matrix after reshaping, this CPP error shows up:
E0815 15:05:07.551517 25 topic_selection_theta.cc:112] ProcessBatchesArgs.topic_name_size() != TopicSelectionThetaConfig.topic_value_size()

It is just a warning and does not impact anything but yeah, kinda ugly. It seems to be because the Theta matrix is not reshaped unless we refit the model, although the phi matrix is. However, the retrieved matrix is still good.

There was an error I introduced during the reshape. Basically, we were keeping more topics then we should have (including the good ones).
The score for theta sparsity is completely wrong. No idea why, I'm guessing it is linked to the bugs we found, but yeah it was just incorrect. I checked and this is not the case for the phi matrix, also I tried using the phi scorer for the theta matrix, but it does not work. As there is no easy way to compute the perplexity, I have no idea whether it is also bugged or not. Anyway, I am going to implement a custom sparsity score for theta (2 lines) and will just remove perplexity just in case.

I've also looked into the sparsity ref a bit more, it does indeed seem to be either detrimental, or not improve significantly the model - and often induce the bug. I will leave it as I did not test on large corpora.

r0mainK · 2019-08-16T08:34:38Z

Merged the bug fixes, keeping this open until @m09 comes back so he can see for himself.

m09 · 2019-08-17T15:30:36Z

Great stuff! It seems the white rabbit wants us to go deeper, the artm python client is an endless stream of fun 😭

r0mainK self-assigned this Aug 14, 2019

r0mainK mentioned this issue Aug 15, 2019

Documents disappearing when using get_sparse_theta bigartm/bigartm#976

Open

r0mainK mentioned this issue Aug 16, 2019

Bugfixes and refactoring #35

Merged

r0mainK mentioned this issue Aug 16, 2019

Experiments planning #1

Open

18 tasks

m09 closed this as completed Aug 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARTM bug - again #33

ARTM bug - again #33

r0mainK commented Aug 14, 2019 •

edited

Loading

r0mainK commented Aug 15, 2019

r0mainK commented Aug 16, 2019

m09 commented Aug 17, 2019

ARTM bug - again #33

ARTM bug - again #33

Comments

r0mainK commented Aug 14, 2019 • edited Loading

r0mainK commented Aug 15, 2019

r0mainK commented Aug 16, 2019

m09 commented Aug 17, 2019

r0mainK commented Aug 14, 2019 •

edited

Loading