-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARTM bug - again #33
Comments
Okay, so a couple more fun stuff:
It is just a warning and does not impact anything but yeah, kinda ugly. It seems to be because the Theta matrix is not reshaped unless we refit the model, although the phi matrix is. However, the retrieved matrix is still good.
I've also looked into the sparsity ref a bit more, it does indeed seem to be either detrimental, or not improve significantly the model - and often induce the bug. I will leave it as I did not test on large corpora. |
Merged the bug fixes, keeping this open until @m09 comes back so he can see for himself. |
Great stuff! It seems the white rabbit wants us to go deeper, the artm python client is an endless stream of fun 😭 |
As you may recall, we had a bug where the shape of the theta matrix was incorrect, the number of documents being inferior to what was expected. We were able to get rid of that by using an experimental feature the API exposes, that allows us to store it in a second phi matrix.
However, this was unfortunately not the only bug. While working on a PR to implement the
consolidated
training (see issue #1 ), I came across the interesting fact that, if the matrix shape was correct, it's contents not so much. Case and point, I simply summed all values in the matrix, expecting to get the number of documents, but in some cases it came short. More precisely, although the document rows in the theta matrix did exist, they were null.As I was testing on a small corpus (bash files extracted from pytorch), I am not sure this also applies to large corpora. However, I assume it is the case, as this bug is closely related to the previous one, which applied to both large and small corpora.
After trying out different things I found a couple observations:
--sparse-doc-coeff
to a lower value - or even 0 - the problem did not occur and the above method worked each time. However, doing so systematically decreased the model quality, more often then not by a lot. I also did not observe significant increases in performances in the past with that regularizer in general.Given all this, here is my proposal (I will implement directly, we can always discuss this further when you come back from vacation @m09 ):
I will implement this in an upcoming PR - probably after implementing the consolidate creation and training. If all else fails, the next step will be to downgrade ARTM version, hoping the package was more stable previously.
The text was updated successfully, but these errors were encountered: