Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

accuracy decays with more topics #3

Closed
strin opened this issue Jan 2, 2016 · 2 comments
Closed

accuracy decays with more topics #3

strin opened this issue Jan 2, 2016 · 2 comments

Comments

@strin
Copy link
Owner

strin commented Jan 2, 2016

with the following setting

import medlda
log = open("20ng_result.txt", "a")
batchsize = 512
label = 20
numword = 53975
num_sample = 100
for k in range(20, 120, 20):
        pamedlda = medlda.OnlineGibbsMedLDA(num_topic = k, labels = label, words = numword, alpha=1/k)
        pamedlda.train_with_gml('/home/wenbo/mfs/data/20ng/20ng_train.gml', batchsize)
        (pred, ind, acc) = pamedlda.infer_with_gml('/home/wenbo/mfs/data/20ng/20ng_test.gml', num_sample)
        log.write("topic: %d, batch: %d, numsample: %d, acc: %0.2f\n" % (k, batchsize, num_sample, acc))
log.close()

the results seem to differ from original paMedLDAgibbs implementation. Here the accuracy drops dramatically as the number of topics k increases.

topic: 10, batch: 512, numsample: 100, acc: 0.80
topic: 20, batch: 512, numsample: 100, acc: 0.80
topic: 30, batch: 512, numsample: 100, acc: 0.80
topic: 40, batch: 512, numsample: 100, acc: 0.80
topic: 50, batch: 512, numsample: 100, acc: 0.79
topic: 60, batch: 512, numsample: 100, acc: 0.78
topic: 70, batch: 512, numsample: 100, acc: 0.76
topic: 80, batch: 512, numsample: 100, acc: 0.72
topic: 90, batch: 512, numsample: 100, acc: 0.72
topic: 100, batch: 512, numsample: 100, acc: 0.70
@strin
Copy link
Owner Author

strin commented Jan 2, 2016

even for binary classification

pamedlda = medlda.OnlineGibbsMedLDA(num_topic=80, labels=2, words=61188)
pamedlda.train_with_gml('../data/binary_train.gml', batchsize=32)
(pred, ind, acc) = pamedlda.infer_with_gml('../data/binary_test.gml', num_sample=10)

acc = 0.56 for 80 topics, while acc = 0.80 for 20 topics.

@strin strin closed this as completed in 1347c56 Jan 2, 2016
@strin
Copy link
Owner Author

strin commented Jan 2, 2016

added a parameter called "stepsize", which adjust the weights for each datapoint. This parameter is set to be dataset size / batch size in our ICML paper.

a rough explanation for this phenomenon:

Assume that K is large (say 80). after first few mini-batches, the bayesian posterior is going to be multi-modal due to uncertainty (too many parameters compared to data). therefore, for the latent samples of initial mini-batches as well as the variational approximate to be accurate, we'll set J (#latent samples per data point) to be large.

So any of the three following solutions can mitigate the accuracy denegeration:

  • increase J.
  • more sweeps over dataset (more than 1).
  • set stepsize to be large.

Revisiting the binary classification experiment, the solutions lead to the following results:

  • set J = 10, test accuracy = 0.76.
  • set pass = 3, test accuracy = 0.81.
  • set stepsize = 25, test accuracy = 0.81.

The third solution seems to be most computationally efficient one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant