accuracy decays with more topics #3

strin · 2016-01-02T19:58:29Z

with the following setting

import medlda
log = open("20ng_result.txt", "a")
batchsize = 512
label = 20
numword = 53975
num_sample = 100
for k in range(20, 120, 20):
        pamedlda = medlda.OnlineGibbsMedLDA(num_topic = k, labels = label, words = numword, alpha=1/k)
        pamedlda.train_with_gml('/home/wenbo/mfs/data/20ng/20ng_train.gml', batchsize)
        (pred, ind, acc) = pamedlda.infer_with_gml('/home/wenbo/mfs/data/20ng/20ng_test.gml', num_sample)
        log.write("topic: %d, batch: %d, numsample: %d, acc: %0.2f\n" % (k, batchsize, num_sample, acc))
log.close()

the results seem to differ from original paMedLDAgibbs implementation. Here the accuracy drops dramatically as the number of topics k increases.

topic: 10, batch: 512, numsample: 100, acc: 0.80
topic: 20, batch: 512, numsample: 100, acc: 0.80
topic: 30, batch: 512, numsample: 100, acc: 0.80
topic: 40, batch: 512, numsample: 100, acc: 0.80
topic: 50, batch: 512, numsample: 100, acc: 0.79
topic: 60, batch: 512, numsample: 100, acc: 0.78
topic: 70, batch: 512, numsample: 100, acc: 0.76
topic: 80, batch: 512, numsample: 100, acc: 0.72
topic: 90, batch: 512, numsample: 100, acc: 0.72
topic: 100, batch: 512, numsample: 100, acc: 0.70

The text was updated successfully, but these errors were encountered:

strin · 2016-01-02T20:37:05Z

even for binary classification

pamedlda = medlda.OnlineGibbsMedLDA(num_topic=80, labels=2, words=61188)
pamedlda.train_with_gml('../data/binary_train.gml', batchsize=32)
(pred, ind, acc) = pamedlda.infer_with_gml('../data/binary_test.gml', num_sample=10)

acc = 0.56 for 80 topics, while acc = 0.80 for 20 topics.

strin · 2016-01-02T22:43:55Z

added a parameter called "stepsize", which adjust the weights for each datapoint. This parameter is set to be dataset size / batch size in our ICML paper.

a rough explanation for this phenomenon:

Assume that K is large (say 80). after first few mini-batches, the bayesian posterior is going to be multi-modal due to uncertainty (too many parameters compared to data). therefore, for the latent samples of initial mini-batches as well as the variational approximate to be accurate, we'll set J (#latent samples per data point) to be large.

So any of the three following solutions can mitigate the accuracy denegeration:

increase J.
more sweeps over dataset (more than 1).
set stepsize to be large.

Revisiting the binary classification experiment, the solutions lead to the following results:

set J = 10, test accuracy = 0.76.
set pass = 3, test accuracy = 0.81.
set stepsize = 25, test accuracy = 0.81.

The third solution seems to be most computationally efficient one.

strin closed this as completed in 1347c56 Jan 2, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

accuracy decays with more topics #3

accuracy decays with more topics #3

strin commented Jan 2, 2016

strin commented Jan 2, 2016

strin commented Jan 2, 2016

accuracy decays with more topics #3

accuracy decays with more topics #3

Comments

strin commented Jan 2, 2016

strin commented Jan 2, 2016

strin commented Jan 2, 2016