Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spot add ldaonlineoptimizer support #3

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

rabarona
Copy link
Owner

No description provided.

@@ -21,6 +21,7 @@

DSOURCE=$1
RAWDATA_PATH=$2
LDAOPTIMIZER="online"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is online the default?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't and also is referencing to an out of date variable name. It will read from spot.conf.

--ldamaxiterations 11 \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the ml_test.sh script but why are we using a different value than what is in spot.conf?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, I think it's a leftover from some tests. Will match with ml_ops.sh.

val optimizer = new EMLDAOptimizer
val ldaOptimizer = ldaOptimizerOption match {
case "em" => new EMLDAOptimizer
case "online" => new OnlineLDAOptimizer().setOptimizeDocConcentration(true).setMiniBatchFraction({

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where did these values come from? are they taken from spark documentation, a paper, or some experiments that we ran?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(0.05+ 1) / corpus size I'm sure it's from https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/LDAExample.scala if corpus size < 2 then 0.75 I can't recall right now but I'm pretty sure that came up in a conversation with @brandon.


// If caller does not provide seed to lda, ie. ldaSeed is empty, lda is seeded automatically set to hash value of class name

if (ldaSeed.nonEmpty) {
lda.setSeed(ldaSeed.get)
}

val (wordTopicMat, docTopicDist) = ldaOptimizer match {
case _: EMLDAOptimizer => {
val ldaModel = lda.run(ldaCorpus).asInstanceOf[DistributedLDAModel]//.toLocal

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the //.toLocal is code in a comment, it should be removed


//Create LDA model
val ldaModel = lda.run(ldaCorpus)
//Topic distribution: for each document, return distribution (vector) over topics for that docs

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add "entry i is the fraction of the document which belongs to topic i"

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this?
// Topic distribution: for each document, return distribution (vector) over topics for that docs where entry i is the fraction of the document which belongs to topic i

case _: OnlineLDAOptimizer => {
val ldaModel = lda.run(ldaCorpus).asInstanceOf[LocalLDAModel]

//Get word topic mix: columns = topic (in no guaranteed order), rows = words (# rows = vocab size)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment confuses me... shouldn't column i correspond to topic i?
how else can we interpret the results?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It confuses me too. If I remember correctly (and based on the subsequent code), topicsMatrix contains 20 rows, one row for each topic and N columns where N is the number of words or vocab size.

The code

In line 155 we call val wordResults = formatSparkLDAWordOutput(wordTopicMat, revWordMap)
In that function we can find the line 255 val wordProbs: Seq[Array[Double]] = wordTopMat.transpose.toArray.grouped(n).toSeq
Until then we have a matrix where each row is a word and each column is a topic.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, I just checked and it's like you said. Columns = number of topics.

//Get word topic mix: columns = topic (in no guaranteed order), rows = words (# rows = vocab size)
val wordTopicMat: Matrix = ldaModel.topicsMatrix

//Topic distribution: for each document, return distribution (vector) over topics for that docs

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor formatting issue throughout (easy query replace)
white space after //
eg.
//Topic
should be
// Topic

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually tried to have IntelliJ to do this for me with auto formatting but seems like IntelliJ doesn't care if you didn't type a space between the // and your comment.

Copy link

@NathanSegerlind NathanSegerlind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved with minor comments on defaults and commets

…NathanSegerlind

Fixed inline comments format, added one space after //.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants